April 7, 2008 (Lecture 29)

April 7, 2008 (Lecture 29)

Reading

Chow and Johnson: 6.1 - 6.4
Coulouris, et al: 8.1 - 8.6

Introduction: What Is a Distributed File Systems (DFS)

Last class we discussed DSM. Today we'll discuss the construction of the distributed version of another one of our old friends from operating systems: The file system. So, what is a DFS?
Let's start to answer that question by reminding ourselves about regular, vanilla flavored, traditional file systems. What are they? What do they do? Let's begin by asking ourselves a more fundamental question: What is a file?
A file is a collection of data organized by the user. The data within a file isn't necessarily meaningful to the operating system. Instead, a file is created by the user and meaninful to the user. It is the job of the operating system to maintain this unit, without understanding or caring why.
A file system is the component of an operating system that is responsible for managing files. File systems typically implement persistent storage, although volatile file systems are also possible (/proc is such an example). So what do we mean when we say manage files? Well, here are some examples:

Name files in meaningful ways. The file system should allow a user to locate a file using a human-friendly name. In many cases, this is done using a hierarchical naming scheme like those we know and love in Windows, UNIX, Linux, &c.
Access files. Create, destroy, read, write, append, truncate, keep track of position within, &c
Physical allocation. Decide where to store things. Reduce fragmentation. Keep related things close together, &c.
Security and protection. Ensure privacy, prevent accidental (or malicious) damage, &c.
Resource administration. Enforce quotas, implement priorities, &c.

So what is it that a DFS does? Well, basically the same thing. The big difference isn't what it does but the environment in which it lives. A traditional file system typically has all of the users and all of the storage resident on the same machine. A distributed file system typically operates in an enviornment where the data may be spread out across many, many hosts on a network -- and the users of the system may be equally distributed.
For the purposes of our conversation, we'll assume that each node of our distributed system has a rudimentary local file system. Our goal will be to coordinate these file systems and hide their existence from the user. The user should, as often as possible, believe that they are using a local file system. This means that the naming scheme needs to hide the machine names, &c. This also means that the system should mitigate the latency imposed by network communication. And the system should reduce the increase risk of accidental and malicious damage to user data assocaited with vulnerabilities in network communication. We could actually continue this list. To do so, pick a property of a local file system and append transparency.

Why Would we want a DFS?

There are many reasons that we might want a DFS. Some of the more common ones are listed below:

More storage than can fit on a single system
More fault tolerance than can be achieved if "all of the eggs are in one basket."
The user is "distributed" and needs to access the file system from many places

How Do We Build a DFS?

Let's take a look at how to construct a design a DFS. What questions do we need to ask. Well, let's walk through some of the bigger design decisions.

Grafting a Name Space Into the Tree

For the most part, it isn't too hard to tie a distributed file system into the name space of the local file system. The process of associating a remote directory tree with a particular point in the local directory tree is known as mounting the local directory that represents the root of the remote directory is typically known as a mount point. Directories are often mounted as the result of an action on the part of an administrator. In the context of UNIX systems, this can be via the "mount" command. The mounts may be done "by hand" or as part of the system initialization. Many UNIX systems automatically associate remote directories with local ones based on "/etc/vfstab". Other mechanisms may also be available. Solaris, for example, can bind remote directories to local mount points "as needed" and "on demand" through the "automounter".
Since mounting hides the location of files from the user, it allows the files to be spread out across multiple servers and/or replicated, without the users knowledge. AFS implements read-only replication. Coda supports read-write replication.

Implementing The Operations

Once the name spaces are glued, the operations on the distributed file system need to be defined. In other words, once the know that the user wants to access a directory entry that lives somewhere else, how does it know what to do? In the context of UNIX and UNIX-like operating systems, this is typically done via the virtual file system (VFS) interface. Recall from operating systems that VFS and vnodes provide an easy way to keep the same interface and the same operations, but to implement them differently. Many other operating systems provide similar object-oriented interfaces.
In a traditional operating system, the operations on a vnode access a local device and local cache. In a similar fashion, we can write operations to implement the VFS and vnode interfaces that will go across the network to get our data. On systems that don't have as nice an interface, we have to do more heavy kernel hacking -- but implementation is still possible. We just implement the DFS using whatever tools are provided to implement a local file system, but ultimately look to the network for the data.

Unit of Transit

So, now that we can move data across the network, how much do we move? Well, there are two intuitive answers to this question: whole files and blocks. Since a file is the unit of storage known to the user, it might make sense to move a whole file each time the user asks to access one. This makes sense, because we don't know how the file is organized, so we don't know what the user might want to do -- except that the user has indicated that the data within the file is related. Another option is to use a unit that is convenient to the file system. Perhaps we agree to a particular block size, and move only one block at a time, as demanded.
Whole file systems are convenient in the sense that once the file arrives, the user will not encounter any delays. Block-based systems are convenient because the user can "get started" after the first block arrives and does not need to wait for the whole (and perhaps very large) file to arrive. As we'll talk about soon, it is easier to maintain file-level coherency if a whole file based system is used and block-level coherency if a block-based system is used. But, in practice, the user is not likely to notice the difference between block-level and file-level systems -- the overwhelming majority of files are smaller than a single block.

Reads, Writes and Coherency

For the moment, let's assume that there is no caching and that a user can atomically read from a file, write to it, and write it back. If there is only one user, it doesn't matter what the she or he might do, or he or she does it -- the file will always be as expected when all is said and done.
But if multiple users are playing with the file, this is not the case. In a file-based system, if two users write to two different parts of the file, the final result will be the file from the perspective of one user, or the other, but not both. This is because one of the writes is destroyed by the subsequent writes. In a block-based system, if the writes lie in different blocks, the final file may represent the changes of both users -- this is because only the block containing each write is written -- not the whole file.
This might be good, because neither user's change is lost. This might also be bad, because the final file isn't what either user expects. It might be better to have one broken application than two broken applications. This is completely a semantic decision.
In a traditional UNIX file system, the unit of sharing is typically a byte. Unless flock(), lockf(), or some other voluntary locking is used, files are not protected from inconsistency caused by multiple users and conflicting writes.
Andrew File System (AFS) version 1 and 2 implemented whole file semantics, as does Coda. NFS and AFS version 3 implement block level semantics. None of these popular systems implement byte-level semantics as does a local file system. This is because the network overhead of sending a message is too high to justify such a small payload.
AFS changed from file level to block level access to reduce the lead time associated with opening and closing a file, and also to improve cache efficiency (more soon). It is unclear if actual workloads, or simply perception, prompted the change. My own belief is that it was simply perception (good NFS salespeople?) This is because most files are smaller than a block -- databases should not be distributed via a DFS!

Caching and Server State

There is typically a fairly high latency associated with the network. We really don't want to ask the file server for help each time the user accesses a file -- we need caching. If only one user would access a unit (file or block) at a time, this wouldn't be hard. Upon a close, we could simply move it to the client, let the client play with it, and then move it back (if written) or simply invalidate the cached copy (if read).
But life gets much harder in a multiuser system. What happens if two different users want to read the same file? Well, I suppose we could just give both of them copies of the file, or block, as appropriate. But then, what happens if one of them, or another user somewhere else writes to the file? How do the clients hodling cached copies know?
Well, we can apply one of several solutions to this problem. The first solution is to apply the Ostrich principle -- we could just hope it doesn't happen and live with the inconsistency if it does. This isn't a bad technique, because, in practice, files are rarely written by multiple users -- never mind concurrently.
Or, perhaps we can periodically validate our cache by checking with the server: "I have checksome [blah]. Is this still valid?" or "I have timestamp [blah] is this still valid?" In this case we can eventually detect a change an invalidate the cache, but there is a window of vulnerability. But, the good thing is that the user can access the data without bringing it back from the network each time. (And, even that approach can still allow for inconsistency, since it takes non-zero time). The cache and validate approach is the technique used by NFS.
An alternative to this approach is used by AFS and Coda. These systems employ smarter servers that keep track of the users. They do this by issuing a callback promise to each client as it collects a file/block. The server promises the client that it will be informed immediately if the file changes. This notice allows the client to invalidate its cache entry.
The callback-based approach is optimistic and saves the network overhead of the periodic validations, while ensuring a more strict consistency model. But it does complicate the server and reduce its robustness. This is because the server must keep track of the callbacks and problems can arise if the server fails, forgets about these promises, and restarts. Typically callbacks have timelimits associated with them, to reduce the amount of damage in the event that a client crashes and doesn't close() the file, or server fails and cannot issue a callback.