Before discussing distributed file systems, it makes sense to discuss file systems. And, it makes sense to begin with the concept of a file. What is a file?
Similarly, a file is an abstraction. It is a collection of data that is organized by its users. The data within the file isn't necessarily meaningful to the OS, the OS many not know how it is organized -- or even why some collection of bytes have been organized together as a file. None-the-less, it is the job of the operating system to provide a convenient way of locating each such collection of data and manipulating it, while protecting it from unintended or malicious damage by those who should not have access to it, and ensuring its privacy, as appropriate.
Introduction: File Systems
A file system is nothing more than the component of the operating system charged with managing files. It is responsible for interacting with the lower-level IO subsystem used to access the file data, as well as managing the files, themselves, and providing the API by which application programmers can manipulate the files.
Factors In Filesystem Design
- storage layout
- failure resiliance
- efficiency (lost space is not recovered when a process ends as it is with RAM, the penalty is also higher for frequent access...by a factor of 106)
- sharing and concurrency
The simplest type of naming scheme is a flat space of objects. In this model, there are only two real issues: naming and aliasing.
- Syntax/format of names
- Legal characters
- Upper/lower case issues
- Length of names
Aliasing is the ability to have more than one name for the same file. If aliasing is to be permitted, we must detemrine what types. It is useful for several reasons:
- Some programs expect certain names. Sharing the name between two such programs is painful without aliasing.
- Manual version management: foo.1, foo.2, foo.3
- Convenience of user -- several file can appear in several places near things that relate to it.
There are two basic types:
- early binding: the target of the name is determined at the time the link is created. In UNIX aliases that are bound early are called hard links.
- late binding: The target is redetermined on each use, not once. That is to say the target is bound to the name every time it is used. In UNIX aliases that are bound late are called soft links or symbolic links. Symbolic links can dangle, that is to say that they can reference an object that has been destroyed.
In order to implement hard links, we must have low level names.
- invarient across renaming
- no aliasing of low level names
- each file has exactly 1 low-level name and at least 1 low level name (The link count is the number of high level names associated with a single low-level name.)
- the OS must ensure that the link count is 0 before removing a file (and no one can have it open)
UNIX has low-level names, they are called inodes. The pair (device number, inode # is unique). The inode also serves as the data structure that represents the file within the OS, keeping track of all of its metadata. In contrast, MS-DOS uniquely names files by their location on disk -- this scheme does not allow for hard links.
Real systems use hierarchical names, not flat names. The reason for this relates to scale. The human mind copes with large scale in a hierarchical fashion.It is essentially a human cognative limitation, we deal with large numbers of things by categorizing the. Every large human organization is hierarchical: army, companies, church, etc.
Furthermore, too many names are hard to remember and it can be hard to generate unique names.
With a hierarchical name space only a small fraction of the full namespace is visible at any level. Internal nodes are directories and leaf nodes are files. The pathname is a representation of the path from the leafnode to the root of the tree.
The process of translating a pathname is known as name resolution. We must translate the pathname one step at a time to allow for symbolic links.
Every process is associated with a current directory. The low level name is evaluated by chdir().If we follow a symbolic link to a location and try to "cd ..", we won't follow the symbolic link back to our original location -- the system doesn't remember how we got there, it takes us to the parent directory.
The ".." relationship superimpsoes a Directed Acyclic Graph(DAG) onto the directory structure, which may contain cycles via links.
Have you ever seen duplicate listings for the same page in Web searche ngines? This is because it is impossible to impose a DAG onto Web space -- not only is it not a DAG on any level, it is very highly connected.
Each directory is created with two implicit components
- "." and ".."
- path to the root is obtained by travelling up ".."
- getwd() and pwd (shell) report the current directory
- "." allows you to supply current working directory to system calls without calling getwd() first
- relative names remain valid, even if entire tree is relocated first
- "." and ".." are the same only for the root directory
What exactly is inside of each directory entry aside form the file or directory name?
UNIX directory entries are simple: name and inode #. The inode contains all of the metadata about the file -- everything you see when you type "ls -l". It also contains the information about where (which sectors) on disk the fiel is stored.
MS-DOS directory entries are much more complex. They actually contain the meta-data about the file:
- name -- 8 bytes
- extension -- 3 bytes
- attribes (file/directory/volume label, read-only/hidden/system) -- byte
- reserved -- 10 bytes (used by OS/2 and Windows 9x)
- time -- 2 bytes
- date -- 2 bytes
- cluster # -- 2 bytes (more soon)
- size -- 4 bytes
Unix keeps similar information in the inode. We'll discuss the inode in detail very soon.
File System Operations
File system operations generally fall into one of three categories:
- Directory operations modify the names space of files. Examples include mkdir(), rename(), creat(), mount(), link() and unlink()
- File operations obtain or modify the characteristics of objects. Examples include stat(), chmod(), and chown().
- I/O operations access the contents of a file. I/O operations, unlike file operations, modify the actual contents of the file, not metadata associated with the file. Examples include read(), write(), and lseek(). These operations are typically much longer than the other two. That is to say that applications spend much more time per byte of data performing I/O operations than directory operations or file operations.
From open() to the inode
Before going any farther, I'd like to review a few details from 15-213, where certain file system data structures, and the process of opening a file, are discussed. Just to make sure that we're on the same page. Then, I'll charge forward into storage allocation, which is new in 15-412.
The operating system maintains two data structures representing the state of open files: the per-process file descriptor table and the system-wide open file table.
When a process calls open(), a new entry is created in the open file table. A pointer to this entry is stored in the process's file descriptor table. The file descriptor table is a simple array of pointers into the open file table. We call the index into the file descriptor table a file descriptor. It is this file descriptor that is returned by open(). When a process accesses a file, it uses the file descriptor to index into the file descriptor table and locate the corresponding entry in the open file table.
The open file table contains several pieces of information about each file:
- the current offset (the next position to be accessed in the file)
- a reference count (we'll explain below in the section about fork())
- the file mode (permissions),
- the flags passed into the open() (read-only, write-only, create, &c),
- a pointer to an in-RAM version of the inode (a slightly light-weight version of the inode for each open file is kept in RAM -- others are on disk), and a structure that contains pointers to all of the .
- A pointer to the structure containing pointers to the functions that implement the behaviors like read(), write(), close(), lseek(), &c on the file system that contains this file. This is the same structure we looked at last week when we discussed the file system interface to I/O devices.
Each entry in the open file table maintains its own read/write pointer for three important reasons:
- Reads by one process don't affect the file position in another process
- Write are visible to all processes, if the file pointer subsequently reaches the location of the write
- The program doesn't have to supply this information each call.
One important note: In modern operating systems, the "open file table" is usually a doubly linked list, not a static table. This ensures that it is typically a reasonable size while capable of accomodating workloads that use massive numbers of files.
Consider the cost of many reads or writes may to one file.
- Each operation could require pathname resolution, protection checking, &c.
- Implicit information, such as the current location (offset) into the file must be maintained,
- Long term state must also be maintained, especially in light of the fact that several processes using the file might require different view.
Caches or buffers may need to be initialized
The solution is to amortize the cost of this overhead over many operations by viewing operations on a file as within a session. open() creates a session and returns a handle and close() ends the session and destroys the state. The overhead can be paid once and shared by all operations.
Consequences of Fork()ing
In the absence of fork(), there is a one-to-one mapping from the file descriptor table to the open file table. But fork introduces several complications, since the parent task's file descriptor table is cloned. In other words, the child process inherits all of the parent's file descriptors -- but new entries are not created in the system-wide open file table.
One interesting consequence of this is that reads and writes in one process can affect another process. If the parent reads or writes, it will move the offset pointer in the open file table entry -- this will affect the parent and all children. The same is of course true of operations performed by the children.
What happens when the parent or child closes a shared file descriptor?
- remember that open file table entries contain a reference count.
- this reference count is decremented by a close
- the file's storage is not reclaimed as long as the reference count is non-zero indicating that an open file entry to it exists
- once the reference count reaches zero, the storage can be reclaimed
- i.e., "rm" may reduce the link count to 0, but the file hangs around until all "opens" are matched by "closes" on that file.
Why clone the file descriptors on fork()?
- it is consistent with the notion of fork creating an exact copy of the parent
- it allows the use of anonymous files by children. The never need to know the names of the files they are using -- in fact, the files may no longer have names.
- The most common use of this involves the shell's implementation of I/O redirection (< and >). Remember doing this?