Advanced File Systems

Disclaimer

As always these are "Additional Notes". They reinforce what we covered in class -- and often go deeper. In the case of these notes, we just touched on many of these ideas. For ease of use and topical completeness, the bread and butter of what we covered is grouped with last classes lecture materials.

A Quick Look Back At Traditional File Systems

We've looked at "General Purpose inode-based file systems" such as UFS and ext2. They are the workhorses of the world. They are reasonable fast, but have some limitations, including:

Hybrid file systems

Today we are going to talk about a newer generation of file systems that keep the best characteristics of traditional file systems, plus some improvements, and also logging to increase availability in the event of failure. These file systems, in particular, support much large file systems than could reasonably be managed using the older file systems and do so more robustly -- and often faster.

Much like the traditional file systems that we talk about have common characteristics, such as similar inode structures, buffer cache organizations, &c, these file systems will often share some of the same characteristics:

ReiserFS

The ReiserFS isn't the most sophisticated among this class of filesystems, but it is a reasonably new filesystem. Furthermore, despite the availability of journaling file systems for other platforms, Reiser was among the first availble for Linux and is the first, and only, hybrid file system currently part of the official Linux kernel distribution.

As with the other filesystems that we dicsussed, ReiserFS only journals metadata. And, it is based on a variation of the B+ tree, the B* tree. Unlike the B+ tree, which does 1-2 splits, the B* tree does 2-3 splits. This increases the overall packing density of the tree at the expense of only a small amount of code complexity.

It also offers a unique tail optimization. This feature helps to mitigate internal fragmentation. It allows the tails of files, the end portions of files that occupy less than a whole block, to be stored together to more completely fill a block.

Unlike the other file systems, its space management is still pretty "old-school". It uses a simple block-based allocator and manages free space using a simple bit-map, instead of a more efficient extent-based allocator and/or B-tree based free space management. Currently the block size is 4KB and the maximum file size 4GB, and the maximum file system size is 16TB, Furthermore, ReiserFS doesn't support sparse files -- all blocks of a file are mapped. Reiser4, scheduled for release this fall, will address some of these limitations by including extents and a variable block size of up to 64KB.

For the moment, free block are found using linear search of bitmap. The search is in the order of increasing block number to match disk spin. It tries to keep things together by searching bitmap beginning with position representing the left neighbor. This was empirically determined to be the better of the following:

ReiserFS allows for the dynamic allocation of inodes and keeps inodes and the directory structure organized within a single B* tree. This tree organizes four different types of nodes:

Items are stored in the tree using a key, which is a tuple:

<parent directory ID, offset within object, item type/uniqueness>, where

Each key structure also contains a unique item number, basically the inode number. But, this isn't used to determine ordering. Instead, the tree sorts keys using each tuple, in order of position. This orders the files in the tree in a way that keep files within the same directory together, and then these sorted by file or directory name

The leaf nodes are data nodes. Unformatted nodes contain whole blocks of data. "Formatted" nodes hold the tails of files. They are formatted to allow more than one tail to be stored within the same block. Since the tree is balanced, the path to to any of these data nodes is the same length.

A file is composed of set of indirect items and, at most 2 direct items for the tail Why not always one? If a tail is smaller than unformatted node, but larger than formatted node, it needs to be broken apart and placed into two direct nodes).

SGI's XFS

In many ways SGI's XFS is similar to ReiserFS. But, it is in many ways more sophisticated. It may be the most sophisticated among the systems we'll consider. This being said, unlike ReiserFS, XFS uses B+ trees instead of B* trees.

The extent-based allocator is rather sophisticated. In particular, it has three pretty cool features. First, it allows for delayed allocation. Basically, this allows the system to build a virtual extent in RAM and then allocate it in one piece at the end. This mitgates the "and one more thing" syndrom that can lead to a bunch of small extents instead of one bit one. It also allows for the preallocation of an extent. This allows the system to reserve an extent that is big enough in advance so that the right sized extent can be used -- without consuming memory for delayed allocation or running the risk of running out of space later on. The system also allows for the ?coalecing of extents as they are freed to reduce fragmentation.

The file system organized into different partions called allocation groups (AGs). Each allocation group has own data structures -- for practical purposes, they are seaparate instances of the same file system class. This helps to keeps data structures to a normal scale. It also allows for parallel activity on multiple AGs, without concurrency control mechanisms creating hot spots.

Inodes are created dynamically in chunks of 64 inodes. Each inode is numbered using a tuple that includes both the chunk number and the inode's index within its chunk. The location of an inode can be discovered by lookup in B+ tree by chunk number. The B+ tree also contains bitmap showing which inodes within each chunk are used.

Free space is managed using two different B+ tree of extents. One B+ tree is organized by size, whereas the other is organized by location. This allows for efficient allocation -- btoh by size and locality.

Directories are also stored in a B+ tree. Instead of storing the name, itself in the tree, a hash of the name is stored. This is done, because it is more complicated to organize a B tree to work with names of different sizes. But, regardless of the size of the name, it will hash to the same sized key.

Each file within this tree contains its own storage map (inode). Initially, each node stores block offset and extent size measured in blocks. When the file grows and overflows the inode, the storage allocation is stored in a tree rooted at the inode. This tree is indexed by the offset of the extent and stores the size of the extent. In this way, the directory structure is really a tree of inodes, which in turn are trees of the file's actual storage. Much like ReiserFS, XFS logs only metadata changes, not changes to the file's actual metadata. In the event of a crach, it replays these logs the obtain consistent metadata. XFS also includes a repair program, similar to fsck, that is capable of fixing other types of corruption. This repair tool was not in the first release of XFS, but was demanded by customers and added later. Logging can be done to a separate device to prevent the log from becoming a hot-spot in high-throughput applications. Normally asynchronous logging is used, but synchronous is possible (be it expensive).

XFS offers variable block size ranging from 512 bytes - 64K and an extent-based alloctor. The maximum file size is 9 thousand petabytes. The maximum file system size is 18 thousand petabytes.

IBM's JFS

IBM's JFS isn't one of the best performers among this class of file system. But, that is probably becuase it was one of the first. What to say? Things get better over time -- and I think everyone benefitted from IBM's experience here.

File system partitions correspond to what are known in DFS as aggregates. Wthin each partition lives an allocation group, similar to that of XFS. Within each allocation group is one or more fileset. A fileset is nothing more than a mountable tree. JFS supports extents within each allocation group.

Much like XFS, JFS uses a B+ tree to store directories. And, again, it also uses a B+ tree to track allocations within a file. Unlike JFS, the B+ tree is used to track even small allocations. The only exception is an optimization that allows symlinks to live directly in the inode.

Free space is represented as array w/1 bit per block. This bit array can be viewed as an array of 32-bit words. These words then form a binary tree sorted by size. This makes it easy to find a contiguous chunk of space of the right size, without a linear search of the available blocks. The same array is also indexed by another tree as a "Binary Buddy". This allows for easy coalescing and easy tracking of the allocated size.

These trees actually have a somewhat complicated structure. We won't spend the time here to cover it in detail. This really was one of the "original attempts" and not very efficient. I can provide you with some references, if you'd like more detail.

As for sidelines statistics, the block size can be 512B, 1KB, 2KB, or 4KB. The maximum file size ranges from 512TB with a 512 byte block size to 4 petabytes with a 4KB blocks size. Similarly, the maximum file system size ranges from 4PB with a 512 byte blocks to 32 petabytes with a 4KB byte block size.

Ext3

Ext3 isn't really a new file system. It is basically a journaling layer on top of Ext2, the "standard" Linux file system. It is both forward and backward compatible with Ext2. One can actually mount any ext2 file system as ext3, or mount any ext3 filesystem as ext2. This filesystem is particularly noteworthy because it is backed by Red Hat and is their "official" file system of choice.

Basically RedHat wanted to have a path into journaling file systems for their customers, but also wanted as little transitional headache and risk as possible. Ext3 offers all of this. There is no need, in any real sense, to convert an existing ext2 file system to it -- really ext3 just needs to be enabled. Furthermore, the unhappy customer cna always go back to ext2. And, in a pinch, the file system can always be mounted as ext2 and the old fsk remains perfectly effective.

The journaling layer of ext3 is really separate from the filesystem layer. There are only two differences between ext2 and ext3. The first, which really isn't a change to ext2-proper, is that ext3 has a "logging layer" to log the file system changes. The second change is the addition in ext3 of a communication interface from the file system to the logging layer. Additionally, one ext2 inode is used for log file, but this really doesn't matter from a compatibility point of view -- unless the ext2 file system is (or otherwise would be) completely full.

Three types of things are logged by ext3. These must be logged atomically (all or none).

Periodically, the in-memory log is check-pointed by writing outstanding entries to an in memory journal. This journal is committed periodically to disk. The level of journaling is a mount option. Basically, writes to the log file are cached, like any other writes. The classic performance versus recency trade-off involves how often we sync the log to disk.

As for the sideline stats, the block size is variable between 1KB and 4KB. The maximum file size is 2GB and the maximum filesystem size is 4TB.

As you can see, this is nothing more than a version of ext2, which supports a journaling/logging layer that provides for a faster, and optionally more thorough, recovery mode. I think Red Hat made the wrong choice. My bet is that people want more than compatibility - more than Ext3 offers. Instead, I think that the ultimate winner will be the new version of ReiserFS or XFS. Or, perhaps, something new -- but not this.