Advanced File Systems

Advanced File Systems

Disclaimer

As always these are "Additional Notes". They reinforce what we covered in class -- and often go deeper. In the case of these notes, we just touched on many of these ideas. For ease of use and topical completeness, the bread and butter of what we covered is grouped with last classes lecture materials.

A Quick Look Back At Traditional File Systems

We've looked at "General Purpose inode-based file systems" such as UFS and ext2. They are the workhorses of the world. They are reasonable fast, but have some limitations, including:

Internal fragmentation: waste w/in allocated blocks
Slow disk performance due to seeking (although some like FFS try to reduce this using extents. They keep track of multisector blocks of various sizes and can allocate that way)
Large logical blocks decrease transfer cost and metadata overhead (number of indirect blocks needed, &c), but increase fragmentation.
Small logical blocks do the opposite, reduce fragmentation, but increase inode overhead and transfer time.
Opportunity for meta-data inconsistency due to crashes. Long recover time due to intensive process used by tools such as the venerable fsck to discover lost information or otherwise force consistency.
Block sizes vs. meta data, and the need for fsck to scan whole partions makes these inappropriate for large partitions - to much wastage and too much time wasted to fsck for high availability.
Free space tracking, typically a bit for each block, is also a big RAM waste and time-consumer for large file systems.
Directory lookups are painfully slow - sequentially open and search directory files - caching helps a good bit, but there are plenty of cold misses and capacity misses, &c.
Static inode allocation limits the number of files and wastes space if not needed.

Hybrid file systems

Today we are going to talk about a newer generation of file systems that keep the best characteristics of traditional file systems, plus some improvements, and also logging to increase availability in the event of failure. These file systems, in particular, support much large file systems than could reasonably be managed using the older file systems and do so more robustly -- and often faster.
Much like the traditional file systems that we talk about have common characteristics, such as similar inode structures, buffer cache organizations, &c, these file systems will often share some of the same characteristics:

They will log only the metadata, not the data. This allows for a fast recovery at boot time, but doesn't provide any consistency or correctness guarantees about the data blocks of the files, themselves.

Unlike LFS, these file systems will manage the data blocks (or extends) in a random-access way -- writes to data blocks will replace the original data, not land at the end of the log file. (The log, itself, however, is an append log).

Make extensive use of the B+ tree data structure to organize information.

The free block (or extent) list is one such case -- it is often maintained by B+ trees. A common organization keeps two trees -- one ordered by size and another ordered by location. This makes it fast to find an extend that is large enough, and also to find one that is nearby other data in the file -- this minimizes seek delay.

Some filesystems further improve this by organizing extents into two different B+ trees, one by physical address and another by size.

Give my an extent at least size X", is fast
As is, "Give me a block or extent near by X".

Directory files are replaced with B+ trees.

Typically one file-system-wide B+ tree containing all directories and
One per-directory B+ tree of entries.
But, some use one B+ tree for whole file system

Organize inodes by file name using a B+ tree

Keep small-sized files' data directly in the inodes.
Keep medium-sized files' data in extents or blocks directly named from the inodes
Keep large-sized files' blocks or extents organized in B+ tree indexed by offset and named by inode.
Many allow dynamic inode allocation to avoid wasted space - there's no more need to index into an array by number. The inode is directly named by the B+ tree

In general, performance comparable to standard file systems, with more space efficiency and higher reliability.

ReiserFS

The ReiserFS isn't the most sophisticated among this class of filesystems, but it is a reasonably new filesystem. Furthermore, despite the availability of journaling file systems for other platforms, Reiser was among the first availble for Linux and is the first, and only, hybrid file system currently part of the official Linux kernel distribution.
As with the other filesystems that we dicsussed, ReiserFS only journals metadata. And, it is based on a variation of the B+ tree, the B* tree. Unlike the B+ tree, which does 1-2 splits, the B* tree does 2-3 splits. This increases the overall packing density of the tree at the expense of only a small amount of code complexity.
It also offers a unique tail optimization. This feature helps to mitigate internal fragmentation. It allows the tails of files, the end portions of files that occupy less than a whole block, to be stored together to more completely fill a block.
Unlike the other file systems, its space management is still pretty "old-school". It uses a simple block-based allocator and manages free space using a simple bit-map, instead of a more efficient extent-based allocator and/or B-tree based free space management. Currently the block size is 4KB and the maximum file size 4GB, and the maximum file system size is 16TB, Furthermore, ReiserFS doesn't support sparse files -- all blocks of a file are mapped. Reiser4, scheduled for release this fall, will address some of these limitations by including extents and a variable block size of up to 64KB.
For the moment, free block are found using linear search of bitmap. The search is in the order of increasing block number to match disk spin. It tries to keep things together by searching bitmap beginning with position representing the left neighbor. This was empirically determined to be the better of the following:

Starting at the beginning (no locality, really
Starting at the right neighbor (begins past us, given disk spin)
Starting at the left neighbor (if space, right in-between; but costly to find left neighbor)

ReiserFS allows for the dynamic allocation of inodes and keeps inodes and the directory structure organized within a single B* tree. This tree organizes four different types of nodes:

Direct items - tails of files packed together or one small file
Indirect items - unformatted [data] nodes; hold whole blocks of file data
Directory items - key for first directory entry, plus number of directory entires
Stat items - metadata (configuration option to combine these with directory item)

Items are stored in the tree using a key, which is a tuple:
<parent directory ID, offset within object, item type/uniqueness>, where

parent ID is ID of parent object
For files, the offset indicates the offset of the first byte stored in this item. For directories, it contains the first 4 bytes of the filename of the first file stored within the node
The item type/uniqueness field indicates the type of the node:

0 - stat
-1 direct
-2 - indirect
500 - directory + unique number for files matching in first 4/bytes

Each key structure also contains a unique item number, basically the inode number. But, this isn't used to determine ordering. Instead, the tree sorts keys using each tuple, in order of position. This orders the files in the tree in a way that keep files within the same directory together, and then these sorted by file or directory name
The leaf nodes are data nodes. Unformatted nodes contain whole blocks of data. "Formatted" nodes hold the tails of files. They are formatted to allow more than one tail to be stored within the same block. Since the tree is balanced, the path to to any of these data nodes is the same length.
A file is composed of set of indirect items and, at most 2 direct items for the tail Why not always one? If a tail is smaller than unformatted node, but larger than formatted node, it needs to be broken apart and placed into two direct nodes).

SGI's XFS

In many ways SGI's XFS is similar to ReiserFS. But, it is in many ways more sophisticated. It may be the most sophisticated among the systems we'll consider. This being said, unlike ReiserFS, XFS uses B+ trees instead of B* trees.
The extent-based allocator is rather sophisticated. In particular, it has three pretty cool features. First, it allows for delayed allocation. Basically, this allows the system to build a virtual extent in RAM and then allocate it in one piece at the end. This mitgates the "and one more thing" syndrom that can lead to a bunch of small extents instead of one bit one. It also allows for the preallocation of an extent. This allows the system to reserve an extent that is big enough in advance so that the right sized extent can be used -- without consuming memory for delayed allocation or running the risk of running out of space later on. The system also allows for the ?coalecing of extents as they are freed to reduce fragmentation.
The file system organized into different partions called allocation groups (AGs). Each allocation group has own data structures -- for practical purposes, they are seaparate instances of the same file system class. This helps to keeps data structures to a normal scale. It also allows for parallel activity on multiple AGs, without concurrency control mechanisms creating hot spots.
Inodes are created dynamically in chunks of 64 inodes. Each inode is numbered using a tuple that includes both the chunk number and the inode's index within its chunk. The location of an inode can be discovered by lookup in B+ tree by chunk number. The B+ tree also contains bitmap showing which inodes within each chunk are used.
Free space is managed using two different B+ tree of extents. One B+ tree is organized by size, whereas the other is organized by location. This allows for efficient allocation -- btoh by size and locality.
Directories are also stored in a B+ tree. Instead of storing the name, itself in the tree, a hash of the name is stored. This is done, because it is more complicated to organize a B tree to work with names of different sizes. But, regardless of the size of the name, it will hash to the same sized key.
Each file within this tree contains its own storage map (inode). Initially, each node stores block offset and extent size measured in blocks. When the file grows and overflows the inode, the storage allocation is stored in a tree rooted at the inode. This tree is indexed by the offset of the extent and stores the size of the extent. In this way, the directory structure is really a tree of inodes, which in turn are trees of the file's actual storage. Much like ReiserFS, XFS logs only metadata changes, not changes to the file's actual metadata. In the event of a crach, it replays these logs the obtain consistent metadata. XFS also includes a repair program, similar to fsck, that is capable of fixing other types of corruption. This repair tool was not in the first release of XFS, but was demanded by customers and added later. Logging can be done to a separate device to prevent the log from becoming a hot-spot in high-throughput applications. Normally asynchronous logging is used, but synchronous is possible (be it expensive).
XFS offers variable block size ranging from 512 bytes - 64K and an extent-based alloctor. The maximum file size is 9 thousand petabytes. The maximum file system size is 18 thousand petabytes.

IBM's JFS

IBM's JFS isn't one of the best performers among this class of file system. But, that is probably becuase it was one of the first. What to say? Things get better over time -- and I think everyone benefitted from IBM's experience here.
File system partitions correspond to what are known in DFS as aggregates. Wthin each partition lives an allocation group, similar to that of XFS. Within each allocation group is one or more fileset. A fileset is nothing more than a mountable tree. JFS supports extents within each allocation group.
Much like XFS, JFS uses a B+ tree to store directories. And, again, it also uses a B+ tree to track allocations within a file. Unlike JFS, the B+ tree is used to track even small allocations. The only exception is an optimization that allows symlinks to live directly in the inode.
Free space is represented as array w/1 bit per block. This bit array can be viewed as an array of 32-bit words. These words then form a binary tree sorted by size. This makes it easy to find a contiguous chunk of space of the right size, without a linear search of the available blocks. The same array is also indexed by another tree as a "Binary Buddy". This allows for easy coalescing and easy tracking of the allocated size.
These trees actually have a somewhat complicated structure. We won't spend the time here to cover it in detail. This really was one of the "original attempts" and not very efficient. I can provide you with some references, if you'd like more detail.
As for sidelines statistics, the block size can be 512B, 1KB, 2KB, or 4KB. The maximum file size ranges from 512TB with a 512 byte block size to 4 petabytes with a 4KB blocks size. Similarly, the maximum file system size ranges from 4PB with a 512 byte blocks to 32 petabytes with a 4KB byte block size.

Ext3

Ext3 isn't really a new file system. It is basically a journaling layer on top of Ext2, the "standard" Linux file system. It is both forward and backward compatible with Ext2. One can actually mount any ext2 file system as ext3, or mount any ext3 filesystem as ext2. This filesystem is particularly noteworthy because it is backed by Red Hat and is their "official" file system of choice.
Basically RedHat wanted to have a path into journaling file systems for their customers, but also wanted as little transitional headache and risk as possible. Ext3 offers all of this. There is no need, in any real sense, to convert an existing ext2 file system to it -- really ext3 just needs to be enabled. Furthermore, the unhappy customer cna always go back to ext2. And, in a pinch, the file system can always be mounted as ext2 and the old fsk remains perfectly effective.
The journaling layer of ext3 is really separate from the filesystem layer. There are only two differences between ext2 and ext3. The first, which really isn't a change to ext2-proper, is that ext3 has a "logging layer" to log the file system changes. The second change is the addition in ext3 of a communication interface from the file system to the logging layer. Additionally, one ext2 inode is used for log file, but this really doesn't matter from a compatibility point of view -- unless the ext2 file system is (or otherwise would be) completely full.
Three types of things are logged by ext3. These must be logged atomically (all or none).

Metadata - The whole block of updated metadata (even if only small part of block changed). This is basically a shadow copy of the updated block.
Descriptor Blocks - These tell where each metadata block should be copied on recovery. These are written before metadata block. Rememebr, the metadata blocks are just unformatted blocks of data -- the descriptor blocks are necessary to tell us which are which.
Header Blocks - These describe the log file, itself. In particular, we need to know the head and tail of journal file, as well as the current sequence number. These tell us where current head and tail of log are for updates

Periodically, the in-memory log is check-pointed by writing outstanding entries to an in memory journal. This journal is committed periodically to disk. The level of journaling is a mount option. Basically, writes to the log file are cached, like any other writes. The classic performance versus recency trade-off involves how often we sync the log to disk.
As for the sideline stats, the block size is variable between 1KB and 4KB. The maximum file size is 2GB and the maximum filesystem size is 4TB.
As you can see, this is nothing more than a version of ext2, which supports a journaling/logging layer that provides for a faster, and optionally more thorough, recovery mode. I think Red Hat made the wrong choice. My bet is that people want more than compatibility - more than Ext3 offers. Instead, I think that the ultimate winner will be the new version of ReiserFS or XFS. Or, perhaps, something new -- but not this.