Return to the lecture notes index

Lecture 7 (February 3, 2011)

Why Study the The File Allocation Table (FAT) System

When it comes to organizing a general purpose file system, the system used by DOS in the 80's and 90's is, in many ways, the "Old standby". We study it for its simplicity, historical significance, and for its tendency to be deployed for use on emerging, low-capacity media devices. These days, we'll often find it on USB flash drives and storage cards.

Why does it continue to be reborn? As we'll see, although inefficient for large volumes, it is very simple to implement. And, because it has been around for years, it is very well understood and accepted and accessible within almost all operating environments.

Data Structures Composing FAT File Systems

There are three important data structures within a FAT file system: the boot sector, the File Allocation Table (FAT), and the directory entry. The boot sector contains the metadata that describes the particulars of the file system, as well as, in the case of a bootable partition, the code that gets the boot-strapping started. The FATs keep track of which sectors of disk are associated with each file. And, the directory entries maintain the directory tree structure, by organizing the directories as lists of directory entries, which contain the name of the file or directory, a pointer into the FATs, and some meta data about the file, such as when it was created.

The File Allocation Tables (FATs)

As we've discussed, the fundamental unit of storage on a disk is the sector. But, there are many, many sectors. Keeping track of them individually would impose a hefty cost in terms of both time and storage. As a result, most file systems simplify the problem by grouping adjacent sectors together and managing these groups. FAT-based systems call these groups clusters.

Because files can grow at any time, even after other files have been created, they can be composed of non-adjacent clusters. The result is that logically adjacent bytes might not by adjacent on the physical media. Because files can be deleted, and compaction is an expensive operation that isn't automatic, the set of physical clusters in use might be fragmented, resulting in fragmentation of the free clusters. The up shot here is that when moving linear thorugh a single file, one can be bouncing around the physical media, encountering penalties for the seeks.

In general, the larger the cluster size, the lower the storage efficiency of the file system. This is because we can not create a file or file tail that is smaller than a cluster. So, the space left over within the cluster, what we call slack space is essentially wasted space. For this reason, it is usually desirable to have the smallest cluster size possible, as this reduces this type of wastage.

Having said this, as we'll see as we learn more about the FAT table, seeking through the clusters of a file is a linear search. So, if one knows that the file system is going to be used to store mostly large files, e.g. long videos, one might intentionally choose to use fewer, larger clusters.

FAT-based systems maintain the list of clusters associated with each file through the use of a redundant FAT table. The FAT table has one entry for each and every cluster on the media. The number of entries within the FAT table depends on the version of the FAT file system in use. In the days when media was small, FAT tables were indexed with 12-bit numbers. But, as media grew, the FAT tables did, too. FAT16 replaced FAT12. And, eventually FAT32 replaced FAT16. Interestingly enough, FAT32 systems only use 28-bit indexes.

Since the number of entries in the FAT table is fixed by the version of the file system in use, in general, the cluster size varies with the capacity of the media. Basically, the cluster size is equal to the number of sectors on the media, divided by the number of FAT table entries (rounded up). This provides the smallest cluster size possible, reducing the wastage. But, as mentioned earlier, there are occasions where we might want to have fewer, larger clusters. And, we can accomodate this by configure the file system such that the FAT table uses fewer than the maximum number of entries.

There are a couple of special entries at the beginning of the FAT table. But, other than these, each entry is just a pointer to another entry, allowing them to be configured into lists. The smallest possible list contains no FAT entries. The largest possible list contains each and every entry. And, multiple lists can be constructed, by having several chains. As you might expect, there is a sentinal value that is used to mark the end of a chain. The "head pointers" are stored within each directory entry. In other words, when we look up a file and find its directory entry, it gives us the first cluster of the file. From there, we can consult this entry in the FAT to find the second cluster, and so on.

In addition to a sentinal value for the last cluster in a file, there are special sentinal values that indicate empty clusters (not part of any file's list), and bad clusters. A cluster is marked bad if the underlying device reports that a sector within it is bad. So, if a hard drive is able to relocate a bad sector using a spare sector, this will be transparent to the FAT file system and will not result in the cluster being marked bad. So, clusters get marked bad when the underlying device (a) lacks the capacity to hide bad secotrs, or (b) has exhausted that capcity, e.g. already consumed all spare sectors.

FAT12 is a little odd in that 12-bits is not a whole number of bytes. So, some entries span two bytes. Given this, any group of three bytes (24 bits) contains two 12-bit entries. Be careful in reading documentation here. Various folks report bizarre things about how the bytes are grouped into the two 3-byte groups. These folks are ignoring endianness. If we have the bytes abcdef and interprete these as three integers, we group them as ab cd ef. But, we interpret these bytes as ba dc fe. So, when we view them as two three-byte values, we get dab and efc.

With respect to the sentinal values, all-0 is free, e.g. 0000. The end of the chain is all Fs on Microsoft-based systems, but actually can be xff8 - xfff, and Linux has usually used xff8 of xfff8. xff7 is a bad cluster. And, xff0-xff6 are not used. What is meant by "x"? Well, all of the leading bits are all 1s (hex f). But, the number of them depends on whether it is a 12-bit, 16-bit, or 32-bit fat entry.

Because, at the least, the first two clusters of the disk are the "reserved area", including the boot sector, these entries are not needed to represent data storage. The first byte of the first entry stores a redundant copy of the "media descriptor", which is also contained within the boot secotr, and described there. The rest of the first entry is all 1 bits, e.g. hex Fs. The low order bytes within the second entry store the end-of-file marker. The lower order bits may be used for administrative flags in FAT16 and FAT32 systems, from the high-order bit down, as follows:

In order to determine whether a system is FAT12, FAT16, or FAT32, first look at the number of number of clusters reported in the boot sector. If it is less than 4085, it is FAT12. If it is 65525 or more, it can't be FAT16, so it is FAT32. Otherwise, it is FAT16.

The FAT tables are stored next to each other right after the boot area. Unfortuantely, this means that, although redundant, they are often damaged together. Additionally, problems in the first table are often copied into the second as various disk utilities attempt to "repair" systems by making them consistent.

Directory Entries

A directory is essentially a structured file that contains a list of files and some information about them. Specifically, a basic directory file is composed of entries, by byte, as follows:

On pre-FAT32 systems, the root directory entry came immediately after the boot sector. On post FAT-32 systems, the root directory entry is treates as a file and begins at cluster 2 of the FAT table.

Virtual FAT (VFAT)

VFAT is a Windows 95 hack to provide long file names, while providing backward compatibility to older FAT systems. It does this by (a) making use of some of the previously undefined bits in a directory entry, specifically 12-21, (b) giving long file name short nicknames for backward compatibility, and (c) hiding the full long file name in a series of hidden directory entries with a new structure that won't be reported in directory listings on older systems, because they appear hidden to these systems. Additionally, long file names can include spaces, upper adn lower case, and some other characters previsouly disallowed

A long file's nickname is contructed by taking the first 6 bytes of its name, appending a ~, and then appending a sequence number to distinguish between multiple long file names with the same short prefix. The extension is retained, but truncated to three, if necessary. All characters are made upper case. Any character previosuly disallowed is replaced with an _. So, we end up with short file names like "HI_THE~1.TXT"

The long file name, up to 255 characters, is encoded in a series of VFAT directory entries that immediately follow. These entries have the following format, by byte:

Notice that each entry can encode 13 characters. the sequence number indicates which substring this is of the full string. The last substring has bit-6 set. These entries are stored before the short entry, tail first.

If a VFAT file system is mounted under an operating system that supports FAT, but not VFAT, these special entries are ignored, as they appear to be empty hidden files. But, some utilities designed to do things like reorder directory entries, could separate so that they are no longer adjacent to the short entries. This could really screw things up. As a result, there is a checksum computed from the short file name that is stored in these entries. They are invalid if they don't match.

If the first byte of a file name is 0xe5, it means that this file has been deleted. It can still be recovered, if its clusters have not be reused. If the first byte actually does start with 0xe5, it is recorded as 0x05. yes, it seems like we could have simplified this to me, too!

The Boot Sector

The boot sector contains a bunch of metadata about the file system as well as the code that gets the OS bootstrapping. Every bootable file system has some type of boot sector, whcih contains a jump to the bootstrap code at the very beginning, followed by filesystem specific information. For reference, UNIX people call the boot sector the superblock.

The format of the FAT12 boot sector, again byte-by-byte, is as follows:

You can lookup the format for the FAT16 and FAT32 boot sectors -- they are much longer, but begin as above.

NTFS

NTFS is the file system introduced with Windows NT. It is very complex and few people outside of the developers understand it in every detail. So, we aren't going to examine it in the detail that we did FAT. Instead, we are going to look at the parts that are most of interest to forensics folks:

The Master File Table (MFT)

The MFT is the closest thing that NTFS has to the FAT table. But, it is more like a database, or a Java Map, that maps from the name of a file to the attributes of the file. Some of the attributes are resident within the file's MFT entry. Others are larger, stored outside of the file's entry, and "pointed to" by the file's entry. Check out Wikipedia and other sources for the details. But, here is a quick run down of some of the components:

One quick note is that NTFS is very smart about allocating nearby clusters tot he same file, reducing fragmentation. Having said this, the MFT is one huge file, growing over time, which can constantly tack non-resident attributes onto old entries. As a result, it, itself, can suffer from heavy fragmentation.

The Log File

There isn't much to say about the log file, since its structure is not completely well-known. It is used to get the file system up and running quickly after restarts, especially to regain consistency after recovery. Its format is Microsoft's domain. I'm guessing they've never published it, so they can change it. The opaqueness of this structure are among the things that have made complete non-Microsoft implementations challenging.

The contents of this file can be of forensic significance, but it is rare. In the event you want to explore into it, there are tools that can give you a look at significant chunks.

The Change Journal, e.g. $UsrJrnl

The change journal is of dramatic forensic value. It notes every time a file is changed, whether the change is to data or metadata. It doesn't contain any actual data -- but notes a lot of information, including the name of the file, the timestamp, the security and other attributes, the size of the file, whether the data, or an alternate data stream, was overwritten, extended, or truncated, created or deleted, etc, etc, etc.
Alternate Data Streams
Alternate Data Streams are, in some sense, files hidden within files. They exist because a single file can have multiple data attributes. These alternate data streams do not show up in directory listings, and their storage is not deducted from free space.

Many tools have been coded such that they are nto capable of interacting with alternate data streams. But, for those that can, one names one as follows: filename.ext:ads.ext. It is that easy. Try creating some alternate data streams using notepad from the command line, e.g. notepad.exe foo.txt:bar.txt. Ain't that cool? And, yeah, some people hide things here. And, yeah, you can execute programs stored within alternate data streams.

Why do they exist? Apple's HFS has the ability to associate things like icons with files. Microsoft wanted a generalized mechanism for doing the same thing. Ta-da!

Thinking Like A Forensics Analyst

Let's think about FAT systems. When files are deleted, they are just marked as deleted. If we replace the sentinel value in the directory with another character, and no other file is using any of the file's old space, we can get it back.

Even when data is overwritten, short files and tails may leave data in the slack space. This data may be of use to us. And, even unallocated clusters may have previosuly been part of files and may contain useful data.

The same is essentially true in NTFS, because of the way clusters work and the entries in the Change Journal.

The NTFS Change Journal is a treasure trove for forensics, because it gives us a ton of information about how and when the system and files were used. But, it is circular log, with a configurable maximum size. So, it will not necessarily go back to the epoch time.

And, well, alternate data streams might be innocuous, but they might also hold maliciously held data, or malware that is of interest in highlighting the security context of a system.

Because NTFS is very good about keeping allocations associated with the same file nearby, the location of unallocated clusters might well give us a clue that they were previously associated. FAT suffers from much more fragmentation, so this is less true there, but we may be able to make inferences, because we know the other allocations and that the FAT table does allocations via a sequential scan.

Warning to all Readers

These are unrefined notes. They are not published documents. They are not citable. They should not be relied upon for forensics practice. They do not define any legal process or strategy, standard of care, evidentiary standard, or process for conducting investigations or analysis. Instead, they are designed for, and serve, a single purpose, to help students to jog their memory of classroom discussions and assist them in thinking critically about the issues presented. The author is certainly not an attorney and is absolutely not giving any legal advice.