Return to the lecture notes index
Thursday, July 12, 2014 (Lecture 20)

Redundant Arrays of Inexpensive/Independent Disks (RAIDs)

We'll shortly discuss the Lustre file system, which is essentially a distributed RAID. But, before we do that, I'd like to take a minute to discuss RAIDs for those who might be unfamiliar.

Disks up to a certain capacity are fairly cheap. But, beyond some point, they get dramatically more expensive really quickly. This has been true for decades. So, it is more cost-effective to build up capacity by using multiple drives, instead of by buying larger ones. And, in fact, it is possible to build up more capacity than is possible to purchase in a single drive -- for any amount of money.

And, as it turns out, the benefits go beyond capacity. We can get dramatically better performance, as well. Again, slower disks are far cheaper per byte than faster ones. And, beyond a certain point, regardless of the financial inducement, disks just don't get any faster. But, if we organize our system correctly, we can use multiple disks to provide not only a greater capacity -- but also more paths to our data.

If we divide a file into 4 stripes, and we fetch each stripe in parallel from different disks 4x faster than we can get them sequentially from a single disk. This is because each independent disk provides its own independent path to data. If we can keep them all working at the same time, at least much of the time, we get much more data moving than any single disk can accomplish.

As a result, we can get not only more storage at a lower price -- but also more and better storage at a lower price. And, we can even get more and better storage than is possible -- at any price.

But, if the approach to accomplishing this is a naive divide-and-conquer, the probability of failure becomes dramatically higher. Big drives don't fail more frequently than small ones -- but as we add more drives, we add a their liklihood of failure to the system. As a result, although a single large drive might be very reliable, a system composed of many smaller ones can be undependable.

As a consequence, as we use additional drives to increase our storage capability, we need to add extra drives with some type of redundancy. This redundancy can allow us to tolerate some failure, depending on the configuration in a fail-safe or fail-soft way. In some cases, data is mirrored, and in some cases, partity or checksum disks are used, etc. But, the common theme is that by combining some type of redundancy or error-correction code with some level of parallelism, we can have our data and eat it, too.

The real-world practical application for this in storage was first observed in the late 1980s by a research team at Berkeley, which included Garth Gibson, now one of our faculty members.

Lustre: A File System for Cluster Computing

The Lustre file system is a contraction of "Linux Cluster". It is a scalable cluster file system designed for use by any organization that needs a large, scalable general-purpose file system. It is presently a product of Sun Microsytems, but it has a Carnegie Mellon tie. It was developed by Peter Braam, who at the time was a faculty member here. He subsequently commercialized his work through the formation of a company, Cluster File Systems, Inc., which was more recently acquired by Sun Microsytems for this technology.

It is basically a network-based RAID. Imagine a system where each node of the cluster file system provides a huge chunk of storage. Now, imagine managing these chunks of storage like a RAID. Files are broken into objects, very similar to stripes. These stripes can be stored by different nodes. The result is that we get the same type of scalability in capacity that we saw with RAIDs -- but at an order of magnitude larger. Each node might, itself, behind the scenes, be connected to a large RAID, or even a storage aire network. And, as with RAIDs, we see the same performance improvement correlated to the number of stripes in flight at a time.

Under the hood, when a client opens a fiel, using the standard POSIX function, e.g. open(), a request is sent to a metadata server. This server responds by giving the cient the metadata about this file -- including the mapping of the file to objects on various nodes. For those who happen to be familiar with the internals of a traditional UNIX file system, this mapping essentially replaces the block-to-storage mapping present in an inode.

Robustness isn't intriniscly a concern, because each node is internally robust, with its own storage based on a RAID or SAN, and its own back-up system. Lustre does, as you might imagine, provide tools to facilitate backup, repair, and recovery of the metadata server. Since even the temporary loss of this server can disable the entire system, Lustre supports a standby server, which is essentially an always-available mirror.

Beyond a UNIX-like File System Interface

The file system interface, the way a file system interacts with an application program, has long been dominated by the early UNIX file systems. These early systems are the basis for the modern day POSIX standard, as well as the landscape in which many non-POSIX-compliant systems, such as AFS were developed.

It is easy to think about file systems only in terms of random-access reads and writes, user/group/world permissions, the prevailing cases being that reads dominating writes, certain reads are dramatically more common than others, and many read or write operations are sequential reads, such as loading or processing all data, or sequential writes, such as logging events over time.

But, many distributed systems are used to tackle very different problems, with very different modalities. If we realize the differences between these applications and the cases to which we are accustomed, we can often make dramatically different design decisions and obtain better performance.

MogileFS and HDFS

Today, we are going to take a really quick, somewhat shallow look at two of these file systems: MogileFS and HDFS. We'll take a more detailed look at HDFS a little later, when we discuss Hadoop.

These two file systems have (at least) two important things in common. They are both implemented at the user-level, rather than in the kernel, and they both support semantics that are quite different, and in some important ways more limited than POSIXs. What they gain is performance for the special class of applications they are intended to support. In the case of MogileFS, it was designed to support a file hosting site, e.g. a photo hosting site. In the case of HDFS, it was designed to support a paradigm of distributed computing, known as the Map-Reduce model. In-place edits, e.g. random access writes, are not important in either model. And, additionally, HDFS benefits from location-awareness -- exactly the opposite of most applications, and the POSIX interface, which are based on location transparency.

Both applications are implemented at the user-level, rather than in-kernel. In other words, they are written as application-level programs that make use of the existing POSIX-compliant file system supported by each local operating system. This is done because it gives them much more flexibility to support a variety of underlying hardware.

Real-world distributed systems are often heterogeneous. It is impossible to keep sufficiently large systems running the same version of the OS at all times. If the file system depends on the OS, it requires a lot of coordination and supporting the same interface across multiple kernel versions. HDFS can support many different operating systems and local fiel systems. Although MogileFS does this in principle, in practice some of the details are Linux-centric. None-the-less, the ability to run across many Linux distribution, and versions of distributions, with different local file systems, is huge, and is good enough in practice for many.

MogileFS: Serving Objects

A big part of our world is sharing. We share all sorts of things. Let's think about photo sharing sites, video sharing sites, music sharing sites. Indeed, we could solve the problem of storing the objects, e.g. photos, videos, and music, using a traditional file system. But, we can do better if we recognize how this workload is likely to be different.

First, these objects are accessed in fewer, simpler ways than in other models. They are never edited in place. No one is changing photos -- they are uploading them. In fact, they are never edited at all -- no one is adding on to existing videos or music -- even if they might subsequently upload longer or extended version. And, no one wants a chunk from the middle of a photo. The same is true for music or videos. Instead, these objects are downloaded, from start-to-finish. So, what we need is a file system that can efficiently support the uploading and downloading of files from start-to-finish, even if it doesn't allow random reads, random writes, or appends.

The world will interact with the MogileFS only through the very rich interface of the Web site. It isn't necessary to supported nested directory trees to create a hierarchy for human convenience. Instead, it is only necessary to allow some simpler way to disambiguate files from different applications (or higher-level domains of some kind).

And, finally, since the number of users is small and under the same administrative domain, protections are enforced by the applications, not the file system.

But, what we do need to do is to deliver many of these files very rapidly and very reliably from many clients. We're might have, for example, a bunch of Web servers from our farm hitting the file system at the same time, each of which will want a fat pipe. And, we don't want to lose any of the objects we are charged with preserving and distributing.

How does MogileFS accomplish this? It works a lot like LustreFS. At a high level, it has the same idea -- a distributed RAID. But, it is different in some of the details. It doesn't rely on each node providing robust storage, instead it replicates objects across servers. The number of replicas is associated with the class of the file, so, for example, photos might have three replicas, each, but thumbnails, which can be recreated from the original photos, might only have one replica of each. this reduces the cost of the storage by allowing less expensive components.

Additionally, MogileFS uses HTTP to server objects from each replica, as opposed to a home-grown protocol, for portability. For the same reason, it keeps its metadata in a standard MySQL database. Since, unlike in Lustre, in-place writes aren't permitted, locks aren't very frequently needed, so the database can maintain a sufficiently high throughput.

Lastly, it maintains simple namespaces, rather than directory trees. We can imagine that several different applications, e.g. several different Web sites of different kinds with different objects to serve, might use the same MogileFS, and accidentally have objects with the same name. As long as they use different namespaces, this isn't a problem -- and is much simpler and more efficient than a full-blown directory system. Similarly, the lack of a complex permission/ownership scheme, though less important, helps to keep things simple.

Hadoop File System (HDFS)

The Hadoop File System (HDFS) is designed to support Hadoop, an open framework for a specialized type of very scalable distributed computing often known as the Map-Reduce paradigm. Both Hadoop and HDFS are based on Google's papers describing early, presumedly landmark, versions of their framework. These days, there are many champions and users of Hadoop. It is worth mentioning Yahoo!, by name, because of their particularly deep involvement and advocacy.

We'll spend a good chunk of time talking about the Map-Reduce paradigm, Hadoop, and HDFS later in the semester. But let's see if we can sketch out the problem. Imagine that you've got a truely huge number of records. Oh, for example, records capturing descriptive observations of some approximation of each and every Web page in the world.

The bad news is that we've got a huge amount of data. The good news is that it won't be edited in place. We'll just be collecting it, adding to it. And, the better news is that since the observations are somewhat independent, we don't have to be too careful about the order in which we append them -- just as long as we don't lose or corrupt any. And, as we'll discuss a little bit in a minute, and more later this semester, we'll also be reading them in fairly limited ways, too.

Now, let's suggest that you want to look at these and decide if they match some search criteria. It would be impossible to look at them sequentially. So, you want to look at them in parallel. So, somehow, the records are going to need to be very spread out for processing. If we keep the data and the storage segregated, for example by maintaining separate computing and storage farms, a huge amount of data will need to move, placing impossible demands upon any deployable network.

Instead we see that it would be ideal to scatter the records across many systems, and have the distributed computing scattered in the same way. We want the computing to be local to the data upon which it is operating, for example attached to the same switch.

Because there is so much data, it isn't practical to use any type of off-line backup. So, instead it should simply use replication -- which has the added benefit of providing extra replicas for parallel processing.

And, this leads us to some important aspects of the deisgn of HDFS. It has to allow appends, but not in-place edits. Writes are read many times. The data and the processing have to be local to each other, so the system requires location-awareness. And, the data needs to be very heavily distributed, to allow for very heavily distributed processing. And, as we discussed earlier, it needs to be implemented in a portable way at the user-level, to be maintainable at scale.