Tuesday, March 30, 2010 (Lecture 19)

Tuesday, March 30, 2010 (Lecture 19)

Overview

MapReduce is a programming model first published by Google in 2004, specifically in an OSDI paper titled MapReduce: Simplified Data Processing on Large Clusters (Dean and Ghemawat). It is presently the basis for a large part of the magic at Google, Yahoo!, Facebook, Amazon and other organizations in the business of "cloud computing" for data-intensive scalable computing (DISC).
MapReduce, itself, is a paradigm or technique -- not a system. It can be implemented in different languages and different environments. It is important to realize that the MapReduce model is a technique for expressing an idea in a way that is massively and automatically parallizable. It is not the machinery that actually compiles a program, distributes the computation across a cloud, feeds it data, provides communication, manages failure, or manages the computation. These tasks belong to the actual system components, hardware and software. The bottom line is that MapReduce is a "Technique", a "Model", a "Paradigm", or a "Way of thinking about and structuring a solution to a problem", but it is not actual hardware and/or software machinery.
Google has its own infrastructure to implement MapReduce in their environment. As you might imagine, this makes use of a specialized file system, programming language components, and system management components. Never mind people.
Unsurprisingly, Google's implementation is very proprietary. I imagine the among the few things more important to them is what they know about user behavior as it relates to the utility of information, including ads. Your best chance to learn about it is to intern at Google, get a real job there, or get a really geeky Googler thoroughly drunk -- which doesn't sound like a terribly good idea.
But, not to fear, we'll be using an open-source alternative called Hadoop. By all accounts, it isn't Google's system. But, it seems to work well enough for Yahoo!, Facebook, and Amazon, among others. And, Google's been pretty support of the effort, too. So, have IBM and the NSF, among others. Whatever it is, Hadoop surely isn't a cheap imitation "wall hanger".

The Big Picture

Imagine that you've got a huge file containing bunches of records and you want to process it some how. Maybe you want to search it for a particular record. Or, maybe you want to compute some sideline stats about the information contained within the records. Or, maybe you want to extract only certain information from each record. How do you go about this?
The basis of the traditional approach is probably a huge loop:
while not eof
do
  read record
  process record

  update stats or other aggregate information
  -- and/or --
  write new record
done

write any stats or other aggregate information
And, this approach has some advantages. It is easy to understand, systematic, and will get the job done. It is also pretty efficient if you've got one disk from which to read (possibly another disk with which to write) and one processor to do the chewing.
But, if you've got a bunch of disks and a bunch of processors, it isn't taking advantage of them. It would be much better to divide the problem up into smaller pieces, process those smaller pieces in parallel, and merge the results together. And, as it turns out, for many, if not most, truly large scale data processing problems, the data is already distributed across many nodes, simply because it is too large to store in any single place. Bonus!
Regardless, therein lies the rub. Many modern programming models have neither a way of representing parallelizable data nor a way of describing parallel processing. They can describe a linear, indexed list of data, And, they can describe iterating through it -- but not attacking it in parallel. There is no way to, for example, describe "Apply this function to every operation of this array" other than to specifically ask for it to be done one element at a time, sequentially.
Fortunately, many functional programming languages do have an approach that is more natural for parallelization -- and for many other types of problems where "one after the other" isn't actually part of the strategy, just part of the implementation. You guys may be familiar with the model from 15-212: Map and Reduce functions.
What is known, in "data intensive scalable computing (disc)", a.k.a. "cloud computing" as the MapReduce paradigm, is really an extension of the techniques used in functional programming into the domain of distributed systems.
The MapReduce models views the inputs, not as a linear list, but as a partitionable, parallelizable body of individual records. It allows the programmer to define, through Map operations, functions to be applied, in parallel, to each partition. Likewise, Reduce functions to aggregate the results of the Map functions together. Since the Map and Reduce functions are programmer designed, the model is very flexible and very powerful.
It is probably also worth noting that the technique also allows the generation of large data sets in parallel. You might imagine that, for example, you want to generate a bunch of rando student records, where the names are all assembled form some pools of first and last names at random, as are the course selections, grades, &c. This could be done in a massively parallel way by generating records using Map and merging them together using Reduce. In this way the technique works for both processing and generation.

A Closer Look at the Data, Map, and Reduce

Key-value pairs are important to the MapReduce model. Input is presented as key-value pairs. And, output is generated as a key-value pair. Interestingly enough, this convention makes it easy to form multi-stage processing by directly using the output from one stage as the input to another stage.
Keys are usually simple. But, the values associated with them can be large and complex. For example, consider a URL and the Web page it describes as a key-value pair. The value, the contents of the page, can be complex and information rich.
Unsurprisingly the MapReduce model is based on a pair of functions: Map and Reduce. The Map function begin with a single -input- key-value pair and produces one or more -intermediate- key-value pairs. There may be, and often are, more than one intermediate pair given a single input pair. And, the key is not necessarily unique within the intermediate pairs -- there can be, and often are, duplicates.
And, that's exactly where the Reduce operation comes in. Reduce takes a key and a -list- of all associated values and reduces that list to a smaller list -- typically of size -zero- or -one-. In other words, typical Reduce functions will produce a single value, or no value -- but producing a list of more than one value is, in practice, not often useful.
The Reduce function is also executed in parallel, resulting in more than output set of key-value pairs. These can be further processed by Map operations, which don't necessarily need for them to be in one partition, further combined via Reduces, or given back to the User program, which might well be able to handle the results spread out across some number of files.
So, there you have it. The Map function, which is performed independently, in a massively parallel way, on separate partions of the data, does a bunch of user-defined processing, and structures its output as key-value pairs. These pairs are then aggregated by the Reduce function, and processed some more or handed back to the user.

The Canonical Example

The example that is almost universally cited is from the OSDI 2004 paper. Please note that it is written in pseudocode, not Java, C, C++, or any other real language. It illustrates how to build a histogram of word usage from a document:
map (String key, String value):  
  // key: document name
  // value: document contents

  for each word in value:
    EmitIntermediate (w, "1");


reduce (String key, Iterator values):
  // key: a word
  // value: a list of counts
  int result = 0;
  for each v in values:
    result += ParseInt(v);
  Emit (AsString(result));
Notice how this works. The Map function finds each word in the document and emits a pair , indicating that it found one instance of that particular word. Since a word may occur multiple times within the same document, it might emit the observation that it found a single instance of the word multiple times.
The list of all of these word counts, each of a single observation, is then fed to the Reduce function. But, before begin fed to the Reduce function, they are grouped together such that, associated with each word is a list of all associated values. In this case, associated with each word is a list of 1's, with a single 1 for each instance of the word in the original document. The Reduce function simply charges through and adds these up to come to the tally for the key word.

Inputs to Reduce Are Sorted

One surprising detail of the system is that the input key-value pairs are sorted in increasing order by key prior to being fed to any instance of Reduce. This is nice for three reasons.
The first reason is human. Often times the ultimate results of data processing are viewed by humans -- and we prefer them in sorted order. Consider, for example, the word frequency program described above. It produces the results sorted alphabeticaly by word -- nice!
The second reason is that, after all of the data processing, the results are sometimes searched using techniques like binary search, which benefit from sorted data. Binary search is surely nicer than brute-force linear searching. But, it has to be remembered that these results might well still be too large for memory. And, even binary search is ugly on disk. But, none-the-less, it can be helpful.
The most compelling reason is that Map functions often yield a huge number of duplicate keys, but the keys are not necessarily grouped together when sucked in by a Reduce. Consider our word frequency example -- certain words surely get used a whole lot, but at different locations in different documents. Before these can be fed to reduce, they must be grouped together by key. It is easy to see that sorting by key grouping the pairs with the same keys together -- with the super-convenient side-effect of them being sorted.
As an interesting aside, since the keys are sorted en route to a reduce, it is possible to write a massively distributed sort via MapReduce. And, this is pretty cool.

Where's the Magic?

Repeat after me: There is no magic. There is no magic. There is no magic. But, there might just be more than meets the eye at first glance. Let's step back and see why this model has the potential to be powerful.
Remember that the typical application for this technique involves the processing of a huge quantity of information, typically structured as a pile of records. It is straight-forward to partition data so structured into chunks for parallel processing, as each record is independent. Given this, we can take our input files, slice them and dice them, and send them off to a bunch of different workers for processing. These workers can then Map and Reduce away -- all in parallel. No "thinking" is required in order to parallelize the task.
The mapping between keys and values by the Map function, and the reduction in the list of values associated with a key by the reduce function are arbitrary. This is the beauty of the model -- the programmer can implement these steps, and pipeline them together. At the end of the day, seemingly complex things might well be representable as highly-parallel combinations of simple things.

Big-O is Back

It is important to note that you can stick whatever code you'd like into your Map and Reduce functions -- but some common sense is required. Remember that the list fed to Reduce can be large. And, remember that the value being fed to Map can be large, too. And, by "Large", I do mean "Huge".
As a result, you want these functions to operate with O(1) memory use. You don't want it to grow larger as the data does -- this is bad for massively parallel computing. Think about this carefully -- and, when you get into coding, stay out of the tar pits here.

No Magic Implies No Magic Bullet.

First and foremost, this model is not the solution to all problems in distributed computation. It works well only when the data is well-structured and is composed of a massive number of independent records. If the records are not well-structured, or if the information within the records is only finds meaning in the context of other records, this might not be the right model to solve the problem.
And, it only works if we can represent our processing in Map and Reduce phases -- and works best if the resulting tree is wide across the data and shallow in the number of phases. In most cases, the Map phases give us the massive parallel fan-out that justifies the technique. But, the greater the number of phases, the more sequential the technique is and the more effort is spent combining verses operating in parallel on each record.
There is no guarantee that any arbitrary problem will be best solvable with this technique. And, some problems and some data representations can be quite poor choices for it, to be sure.

The System Model

Okay, so at this point, we understand that we need a programming enviornment that supports Map and Reduce operations. But, we are left with more questions than we yet have answers. Some of these questions are unique to this model, whereas other are common to any distributed system.

Where is the original data?
How do the Map and Reduce operations actually get parallelized?
How does the data get parallelizes?
How does data move from one phase to the next?
Does load balancing happen?
What if a node fails or a file has corruption?
How many instances of Map do I have? Reduce?
What is the plumbing that actually connects the Map and Reduce phases?
Who determins which Map is connected to which Reduce? Etc?

In order to answer these questions, we're going to walk through the entire process of executing a MapReduce program, from start-to-finish. But, this time, we're not going to worry about the details of the program -- instead we are going to consider the role of the machinery underneath.

Work Flow

When a user program starts, it starts up MapReduce. One of the most important early steps is for MapReduce to carve up the input file(s) into chunks, known as splits. Each split is of the same size, which is user configurable anywhere from a dozen to several dozen megabytes.
MapReduce then initializes a whole bunch of instances across many nodes. One of these instances is the Master that is responsible for coordination. The other instances are Workers that will each perform Map and/or Reduce operations. The Master will assign idle workers Map or Reduce tasks.
But, it does not assign more than one task per worker. If there is more work to be done than workers available, the Master will hold onto it until some Worker becomes idle and can immediately accept it. By keeping the de facto work queue at the Master, rather than on the Workers, the Master is able to improve load balancing. This is because Workers will likely finish at unpredictible and different times, making it hard to optimally allocate all work initially.
The Map Worker does its thing and churns out the results -- the intermediate key-value pairs. These results are buffered in memory for a while, but periodically written to disk. As they are written to disk, the key-value pairs are hashed into Regions, based on their key. The data is divided into Regions to provide chunks that can be processed in parallel by Reduce workers. By dividing the output using a hash function, the buckets associated with each worker will be approximately the same size.
As each Region is written, the Master is informed. This allows the Master to assign the work to a Reduce Worker, which will read the data from the intermediate file using an remote read, such as by RPC call.

Number of Nodes, Workers, Maps and Reduces

The number of Workers is determined by how many nodes we'd like to have in play. And, the answer to the question, "How many Nodes do we want?" Is almost always, "The more, the merrier." Having access to more Nodes is realistically always better -- you'd have to have some disproportionately large number of nodes before it wouldn't be worth it to use them. And, at that point, I'm not sure that the game counts as "Data intensive" any more.
But, given some number of Nodes, we also need to determine how many units of work we want. In other words, given some large chunk of data, how many pieces do we cut it into before the Map? And, how many groups of intermediate results do we create for Reducing?

How Many Maps?

The number of Map tasks into which the job is divided is decided by the user, based on intuition, the number of available Workers, the amount of data, and the amount of computation within a map.
In some sense, the greater the number of Map operations, the better. If, for example, we have fewer Map operations than workers, some workers will initially be idle. If we have as many Maps as workers, all nodes will initially be busy, but Reduces aside, some will finish before others, leaving idle time. But, if the number of Map operations is really high, leaving many small pieces, they can constantly be fed to nodes as they become available, allowing for much finer grained load balancing.
The cost is, of course, overhead. The more Map tasks there are, the more effort is made by the Master to keep track of the state of computation. Since the Master needs to manage each Map and each Reduce, it must make O(M+R) schedulign decisions. And, since there is state associated with each Map-Reduce pair, it must maintain O(M*R) state information. Regardless, this overhead, especially the storage overhead, has a reasonably small coefficient.

How Many Reduces?

The number of Reduces is governed by the same load-balancing and overhead concerns as the number of Maps. Additionally, since each reduce produces a different user-visible, final output file, the user often prefers fewer Reduce tasks -- independent of system efficiencies.
In order to keep things even modestly busy and load balanced for much of the program's lifetime, one can imagine that the number of Reduces should be a few times the number of processors, whereas the number of Map tasks can be many, many times higher than that.

Locality

It is easy to see that performance is dramatically improved when the underlying system is location aware. If our processing Nodes are near our Storage Nodes, moving things around is much faster. What is meant by near? On the same host? In the same rack? On the same switch? Very few switches in between? The closer the better.
If input files were to be stored whole, such as is the case in many traditional file systems, they would somehow need to be sliced and diced before distributing to the Worker's nodes. Given that the entire goal of the system is to perform MapReduce, it makes sense for the file system to do this in advance.
So, it does. Files are stored in chunks which are spread out across the system, making the pieces local to different nodes. This enables them to be processed in parallel by multiple Maps, with little communications overhead.
Furthermore, multiple copies of each block are stored in different places. By having multiple replicas, the data is not only made more robust in light of failure, it can also be spread out in more places -- allowing for more options when load-balancing the Map workers.
Three (3) replicas is the magic number reported, once upon a time, by Google as their choice. This is a magic number. There doesn't appear to be much science here. Who knows if they use it today. Or, if it even still works this way. But, it is an example of an intuitively reasonable "Guesstimate".

Combining Maps and Reduces

Sometimes a Map function produces a huge number of pairs with overlapping keys. In this case, it might make sense to reduce the output, before allowing it to be shipped off somewhere else for a Reduce. There is certainly an economy in running a Reduce, especially a data-intensive one, on the Worker that already has the output file.
For this reason, the MapReduce model allows a Reduce to be more-or-less attached to the end of a Map. When this is done, the Reducer is known as a Combiner. One critical difference between a Reducer and Combiner is that the Combiner's results, like the Map's results, are written to an intermediate file that needs to be subsequently reduced to generate a user-visible output file. A Reducer's output can be a final, user-visible result.
A Combiner function needs to be communative and associative. This is because a Reducer function, doing essentially the same thing, will follow it. Remember, the Combiner only operates on the output of one Map to reduce its size -- unlike a Reduce, it does not merge the results of many Maps.

Worker Failure

On a system of this scale, failure is common place. It is the job of the Master to periodically ping the Workers. If a Worker doesn't answer, it is marked as bad, and its work is rescheduled to another Worker. Furthermore, any Reduce that was scheduled to get results from the old Worker is told to begin getting them from the new Worker, instead.
When a Map worker dies, it needs to be re-executed from scratch. The reason for this is the results are stored on the Worker's local disk and are now inaccessible to Reduces. But, should a Reduce Worker fail, its results remain available in the global file system.
Why the difference? Well, remember, the results of a Reduce are designed for consumption by the end user. Because of this, they are placed in a distributed file system such that the program can get to all of them in one place.
By contrast the results of the Map Workers are intended only for consumption by a particular worker, so they are left in place. Instead, the upsteam reduce Worker is told of their location by the Master and they suck in the data explicitly, in a location-aware way, by an RPC-like mechanism.

Master Failure

What if a Master fails? Well, one could apply a traditional distributed systems approach and checkpoint the data structures into the global file system and the user library can periodically and invisibly ping the master. If it doesn't asnwer, the user library can conjure up a new Master and instruct it to recover its state from the checkpoint.
But, why? The master isn't scaled up. It is just one central Master. Like your desktop. Failures are years apart. And, checkpointing things will waste tons of time.
Instead, if a computation times out, the program can just restart the computation a new, perhaps after checking the status of and with the Master, etc.

Bad Data -- and Bad Code

In playing with uge amount of data, some fo the reocrds are bound to be badly formed, corrupted, etc. This certainly can't break a large scale computation. At least not the common-place kind that isn't looking for an exact, unique result.
More commonly, we are looking for a "Good result" or a "Good approximation" of the answer, not some sort of "Perfect" answer. Should a particular record repeatedly cause crashes and skips exactly these records.
It is worth noting that this feature also protects the robostness of the system from other "Features", e.g. bugs in the Map or Reduce code. Even if the record is good, if the Map or Reduce and it is a code defect that is causing the choking, this mechanism can come to the rescue.
You can imagine that this feature is implemented by keeping track of the record that is currently being processed by a Worker. If the Worker crashes, an Exception or Singal handler sends this record number to the Master, before the Worker dies. If the Master sees that the same record is repeatedly causing a crash, it can re-issue the task, with instructions to skip that record.

How Man Map-Reduce Phases Is Optimal?

One question that we've gotten a bunch over the last few days is, "How many Map-Reduce phases should we have?", which is sometimes phrased, "In designing a Map-Reduce approach, should we use many phases or just a few?" The answer to this question is, "Ideally, it would be possible to have exactly one phase -- but it often isn't."
Much of the power in a distributed Map-Reduce comes from the work that is distributed in the Map phase. In an ideal world, the Mappers will keep a lot of workers busy for a long time. Keep in mind that, whereas the nature of the data and processing determines the number of Mappers that can efficiently run concurrently, the number of Reducers is limited by the number of output files that the end-user application is willing to accept. So, althoguh we can go really wide on a Map, and as a consequence get a lot done at a time, the Reduce can be a bottleneck.
In an ideal world, a metric boat load of Mappers each process a relatively small chunk of the data in parallel and the results are locally combined into a much smaller set. These are then sorted, perhaps externally, and fed into relatively few Reducers, each of which performs only a very small amount of work to take the new information and add it to the current bucket.
Although multiple Map-Reduce phases are possible on the same data, it is almost always better to structure these into a fewer number of phases, if possible. Remember, Mappers read their data from the global file system and write it into a local cache. The Reducers get the data from this local cache into their own local cache via RPC calls and then write the results back into the global file system, which distributes and replicates it.
If we can do more processing on a single unit of data in the first pass, we cut out a huge amount of overhead. We save the work of shipping cached temporary results into the global file system, where they get replicated, etc. We also save the overhead of sucking them back into cache file systems, which might or might not be ont he same nodes. There's also the overhead of managing another phase of computation.

When Do Multiple Phases Make Sense?

There are times when multiple phases do make sense. We've seen one example of this already Consider, for example, the first lab. In the first phase, we counted the word occurances. In the second phase, we flipped them to sort by count rather than key. If we could have done this in one phase, it surely would have been more efficient -- but we couldn't. So, we either had to do it in two phases -- or with post-processing after the fact. In the case of the lab, we did it with a second Map-Reduce phase.
Another general situation that might involved multiple Map-Reduce phases is when we need to draw inferences across the output of the first phase, rather than about the individual elements. For example, "Find some list of records that match X, and then, determine the Y of those".

Hadoop

At this point, we punted to the Hadoop Tutorial.