How Many Reduce Instances Should We Have? Can I Make Things Better By Having More Reducers?
Generally speaking, the number of reducers you need is determined by the number of output files you need. If you can get away with more output files, that is probably a win. For example, if you can take the "Top M" items from each of n output files, that might be better than reducing these files to a single file and selecting the top n*m items (Yes, I know that these two sets are not exactly the same). The reason for this is that, if you can take this shortcut, you are saving some work.
It doesn't usually make sense to reduce your maps to some large number of files and then repeat identity-maps and reduces to perform a multi-phase merge to one large file. The reason for this is that Hadoop can do this, without wasting the time on the Maps. It can do large external merge sorts.
The only time you'd want to do this is if you could throw away results with the inital reduces or sumsequent merges, making the problem substantially smaller, reducing the amount of actual work to be done.
Processing Multiple Data Files
One question that has come up with a couple of teams during office hours has been about using the Map-Reduce paradigm with multiple files. Map can take its input from multiple files at the same time. It can even munge on records of different types, if need be. But, a relational database, a Map-Reduce engine is surely not. Things which are very easily expressed in SQL can take a lot of work to express in Map-Reduce code. The basic problem we run into is that a very common place database operation, the join, is not well expressed in Map-Reduce. It takes a lot of work to express the join -- and, this is appropriate, as joins can be expensive.
A table in a database is a lot like a file that contains structured records of the same type. Each table has only one type of record -- but each table can have a different type of record. The basic idea is that each record is a row and each field is a column. A join operation matches records in one table with those in another table that share a common key. If you ever take databases, you'll learn that there are a lot of different ways of expressing a join -- each with its own implementation efficiencies. But, for Map-Reduce, there are no efficiencies.
In class, we discussed, essentially, what database folks call an inner join. It is one, of many, types of relational operations that are challenging for Map-Reduce. The idea beahind an inner-join is that we have two tables that share at least one field. In effect, we match the records in the two table based on this one field to produce a new table in which each record contains the union of the matching records from both tables. We then filter these results based on whatever criteria we'd like This criteria can include fields that were originally part of different tables, as we are now, at least logically, looking at uber-records that contain both sets of fields.
In order to implement this in Map-reduce, we end up going through more than one phase. Here I'll describe one idiom -- but not the only one. The first phase uses a Map operations to process each file independently. The Map produces each record in a new format that contains the common element as the key and a value composed of the rest of the fields. In addition, it performs any per-record filtering that does not depend on records in the other file. The new record format might be very specific, e.g. <field1, field2, field3>, or it might be more flexible, e.g. <TYPE, field1, field2, field3>.
Although there were two different types of Map operations, one for each field type, they produced a common output format. These can be hashed to the same set of Reduce operations. This is, in effect, where the join happens. As you know, the output from the Map is sorted en route to the Reduce so that the records with the same key can be brought together. And the reduce does exactly this, in effect producing the join table. As this is happening, the filter criteria that depends on the relation between the two tables can be applied and only those records that satisfy both criteria need be produced.
At this point, what we have is essentially what the join table. It is important to note that we might do the filtering upon the indivudal records in the same pass as we render the records into a new format, or in a prior or subsequent pass. The same is true of the filtering based on the new record. But, we can only rarely use a combiner to join the records -- as the two records to be joined are necessarily the output of different Map operations and will only come together at a join.
It might also be important to note that since the records from two differnet files are converging at a single reducer, there are likely to be a huge number of records. In practice, this means a huge sort, likely external, will need to occur at the reducer. In reality, this might need to be handled as a distributed sort, with multiple reduce phases.
This is a lot of work for an operation that is, essentially, the backbone of modern databases. In class, we illustrated this by with the example of finding the "Top Customer This Week" from a Web access log <sourceIP, destinationURL, date, $$> and also determining the average rank of the pages s/he visited from a datafile
. This problem was solvable, but only after several phases of processing. This problem was adapted from a database column article.
Big Picture View
It is fairly straight-forward to represent any problem that involves processing individual records independently as a Map Reduce problem. It is striaght-forward to represent any problem that is the aggregation of the results of such individual processing as a Map Reduce problem -- as long as the aggregation can be performed in a "running" way and without any data structures that grow as one goes.
It is challenging to solve problems that involve relating different types of data to each other in the Map-Reduce paradigm, because these involve "matching" rather than individual records. The more relational our problem, the more we have to match to find our answer, the more phases we are likely to need. And, we could end up processing a lot of intermediate records. A whole lot of intermediate records -- and that can make storage of this intermediate output a concern. Also, multiple reduce phases means that we are going deep, and wide is really where the advantage lies.
But, again, you've got to look at the problem. Some problems are just big and deep -- and that's not only okay. It is the only way.