Project 3: Design and Implementation Guidelines

Project 3: Design and Implementation Guidelines

General

If you make any assumptions or simplifications, please state them clearly in your report.
You are welcome to use Java's RMI, or your own RMI framework from Project 2. Both are allowed.
You are not allowed to store any data in AFS to share it between different components in your system. If the purpose of its use is not to replace some functionality of your system, you are allowed to use AFS as an exception. For example, for DFS bootstraping, it is acceptable to have original files stored in AFS so that all the data nodes can read them and create replicas as needed. Once the file system is set up, the data in AFS must no longer be used.

Report

As mentioned in the project handout, avoid describing your design in terms of classes and methods. We are interested in what components work together to accomplish the goals in your system, and how they do it. E.g., we would like to know how the servers and clients are deployed (where, how many etc.), rather than what classes and methods realize the server functionalities. The latter is best placed in comments in your code.

MapReduce Computation Model

It is not required that your MapReduce framework explicitly support global sorting of final results. Namely, it suffices to be able to have results produced by Reducers locally sorted. If you would like to support global sorting, you are welcome to make intermediate key-value pairs produced by Mappers available to Reducers running on different machines; this can be done by, for example, writing them in your distributed file system instead of the local file system of Mapper nodes, or directly transferring them to Reducers over the network if you wish.

Scheduling

Scheduler should try to enforce work conservation, and launch as many Mappers and Reducers as allowed by the task's computation model. Specifically, it should exploit multiple CPU cores on each compute node. On Andrew machines, reading /proc/cpuinfo will give you information about the number of cores.
Scheduler should have a reasonable policy considering the cost of data accesses (local or remote) and overall task throughput. For example, it is undesirable to leave some available cores idle simply because Mappers on them could access input data only remotely.

Distributed File System

You are allowed to assume that input files for Mappers are text files and Mappers are processing input line-by-line. You can then split them into replicas at line breaks, so that individual input entries processed by Mappers do not span multiple replicas. This will also simplify determining the cost of a particular scheduling decision.
Upon data node failures, it is recommended (although not required) that your file system create new replicas for those on the failing nodes, to keep the replication factor enforced. On bootstrapping, the replication factor must be enforced.