Return to labs index
Lab #3
Due: To be negotiated in class on Thursday

Overview

The purpose of this assignment is for you to solve a problem of your choosing using Hadoop. You may use your choice of the Google-IBM cluster, Amazon's AWS, your own computer, a machine in WeH 5419, or Dave Anderson's research cluster. Access details for the clusters will be provided in recitation on Wednesday.

No Partners

...unless you turnbed in the "Step 0" checkpoint.

You'll receive feedback today if you did turn it in asking you to see us, letting you know that everything looks good. But, as long as you turned it in, you do not need to wait for feedback before partnering. If you did not turn it in, you may continue -- but without a partner.

First Step: Pick a Problem

Examine the small chunk of the Wiki dump available in the handout directory and/or check out the data associated with the NetFlix Prize, which requires registration.

Think about what you'd like to learn about the data. Find a problem that is a good fit for the Match-Reduce paradigm.

Drop a short .txt file into "handin/lab3-1" that a) describes, at a human level, the problem you'd like to solve and b) the reason you believe it will be addressable with Map-Reduce. You do not need to have all of the details worked out -- just an idea or general strategy.

We'll take a look at these as they come in and provide you feedback within 24 hours. If you want it sooner, please ask in person or via email. But, please don't wait for our feedback -- just get started.

Second Step: Solve the Problem

Solve your problem using MapReduce. You may immediately begin working on a cluster. But, you might find it easier to begin work using a small sliver of the data on your laptop or a WeH 5419 machine first. This will let you do at least rough debugging on the code, before you launch a monster. This is especially important if you are using AWS, as you only have $100 of credit (more might be available, if you get into a bind).

Third Step: Tell the Story

Please write a brief report that explains a) the problem, b) your initial strategy, c) the strategy that worked, d) what you learned along the way, e) your results in human readable form, f) your results, and g) advice for next semester's students.

Then, turn this into "handout/lab3-2", along with your source code.

About the Wiki Dump

You might find the following resources helpful:

We're Here To Help!

As always -- remember, we're here to help!