skip to page content SBU
Carnegie Mellon University
Big Data and Large Scale Computing
95-869 - Spring Mini-4 2017





Coursework consist of (grading in parentheses):
  • Class Participation (10%)
  • Homework (40%)
  • Project (20%)
  • Final exam (30%)

NOTE: All assignments are to be done individually. Please see the Collaboration policy.


Assignment Note Out Due Weight
Installation, Set up (HBH 2106, 6:30pm-8pm)
Mar 21
Homework 1
Programming in MapReduce and Spark
Mar 28
Apr 6
Homework 2
Regression in Spark
Apr 6
Apr 18
Homework 3
Classification in Spark
Apr 18
Apr 27
Homework 4
Data Analysis with PCA in Spark
Apr 27
May 4
Open-ended problem (NO LATE DAYS!)
May 4
May 6
Final Exam

May 8, 08:30am, HBH 1002, 1005


Homework should be turned in at the beginning of the class on the day it is due. If you are taking late day(s), please send your homework as an email to the TA and also submit a hard copy next time in class. Note the number of late days you used on the top front of the first page of your homework.

We ask that you submit all code electronically only (no print outs) that was used to complete the assignment via Blackboard.


There will be a final exam. Note: Exam will be open book, notes, papers, etc., but you are not allowed to use a computer. The tentative dates are posted above, the finalized dates will be announced during the semester.


Your class project is an opportunity for you to explore a machine learning problem in the context of a real-world data set using big data analysis tools.

For the project, we will provide you with a large dataset as well as a list of machine learning problems possible on the provided data. Your task will be to choose one of those ML problems, or define your own, on the provided dataset and address the problem of your choice with the big data analysis tools you learned during the course as well as others you explore based on the APIs.

By design, the project is open-ended; you are free to decide how you want to approach the problem and what tools you want to employ. We want to see a best-effort solution that utilizes what you learned in class and also potentially trying new things beyond class.

Important things to note:

  • You have to use the data we have provided you. You cannot choose your own dataset.
  • You will be given 48 hours to work on the project. Use of late days are not allowed for this submission.
  • Project is to be done individually. No collaboration is allowed. Students who use each other's ideas or code will be heavily penalized.

Your project will be worth 20% of your final class grade.

Project Writeup:

Course staff will use the following rubric when grading your final project.
  • Introduction/Motivation/Problem Definition (10%)
    • Identify, define, and motivate the problem that you are addressing.
    • How (precisely) will a machine learning solution address the problem?

  • Data Understanding and Preparation (15%)
    • What preliminary analyses have you performed on the data? What observations have you made? How did those observations help shape your approach?
    • Provide the preliminary data analysis results and your observations.
    • Specify how the data will be transformed to the format required for machine learning.

  • Methodology (35%)
    This is where you give a detailed description of your primary contributions. It is especially important that this part be clear and well written so that we can fully understand what you did.
    • Specify the type of model(s) built and/or information/knowledge extracted.
    • Discuss choices for machine learning algorithm: what are other alternatives, and what are their pros and cons (in the context of the problem and as compared to your proposed solution)?
    • Discuss why and how this model should "solve" the problem (i.e., improve along some dimension of interest).
    • Outline the big data analysis tools and libraries you have used.

    It is not so important how well your method performs but rather, (a) how thorough and careful your methodology is, and (b) how interesting and clever the approaches your took and the tools you have used are.

  • Evaluation and Results (30%)
    We are interested in seeing a clear and conclusive set of experiments which successfully evaluate the problem you set out to solve. Make sure to interpret the results and talk about what we can conclude and learn from your approach.
    • How do you evaluate your machine learning solution to the specific question(s) you have addressed?
    • What do these results tell you about your solution?
    • Present and discuss your evaluation results and findings. You may use tables or figures (e.g. ROC plot) to visualize your results.

  • Style and writing (10%)
    Overall writing, organization, figures and illustrations.
Please follow the instructions in the IPython notebook that will be handed out to you. You will fill in your answers in the IPython notebook, upload a zipped folder containing the IPython notebook and its HTML output on Blackboard, and submit a printed hardcopy of the HTML output to the TA.