skip to page content SBU
Carnegie Mellon University
Big Data and Large Scale Computing
95-869 - Spring Mini-4 2017

Home
Syllabus
Assignments
Notes

Tentative Syllabus


Date

Lectures and Readings

Out
/ Due


3/21

 

Review

(Recitation) Lecture 0: Set up

  • Installation of Hadoop and Spark on your local machine
  • Setting up AWS clusters

Please take this Python mini-quiz before the course and take this Python mini-course if you need to learn Python or refresh your Python knowledge.

   

3/21

 

Lecture 1: Introduction

  • Big Data applications
  • Technologies for handling big data
  • Apache Hadoop and Spark overview


3/23

3/28

Lecture 2: Hadoop Fundamentals

  • Hadoop architecture
  • HDFS and the MapReduce paradigm
  • Hadoop ecosystem: Mahout, Pig, Hive, HBase, Spark



HW1 out


3/28

3/30

Lecture 3: Introduction to Apache Spark

  • Big data and hardware trends
  • History of Apache Spark
  • Spark's Resilient Distributed Datasets (RDDs)
  • Transformations and actions

   

4/4

Lecture 4: Machine Learning Overview

  • Basic machine learning concepts
  • Steps of typical supervised learning pipelines
  • Linear algebra review
  • Computational complexity / Big O notation review

   

   

4/6


4/11

Lecture 5: Linear Regression and Distributed ML Principles

  • Linear regression
    • formulation and closed-form solution
    • gradient descent
    • grid search
  • Distributed machine learning principles
    • computation, storage, and communication
HW1 due    HW2 out





4/13


4/18

Lecture 6: Logistic Regression and Click-through Rate Prediction

  • Online advertising
  • Linear classification
  • Logistic regression
    • working with probabilistic predictions
    • categorical data and one-hot-encoding
    • feature hashing for dimensionality reduction

   

   

HW2 due    HW3 out

4/20

No classes; Spring Carnival

   

4/18

4/25

Lecture 7: Principal Component Analysis and Neuroimaging

  • Exploratory data analysis
  • Principal Component Analysis (PCA)
  • Formulations and solution
  • Distributed PCA

 

4/27

 

Lecture 8: Big Data ML with MLlib

  • k-means Clustering
  • Decision Trees and Random Forests
  • Recommenders
HW3 due    HW4 out



5/2

Lecture 9: Introduction to SparkSQL

  • Working with tables in Spark
  • Higher-level declerative programming

   

5/4

Lecture 10: Analyzing Networks with GraphX

  • Understanding network structure
  • Computing graph statistics
HW4 due    Project out

TBD

Final Exam



Last modified by Leman Akoglu, Mar 2017