Week

Lectures and Readings

Out / Due


Review
Please take some of these Python miniquizzes before the course and take this Python minicourse if you need to refresh on your Python knowledge.


Week 1

Lecture 1: Introduction
 Big Data applications
 Technologies for handling big data
 Apache Hadoop and Spark overview
 

Week 2

Lecture 2: Hadoop Fundamentals
 Hadoop architecture
 HDFS and the MapReduce paradigm
 Hadoop ecosystem: Mahout, Pig, Hive, HBase, Spark

HW0 out 

Lecture 3: Introduction to Apache Spark
 Big data and hardware trends
 History of Apache Spark
 Spark's Resilient Distributed Datasets (RDDs)
 Transformations and actions

HW1 out 
Week 3

Lecture 4: Machine Learning Overview
 Basic machine learning concepts
 Steps of typical supervised learning pipelines
 Linear algebra review
 Computational complexity / Big O notation review

Week 4

Lecture 5: Linear Regression and Distributed ML Principles
 Linear regression
 formulation and closedform solution
 gradient descent
 grid search
 Distributed machine learning principles
 computation, storage, and communication
 HW1 due HW2 out


Week 5

Lecture 6: Logistic Regression and Clickthrough Rate Prediction
 Online advertising
 Linear classification
 Logistic regression
 working with probabilistic predictions
 categorical data and onehotencoding
 feature hashing for dimensionality reduction

HW2 due HW3 out

Week 6

Lecture 7: Principal Component Analysis and Neuroimaging
 Exploratory data analysis
 Principal Component Analysis (PCA)
 Formulations and solution
 Distributed PCA

HW3 due HW4 out


Week 7

Lecture 8: Big Data ML with MLlib
 kmeans Clustering
 Decision Trees and Random Forests
 Recommenders

HW4 due HW5 out 

Lecture 9: Introduction to SparkSQL
 Working with tables in Spark
 Higherlevel declarative programming


Bonus Lecture

Lecture 10: Analyzing Networks with GraphX
 Understanding network structure
 Computing graph statistics

HW5 due 
See here

Final Exam
