Date

Lectures and Readings

Out / Due


Review
Please take this Python miniquiz before the course and take this Python minicourse if you need to learn Python or refresh your Python knowledge.


8/27

Lecture 1: Introduction
 Big Data applications
 Technologies for handling big data
 Apache Hadoop and Spark overview
 

8/27
8/29

Lecture 2: Hadoop Fundamentals
 Hadoop architecture
 HDFS and the MapReduce paradigm
 Hadoop ecosystem: Mahout, Pig, Hive, HBase, Spark

HW0 out 
9/3

Lecture 3: Introduction to Apache Spark
 Big data and hardware trends
 History of Apache Spark
 Spark's Resilient Distributed Datasets (RDDs)
 Transformations and actions

HW1 out 
9/10

Lecture 4: Machine Learning Overview
 Basic machine learning concepts
 Steps of typical supervised learning pipelines
 Linear algebra review
 Computational complexity / Big O notation review

9/12
9/17

Lecture 5: Linear Regression and Distributed ML Principles
 Linear regression
 formulation and closedform solution
 gradient descent
 grid search
 Distributed machine learning principles
 computation, storage, and communication
 HW1 due HW2 out


9/19
9/24

Lecture 6: Logistic Regression and Clickthrough Rate Prediction
 Online advertising
 Linear classification
 Logistic regression
 working with probabilistic predictions
 categorical data and onehotencoding
 feature hashing for dimensionality reduction

HW2 due HW3 out

9/26
10/1

Lecture 7: Principal Component Analysis and Neuroimaging
 Exploratory data analysis
 Principal Component Analysis (PCA)
 Formulations and solution
 Distributed PCA

HW3 due HW4 out


10/3

Lecture 8: Big Data ML with MLlib
 kmeans Clustering
 Decision Trees and Random Forests
 Recommenders

HW4 due HW5 out 
10/8

Lecture 9: Introduction to SparkSQL
 Working with tables in Spark
 Higherlevel declarative programming


10/10

Lecture 10: Analyzing Networks with GraphX
 Understanding network structure
 Computing graph statistics

HW5 due 
See here

Final Exam
