Class time and location:
Instructor: George Chen (georgechen [at symbol] cmu.edu)
Teaching assistants:
Office hours:
Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).
Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a twostep approach:
We will be coding lots of Python and working with Amazon Web Services (AWS) for cloud computing (including using GPU's).
Prerequisite: If you are a Heinz student, then you must have either (1) passed the Heinz Python exemption exam, or (2) taken a Heinz introductory Python course (90812, 95880, 95888) or the Applied Data Science course (16791). If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.
Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more
Grading: Homework 20%, midmini quiz 35%, final exam 45%. If you do better on the final exam than the midmini quiz, then your final exam score clobbers your midmini quiz score (thus, the quiz does not count for you, and instead your final exam counts for 80% of your grade).
Syllabus: [Pittsburgh] [Adelaide]
Warning: As this course is still relatively new, the lecture slides are a bit rough and may contain bugs. To provide feedback/bug reports, please directly contact the instructor, George (georgechen [at symbol] cmu.edu). The Spring 2018 mini 4 course website is available here.
Date  Topic  Supplemental Material 

Part I. Exploratory data analysis  
Tue Oct 23 
Lecture 1: Course overview, basic text processing and frequency analysis
HW1 released (check Canvas)! 

Thur Oct 25 
Lecture 2: Basic text analysis demo, cooccurrence analysis


Fri Oct 26 
Recitation 1: Basic Python review


Tue Oct 30 
Lecture 3: Finding possibly related entities, PCA, Isomap

Causality additional reading:
PCA additional reading (technical): 
Thur Nov 1 
Lecture 4: tSNE

Python examples for dimensionality reduction:
Additional dimensionality reduction reading (technical):

Fri Nov 2 
Recitation 2: tSNE


Tue Nov 6 
Lecture 5: Introduction to clustering, kmeans, Gaussian mixture models
HW1 due 4:30pm (start of class), HW2 released! 
Additional clustering reading (technical):
Python cluster evaluation: 
Thur Nov 8 
Lecture 6: Clustering and clustering interpretation demo, automatic selection of k with CH index


Fri Nov 9 
Recitation 3: Quiz review session 

Tue Nov 13  Midmini quiz  
Wed Nov 14  George's regular office hours are cancelled for this week  
Thur Nov 15 
Lecture 7: Clustering (wrapup), topic modeling 
Additional reading: 
Part 2. Predictive data analysis  
Fri Nov 16 
Lecture 8 (this is not a typo): Introduction to predictive analytics, nearest neighbors, evaluating prediction methods 

Tue Nov 20 
Lecture 9: Neural nets and deep learning HW2 due 4:30pm (start of class); HW3 released 
Mike Jordan's Medium article on where AI is at (April 2018):
The Atlantic's article about Judea Pearl's concerns with AI not focusing on causal reasoning (May 2018): 
Thur Nov 22 
Thanksgiving: no class 

Tue Nov 27 
Lecture 10: Image analysis with CNNs (also called convnets) 
Video introduction on neural nets:
Additional reading: 
Thur Nov 29 
Lecture 11: Time series analysis with RNNs, roughly how learning a deep net works (gradient descent and variants) 
LSTM reading:
Videos on learning neural nets (warning: the loss function used is not the same as what we are using): 
Fri Nov 30 
Recitation 4 HW3 due 8:29am (end of fall classes for Adelaide) 

Tue Dec 4 
Lecture 12: Interpreting what a deep net is learning, other deep learning topics, wrapup 
Some interesting reads: 
Thur Dec 6 
No class 

Fri Dec 7 
Recitation 5: Final exam review session 

Fri Dec 14 
Final exam 1pm HBH 1002 
Date  Topic  Supplemental Material 

Part I. Exploratory data analysis  
Tue Oct 23 
Lecture 1: Course overview, basic text processing, cooccurrence analysis (finding possibly related entities with discrete outcomes)
HW1 released (check Canvas)!
Pittsburgh Recitation 1: Basic Python review


Tue Oct 30 
Lecture 2: Finding possibly related entities, PCA, manifold learning (isomap, tSNE)
Pittsburgh Recitation 2: tSNE

Causality additional reading:
Python examples for dimensionality reduction:
PCA additional reading (technical):
Additional dimensionality reduction reading (technical):

Tue Nov 6 
Lecture 3: Introduction to clustering, kmeans, Gaussian mixture models, automatic selection of k with CH index

Additional clustering reading (technical):
Python cluster evaluation: 
Wed Nov 7  HW1 due at 8am (start of Pittsburgh class); HW2 released! 

Tue Nov 13 
Lecture 4: Clustering (wrapup), topic modeling 
Additional reading:
Additional reading: 
Part 2. Predictive data analysis  
Tue Nov 20 
Lecture 5: Introduction to predictive analytics, nearest neighbors, evaluating prediction methods, neural nets and deep learning 
Video introduction on neural nets:
Additional reading:
Mike Jordan's Medium article on where AI is at (April 2018):
The Atlantic's article about Judea Pearl's concerns with AI not focusing on causal reasoning (May 2018): 
Wed Nov 21  HW2 due at 8am (start of Pittsburgh class); HW3 released! 

Tue Nov 27 
Lecture 6: Image analysis with CNNs (also called convnets), time series analysis with RNNs, roughly how learning a deep net works (gradient descent and variants) 
LSTM reading:
Videos on learning neural nets (warning: the loss function used is not the same as what we are using):
Some interesting reads: 
Fri Nov 30 
HW3 due 11:59pm 

Tue Dec 4 
Final exam 