Class time and location:
Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol
Teaching assistants: Georgia Fu (qiaoyaf ♣ andrew.cmu.edu), Cyndi Wang (xiangyu4 ♣ andrew.cmu.edu)
Office hours (starting second week of class):
Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).
Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a twostep approach:
Prerequisite: If you are a Heinz student, then you must have already taken 95791 "Data Mining" and also either (1) passed the Heinz Python exemption exam, or (2) taken 95888 "DataFocused Python" or 90819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.
Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more
Grading: HW1 8%, HW2 8%, HW3 4%, midmini quiz 35%, final project proposal 10%, final project 35%. HW3 is shorter than HW1 and HW2. Letter grades are determined based on a curve.
Syllabus: [pdf]
Previous lecture slides and demos: spring 2019 version of 94775 (click to see course webpage)
Date  Topic  Supplemental Material 

Part I. Exploratory data analysis  
Tue Jan 14 
Lecture 1: Course overview, basic text processing and frequency analysis
HW1 released (check Canvas)! 

Thur Jan 16 
Lecture 2: Basic text analysis demo, cooccurrence analysis


Fri Jan 17 
Recitation 1: Basic Python review


Tue Jan 21 
Lecture 3: Finding possibly related entities

What is the maximum value of phisquare/chisquare value? (technical)
Causality additional reading: 
Thur Jan 23 
Lecture 4: Visualizing highdimensional data with PCA and Isomap

Python examples for dimensionality reduction:
Additional dimensionality reduction reading (technical):

Fri Jan 24 
Recitation 2: More on PCA, bookkeeping with np.argsort HW1 due 11:59pm, HW2 released 

Tue Jan 28 
Lecture 5: Manifold learning (Isomap, tSNE)

See supplementary materials from the previous lecture; in addition, here's some reading for tSNE (technical):

Thur Jan 30 
Lecture 6: Clustering

Additional clustering reading (technical): 
Fri Jan 31 
Lecture 7: More clustering, topic modeling

Python cluster evaluation:
Additional reading on topic modeling: 
Tue Feb 4 
Quiz review held by TA 

Thur Feb 6  Quiz (same time/place as lecture)  
Fri Feb 7 
No recitation  please schedule time with TA's to get feedback on final project ideas 

Part 2. Predictive data analysis  
Mon Feb 10 
HW2 and final project proposal due 11:59pm, HW3 released 

Tue Feb 11 
Lecture 8: Wrap up topic modeling, introduction to predictive analytics, nearest neighbors, evaluating prediction methods

Some nuanced details on crossvalidation:

Thur Feb 13 
Lecture 9: More on model evaluation (including confusion matrices, ROC curves), decision trees & forests


Fri Feb 14 
Recitation 3: ROC curves 

Tue Feb 18 
Lecture 10: Introduction to neural nets and deep learning
Mike Jordan's Medium article on where AI is at (April 2018): 
PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU):
Additional reading:
Video introduction on neural nets: 
Thur Feb 20 
Lecture 11: Image analysis with convolutional neural nets

Additional reading:

Fri Feb 21 
Recitation 4: Sentiment analysis, extensions to topic modeling HW3 due 11:59pm 

Tue Feb 25 
Lecture 12: Time series analysis with recurrent neural nets (application: sentiment analysis in IMDb reviews)

LSTM reading: 
Thur Feb 27 
Lecture 13: Other deep learning topics, wrapup

Additional reading:

Fri Feb 28 
No recitation  please schedule time with TA's to get feedback on final projects 

Tue Mar 3 
Final project presentations inclass 

Thur Mar 5 
No class Final projects due 11:59pm (slide deck + Jupyter notebook) 