Lectures:
Note that the current plan is for Section A4 and K4 lectures to be recorded.
Recitations:
Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol
Teaching assistants:
Office hours (starting second week of class):
Regardless of which section you are in, you are welcome to attend office hours for any of the course staff and we've tried to have the office hours in rather scattered times to try to get to many of the time zones that you are in. I suggest that you add all of the times below to your calendar via Google calendar using its time zone feature so that it automatically converts it to your local time (Pittsburgh time is labeled as "Eastern Time  New York" and Adelaide time is listed as "Central Australia Time  Adelaide"). Office hours are all held remotely over Zoom; Zoom links for office hours are posted in Canvas.
Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).
Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a twostep approach:
We will be coding lots of Python and dabble a bit with GPU computing (Google Colab).
Prerequisite: If you are a Heinz student, then you must have either (1) passed the Heinz Python exemption exam, or (2) taken 95888 "DataFocused Python" or 90819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.
Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more
Grading: Homework 20%, quiz 1 40%, quiz 2 40%*
*Students with the most instructorendorsed answers on Piazza will receive a slight bonus at the end of the mini, which will be added directly to their quiz 2 score (a maximum of 5 bonus points; quiz 2 is out of 100 points prior to any bonus points being added).
Syllabus (updated 3/22 11:24pm Pittsburgh time): [pdf]
Previous version of course (including lecture slides and demos): 95865 Fall 2020 mini 2
Date  Topic  Supplemental Material 

Part I. Exploratory data analysis  
Mon Mar 22 
Lecture 1: Course overview


Wed Mar 24 
Lecture 2: Basic text analysis, cooccurrence analysis
For the basic text analysis demo to work, please
install Anaconda Python 3, Jupyter, and spaCy first


Fri Mar 26 
Recitation: Basic Python review


Mon Mar 29 
Lecture 3: Finding possibly related entities

What is the maximum phisquared/chisquared value? (technical)

Wed Mar 31 
Lecture 4: Visualizing highdimensional data with PCA

Causality additional reading:
PCA additional reading (technical): 
Fri Apr 2 
Recitation slot — Lecture 5: Manifold learning with Isomap

Python examples for dimensionality reduction: 
Mon Apr 5 
No class (CMU break day) 

Wed Apr 7 
Lecture 6: Wrap up manifold learning (tSNE), a first look at analyzing images, and an introduction to clustering phenomena
HW1 due 11:59pm Pittsburgh time 
See supplementary materials from the previous lecture; in addition, here's some reading for tSNE (technical):

Fri Apr 9 
Recitation: More on PCA, practice with argsort


Mon Apr 12 
Lecture 7: Distance and similarity functions, clustering (kmeans, GMMs)

Clustering additional reading (technical): 
Wed Apr 14 
Lecture 8: More on clustering (interpreting clustering results, automatically choosing the number of clusters for GMMrelated models)

Reading on DPmeans, DP mixture models for which a DPGMM is a special case (technical):

Fri Apr 16  No class (CMU break day)  
Mon Apr 19 
Lecture 9: Topic modeling

Topic modeling reading:

Wed Apr 21 
Lecture 10: Wrap up topic modeling; wrap up clustering; a glimpse of predictive data analytics


Thur Apr 22 
HW2 due 11:59pm Pittsburgh time 

Fri Apr 23 
Quiz 1:


Part II. Predictive data analysis  
Mon Apr 26 
Lecture 11: Intro to predictive data analytics

Some nuanced details on crossvalidation (technical):

Wed April 28 
Lecture 12: Wrap up basic prediction concepts


Fri April 30 
Recitation: More practice on model evaluation


Mon May 3 
Lecture 13: Intro to neural nets and deep learning

PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU):
Additional reading:
Video introduction on neural nets:
Mike Jordan's Medium article on where AI is at (April 2018): 
Wed May 5 
Lecture 14: Image analysis with convolutional neural nets

Additional reading:

Fri Dec 7 
Recitation slot: Lecture 15 on time series analysis with recurrent neural nets; more deep learning topics; course wrapup
The demo will not be covered during the recitation slot and is instead covered in the April 30 Section K4 recitation Zoom recording by your TA Erick (check Canvas Zoom recordings)

Additional reading:

Mon May 10 
HW3 due 11:59pm Pittsburgh time 

Thur May 13 
Quiz 2:

Date  Topic  Supplemental Material 

Part I. Exploratory data analysis  
Wed Mar 24 
Lecture 1: Course overview


Fri Mar 26 
Lecture 2: Basic text analysis, cooccurrence analysis
For the basic text analysis demo to work, please
install Anaconda Python 3, Jupyter, and spaCy first
Recitation: Basic Python review


Wed Mar 31 
Lecture 3: Finding possibly related entities

What is the maximum phisquared/chisquared value? (technical)

Fri Apr 2 
No class (Good Friday) 

Wed Apr 7 
Lecture 4: Visualizing highdimensional data with PCA

Causality additional reading:
PCA additional reading (technical): 
Thur Apr 8 
HW1 due 1:29pm Adelaide time (corresponds to 11:59pm Wed Apr 7 Pittsburgh time) 

Fri Apr 9 
Lecture 5: Manifold learning with Isomap
Extended recitation slot (5:30pm8:30pm Adelaide time): Lectures 6 and 7 on wrapping up manifold learning (tSNE), a first look at analyzing images, and an introduction to clustering (kmeans, GMMs)

Python examples for dimensionality reduction:
TSNE additional reading (technical):
Clustering additional reading (technical): 
Wed Apr 14 
Lecture 8: More on clustering (interpreting clustering results, automatically choosing the number of clusters for GMMrelated models)
Quiz 1 review session (7pm8:30pm Adelaide time) 
Reading on DPmeans, DP mixture models for which a DPGMM is a special case (technical):

Fri Apr 16 
Lecture 9: Wrap up clustering (densitybased clustering with DBSCAN, final remarks); topic modeling
Recitation slot: Quiz 1 (80 minutes to match amount of time that will be given to Pittsburgh students) 
Topic modeling reading:

Part II. Predictive data analysis  
Wed Apr 21 
Lecture 10: Intro to predictive data analytics

Some nuanced details on crossvalidation (technical):

Fri Apr 23 
HW2 due 1:29pm Adelaide time (corresponds to 11:59pm Mon Apr 22 Pittsburgh time)
Lecture 11: Wrap up basic prediction concepts; intro to neural nets and deep learning
Recitation slot — Lecture 12: Intro to neural nets and deep learning

PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU):
Additional reading:
Video introduction on neural nets:
Mike Jordan's Medium article on where AI is at (April 2018): 
Wed Apr 28 
Lecture 13: Image analysis with convolutional neural nets

Additional reading:

Fri April 30 
Lecture 14: Time series analysis with recurrent neural nets; some other deep learning topics; course wrapup
Recitation: sentiment analysis with IMDB reviews; more on word embeddings and fine tuning; some PyTorch code examples

Additional reading:

Final exam period May 37 
HW3 due date May 6, 11:59pm Adelaide time Quiz 2, May 7 10:30am11:50am Adelaide time 