All times listed are in Pittsburgh time (US Eastern Time)
Lectures, time and location: Currently, the plan is for lectures prior to Thanksgiving break to be inperson and live at the same time (i.e., I teach in a classroom and start a Zoom session). After Thanksgiving, all instruction will be purely remote. Note that Tue/Thur lectures are recorded and not Mon/Wed.
Recitations: Fridays 1:30pm2:50pm, remote (Zoom)
Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol
Teaching assistants: Xinyu Yao (xinyuyao ♣ andrew.cmu.edu), Xuejian Wang (xuejianw ♣ andrew.cmu.edu)
Office hours:
Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).
Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a twostep approach:
We will be coding lots of Python and dabble a bit with GPU computing (Google Colab).
Prerequisite: If you are a Heinz student, then you must have either (1) passed the Heinz Python exemption exam, or (2) taken 95888 "DataFocused Python" or 90819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.
Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more
Grading: Homework 20%, quiz 1 40%, quiz 2 40%*
*Students with the most instructorendorsed answers on Piazza will receive a slight bonus at the end of the mini, which will be added directly to their quiz 2 score (a maximum of 5 bonus points; quiz 2 is out of 100 points prior to any bonus points being added).
Syllabus: [pdf]
Previous version of course (including lecture slides and demos): 95865 Spring 2020 mini 3
Date  Topic  Supplemental Material 

Part I. Exploratory data analysis  
Week 1: Oct 2630 
HW1 was released Oct 26 (check Canvas)
Lecture 1 (Oct 2627): Course overview, analyzing text using frequencies
Lecture 2 (Oct 2829): Text analysis demo, cooccurrence analysis
Recitation (Oct 30): Basic Python review


Week 2: Nov 26 
Lecture 3 (Nov 23): Finding possibly related entities
Lecture 4 (Nov 45): Visualizing highdimensional data (PCA)
Recitation (Nov 6): More on PCA, practice with argsort
HW1 due Friday Nov 6, 11:59pm 
What is the maximum phisquared/chisquared value? (technical)
Causality additional reading:
PCA additional reading (technical): 
Week 3: Nov 913 
HW2 released start of the week
Lecture 5 (Nov 910): Manifold learning (Isomap, tSNE)
Lecture 6 (Nov 1112): Wrap up manifold learning, begin clustering (kmeans)
Recitation (Nov 13): Quiz 1 review

Python examples for dimensionality reduction:
Some details on tSNE including code (from a past UDA recitation):
Additional dimensionality reduction reading (technical):
Additional clustering reading (technical):

Week 4: Nov 1620 
Lecture 7 (Nov 1617): Clustering (kmeans, GMMs)
Lecture 8 (Nov 1819): More clustering (automatically choosing k with CHindex, DPGMMs, and DPmeans)
Friday Nov 20: no recitation, instead Quiz 1 — upon opening the quiz, you have 80 minutes to complete it 
Python cluster evaluation:
DPmeans paper (technical):
Hierarchical clustering reading (technical):

Week 5: Nov 2327 
Lecture 9 (Nov 2324): Topic modeling
Thanksgiving: no class Wednesday through Friday (note that to keep the two sections synced, there is no Wednesday class!) 
Topic modeling reading:

Part 2. Predictive data analysis  
Week 6: Nov 30Dec 4 
HW2 due Monday Nov 30, 11:59pm HW3 released early in the week Instruction becomes purely remote at the start of this week — do not show up to HBH 1204
Lecture 10 (Nov 30Dec 1): Intro to predictive data analytics (some terminology, kNN classification, model evaluation)
Lecture 11 (Dec 23): Wrap up predictive model evaluation, classical classifiers; intro to neural nets and deep learning
Lecture 12 during Dec 4 recitation slot:
Wrap up intro to neural nets and deep learning; image analysis with convolutional neural nets

Some nuanced details on crossvalidation (technical):
PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU):
Additional reading:
Video introduction on neural nets:
Mike Jordan's Medium article on where AI is at (April 2018): 
Week 7 Dec 711 
Lecture 13 (Dec 78): Time series analysis with recurrent neural nets
Lecture 14 (Dec 910): More on deep learning and course wrapup
Recitation: Quiz 2 review

Additional reading:
Some bonus reading (a student asked about image segmentation, and here's an introduction):

Final exam period Dec 1420 
HW3 due Monday Dec 14, 11:59pm Friday Dec 18: Quiz 2 — upon opening the quiz, you have 80 minutes to complete it 