Class time and location:
Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol
Teaching assistant: Johnna Sundberg (jsundber ♣ andrew.cmu.edu)
Office hours (starting second week of class): Check the course Canvas homepage for the office hour times and locations.
Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).
Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:
Prerequisite: If you are a Heinz student, then you must have already completed 90-803 "Machine Learning Foundations with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python and machine learning courses you have taken (or relevant experience).
Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more
Grading:
Letter grades are determined based on a curve.
Date | Topic | Supplemental Materials |
---|---|---|
Part I. Exploratory data analysis | ||
Week 1 | ||
Tue Mar 11 |
Lecture 1: Course overview
[slides] Please install Anaconda Python 3 and spaCy by following this tutorial (needed for HW1 and the demo next lecture): [slides] Note: Anaconda Python 3 includes support for Jupyter notebooks, which we use extensively in this class |
|
Wed Mar 12 | HW1 released | |
Thur Mar 13 |
Lecture 2: Basic text analysis (requires Anaconda Python 3 & spaCy)
[slides] [Jupyter notebook (basic text analysis)] |
|
Fri Mar 14 |
Recitation slot: Lecture 3 — Basic text analysis (cont'd), co-occurrence analysis
[slides] [Jupyter notebook (basic text analysis using arrays)] [Jupyter notebook (co-occurrence analysis toy example)] |
As we saw in class, PMI is defined in terms of log probabilities. Here's additional reading that provides some intuition on log probabilities (technical):
[Section 1.2 of lecture notes from CMU 10-704 "Information Processing and Learning" Lecture 1 (Fall 2016) discusses "information content" of random outcomes, which are in terms of log probabilities] |
Week 2 | ||
Tue Mar 18 |
Lecture 4: Co-occurrence analysis (cont'd), visualizing high-dimensional data with PCA
[slides] [Jupyter notebook (text generation using n-grams)] |
Additional reading (technical):
[Abdi and Williams's PCA review] Supplemental videos: [StatQuest: PCA main ideas in only 5 minutes!!!] [StatQuest: Principal Component Analysis (PCA) Step-by-Step (note that this is a more technical introduction than mine using SVD/eigenvalues)] [StatQuest: PCA - Practical Tips] [StatQuest: PCA in Python (note that this video is more Pandas-focused whereas 94-775 is taught in a manner that is more numpy-focused to better prep for working with PyTorch later)] |
Thur Mar 20 |
Lecture 5: PCA (cont'd), manifold learning (Isomap, MDS)
[slides] [Jupyter notebook (PCA)] [Jupyter notebook (manifold learning)] |
Additional reading (technical):
[The original Isomap paper (Tenenbaum et al 2000)] |
Fri Mar 21 | Recitation slot: More on dimensionality reduction | |
Week 3 | ||
Tue Mar 25 |
HW1 due 11:59pm
Lecture 6: Manifold learning, intro to clustering |
|
Thur Mar 27 | Lecture 7: Clustering | |
Fri Mar 28 | Recitation slot: Quiz 1 — material coverage: everything up to and including Fri Mar 21 (i.e., weeks 1-2) | |
Week 4 | ||
Tue Apr 1 |
Final project proposals due 11:59pm (1 email per group)
Lecture 8: Clustering (cont'd) |
|
Thur Apr 3 & Fri Apr 4 | No class (CMU Spring Carnival) 🎪 | |
Week 5 | ||
Tue Apr 8 | Lecture 9: Wrap up clustering, topic modeling | |
Part II. Predictive data analysis | ||
Thur Apr 10 | Lecture 10: Intro to predictive data analysis | |
Fri Apr 11 | Recitation slot: Quiz 2 — material coverage: Tue Mar 25 up to Tue Apr 8 (i.e., weeks 3-4 as well as Lecture 9) | |
Week 6 | ||
Tue Apr 15 |
HW2 due 11:59pm
Lecture 11: Intro to neural nets & deep learning | |
Thur Apr 17 | Lecture 12: Image analysis with convolutional neural nets (also called CNNs or convnets) | |
Fri Apr 18 | Recitation slot: TBD | |
Week 7 | ||
Tue Apr 22 | Lecture 13: Text generation with generative pretrained transformers (GPTs) | |
Thur Apr 24 | Lecture 14: Other deep learning topics; course wrap-up | |
Fri Apr 25 | Recitation slot: Final project presentations | |
Final exam week | ||
Mon Apr 28 | Final project slide decks + Jupyter notebooks due 11:59pm by email (1 email per group) |