Class time and location:
Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol
Teaching assistant: Shurui Cao (shuruic ♣ andrew.cmu.edu)
Office hours (starting second week of class): Check the course Canvas homepage for the office hour times and locations.
Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).
Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:
Prerequisite: If you are a Heinz student, then you must have already completed 90-803 "Machine Learning Foundations with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python and machine learning courses you have taken (or relevant experience).
Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more
Grading:
Letter grades are determined based on a curve.
| Date | Topic | Supplemental Materials |
|---|---|---|
| Part I. Exploratory data analysis | ||
| Week 1 | ||
| Tue Mar 10 |
Lecture 1: Course overview
[slides] |
|
| Thur Mar 12 |
Lecture 2: Basic text analysis (requires Anaconda Python 3 & spaCy)
[slides] [slides on how to install Anaconda Python 3 and spaCy (needed for HW1 and lecture demos)] [Jupyter notebook (basic text analysis)] HW1 released (check Canvas) |
|
| Fri Mar 13 |
Recitation slot: Lecture 3 — Basic text analysis (cont'd), co-occurrence analysis
[slides] [Jupyter notebook (basic text analysis using arrays)] [Jupyter notebook (co-occurrence analysis toy example)] |
|
| Week 2 | ||
| Tue Mar 17 |
Lecture 4: Co-occurrence analysis (cont'd), visualizing high-dimensional data with PCA
[slides] [Jupyter notebook (text generation using n-grams)] |
Additional reading (technical):
[Abdi and Williams's PCA review] Supplemental videos: [StatQuest: PCA main ideas in only 5 minutes!!!] [StatQuest: Principal Component Analysis (PCA) Step-by-Step (note that this is a more technical introduction than mine using SVD/eigenvalues)] [StatQuest: PCA - Practical Tips] [StatQuest: PCA in Python (note that this video is more Pandas-focused whereas 94-775 is taught in a manner that is more numpy-focused to better prep for working with PyTorch later)] |
| Thur Mar 19 |
Lecture 5: PCA (cont'd), manifold learning (Isomap, MDS)
[slides] [Jupyter notebook (PCA)] [Jupyter notebook (manifold learning)] |
Additional reading (technical):
[The original Isomap paper (Tenenbaum et al 2000)] Python examples for manifold learning: [scikit-learn example (Isomap, t-SNE, and many other methods)] |
| Fri Mar 20 |
Recitation: More on dimensionality reduction
[slides (how to save a Jupyter notebook as PDF)] [Jupyter notebook (more on PCA, argsort)] [Jupyter notebook (analyzing the 20 Newsgroups dataset)] |
|
| Week 3 | ||
| Tue Mar 24 |
HW1 due 11:59pm
Lecture 6: Manifold learning (cont'd) [slides] [continuation of demo from last Thursday's lecture: Jupyter notebook (manifold learning)] [required reading: "How to Use t-SNE Effectively" (Wattenberg et al 2016)] [Jupyter notebook (PCA, t-SNE, and UMAP with images)***] ***For the demo with images to work, you will need to install some packages: pip install torch torchvision
|
Supplemental videos (warning: StatQuest focuses on highlighting datasets with clustering structure for t-SNE and UMAP but these manifold learning methods can also work with datasets that don't have clustering structure such as the Swiss roll):
[StatQuest: t-SNE, clearly explained] [StatQuest: UMAP Dimension Reduction, Main Ideas!!!] Additional reading (technical): [some technical slides on t-SNE by George for 94-775] [Simon Carbonnelle's much more technical t-SNE slides] [t-SNE webpage] [Coenen and Pearce's "Understanding UMAP"] [Adele Jackson's "The mathematics of UMAP" article] [The original UMAP paper by McInnes et al (2018)] |
| Wed Mar 25 | Quiz 1 review session 7pm-8pm over Zoom, run by your TA Shurui (check Canvas for Zoom link) | |
| Thur Mar 26 |
Lecture 7: Clustering
[slides] [Jupyter notebook (preprocessing 20 Newsgroups dataset)] [Jupyter notebook (clustering 20 Newsgroups dataset)] |
Additional reading on clustering (technical):
[see Section 14.3 of the book "Elements of Statistical Learning"] Supplemental video: [StatQuest: K-means clustering (warning: the elbow method is specific to using total variation (i.e., residual sum of squares) as a score function; the elbow method is not always the approach you should use with other score functions) |
| Fri Mar 27 | Recitation slot: Quiz 1 — material coverage: everything up to and including Fri Mar 20 (i.e., weeks 1-2) | |
| Week 4 | ||
| Tue Mar 31 |
Lecture 8: Clustering (cont'd)
[slides] [we resume the demo from last time: Jupyter notebook (clustering 20 Newsgroups dataset)] [Jupyter notebook (clustering metrics on toy synthetic datasets)] |
|
| Thur Apr 2 |
Lecture 9: Clustering (cont'd); topic modeling
[slides] [Jupyter notebook (DBSCAN and HDBSCAN on toy synthetic dataset)] [Jupyter notebook (clustering on text revisited using TF-IDF, normalizing using Euclidean norm)] |
Supplemental reading:
["Mastering HDBSCAN: Clustering Variable Density Data Made Easy"] ["How HDBSCAN Works"] [David Blei's general intro to topic modeling] [Maria Antoniak's practical guide for using LDA] |
| Fri Apr 3 |
Final project proposals due 11:59pm (1 email per group)
Recitation: More on clustering [Jupyter notebook (clustering 20 Newsgroups dataset, extended version of original demo with clustering metrics and DBSCAN)] [Jupyter notebook (clustering on images)] |
|
| Week 5 | ||
| Tue Apr 7 |
Lecture 10: Topic modeling (cont'd)
[slides] [Jupyter notebook (topic modeling with LDA)] [Jupyter notebook (LDA: choosing the number of topics)] |
|
| Thur Apr 9 & Fri Apr 10 |
No class (CMU Spring Carnival)
🎪 |
|
| Part II. Predictive data analysis | ||
| Week 6 | ||
| Tue Apr 14 |
HW2 due 11:59pm
Lecture 11: Review of basic prediction concepts; intro to neural nets & deep learning [slides] [Jupyter notebook (prediction and model validation)] |
Additional reading on basic neural networks:
[Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning] Video introduction on neural nets: ["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown] StatQuest series of videos on neural nets and deep learning: [YouTube playlist (note: there are a lot of videos in this playlist, some of which goes into more detail than you're expected to know for 95-865; make sure that you understand concepts at the level of how they are presented in 95-865 lectures/recitations)] |
| Thur Apr 16 |
Lecture 12: Neural net basics
[slides] For the below neural net demo below to work, you will need to install some packages: pip install torch torchvision torchinfo
[Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to use the "Download ZIP" link to download all the files (especially so that you also download UDA_pytorch_utils.py))] |
PyTorch tutorial (it suffices to go over the first page or so of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU; later parts of the tutorial get more technical):
[PyTorch tutorial] |
| Fri Apr 17 | Recitation slot: Quiz 2 — material coverage: Tue Mar 24 up to Tue Apr 7 (i.e., weeks 3-5) | |
| Week 7 | ||
| Tue Apr 21 |
Lecture 13: Wrap up neural net basics; image analysis with convolutional neural nets (CNNs); start coverage of time series analysis
[slides] [we resume the demo from last time: Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to the bottom to download UDA_pytorch_utils.py)] |
Supplemental reading and video for convolutional neural networks (CNNs):
[Stanford CS231n Convolutional Neural Networks for Visual Recognition] [(technical) Richard Zhang's fix for max pooling] In the StatQuest YouTube playlist (from the previous lecture's supplemental materials), there's a video in the playlist on CNNs |
| Thur Apr 23 |
Lecture 14: Generative pre-trained transformers (GPTs), a few other deep learning topics, course wrap-up
[slides] |
Additional reading/videos:
[Andrej Karpathy's "Neural Networks: Zero to Hero" lecture series (including a more detailed GPT lecture)] Software for explaining neural nets: [Captum] Some articles on being careful with explanation methods (technical): ["The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective" (Krishna et al 2022)] ["Do Feature Attribution Methods Correctly Attribute Features?" (Zhou et al 2022)] ["The false hope of current approaches to explainable artificial intelligence in health care" (Ghassemi et al 2021)] |
| Fri Apr 24 |
Recitation slot: How to use BERT embeddings, how to use a neural topic model (BERTopic), and how to pre-train a GPT
[Jupyter notebook (how to use a BERT word embedding model from Hugging Face)] [Jupyter notebook (BERTopic and sentiment analysis)] [slides (to help understand GPT demo)] [Jupyter notebook (text generation with a GPT)] |
A reading and an extra demo (technical):
[A tutorial on BERT word embeddings] [Jupyter notebook (sentiment analysis with IMDb reviews using BERT-Tiny)] |
| Final exam week | ||
| Mon Apr 27 |
8:30am—11:30am HBH 1202 (note that we might not use this entire time window but we will start at 8:30am): final group project presentations
Final project slide decks + Jupyter notebooks due 11:59pm (1 email per group) |
|