Lectures:
Note that the current plan is for Section B4 to be recorded.
Recitations (shared across sections): Fridays 5pm-6:20pm, HBH A301
Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol
Teaching assistants:
Office hours (starting second week of class): Check the course Canvas homepage for the office hour times and locations.
Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).
Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:
We will be coding lots of Python and dabble a bit with GPU computing (Google Colab).
Prerequisite: If you are a Heinz student, then you must have taken 95-888 "Data-Focused Python" or 90-819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.
Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more
Grading: Homework (30%), Quiz 1 (35%), Quiz 2 (35%*)
*Students with the most instructor-endorsed posts on Piazza will receive a slight bonus at the end of the mini, which will be added directly to their Quiz 2 score (a maximum of 10 bonus points, so that it is possible to get 110 out of 100 points on Quiz 2).
Letter grades are determined based on a curve.
Previous version of course (including lecture slides and demos): 95-865 Fall 2025 mini 2
| Date | Topic | Supplemental Materials |
|---|---|---|
| Part I. Exploratory data analysis | ||
| Week 1 | ||
| Mon Mar 9 |
Lecture 1: Course overview
[slides] |
|
| Wed Mar 11 |
Lecture 2: Basic text analysis (requires Anaconda Python 3 & spaCy)
[lecture slides] [slides on how to install Anaconda Python 3 and spaCy (needed for HW1 and lecture demos)] Note: Anaconda Python 3 includes support for Jupyter notebooks, which we use extensively in this class [Jupyter notebook (basic text analysis)] |
|
| Thur Mar 12 | HW1 released (check Canvas) | |
| Fri Mar 13 |
Recitation slot: Lecture 3 — Basic text analysis (cont'd), co-occurrence analysis
[slides] [Jupyter notebook (basic text analysis using arrays)] [Jupyter notebook (co-occurrence analysis toy example)] |
|
| Week 2 | ||
| Mon Mar 16 |
Lecture 4: Co-occurrence analysis (cont'd), visualizing high-dimensional data with PCA
[slides] [Jupyter notebook (text generation using n-grams)] [Jupyter notebook (PCA)] |
Additional reading (technical):
[Abdi and Williams's PCA review] Supplemental videos: [StatQuest: PCA main ideas in only 5 minutes!!!] [StatQuest: Principal Component Analysis (PCA) Step-by-Step (note that this is a more technical introduction than mine using SVD/eigenvalues)] [StatQuest: PCA - Practical Tips] [StatQuest: PCA in Python (note that this video is more Pandas-focused whereas 95-865 is taught in a manner that is more numpy-focused to better prep for working with PyTorch later)] |
| Wed Mar 18 |
Lecture 5: PCA (cont'd), manifold learning
[slides] [Jupyter notebook (PCA)] [Jupyter notebook (manifold learning)] |
Additional reading (technical):
[The original Isomap paper (Tenenbaum et al 2000)] Python examples for manifold learning: [scikit-learn example (Isomap, t-SNE, and many other methods)] |
| Fri Mar 20 |
Recitation: More on dimensionality reduction
[slides (how to save a Jupyter notebook as PDF)] [Jupyter notebook (more on PCA, argsort)] [Jupyter notebook (analyzing the 20 Newsgroups dataset)] |
|
| Week 3 | ||
| Mon Mar 23 |
HW1 due 11:59pm
Lecture 6: Manifold learning (cont'd) [slides] [continuation of demo from last Wednesday's lecture: Jupyter notebook (manifold learning)] [required reading: "How to Use t-SNE Effectively" (Wattenberg et al 2016)] [Jupyter notebook (PCA, t-SNE, and UMAP with images)***] ***For the demo with images to work, you will need to install some packages: pip install torch torchvision
|
Supplemental videos (warning: StatQuest focuses on highlighting datasets with clustering structure for t-SNE and UMAP but these manifold learning methods can also work with datasets that don't have clustering structure such as the Swiss roll):
[StatQuest: t-SNE, clearly explained] [StatQuest: UMAP Dimension Reduction, Main Ideas!!!] Additional reading (technical): [some technical slides on t-SNE by George for 95-865] [Simon Carbonnelle's much more technical t-SNE slides] [t-SNE webpage] [Coenen and Pearce's "Understanding UMAP"] [Adele Jackson's "The mathematics of UMAP" article] [The original UMAP paper by McInnes et al (2018)] |
| Tue Mar 24 | Quiz 1 review session 8pm-9pm over Zoom, run by your TA Yidi (check Canvas for Zoom link) | |
| Wed Mar 25 |
Lecture 7: Clustering
[slides] [Jupyter notebook (preprocessing 20 Newsgroups dataset)] [Jupyter notebook (clustering 20 Newsgroups dataset)] |
Additional reading on clustering (technical):
[see Section 14.3 of the book "Elements of Statistical Learning"] Supplemental video: [StatQuest: K-means clustering (warning: the elbow method is specific to using total variation (i.e., residual sum of squares) as a score function; the elbow method is not always the approach you should use with other score functions) |
| Fri Mar 27 | Quiz 1 (80-minute exam) — material coverage is up to and including Monday Mar 23's lecture coverage (lecture 6) | |
| Week 4 | ||
| Mon Mar 30 |
Lecture 8: Clustering (cont'd)
[slides] [we resume the demo from last time: Jupyter notebook (clustering 20 Newsgroups dataset)] [Jupyter notebook (clustering metrics on toy synthetic datasets)] |
|
| Wed Apr 1 |
Lecture 9: Clustering (cont'd), TF-IDF representation, topic modeling
[slides] [Jupyter notebook (DBSCAN and HDBSCAN on toy synthetic dataset)] [Jupyter notebook (clustering on text revisited using TF-IDF, normalizing using Euclidean norm)] |
Supplemental reading:
["Mastering HDBSCAN: Clustering Variable Density Data Made Easy"] ["How HDBSCAN Works"] [David Blei's general intro to topic modeling] [Maria Antoniak's practical guide for using LDA] |
| Fri Apr 3 |
Recitation: More on clustering
[Jupyter notebook (clustering 20 Newsgroups dataset, extended version of original demo with clustering metrics and DBSCAN)] [Jupyter notebook (clustering on images)] |
|
| Week 5 | ||
| Mon Apr 6 |
Lecture 10: Wrap up topic modeling
[slides] [Jupyter notebook (topic modeling with LDA)] [Jupyter notebook (LDA: choosing the number of topics)] |
|
| Part II. Predictive data analysis | ||
| Wed Apr 8 |
Lecture 11: Introduction to prediction, neural nets, and deep learning
[slides] [Jupyter notebook (prediction and model validation)] |
Additional reading on basic neural networks:
[Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning] Video introduction on neural nets: ["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown] StatQuest series of videos on neural nets and deep learning: [YouTube playlist (note: there are a lot of videos in this playlist, some of which goes into more detail than you're expected to know for 95-865; make sure that you understand concepts at the level of how they are presented in 95-865 lectures/recitations)] |
| Fri Apr 10 |
No class (CMU Spring Carnival)
🎪 | |
| Week 6 | ||
| Mon Apr 13 |
HW2 due 11:59pm
Lecture 12: Neural net basics [slides] For the below neural net demo below to work, you will need to install some packages: pip install torch torchvision torchinfo
[Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to use the "Download ZIP" link to download all the files (especially so that you also download UDA_pytorch_utils.py))] |
PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU):
[PyTorch tutorial] |
| Wed Apr 15 |
Lecture 13: Wrap up neural net basics; image analysis with convolutional neural nets (CNNs/convnets); start time series analysis with recurrent neural nets (RNNs)
[slides] [we resume the demo from last time: Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to the bottom to download UDA_pytorch_utils.py)] |
Supplemental reading and video for convolutional neural networks (CNNs): [Stanford CS231n Convolutional Neural Networks for Visual Recognition] [(technical) Richard Zhang's fix for max pooling] In the StatQuest YouTube playlist (from the previous lecture's supplemental materials), there's a video in the playlist on CNNs and also a video on RNNs |
| Fri Apr 17 |
Recitation: More on prediction and PyTorch
[slides] [note: parameter counting for CNNs uses the same demo as in lectures 12 and 13] [Jupyter notebook (more on prediction evaluation)] |
|
| Week 7 | ||
| Mon Apr 20 |
Lecture 14: Wrap up RNNs; glimpse of word embeddings; start coverage on text generation
[slides] For the neural net demos below to work, you will need to install the Hugging Face transformers package (in addition to the packages needed to run previous neural net demos): pip install transformers
[required reading: Jupyter notebook (quick intro on how to use BERT/BERT-Tiny)] [required reading: Jupyter notebook (sentiment analysis with IMDb reviews version 2 (uses BERT-Tiny and LSTM); requires UDA_pytorch_utils.py from the previous lecture's demo)] [required reading: Jupyter notebook (sentiment analysis with IMDb reviews version 1 (learns a static word embedding (no BERT-Tiny) and uses a vanilla ReLU RNN (not an LSTM); even though the resulting model does not work as well as the one in previous demo, this notebook can be helpful to better understand what's going on with a much simpler model)] |
BERT word embeddings (technical):
[A tutorial on BERT word embeddings] Extra notebooks: [Jupyter notebook (slight variant on the sentiment analysis RNN demo where the BERT-Tiny model is treated as frozen, so that model training only learns parameters for the LSTM and Linear layers)] |
| Wed Apr 22 |
Lecture 15: Text generation with generative pretrained transformers; course wrap-up
[slides] [Jupyter notebook (text generation with neural nets)] |
Additional reading/videos:
[Andrej Karpathy's "Neural Networks: Zero to Hero" lecture series (including a more detailed GPT lecture)] Software for explaining neural nets: [Captum] Some articles on being careful with explanation methods (technical): ["The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective" (Krishna et al 2022)] ["Do Feature Attribution Methods Correctly Attribute Features?" (Zhou et al 2022)] ["The false hope of current approaches to explainable artificial intelligence in health care" (Ghassemi et al 2021)] |
| Fri Apr 24 |
Recitation: How to use BERT word embeddings from Hugging Face, details on the GPT lecture demo, an example of a neural topic model (BERTopic), how training a neural net roughly works
[Jupyter notebook (quick intro on how to use BERT/BERT-Tiny - also listed as required reading under Lecture 14)] [Jupyter notebook (text generation with neural nets - also listed as the demo corresponding to Lecture 15)] [Jupyter notebook (BERTopic)] [slides on how to train a deep net] |
|
| Final exam week | ||
| Mon Apr 27 | HW3 due 11:59pm | |
| Wed Apr 29, 5:30pm-6:50pm HBH A301 |
Quiz 2 (80-minute exam)
Quiz 2 focuses on material from Wed Mar 25's lecture (Lecture 7) and onwards (note that by how the course is set up, material from Lecture 7 onwards naturally at times relates to material from Lectures 1–6, so some ideas in these earlier lectures could still possibly show up on Quiz 2—please focus your studying on material from Lecture 7 onwards) |
|