94-775: Unstructured Data Analytics for Policy
(Spring 2026 Mini 4; listed as 94-475 for undergrads)

Unstructured Data Analytics

Class time and location:

Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol

Teaching assistant: Shurui Cao (shuruic ♣ andrew.cmu.edu)

Office hours (starting second week of class): Check the course Canvas homepage for the office hour times and locations.

Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).

Course Description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:

  1. We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
  2. Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.
Many examples are given for how these methods help solve real problems faced by organizations. There is a final project in this course which must address a policy question.

Note regarding GenAI (including large language models): As likely all of you are aware, there are now technologies like (Chat)GPT, Gemini, Claude, Llama, DeepSeek, etc which will all be getting better over time. If you use any of these in your homework, please cite them. For the purposes of the class, I will view these as external collaborators (no different than if you got help from a human friend). For exams, I want to make sure that you actually understand the material and are not just telling me what someone else or an AI assistant knows. This is important so that in the future, if you get help from an AI assistant (or a human) to aid you in your unstructured data analysis, you have enough background knowledge to check for yourself whether you think the AI (or human) is giving you a solution that is correct or not. For this reason, exams in this class will explicitly not allow electronics.

Prerequisite: If you are a Heinz student, then you must have already completed 90-803 "Machine Learning Foundations with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python and machine learning courses you have taken (or relevant experience).

Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more

Grading:

*Students with the most instructor-endorsed posts on Piazza will get a bonus of up to 20 points on Quiz 2 (so that it is possible to get 120 out of 100 points).

Letter grades are determined based on a curve.

Calendar (tentative)

Date Topic Supplemental Materials
Part I. Exploratory data analysis
Week 1
Tue Mar 10 Lecture 1: Course overview
[slides]

Thur Mar 12 Lecture 2: Basic text analysis (requires Anaconda Python 3 & spaCy)
[slides]
[slides on how to install Anaconda Python 3 and spaCy (needed for HW1 and lecture demos)]
[Jupyter notebook (basic text analysis)]

HW1 released (check Canvas)
Fri Mar 13 Recitation slot: Lecture 3 — Basic text analysis (cont'd), co-occurrence analysis
[slides]
[Jupyter notebook (basic text analysis using arrays)]
[Jupyter notebook (co-occurrence analysis toy example)]
Week 2
Tue Mar 17 Lecture 4: Co-occurrence analysis (cont'd), visualizing high-dimensional data with PCA
[slides]
[Jupyter notebook (text generation using n-grams)]
Additional reading (technical):
[Abdi and Williams's PCA review]

Supplemental videos:

[StatQuest: PCA main ideas in only 5 minutes!!!]
[StatQuest: Principal Component Analysis (PCA) Step-by-Step (note that this is a more technical introduction than mine using SVD/eigenvalues)]
[StatQuest: PCA - Practical Tips]
[StatQuest: PCA in Python (note that this video is more Pandas-focused whereas 94-775 is taught in a manner that is more numpy-focused to better prep for working with PyTorch later)]
Thur Mar 19 Lecture 5: PCA (cont'd), manifold learning (Isomap, MDS)
[slides]
[Jupyter notebook (PCA)]
[Jupyter notebook (manifold learning)]
Additional reading (technical):
[The original Isomap paper (Tenenbaum et al 2000)]

Python examples for manifold learning:

[scikit-learn example (Isomap, t-SNE, and many other methods)]
Fri Mar 20 Recitation: More on dimensionality reduction
[slides (how to save a Jupyter notebook as PDF)]
[Jupyter notebook (more on PCA, argsort)]
[Jupyter notebook (analyzing the 20 Newsgroups dataset)]
Week 3
Tue Mar 24 HW1 due 11:59pm

Lecture 6: Manifold learning (cont'd)
[slides]
[continuation of demo from last Thursday's lecture: Jupyter notebook (manifold learning)]
[required reading: "How to Use t-SNE Effectively" (Wattenberg et al 2016)]
[Jupyter notebook (PCA, t-SNE, and UMAP with images)***]
***For the demo with images to work, you will need to install some packages:
pip install torch torchvision
Supplemental videos (warning: StatQuest focuses on highlighting datasets with clustering structure for t-SNE and UMAP but these manifold learning methods can also work with datasets that don't have clustering structure such as the Swiss roll):
[StatQuest: t-SNE, clearly explained]
[StatQuest: UMAP Dimension Reduction, Main Ideas!!!]

Additional reading (technical):

[some technical slides on t-SNE by George for 94-775]
[Simon Carbonnelle's much more technical t-SNE slides]
[t-SNE webpage]
[Coenen and Pearce's "Understanding UMAP"]
[Adele Jackson's "The mathematics of UMAP" article]
[The original UMAP paper by McInnes et al (2018)]
Wed Mar 25 Quiz 1 review session 7pm-8pm over Zoom, run by your TA Shurui (check Canvas for Zoom link)
Thur Mar 26 Lecture 7: Clustering
[slides]
[Jupyter notebook (preprocessing 20 Newsgroups dataset)]
[Jupyter notebook (clustering 20 Newsgroups dataset)]
Additional reading on clustering (technical):
[see Section 14.3 of the book "Elements of Statistical Learning"]

Supplemental video:

[StatQuest: K-means clustering (warning: the elbow method is specific to using total variation (i.e., residual sum of squares) as a score function; the elbow method is not always the approach you should use with other score functions)
Fri Mar 27 Recitation slot: Quiz 1 — material coverage: everything up to and including Fri Mar 20 (i.e., weeks 1-2)
Week 4
Tue Mar 31 Lecture 8: Clustering (cont'd)
[slides]
[we resume the demo from last time: Jupyter notebook (clustering 20 Newsgroups dataset)]
[Jupyter notebook (clustering metrics on toy synthetic datasets)]
Thur Apr 2 Lecture 9: Clustering (cont'd); topic modeling
[slides]
[Jupyter notebook (DBSCAN and HDBSCAN on toy synthetic dataset)]
[Jupyter notebook (clustering on text revisited using TF-IDF, normalizing using Euclidean norm)]
Supplemental reading:
["Mastering HDBSCAN: Clustering Variable Density Data Made Easy"]
["How HDBSCAN Works"]
[David Blei's general intro to topic modeling]
[Maria Antoniak's practical guide for using LDA]
Fri Apr 3 Final project proposals due 11:59pm (1 email per group)

Recitation: More on clustering
[Jupyter notebook (clustering 20 Newsgroups dataset, extended version of original demo with clustering metrics and DBSCAN)]
[Jupyter notebook (clustering on images)]
Week 5
Tue Apr 7 Lecture 10: Topic modeling (cont'd)
[slides]
[Jupyter notebook (topic modeling with LDA)]
[Jupyter notebook (LDA: choosing the number of topics)]
Thur Apr 9 & Fri Apr 10 No class (CMU Spring Carnival)
🎪
Part II. Predictive data analysis
Week 6
Tue Apr 14 HW2 due 11:59pm

Lecture 11: Review of basic prediction concepts; intro to neural nets & deep learning
[slides]
[Jupyter notebook (prediction and model validation)]
Additional reading on basic neural networks:
[Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]

Video introduction on neural nets:

["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown]

StatQuest series of videos on neural nets and deep learning:

[YouTube playlist (note: there are a lot of videos in this playlist, some of which goes into more detail than you're expected to know for 95-865; make sure that you understand concepts at the level of how they are presented in 95-865 lectures/recitations)]
Thur Apr 16 Lecture 12: Neural net basics
[slides]
For the below neural net demo below to work, you will need to install some packages:
pip install torch torchvision torchinfo
[Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to use the "Download ZIP" link to download all the files (especially so that you also download UDA_pytorch_utils.py))]
PyTorch tutorial (it suffices to go over the first page or so of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU; later parts of the tutorial get more technical):
[PyTorch tutorial]
Fri Apr 17 Recitation slot: Quiz 2 — material coverage: Tue Mar 24 up to Tue Apr 7 (i.e., weeks 3-5)
Week 7
Tue Apr 21 Lecture 13: Wrap up neural net basics; image analysis with convolutional neural nets (CNNs); start coverage of time series analysis
[slides]
[we resume the demo from last time: Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to the bottom to download UDA_pytorch_utils.py)]
Supplemental reading and video for convolutional neural networks (CNNs):
[Stanford CS231n Convolutional Neural Networks for Visual Recognition]
[(technical) Richard Zhang's fix for max pooling]
In the StatQuest YouTube playlist (from the previous lecture's supplemental materials), there's a video in the playlist on CNNs
Thur Apr 23 Lecture 14: Generative pre-trained transformers (GPTs), a few other deep learning topics, course wrap-up
[slides]
Additional reading/videos:
[Andrej Karpathy's "Neural Networks: Zero to Hero" lecture series (including a more detailed GPT lecture)]

Software for explaining neural nets:

[Captum]

Some articles on being careful with explanation methods (technical):

["The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective" (Krishna et al 2022)]
["Do Feature Attribution Methods Correctly Attribute Features?" (Zhou et al 2022)]
["The false hope of current approaches to explainable artificial intelligence in health care" (Ghassemi et al 2021)]
Fri Apr 24 Recitation slot: How to use BERT embeddings, how to use a neural topic model (BERTopic), and how to pre-train a GPT
[Jupyter notebook (how to use a BERT word embedding model from Hugging Face)]
[Jupyter notebook (BERTopic and sentiment analysis)]
[slides (to help understand GPT demo)]
[Jupyter notebook (text generation with a GPT)]
A reading and an extra demo (technical):
[A tutorial on BERT word embeddings]
[Jupyter notebook (sentiment analysis with IMDb reviews using BERT-Tiny)]
Final exam week
Mon Apr 27 8:30am—11:30am HBH 1202 (note that we might not use this entire time window but we will start at 8:30am): final group project presentations

Final project slide decks + Jupyter notebooks due 11:59pm (1 email per group)