94-775: Unstructured Data Analytics for Policy
(Spring 2022 Mini 4; also listed as 94-475)

Unstructured Data Analytics

All times are listed in Pittsburgh time

Class time and location:

Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol

Teaching assistants:

Office hours (starting second week of class):
Office hours are all held remotely over Zoom; Zoom links for office hours are posted in Canvas.

Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).

Course Description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:

  1. We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
  2. Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.
Many examples are given for how these methods help solve real problems faced by organizations. There is a final project in this course which must address a policy question.

Prerequisite: If you are a Heinz student, then you must have already completed 95-791 "Data Mining" and also one of either 95-888 "Data-Focused Python" or 90-819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.

Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more

Grading:

*Students with the most instructor-endorsed posts on Piazza will get a bonus of up to 10 points on the quiz (so that it is possible to get 110 out of 100 points).

Letter grades are determined based on a curve.

Syllabus: [pdf]

Calendar (tentative)

Previous lecture slides and demos: spring 2021 version of 94-775 (click to see course webpage)

Date Topic Supplemental Material
Part I. Exploratory data analysis
Tue Mar 15 Lecture 1: Course overview, analyzing text using frequencies
[slides]

Please install Anaconda Python 3 and spaCy by following this tutorial (needed for HW1 and the demo next lecture):

[slides]
Note: Anaconda Python 3 includes support for Jupyter notebooks, which we use extensively in this class

HW1 released (check Canvas)
Thur Mar 17 Lecture 2: Basic text analysis demo (requires Anaconda Python 3 & spaCy), co-occurrence analysis
[slides]
[Jupyter notebook (basic text analysis)]
Fri Mar 18 Recitation: Basic Python review
[Jupyter notebook]
Tue Mar 22 Lecture 3: Co-occurrence analysis (cont'd), visualizing high-dimensional data
[slides]
[Jupyter notebook (co-occurrence analysis)]
What is the maximum phi-squared/chi-squared value? (technical)
[Stack Exchange answer]
Thur Mar 24 Lecture 4: PCA, manifold learning
[slides]
[Jupyter notebook (PCA)]


Additional reading (technical):

[Abdi and Williams's PCA review]
[The original Isomap paper (Tenenbaum et al 2020)]
Fri Mar 25 Recitation: More on PCA, practice with argsort
[Jupyter notebook]
Tue Mar 29 Lecture 5: Manifold learning (cont'd), clustering
[slides]
[Jupyter notebook (manifold learning)]
[required reading: "How to Use t-SNE Effectively" (Wattenberg et al, Distill 2016)]

HW1 due 11:59pm
Python examples for dimensionality reduction:
[scikit-learn example (PCA, Isomap, t-SNE, and many other methods)]

Some technical details for t-SNE:

[slides]

Even more technical reading for t-SNE:

[Simon Carbonnelle's t-SNE slides]
[t-SNE webpage]
Thur Mar 31 Lecture 6: Clustering (cont'd)
[slides]
[Jupyter notebook (dimensionality reduction with images)***]
***For the demo on t-SNE with images to work, you will need to install some packages:
pip install torch torchvision
[Jupyter notebook (dimensionality reduction and clustering with drug data)]
Clustering additional reading (technical):
[see Section 14.3 of the book "Elements of Statistical Learning" on clustering]
Fri Apr 1 Recitation slot — Lecture 7: Clustering (cont'd) — interpreting GMMs, automatically selecting the number of clusters
[slides]
We continue using the same demo from last time:
[Jupyter notebook (dimensionality reduction and clustering with drug data)]
See supplemental clustering reading posted for previous lecture
Tue Apr 5 Lecture 8: Clustering (cont'd), topic modeling
[slides]
[Jupyter notebook (topic modeling with LDA)]

Final project proposals due 11:59pm by email (1 email per group)
Topic modeling reading:
[David Blei's general intro to topic modeling]
Thur Apr 7
Fri Apr 8
No class (CMU Spring Carnival)
Mon Apr 11 HW2 due 11:59pm (note that we release HW solutions 2 days after the due date in case anyone is using 2 late days, so this due date is set up so that you get solutions before the quiz)
Tue Apr 12 Lecture slot: Quiz review
Thur Apr 14 Lecture slot: Quiz
Fri Apr 15 No recitation (instead sign up to meet with TAs to discuss final project status)
Part II. Predictive data analysis
Tue Apr 19 Lecture 9: Topic modeling (cont'd), intro to predictive data analysis
[slides]
Thur Apr 21 Lecture 10: Hyperparameter tuning, classifier evaluation, intro to neural nets & deep learning
[slides]
[Jupyter notebook (prediction and model validation)]
Some nuanced details on cross-validation (technical):
[Andrew Ng's article Preventing "Overfitting" of Cross-Validation Data]
[Braga-Neto and Dougherty's article Is cross-validation valid for small-sample microarray classification? (this article applies more generally rather than only to microarray data from biology)]
[Bias and variance as we change the number of folds in k-fold cross-validation]

Mike Jordan's Medium article on where AI is at (April 2018):

["Artificial Intelligence - The Revolution Hasn't Happened Yet"]
Fri Apr 22 Recitation: More on classifier evaluation
[slides]
[Jupyter notebook]
Tue Apr 26 Lecture 11: Neural nets & deep learning
[slides]
For the neural net demo below to work, you will need to install some packages:
pip install torch torchvision torchaudio
pip install torchsummaryX
pip install pytorch-nlp
[Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to the bottom to download UDA_pytorch_utils.py)]
Be sure to edit two pytorch-nlp files as indicated in the following slides (resolves some issues with recent updates to PyTorch & spaCy):
[slides]

HW3 due 11:59pm


PyTorch tutorial (at the very least, I suggest going over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU):

[PyTorch tutorial]

Additional reading:

[Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]

Video introduction on neural nets:

["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown]
Thur Apr 28 Lecture 12: Image analysis with CNNs, time series analysis with RNNs, deep learning and course wrap-up
[slides]
We continue using the demo from the previous lecture
Additional reading:
[Stanford CS231n Convolutional Neural Networks for Visual Recognition]
[(technical) Richard Zhang's fix for max pooling]
[Christopher Olah's "Understanding LSTM Networks"]
[A tutorial on word2vec word embeddings]
[A tutorial on BERT word embeddings]

RNN demo — Sentiment analysis with IMDb reviews (95-865 demo):

[Jupyter notebook (sentiment analysis with IMDb reviews; requires UDA_pytorch_utils.py from the previous demo)]
Fri Apr 29 Recitation slot: Final project presentations
Mon May 2 Final project slide decks + Jupyter notebooks due 11:59pm by email (1 email per group)