94-775: Unstructured Data Analytics for Policy
(Spring 2021 Mini 4)

Unstructured Data Analytics

All times are listed in Pittsburgh time unless otherwise stated

Class time and location:

Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol

Teaching assistants:

Office hours (starting second week of class); these are all over Zoom (see Canvas for Zoom links):

Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).

course description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:

  1. We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
  2. Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.
Many examples are given for how these methods help solve real problems faced by organizations. There is a final project in this course which must address a policy question.

Prerequisite: If you are a Heinz student, then you must have already taken 95-791 "Data Mining" and also either (1) passed the Heinz Python exemption exam, or (2) taken 95-888 "Data-Focused Python" or 90-819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.

Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more

Grading: HW1 8%, HW2 8%, HW3 4%, mid-mini quiz 35%, final project proposal 10%, final project 35%. HW3 is shorter than HW1 and HW2. Letter grades are determined based on a curve. Students with the most instructor-endorsed posts on Piazza will get a bonus worth 5% of the quiz score on the quiz (allowing for possibly scoring over 100% on the quiz).

Syllabus (updated 4/9 to mention Piazza quiz bonus incentive): [pdf]

calendar (tentative)

Previous lecture slides and demos: spring 2020 version of 94-775 (click to see course webpage)

Date Topic Supplemental Material
Part I. Exploratory data analysis
Tue Mar 23

Lecture 1: Course overview
[slides]

Thur Mar 25

Lecture 2: Basic text analysis, co-occurrence analysis
[slides]

For the basic text analysis demo to work, please install Anaconda Python 3, Jupyter, and spaCy first
[Jupyter notebook (basic text analysis)]

Fri Mar 26

Recitation: Basic Python review
[Jupyter notebook]

Tue Mar 30

Lecture 3: Finding possibly related entities
[slides]
[Jupyter notebook (co-occurrence analysis)]

What is the maximum phi-squared/chi-squared value? (technical)
[stack exchange answer]

Thur Apr 1

Lecture 4: Visualizing high-dimensional data with PCA
[slides]
[Jupyter notebook (PCA)]

Causality additional reading:
[Computational and Inferential Thinking, "Causality and Experiments" chapter]

PCA additional reading (technical):
[Abdi and Williams's PCA review]

Fri Apr 2

Recitation slot — Lecture 5: Manifold learning with Isomap
[slides]
[Jupyter notebook (manifold learning)]

Python examples for dimensionality reduction:
[scikit-learn example (PCA, Isomap, t-SNE, and many other methods)]

Tue Apr 6

Lecture 6: Wrap up manifold learning (t-SNE), a first look at analyzing images, and an introduction to clustering phenomena
[slides]
[Jupyter notebook (manifold learning); same demo as previous lecture]
[required reading: "How to Use t-SNE Effectively" (Wattenberg et al, Distill 2016)]
For the demo below to work (on t-SNE with images), you will need to install some packages:
pip install torch torchvision
[Jupyter notebook (t-SNE with images)]
[Jupyter notebook (PCA, t-SNE, clustering on UCI drug consumption data)]
[slides with some technical details for t-SNE]

HW1 due 11:59pm

See supplementary materials from the previous lecture; in addition, here's some reading for t-SNE (technical):
[Simon Carbonnelle's t-SNE slides]
[t-SNE webpage]

Thur Apr 8

Lecture 7: Distance and similarity functions, clustering (k-means, GMMs)
[slides]
[Jupyter notebook (PCA, t-SNE, clustering on UCI drug consumption data); same demo as previous lecture]

Additional clustering reading (technical):
[see Section 14.3 of the book "Elements of Statistical Learning" on clustering]

Fri Apr 9

Recitation slot — Lecture 8: More on clustering (interpreting clustering results, automatically choosing the number of clusters for GMM-related models, density-based clustering with DBSCAN)
[slides]
[Jupyter notebook (PCA, t-SNE, clustering on UCI drug consumption data); same demo as previous lecture]

Reading on DP-means, DP mixture models for which a DP-GMM is a special case (technical):
[Revisiting k-means: New Algorithms via Bayesian Nonparametrics]

Tue Apr 13

Lecture 9: Wrap up clustering; topic modeling
[slides]
[Jupyter notebook (topic modeling with LDA)]

Topic modeling reading:
[David Blei's general intro to topic modeling]
[(technical) Topic Modelling Meets Deep Neural Networks: A Survey]

Thur Apr 15 No class (CMU break day)
Fri Apr 16

No class (CMU break day)

Tue Apr 20

Lecture slot: Quiz review

Thur Apr 22

Lecture slot: Quiz

Part II. Predictive data analysis
Fri Apr 23

Recitation slot — Lecture 10: Introduction to predictive data analytics
[slides]
For the demo below to work, you will need to install some packages:
pip install torch torchvision
[Jupyter notebook (prediction and model validation)]

Some nuanced details on cross-validation (technical):
[Andrew Ng's article Preventing "Overfitting" of Cross-Validation Data]
[Braga-Neto and Dougherty's article Is cross-validation valid for small-sample microarray classification? (this article applies more generally rather than only to microarray data from biology)]
[Bias and variance as we change the number of folds in k-fold cross-validation]

Mon Apr 26

HW2 and final project proposal due 11:59pm

Tue Apr 27

Lecture 11: Wrap up predictive model evaluation and classical classifiers
[slides]
[Jupyter notebook (prediction and model validation; same demo as last time)]

Thur Apr 29

Lecture 12: Intro to neural nets and deep learning
[slides]
For the neural net demo below to work, you will need to install some packages:
pip install torch torchvision torchaudio
pip install torchsummaryX
python -m spacy download en
pip install pytorch-nlp
[Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to the bottom to download UDA_pytorch_utils.py)]

PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU):
[PyTorch tutorial]

Additional reading:
[Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]

Video introduction on neural nets:
["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown]

Mike Jordan's Medium article on where AI is at (April 2018):
["Artificial Intelligence - The Revolution Hasn't Happened Yet"]

Fri Apr 30

No recitation; please schedule with your TA Jingbo for a project checkup

Mon May 3

HW3 due 11:59pm

Tue May 4

Lecture 13: Image analysis with convolutional neural nets
[slides]
[Jupyter notebook (handwritten digit recognition with neural nets; same demo as previous lecture]

Additional reading:
[Stanford CS231n Convolutional Neural Networks for Visual Recognition]
[(technical) Richard Zhang's fix for max pooling (presented at ICML 2019)]

Thur May 6

Lecture 14: Time series analysis with recurrent neural nets; some other deep learning topics; course wrap-up
[slides]

Supplemental demo for lecture (that isn't actually covered in lecture due to time constraints):
For the demo below to work, be sure to install the prerequisite packages as mentioned for the lecture 12 demo.
[Jupyter notebook (sentiment analysis with IMDb reviews)]

Additional reading:
[A tutorial on word2vec word embeddings]
[A tutorial on BERT word embeddings]

Fri May 7

Recitation slot: final project presentations

Final projects due 11:59pm (slide deck + Jupyter notebook)