95-865: Unstructured Data Analytics (Fall 2020 Mini 2)

Unstructured Data Analytics

All times listed are in Pittsburgh time (US Eastern Time)

Lectures, time and location: Currently, the plan is for lectures prior to Thanksgiving break to be in-person and live at the same time (i.e., I teach in a classroom and start a Zoom session). After Thanksgiving, all instruction will be purely remote. Note that Tue/Thur lectures are recorded and not Mon/Wed.

Recitations: Fridays 1:30pm-2:50pm, remote (Zoom)

Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol

Teaching assistants: Xinyu Yao (xinyuyao ♣ andrew.cmu.edu), Xuejian Wang (xuejianw ♣ andrew.cmu.edu)

Office hours:

Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).

course description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:

  1. We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
  2. Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.
Many examples are given for how these methods help solve real problems faced by organizations. Along the way, we encounter many of the most popular methods in analyzing unstructured data, from modern classics in manifold learning, clustering, and topic modeling to some of the latest developments in deep neural networks for analyzing text, images, and time series.

We will be coding lots of Python and dabble a bit with GPU computing (Google Colab).

Prerequisite: If you are a Heinz student, then you must have either (1) passed the Heinz Python exemption exam, or (2) taken 95-888 "Data-Focused Python" or 90-819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.

Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more

Grading: Homework 20%, quiz 1 40%, quiz 2 40%*

*Students with the most instructor-endorsed answers on Piazza will receive a slight bonus at the end of the mini, which will be added directly to their quiz 2 score (a maximum of 5 bonus points; quiz 2 is out of 100 points prior to any bonus points being added).

Syllabus: [pdf]

calendar (subject to revision)

Previous version of course (including lecture slides and demos): 95-865 Spring 2020 mini 3

Date Topic Supplemental Material
Part I. Exploratory data analysis

Week 1: Oct 26-30

HW1 was released Oct 26 (check Canvas)

Lecture 1 (Oct 26|27): Course overview, analyzing text using frequencies
[slides]

Lecture 2 (Oct 28|29): Text analysis demo, co-occurrence analysis
[slides]
[Jupyter notebook (basic text analysis)]
[Jupyter notebook (co-occurrence analysis)]

Recitation (Oct 30): Basic Python review
[Jupyter notebook (without solutions)]
[Jupyter notebook (with solutions)]

Week 2: Nov 2-6

Lecture 3 (Nov 2|3): Finding possibly related entities
[slides]
[Jupyter notebook (co-occurrence analysis); same demo as previous lecture]

Lecture 4 (Nov 4|5): Visualizing high-dimensional data (PCA)
[slides]
[Jupyter notebook (PCA)]

Recitation (Nov 6): More on PCA, practice with argsort
[Jupyter notebook]

HW1 due Friday Nov 6, 11:59pm

What is the maximum phi-squared/chi-squared value? (technical)
[stack exchange answer]

Causality additional reading:
[Computational and Inferential Thinking, "Causality and Experiments" chapter]

PCA additional reading (technical):
[Abdi and Williams's PCA review]

Week 3: Nov 9-13

HW2 released start of the week

Lecture 5 (Nov 9|10): Manifold learning (Isomap, t-SNE)
[slides]
[Jupyter notebook (manifold learning)]
[slides with some technical details for t-SNE]

Lecture 6 (Nov 11|12): Wrap up manifold learning, begin clustering (k-means)
[slides]
[Jupyter notebook (manifold learning); same demo as previous lecture]
[required reading: "How to Use t-SNE Effectively" (Wattenberg et al, Distill 2016)]
[Jupyter notebook (t-SNE with images)]
[Jupyter notebook (PCA, t-SNE, clustering with drug data)]

Recitation (Nov 13): Quiz 1 review
Check Canvas for the Gradescope link (released at the start of recitation)

Python examples for dimensionality reduction:
[scikit-learn example (PCA, Isomap, t-SNE, and many other methods)]

Some details on t-SNE including code (from a past UDA recitation):
[some extra t-SNE slides]
[Jupyter notebook (t-SNE)]

Additional dimensionality reduction reading (technical):
[Abdi and Williams's PCA review]
[Isomap webpage]
[Why scikit-learn's Isomap uses a kernel version of PCA to do MDS]
[Simon Carbonnelle's t-SNE slides]
[t-SNE webpage]
["An Analysis of the t-SNE Algorithm for Data Visualization" (Arora et al, COLT 2018)]

Additional clustering reading (technical):
[see Section 14.3 of the book "Elements of Statistical Learning" on clustering]

Week 4: Nov 16-20

Lecture 7 (Nov 16|17): Clustering (k-means, GMMs)
[slides]
[Jupyter notebook (continuation of previous drug data demo: cluster interpretation)]

Lecture 8 (Nov 18|19): More clustering (automatically choosing k with CH-index, DP-GMMs, and DP-means)
[slides]
[Jupyter notebook (continuation of previous drug data demo: automatically choosing k)]

Friday Nov 20: no recitation, instead Quiz 1 — upon opening the quiz, you have 80 minutes to complete it

Python cluster evaluation:
[scikit-learn cluster evaluation metrics (many, including CH index)]

DP-means paper (technical):
["Revisiting k-means: New Algorithms via Bayesian Nonparametrics"]

Hierarchical clustering reading (technical):
[see Section 14.3.12 of the book "Elements of Statistical Learning"]

Week 5: Nov 23-27

Lecture 9 (Nov 23|24): Topic modeling
[slides]
[Jupyter notebook (topic modeling with LDA)]

Thanksgiving: no class Wednesday through Friday (note that to keep the two sections synced, there is no Wednesday class!)

Topic modeling reading:
[David Blei's general intro to topic modeling]

Part 2. Predictive data analysis
Week 6: Nov 30-Dec 4

HW2 due Monday Nov 30, 11:59pm

HW3 released early in the week

Instruction becomes purely remote at the start of this week — do not show up to HBH 1204

Lecture 10 (Nov 30|Dec 1): Intro to predictive data analytics (some terminology, k-NN classification, model evaluation)
[slides]
[Jupyter notebook (prediction and model validation)]

Lecture 11 (Dec 2|3): Wrap up predictive model evaluation, classical classifiers; intro to neural nets and deep learning
[slides]
[Jupyter notebook (prediction and model validation; same demo as last time)]

Lecture 12 during Dec 4 recitation slot: Wrap up intro to neural nets and deep learning; image analysis with convolutional neural nets
[slides]
For the neural net demo below to work, you will need to install some packages:
pip install torch torchvision torchaudio
pip install torchsummaryX
python -m spacy download en
pip install pytorch-nlp
[Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to the bottom to download UDA_pytorch_utils.py)]

Some nuanced details on cross-validation (technical):
[Andrew Ng's article Preventing "Overfitting" of Cross-Validation Data]
[Braga-Neto and Dougherty's article Is cross-validation valid for small-sample microarray classification? (this article applies more generally rather than only to microarray data from biology)]
[Bias and variance as we change the number of folds in k-fold cross-validation]

PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU):
[PyTorch tutorial]

Additional reading:
[Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]
[Stanford CS231n Convolutional Neural Networks for Visual Recognition]
[(technical) Richard Zhang's fix for max pooling (presented at ICML 2019)]

Video introduction on neural nets:
["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown]

Mike Jordan's Medium article on where AI is at (April 2018):
["Artificial Intelligence - The Revolution Hasn't Happened Yet"]

Week 7 Dec 7-11

Lecture 13 (Dec 7|8): Time series analysis with recurrent neural nets
[slides]
For the demo below to work, be sure to install the prerequisite packages as mentioned for the lecture 12 demo.
[Jupyter notebook (sentiment analysis with IMDb reviews)]

Lecture 14 (Dec 9|10): More on deep learning and course wrap-up
[slides]

Recitation: Quiz 2 review
Check Canvas for the Gradescope link (released at the start of recitation)

Additional reading:
[Christopher Olah's "Understanding LSTM Networks"]
[A tutorial on word2vec word embeddings]
[A tutorial on BERT word embeddings]

Some bonus reading (a student asked about image segmentation, and here's an introduction):
[A Brief History of CNNs in Image Segmentation: From R-CNN to Mask R-CNN]

Final exam period Dec 14-20

HW3 due Monday Dec 14, 11:59pm

Friday Dec 18: Quiz 2 — upon opening the quiz, you have 80 minutes to complete it