95-865: Unstructured Data Analytics (Spring 2020 Mini 3)

Unstructured Data Analytics

Lectures, time and location:

Recitations: Fridays 3:00pm-4:20pm, HBH A301

Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol

Teaching assistants: Tianyu Huang (tianyuhu ♣ andrew.cmu.edu), Xiaobin Shen (xiaobins ♣ andrew.cmu.edu), Sachin Kalayathankal Sunny (ssunny ♣ andrew.cmu.edu)

Office hours:

Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).

course description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:

  1. We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
  2. Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.
Many examples are given for how these methods help solve real problems faced by organizations. Along the way, we encounter many of the most popular methods in analyzing unstructured data, from modern classics in manifold learning, clustering, and topic modeling to some of the latest developments in deep neural networks for analyzing text, images, and time series.

We will be coding lots of Python and working with Amazon Web Services (AWS) for cloud computing (including using GPU's).

Prerequisite: If you are a Heinz student, then you must have either (1) passed the Heinz Python exemption exam, or (2) taken 95-888 "Data-Focused Python" or 90-819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.

Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more

Grading: Homework 20%, quiz 1 40%, quiz 2 40%

Syllabus: [pdf]

calendar (subject to revision)

Previous version of course (including lecture slides and demos): 95-865 Fall 2019 mini 2

Date Topic Supplemental Material
Part I. Exploratory data analysis

Week 1: Jan 13-17

Lecture 1 (Jan 13): Course overview, analyzing text using frequencies
Lecture 2 (Jan 15): Text analysis demo, co-occurrence analysis
[slides for lectures 1 & 2]
[Jupyter notebook (basic text analysis)]

Recitation 1 (Jan 17): Basic Python review
[Jupyter notebook]

HW1 released (check Canvas)!

Week 2: Jan 20-24

No class on Monday (MLK holiday)

Lecture 3 (Jan 22): Finding possibly related entities
[slides for lecture 3]
[Jupyter notebook (co-occurrence analysis)]

Recitation 2 (Jan 24): Python practice with sorting and NumPy
[Jupyter notebook]

What is the maximum value of phi-square/chi-square value? (technical)
[stack exchange answer]

Causality additional reading:
[Computational and Inferential Thinking, "Causality and Experiments" chapter]

Week 3: Jan 27-31

HW1 due Monday 11:59pm

HW2 released start of the week

Lecture 4 (Jan 27): Wrap up finding possibly related entities, visualizing high-dimensional data (PCA)
[slides for lecture 4]
[Jupyter notebook (PCA)]

Lecture 5 (Jan 29): Manifold learning (Isomap, t-SNE)
[slides for lecture 5]
[required reading: "How to Use t-SNE Effectively" (Wattenberg et al, Distill 2016)]
[Jupyter notebook (manifold learning)]
[Jupyter notebook (t-SNE with images)]
[slides with some technical details for t-SNE]

Lecture 6 (during Jan 31 recitation slot): Wrap up manifold learning, begin clustering (k-means)
[slides for lecture 6]
[Jupyter notebook (PCA, t-SNE, clustering with drug data)]

Python examples for dimensionality reduction:
[scikit-learn example (PCA, Isomap, t-SNE, and many other methods)]

Some details on t-SNE including code (from a past UDA recitation):
[some extra t-SNE slides]
[Jupyter notebook (t-SNE)]

Additional dimensionality reduction reading (technical):
[Abdi and Williams's PCA review]
[Isomap webpage]
[Simon Carbonnelle's t-SNE slides]
[t-SNE webpage]
["An Analysis of the t-SNE Algorithm for Data Visualization" (Arora et al, COLT 2018)]

Additional clustering reading (technical):
[see Section 14.3 of the book "Elements of Statistical Learning" on clustering]

Week 4: Feb 3-7

Lecture 7 (Feb 3): Clustering (k-means, GMMs, CH index)
[slides for lecture 7]
[Jupyter notebook (continuation of previous drug data demo: cluster interpretation, automatically choosing k)]

Lecture 8 (Feb 5): More clustering (DP-GMMs, DP-means), topic modeling (LDA)
[slides for lecture 8]
[Jupyter notebook (continuation of previous drug data demo: DP-GMMs)]
[Jupyter notebook (topic modeling with LDA)]

Quiz 1 during Feb 7 recitation slot

Python cluster evaluation:
[scikit-learn cluster evaluation metrics (many, including CH index)]

Topic modeling reading:
[see Section 14.3 of the book "Elements of Statistical Learning" on clustering]
[David Blei's general intro to topic modeling]

Part 2. Predictive data analysis
Week 5: Feb 10-14

HW2 due Monday 11:59pm

HW3 released early in the week

Lecture 9 (Feb 10): Wrap up topic modeling, intro to predictive data analytics (some terminology, k-NN classification, model evaluation)
[slides for lecture 9]

Lecture 10 (Feb 12): More on model evaluation (including confusion matrices, ROC curves), decision trees & forests
[slides for lecture 10]
[Jupyter notebook (prediction and model validation)]

Recitation 3 (Feb 14): Practice with ROC curves
[Jupyter notebook]

Some nuanced details on cross-validation (technical):
[Andrew Ng's article Preventing "Overfitting" of Cross-Validation Data]
[Braga-Neto and Dougherty's article Is cross-validation valid for small-sample microarray classification? (this article applies more generally rather than only to microarray data from biology)]
[Bias and variance as we change the number of folds in k-fold cross-validation]

Week 6: Feb 17-21

Lecture 11 (Feb 17): Intro to neural nets and deep learning
[slides for lecture 11]
For the lecture 11 demo to work, you will need to install some packages:
conda install -c pytorch pytorch
conda install -c pytorch torchvision
pip install torchsummaryX
python -m spacy download en
pip install pytorch-nlp
[Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to the bottom to download UDA_pytorch_utils.py)]

Mike Jordan's Medium article on where AI is at (April 2018):
["Artificial Intelligence - The Revolution Hasn't Happened Yet"]

Lecture 12 (Feb 19): Image analysis with convolutional neural nets
[slides for lecture 12]
[Jupyter notebook (continuation of previous demo with addition of two convnet models)]

Lecture 13 (during Feb 21 recitation slot): Time series analysis with recurrent neural nets
[slides for lecture 13]
[Jupyter notebook (sentiment analysis with IMDb reviews)]

PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU):
[PyTorch tutorial]

Additional reading:
[Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]
[Stanford CS231n Convolutional Neural Networks for Visual Recognition]
[(technical) Richard Zhang's fix for max pooling (presented at ICML 2019)]
[Christopher Olah's "Understanding LSTM Networks"]

Video introduction on neural nets:
["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown]

Week 7: Feb 24-28

Lecture 14 (Feb 24): More deep learning, wrap-up
[slides for lecture 14]

Wednesday Feb 26 lecture slot: Quiz 2 review

Quiz 2 during Feb 28 recitation slot

Additional reading:
[A tutorial on word2vec word embeddings]
[A tutorial on BERT word embeddings]

Week 8: Mar 2-6

HW3 due Thursday Mar 5, 11:59pm