95-865: Unstructured Data Analytics (Fall 2019 Mini 2)

Unstructured Data Analytics

Lectures, time and location:

Recitations for Pittsburgh: Fridays 1:30pm-2:50pm Eastern Time, HBH A301

Recitations for Adelaide: Thursdays 5pm-7pm Australian Central Time, Classroom 1

Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol

Teaching assistants for Pittsburgh: Daniel Chen (dpchen ♣ andrew.cmu.edu), Emaad Manzoor (emaad ♣ cmu.edu)

Teaching assistants for Adelaide: Erick Rodriguez (erickger ♣ andrew.cmu.edu)

Office hours:

Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).

course description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:

  1. We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
  2. Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.
Many examples are given for how these methods help solve real problems faced by organizations. Along the way, we encounter many of the most popular methods in analyzing unstructured data, from modern classics in manifold learning, clustering, and topic modeling to some of the latest developments in deep neural networks for analyzing text, images, and time series.

We will be coding lots of Python and working with Amazon Web Services (AWS) for cloud computing (including using GPU's).

Prerequisite: If you are a Heinz student, then you must have either (1) passed the Heinz Python exemption exam, or (2) taken 95-888 "Data-Focused Python" or 16-791 "Applied Data Science". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.

Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more

Grading: Homework 20%, quiz 1 40%, quiz 2 40%

Syllabus: [pdf]

calendar (subject to revision)

🔥 Previous version of course (including lecture slides and demos): 95-865 Spring 2019 mini 3 🔥

Date Topic Supplemental Material
Part I. Exploratory data analysis

Week 1: Oct 21-25

Reminder: Section A2 meets on Tuesdays and Thursdays, B2 meets on Mondays and Wednesdays, and K2 meets on Tuesdays (the Adelaide section gets all lectures for the week in a single ~3 hour session)

Lecture 1: Course overview, analyzing text using frequencies
Lecture 2: Text analysis demo, co-occurrence analysis
[slides for lectures 1 & 2]
[Jupyter notebook (basic text analysis)]

Recitation 1: Basic Python review
[Jupyter notebook (Python review)]
Pittsburgh students: there is no recitation on Friday Oct 25 due to CMU's Tartan Community Day; instead we will provide a video from the Adelaide recitation that you should view on your own (feel free to go to office hours to ask questions)

HW1 released (check Canvas)!

Week 2: Oct 28-Nov 1

Lecture 3: Finding possibly related entities
Lecture 4: Wrap-up finding possibly related entities, visualizing high-dimensional data (PCA, Isomap)
[slides for lectures 3 & 4]
[Jupyter notebook (co-occurrence analysis)]
[Jupyter notebook (PCA)]
[Jupyter notebook (Isomap)]

Recitation 2: More on PCA, bookkeeping with np.argsort
[Jupyter notebook (more PCA, argsort)]

HW1 due Thursday 11:59pm Eastern time

Causality additional reading:
[Computational and Inferential Thinking, "Causality and Experiments" chapter]

PCA additional reading (technical):
[Abdi and Williams's PCA review]

Python examples for dimensionality reduction:
[scikit-learn example (PCA, Isomap, t-SNE, and many other methods)]

Additional dimensionality reduction reading (technical):
[Isomap webpage]

Week 3: Nov 4-8

Lecture 5: t-SNE
Lecture 6: Clustering
[slides for lectures 5 & 6]
[required reading: "How to Use t-SNE Effectively" (Wattenberg et al, Distill 2016)]
[Jupyter notebook (Swiss roll t-SNE; extends Isomap demo)]
[Jupyter notebook (PCA and t-SNE with images)]
[Jupyter notebook (PCA, t-SNE, clustering with drug data)]
[slides with some technical details for t-SNE]

Recitation 3: t-SNE, review session for quiz 1
[some extra t-SNE slides]
[Jupyter notebook (t-SNE)]

HW2 released start of the week

Additional dimensionality reduction reading (technical):
[Simon Carbonnelle's t-SNE slides]
[t-SNE webpage]
["An Analysis of the t-SNE Algorithm for Data Visualization" (Arora et al, COLT 2018)]

Additional clustering reading (technical):
[see Section 14.3 of the book "Elements of Statistical Learning" on clustering]

Week 4: Nov 11-15

Lecture 7: More clustering, topic modeling
Lecture 8: Hierarchical clustering
[slides for lectures 7 & 8]
[Jupyter notebook (continuation of previous drug data demo: cluster interpretation, automatically choosing k)]
[Jupyter notebook (topic modeling with LDA)]
[Jupyter notebook (hierarchical clustering)]

HW2 due Thursday 11:59pm Eastern Time

Recitation 4: Quiz 1

Python cluster evaluation:
[scikit-learn cluster evaluation metrics (many, including CH index)]

Additional reading (technical):
[see Section 14.3.12 "Hierarchical Clustering" of the book "Elements of Statistical Learning"]
[David Blei's general intro to topic modeling]

Part 2. Predictive data analysis
Week 5: Nov 18-22

George is in Adelaide this week and will attempt to give Pittsburgh lectures remotely. His usual Pittsburgh office hours are cancelled (if you would like to meet via Skype, please email to schedule).

Lecture 9: Introduction to predictive analytics, model validation
Lecture 10: Introduction to neural nets and deep learning
[slides for lectures 9 & 10]
[Jupyter notebook (prediction and model validation)]
[Jupyter notebook (intro to neural nets and deep learning)]

Mike Jordan's Medium article on where AI is at (April 2018):
["Artificial Intelligence - The Revolution Hasn't Happened Yet"]

Recitation 5: More classical classification models, ROC curves

HW3 released start of the week

Video introduction on neural nets:
["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown]

Additional reading:
[Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]

Week 6: Nov 25-29

Lecture 11: Image analysis with CNNs (also called convnets)
Lecture 12 (Adelaide only; Pittsburgh has Thanksgiving break): Time series analysis with RNNs, other deep learning topics
[slides for lectures 11 & 12]
[Jupyter notebook (CNN demo)]
[Jupyter notebook (RNN demo)]

In lecture 12 (i.e., the second half of Adelaide's week 6 lecture), I mentioned that I will post a demo on interpreting CNNs. The demo is available here:
[fashion_mnist_cnn_model.h5 (pre-trained convnet needed CNN interpretation demo)]
[Jupyter notebook (CNN interpretation demo)]

Recitation 6 (Adelaide only; Pittsburgh has Thanksgiving break): Word embeddings as self-supervised learning, review session for quiz 2

CNN reading:
[Stanford CS231n Convolutional Neural Networks for Visual Recognition]

LSTM reading:
[Christopher Olah's "Understanding LSTM Networks"]

Week 7: Dec 2-6

Lecture 12 (Pittsburgh only): same as Adelaide Lecture 12
Lecture 13 (Pittsburgh only): same as Adelaide Recitation 6

HW3 due Thursday 11:59pm Eastern Time

Recitation 7: Quiz 2