95-865: Unstructured Data Analytics (Spring 2018 Mini 4)

Unstructured Data Analytics

Class time and location:

Instructor: George Chen (georgechen [at symbol] cmu.edu)

Teaching assistants: Emaad Manzoor (emaad [at symbol] cmu.edu), Mallory Nobles (mnobles [at symbol] andrew.cmu.edu)

Office hours:

Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).

Course Description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:

  1. We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
  2. Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.
Many examples are given for how these methods help solve real problems faced by organizations. Along the way, we encounter many of the most popular methods in analyzing unstructured data, from modern classics in manifold learning, clustering, and topic modeling to some of the latest developments in deep neural networks for analyzing text, images, and time series.

We will be coding lots of Python and working with Amazon Web Services (AWS) for cloud computing (including using GPU's).

Prerequisite: Python coding experience

Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more

Grading: Homework 20%, mid-mini quiz 35%, final exam 45%. If you do better on the final exam than the mid-mini quiz, then your final exam score clobbers your mid-mini quiz score (thus, the quiz does not count for you, and instead your final exam counts for 80% of your grade).

Syllabus: [pdf]

Calendar (tentative)

Warning: As this course is still relatively new, the lecture slides are a bit rough and may contain bugs. To provide feedback/bug reports, please directly contact the instructor, George (georgechen [at symbol] cmu.edu). The Spring 2018 mini-3 course website is available here.

Date Topic Supplemental Material
Part I. Exploratory data analysis
Mon Mar 19

Course overview, basic text processing and frequency analysis
[slides]

HW0 released! (Check Canvas)

Some Python resources:
[Dive into Python 3]
[Computational and Inferential Thinking, "Programming in Python" chapter]

Wed Mar 21

Basic text analysis demo, co-occurrence analysis
[slides]
[spaCy demo (Jupyter notebook)]

Fri Mar 23

HW0 due 11:59pm, HW1 released!

Mon Mar 26

Wrap up co-occurrence analysis, scatter plots, correlation, causation, visualizing high-dimensional data: PCA
[slides]
[PCA demo (Jupyter notebook)]

Causality additional reading:
[Computational and Inferential Thinking, "Causality and Experiments" chapter]

PCA additional reading:
[Abdi and Williams's PCA review]

Wed Mar 28

Manifold learning (isomap, t-SNE)
[slides]
[Multidimensional Scaling demo (Jupyter notebook)]
[t-SNE demo (Jupyter notebook)]

Python examples for dimensionality reduction:
[scikit-learn example (PCA, Isomap, t-SNE, and many other methods)]

Additional dimensionality reduction reading:
[Isomap webpage]
[Simon Carbonnelle's t-SNE slides]
[Wattenberg, et al., "How to Use t-SNE Effectively", Distill, 2016]
[t-SNE webpage]

Mon Apr 2

Introduction to clustering, k-means, Gaussian mixture models (GMMs)
[slides]
[k-means, GMM, DP-GMM demo (Jupyter notebook)]

HW1 due 10:30am, HW2 released!

Additional clustering reading:
[see Section 14.3 of the book "Elements of Statistical Learning" on clustering]

Wed Apr 4

DP-GMMs, DP-means, CH index, hierarchical clustering
[slides]
[the DP-GMM demo is in the same Jupyter notebook as the previous lecture]

Python cluster evaluation:
[scikit-learn cluster evaluation metrics (many, including CH index)]

Additional reading:
[the original paper on DP-means (fairly technical)]
[see Section 14.3.12 "Hierarchical Clustering" of the book "Elements of Statistical Learning"]

Mon Apr 9

Clustering (wrap-up), topic modeling, intro to predictive data analysis
[slides]
[demo sketching how to interpret clusters (Jupyter notebook; builds off earlier clustering demo but with the DP-GMM part omitted)]
[Latent Dirichlet Allocation demo (Jupyter notebook)]

Additional reading:
[see Section 14.3.12 "Hierarchical Clustering" of the book "Elements of Statistical Learning"]
[David Blei's general intro to topic modeling]

Tue Apr 10

Just for this week, George's office hours are Tuesday 5pm-7pm, HBH 2216 (and not on Wednesday!)

Wed Apr 11 Mid-mini quiz
Part 2. Predictive data analysis
Mon Apr 16

Introduction to predictive analytics, some classics of classification: nearest neighbors, evaluating prediction methods, naive Bayes
[slides]

Wed Apr 18

Support vector machines, decision trees and forests
[slides]
[model validation demo Jupyter notebook (includes SVM and random forest code snippets)]

HW3 released

Mon Apr 23

Intro to neural nets and deep learning
[slides]
[handwritten digit recognition single and two layer neural neural demo]

Mike Jordan's Medium article (from just a few days ago!) on where AI is currently at:
["Artificial Intelligence - The Revolution Hasn't Happened Yet"]

HW2 due 10:30am

Video introduction on neural nets:
["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown]

Additional reading:
[Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]

Wed Apr 25

Image analysis with CNNs (also called convnets)
[slides]
[handwritten digit recognition convnet demo (builds off demo from last lecture and now uses a validation set during fitting)]

Additional reading:
[Stanford CS231n Convolutional Neural Networks for Visual Recognition]

Mon Apr 30

Time series analysis with RNNs, roughly how learning a deep net works (gradient descent and variants)
[slides]
[RNN demo]

LSTM reading:
[Christopher Olah's "Understanding LSTM Networks"]

Videos on learning neural nets:
["Gradient descent, how neural networks learn | Chapter 2, deep learning" by 3Blue1Brown]
["What is backpropagation really doing? | Chapter 3, deep learning" by 3Blue1Brown]

Wed May 2

Interpreting what a deep net is learning, other deep learning topics, wrap-up
[slides]
[fashion_mnist_cnn_model.h5 (pre-trained convnet needed for demo)]
[convnet interpretation demo]

HW3 due 10:30am

Tue May 8

Final exam 1pm, HBH 1002