95-865: Unstructured Data Analytics (Spring 2018 Mini 3)

Unstructured Data Analytics

Class time and location:

Instructor: George Chen (georgechen [at symbol] cmu.edu)

Teaching assistants: Emaad Manzoor (emaad [at symbol] cmu.edu), Mallory Nobles (mnobles [at symbol] andrew.cmu.edu)

Office hours:

Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).

Course Description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:

  1. We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
  2. Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.
Many examples are given for how these methods help solve real problems faced by organizations. Along the way, we encounter many of the most popular methods in analyzing unstructured data, from modern classics in manifold learning, clustering, and topic modeling to some of the latest developments in deep neural networks for analyzing text, images, and time series.

We will be coding lots of Python and working with Amazon Web Services (AWS) for cloud computing (including using GPU's).

Prerequisite: Python coding experience

Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more

Grading: Homework 25%, mid-mini quiz 35%, final exam 40%

Syllabus (last updated Jan 18, 2018 with TA and OH info): [pdf]

Calendar (tentative)

Warning: As this course is still relatively new, the lecture slides are a bit rough and may contain bugs. To provide feedback/bug reports, please directly contact the instructor, George (georgechen [at symbol] cmu.edu). The Fall 2017 mini-2 course website is available here.

Date Topic Supplemental Material
Part 1. Exploratory data analysis
Wed Jan 17

Course introduction, basic text processing, frequency analysis
[course introduction slides]
[basic text processing and frequency analysis slides]
[spaCy demo (Jupyter notebook)]

HW0 released! (Check Canvas)

Some Python resources:
[Dive into Python 3]
[Computational and Inferential Thinking, "Programming in Python" chapter]

Mon Jan 22

Finding possibly related features: co-occurrence analysis, scatter plots, correlation, causation
[administrivia slide]
[slides]

HW0 due, HW1 released!

Causality additional reading:
[Computational and Inferential Thinking, "Causality and Experiments" chapter]

Wed Jan 24

Visualizing high-dimensional data: PCA, introduction to manifold learning, Isomap, t-SNE
[slides]
[PCA demo (Jupyter notebook)]
[Multidimensional Scaling demo (Jupyter notebook)]

Python examples for dimensionality reduction:
[scikit-learn example (PCA, Isomap, t-SNE, and many other methods)]

Additional reading:
[Abdi and Williams's PCA review]
[Isomap webpage]
[Simon Carbonnelle's t-SNE slides]
[Wattenberg, et al., "How to Use t-SNE Effectively", Distill, 2016]
[t-SNE webpage]

Mon Jan 29

Visualization (wrap-up), introduction to clustering, k-means, Gaussian mixture models (GMMs)
[slides]
[t-SNE demo (Jupyter notebook)]

Additional reading:
[see Section 14.3 of the book "Elements of Statistical Learning" on clustering]

Wed Jan 31

GMMs, DP-GMMs
[slides]
[k-means, GMM, DP-GMM demo (Jupyter notebook)]

HW1 due, HW2 released!

Mon Feb 5

DP-means, CH index (see also: gap statistic), hierarchical clustering
[slides]

In class I mentioned that scraping the web can be a helpful tool to gather data to analyze. While we won't be talking about web scraping in the course, if you're interested you can look at tutorials for scrapy and Selenium.

Python cluster evaluation:
[scikit-learn cluster evaluation metrics (many, including CH index)]
[gap statistic]

Additional reading:
[the original paper on DP-means (fairly technical)]
[paper on the gap statistic (fairly technical)]
[see Section 14.3.12 "Hierarchical Clustering" of the book "Elements of Statistical Learning"]

Wed Feb 7

Clustering (wrap-up), topic modeling, intro to predictive data analysis
[slides]
[Latent Dirichlet Allocation demo (Jupyter notebook)]

Additional reading:
[see Section 14.3.12 "Hierarchical Clustering" of the book "Elements of Statistical Learning"]
[David Blei's general intro to topic modeling]

Part 2. Predictive data analysis
Mon Feb 12

Some classics of classification: nearest neighbors, evaluating prediction methods, naive Bayes
[slides]

Wed Feb 14

Mid-mini quiz

Mon Feb 19

No class

HW2 due

Wed Feb 21

From classical to modern classification methods: SVMs, decision trees and forests, intro to neural nets and deep learning
[slides]

HW3 released

Mon Feb 26

Neural nets and deep learning
[slides]
[demo (Jupyter notebook)]

Video introduction on neural nets:
["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown]

Additional reading:
[Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]

Wed Feb 28

Image analysis with CNNs (also called convnets), time series analysis with RNNs
[slides]
[for the CNN demo, see last lecture's demo (the end of it is on CNNs)]
[RNN demo (Jupyter notebook)]

[Stanford CS231n Convolutional Neural Networks for Visual Recognition]
[Christopher Olah's "Understanding LSTM Networks"]

Mon Mar 5

Wrap-up of deep learning and of 95-865
[slides]

HW3 due

Videos on learning neural nets:
["Gradient descent, how neural networks learn | Chapter 2, deep learning" by 3Blue1Brown]
["What is backpropagation really doing? | Chapter 3, deep learning" by 3Blue1Brown]

Wed Mar 7

Final exam, 10:30am-11:50am, HBH 1202