95-865: Unstructured Data Analytics (Spring 2019 Mini 3)

Unstructured Data Analytics

Lectures, time and location:

Recitations for all three sections: Fridays 3pm-4:20pm HBH A301

Instructor: George Chen (georgechen [at symbol] cmu.edu)

Teaching assistants: Emaad Manzoor (emaad [at symbol] cmu.edu), Yucheng Huang (huangyucheng [at symbol] cmu.edu)

Office hours (starting second week of class):

Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).

Course Description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:

  1. We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
  2. Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.
Many examples are given for how these methods help solve real problems faced by organizations. Along the way, we encounter many of the most popular methods in analyzing unstructured data, from modern classics in manifold learning, clustering, and topic modeling to some of the latest developments in deep neural networks for analyzing text, images, and time series.

We will be coding lots of Python and working with Amazon Web Services (AWS) for cloud computing (including using GPU's).

Prerequisite: If you are a Heinz student, then you must have either (1) passed the Heinz Python exemption exam, or (2) taken 95-888 "Data-Focused Python" or 16-791 "Applied Data Science". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.

Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more

Grading: Homework 20%, mid-mini quiz 35%, final exam 45%. If you do better on the final exam than the mid-mini quiz, then your final exam score clobbers your mid-mini quiz score (thus, the quiz does not count for you, and instead your final exam counts for 80% of your grade).

Syllabus: [pdf]

Calendar (tentative)

🔥 Previous version of course (including lecture slides and demos): 95-865 Fall 2018 mini 2 🔥

Date Topic Supplemental Material
Part I. Exploratory data analysis

Mon-Tue Jan 14-15

Reminder: Sections A3 and B3 meet Mondays and Wednesdays; Section C3 meets Tuesdays and Thursdays

Lecture 1: Course overview, basic text processing, and frequency analysis
[slides]

HW1 released (check Canvas)!

Wed-Thur Jan 16-17

Lecture 2: Basic text analysis demo, co-occurrence analysis
[slides]
[Jupyter notebook (basic text analysis)]
[Jupyter notebook (co-occurrence analysis)]

Fri Jan 18

Recitation 1: Basic Python review
[Jupyter notebook]

Mon-Tue Jan 21-22

No class due to MLK Jr. Day (even though Tuesday is not a holiday, to keep the three sections synchronized, there will be no class on Tuesday for Section C3)

Wed-Thur Jan 23-24

Lecture 3: Finding possibly related entities, PCA, Isomap

Causality additional reading:
[Computational and Inferential Thinking, "Causality and Experiments" chapter]

PCA additional reading (technical):
[Abdi and Williams's PCA review]

Fri Jan 25

Recitation 2: TBA

HW1 due 11:59pm, HW2 released

Mon-Tue Jan 28-29

Lecture 4: t-SNE

Python examples for dimensionality reduction:
[scikit-learn example (PCA, Isomap, t-SNE, and many other methods)]

Additional dimensionality reduction reading (technical):
[Isomap webpage]
[Simon Carbonnelle's t-SNE slides]
[t-SNE webpage]
["An Analysis of the t-SNE Algorithm for Data Visualization" (Arora et al, COLT 2018)]

Wed-Thur Jan 30-31

Lecture 5: Introduction to clustering, k-means, Gaussian mixture models

Additional clustering reading (technical):
[see Section 14.3 of the book "Elements of Statistical Learning" on clustering]

Fri Feb 1

Recitation 3: t-SNE

Mon-Tue Feb 4-5

Lecture 6: Clustering and clustering interpretation demo, automatic selection of k with CH index

Additional clustering reading (technical):
[see Section 14.3 of the book "Elements of Statistical Learning" on clustering]

Python cluster evaluation:
[scikit-learn cluster evaluation metrics (many, including CH index)]

Wed-Thur Feb 6-7

Lecture 7: Hierarchical clustering, topic modeling

Additional reading (technical):
[see Section 14.3.12 "Hierarchical Clustering" of the book "Elements of Statistical Learning"]
[David Blei's general intro to topic modeling]

Fri Feb 8

Recitation 4: Quiz review session

HW2 due 11:59pm, HW3 released

Part 2. Predictive data analysis
Mon-Tue Feb 11-12

Lecture 8: Introduction to predictive analytics, nearest neighbors, evaluating prediction methods, decision trees

Wed-Thur Feb 13-14

Lecture 9: Support vector machines, decision boundaries, ROC curves

Fri Feb 15 Mid-mini quiz (same time/place as recitation); in case of space issues, we do have an overflow room booked (HBH 1002)
Mon-Tue Feb 18-19

Lecture 10: Introduction to neural nets and deep learning

Mike Jordan's Medium article on where AI is at (April 2018):
["Artificial Intelligence - The Revolution Hasn't Happened Yet"]

Video introduction on neural nets:
["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown]

Additional reading:
[Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]

Wed-Thur Feb 20-21

Lecture 11: Image analysis with CNNs (also called convnets)

CNN reading:
[Stanford CS231n Convolutional Neural Networks for Visual Recognition]

Fri Feb 22

Recitation 5: Final exam review

Mon-Tue Feb 25-26

Lecture 12: Time series analysis with RNNs, roughly how learning a deep net works (gradient descent and variants)

LSTM reading:
[Christopher Olah's "Understanding LSTM Networks"]

Videos on learning neural nets (warning: the loss function used is not the same as what we are using in 95-865):
["Gradient descent, how neural networks learn | Chapter 2, deep learning" by 3Blue1Brown]
["What is backpropagation really doing? | Chapter 3, deep learning" by 3Blue1Brown]

Recent heuristics/theory on gradient descent variants for deep nets (technical):
["Don't Decay the Learning Rate, Increase the Batch Size" by Smith et al (ICLR 2018)]
["Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification" by Jain et al (JMLR 2018)]

Wed-Thur Feb 27-28

Lecture 13: Interpreting what a deep net is learning, other deep learning topics, wrap-up

Gary Marcus's Medium article on limitations of deep learning and his heated debate with Yann LeCun (December 2018):
["The deepest problem with deep learning"]

Some interesting reads (technical):
["Understanding deep learning requires rethinking generalization" by Zhang et al (ICLR 2017)]
["Relational inductive biases, deep learning, and graph networks" by Battaglia et al (2018)]

Fri Mar 1

Final exam (same time/place as recitation); in case of space issues, we do have an overflow room booked (HBH 1002)

HW3 due 11:59pm

Mini-3 final exam week Mar 4-7

No class