95-865: Unstructured Data Analytics (Fall 2018 Mini 2)

Unstructured Data Analytics

Class time and location:

Times are specified in the time zone of the location specified.

Instructor: George Chen (georgechen [at symbol] cmu.edu)

Teaching assistants:

Office hours:

Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).

Course Description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:

  1. We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
  2. Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.
Many examples are given for how these methods help solve real problems faced by organizations. Along the way, we encounter many of the most popular methods in analyzing unstructured data, from modern classics in manifold learning, clustering, and topic modeling to some of the latest developments in deep neural networks for analyzing text, images, and time series.

We will be coding lots of Python and working with Amazon Web Services (AWS) for cloud computing (including using GPU's).

Prerequisite: If you are a Heinz student, then you must have either (1) passed the Heinz Python exemption exam, or (2) taken a Heinz introductory Python course (90-812, 95-880, 95-888) or the Applied Data Science course (16-791). If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.

Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more

Grading: Homework 20%, mid-mini quiz 35%, final exam 45%. If you do better on the final exam than the mid-mini quiz, then your final exam score clobbers your mid-mini quiz score (thus, the quiz does not count for you, and instead your final exam counts for 80% of your grade).

Syllabus: [Pittsburgh] [Adelaide]

Calendar (tentative)

Warning: As this course is still relatively new, the lecture slides are a bit rough and may contain bugs. To provide feedback/bug reports, please directly contact the instructor, George (georgechen [at symbol] cmu.edu). The Spring 2018 mini 4 course website is available here.

Pittsburgh

Date Topic Supplemental Material
Part I. Exploratory data analysis
Tue Oct 23

Lecture 1: Course overview, basic text processing and frequency analysis
[slides]

HW1 released (check Canvas)!

Thur Oct 25

Lecture 2: Basic text analysis demo, co-occurrence analysis
[slides]
[Jupyter notebook (spaCy)]
[Jupyter notebook (co-occurrences)]

Fri Oct 26

Recitation 1: Basic Python review
[Jupyter notebook]

Tue Oct 30

Lecture 3: Finding possibly related entities, PCA, Isomap
[slides]
[Jupyter notebook (PCA)]
[Jupyter notebook (multidimensional scaling)]

Causality additional reading:
[Computational and Inferential Thinking, "Causality and Experiments" chapter]

PCA additional reading (technical):
[Abdi and Williams's PCA review]

Thur Nov 1

Lecture 4: t-SNE
[slides]
[required reading: "How to Use t-SNE Effectively" (Wattenberg et al, Distill 2016)]
[Jupyter notebook (t-SNE)]

Python examples for dimensionality reduction:
[scikit-learn example (PCA, Isomap, t-SNE, and many other methods)]

Additional dimensionality reduction reading (technical):
[Isomap webpage]
[Simon Carbonnelle's t-SNE slides]
[t-SNE webpage]
["An Analysis of the t-SNE Algorithm for Data Visualization" (Arora et al, COLT 2018)]

Fri Nov 2

Recitation 2: t-SNE
[Jupyter notebook]

Tue Nov 6

Lecture 5: Introduction to clustering, k-means, Gaussian mixture models
[slides]

HW1 due 4:30pm (start of class), HW2 released!

Additional clustering reading (technical):
[see Section 14.3 of the book "Elements of Statistical Learning" on clustering]

Python cluster evaluation:
[scikit-learn cluster evaluation metrics (many, including CH index)]

Thur Nov 8

Lecture 6: Clustering and clustering interpretation demo, automatic selection of k with CH index
[slides]
[Jupyter notebook (clustering)]

Fri Nov 9

Recitation 3: Quiz review session

Tue Nov 13 Mid-mini quiz
Wed Nov 14 George's regular office hours are cancelled for this week
Thur Nov 15

Lecture 7: Clustering (wrap-up), topic modeling

Additional reading:
[see Section 14.3.12 "Hierarchical Clustering" of the book "Elements of Statistical Learning"]
[David Blei's general intro to topic modeling]

Part 2. Predictive data analysis
Fri Nov 16

Lecture 8 (this is not a typo): Introduction to predictive analytics, nearest neighbors, evaluating prediction methods

Tue Nov 20

Lecture 9: Neural nets and deep learning

HW2 due 4:30pm (start of class); HW3 released

Mike Jordan's Medium article on where AI is at (April 2018):
["Artificial Intelligence - The Revolution Hasn't Happened Yet"]

The Atlantic's article about Judea Pearl's concerns with AI not focusing on causal reasoning (May 2018):
["How a Pioneer of Machine Learning Became One of Its Sharpest Critics"]

Thur Nov 22

Thanksgiving: no class

Tue Nov 27

Lecture 10: Image analysis with CNNs (also called convnets)

Video introduction on neural nets:
["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown]

Additional reading:
[Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]
[Stanford CS231n Convolutional Neural Networks for Visual Recognition]

Thur Nov 29

Lecture 11: Time series analysis with RNNs, roughly how learning a deep net works (gradient descent and variants)

LSTM reading:
[Christopher Olah's "Understanding LSTM Networks"]

Videos on learning neural nets (warning: the loss function used is not the same as what we are using):
["Gradient descent, how neural networks learn | Chapter 2, deep learning" by 3Blue1Brown]
["What is backpropagation really doing? | Chapter 3, deep learning" by 3Blue1Brown]

Fri Nov 30

Recitation 4

HW3 due 8:29am (end of fall classes for Adelaide)

Tue Dec 4

Lecture 12: Interpreting what a deep net is learning, other deep learning topics, wrap-up

Some interesting reads:
["Understanding deep learning requires rethinking generalization" by Zhang et al (ICLR 2017)]
["Relational inductive biases, deep learning, and graph networks" by Battaglia et al (2018)]

Thur Dec 6

No class

Fri Dec 7

Recitation 5: Final exam review session

Fri Dec 14

Final exam 1pm HBH 1002

Adelaide

Date Topic Supplemental Material
Part I. Exploratory data analysis
Tue Oct 23

Lecture 1: Course overview, basic text processing, co-occurrence analysis (finding possibly related entities with discrete outcomes)
[slides]
[Jupyter notebook (spaCy)]
[Jupyter notebook (co-occurrences)]

HW1 released (check Canvas)!

Pittsburgh Recitation 1: Basic Python review
[Jupyter notebook]

Tue Oct 30

Lecture 2: Finding possibly related entities, PCA, manifold learning (isomap, t-SNE)
[slides]
[required reading: "How to Use t-SNE Effectively" (Wattenberg et al, Distill 2016)]
[Jupyter notebook (PCA)]
[Jupyter notebook (multidimensional scaling)]
[Jupyter notebook (t-SNE)]

Pittsburgh Recitation 2: t-SNE
[Jupyter notebook]

Causality additional reading:
[Computational and Inferential Thinking, "Causality and Experiments" chapter]

Python examples for dimensionality reduction:
[scikit-learn example (PCA, Isomap, t-SNE, and many other methods)]

PCA additional reading (technical):
[Abdi and Williams's PCA review]

Additional dimensionality reduction reading (technical):
[Isomap webpage]
[Simon Carbonnelle's t-SNE slides]
[t-SNE webpage]
["An Analysis of the t-SNE Algorithm for Data Visualization" (Arora et al, COLT 2018)]

Tue Nov 6

Lecture 3: Introduction to clustering, k-means, Gaussian mixture models, automatic selection of k with CH index
[slides]
[Jupyter notebook (clustering)]

Additional clustering reading (technical):
[see Section 14.3 of the book "Elements of Statistical Learning" on clustering]

Python cluster evaluation:
[scikit-learn cluster evaluation metrics (many, including CH index)]

Wed Nov 7

HW1 due at 8am (start of Pittsburgh class); HW2 released!

Tue Nov 13

Lecture 4: Clustering (wrap-up), topic modeling
Mid-mini quiz (80 minutes)

Additional reading:
[see Section 14.3.12 "Hierarchical Clustering" of the book "Elements of Statistical Learning"]
[David Blei's general intro to topic modeling]

Additional reading:
[the original paper on DP-means (fairly technical)]

Part 2. Predictive data analysis
Tue Nov 20

Lecture 5: Introduction to predictive analytics, nearest neighbors, evaluating prediction methods, neural nets and deep learning

Video introduction on neural nets:
["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown]

Additional reading:
[Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]
[Stanford CS231n Convolutional Neural Networks for Visual Recognition]

Mike Jordan's Medium article on where AI is at (April 2018):
["Artificial Intelligence - The Revolution Hasn't Happened Yet"]

The Atlantic's article about Judea Pearl's concerns with AI not focusing on causal reasoning (May 2018):
["How a Pioneer of Machine Learning Became One of Its Sharpest Critics"]

Wed Nov 21

HW2 due at 8am (start of Pittsburgh class); HW3 released!

Tue Nov 27

Lecture 6: Image analysis with CNNs (also called convnets), time series analysis with RNNs, roughly how learning a deep net works (gradient descent and variants)

LSTM reading:
[Christopher Olah's "Understanding LSTM Networks"]

Videos on learning neural nets (warning: the loss function used is not the same as what we are using):
["Gradient descent, how neural networks learn | Chapter 2, deep learning" by 3Blue1Brown]
["What is backpropagation really doing? | Chapter 3, deep learning" by 3Blue1Brown]

Some interesting reads:
["Understanding deep learning requires rethinking generalization" by Zhang et al (ICLR 2017)]
["Relational inductive biases, deep learning, and graph networks" by Battaglia et al (2018)]

Fri Nov 30

HW3 due 11:59pm

Tue Dec 4

Final exam