95-865: Unstructured Data Analytics (Fall 2018 Mini 2)

Unstructured Data Analytics

Class time and location:

Times are specified in the time zone of the location specified.

Instructor: George Chen (georgechen [at symbol] cmu.edu)

Teaching assistants:

Office hours:

Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).

Course Description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:

  1. We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
  2. Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.
Many examples are given for how these methods help solve real problems faced by organizations. Along the way, we encounter many of the most popular methods in analyzing unstructured data, from modern classics in manifold learning, clustering, and topic modeling to some of the latest developments in deep neural networks for analyzing text, images, and time series.

We will be coding lots of Python and working with Amazon Web Services (AWS) for cloud computing (including using GPU's).

Prerequisite: If you are a Heinz student, then you must have either (1) passed the Heinz Python exemption exam, or (2) taken 95-888 "Data-Focused Python" or 16-791 "Applied Data Science". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.

Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more

Grading: Homework 20%, mid-mini quiz 35%, final exam 45%. If you do better on the final exam than the mid-mini quiz, then your final exam score clobbers your mid-mini quiz score (thus, the quiz does not count for you, and instead your final exam counts for 80% of your grade).

Syllabus: [Pittsburgh] [Adelaide]

Calendar (tentative)

Warning: As this course is still relatively new, the lecture slides are a bit rough and may contain bugs. To provide feedback/bug reports, please directly contact the instructor, George (georgechen [at symbol] cmu.edu). The Spring 2018 mini 4 course website is available here.

Pittsburgh

Date Topic Supplemental Material
Part I. Exploratory data analysis
Tue Oct 23

Lecture 1: Course overview, basic text processing and frequency analysis
[slides]

HW1 released (check Canvas)!

Thur Oct 25

Lecture 2: Basic text analysis demo, co-occurrence analysis
[slides]
[Jupyter notebook (spaCy)]
[Jupyter notebook (co-occurrences)]

Fri Oct 26

Recitation 1: Basic Python review
[Jupyter notebook]

Tue Oct 30

Lecture 3: Finding possibly related entities, PCA, Isomap
[slides]
[Jupyter notebook (PCA)]
[Jupyter notebook (multidimensional scaling)]

Causality additional reading:
[Computational and Inferential Thinking, "Causality and Experiments" chapter]

PCA additional reading (technical):
[Abdi and Williams's PCA review]

Thur Nov 1

Lecture 4: t-SNE
[slides]
[required reading: "How to Use t-SNE Effectively" (Wattenberg et al, Distill 2016)]
[Jupyter notebook (t-SNE)]

Python examples for dimensionality reduction:
[scikit-learn example (PCA, Isomap, t-SNE, and many other methods)]

Additional dimensionality reduction reading (technical):
[Isomap webpage]
[Simon Carbonnelle's t-SNE slides]
[t-SNE webpage]
["An Analysis of the t-SNE Algorithm for Data Visualization" (Arora et al, COLT 2018)]

Fri Nov 2

Recitation 2: t-SNE
[Jupyter notebook]

Tue Nov 6

Lecture 5: Introduction to clustering, k-means, Gaussian mixture models
[slides]

HW1 due 4:30pm (start of class), HW2 released!

Additional clustering reading (technical):
[see Section 14.3 of the book "Elements of Statistical Learning" on clustering]

Python cluster evaluation:
[scikit-learn cluster evaluation metrics (many, including CH index)]

Thur Nov 8

Lecture 6: Clustering and clustering interpretation demo, automatic selection of k with CH index
[slides]
[Jupyter notebook (clustering)]

Fri Nov 9

Recitation 3: Quiz review session

Tue Nov 13 Mid-mini quiz
Wed Nov 14 George's regular office hours are cancelled for this week
Thur Nov 15

Lecture 7: Hierarchical clustering, topic modeling
[slides]
[Jupyter notebook (hierarchical clustering)]
[Jupyter notebook (Latent Dirichlet Allocation)]

Additional reading (technical):
[see Section 14.3.12 "Hierarchical Clustering" of the book "Elements of Statistical Learning"]
[David Blei's general intro to topic modeling]

Part 2. Predictive data analysis
Fri Nov 16

Lecture 8 (this is not a typo): Introduction to predictive analytics, nearest neighbors, evaluating prediction methods, decision trees
[slides]
[Jupyter notebook (prediction and model validation)]

Tue Nov 20

Lecture 9: Intro to neural nets and deep learning
[slides]
[Jupyter notebook (intro to neural nets and deep learning)]

Mike Jordan's Medium article on where AI is at (April 2018):
["Artificial Intelligence - The Revolution Hasn't Happened Yet"]

The Atlantic's article about Judea Pearl's concerns with AI not focusing on causal reasoning (May 2018):
["How a Pioneer of Machine Learning Became One of Its Sharpest Critics"]

HW2 due 4:30pm (start of class); HW3 released

Video introduction on neural nets:
["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown]

Additional reading:
[Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]

Wed Nov 21

Adelaide Recitation 5: Support vector machines (SVM's), cross validation, decision boundaries, ROC curves
[slides (support vector machines)]
[Jupyter notebook]

Thur Nov 22

Thanksgiving: no class

Tue Nov 27

Lecture 10: Image analysis with CNNs (also called convnets)
[slides]
[Jupyter notebook (CNN demo)]

CNN reading:
[Stanford CS231n Convolutional Neural Networks for Visual Recognition]

Thur Nov 29

Lecture 11: Time series analysis with RNNs, roughly how learning a deep net works (gradient descent and variants)
[slides]
[Jupyter notebook (RNN demo)]

LSTM reading:
[Christopher Olah's "Understanding LSTM Networks"]

Videos on learning neural nets (warning: the loss function used is not the same as what we are using in 95-865):
["Gradient descent, how neural networks learn | Chapter 2, deep learning" by 3Blue1Brown]
["What is backpropagation really doing? | Chapter 3, deep learning" by 3Blue1Brown]

Recent heuristics/theory on gradient descent variants for deep nets (technical):
["Don't Decay the Learning Rate, Increase the Batch Size" by Smith et al (ICLR 2018)]
["Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification" by Jain et al (JMLR 2018)]

Fri Nov 30

Recitation 4: Word embeddings as an example of self-supervised learning
[slides]
[Jupyter notebook]
Important: You do not need to know the technical details of the demo (also, for this class you do not need to know Pandas)

Tue Dec 4

Lecture 12: Interpreting what a deep net is learning, other deep learning topics, wrap-up
[slides]
[fashion_mnist_cnn_model.h5 (pre-trained convnet needed for demo)]
[Jupyter notebook (interpreting a CNN)]

Gary Marcus's Medium article on limitations of deep learning and his heated debate with Yann LeCun (December 2018):
["The deepest problem with deep learning"]

HW3 due 4:30pm (start of class)

Some interesting reads (technical):
["Understanding deep learning requires rethinking generalization" by Zhang et al (ICLR 2017)]
["Relational inductive biases, deep learning, and graph networks" by Battaglia et al (2018)]

Thur Dec 6

No class

Fri Dec 7

Recitation 5: Final exam review session

Fri Dec 14

Final exam 1pm-4pm HBH 1002

Adelaide

Date Topic Supplemental Material
Part I. Exploratory data analysis
Tue Oct 23

Lecture 1: Course overview, basic text processing, co-occurrence analysis (finding possibly related entities with discrete outcomes)
[slides]
[Jupyter notebook (spaCy)]
[Jupyter notebook (co-occurrences)]

HW1 released (check Canvas)!

Pittsburgh Recitation 1: Basic Python review
[Jupyter notebook]

Tue Oct 30

Lecture 2: Finding possibly related entities, PCA, manifold learning (isomap, t-SNE)
[slides]
[required reading: "How to Use t-SNE Effectively" (Wattenberg et al, Distill 2016)]
[Jupyter notebook (PCA)]
[Jupyter notebook (multidimensional scaling)]
[Jupyter notebook (t-SNE)]

Pittsburgh Recitation 2: t-SNE
[Jupyter notebook]

Causality additional reading:
[Computational and Inferential Thinking, "Causality and Experiments" chapter]

Python examples for dimensionality reduction:
[scikit-learn example (PCA, Isomap, t-SNE, and many other methods)]

PCA additional reading (technical):
[Abdi and Williams's PCA review]

Additional dimensionality reduction reading (technical):
[Isomap webpage]
[Simon Carbonnelle's t-SNE slides]
[t-SNE webpage]
["An Analysis of the t-SNE Algorithm for Data Visualization" (Arora et al, COLT 2018)]

Tue Nov 6

Lecture 3: Introduction to clustering, k-means, Gaussian mixture models, automatic selection of k with CH index
[slides]
[Jupyter notebook (clustering)]

Additional clustering reading (technical):
[see Section 14.3 of the book "Elements of Statistical Learning" on clustering]

Python cluster evaluation:
[scikit-learn cluster evaluation metrics (many, including CH index)]

Wed Nov 7

HW1 due at 8am (start of Pittsburgh class); HW2 released!

Tue Nov 13

Lecture 4: Hierarchical clustering, topic modeling
Mid-mini quiz (80 minutes)
[slides]
[Jupyter notebook (hierarchical clustering)]
[Jupyter notebook (Latent Dirichlet Allocation)]

Additional reading (technical):
[see Section 14.3.12 "Hierarchical Clustering" of the book "Elements of Statistical Learning"]
[David Blei's general intro to topic modeling]

Part 2. Predictive data analysis
Tue Nov 20

Lecture 5: Introduction to predictive data analytics, neural nets, and deep learning
[slides]
[Jupyter notebook (intro to predictive analytics)]
[Jupyter notebook (intro to neural nets and deep learning)]

Mike Jordan's Medium article on where AI is at (April 2018):
["Artificial Intelligence - The Revolution Hasn't Happened Yet"]

The Atlantic's article about Judea Pearl's concerns with AI not focusing on causal reasoning (May 2018):
["How a Pioneer of Machine Learning Became One of Its Sharpest Critics"]

Video introduction on neural nets:
["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown]

Additional reading:
[Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]

Wed Nov 21

Recitation 5: Support vector machines (SVM's), cross validation, decision boundaries, ROC curves
[slides (support vector machines)]
[Jupyter notebook]

HW2 due at 8am (start of Pittsburgh class); HW3 released!

Tue Nov 27

Lecture 6: Image analysis with CNNs (also called convnets), time series analysis with RNNs, roughly how learning a deep net works (gradient descent and variants)
[slides]
[Jupyter notebook (CNN demo)]
[Jupyter notebook (RNN demo)]

Pittsburgh Recitation 4: Word embeddings as an example of self-supervised learning
[slides]
[Jupyter notebook]
Important: You do not need to know the technical details of the demo (also, for this class you do not need to know Pandas)

CNN reading:
[Stanford CS231n Convolutional Neural Networks for Visual Recognition]

LSTM reading:
[Christopher Olah's "Understanding LSTM Networks"]

Videos on learning neural nets (warning: the loss function used is not the same as what we are using in 95-865):
["Gradient descent, how neural networks learn | Chapter 2, deep learning" by 3Blue1Brown]
["What is backpropagation really doing? | Chapter 3, deep learning" by 3Blue1Brown]

Recent heuristics/theory on gradient descent variants for deep nets (technical):
["Don't Decay the Learning Rate, Increase the Batch Size" by Smith et al (ICLR 2018)]
["Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification" by Jain et al (JMLR 2018)]

Some interesting reads (technical):
["Understanding deep learning requires rethinking generalization" by Zhang et al (ICLR 2017)]
["Relational inductive biases, deep learning, and graph networks" by Battaglia et al (2018)]

Wed Dec 5

HW3 due 8am (start of Pittsburgh class)

Thur Dec 6

Final exam 9am-12pm