95-865: Unstructured Data Analytics (Fall 2017)

Unstructured Data Analytics

Class time and location:

Instructor: George Chen (contact info on website)

Teaching assistants: Castiel Huang (Adelaide), Emaad Manzoor, Rashmi Raghunandan, Runshan Fu, Yoonjung Kim

Office hours:

Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!). If you have a question specific to you and that others cannot benefit from, you can reach out to all of the course staff (TA's + instructor) by emailing:

uda-course-f17 [at symbol here!] lists.andrew.cmu.edu

Course Description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:

  1. We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
  2. Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.
Many examples are given for how these methods help solve real problems faced by organizations. Along the way, we encounter many of the most popular methods in analyzing unstructured data, from modern classics in manifold learning, clustering, and topic modeling to some of the latest developments in deep neural networks for analyzing text, images, and time series.

We will be coding lots of Python and working with Amazon AWS for cloud computing (including using GPU's).

Prerequisite: Python coding experience

Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more

Grading: Equal weights on HW1, HW2, HW3, final exam

Calendar

Warning: As this is the first offering of this course, the slides are a bit rough and may contain bugs. To provide feedback/bug reports, please directly contact the instructor, George (email info is on his homepage).

Pittsburgh

Date Topic Supplemental Material
Part 1. Exploratory data analysis
Tuesday Oct 24, 2017

Course introduction, basic text processing, frequency analysis
[course introduction slides (updated 11/3)]
[basic text processing and frequency analysis slides (updated 11/3)]
[spaCy demo (Jupyter notebook)]

Python review by Emaad:
[github]

Some Python resources:
[Dive into Python 3]
[Computational and Inferential Thinking, "Programming in Python" chapter]

Thursday Oct 26, 2017

Finding possibly related features: co-occurrence analysis, scatter plots, correlation, causation
[slides (updated 11/3)]

HW1 released! (Check Canvas)

Causality additional reading:
[Computational and Inferential Thinking, "Causality and Experiments" chapter]

Tuesday Oct 31, 2017

Visualizing high-dimensional vectors: PCA, introduction to manifold learning, Isomap, t-SNE
[slides (updated 11/3)]
[slides without animation (truncated by Castiel)]
[PCA demo (Jupyter notebook)]
[Multidimensional Scaling demo (Jupyter notebook)]
[t-SNE demo (Jupyter notebook)]

Python examples for dimensionality reduction:
[scikit-learn example (PCA, Isomap, t-SNE, and many other methods)]

Additional reading:
[Abdi and Williams's PCA review]
[Isomap webpage]
[Simon Carbonnelle's t-SNE slides]
[Wattenberg, et al., "How to Use t-SNE Effectively", Distill, 2016]
[t-SNE webpage]

Thursday Nov 2, 2017

Clustering: introduction, k-means, Gaussian mixture models
[Isomap + t-SNE recap: see updated slides + multidimensional scaling demo for Oct 31 lecture above]
[slides (updated 11/7)]
[slides without animation (truncated by Castiel)]

Additional reading:
[see Section 14.3 of the book "Elements of Statistical Learning" on clustering]

Tuesday Nov 7, 2017

Automatically choosing the number of clusters: DP-GMM's, DP-means, CH index (see also gap statistic)
[slides (updated 11/9)]
[slides without animation (truncated by Castiel)]
[k-means, GMM, DP-GMM demo (Jupyter notebook)]

HW1 due at 4:30pm, HW2 released!

Python cluster evaluation:
[scikit-learn cluster evaluation metrics (many, including CH index)]
[gap statistic]

Additional reading:
[the original paper on DP-means (fairly technical)]
[paper on the gap statistic (fairly technical)]

Thursday Nov 9, 2017

Hierarchical clustering, topic modeling
[slides]
[slides without animation (truncated by Castiel)]
[Latent Dirichlet Allocation demo (Jupyter notebook)]

Additional reading:
[see Section 14.3.12 "Hierarchical Clustering" of the book "Elements of Statistical Learning"]
[David Blei's general intro to topic modeling]
Part 2. Predictive data analysis
Tuesday Nov 14, 2017

Introduction to classification
[slides]
[demo]

Thursday Nov 16, 2017 No class — optional AWS tutorial instead at the same time as class (led by Yoonjung)
Tuesday Nov 21, 2017

Adaptive nearest neighbor methods: decision trees and their use in ensembles (such as in random forests, AdaBoost, gradient tree boosting), and why they're nearest neighbor methods
[slides]
[slides (truncated)]

HW2 due at 4:30pm 11:59pm Wed Nov 22, HW3 released!

Python code example:
[scikit-learn example with many classifiers]

Additional reading:
[Chapter 8 "Tree-Based Methods" of the book "Introduction to Statistical Learning"]

Thursday Nov 23, 2017 Thanksgiving: no class
Tuesday Nov 28, 2017

Introduction to deep learning
[slides]
[slides (truncated)]
[demo]
[video (lecture during Thanksgiving weekend for Adelaide)]

Video introduction on neural nets:
["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown]

Additional reading:
[Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]

Thursday Nov 30, 2017

Deep learning for analyzing images and time series
[slides]
[slides (truncated)]
[CNN demo (continuation of handwritten digit demo by George)]
[CNN demo (more closely related to HW3; put together by Rashmi)]
[RNN demo (put together by Runshan)]

Additional reading:
[Stanford CS231n Convolutional Neural Networks for Visual Recognition]
[Christopher Olah's "Understanding LSTM Networks"]

Tuesday Dec 5, 2017

Wrap-up of deep learning and of 95-865
[some important announcements]
[slides]
[slides (truncated)]

HW3 due at 4:30pm

Videos on learning neural nets:
["Gradient descent, how neural networks learn | Chapter 2, deep learning" by 3Blue1Brown]
["What is backpropagation really doing? | Chapter 3, deep learning" by 3Blue1Brown]

Thursday Dec 7, 2017 Review session
Friday Dec 15, 2017 Final exam: 1pm-4pm, HBH 1202

Adelaide

Date Topic Supplemental Material
Part 1. Exploratory data analysis
Friday Oct 27, 2017

Course introduction, basic text processing, frequency analysis
[course introduction slides (updated 11/3)]
[basic text processing and frequency analysis slides (updated 11/3)]
[spaCy demo (Jupyter notebook)]

Finding possibly related features: co-occurrence analysis, scatter plots, correlation, causation
[slides (updated 11/3)]

HW1 released! (Check Canvas)

Python review by Emaad:
[github]

Some Python resources:
[Dive into Python 3]
[Computational and Inferential Thinking, "Programming in Python" chapter]

Causality additional reading:
[Computational and Inferential Thinking, "Causality and Experiments" chapter]

Friday Nov 3, 2017

Visualizing high-dimensional vectors: PCA, introduction to manifold learning, Isomap, t-SNE
[slides (updated 11/3)]
[slides without animation (truncated by Castiel)]
[PCA demo (Jupyter notebook)]
[Multidimensional Scaling demo (Jupyter notebook)]
[t-SNE demo (Jupyter notebook)]

Clustering: introduction, k-means, Gaussian mixture models
[slides (updated 11/7)]
[slides without animation (truncated by Castiel)]

Python examples for dimensionality reduction:
[scikit-learn example (PCA, Isomap, t-SNE, and many other methods)]

Additional reading on dimensionality reduction:
[Abdi and Williams's PCA review]
[Isomap webpage]
[Simon Carbonnelle's t-SNE slides]
[Wattenberg, et al., "How to Use t-SNE Effectively", Distill, 2016]
[t-SNE webpage]

Additional reading on clustering:
[see Section 14.3 of the book "Elements of Statistical Learning"]

Wednesday Nov 8, 2017

HW1 due at 8am, HW2 released!

Friday Nov 10, 2017

Automatically choosing the number of clusters: DP-GMM's, DP-means, CH index (see also gap statistic)
[slides (updated 11/9)]
[slides without animation (truncated by Castiel)]
[k-means, GMM, DP-GMM demo (Jupyter notebook)]

Hierarchical clustering, topic modeling
[slides]
[slides without animation (truncated by Castiel)]
[Latent Dirichlet Allocation demo (Jupyter notebook)]

Python cluster evaluation:
[scikit-learn cluster evaluation metrics (many, including CH index)]
[gap statistic]

Additional reading:
[the original paper on DP-means (fairly technical)]
[paper on the gap statistic (fairly technical)]
[see Section 14.3.12 "Hierarchical Clustering" of the book "Elements of Statistical Learning"]
[David Blei's general intro to topic modeling]

Part 2. Predictive data analysis
Week of Friday Nov 17, 2017

Introduction to classification
[slides]
[demo]

Wednesday Nov 22, 2017

HW2 due at 8am 3:29pm Thu Nov 23, HW3 released!

Week of Friday Nov 24, 2017

Adaptive nearest neighbor methods: decision trees and their use in ensembles (such as in random forests, AdaBoost, gradient tree boosting), and why they're nearest neighbor methods
[slides]
[slides (truncated)]

Introduction to deep learning
[video]
[slides]
[slides (truncated)]
[demo]

Python code example that includes adaptive nearest neighbor methods (not deep learning):
[scikit-learn example with many classifiers]

Additional reading:
[Chapter 8 "Tree-Based Methods" of the book "Introduction to Statistical Learning"]
[Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]

Video introduction on neural nets:
["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown]

Friday Dec 1, 2017

Deep learning for analyzing images and time series, wrap-up of deep learning and 95-865
[slides]
[slides (truncated)]
[CNN demo (continuation of handwritten digit demo by George)]
[CNN demo (more closely related to HW3; put together by Rashmi)]
[RNN demo (put together by Runshan)]

Additional reading:
[Stanford CS231n Convolutional Neural Networks for Visual Recognition]
[Christopher Olah's "Understanding LSTM Networks"]

Videos on learning neural nets:
["Gradient descent, how neural networks learn | Chapter 2, deep learning" by 3Blue1Brown]
["What is backpropagation really doing? | Chapter 3, deep learning" by 3Blue1Brown]

Wednesday Dec 6, 2017

HW3 due at 8am

Friday Dec 8, 2017 Final exam: 9am-12pm, classroom 1