94-775: Unstructured Data Analytics for Policy
(Spring 2019 Mini 3)

Class time and location:

Lectures: Tuesdays and Thursdays 3pm-4:20pm, HBH 1206
Recitations: Fridays 4:30pm-5:50pm, HBH 1206

Instructor: George Chen (georgechen [at symbol] cmu.edu)

Teaching assistants: David Pinski (dpinski [at symbol] andrew.cmu.edu), Emaad Manzoor (emaad [at symbol] cmu.edu)

Office hours (starting second week of class):

David (these office hours are dedicated to 94-775): Wednesdays 4:30pm-6pm, A007B
Emaad (these office hours are for both 94-775 and 95-865): Tuesdays and Thursdays 11:30am-1pm, HBH 3rd floor Heinz PhD lounge
George (these office hours are for both 94-775 and 95-865): Tuesdays 1pm-2:30pm, HBH 2216

Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).

Course Description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:

We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.

Many examples are given for how these methods help solve real problems faced by organizations. There is a final project in this course which must address a policy question.

Prerequisite: If you are a Heinz student, then you must have either (1) passed the Heinz Python exemption exam, or (2) taken 90-812 "Introduction to Programming with Python", 95-888 "Data-Focused Python", or 16-791 "Applied Data Science". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.

Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more

Grading: HW1 8%, HW2 8%, HW3 4%, mid-mini quiz 35%, final project proposal 10%, final project 35%. HW3 is shorter than HW1 and HW2. Letter grades are determined based on a curve.

Syllabus: [pdf]

Calendar (tentative)

Previous lecture slides and demos: The spring 2017 version of 94-775 (click to see course webpage) was quite different and did not have Python as a prerequisite. At this point, the closest material coverage is last mini's 95-865 (click to see course webpage).

Date	Topic	Supplemental Material
Part I. Exploratory data analysis
Tue Jan 15	Lecture 1: Course overview, basic text processing and frequency analysis [slides] HW1 released (check Canvas)!
Thur Jan 17	Lecture 2: Basic text analysis demo, co-occurrence analysis [slides] [Jupyter notebook (basic text analysis)] [Jupyter notebook (co-occurrence analysis)]
Fri Jan 18	Recitation 1: Basic Python review [Jupyter notebook 1] [Jupyter notebook 2 (94-775 lecture 1 from spring 2018)]
Tue Jan 22	Lecture 3: Finding possibly related entities, PCA [slides] [Jupyter notebook (PCA)]	Causality additional reading: [Computational and Inferential Thinking, "Causality and Experiments" chapter] PCA additional reading (technical): [Abdi and Williams's PCA review]
Thur Jan 24	Lecture 4: Manifold learning with Isomap and t-SNE [slides] [required reading: "How to Use t-SNE Effectively" (Wattenberg et al, Distill 2016)] [Jupyter notebook (MDS, Isomap, t-SNE)]	Python examples for dimensionality reduction: [scikit-learn example (PCA, Isomap, t-SNE, and many other methods)] Additional dimensionality reduction reading (technical): [Isomap webpage] [Simon Carbonnelle's t-SNE slides] [t-SNE webpage] ["An Analysis of the t-SNE Algorithm for Data Visualization" (Arora et al, COLT 2018)]
Fri Jan 25	Recitation 2: t-SNE [slides] [Jupyter notebook] HW1 due 11:59pm, HW2 released
Tue Jan 29	Lecture 5: Introduction to clustering, k-means, Gaussian mixture models [slides] [Jupyter notebook (dimensionality reduction: images)] [Jupyter notebook (dimensionality reduction: drug consumption data)]	Additional clustering reading (technical): [see Section 14.3 of the book "Elements of Statistical Learning" on clustering]
Thur Jan 31	Class cancelled due to polar vortex
Fri Feb 1	Lecture 6 (this is not a typo): Clustering and clustering interpretation demo, automatic selection of k with CH index [slides] [Jupyter notebook (clustering)]	Additional clustering reading (technical): [see Section 14.3 of the book "Elements of Statistical Learning" on clustering] Python cluster evaluation: [scikit-learn cluster evaluation metrics (many, including CH index)]
Tue Feb 5	Recitation 3: Quiz review session
Thur Feb 7	Mid-mini quiz (same time/place as lecture)
Fri Feb 8	Recitation 4: Final project work section
Tue Feb 12	Lecture 7: Hierarchical clustering, topic modeling [slides] [Jupyter notebook (hierarchical clustering)] [Jupyter notebook (topic modeling)]	Additional reading (technical): [see Section 14.3.12 "Hierarchical Clustering" of the book "Elements of Statistical Learning"] [David Blei's general intro to topic modeling]
Part 2. Predictive data analysis
Thur Feb 14	Lecture 8: Topic modeling (wrap-up), introduction to predictive analytics, nearest neighbors, evaluating prediction methods [slides] (the topic modeling demo is the same as the one from the previous lecture)
Fri Feb 15	Recitation 5: Support vector machines, decision boundaries, ROC curves [slides] [Jupyter notebook] HW2 and final project proposal due 11:59pm, HW3 released
Tue Feb 19	Lecture 9: Prediction and model validation demo, decision trees/forests [slides] [Jupyter notebook (prediction and model validation)]
Thur Feb 21	Lecture 10: Introduction to neural nets and deep learning [slides] [Jupyter notebook (intro to neural nets and deep learning)] Mike Jordan's Medium article on where AI is at (April 2018): ["Artificial Intelligence - The Revolution Hasn't Happened Yet"]	Video introduction on neural nets: ["But what is a neural network? \| Chapter 1, deep learning" by 3Blue1Brown] Additional reading: [Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]
Fri Feb 22	Recitation 6: Sentiment analysis, extensions to topic modeling [Jupyter notebook]
Mon Feb 25	HW3 due 11:59pm
Tue Feb 26	Lecture 11: Image analysis with CNNs (also called convnets) [slides] [Jupyter notebook (CNN demo)]	CNN reading: [Stanford CS231n Convolutional Neural Networks for Visual Recognition]
Thur Feb 28	Lecture 12: Time series analysis with RNNs, other deep learning topics, wrap-up [slides] [Jupyter notebook (RNN demo)] [fashion_mnist_cnn_model.h5 (pre-trained convnet needed for CNN interpretation demo)] [Jupyter notebook (interpreting a CNN)] Gary Marcus's Medium article on limitations of deep learning and his heated debate with Yann LeCun (December 2018): ["The deepest problem with deep learning"]	LSTM reading: [Christopher Olah's "Understanding LSTM Networks"] Videos on learning neural nets (warning: the loss function used is not the same as what we are using in class): ["Gradient descent, how neural networks learn \| Chapter 2, deep learning" by 3Blue1Brown] ["What is backpropagation really doing? \| Chapter 3, deep learning" by 3Blue1Brown] Recent heuristics/theory on gradient descent variants for deep nets (technical): ["Don't Decay the Learning Rate, Increase the Batch Size" by Smith et al (ICLR 2018)] ["Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification" by Jain et al (JMLR 2018)] Some interesting reads (technical): ["Understanding deep learning requires rethinking generalization" by Zhang et al (ICLR 2017)] ["Relational inductive biases, deep learning, and graph networks" by Battaglia et al (2018)]
Fri Mar 1	Recitation 7: Final project work section
Tue Mar 5	Final project presentations in-class
Thur Mar 7	No class Final projects due 11:59pm (slide deck + Jupyter notebook)

94-775: Unstructured Data Analytics for Policy (Spring 2019 Mini 3)

Course Description

Calendar (tentative)

94-775: Unstructured Data Analytics for Policy
(Spring 2019 Mini 3)