94-775: Unstructured Data Analytics for Policy
(Spring 2020 Mini 3)

Class time and location:

Lectures: Tuesdays and Thursdays 1:30pm-2:50pm, HBH 1206
Recitations: Fridays 4:30pm-5:50pm, HBH 1206

Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol

Teaching assistants: Georgia Fu (qiaoyaf ♣ andrew.cmu.edu), Cyndi Wang (xiangyu4 ♣ andrew.cmu.edu)

Office hours (starting second week of class):

TA's: Tuesdays and Thursdays 7:00pm-8:30pm HBH 2011
George: Thursdays 11:30am-1:00pm HBH 2216

Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).

course description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:

We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.

Many examples are given for how these methods help solve real problems faced by organizations. There is a final project in this course which must address a policy question.

Prerequisite: If you are a Heinz student, then you must have already taken 95-791 "Data Mining" and also either (1) passed the Heinz Python exemption exam, or (2) taken 95-888 "Data-Focused Python" or 90-819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.

Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more

Grading: HW1 8%, HW2 8%, HW3 4%, mid-mini quiz 35%, final project proposal 10%, final project 35%. HW3 is shorter than HW1 and HW2. Letter grades are determined based on a curve.

Syllabus: [pdf]

calendar (tentative)

Previous lecture slides and demos: spring 2019 version of 94-775 (click to see course webpage)

Date	Topic	Supplemental Material
Part I. Exploratory data analysis
Tue Jan 14	Lecture 1: Course overview, basic text processing and frequency analysis [slides] HW1 released (check Canvas)!
Thur Jan 16	Lecture 2: Basic text analysis demo, co-occurrence analysis [slides] [Jupyter notebook (basic text analysis)]
Fri Jan 17	Recitation 1: Basic Python review [Jupyter notebook]
Tue Jan 21	Lecture 3: Finding possibly related entities [slides] [Jupyter notebook (co-occurrence analysis)]	What is the maximum value of phi-square/chi-square value? (technical) [stack exchange answer] Causality additional reading: [Computational and Inferential Thinking, "Causality and Experiments" chapter]
Thur Jan 23	Lecture 4: Visualizing high-dimensional data with PCA and Isomap [slides] [Jupyter notebook (PCA)]	Python examples for dimensionality reduction: [scikit-learn example (PCA, Isomap, t-SNE, and many other methods)] Additional dimensionality reduction reading (technical): [Abdi and Williams's PCA review] [Isomap webpage]
Fri Jan 24	Recitation 2: More on PCA, bookkeeping with np.argsort [Jupyter notebook] HW1 due 11:59pm, HW2 released
Tue Jan 28	Lecture 5: Manifold learning (Isomap, t-SNE) [slides] [required reading: "How to Use t-SNE Effectively" (Wattenberg et al, Distill 2016)] [Jupyter notebook (manifold learning)] [Jupyter notebook (t-SNE with images)] [slides with some technical details for t-SNE]	See supplementary materials from the previous lecture; in addition, here's some reading for t-SNE (technical): [Simon Carbonnelle's t-SNE slides] [t-SNE webpage]
Thur Jan 30	Lecture 6: Clustering [slides] [Jupyter notebook (PCA, t-SNE, clustering with drug data)]	Additional clustering reading (technical): [see Section 14.3 of the book "Elements of Statistical Learning" on clustering]
Fri Jan 31	Lecture 7: More clustering, topic modeling [slides] [Jupyter notebook (continuation of previous drug data demo: cluster interpretation, automatically choosing k)] [Jupyter notebook (topic modeling with LDA)]	Python cluster evaluation: [scikit-learn cluster evaluation metrics (many, including CH index)] Additional reading on topic modeling: [David Blei's general intro to topic modeling]
Tue Feb 4	Quiz review held by TA
Thur Feb 6	Quiz (same time/place as lecture)
Fri Feb 7	No recitation - please schedule time with TA's to get feedback on final project ideas
Part 2. Predictive data analysis
Mon Feb 10	HW2 and final project proposal due 11:59pm, HW3 released
Tue Feb 11	Lecture 8: Wrap up topic modeling, introduction to predictive analytics, nearest neighbors, evaluating prediction methods [slides]	Some nuanced details on cross-validation: [Andrew Ng's article Preventing "Overfitting" of Cross-Validation Data] [Braga-Neto and Dougherty's article Is cross-validation valid for small-sample microarray classification? (this article applies more generally rather than only to microarray data from biology)] [Bias and variance as we change the number of folds in k-fold cross-validation]
Thur Feb 13	Lecture 9: More on model evaluation (including confusion matrices, ROC curves), decision trees & forests [slides] [Jupyter notebook (prediction and model validation)]
Fri Feb 14	Recitation 3: ROC curves
Tue Feb 18	Lecture 10: Introduction to neural nets and deep learning [slides] For the lecture demo below to work, you will need to install some packages: `conda install -c pytorch pytorch` `conda install -c pytorch torchvision` `pip install torchsummaryX` `python -m spacy download en` `pip install pytorch-nlp` [Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to the bottom to download UDA_pytorch_utils.py)] Mike Jordan's Medium article on where AI is at (April 2018): ["Artificial Intelligence - The Revolution Hasn't Happened Yet"]	PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU): [PyTorch tutorial] Additional reading: [Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning] Video introduction on neural nets: ["But what is a neural network? \| Chapter 1, deep learning" by 3Blue1Brown]
Thur Feb 20	Lecture 11: Image analysis with convolutional neural nets [slides] [Jupyter notebook (continuation of previous demo with addition of two convnet models)]	Additional reading: [Stanford CS231n Convolutional Neural Networks for Visual Recognition] [(technical) Richard Zhang's fix for max pooling (presented at ICML 2019)]
Fri Feb 21	Recitation 4: Sentiment analysis, extensions to topic modeling HW3 due 11:59pm
Tue Feb 25	Lecture 12: Time series analysis with recurrent neural nets (application: sentiment analysis in IMDb reviews) [slides] [Jupyter notebook (sentiment analysis with IMDb reviews)]	LSTM reading: [Christopher Olah's "Understanding LSTM Networks"]
Thur Feb 27	Lecture 13: Other deep learning topics, wrap-up [slides]	Additional reading: [A tutorial on word2vec word embeddings] [A tutorial on BERT word embeddings]
Fri Feb 28	No recitation - please schedule time with TA's to get feedback on final projects
Tue Mar 3	Final project presentations in-class
Thur Mar 5	No class Final projects due 11:59pm (slide deck + Jupyter notebook)

94-775: Unstructured Data Analytics for Policy (Spring 2020 Mini 3)

course description

calendar (tentative)

94-775: Unstructured Data Analytics for Policy
(Spring 2020 Mini 3)