94-775: Unstructured Data Analytics for Policy
(Spring 2018 Mini 4)

Unstructured Data Analytics

Class time and location:

Instructor: George Chen (georgechen [at symbol] cmu.edu)

Teaching assistants: Dylan Fitzpatrick (djfitzpa [at symbol] cmu.edu), Runshan Fu (runshanf [at symbol] andrew.cmu.edu)

Office hours:

Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).

Course Description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course provides a practical introduction to unstructured data analysis and is composed of three parts:

  1. Basic Python programming especially as it pertains to working with data
  2. Exploratory data analysis: identifying possible structure present in the data via visualization and other exploratory methods
  3. Predictive data analysis: once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions
Many examples are given for how these methods help solve real problems faced by organizations. There is a final project in this course which must address a policy question.

How this course differs from 95-865 "Unstructured Data Analysis": 95-865 has Python programming as a prerequisite, emphasizes more of the technical skill development (assessed through two in-class exams involving coding), and does not have any sort of policy focus. On the other hand, 94-775 does not assume any Python experience and has a policy-focused final project instead of a final exam. 94-775 does not require cloud computing (part of 95-865 requires the use of Amazon Web Services). Despite these differences, there is heavy material overlap between 94-775 and 95-865.

Grading: HW1 8%, HW2 8%, HW3 4%, mid-mini quiz 35%, final project proposal 10%, final project 35%. HW3 is shorter than HW1 and HW2. Letter grades are determined based on a curve.

Syllabus: [pdf]

Calendar

Warning: As this course is new, the lecture slides are a bit rough and may contain bugs. To provide feedback/bug reports, please directly contact the instructor, George (georgechen [at symbol] cmu.edu).

Date Topic Supplemental Material
Part I. Python for data analysis
Tue Mar 20

Lecture 1: Course overview, basic python
[slides]
[Jupyter notebook and accompanying CSV file]

Some Python resources:
[Dive into Python 3]
["Programming in Python" chapter of the book "Computational and Inferential Thinking"]

Thur Mar 22

Lecture 2: Basic Python, continued
[Jupyter notebook and accompanying CSV file]

Fri Mar 23

Recitation 1: Python 3 installation, Jupyter notebooks, Python basics
[Jupyter notebook and accompanying CSV file]
Note: The dataset used is from: https://archive.ics.uci.edu/ml/datasets/iris

HW1 released

Tue Mar 27

Lecture 3: Basic Python, continued
[Jupyter notebook and accompanying CSV file]

Part II. Exploratory data analysis
Thur Mar 29

Lecture 4: Basic text analysis
[slides]
[Jupyter notebook (spaCy demo) and accompanying text file]

HW1 due 3pm

Fri Mar 30

Recitation 2: numpy and spaCy basics
[Jupyter notebook]

HW2 released

Tue Apr 3

Lecture 5: Finding possibly related entities
[slides]
[Jupyter notebook]

Thur Apr 5

Lecture 6: Visualizing high-dimensional feature vectors, intro to clustering
[slides]
[Jupyter notebook (first scikit-learn examples) and accompanying CSV file]

Python examples for dimensionality reduction:
[scikit-learn example (PCA, t-SNE, and many other methods)]

Additional reading:
[Sections 10.2 "Principal Components Analysis" and 10.3 "Clustering Methods" of the book "An Introduction to Statistical Learning"]
[Wattenberg, et al., "How to Use t-SNE Effectively", Distill, 2016]

Fri Apr 6

Recitation 3: HW2 tips, quiz review
[Jupyter notebook and accompanying CSV file]

Tue Apr 10

Lecture 7: Interpreting clusters, Gaussian mixture models, automatically choosing k
[slides]
[Jupyter notebook (builds off of previous lecture's demo) and accompanying CSV file]

HW2 and final project proposals due 3pm

Just for this week, George's office hours are Tuesday 5pm-7pm, HBH 2216 (and not on Wednesday!)

Additional reading:
[see Section 14.3 "Cluster Analysis" of the book "Elements of Statistical Learning"]

Thur Apr 12 Mid-mini quiz
Tue Apr 17

Lecture 8: Topic modeling with latent Dirichlet allocation, preview of predictive data analysis
[slides]
[Jupyter notebook]

Additional reading:
[David Blei's general intro to topic modeling]

Part III. Predictive data analysis
Thur Apr 19

Lecture 9: Prediction and validation illustrated using support vector classification
[slides]
[Jupyter notebook]

HW3 released

Additional reading:
[Section 5.1 "Cross-Validation" and Chapter 9 "Support Vector Machines" of the book "An Introduction to Statistical Learning"]

Tue Apr 24

Lecture 10: Introduction to neural nets and deep learning
[slides]
[Jupyter notebook]

Mike Jordan's Medium article (from just a few days ago!) on where AI is currently at:
["Artificial Intelligence - The Revolution Hasn't Happened Yet"]

Video introduction on neural nets:
["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown]

Additional reading:
[Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]

Thur Apr 26

Lecture 11: Wrap-up of deep learning and 94-775
[slides]

HW3 due 3pm

Final project presentations
Tue May 1

Final project presentations:

  1. Arnav Choudhry, James Fasone, Nitin Kumar
  2. Rachita Vaidya, Alison Siegel, Eileen Patten, Wei Zhu, Vicky Mei
  3. Nattaphat Buddharee, Matthew Jannetti, Angela Wang
  4. Hikaru Murase, Nidhi Shree
  5. Nicholas Elan, Ben Simmons, Ada Tso, Michael Turner
Thur May 3

Final project presentations:

  1. Hyung-Gwan Bae, Taimur Farooq, Alvaro Gonzalez, Osama Mansoor, Ben Silliman
  2. Quitong Dong, Jun Zhang, Na Su, Wei Huang, Xinlu Yao
  3. Anhvinh Doanvo, Wilson Mui, David Pinski, Vinay Srinivasan
  4. Jenny Keyt, Natasha Gonzalez, Olga Graves
  5. Sicheng Liu, Xi Wang, Jing Zhao
Fri May the 4th

Final project report (slide deck + Jupyter notebook) due 11:59pm

If you have HW2 or HW3 regrade requests, please submit them by 11:59pm