All times are listed in Pittsburgh time unless otherwise stated
Class time and location:
Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol
Teaching assistants:
Office hours (starting second week of class); these are all over Zoom (see Canvas for Zoom links):
Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).
Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:
Prerequisite: If you are a Heinz student, then you must have already taken 95-791 "Data Mining" and also either (1) passed the Heinz Python exemption exam, or (2) taken 95-888 "Data-Focused Python" or 90-819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.
Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more
Grading: HW1 8%, HW2 8%, HW3 4%, mid-mini quiz 35%, final project proposal 10%, final project 35%. HW3 is shorter than HW1 and HW2. Letter grades are determined based on a curve. Students with the most instructor-endorsed posts on Piazza will get a bonus worth 5% of the quiz score on the quiz (allowing for possibly scoring over 100% on the quiz).
Syllabus (updated 4/9 to mention Piazza quiz bonus incentive): [pdf]
Previous lecture slides and demos: spring 2020 version of 94-775 (click to see course webpage)
Date | Topic | Supplemental Material |
---|---|---|
Part I. Exploratory data analysis | ||
Tue Mar 23 |
Lecture 1: Course overview
|
|
Thur Mar 25 |
Lecture 2: Basic text analysis, co-occurrence analysis
For the basic text analysis demo to work, please
install Anaconda Python 3, Jupyter, and spaCy first
|
|
Fri Mar 26 |
Recitation: Basic Python review
|
|
Tue Mar 30 |
Lecture 3: Finding possibly related entities
|
What is the maximum phi-squared/chi-squared value? (technical)
|
Thur Apr 1 |
Lecture 4: Visualizing high-dimensional data with PCA
|
Causality additional reading:
PCA additional reading (technical): |
Fri Apr 2 |
Recitation slot — Lecture 5: Manifold learning with Isomap
|
Python examples for dimensionality reduction: |
Tue Apr 6 |
Lecture 6: Wrap up manifold learning (t-SNE), a first look at analyzing images, and an introduction to clustering phenomena
HW1 due 11:59pm |
See supplementary materials from the previous lecture; in addition, here's some reading for t-SNE (technical):
|
Thur Apr 8 |
Lecture 7: Distance and similarity functions, clustering (k-means, GMMs)
|
Additional clustering reading (technical): |
Fri Apr 9 |
Recitation slot — Lecture 8: More on clustering (interpreting clustering results, automatically choosing the number of clusters for GMM-related models, density-based clustering with DBSCAN)
|
Reading on DP-means, DP mixture models for which a DP-GMM is a special case (technical):
|
Tue Apr 13 |
Lecture 9: Wrap up clustering; topic modeling
|
Topic modeling reading:
|
Thur Apr 15 | No class (CMU break day) | |
Fri Apr 16 |
No class (CMU break day) |
|
Tue Apr 20 |
Lecture slot: Quiz review |
|
Thur Apr 22 |
Lecture slot: Quiz |
|
Part II. Predictive data analysis | ||
Fri Apr 23 |
Recitation slot — Lecture 10: Introduction to predictive data analytics
|
Some nuanced details on cross-validation (technical):
|
Mon Apr 26 |
HW2 and final project proposal due 11:59pm |
|
Tue Apr 27 |
Lecture 11: Wrap up predictive model evaluation and classical classifiers
|
|
Thur Apr 29 |
Lecture 12: Intro to neural nets and deep learning
|
PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU):
Additional reading:
Video introduction on neural nets:
Mike Jordan's Medium article on where AI is at (April 2018): |
Fri Apr 30 |
No recitation; please schedule with your TA Jingbo for a project checkup |
|
Mon May 3 |
HW3 due 11:59pm |
|
Tue May 4 |
Lecture 13: Image analysis with convolutional neural nets
|
Additional reading:
|
Thur May 6 |
Lecture 14: Time series analysis with recurrent neural nets; some other deep learning topics; course wrap-up
Supplemental demo for lecture (that isn't actually covered in lecture due to time constraints):
|
Additional reading:
|
Fri May 7 |
Recitation slot: final project presentations Final projects due 11:59pm (slide deck + Jupyter notebook) |