Class time and location:
Instructor: George Chen (georgechen [at symbol] cmu.edu)
Teaching assistants: David Pinski (dpinski [at symbol] andrew.cmu.edu), Emaad Manzoor (emaad [at symbol] cmu.edu)
Office hours (starting second week of class):
Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).
Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:
Prerequisite: If you are a Heinz student, then you must have either (1) passed the Heinz Python exemption exam, or (2) taken 90-812 "Introduction to Programming with Python", 95-888 "Data-Focused Python", or 16-791 "Applied Data Science". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.
Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more
Grading: HW1 8%, HW2 8%, HW3 4%, mid-mini quiz 35%, final project proposal 10%, final project 35%. HW3 is shorter than HW1 and HW2. Letter grades are determined based on a curve.
Syllabus: [pdf]
Previous lecture slides and demos: The spring 2017 version of 94-775 (click to see course webpage) was quite different and did not have Python as a prerequisite. At this point, the closest material coverage is last mini's 95-865 (click to see course webpage).
Date | Topic | Supplemental Material |
---|---|---|
Part I. Exploratory data analysis | ||
Tue Jan 15 |
Lecture 1: Course overview, basic text processing and frequency analysis
HW1 released (check Canvas)! |
|
Thur Jan 17 |
Lecture 2: Basic text analysis demo, co-occurrence analysis
|
|
Fri Jan 18 |
Recitation 1: Basic Python review
|
|
Tue Jan 22 |
Lecture 3: Finding possibly related entities, PCA
|
Causality additional reading:
PCA additional reading (technical): |
Thur Jan 24 |
Lecture 4: Manifold learning with Isomap and t-SNE
|
Python examples for dimensionality reduction:
Additional dimensionality reduction reading (technical):
|
Fri Jan 25 |
Recitation 2: t-SNE
HW1 due 11:59pm, HW2 released |
|
Tue Jan 29 |
Lecture 5: Introduction to clustering, k-means, Gaussian mixture models
|
Additional clustering reading (technical): |
Thur Jan 31 |
Class cancelled due to polar vortex |
|
Fri Feb 1 |
Lecture 6 (this is not a typo): Clustering and clustering interpretation demo, automatic selection of k with CH index
|
Additional clustering reading (technical):
Python cluster evaluation: |
Tue Feb 5 |
Recitation 3: Quiz review session |
|
Thur Feb 7 | Mid-mini quiz (same time/place as lecture) | |
Fri Feb 8 |
Recitation 4: Final project work section |
|
Tue Feb 12 |
Lecture 7: Hierarchical clustering, topic modeling
|
Additional reading (technical): |
Part 2. Predictive data analysis | ||
Thur Feb 14 |
Lecture 8: Topic modeling (wrap-up), introduction to predictive analytics, nearest neighbors, evaluating
prediction methods
|
|
Fri Feb 15 |
Recitation 5: Support vector machines, decision boundaries, ROC curves
HW2 and final project proposal due 11:59pm, HW3 released |
|
Tue Feb 19 |
Lecture 9: Prediction and model validation demo, decision trees/forests
|
|
Thur Feb 21 |
Lecture 10: Introduction to neural nets and deep learning
Mike Jordan's Medium article on where AI is at (April 2018): |
Video introduction on neural nets:
Additional reading: |
Fri Feb 22 |
Recitation 6: Sentiment analysis, extensions to topic modeling
|
|
Mon Feb 25 | HW3 due 11:59pm |
|
Tue Feb 26 |
Lecture 11: Image analysis with CNNs (also called convnets)
|
CNN reading:
|
Thur Feb 28 |
Lecture 12: Time series analysis with RNNs, other deep learning topics, wrap-up
Gary Marcus's Medium article on limitations of deep learning and his heated debate with Yann LeCun (December 2018): |
LSTM reading:
Videos on learning neural nets (warning: the loss function used is not the same as what we are using in class):
Recent heuristics/theory on gradient descent variants for deep nets (technical):
Some interesting reads (technical): |
Fri Mar 1 |
Recitation 7: Final project work section |
|
Tue Mar 5 |
Final project presentations in-class |
|
Thur Mar 7 |
No class Final projects due 11:59pm (slide deck + Jupyter notebook) |