Class time and location:
Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol
Teaching assistants: Georgia Fu (qiaoyaf ♣ andrew.cmu.edu), Cyndi Wang (xiangyu4 ♣ andrew.cmu.edu)
Office hours (starting second week of class):
Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).
Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:
Prerequisite: If you are a Heinz student, then you must have already taken 95-791 "Data Mining" and also either (1) passed the Heinz Python exemption exam, or (2) taken 95-888 "Data-Focused Python" or 90-819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.
Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more
Grading: HW1 8%, HW2 8%, HW3 4%, mid-mini quiz 35%, final project proposal 10%, final project 35%. HW3 is shorter than HW1 and HW2. Letter grades are determined based on a curve.
Syllabus: [pdf]
Previous lecture slides and demos: spring 2019 version of 94-775 (click to see course webpage)
Date | Topic | Supplemental Material |
---|---|---|
Part I. Exploratory data analysis | ||
Tue Jan 14 |
Lecture 1: Course overview, basic text processing and frequency analysis
HW1 released (check Canvas)! |
|
Thur Jan 16 |
Lecture 2: Basic text analysis demo, co-occurrence analysis
|
|
Fri Jan 17 |
Recitation 1: Basic Python review
|
|
Tue Jan 21 |
Lecture 3: Finding possibly related entities
|
What is the maximum value of phi-square/chi-square value? (technical)
Causality additional reading: |
Thur Jan 23 |
Lecture 4: Visualizing high-dimensional data with PCA and Isomap
|
Python examples for dimensionality reduction:
Additional dimensionality reduction reading (technical):
|
Fri Jan 24 |
Recitation 2: More on PCA, bookkeeping with np.argsort HW1 due 11:59pm, HW2 released |
|
Tue Jan 28 |
Lecture 5: Manifold learning (Isomap, t-SNE)
|
See supplementary materials from the previous lecture; in addition, here's some reading for t-SNE (technical):
|
Thur Jan 30 |
Lecture 6: Clustering
|
Additional clustering reading (technical): |
Fri Jan 31 |
Lecture 7: More clustering, topic modeling
|
Python cluster evaluation:
Additional reading on topic modeling: |
Tue Feb 4 |
Quiz review held by TA |
|
Thur Feb 6 | Quiz (same time/place as lecture) | |
Fri Feb 7 |
No recitation - please schedule time with TA's to get feedback on final project ideas |
|
Part 2. Predictive data analysis | ||
Mon Feb 10 |
HW2 and final project proposal due 11:59pm, HW3 released |
|
Tue Feb 11 |
Lecture 8: Wrap up topic modeling, introduction to predictive analytics, nearest neighbors, evaluating prediction methods
|
Some nuanced details on cross-validation:
|
Thur Feb 13 |
Lecture 9: More on model evaluation (including confusion matrices, ROC curves), decision trees & forests
|
|
Fri Feb 14 |
Recitation 3: ROC curves |
|
Tue Feb 18 |
Lecture 10: Introduction to neural nets and deep learning
Mike Jordan's Medium article on where AI is at (April 2018): |
PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU):
Additional reading:
Video introduction on neural nets: |
Thur Feb 20 |
Lecture 11: Image analysis with convolutional neural nets
|
Additional reading:
|
Fri Feb 21 |
Recitation 4: Sentiment analysis, extensions to topic modeling HW3 due 11:59pm |
|
Tue Feb 25 |
Lecture 12: Time series analysis with recurrent neural nets (application: sentiment analysis in IMDb reviews)
|
LSTM reading: |
Thur Feb 27 |
Lecture 13: Other deep learning topics, wrap-up
|
Additional reading:
|
Fri Feb 28 |
No recitation - please schedule time with TA's to get feedback on final projects |
|
Tue Mar 3 |
Final project presentations in-class |
|
Thur Mar 5 |
No class Final projects due 11:59pm (slide deck + Jupyter notebook) |