Class time and location:
Instructor: George Chen (georgechen [at symbol] cmu.edu)
Teaching assistants: Dylan Fitzpatrick (djfitzpa [at symbol] cmu.edu), Runshan Fu (runshanf [at symbol] andrew.cmu.edu)
Office hours:
Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).
Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course provides a practical introduction to unstructured data analysis and is composed of three parts:
How this course differs from 95-865 "Unstructured Data Analysis": 95-865 has Python programming as a prerequisite, emphasizes more of the technical skill development (assessed through two in-class exams involving coding), and does not have any sort of policy focus. On the other hand, 94-775 does not assume any Python experience and has a policy-focused final project instead of a final exam. 94-775 does not require cloud computing (part of 95-865 requires the use of Amazon Web Services). Despite these differences, there is heavy material overlap between 94-775 and 95-865.
Grading: HW1 8%, HW2 8%, HW3 4%, mid-mini quiz 35%, final project proposal 10%, final project 35%. HW3 is shorter than HW1 and HW2. Letter grades are determined based on a curve.
Syllabus: [pdf]
Warning: As this course is new, the lecture slides are a bit rough and may contain bugs. To provide feedback/bug reports, please directly contact the instructor, George (georgechen [at symbol] cmu.edu).
Date | Topic | Supplemental Material |
---|---|---|
Part I. Python for data analysis | ||
Tue Mar 20 |
Lecture 1: Course overview, basic python |
Some Python resources: |
Thur Mar 22 |
Lecture 2: Basic Python, continued |
|
Fri Mar 23 |
Recitation 1: Python 3 installation, Jupyter notebooks, Python basics HW1 released |
|
Tue Mar 27 |
Lecture 3: Basic Python, continued |
|
Part II. Exploratory data analysis | ||
Thur Mar 29 |
Lecture 4: Basic text analysis HW1 due 3pm |
|
Fri Mar 30 |
Recitation 2: numpy and spaCy basics HW2 released |
|
Tue Apr 3 |
Lecture 5: Finding possibly related entities |
|
Thur Apr 5 |
Lecture 6: Visualizing high-dimensional feature vectors, intro to clustering |
Python examples for dimensionality reduction:
Additional reading: |
Fri Apr 6 |
Recitation 3: HW2 tips, quiz review |
|
Tue Apr 10 |
Lecture 7: Interpreting clusters, Gaussian mixture models, automatically choosing k HW2 and final project proposals due 3pm Just for this week, George's office hours are Tuesday 5pm-7pm, HBH 2216 (and not on Wednesday!) |
Additional reading: |
Thur Apr 12 | Mid-mini quiz | |
Tue Apr 17 |
Lecture 8: Topic modeling with latent Dirichlet allocation, preview of predictive data analysis |
Additional reading: |
Part III. Predictive data analysis | ||
Thur Apr 19 |
Lecture 9: Prediction and validation illustrated using support vector classification HW3 released |
Additional reading: |
Tue Apr 24 |
Lecture 10: Introduction to neural nets and deep learning
Mike Jordan's Medium article (from just a few days ago!) on where AI is currently at: |
Video introduction on neural nets:
Additional reading: |
Thur Apr 26 |
Lecture 11: Wrap-up of deep learning and 94-775 HW3 due 3pm |
|
Final project presentations | ||
Tue May 1 |
Final project presentations:
|
|
Thur May 3 |
Final project presentations:
|
|
Fri May the 4th |
Final project report (slide deck + Jupyter notebook) due 11:59pm If you have HW2 or HW3 regrade requests, please submit them by 11:59pm |