Class time and location:
Instructor: George Chen (georgechen [at symbol] cmu.edu)
Teaching assistants: Emaad Manzoor (emaad [at symbol] cmu.edu), Mallory Nobles (mnobles [at symbol] andrew.cmu.edu)
Office hours:
Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).
Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:
We will be coding lots of Python and working with Amazon Web Services (AWS) for cloud computing (including using GPU's).
Prerequisite: Python coding experience
Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more
Grading: Homework 25%, mid-mini quiz 35%, final exam 40%
Syllabus (last updated Jan 18, 2018 with TA and OH info): [pdf]
Warning: As this course is still relatively new, the lecture slides are a bit rough and may contain bugs. To provide feedback/bug reports, please directly contact the instructor, George (georgechen [at symbol] cmu.edu). The Fall 2017 mini-2 course website is available here.
Date | Topic | Supplemental Material |
---|---|---|
Part 1. Exploratory data analysis | ||
Wed Jan 17 |
Course introduction, basic text processing, frequency analysis HW0 released! (Check Canvas) |
Some Python resources: |
Mon Jan 22 |
Finding possibly related features:
co-occurrence analysis, scatter plots, correlation, causation HW0 due, HW1 released! |
Causality additional reading: |
Wed Jan 24 |
Visualizing high-dimensional data: PCA, introduction to manifold learning, Isomap, t-SNE |
Python examples for dimensionality reduction:
Additional reading: |
Mon Jan 29 |
Visualization (wrap-up), introduction to clustering, k-means, Gaussian mixture models (GMMs) |
Additional reading: |
Wed Jan 31 |
GMMs, DP-GMMs HW1 due, HW2 released! |
|
Mon Feb 5 |
DP-means, CH index (see also: gap statistic), hierarchical clustering In class I mentioned that scraping the web can be a helpful tool to gather data to analyze. While we won't be talking about web scraping in the course, if you're interested you can look at tutorials for scrapy and Selenium. |
Python cluster evaluation:
Additional reading: |
Wed Feb 7 |
Clustering (wrap-up), topic modeling, intro to predictive data analysis |
Additional reading: |
Part 2. Predictive data analysis | ||
Mon Feb 12 |
Some classics of classification: nearest neighbors, evaluating
prediction methods, naive Bayes |
|
Wed Feb 14 |
Mid-mini quiz |
|
Mon Feb 19 |
No class HW2 due |
|
Wed Feb 21 |
From classical to modern classification methods: SVMs, decision trees and forests, intro to neural nets and deep learning HW3 released |
|
Mon Feb 26 |
Neural nets and deep learning |
Video introduction on neural nets:
Additional reading: |
Wed Feb 28 |
Image analysis with CNNs (also called convnets), time series analysis with RNNs |
[Stanford CS231n Convolutional Neural Networks for Visual Recognition] |
Mon Mar 5 |
Wrap-up of deep learning and of 95-865 HW3 due |
Videos on learning neural nets: |
Wed Mar 7 |
Final exam, 10:30am-11:50am, HBH 1202 |