95-865: Unstructured Data Analytics (Spring 2021 Mini 4)

Unstructured Data Analytics

Lectures:
Note that the current plan is for Section A4 and K4 lectures to be recorded.

Recitations:

  • Sections A4/B4/Z4: Fridays 3:10pm-4:30pm Pittsburgh time, live over Zoom
  • Section K4: Fridays 5:30pm-7pm Adelaide time, live over Zoom
  • Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol

    Teaching assistants:

    Office hours (starting second week of class):
    Regardless of which section you are in, you are welcome to attend office hours for any of the course staff and we've tried to have the office hours in rather scattered times to try to get to many of the time zones that you are in. I suggest that you add all of the times below to your calendar via Google calendar using its time zone feature so that it automatically converts it to your local time (Pittsburgh time is labeled as "Eastern Time - New York" and Adelaide time is listed as "Central Australia Time - Adelaide"). Office hours are all held remotely over Zoom; Zoom links for office hours are posted in Canvas.

    Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).

    course description

    Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:

    1. We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
    2. Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.
    Many examples are given for how these methods help solve real problems faced by organizations. Along the way, we encounter many of the most popular methods in analyzing unstructured data, from modern classics in manifold learning, clustering, and topic modeling to some of the latest developments in deep neural networks for analyzing text, images, and time series.

    We will be coding lots of Python and dabble a bit with GPU computing (Google Colab).

    Prerequisite: If you are a Heinz student, then you must have either (1) passed the Heinz Python exemption exam, or (2) taken 95-888 "Data-Focused Python" or 90-819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.

    Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more

    Grading: Homework 20%, quiz 1 40%, quiz 2 40%*

    *Students with the most instructor-endorsed answers on Piazza will receive a slight bonus at the end of the mini, which will be added directly to their quiz 2 score (a maximum of 5 bonus points; quiz 2 is out of 100 points prior to any bonus points being added).

    Syllabus (updated 3/22 11:24pm Pittsburgh time): [pdf]

    calendar (subject to revision)

    Previous version of course (including lecture slides and demos): 95-865 Fall 2020 mini 2

    Pittsburgh

    Date Topic Supplemental Material
    Part I. Exploratory data analysis
    Mon Mar 22

    Lecture 1: Course overview
    [slides]

    Wed Mar 24

    Lecture 2: Basic text analysis, co-occurrence analysis
    [slides]

    For the basic text analysis demo to work, please install Anaconda Python 3, Jupyter, and spaCy first
    [Jupyter notebook (basic text analysis)]

    Fri Mar 26

    Recitation: Basic Python review
    [Jupyter notebook]

    Mon Mar 29

    Lecture 3: Finding possibly related entities
    [slides]
    [Jupyter notebook (co-occurrence analysis)]

    What is the maximum phi-squared/chi-squared value? (technical)
    [stack exchange answer]

    Wed Mar 31

    Lecture 4: Visualizing high-dimensional data with PCA
    [slides]
    [Jupyter notebook (PCA)]

    Causality additional reading:
    [Computational and Inferential Thinking, "Causality and Experiments" chapter]

    PCA additional reading (technical):
    [Abdi and Williams's PCA review]

    Fri Apr 2

    Recitation slot — Lecture 5: Manifold learning with Isomap
    [slides]
    [Jupyter notebook (manifold learning)]

    Python examples for dimensionality reduction:
    [scikit-learn example (PCA, Isomap, t-SNE, and many other methods)]

    Mon Apr 5

    No class (CMU break day)

    Wed Apr 7

    Lecture 6: Wrap up manifold learning (t-SNE), a first look at analyzing images, and an introduction to clustering phenomena
    [slides]
    [Jupyter notebook (manifold learning); same demo as previous lecture]
    [required reading: "How to Use t-SNE Effectively" (Wattenberg et al, Distill 2016)]
    For the demo below to work (on t-SNE with images), you will need to install some packages:
    pip install torch torchvision
    [Jupyter notebook (t-SNE with images)]
    [slides with some technical details for t-SNE]

    HW1 due 11:59pm Pittsburgh time

    See supplementary materials from the previous lecture; in addition, here's some reading for t-SNE (technical):
    [Simon Carbonnelle's t-SNE slides]
    [t-SNE webpage]

    Fri Apr 9

    Recitation: More on PCA, practice with argsort
    [Jupyter notebook]

    Mon Apr 12

    Lecture 7: Distance and similarity functions, clustering (k-means, GMMs)
    [slides]
    [Jupyter notebook (PCA, t-SNE, clustering with drug data)]

    Clustering additional reading (technical):
    [see Section 14.3 of the book "Elements of Statistical Learning" on clustering]

    Wed Apr 14

    Lecture 8: More on clustering (interpreting clustering results, automatically choosing the number of clusters for GMM-related models)
    [slides]
    [Jupyter notebook (PCA, t-SNE, clustering on UCI drug consumption data); same demo as previous lecture]

    Reading on DP-means, DP mixture models for which a DP-GMM is a special case (technical):
    [Revisiting k-means: New Algorithms via Bayesian Nonparametrics]

    Fri Apr 16 No class (CMU break day)
    Mon Apr 19

    Lecture 9: Topic modeling
    [slides]
    [Jupyter notebook (topic modeling with LDA)]

    Topic modeling reading:
    [David Blei's general intro to topic modeling]
    [(technical) Topic Modelling Meets Deep Neural Networks: A Survey]

    Wed Apr 21

    Lecture 10: Wrap up topic modeling; wrap up clustering; a glimpse of predictive data analytics
    [slides]
    [Jupyter notebook (topic modeling with LDA); same demo as previous lecture]
    [Jupyter notebook (PCA, t-SNE, clustering on UCI drug consumption data); same demo as previous lecture]

    Thur Apr 22

    HW2 due 11:59pm Pittsburgh time

    Fri Apr 23

    Quiz 1:

    • Recitation slot (3:10pm-4:30pm Pittsburgh time) for Sections A4/B4
    • 6:30pm-7:50pm Pittsburgh time for Section Z4

    Part II. Predictive data analysis
    Mon Apr 26

    Lecture 11: Intro to predictive data analytics
    [slides]

    Some nuanced details on cross-validation (technical):
    [Andrew Ng's article Preventing "Overfitting" of Cross-Validation Data]
    [Braga-Neto and Dougherty's article Is cross-validation valid for small-sample microarray classification? (this article applies more generally rather than only to microarray data from biology)]
    [Bias and variance as we change the number of folds in k-fold cross-validation]

    Wed April 28

    Lecture 12: Wrap up basic prediction concepts
    [slides]
    [Jupyter notebook (prediction and model validation; same demo as last time)]

    Fri April 30

    Recitation: More practice on model evaluation
    [Jupyter notebook]

    Mon May 3

    Lecture 13: Intro to neural nets and deep learning
    [slides]
    For the neural net demo below to work, you will need to install some packages:
    pip install torch torchvision torchaudio
    pip install torchsummaryX
    python -m spacy download en
    pip install pytorch-nlp
    [Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to the bottom to download UDA_pytorch_utils.py)]

    PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU):
    [PyTorch tutorial]

    Additional reading:
    [Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]

    Video introduction on neural nets:
    ["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown]

    Mike Jordan's Medium article on where AI is at (April 2018):
    ["Artificial Intelligence - The Revolution Hasn't Happened Yet"]

    Wed May 5

    Lecture 14: Image analysis with convolutional neural nets
    [slides]
    [Jupyter notebook (handwritten digit recognition with neural nets; same demo as previous lecture)]

    Additional reading:
    [Stanford CS231n Convolutional Neural Networks for Visual Recognition]
    [(technical) Richard Zhang's fix for max pooling (presented at ICML 2019)]

    Fri Dec 7

    Recitation slot: Lecture 15 on time series analysis with recurrent neural nets; more deep learning topics; course wrap-up
    [slides]

    The demo will not be covered during the recitation slot and is instead covered in the April 30 Section K4 recitation Zoom recording by your TA Erick (check Canvas Zoom recordings)
    For the demo below to work, be sure to install the prerequisite packages as mentioned for the lecture 13 demo.
    [Jupyter notebook (sentiment analysis with IMDb reviews)]

    Additional reading:
    [Christopher Olah's "Understanding LSTM Networks"]
    [A tutorial on word2vec word embeddings]
    [A tutorial on BERT word embeddings]

    Mon May 10

    HW3 due 11:59pm Pittsburgh time

    Thur May 13

    Quiz 2:

    • 1pm-2:20pm Pittsburgh time for Sections A4/B4
    • 5:30pm-6:50pm Pittsburgh time for students in A4/B4 taking the alternate Z4 time
    • Students officially in Section Z4 have been sent separate instructions for their Quiz 2 (check Canvas)

    Adelaide

    Date Topic Supplemental Material
    Part I. Exploratory data analysis
    Wed Mar 24

    Lecture 1: Course overview
    [slides]

    Fri Mar 26

    Lecture 2: Basic text analysis, co-occurrence analysis
    [slides]

    For the basic text analysis demo to work, please install Anaconda Python 3, Jupyter, and spaCy first
    [Jupyter notebook (basic text analysis)]

    Recitation: Basic Python review
    [Jupyter notebook]

    Wed Mar 31

    Lecture 3: Finding possibly related entities
    [slides]
    [Jupyter notebook (co-occurrence analysis)]

    What is the maximum phi-squared/chi-squared value? (technical)
    [stack exchange answer]

    Fri Apr 2

    No class (Good Friday)

    Wed Apr 7

    Lecture 4: Visualizing high-dimensional data with PCA
    [slides]
    [Jupyter notebook (PCA)]

    Causality additional reading:
    [Computational and Inferential Thinking, "Causality and Experiments" chapter]

    PCA additional reading (technical):
    [Abdi and Williams's PCA review]

    Thur Apr 8

    HW1 due 1:29pm Adelaide time (corresponds to 11:59pm Wed Apr 7 Pittsburgh time)

    Fri Apr 9

    Lecture 5: Manifold learning with Isomap
    [slides]
    [Jupyter notebook (manifold learning)]

    Extended recitation slot (5:30pm-8:30pm Adelaide time): Lectures 6 and 7 on wrapping up manifold learning (t-SNE), a first look at analyzing images, and an introduction to clustering (k-means, GMMs)
    [slides]
    [Jupyter notebook (manifold learning); same demo as previous lecture]
    [required reading: "How to Use t-SNE Effectively" (Wattenberg et al, Distill 2016)]
    For the demo below to work (on t-SNE with images), you will need to install some packages:
    pip install torch torchvision
    [Jupyter notebook (t-SNE with images)]
    [Jupyter notebook (PCA, t-SNE, clustering with drug data)]
    [slides with some technical details for t-SNE]

    Python examples for dimensionality reduction:
    [scikit-learn example (PCA, Isomap, t-SNE, and many other methods)]

    T-SNE additional reading (technical):
    [Simon Carbonnelle's t-SNE slides]
    [t-SNE webpage]

    Clustering additional reading (technical):
    [see Section 14.3 of the book "Elements of Statistical Learning" on clustering]

    Wed Apr 14

    Lecture 8: More on clustering (interpreting clustering results, automatically choosing the number of clusters for GMM-related models)
    [slides]
    [Jupyter notebook (PCA, t-SNE, clustering on UCI drug consumption data); same demo as previous lecture]

    Quiz 1 review session (7pm-8:30pm Adelaide time)

    Reading on DP-means, DP mixture models for which a DP-GMM is a special case (technical):
    [Revisiting k-means: New Algorithms via Bayesian Nonparametrics]

    Fri Apr 16

    Lecture 9: Wrap up clustering (density-based clustering with DBSCAN, final remarks); topic modeling
    [slides]
    [Jupyter notebook (PCA, t-SNE, clustering on UCI drug consumption data); same demo as previous lecture]
    [Jupyter notebook (topic modeling with LDA)]

    Recitation slot: Quiz 1 (80 minutes to match amount of time that will be given to Pittsburgh students)

    Topic modeling reading:
    [David Blei's general intro to topic modeling]
    [(technical) Topic Modelling Meets Deep Neural Networks: A Survey]

    Part II. Predictive data analysis
    Wed Apr 21

    Lecture 10: Intro to predictive data analytics
    [slides]
    For the demo below to work, you will need to install some packages:
    pip install torch torchvision
    [Jupyter notebook (prediction and model validation)]

    Some nuanced details on cross-validation (technical):
    [Andrew Ng's article Preventing "Overfitting" of Cross-Validation Data]
    [Braga-Neto and Dougherty's article Is cross-validation valid for small-sample microarray classification? (this article applies more generally rather than only to microarray data from biology)]
    [Bias and variance as we change the number of folds in k-fold cross-validation]

    Fri Apr 23

    HW2 due 1:29pm Adelaide time (corresponds to 11:59pm Mon Apr 22 Pittsburgh time)

    Lecture 11: Wrap up basic prediction concepts; intro to neural nets and deep learning
    [slides]
    [Jupyter notebook (prediction and model validation; same demo as last time)]

    Recitation slot — Lecture 12: Intro to neural nets and deep learning
    [slides]
    For the neural net demo below to work, you will need to install some packages:
    pip install torch torchvision torchaudio
    pip install torchsummaryX
    python -m spacy download en
    pip install pytorch-nlp
    [Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to the bottom to download UDA_pytorch_utils.py)]

    PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU):
    [PyTorch tutorial]

    Additional reading:
    [Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]

    Video introduction on neural nets:
    ["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown]

    Mike Jordan's Medium article on where AI is at (April 2018):
    ["Artificial Intelligence - The Revolution Hasn't Happened Yet"]

    Wed Apr 28

    Lecture 13: Image analysis with convolutional neural nets
    [slides]
    [Jupyter notebook (handwritten digit recognition with neural nets; same demo as previous lecture]

    Additional reading:
    [Stanford CS231n Convolutional Neural Networks for Visual Recognition]
    [(technical) Richard Zhang's fix for max pooling (presented at ICML 2019)]

    Fri April 30

    Lecture 14: Time series analysis with recurrent neural nets; some other deep learning topics; course wrap-up
    [slides]

    Recitation: sentiment analysis with IMDB reviews; more on word embeddings and fine tuning; some PyTorch code examples
    For the demo below to work, be sure to install the prerequisite packages as mentioned for the lecture 12 demo.
    [Jupyter notebook (sentiment analysis with IMDb reviews)]

    Additional reading:
    [Christopher Olah's "Understanding LSTM Networks"]
    [A tutorial on word2vec word embeddings]
    [A tutorial on BERT word embeddings]

    Final exam period May 3-7

    HW3 due date May 6, 11:59pm Adelaide time

    Quiz 2, May 7 10:30am-11:50am Adelaide time