95-865: Unstructured Data Analytics
(Fall 2023 Mini 2)

Unstructured Data Analytics

Lectures:
Note that the current plan is for Section C2 to be recorded.

Recitations (shared across Sections A4/B4): Fridays 2pm-3:20pm, HBH A301

Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol

Teaching assistants:

Office hours (starting second week of class): Check the course Canvas homepage for the office hour times and locations.

Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).

Course Description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:

  1. We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
  2. Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.
Many examples are given for how these methods help solve real problems faced by organizations. Along the way, we encounter many of the most popular methods in analyzing unstructured data, from modern classics in manifold learning, clustering, and topic modeling to some of the latest developments in deep neural networks for analyzing text, images, and time series.

We will be coding lots of Python and dabble a bit with GPU computing (Google Colab).

Note regarding foundation models (such as Large Language Models): As likely all of you are aware, there are now technologies like (Chat)GPT, Bard, Llama, etc which will all be getting better over time. If you use any of these in your homework, please cite them. For the purposes of the class, I will view these as external resources/collaborators. For exams, I want to make sure that you actually understand the material and are not just telling me what someone else or GPT/Bard/etc knows. This is important so that in the future, if you use AI technologies to assist you in your data analysis, you have enough background knowledge to check for yourself whether you think the AI is giving you a solution that is correct or not. For this reason, exams this semester will explicitly not allow electronics.

Prerequisite: If you are a Heinz student, then you must have taken 95-888 "Data-Focused Python" or 90-819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.

Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more

Grading: Homework (30%), Quiz 1 (35%), Quiz 2 (35%*)

*Students with the most instructor-endorsed posts on Piazza will receive a slight bonus at the end of the mini, which will be added directly to their Quiz 2 score (a maximum of 10 bonus points, so that it is possible to get 110 out of 100 points on Quiz 2).

Letter grades are determined based on a curve.

Syllabus: [handout]

Calendar (tentative)

Previous version of course (including lecture slides and demos): 95-865 Spring 2023 mini 4

Date Topic Supplemental Material
Part I. Exploratory data analysis
Tue Oct 24 Lecture 1: Course overview, analyzing text using frequencies
[slides]

Please install Anaconda Python 3 and spaCy by following this tutorial (needed for HW1 and the demo next lecture):

[slides]
Note: Anaconda Python 3 includes support for Jupyter notebooks, which we use extensively in this class

Wed Oct 25 HW1 released (check Canvas)
Thur Oct 26 No lecture (the instructor has a scheduling conflict)

Optional Python review session at 6pm-7pm over Zoom with your TA Zekai (check Canvas for the Zoom link)

[Jupyter notebook]
Fri Oct 27 Recitation slot: lecture 2 delivered remotely—Basic text analysis demo (requires Anaconda Python 3 & spaCy)
[slides]
[Jupyter notebook (basic text analysis)]
Tue Oct 31 Lecture 3: Wrap-up basic text analysis, co-occurrence analysis
[slides]
[Jupyter notebook (basic text analysis using arrays)]
[Jupyter notebook (co-occurrence analysis toy example)]
Thur Nov 2 Lecture 4: Co-occurrence analysis (cont'd), visualizing high-dimensional data with PCA
[slides]
[Jupyter notebook (text generation using n-grams)]
[Jupyter notebook (PCA)]
Additional reading (technical):
[Abdi and Williams's PCA review]
Fri Nov 3 Recitation slot: Lecture 5— PCA (cont'd), manifold learning (Isomap, MDS)
[slides]
Python examples for manifold learning:
[scikit-learn example (Isomap, t-SNE, and many other methods)]

Additional reading (technical):

[The original Isomap paper (Tenenbaum et al 2000)]
Mon Nov 6 HW1 due 11:59pm
Tue Nov 7 No class (Democracy Day)
Thur Nov 9 Lecture 6: Manifold learning (cont'd)
[slides]
[required reading: "How to Use t-SNE Effectively" (Wattenberg et al, Distill 2016)]
[Jupyter notebook (manifold learning)]
[Jupyter notebook (dimensionality reduction with images)***]
***For the demo on t-SNE with images to work, you will need to install some packages:
pip install torch torchvision

HW2 released (check Canvas)
Additional reading (technical):
[some technical slides on t-SNE by George for 95-865]
[Simon Carbonnelle's much more technical t-SNE slides]
[t-SNE webpage]

New manifold learning method that is promising (PaCMAP):

[paper (Wang et al 2021) (technical)]
[code (github repo)]
Fri Nov 10 Recitation slot: Lecture 7—Clustering
[slides]
[Jupyter notebook (dimensionality reduction and clustering with drug data)]
Clustering additional reading (technical):
[see Section 14.3 of the book "Elements of Statistical Learning" on clustering]
Tue Nov 14 Lecture 8: Clustering (cont'd)
[slides]
We continue using the same demo from last time:
[Jupyter notebook (dimensionality reduction and clustering with drug data)]
[Jupyter notebook (clustering with images)]
Wed Nov 15 Quiz 1 review session: 7:30pm-8:30pm over Zoom with your TA Omar (check Canvas for the Zoom link)
Thur Nov 16 Lecture 9: Topic modeling
[slides]
[Jupyter notebook (topic modeling with LDA)]
Topic modeling reading:
[David Blei's general intro to topic modeling]
[Maria Antoniak's practical guide for using LDA]
Fri Nov 17 Quiz 1 (80-minute exam)
Part II. Predictive data analysis
Tue Nov 21 Lecture 10: Wrap up topic modeling, intro to predictive data analysis
[slides]
[Jupyter notebook (prediction and model validation)]
Thur Nov 23, Fri Nov 24 No class (Thanksgiving)
Mon Nov 27 HW2 due 11:59pm
Tue Nov 28 Lecture 11: Intro to neural nets & deep learning
[slides]
For the neural net demo below to work, you will need to install some packages:
pip install torch torchvision torchaudio torchtext torchinfo
[Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to the bottom to download UDA_pytorch_utils.py)]


PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU):

[PyTorch tutorial]

Additional reading:

[Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]

Video introduction on neural nets:

["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown]
Thur Nov 30 Lecture 12: Wrap up neural net basics; image analysis with convolutional neural nets (also called CNNs or convnets)
[slides]
We continue using the demo from last lecture
Additional reading:
[Stanford CS231n Convolutional Neural Networks for Visual Recognition]
[(technical) Richard Zhang's fix for max pooling]
Fri Dec 1 Recitation: More details on hyperparameter tuning, model evaluation, neural network training
[slides]
[Jupyter notebook]
Tue Dec 5 Lecture 13: Time series analysis with recurrent neural nets (RNNs)
[slides]
[Jupyter notebook (sentiment analysis with IMDb reviews; requires UDA_pytorch_utils.py from the previous lecture's demo)]
Thur Dec 7 Lecture 14: Text generation with RNNs and generative pre-trained transformers (GPTs); course wrap-up
[slides]
[Jupyter notebook (text generation with neural nets)]
Additional reading/videos:
[Andrej Karpathy's "Neural Networks: Zero to Hero" lecture series (including a more detailed GPT lecture)]
[A tutorial on BERT word embeddings]

Software for explaining neural nets:

[Captum]

Some articles on being careful with explanation methods (technical):

["The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective" (Krishna et al 2022)]
["Do Feature Attribution Methods Correctly Attribute Features?" (Zhou et al 2022)]
["The false hope of current approaches to explainable artificial intelligence in health care" (Ghassemi et al 2021)]
Fri Dec 8 Recitation slot: More details on RNNs, transformers, some other deep learning topics
[slides]
Additional reading:
[Christopher Olah's "Understanding LSTM Networks"]
Mon Dec 11 HW3 due 11:59pm
Fri Dec 15 Quiz 2 (80-minute exam): 1pm-2:20pm at HBH A301 (note: the official schedule also lists another room but please only show up to A301)