94-775 & 94-475 Practical Unstructured Data Analytics

Course Information

Section B3: TR 3:30-4:50 and F 5:00-6:20Room: HBH 1204

Course Description

Organizations like companies, governments, and others are currently gathering a huge amount of data that is composed of various forms such as text, images, audio, and video. The question is how to convert this diverse and disorganized data into useful information. One common issue is that the underlying structure of the data is not always known before analyzing it, which is why it is called "unstructured." This course aims to provide a hands-on approach to analyzing unstructured data. We first investigate how to recognize any potential structure that may be present in the data through utilizing visual representation and other techniques for investigating the data.Once we have indications of what structure may be present in the data, we can use it to make predictions. Throughout the course, we will come across several widely used techniques for analyzing unstructured data. This includes both established methods such as manifold learning, clustering, and topic modeling, as well as newer approaches like deep neural networks for analyzing text, images, and time series. Programming in Python using tools like Jupyter Notebook or Colab will be a significant component of the course. Additionally, the use of ChatGPT is also encouraged throughout.See more details in the course syllabus. 

Course Schedule

Tue, Jan 16
Lecture 1: Introduction [Slides]
Course overview and introduction to unstructured data
Thu, Jan 18
Lecture 2: Unstructured data modeling [Slides]HW1 out
This lecture discusses traditional techniques for modeling unstructured data, including images, graphs, and text. 
Fri, Jan 19
Recitation: Tutorials for Colab, spaCy, and sklearn
Tue, Jan 23
Lecture 3: Text analysis and PCA [Slides]
Covers some basic text analysis techniques and starts the discussion of dimensionality reduction as well as one of the most commonly used methods -- Principal Component Analysis (PCA).
Thu, Jan 25
Lecture 4: Manifold learning [Slides]
Focuses on manifold learning, exploring two specific techniques: Isomap and t-SNE. 
Fri, Jan 26
Recitation: Demo for text modeling and analysis
Tue, Jan 30
Lecture 5: Clustering part 1 [Slides]
Discusses the clustering algorithms in general and delves further into k-Means and Gaussian mixture models (GMM).
Thu, Feb 1
Lecture 6: Clustering part 2 [Slides]HW2 out
Delves into more details about GMMs and draws the connection between GMMs and k-Means. Also discusses how to select their hyper-parameters.
Fri, Feb 2
Case study: Police 911 calls-for-service analysis
Tue, Feb 6
Lecture 7: Clustering part 3 and topic modeling [Slides]
Discusses two other clustering algorithms and gives a brief introduction to the topic modeling.
Thu, Feb 8
Lecture 8: LDA and Intro to predictive analysis [Slides]
Focuses on one of the topic modeling methods, Latent Dirichlet Allocation, and gives an introduction to the predictive data analysis. 
Fri, Feb 9
Quiz 1
Tue, Feb 13
Lecture 9: Classification [Slides]
Introduces one of the commonly-used classification model -- Decision Tree and Random Forest. Also covers how to select hyper-parameters through k-fold cross-validation. 
Thu, Feb 15
Lecture 10: Regression [Slides]HW3 out
Focuses on linear regression and continues the discussion on how to select its hyper-parameters. We will also talk about other commonly-used model evaluation metrics. 
Fri, Feb 16
Case study: COVID-19 prediction and analysis
Tue, Feb 20
Lecture 11: Spatio-temporal modeling [Slides]
Introduces spatio-temporal data and the modeling techniques, including Generalized Least Square, Covariance Function, Generalized Linear Models, and Auto-regressive Models
Thu, Feb 22
Lecture 12: Deep learning part 1 [Slides]
An overview of deep learning and neural networks is provided. We also briefly introduces widely-used deep learning computational frameworks, such as PyTorch.
Fri, Feb 23
Review session
Tue, Feb 27
Lecture 13: Deep learning part 2 [Slides]
This lecture centers on two specific types of deep neural networks—recurrent neural networks (RNNs) and convolutional neural networks (CNNs). 
Thu, Feb 29
Lecture 14: Other advance topics [Slides]
Explores the concepts of generative models, including VAEs, diffusion models, and Large Language Models, highlighting their evolution, applications, and significant contributions to multi-modality in AI.
Fri, Mar 1
Quiz 2