skip to page contentSBU
Stony Brook University
Machine Learning
CSE512 - Spring 2014


Project Ideas and Datasets

Below are descriptions of several data sets, and some suggested projects. The first few are spelled out in greater detail. You are encouraged to select and flesh out one of these projects, or make up your own well-specified project using these datasets. If you have other data sets you would like to work on, we would consider that as well, provided you already have access to this data and a good idea of what to do with it.

Several of the project ideas are compiled from similar courses online. See here

A0. Object Detection


You can download the dataset from here. There are 20 categories of objects, ranging from car and bike to cat and dog. It is also the most used benchmark dataset for object detection task.

Project Ideas:

There are two classic ways of doing object detection; one is based on exhaustive sliding window search, such as Deformable Part Model (, the other one is based on Selective Search method ( Selective search has great potential to apply on larger dataset with more categories. You can try the selective search idea and compare to the DPM’s performance (results available online).


  • “Object Detection with Discriminatively Trained Part Based Models.” P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan. PAMI 2010.
  • “Segmentation As Selective Search for Object Recognition.” Koen E. A. van de Sande, Jasper R. R. Uijlings, Theo Gevers, and Arnold W. M. Smeulders. ICCV 2011.

A. fMRI Brain Imaging


Available here

This data set contains a time series of images of brain activation, measured using fMRI, with one image every 500 msec. During this time, human subjects performed 40 trials of a sentence-picture comparison task (reading a sentence, observing a picture, and determining whether the sentence correctly described the picture). Each of the 40 trials lasts approximately 30 seconds. Each image contains approximately 5,000 voxels (3D pixels), across a large portion of the brain. Data is available for 12 different human subjects.

Project A1: Bayes network classifiers for fMRI

Gaussian Naïve Bayes classifiers and SVMs have been used with this data to predict when the subject was reading a sentence versus perceiving a picture. Both of these classify 8-second windows of data into these two classes, achieving around 85% classification accuracy [Mitchell et al, 2004]. This project will explore going beyond the Gaussian Naïve Bayes classifier (which assumes voxel activities are conditionally independent), by training a Bayes network in particular a TAN tree [Friedman, et al., 1997]. Issues youll need to confront include which features to include (5000 voxels times 8 seconds of images is a lot of features) for classifier input, whether to train brain-specific or brain-independent classifiers, and a number of issues about efficient computation with this fairly large data set.


Project A2: Dimensionality reduction for fMRI data

Explore the use of dimensionality-reduction methods to improve classification accuracy with this data. Given the extremely high dimension of the input (5000 voxels times 8 images) to the classifier, it is sensible to explore methods for reducing this to a small number of dimension. For example, consider PCA, hidden layers of neural nets, or other relevant dimensionality reducing methods. PCA is an example of a method that finds lower dimension representations that minimize error in reconstructing the data. In contrast, neural network hidden layers are lower dimensional representations of the inputs that minimize classification error (but only find a local minimum). Does one of these work better? Does it depend on parameters such as the number of training examples?


Project A3: Feature selection/feature invention for fMRI classification

As in many high dimensional data sets, automatic selection of a subset of features can have a strong positive impact on classifier accuracy. It has been found that selecting features by the difference in their activity when the subject performs the task, relative to their activity while the subject is resting, is one useful strategy [Mitchell et al., 2004]. In this project you could suggest, implement, and test alternative feature selection strategies (eg., consider the incremental value of adding a new feature to the current feature set, instead of scoring each feature independent of other features that are being selected), and see whether you can obtain higher classification accuracies. Alternatively, you could consider methods for synthesizing new features (e.g., define the 'smoothed value' of a voxel in terms of a spatial Gaussian kernel function applied to it and its neighbors, or define features by averaging voxels whose time series are highly correlated).


B. Brain network (Connectome)-based classification

This project involves classifying different human-subjects by their brain connectivity structure (or brain network, connectome).


This dataset contains the brain connecticity graphs of 114 human subjects. Each brain is segmented into 70 regions (or supervoxels). The network depicts the connectivity among these regions, where weights on links represent the strength of the connection. Meta-data on human-subjects include gender, age, IQ, etc. as well as scores obtained by tests evaluating the math capability or creativity of the subjects.
Available here.

Project suggestions:

  • Classify human-subjects into (1) male vs. female, (2) high-math capable vs. normal, (3) creative vs. normal

  • Dimensionality reduction and feature construction for improving accuracy (see A2 and A3 above).


C. NBA statistics


This dataset contains 2004-2005 NBA and ABA stats for
  • Player regular season stats
  • Player regular season career totals
  • Player playoff stats
  • Player playoff career totals
  • Player all-star game stats
  • Team regular season stats
  • Complete draft history
  • coaches_season.txt - nba coaching records by season
  • coaches_career.txt - nba career coaching records
(currently all of the regular season)
Available here.

Project suggestions:

  • You can try to predict the outcome of a given game.

  • Detecting groups of similar players, and outlier detection on the players; find out who are the outstanding ones.

D. Physiological Data Modeling (BodyMedia)

Physiological data offers many challenges to the machine learning community including dealing with large amounts of data, sequential data, issues of sensor fusion, and a rich domain complete with noise, hidden variables, and significant effects of context.


1. Which sensors correspond to each column?
  • characteristic1 age
  • characteristic2 handedness
  • sensor1 gsr_low_average
  • sensor2 heat_flux_high_average
  • sensor3 near_body_temp_average
  • sensor4 pedometer
  • sensor5 skin_temp_average
  • sensor6 longitudinal_accelerometer_SAD
  • sensor7 longitudinal_accelerometer_average
  • sensor8 transverse_accelerometer_SAD
  • sensor9 transverse_accelerometer_average
2. What are the activities behind each annotation?
The annotations for the contest were:
  • 5102 = sleep
  • 3104 = watching TV
Available here (external link broken, use the internal link).

Project suggestions:

  • Behavior classification: to classify the person's activity based on the sensor measurements.

  • Train a classifier to identify subjects as men or women (this information is given in the training data sequences)

E. Face Recognition


There are two data sets for this type of problem.
  • The first dataset (CMU Machine Learning Faces) contains 640 images of faces. The faces themselves are images of 20 former Machine Learning students and instructors, with about 32 images of each person. Images vary by the pose (direction the person is looking), expression (happy/sad), face jewelry (sun glasses or not), etc. This gives you a chance to consider a variety of classification problems ranging from person identification to sunglass detection. The data, documentation, and associated code are available at the link.
    * The same website provides an implementation of a neural network classifier for this image data. The code is quite robust, and pretty well documented.

  • The second dataset (Facial Attractiveness Images) consists of 2253 female and 1745 male rectified frontal face images scraped from the website by Ryan White along with user ratings of attractiveness.

Project suggestions:

  • Try SVM's on this data, and compare their performance to that of the provided neural networks.

  • Apply a clustering algorithm to find "similar" faces.

  • Learn a facial attractiveness classifier. A paper on the topic of predicting facial attractiveness can be found here.

F. Character recognition (digits/letters)

Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research.


We have three datasets on this topic.
  • The first dataset tackles the more general OCR task, on a small vocabulary of words: (Note that the first letter of each word was removed, since these were capital letters that would make the task harder for you.)

  • The second dataset is the now "classic" digit recognition task for outgoing mail zip codes

  • The third (and most challenging) data set consists of scrambled text known as CAPTCHAs (which stands for Completely Automated Public Turing test to tell Computers and Humans Apart) that were designed by Luis Von Ahn at CMU to be difficult to automatically recognize. (For more about CAPTCHAs go to Wikipedia Article or where you will find several papers.

Project suggestions:

  • Learn a classifier to recognize the digit/letter

  • Use an HMM to exploit correlations between neighboring letters in the general OCR case to improve accuracy. (Since ZIP codes don't have such constraints between neighboring digits, HMMs will probably not help in the digit case.)

  • Apply a clustering/dimensionality reduction algorithm on this data, see if you get better classification on this lower dimensional space.

  • Learn a classifier to decipher CAPTCHAs. You may want to begin by reading the following:
  • You may want to begin by building a classifier to segment the image into seperate letters.

G. Image Segmentation

The main goal of this project is to segment given images in a meaningful way.


Berkeley collected three hundred images and paid students to hand-segment each one (usually each image has multiple hand-segmentations). Two-hundred of these images are training images, and the remaining 100 are test images. The dataset includes code for reading the images and ground-truth labels, computing the benchmark scores, and some other utility functions. It also includes code for a segmentation example.
Available resources can be found here.

A newer (and bigger) dataset of manually labeled images is here (images, ground-truth data and benchmarks).

Project G1: Region-Based Segmentation

Most segmentation algorithms have focused on segmentation based on edges or based on discontinuity of color and texture. The ground-truth in this dataset, however, allows supervised learning algorithms to segment the images based on statistics calculated over regions. One way to do this is to "oversegment" the image into superpixels (Felzenszwalb 2004, code available) and merge the superpixels into larger segments. Come up with a set of features to represent the superpixels (probably based on color and texture), a classifier/regression algorithm (suggestion: boosted decision trees) that allows you to estimate the likelihood that two superpixels are in the same segment, and an algorithm for segmentation based on those pairwise likelihoods. Since this project idea is fairly time-consuming focusing on a specific part of the project may also be acceptable.

For midway report, you should be able to estimate the likelihood that two superpixels are in the same segment and have a quantitative measure of how good your estimator is. You should also have an outline of how to use the likelihood estimates to form the final segmentation. The rest of the project will involve improving your likelihood estimation and your grouping algorithm, and in generating final results.


  • Some segmentation papers from Berkeley are available here

Project G2: Supervised vs. Unsupervised Segmentation Methods

Write two segmentation algorithms (these may be simpler than the one above): a supervised method (such as logistic regression) and an unsupervised method (such as K-means). Compare the results of the two algorithms. For your write-up, describe the two classification methods that you plan to use.

For midway report, you should have completed at least one of your segmentation algorithms and have results for that algorithm.


  • Some segmentation papers from Berkeley are available here

H. Object Recognition


The Caltech 256 dataset contains images of 256 object categories taken at varying orientations, varying lighting conditions, and with different backgrounds.
Available here.

Project suggestions:

  • You can try to create an object recognition system which can identify which object category is the best match for a given test image.

  • Apply clustering to learn object categories without supervision.


  • See link above.

I. Election Contributions


This dataset represents federal electoral compaign donations in the United States for the election years 1980 through 2006. The data, fully built, will form a tripartite, directed graph. Donors (individuals and corporations) make contributions to Committtees, who then in turn make contributions to Candidates. There is a many-to-many relationship between Donors and Committees, and also a many-to-many relationship between Committees and Candidates.
Available here (data and documentation)

Project suggestions:

  • Predict a committee's contribution rate, or preferred candidates, based on its past contribution rate. Which features best indicate who donates to it?

  • Predict how much a donor will contribute based on zip code, or whether an occupation is listed (or, if you can analyze the text, what occupation is listed).

  • Predict how much money a candidate will receive based on party, state, or whether s/he is an incumbent/challenger/open seat.

  • Discover clusters of donors/committees/candidates.

J. Sensor networks


This dataset contains temperature, humidity, and light data measurements, along with the voltage level of the batteries at each node, using this 54-node sensor network deployment. The data was collected every 30 seconds, starting around 1am on February 28th 2004.
This is a "real" dataset, with lots of missing data, noise, and failed sensors giving outlier values, especially when battery levels are low.
Available here.

Project suggestions:

  • Compare various regression algorithms.

  • Automatically detect failed sensors.

  • Learn graphical models (e.g. Bayes nets) representing the correlations between measurements at different nodes.


K. Twenty Newgroups text classification


This data set contains 1000 text articles posted to each of 20 online newgroups, for a total of 20,000 articles. This data is useful for a variety of text classification and/or clustering projects. The "label" of each article is which of the 20 newsgroups it belongs to. The newsgroups (labels) are hierarchically organized (e.g., "sports", "hockey").
Available here.
* The same website provides an implementation of a Naive Bayes classifier for this text data. The code is quite robust, and some documentation is available.

Project suggestions:

  • EM for text classification in the case where you have labels for some documents, but not for others (see Nigam et al, and come up with your own suggestions).

  • Make up your own text learning problem/approach.


L. WebKB webpage classification


This dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.
Available here.

Project suggestions:

  • You can try to learn classifiers to predict the type of a webpage from the text.

  • Try to improve accuracy by (1) exploiting correlations between pages that point to each other, and/or (2) segmenting the pages into meaningful parts (bio, publications, etc.)


  • See link above.

Final note:

Kaggle has a long list of (machine learning) problems! They help people with such problems meet people with the know-how to solve their problems for them. The problems are cast as open competitions (with dollar awards).

You can consider picking up a problem from Kaggle (e.g. salary prediction, predicting which new questions asked on Stack Overflow will be closed, diabetes classification, etc.) (they often have the data available) and maybe even win a prize!

Additional datasets & problems

From Prof. Jitendra Malik's talk: '3 R's of Computer Vision'
  • Semantic reconstruction: relatively hard problem, involves semantically labeling objects in the reconstructed image such as doors, walls, etc.
  • Semantic segmentation (telling a stroy about the image): involves for example (i) attribute classification (e.g., elderly white man with a baseball hat), (ii) orientation (e.g., next to, behind, face to face)
  • Face expression classification (smiling, angry, worried, suspicious, ...)
  • Person pose (called poselets) classification (standing, arms crossed, hand raised, ...); can also do action detection using probability of poselets as features? (e.g., dancing, running, ...)
For feature construction one can use RGBD (D for depth) in contrast to historical RGB.
For labeling, one can use AMT where humans mark joints, arms, shoulders, etc. on example images. Alternatively, gazing patterns of camera-recorded people can be used.

There are many many other datasets and machine learning problems out there. You can choose to work with any of these datasets and define your own ML problems to solve that are interesting to you.
  • UC Irvine has a ML repository that could be useful for you project . Many of these data sets have been used extensively in ML research (although often small datasets)

  • Sam Roweis also has a link to several datasets.

  • Many online media datasets by Jure Leskovec (mostly network/graph data, but also tweets, reviews, etc.) as well as more data here.
    For a nice read on several interesting prediction tasks on StackOverflow, see this.
    For a nice read on several interesting prediction tasks on Facebook and Wikipedia, see this.

  • arXiv Preprints: A collection of preprints in the field of high-energy physics. Includes the raw LaTeX source of each paper (so you can extract either structured sentences or a bag-of-words) along with the graph of citations between papers.

  • TRECVID: A competition for multimedia information retrieval. They keep a fairly large archive of video data sets, along with featurizations of the data.

  • Activity Modelling data: Activity modelling is the task of inferring what the user is doing from observations (eg, motion sensors, microphones). This data set consists of GPS motion data for two subjects tagged with labels like car, working, athome, shopping. A related paper using a Bayes net for this problem is here.

  • Record Deduplication data: The datasets provided below comprise of lists of records, and the goal is to identify, for any dataset, the set of records which refer to unique entities. This problem is known by the varied names of deduplication, identity uncertainty and record linkage. One common approach is to cast the deduplication problem as a classification problem. Consider the set of record-pairs, and classify them as either "unique" or "not-unique". Some papers on record deduplication include this and this.

  • Enron e-mail data: Consists of ~500K e-mails collected from Enron employees. It has been used for research into information extraction, social network analysis, and topic modeling. For a possible project, (1) you can try to classify the text of an e-mail message to decide who sent it, or (2) you can try to predict the length of an email given the past emailing history of the sender and recipients.

  • NIPS Corpus data: A data set based on papers from a machine learning conference (NIPS volumes 1-12). The data can be viewed as a tripartite graph on authors, papers, and words. Links represent authorship and the words used in a paper. Additionally, papers are tagged with topics and we know which year each paper was written. Potential projects include authorship prediction, document clustering, and topic tracking.

  • Precipitation data: This dataset has includes 45 years of daily precipitation data from the Northwestern US. Ideas for projects include predicting rain levels, and deciding where to place sensors to best predict rainfall. See this for the latter and the citations therein.

Last modified: 2014, by Leman Akoglu