Note: "♣" denotes an author list that is alphabetical by last name, as is customary in fields like math and theoretical computer science.

You can also find my papers listed on Google Scholar.

Work in progress

  • "Consumer Behavior in the Online Classroom: Using Video Analytics and Machine Learning to Understand the Consumption of Video Courseware"
    Mi Zhou, George H. Chen, Pedro Ferreira, Michael D. Smith
    INFORMS Conference on Information Systems & Technology (CIST), October 2019
    Workshop on Information Systems & Economics (WISE), December 2019
  • "Persuasion via Credibility Signals in Argumentative Dialogue"
    Emaad Ahmed Manzoor, George H. Chen, Dokyun Lee, Michael D. Smith
    INFORMS Conference on Information Systems & Technology (CIST), October 2019
    INFORMS Workshop on Data Science, October 2019
    ISMS Marketing Science Conference, June 2019

2019

  • "Missing Not at Random in Matrix Completion: The Effectiveness of Estimating Missingness Probabilities Under a Low Nuclear Norm Assumption"
    Wei Ma*, George H. Chen* (* = equal contribution)
    Neural Information Processing Systems (NeurIPS), December 2019
    [arXiv] [code] [poster] [slides]
    Note: We have a longer version in preparation analyzing a collection of missingness probability estimators, with more debiasing guarantees
    Best paper (theoretical track) at INFORMS Data Mining and Decision Analytics Workshop 2019
  • "Truck Traffic Monitoring with Satellite Images"
    Lynn H. Kaack, George H. Chen, M. Granger Morgan
    ACM Conference on Computing and Sustainable Societies (COMPASS), July 2019
    [arXiv]
    (Also presented at the International Conference on Machine Learning (ICML) Workshop on Climate Change, June 2019)
  • "Nearest Neighbor and Kernel Survival Analysis: Nonasymptotic Error Bounds and Strong Consistency Rates"
    George H. Chen
    International Conference on Machine Learning (ICML), June 2019
    [arXiv] [code] [talk] [poster]
  • "An Interpretable Produce Price Forecasting System for Small Farmers in India using Collaborative Filtering and Adaptive Nearest Neighbors"
    Wei Ma, Kendall Nowocin, Niraj Marathe, George H. Chen
    Information and Communication Technologies and Development (ICTD), January 2019
    [arXiv]

2018

  • "Explaining the Success of Nearest Neighbor Methods in Prediction"
    George H. Chen, Devavrat Shah
    Foundations and Trends in Machine Learning, May 2018
    [DOI]

2017

  • "Survival-Supervised Topic Modeling with Anchor Words: Characterizing Pancreatitis Outcomes"
    George H. Chen, Jeremy C. Weiss
    Neural Information Processing Systems (NeurIPS) Workshop on Machine Learning for Health, December 2017
    [arXiv (short workshop version)]
    (Also presented at Society for Medical Decision Making North American Meeting, October 2017)
  • "Toward Reducing Crop Spoilage and Increasing Small Farmer Profits in India: a Simultaneous Hardware and Software Solution"
    George H. Chen, Kendall Nowocin, Niraj Marathe
    Information and Communication Technologies and Development, November 2017
    [arXiv]

2015

  • "A Latent Source Model for Patch-Based Image Segmentation"
    George H. Chen, Devavrat Shah, Polina Golland
    Medical Image Computing and Computer-Assisted Intervention (MICCAI), October 2015
    [arXiv] [paper] [poster]
    Note: For a more comprehensive exposition of this paper, consider reading Chapter 5 of my Ph.D. thesis.
  • "Latent Source Models for Nonparametric Inference"
    George H. Chen
    Ph.D. thesis, MIT, May 2015
    [paper]
    Received the George M. Sprowls award for best Ph.D. thesis in Computer Science at MIT
  • "Targeting Villages for Rural Development Using Satellite Image Analysis"
    Kush R. Varshney, George H. Chen, Brian Abelson, Kendall Nowocin, Vivek Sakhrani, Ling Xu, Brian L. Spatocco
    Big Data, March 2015
    [paper]

2014

  • "A Latent Source Model for Online Collaborative Filtering"
    ♣ Guy Bresler, George H. Chen, Devavrat Shah
    Neural Information Processing Systems (NeurIPS), December 2014
    [arXiv - longer version] [paper - short conference version] [poster]
    Selected for spotlight (one of 62/1678 submissions)
    Note: An expanded version including intuition for how collaborative filtering relates to an MAP item recommender and derivations for the examples is in Chapter 4 of my Ph.D. thesis; the notation has also been changed to be more similar to the other two papers that went toward my thesis.

2013

  • "A Latent Source Model for Nonparametric Time Series Classification"
    George H. Chen, Stanislav Nikolov, Devavrat Shah
    Neural Information Processing Systems (NeurIPS), December 2013
    [arXiv - longer version] [paper - short conference version] [poster]
    Note: An expanded version with a lower bound on the misclassification rate and further discussion is in Chapter 3 of my Ph.D. thesis.
  • "Sparse Projections of Medical Images onto Manifolds"
    George H. Chen, Christian Wachinger, Polina Golland
    Information Processing in Medical Imaging (IPMI), June-July 2013
    [arXiv] [paper] [poster]

2012

  • "Deformation-Invariant Sparse Coding"
    George H. Chen
    Master's thesis, MIT, May 2012
    [paper] [poster]

2011

  • "Deformation-Invariant Sparse Coding for Modeling Spatial Variability of Functional Patterns in the Brain"
    George H. Chen, Evelina G. Fedorenko, Nancy G. Kanwisher, Polina Golland
    Neural Information Processing Systems (NeurIPS) Workshop on Machine Learning and Interpretation in Neuroimaging, December 2011
    [paper] [talk slides]

2010

  • "Indoor Localization and Visualization Using a Human-Operated Backpack System"
    Timothy Liu, Matthew Carlberg, George Chen, Jacky Chen, John Kua, Avideh Zakhor
    International Conference on Indoor Positioning and Indoor Navigation (IPIN), September 2010
    [paper]
  • "Indoor Localization Algorithms for a Human-Operated Backpack System"
    George Chen, John Kua, Stephen Shum, Nikhil Naikal, Matthew Carlberg, Avideh Zakhor
    International Symposium on 3D Data Processing, Visualization and Transmission (3DPVT), May 2010
    [paper]

2009

  • "Classifying Urban Landscape in Aerial LIDAR Using 3D Shape Analysis"
    Matthew Carlberg, Peiran Gao, George Chen, Avideh Zakhor
    International Conference on Image Processing (ICIP), November 2009
    [paper]
  • "2D Tree Detection in Large Urban Landscapes Using Aerial LIDAR Data"
    George Chen, Avideh Zakhor
    International Conference on Image Processing (ICIP), November 2009
    [paper]
  • "Image Augmented Laser Scan Matching for Indoor Dead Reckoning"
    Nikhil Naikal, John Kua, George Chen, Avideh Zakhor
    International Conference on Intelligent Robots and Systems (IROS), October 2009
    [paper]

Nonasymptotic theory for nearest neighbor methods in prediction

Despite nearest neighbor methods appearing in text as early as the 11th century in Alhazen's "Book of Optics", it was not until fairly recently that arguably the most general, nonasymptotic theory for nearest neighbor classification was developed by Chaudhuri and Dasgupta (2014). I've worked on a book that goes over some of the latest nonasymptotic theoretical guarantees for nearest neighbor and related kernel regression and classification methods both in general metric spaces, and in contemporary applications where clustering structure appears (time series forecasting, recommendation systems, medical image segmentation). The book also covers some recent advances in approximate nearest neighbor search, explains why decision tree and related ensemble methods are nearest neighbor methods, and discusses the potential for far away neighbors to help in prediction. I have also developed theory for nearest neighbor and kernel survival analysis (ICML 2019) and helped organize a related workshop at NeurIPS 2017 (slides are available for all the talks).

Nearest neighbor survey book thumbnail
  • (Monograph/Book) "Explaining the Success of Nearest Neighbor Methods in Prediction"
    George H. Chen, Devavrat Shah
    Foundations and Trends in Machine Learning, May 2018
    [DOI]
  • (Survival analysis) "Nearest Neighbor and Kernel Survival Analysis: Nonasymptotic Error Bounds and Strong Consistency Rates"
    George H. Chen
    International Conference on Machine Learning (ICML), June 2019
    [arXiv] [code] [talk] [poster]

Chapter 5 of the above monograph is on theoretical results using clustering structure. This chapter is based on my PhD thesis and provides a better overview than my thesis does. Proofs for the chapter are deferred to my thesis:

  • "Latent Source Models for Nonparametric Inference"
    George H. Chen
    Ph.D. thesis, MIT, May 2015
    [paper]
    Received the George M. Sprowls award for best Ph.D. thesis in Computer Science at MIT

My thesis unifies and builds on the following trilogy of papers:

  • "A Latent Source Model for Patch-Based Image Segmentation"
    George H. Chen, Devavrat Shah, Polina Golland
    Medical Image Computing and Computer-Assisted Intervention (MICCAI), October 2015
    [arXiv] [paper] [poster]
    Note: For a more comprehensive exposition of this paper, consider reading Chapter 5 of my Ph.D. thesis.
  • "A Latent Source Model for Online Collaborative Filtering"
    ♣ Guy Bresler, George H. Chen, Devavrat Shah
    Neural Information Processing Systems (NeurIPS), December 2014
    [arXiv - longer version] [paper - short conference version] [poster]
    Selected for spotlight (one of 62/1678 submissions)
    Note: An expanded version including intuition for how collaborative filtering relates to an MAP item recommender and derivations for the examples is in Chapter 4 of my Ph.D. thesis; the notation has also been changed to be more similar to the rest of the trilogy of papers.
  • "A Latent Source Model for Nonparametric Time Series Classification"
    George H. Chen, Stanislav Nikolov, Devavrat Shah
    Neural Information Processing Systems (NeurIPS), December 2013
    [arXiv - longer version] [paper - short conference version] [poster]
    Note: An expanded version with a lower bound on the misclassification rate and further discussion is in Chapter 3 of my Ph.D. thesis.

Handling missing not at random data

In a variety of prediction problems, we have feature vectors with entries missing not at random, where a standard approach is to impute missing entries (i.e., solve a matrix completion problem) prior to solving a prediction task (with imputed features). Even when there's no follow-up prediction task and we're only doing matrix completion, there is an incomplete understanding of debiasing guarantees. With Wei Ma, I am working on developing a new approach to estimating probabilities of entries being missing for debiasing matrix completion. Some preliminary progress is in our NeurIPS 2019 paper:

  • "Missing Not at Random in Matrix Completion: The Effectiveness of Estimating Missingness Probabilities Under a Low Nuclear Norm Assumption"
    Wei Ma*, George H. Chen* (* = equal contribution)
    Neural Information Processing Systems (NeurIPS), December 2019
    [arXiv] [code] [poster] [slides]
    Note: We have a longer version in preparation analyzing a collection of missingness probability estimators, with more debiasing guarantees
    Best paper (theoretical track) at INFORMS Data Mining and Decision Analytics Workshop 2019

Machine learning for sustainable development

Automatically finding trucks in satellite images to help estimate truck traffic, with an application to freight activity estimation in developing countries:

  • "Truck Traffic Monitoring with Satellite Images"
    Lynn H. Kaack, George H. Chen, M. Granger Morgan
    ACM Conference on Computing and Sustainable Societies (COMPASS), July 2019
    [arXiv]
    (Also presented at the International Conference on Machine Learning (ICML) Workshop on Climate Change, June 2019)

With a startup called CoolCrop, I am working on providing small and marginal farmers in rural India with access to cost-effective refrigeration and predictive analytics:

  • "An Interpretable Produce Price Forecasting System for Small Farmers in India using Collaborative Filtering and Adaptive Nearest Neighbors"
    Wei Ma, Kendall Nowocin, Niraj Marathe, George H. Chen
    Information and Communication Technologies and Development (ICTD), January 2019
    [arXiv]
  • "Toward Reducing Crop Spoilage and Increasing Small Farmer Profits in India: a Simultaneous Hardware and Software Solution"
    George H. Chen, Kendall Nowocin, Niraj Marathe
    Information and Communication Technologies for Development (ICTD), November 2017
    [arXiv]

Previously, as part of a startup GridForm, I analyzed satellite images of enormous tracts of land to help plan development projects. We focused on helping renewable energy companies bring electricity to rural India. We won the $10,000 grand prize at the 2014 MIT IDEAS Global Challenge. Here's a joint paper with Kush Varshney and Brian Abelson of DataKind:

  • "Targeting Villages for Rural Development Using Satellite Image Analysis"
    Kush R. Varshney, George H. Chen, Brian Abelson, Kendall Nowocin, Vivek Sakhrani, Ling Xu, Brian L. Spatocco
    Big Data, March 2015
    [paper]

Quantifying persuasion

With Emaad Manzoor, Dokyun Lee, and Alan Montgomery, I'm working on quantifying what makes an argument persuasive by mining the ChangeMyView subreddit:

  • "Quantifying Strategic Persuasion — Measuring d(opinion)/d(argument) in Debates on Gun Control"
    Emaad Ahmed Manzoor, Dokyun Lee, George H. Chen, Alan Montgomery
    INFORMS Conference on Information Systems & Technology (CIST), October 2019
    INFORMS Workshop on Data Science, October 2019
    (Also previously presented at ISMS Marketing Science Conference, June 2019)

Online education videos

With Mi Zhou, Pedro Ferreira, and Michael D. Smith, I'm working on identifying what features in an online educational video helps predict whether the videos will be watched; we're using MasterClass data:

  • "Disrupting Class: Using Video Analytics and Machine Learning to Understand Student Behavior Online"
    Mi Zhou, George H. Chen, Pedro Ferreira, Michael D. Smith
    INFORMS Conference on Information Systems & Technology (CIST), October 2019
    Workshop on Information Systems & Economics (WISE), December 2019

Forecasting patient outcomes in electronic health records

I'm working on survival analysis (predicting time durations until critical events) for healthcare. Some preliminary results for predicting length of stay for pancreatitis patients admitted to an intensive care unit were presented in the ML for health workshop at NeurIPS:

  • "Survival-Supervised Topic Modeling with Anchor Words: Characterizing Pancreatitis Outcomes"
    George H. Chen, Jeremy C. Weiss
    Neural Information Processing Systems (NeurIPS) Workshop on Machine Learning for Health, December 2017
    [arXiv (short workshop version)]
    (Also presented at Society for Medical Decision Making North American Meeting, October 2017)

Real-time medical image analysis

Various real-time medical imaging applications could be enabled by speeding up dimensionality reduction, a subroutine used in many image analysis algorithms. To do this, we create a sparse description of a manifold; our work relates to sparse multivariate regression:

Sparsification graphic
  • "Sparse Projections of Medical Images onto Manifolds"
    George H. Chen, Christian Wachinger, Polina Golland
    Information Processing in Medical Imaging (IPMI), June-July 2013
    [arXiv] [paper] [poster]

Modeling brain activation patterns

My master's thesis presented a probabilistic model of brain activation patterns evoked by functional stimuli such as reading sentences; the model combines sparse coding and image alignment:

Deformation-invariant sparse coding graphic
  • "Deformation-Invariant Sparse Coding"
    George H. Chen
    Master's thesis, MIT, May 2012
    [paper] [poster]

Preliminary version:

  • "Deformation-Invariant Sparse Coding for Modeling Spatial Variability of Functional Patterns in the Brain"
    George H. Chen, Evelina G. Fedorenko, Nancy G. Kanwisher, Polina Golland
    Neural Information Processing Systems (NeurIPS) Workshop on Machine Learning and Interpretation in Neuroimaging, December 2011
    [paper] [talk slides]

Backpack with sensors for indoor modeling

I developed algorithms that track where this fancy backpack is indoors using laser scanners. After I graduated from Berkeley, this project progressed quite a bit! Be sure to check out the latest developments from the Video and Image Processing Lab's website. Preliminary results:

Photo of backpack with sensors
  • "Indoor Localization and Visualization Using a Human-Operated Backpack System"
    Timothy Liu, Matthew Carlberg, George Chen, Jacky Chen, John Kua, Avideh Zakhor
    International Conference on Indoor Positioning and Indoor Navigation (IPIN), September 2010
    [paper]
  • "Indoor Localization Algorithms for a Human-Operated Backpack System"
    George Chen, John Kua, Stephen Shum, Nikhil Naikal, Matthew Carlberg, Avideh Zakhor
    International Symposium on 3D Data Processing, Visualization and Transmission (3DPVT), May 2010
    [paper]
  • "Image Augmented Laser Scan Matching for Indoor Dead Reckoning"
    Nikhil Naikal, John Kua, George Chen, Avideh Zakhor
    International Conference on Intelligent Robots and Systems (IROS), October 2009
    [paper]

Analyzing aerial images of cities

How to automatically find buildings, trees, ground, and water in aerial LIDAR images:

Example labeling of LIDAR image
  • "Classifying Urban Landscape in Aerial LIDAR Using 3D Shape Analysis"
    Matthew Carlberg, Peiran Gao, George Chen, Avideh Zakhor
    International Conference on Image Processing (ICIP), November 2009
    [paper]
  • "2D Tree Detection in Large Urban Landscapes Using Aerial LIDAR Data"
    George Chen, Avideh Zakhor
    International Conference on Image Processing (ICIP), November 2009
    [paper]