Avatar

Yue Zhao

Ph.D. Candidate

Carnegie Mellon University

Unsupervised ML, ML Systems (MLSys), Automated ML, Anomaly Detection

Author of PyOD, PyGOD, ADBench

Biography

📰 I am on the market for tenure-track AP positions. I am broadly interested in machine learning (ML), data mining/science, and information systems positions. I am a US and Canadian permanent resident with full work authorization. See my latest CV.

Summary. In June 2023, I will finish my Ph.D. in 3.5 years at Carnegie Mellon University (CMU),
with the support from the CMU Presidential Fellowship and Norton Graduate Fellowship. My research accelerates and automates unsupervised ML: (1) how to support large-scale learning tasks with ML systems and (2) how to automate unsupervised ML model selection and hyperparameter optimization. I build AI/ML applications in healthcare and security.

Research Overview

Mentors. At CMU, I work with Prof. Leman Akoglu for automated data mining, Prof. Zhihao Jia for machine learning systems, and Prof. George H. Chen for general ML. I am a member of CMU automated learning systems group (Catalyst) and Data Analytics Techniques Algorithms (DATA) Lab. I have collaborated with Prof. Jure Leskovec at Stanford and Prof. Philip S. Yu at UIC.


Open-source Contribution: I have led or contributed as a core member to more than 10 ML open-source initiatives, receiving 15,000 GitHub stars (top 0.002%: ranked 800 out of 40M GitHub users) and >12,000,000 total downloads.

I am a dedicated writer with more than 300 articles (in Chinese) and 200,000 followers on Zhihu (知乎) — Chinese Quora (200 million+ registered users). I have been officially recognized as a “Top Writer” (优秀回答者) in four fields (AI, ML, DM, and STAT). My articles have been read by more than 20,000,000 times. See my Zhihu page (微调).

Contact me by Email (zhaoy [AT] cmu.edu) or WeChat @ yzhao062.

Interests

  • Unsupervised ML
  • ML Systems (MLSys)
  • Automated ML (AutoML)
  • Anomaly Detection
  • Healthcare AI
  • AI + Security
  • Graph Neural Networks
  • Ensemble Learning

Education

  • Ph.D. Candidate in Information Systems, 2019-2023

    Carnegie Mellon University

  • M.S. in Applied Computing, 2015-2017

    University of Toronto

  • B.S. in Computer Engineering (Minor in Computer Science and Math), 2015

    University of Cincinnati

  • High School Diploma, 2010

    Shanxi Experimental Secondary School 山西省实验中学

Miscellaneous

Recent and Upcoming Talks

Mar 2023: Automated and Salable Algorithms and Systems for Unsupervised ML @ USC

Mar 2023: Automated and Salable Algorithms and Systems for Unsupervised ML @ UC Davis

Feb 2023: Automated and Salable Algorithms and Systems for Unsupervised ML @ SBU

Feb 2023: Automated and Salable Algorithms and Systems for Unsupervised ML @ U Chicago

Feb 2023: Automated and Salable Algorithms and Systems for Unsupervised ML @ CMU (PDL)

Feb 2023: Automated and Salable Algorithms and Systems for Unsupervised ML @ UCM

News & Travel

Feb 2023: Weakly Supervised Anomaly Detection: A Survey is out! [code]

Dec 2022: The Need for Unsupervised Outlier Model Selection: A Review and Evaluation of Internal Evaluation Strategies will appear in ACM SIGKDD Explorations Newsletter 2023 (joint work with Leman Akoglu).

Nov 2022: Happy to serve as the workflow co-chair for KDD 2023!

Nov 2022: ADMoE: Anomaly Detection with Mixture-of-Experts from Noisy Labels will appear in AAAI 2023–the first framework of using multiple sets of noisy labels for detection.

Oct 2022: Have a new system paper out TOD: GPU-accelerated Outlier Detection via Tensor Operations. with George H. Chen and Zhihao Jia. VLDB paper, Code.

Oct 2022: Great news! Our proposal (led by Prof. Zhihao Jia) for AI-assisted systems has been funded via Meta 2022 AI4AI Research!


Profile & Casual Pictures

Publications

See my Google Scholar, DBLP, ORCID, and ResearchGate.

Research outcomes. I have published more than 30 papers in leading journals such as JMLR, NeurIPS, VLDB, and MLsys (primarily for unsupervised ML if not specified):

Primary field Secondary Method Year Venue Lead author
large-scale Benchmark tabular data ADBench 2022 NeurIPS Y
large-scale Benchmark graph learning BOND 2022 NeurIPS Y
large-scale Benchmark sequence data TODS 2021 NeurIPS
automated machine learning model selection MetaOD 2021 NeurIPS Y
automated machine learning model selection ELECT 2022 ICDM Y
automated machine learning HP optimization HPOD 2022 Preprint Y
automated machine learning evaluation metrics IPM 2023 KDD Explor. Y
machine learning systems PyOD 2019 JMLR Y
machine learning systems time series TODS 2020 AAAI
machine learning systems SUOD 2021 MLSys Y
machine learning systems distributed systems TOD 2022 VLDB Y
machine learning systems graph neural networks PyGOD 2022 Preprint Y
robust ML semi-supervised XGBOD 2018 IJCNN Y
robust ML ensemble learning LSCP 2019 SDM Y
robust ML ensemble learning combo 2020 AAAI Y
robust ML ensemble learning COPOD 2020 ICDM Y
robust ML ensemble learning ECOD 2022 TKDE Y
robust ML noisy label learning ADMoE 2023 AAAI Y
graph mining finance AutoAudit 2020 BigData
graph neural networks contrastive learning CONAD 2022 PAKDD
Diffusion Models survey 2022 Preprint
AI x Science synthetic data SynC 2020 ICDMW
AI x Science healthcare AI PyHealth 2020 Preprint Y
AI x Science Datasets & Benchmark TDC 2021 NeurIPS
AI x Science Datasets & Benchmark TDC V2 2022 NCHEMB


Prepints & Working Papers

[w23a] Weakly Supervised Anomaly Detection: A Survey, with Minqi Jiang, Chaochuan Hou, Ao Zheng, Xiyang Hu, Songqiao Han, Hailiang Huang, Xiangnan He, Philip S. Yu. Preprint.

[w22f] Diffusion Models: A Comprehensive Survey of Methods and Applications, with Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yingxia Shao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. Preprint.

[w22e] Hyperparameter Optimization for Unsupervised Outlier Detection, with Leman Akoglu. Preprint.


Peer-reviewed Papers

(2023). TOD: GPU-accelerated Outlier Detection via Tensor Operations. International Conference on Very Large Data Bases (VLDB).

PDF Code DOI

(2022). ADBench: Anomaly Detection Benchmark. Advances in Neural Information Processing Systems (NeurIPS) (Equal contribution).

PDF Code

(2022). ELECT: Toward Unsupervised Outlier Model Selection. IEEE International Conference on Data Mining (ICDM).

PDF Code

(2022). ECOD: Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions. IEEE Transactions on Knowledge and Data Engineering (TKDE) (Co-first author; equal contribution).

PDF Code DOI IEEE Xplore

(2021). Automatic Unsupervised Outlier Model Selection. Advances in Neural Information Processing Systems (NeurIPS).

PDF Code Project

(2020). Combining Machine Learning Models Using combo Library. Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI), demo track.

PDF Code Video DOI

(2020). SynC: A Unified Framework for Generating Synthetic Population with Gaussian Copula. Workshops at the Thirty-Fourth AAAI Conference on Artificial Intelligence.

PDF Code PPAI Arxiv

(2018). DCSO: Dynamic Combination of Detector Scores for Outlier Ensembles. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Workshop on Outlier Detection De-constructed (ODD).

PDF Slides

(2017). An empirical study of touch-based authentication methods on smartwatches. Proceedings of the 2017 ACM International Symposium on Wearable Computers (ISWC) (Co-first author; equal contribution).

PDF DOI ACM DL

Awards and Funds

Meta 2022 AI4AI Research Award

To foster further innovation in this area, and to deepen our collaboration with academia, Meta is pleased to invite faculty to respond to this call for research proposals pertaining to the aforementioned topics.

The Norton Labs Graduate Fellowship

The Norton Labs Graduate Fellowship provides up to $20,000 USD that may be used to cover one year of the student's tuition fees and/or reimburse expenses incurred by the student during collaboration with Norton Labs. Selected as one of the only two graduate students to receive the award.

CMU Presidential Fellowship

PwC Presidential Fellowship ($80,000).

Mitacs-Accelerate Research and Development Funding

Project IT07884 ($30,000): machine learning in HR analytics.

Mantei/Mae Award & Scholar

Awarded to highest-performing students in Electrical Engineering, Computer Engineering, and Computer Science ($40,000 in four years).

University Global Award and Scholarship

Awarded to top performing international students ($32,000 in four years).

Services

Conference Organizing Committee

External Reviewer for Funding Proposals

Journal Reviewer

Program Committee and/or Reviewer for Conferences and Workshops

Experience

Professional Positions

 
 
 
 
 

Machine Learning Research Intern

NortonLifeLock Research Group

May 2022 – Dec 2022
Supervised by Dr. Acar Tamersoy and Dr. Kevin Roundy.
 
 
 
 
 

Machine Learning Research Intern

Microsoft Research

Jan 2022 – Mar 2022

Designed weakly supervised anomaly detection algorithms.

Supervised by Dr. Guoqing Zheng and Dr. Subhabrata (Subho) Mukherjee.

 
 
 
 
 

Visiting Student Researcher

Stanford University, Computer Science Department,

May 2021 – Aug 2021 Stanford, CA, USA

Designed new GNN systems and models.

Supervised by Prof. Jure Leskovec.

 
 
 
 
 

Machine Learning Research Intern

IQVIA, Analytics Center of Excellence

May 2020 – Aug 2020 Boston, MA, USA

Designed new machine learning systems and models in healthcare.

Supervised by Dr. Cao (Danica) Xiao (IQVIA) and Prof. Jimeng Sun (UIUC).

 
 
 
 
 

Senior Consultant

PwC Canada, Consulting & Deals

Feb 2017 – Jun 2019 Toronto, ON, Canada
I was a senior consultant with the following duties:

  • Designed fraud analytic solutions for major Canadian banks and insurance firms.
  • Led applied data analytics projects, e.g., client segmentation and churn analysis.
  • Developed multiple pricing optimization models with statistical methods.
 
 
 
 
 

Research Associate (Intern)

PwC Canada, Consulting & Deals

May 2016 – Dec 2016 Toronto, ON, Canada

Applied research in people analytics with machine learning.

Supervised by Prof. Anthony Bonner and the project is partly supported by Mitacs-Accelerate Research and Development Funding (IT07884).

 
 
 
 
 

Software Engineer (Contract & Intern)

Siemens PLM Software USA

Mar 2012 – Dec 2014 Cincinnati, Ohio, USA
As a co-op student and contractor, my works include:

  • Managed a Java project to transition the LabManager system to vCloud Director.
  • Refactored outdated automation code and added new modules and JUnit test cases.
  • Led a C++ Code Coverage project on Teamcenter platform to strengthen its stability.

Experience

Teaching Positions

 
 
 
 
 

Teaching Assistant to multiples courses

Carnegie Mellon University, Heinz College of Information Systems

Feb 2020 – May 2022 Pittsburgh, PA, United States

I am a teaching assistant for the following courses:

  • Intro to Artificial Intelligence taught by Prof. David Steier (Fall 2020, Spring 2021, Fall 2021, Spring 2022).
  • Digital Transformation taught by Prof. James Riel (Spring 2022).
  • Statistics for IT Managers taught by Prof. Daniel Nagin (Fall 2021).

The main duties include grading assignments and giving lectures on selected topics.

 
 
 
 
 

Teaching Assistant

University of Toronto, Department of Computer Science

Sep 2015 – Dec 2015 Toronto, ON, Canada
I was a teaching assistant for Embedded Systems taught by Prof. Philip Anderson.
 
 
 
 
 

Teaching Assistant

University of Cincinnati, Department of Electrical Engineering & Computer Science

Sep 2014 – Dec 2014 Cincinnati, OH, USA
I was a teaching assistant for Introduction to Programming taught by Prof. George Purdy.

Open-source Initiatives

To find more of my open-source initiatives, see my GitHub. Popular ones:

  • PyOD: A Python Toolbox for Scalable Outlier Detection (Anomaly Detection).
  • Therapeutics Data Commons (TDC): Machine learning for drug discovery.
  • ADBench: The most comprehensive tabular anomaly detection benchmark (30 anomaly detection algorithms on 57 benchmark datasets).
  • TOD: Tensor-based outlier detection–First large-scale GPU-based system for acceleration!
  • PyTorch Geometric (PyG): Graph Neural Network Library for PyTorch. Contributed to profiler & benchmarking, and heterogeneous data transformation.
  • SUOD: An Acceleration System for Large-scale Heterogeneous Outlier Detection.
  • Python Graph Outlier Detection (PyGOD): A Python Library for Graph Outlier Detection.
  • combo: A Python Toolbox for ML Model Combination (Ensemble Learning).
  • TODS: Time-series Outlier Detection. Contributed to core detection models.

ADBench

ADBench Anomaly Detection Benchmark

PyGOD (Python Graph Outlier Detection)

A Python Library for Graph Outlier Detection (Anomaly Detection) and Its Benchmark

Therapeutics Data Commons (TDC)

Machine Learning Datasets and Tasks for Drug Discovery and Development

SUOD

SUOD Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection

combo

A Python Toolbox for Machine Learning Model Combination.

Python Outlier Detection Toolbox

PyOD–A Python Toolbox for Scalable Outlier Detection (Anomaly Detection).

Talks

Previous Talks

Contact

[WeChat (微信) @ yzhao062 | 微信 @ 加群小助手]