Biography

My name is Yue ZHAO (赵越 in Chinese). I am a third-year Ph.D. student at Carnegie Mellon University (CMU), and an ex management consultant at PwC Canada. I have led/participated > 10 ML open-source initiatives, receiving 10,000 GitHub stars (top 0.002%: ranked 900 out of 40M GitHub users) and >400,0000 total downloads. Popular ones:

  • [JMLR] PyOD: A Python Toolbox for Scalable Outlier Detection (Anomaly Detection).
  • [MLSys] SUOD: An Acceleration System for Large-scale Heterogeneous Outlier Detection (Anomaly Detection).
  • TDC: An extensive machine learning data hub for therapeutic tasks (ML for Therapeutic).
  • MetaOD: Automatic Unsupervised Outlier Model Selection (Outlier Detection and AutoML).
  • PyHealth: A Python Library for Healthcare AI (ML for Healthcare).
  • [AAAI] combo: A Python Toolbox for ML Model Combination (Ensemble Learning).
  • [AAAI] TODS: Time-series Outlier Detection System. Contributed to core detection model.

I specialize in designing and building machine learning systems (MLSys), with realization and applications in outlier detection, healthcare, graph neural networks, and ensemble learning. My research focuses on the intersection of two fields:

  • data mining topics related to outlier detection (anomaly detection)
  • machine learning systems that can speed/scale up and automate data mining and machine learning algorithms

At CMU, I work with Prof. Leman Akoglu from DATA Lab on outlier detection, Prof. George H. Chen on general ML and statistics, and Prof. Zhihao Jia from Catalyst on machine learning systems. I am currently visiting Prof. Jure Leskovec at SNAP, Standford University.

Startup and VC: I am interested in capitalizing my expertise in anomaly detection. Let's connect!

General Notes: I am open to ML/DM Internship (2022). Please reach out :)

Contact me by Email (zhaoy [AT] cmu.edu) or WeChat (微信) @ yzhao062.

[#1] I am open to collaboration opportunities (anytime & anywhere) and research internships (summer 2022). I could legally work in United States (CPT), Canada (permanent residency), and China (permanent residency). I have been working with the professionals from both industry and academia (e.g., Stanford, Havard, Facebook).

[#2] Call for review oppt. I am looking for paper review, tutorial, workshop, and talk opportunities (in anomaly detection, scalable ML, machine learning systems, and AutoML).

[#3] I host a WeChat group on anomaly detection (异常检测微信讨论组), along with more than three hundred of researchers (e.g., Berkley, MIT, Tsinghua, etc.) and industry people (e.g., Alibaba, IBM, Facebook, etc.) for collaboration and intern/full-time opportunities. Ping me to join!

[#4] I am a dedicated writer with more than 300 articles (in Chinese) and 160,000 followers on Zhihu (知乎) — Chinese Quora (200 million+ registered users). Since 2018, I have been officially recognized as a “Top Zhihu Writer” (优秀回答者) in four fields (AI, ML, DM, and STAT). My articles have been read by more than 20,000,000 times. See my Zhihu page (微调).

Interests

  • Outlier & Anomaly Detection
  • Machine Learning Systems (MLSys)
  • Automated Machine Learning
  • Scalable Machine Learning
  • Parallel Computing
  • Healthcare AI & Therapeutic for ML
  • Graph Neural Networks
  • Ensemble Learning
  • Information Systems

Education

  • Ph.D. Student in Information Systems and Management, 2019-2024

    Carnegie Mellon University

  • M.S. in Applied Computing, 2015-2017

    University of Toronto

  • B.S. in Computer Engineering (Minor in Computer Science and Math), 2015

    University of Cincinnati

  • High School Diploma, 2010

    Shanxi Experimental Secondary School 山西省实验中学

Miscellaneous

News & Travel

Jun 2021: Two impactful large-scale ML initiatives are under submission at NeurIPS 2021 (Datasets and Benchmarks). Please check out and follow them on OpenReview: (1) Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development and (2) Revisiting Time Series Outlier Detection: Definitions and Benchmarks.

May-Aug 2021: Visiting at Standford University in SNAP by Prof. Jure Leskovec.

May 2021: Have a new journal paper titled Copula-Based Outlier Detection under review. It is based on our ICDM’ 20 paper with more theoretical analysis. See the extended journal version!

Apr 2021: How to evaluate/select outlier detection models without any external information (e.g., ground truth)? We have a new preprint on using internal strategies for model selection. Do they suffice? Check out our paper!


Profile & Casual Pictures


Resources

Services

I am open to peer review and organizing chances in the field of outlier & anomaly detection, ensemble Learning, clustering, ML libraries & systems, and information systems.

Journal/Conference Reviewer

Journal:

Conference:

Program Committee

Publications

See my Google Scholar, DBLP, ORCID, and ResearchGate.

Prepints & Working Papers

[w21f] Revisiting Time Series Outlier Detection: Definitions and Benchmarks, with Kwei-Herng Lai, Daochen Zha, Junjie Xu, Guanchu Wang, Xia Hu. Submitted to a major CS conference, under review. Preprint.

[w21e] Automatic Unsupervised Outlier Model Selection, with Ryan A. Rossi and Leman Akoglu. Submitted to a major CS conference, under review. Preprint.

[w21d] Copula-Based Outlier Detection, with Zheng Li, Xiyang Hu, Nicola Botta, Cezar Ionescu, and George H. Chen. Submitted to a key ML journal, under review. Preprint.

[w21c] A Large-scale Study on Unsupervised Outlier Model Selection: Do Internal Strategies Suffice? with Martin Q. Ma (equal contribution), Xiaorong Zhang, and Leman Akoglu. Preprint.

[w21b] Therapeutics Data Commons: Machine Learning Datasets and Tasks for Therapeutics, with Kexin Huang, Tianfan Fu, Wenhao Gao, Yusuf Roohani, Jure Leskovec, Connor W. Coley, Cao Xiao, Jimeng Sun, Marinka Zitnik. Submitted to a major CS conference, under review. Preprint.

[w21a] PyHealth: A Python Library for Health Predictive Models, with Zhi Qiao (equal contribution), Cao (Danica) Xiao, Lucas M. Glass, and Jimeng Sun. Preprint.


Peer-reviewed Papers

(2020). Combining Machine Learning Models Using combo Library. Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI), demo track.

PDF Code Video DOI

(2020). SynC: A Unified Framework for Generating Synthetic Population with Gaussian Copula. Workshops at the Thirty-Fourth AAAI Conference on Artificial Intelligence.

PDF Code PPAI Arxiv

(2018). DCSO: Dynamic Combination of Detector Scores for Outlier Ensembles. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Workshop on Outlier Detection De-constructed (ODD).

PDF Poster Slides

(2017). An empirical study of touch-based authentication methods on smartwatches. Proceedings of the 2017 ACM International Symposium on Wearable Computers (Equal contribution).

PDF DOI ACM DL

Experience

Professional Positions

 
 
 
 
 

Visiting Student Researcher

Stanford University, Computer Science Department,

May 2021 – Aug 2020 Stanford, CA, USA

Designed new GNN systems and models.

Supervised by Prof. Jure Leskovec.

 
 
 
 
 

Machine Learning Research Intern

IQVIA, Analytics Center of Excellence

May 2020 – Aug 2020 Boston, MA, USA

Designed new machine learning systems and models in healthcare.

Supervised by Dr. Cao (Danica) Xiao (IQVIA) and Prof. Jimeng Sun (UIUC).

 
 
 
 
 

Senior Consultant

PwC Canada, Consulting & Deals

Feb 2017 – Jun 2019 Toronto, ON, Canada
I was a senior consultant with the following duties:

  • Designed fraud analytic solutions for major Canadian banks and insurance firms.
  • Led applied data analytics projects, e.g., client segmentation and churn analysis.
  • Developed multiple pricing optimization models with statistical methods.
 
 
 
 
 

Research Associate (Intern)

PwC Canada, Consulting & Deals

May 2016 – Dec 2016 Toronto, ON, Canada

Applied research in people analytics with machine learning.

Supervised by Prof. Anthony Bonner and the project is partly supported by Mitacs-Accelerate Research and Development Funding (IT07884).

 
 
 
 
 

Software Engineer (Contract & Intern)

Siemens PLM Software USA

Mar 2012 – Dec 2014 Cincinnati, Ohio, USA
As a co-op student and contractor, my works include:

  • Managed a Java project to transition the LabManager system to vCloud Director.
  • Refactored outdated automation code and added new modules and JUnit test cases.
  • Led a C++ Code Coverage project on Teamcenter platform to strengthen its stability.

Experience

Teaching Positions

 
 
 
 
 

Teaching Assistant

Carnegie Mellon University, Heinz College of Information Systems and Public Policy

Feb 2021 – May 2021 Toronto, ON, Canada
I am a teaching assistant for Intro to Artificial Intelligence taught by Prof. David Steier. Grading assignments and giving lectures on selected topics.
 
 
 
 
 

Teaching Assistant

Carnegie Mellon University, Heinz College of Information Systems and Public Policy

Sep 2020 – Dec 2020 Toronto, ON, Canada
I am a teaching assistant for Intro to Artificial Intelligence taught by Prof. David Steier. Grading assignments and giving lectures on selected topics.
 
 
 
 
 

Teaching Assistant

University of Toronto, Department of Computer Science

Sep 2015 – Dec 2015 Toronto, ON, Canada
I was a teaching assistant for Embedded Systems taught by Prof. Philip Anderson.
 
 
 
 
 

Teaching Assistant

University of Cincinnati, Department of Electrical Engineering & Computer Science

Sep 2014 – Dec 2014 Cincinnati, OH, USA
I was a teaching assistant for Introduction to Programming taught by Prof. George Purdy.

Funds and Awards

CMU GSA/Provost Conference Funding

Part of the travel grant for attending ICDM 2020.

AAAI Student Travel Grant & CMU GSA/Provost Conference Funding

Part of the travel grant for attending AAAI 2020.

Mitacs-Accelerate Research and Development Funding

Project IT07884 ($30,000): machine learning in HR analytics.

Mantei/Mae Award & Scholar

Awarded to highest-performing students in Electrical Engineering, Computer Engineering, and Computer Science ($40,000 in four years).

University Global Award and Scholarship

Awarded to top performing international students ($32,000 in four years).

Open-source Initiatives

I am happy to give talks on the series of tools I built, e.g., PyOD, combo, and SUOD. I am also willing to share my experience as a ML developer and researcher, especially on how to build ML tools from design. Please drop me a line for invite :)

I am an enthusiastic open-source developer: I build machine learning libraries and systems. Specifically, I initialized Python Outlier Detection library (PyOD) in 2018, which has become the most popular Python outlier detection toolkit. I also initialized combo: A Python Toolbox for Machine Learning Model Combination in July 2019–it is currently under active development.

I am currently working on a new ML system called SUOD (Scalable Unsupervised Outlier Detection), for accelerating model training and prediction when a large number of outlier detectors are presented on large, high-dimensional datasets. Watch/Star/Follow welcome!

SUOD

SUOD Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection

combo

A Python Toolbox for Machine Learning Model Combination.

Python Outlier Detection Toolbox

PyOD–A Python Toolbox for Scalable Outlier Detection (Anomaly Detection).

Talks

Anomaly Detection Algorithms, Applications, and Systems (in Chinese)

本次视频主要介绍了多种异常检测算法,相关应用和使用技巧,并对未来的研究进行了展望.

Contact

WeChat (微信) @ yzhao062