Avatar

Yue Zhao

Ph.D. Student in Information Systems

Carnegie Mellon University

Expert with 7-year Professional Experience in Anomaly Detection

Author of PyOD, PyGOD, ADBench

Biography

📰 I am on the market with expected graduation in Summer 2023. I am broadly interested in machine learning, data mining and science, and information science and systems positions. I can work in the U.S., Canada, and China without sponsorship; please reach out if you have an open opportunity in either academia or industry! Contact me by Email (zhaoy [AT] cmu.edu) or WeChat (微信) @ yzhao062.

Who am I (赵越)? I am a 4-th year Ph.D. student at Carnegie Mellon University (CMU). Before joining CMU, I earned my Master degree from University of Toronto (2016) and Bachelor degree from University of Cincinnati (2015), and worked as a senior consultant at PwC Canada (2016-19). I am an expert on anomaly detection (a.k.a outlier detection) algorithms, systems, and its applications in security, healthcare, and Finance, with more than 7-year professional experience and 20+ papers (in JMLR, TKDE, NeurIPS, etc.). I appreciate the support from Norton Labs Graduate Fellowship.

At CMU, I work with Prof. Leman Akoglu, Prof. Zhihao Jia, and Prof. George H. Chen. I am a member of Data Analytics Techniques Algorithms (DATA) Lab and CMU automated learning systems group (Catalyst). Externally, I collaborate with Prof. Jure Leskovec at Stanford, Prof. Xia “Ben” Hu at Rice University, and Prof. Philip S. Yu at UIC.

Contributions to outlier detection systems, benchmarks, and applications: I build automated, scalable, and accelerated machine learning systems (MLSys) to support large-scale, real-world outlier detection applications in security, finance, and healthcare with millions of downloads. I designed CPU-based (PyOD), GPU-based (TOD), distributed detection systems (SUOD) for tabular (PyOD), time-series (TODS), and graph data (PyGOD). To understand the characteristics of OD algorithms, I co-author large-scale benchmarks for tabular data (ADBench), time-series data (paper), and graph data (UNOD). My work has been widely used by thousands of projects and applications, including firms like IBM, Morgan Stanley, and Tesla. See more applications.

Research outcomes (primarily for outlier detection if not specified):

Primary field Secondary Method Year Venue Lead author
large-scale Benchmark tabular anomaly detection ADBench 2022 NeurIPS Y
large-scale Benchmark graph anomaly detection UNOD 2022 NeurIPS Y
large-scale Benchmark sequence anomaly detection TODS 2021 NeurIPS
automated machine learning outlier model selection MetaOD 2021 NeurIPS Y
automated machine learning outlier model selection ELECT 2022 ICDM Y
automated machine learning outlier HP optimization HPOD 2022 Preprint Y
automated machine learning outlier evaluation IPM 2021 Preprint Y
machine learning systems PyOD 2019 JMLR Y
machine learning systems time series TODS 2020 AAAI
machine learning systems SUOD 2021 MLSys Y
machine learning systems distributed systems TOD 2022 Preprint Y
machine learning systems graph neural networks PyGOD 2022 Preprint Y
ensemble learning semi-supervised XGBOD 2018 IJCNN Y
ensemble learning LSCP 2019 SDM Y
ensemble learning machine learning systems combo 2020 AAAI Y
ensemble learning interpretable ML COPOD 2020 ICDM Y
ensemble learning interpretable ML ECOD 2022 TKDE Y
graph mining finance AutoAudit 2020 BigData
graph neural networks contrastive learning CONAD 2022 PAKDD
Diffusion Models survey 2022 Preprint
AI x Science synthetic data SynC 2020 ICDMW
AI x Science healthcare AI PyHealth 2020 Preprint Y
AI x Science Datasets & Benchmark TDC 2021 NeurIPS
AI x Science Datasets & Benchmark TDC V2 2022 NCHEMB

Open-source Contribution: I have led or contributed as a core member to more than 10 ML open-source initiatives, receiving 14,000 GitHub stars (top 0.002%: ranked 800 out of 40M GitHub users) and >10,000,000 total downloads. Popular ones:

  • PyOD: A Python Toolbox for Scalable Outlier Detection (Anomaly Detection).
  • ADBench: The most comprehensive tabular anomaly detection benchmark (30 anomaly detection algorithms on 57 benchmark datasets).
  • TOD: Tensor-based outlier detection–First large-scale GPU-based system for acceleration!
  • SUOD: An Acceleration System for Large-scale Heterogeneous Outlier Detection.
  • anomaly-detection-resources: The most starred resources (books, courses, etc.)!
  • Python Graph Outlier Detection (PyGOD): A Python Library for Graph Outlier Detection.
  • Therapeutics Data Commons (TDC): Machine learning for drug discovery.
  • PyTorch Geometric (PyG): Graph Neural Network Library for PyTorch. Contributed to profiler & benchmarking, and heterogeneous data transformation.
  • combo: A Python Toolbox for ML Model Combination (Ensemble Learning).
  • TODS: Time-series Outlier Detection. Contributed to core detection models.
  • MetaOD: Automatic Unsupervised Outlier Model Selection (AutoML).

[#1] 我组织并维护多个机器学习研究社交微信群,包括

  • anomaly detection (异常检测微信讨论组)
  • machine learning systems (机器学习系统讨论组),
  • ML Ph.D. (北美ML博士求职分享群) where we share postdoc, intern, and full-time jobs for ML Ph.D. (students).
  • 其他机器学习研究方向群 Join them by scanning 微信 @ 加群小助手!

[#2] I am a dedicated writer with more than 300 articles (in Chinese) and 190,000 followers on Zhihu (知乎) — Chinese Quora (200 million+ registered users). I have been officially recognized as a “Top Writer” (优秀回答者) in four fields (AI, ML, DM, and STAT). My articles have been read by more than 20,000,000 times. See my Zhihu page (微调).

Contact me by Email (zhaoy [AT] cmu.edu) or WeChat (微信) @ yzhao062.

Interests

  • Outlier & Anomaly Detection
  • AI + Security
  • Machine Learning Systems (MLSys)
  • Automated Machine Learning
  • Outlier Detection Systems (ODSys)
  • Graph Neural Networks
  • Ensemble Learning
  • Healthcare AI & Therapeutic for ML
  • Information Systems

Education

  • Ph.D. Student in Information Systems, 2019-2023

    Carnegie Mellon University

  • M.S. in Applied Computing, 2015-2017

    University of Toronto

  • B.S. in Computer Engineering (Minor in Computer Science and Math), 2015

    University of Cincinnati

  • High School Diploma, 2010

    Shanxi Experimental Secondary School 山西省实验中学

Miscellaneous

News & Travel

Sep 2022: Artificial Intelligence Foundation for Therapeutic Science published in Nature Chemical Biology. The paper describes Therapeutics Data Commons (TDC) and its various use cases, laying the foundation of therapeutic science.

Sep 2022: Two large-scale anomaly detection benchmarks for tabular data (ADBench) and graph data (UNOD) accepted at NeurIPS 2022.

  • ADBench is arguably my most important work—this 45-page paper contains the analysis results on 30 algorithms on 57 datasets, with around 100,000 experiments. If you are doing anomaly detection, I believe this is a must-read.

Sep 2022: Check out our comprehensive survey on diffusion models. Star the code repo!

Aug 2022: ELECT: Toward Unsupervised Outlier Model Selection is accepted to IEEE International Conference on Data Mining (ICDM) as a regular paper!

Jul 2022: 🌟 Reached 1000 citations on Google Scholar!


Profile & Casual Pictures

Publications

See my Google Scholar, DBLP, ORCID, and ResearchGate.

Prepints & Working Papers

[w22f] Diffusion Models: A Comprehensive Survey of Methods and Applications, with Ling Yang, Zhilong Zhang, Shenda Hong, Runsheng Xu, Yingxia Shao, Wentao Zhang, Ming-Hsuan Yang, and Bin Cui. Preprint.

[w22e] Towards Unsupervised HPO for Outlier Detection, with Leman Akoglu. Preprint.

[w22d] ADMoE: Anomaly Detection with Mixture-of-Experts from Noisy Labels, with Guoqing Zheng, Subhabrata Mukherjee, Robert McCann, Ahmed Awadallah. Preprint.

[w22a] TOD: GPU-accelerated Outlier Detection via Tensor Operations, with George H. Chen and Zhihao Jia. Revision at VLDB 2022. Preprint, code.

[w21c] A Large-scale Study on Unsupervised Outlier Model Selection: Do Internal Strategies Suffice? with Martin Q. Ma (equal contribution), Xiaorong Zhang, and Leman Akoglu. Preprint.


Peer-reviewed Papers

(2022). ADBench: Anomaly Detection Benchmark. Advances in Neural Information Processing Systems (NeurIPS) (Equal contribution).

PDF Code

(2022). Benchmarking Node Outlier Detection on Graphs. Advances in Neural Information Processing Systems (NeurIPS) (First three authors contributed equally).

PDF Code

(2022). ELECT: Toward Unsupervised Outlier Model Selection . IEEE International Conference on Data Mining (ICDM).

PDF Code

(2022). ECOD: Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions. IEEE Transactions on Knowledge and Data Engineering (TKDE) (Co-first author; equal contribution).

PDF Code DOI IEEE Xplore

(2021). Automatic Unsupervised Outlier Model Selection. Advances in Neural Information Processing Systems (NeurIPS).

PDF Code Project

(2020). Combining Machine Learning Models Using combo Library. Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI), demo track.

PDF Code Video DOI

(2020). SynC: A Unified Framework for Generating Synthetic Population with Gaussian Copula. Workshops at the Thirty-Fourth AAAI Conference on Artificial Intelligence.

PDF Code PPAI Arxiv

(2018). DCSO: Dynamic Combination of Detector Scores for Outlier Ensembles. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Workshop on Outlier Detection De-constructed (ODD).

PDF Slides

(2017). An empirical study of touch-based authentication methods on smartwatches. Proceedings of the 2017 ACM International Symposium on Wearable Computers (ISWC) (Co-first author; equal contribution).

PDF DOI ACM DL

Services

I am open to peer review and organizing chances in the field of outlier & anomaly detection, ensemble Learning, clustering, ML libraries & systems, and information systems.

Journal Reviewer

Program Committee and/or Reviewer for Conferences and Workshops

Awards and Funds

The Norton Labs Graduate Fellowship

The Norton Labs Graduate Fellowship provides up to $20,000 USD that may be used to cover one year of the student’s tuition fees and/or reimburse expenses incurred by the student during collaboration with Norton Labs. Selected as one of the only two graduate students to receive the award.

CMU GSA/Provost Conference Funding

Part of the travel grant for attending ICDM 2020.

AAAI Student Travel Grant & CMU GSA/Provost Conference Funding

Part of the travel grant for attending AAAI 2020.

Mitacs-Accelerate Research and Development Funding

Project IT07884 ($30,000): machine learning in HR analytics.

Mantei/Mae Award & Scholar

Awarded to highest-performing students in Electrical Engineering, Computer Engineering, and Computer Science ($40,000 in four years).

University Global Award and Scholarship

Awarded to top performing international students ($32,000 in four years).

Experience

Professional Positions

 
 
 
 
 

Machine Learning Research Intern

NortonLifeLock Research Group

May 2022 – Present
Supervised by Dr. Acar Tamersoy and Dr. Kevin Roundy.
 
 
 
 
 

Machine Learning Research Intern

Microsoft Research

Jan 2022 – Mar 2022

Designed weakly supervised anomaly detection algorithms.

Supervised by Dr. Guoqing Zheng and Dr. Subhabrata (Subho) Mukherjee.

 
 
 
 
 

Visiting Student Researcher

Stanford University, Computer Science Department,

May 2021 – Aug 2021 Stanford, CA, USA

Designed new GNN systems and models.

Supervised by Prof. Jure Leskovec.

 
 
 
 
 

Machine Learning Research Intern

IQVIA, Analytics Center of Excellence

May 2020 – Aug 2020 Boston, MA, USA

Designed new machine learning systems and models in healthcare.

Supervised by Dr. Cao (Danica) Xiao (IQVIA) and Prof. Jimeng Sun (UIUC).

 
 
 
 
 

Senior Consultant

PwC Canada, Consulting & Deals

Feb 2017 – Jun 2019 Toronto, ON, Canada

I was a senior consultant with the following duties:

  • Designed fraud analytic solutions for major Canadian banks and insurance firms.
  • Led applied data analytics projects, e.g., client segmentation and churn analysis.
  • Developed multiple pricing optimization models with statistical methods.
 
 
 
 
 

Research Associate (Intern)

PwC Canada, Consulting & Deals

May 2016 – Dec 2016 Toronto, ON, Canada

Applied research in people analytics with machine learning.

Supervised by Prof. Anthony Bonner and the project is partly supported by Mitacs-Accelerate Research and Development Funding (IT07884).

 
 
 
 
 

Software Engineer (Contract & Intern)

Siemens PLM Software USA

Mar 2012 – Dec 2014 Cincinnati, Ohio, USA

As a co-op student and contractor, my works include:

  • Managed a Java project to transition the LabManager system to vCloud Director.
  • Refactored outdated automation code and added new modules and JUnit test cases.
  • Led a C++ Code Coverage project on Teamcenter platform to strengthen its stability.

Experience

Teaching Positions

 
 
 
 
 

Teaching Assistant

Carnegie Mellon University, Heinz College of Information Systems and Public Policy

Feb 2020 – Present Pittsburgh, PA, United States

I am a teaching assistant for the following courses:

  • Intro to Artificial Intelligence taught by Prof. David Steier (Fall 2020, Spring 2021, Fall 2021, Spring 2022).
  • Digital Transformation taught by Prof. James Riel (Spring 2022).
  • Statistics for IT Managers taught by Prof. Daniel Nagin (Fall 2021).

The main duties include grading assignments and giving lectures on selected topics.

 
 
 
 
 

Teaching Assistant

University of Toronto, Department of Computer Science

Sep 2015 – Dec 2015 Toronto, ON, Canada
I was a teaching assistant for Embedded Systems taught by Prof. Philip Anderson.
 
 
 
 
 

Teaching Assistant

University of Cincinnati, Department of Electrical Engineering & Computer Science

Sep 2014 – Dec 2014 Cincinnati, OH, USA
I was a teaching assistant for Introduction to Programming taught by Prof. George Purdy.

Open-source Initiatives

To find more of my open-source initiatives, see my GitHub.

ADBench

ADBench Anomaly Detection Benchmark

PyGOD (Python Graph Outlier Detection)

A Python Library for Graph Outlier Detection (Anomaly Detection) and Its Benchmark

Therapeutics Data Commons (TDC)

Machine Learning Datasets and Tasks for Drug Discovery and Development

SUOD

SUOD Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection

combo

A Python Toolbox for Machine Learning Model Combination.

Python Outlier Detection Toolbox

PyOD–A Python Toolbox for Scalable Outlier Detection (Anomaly Detection).

Talks

Recent Talks

Contact

[WeChat (微信) @ yzhao062 | 微信 @ 加群小助手]