SUOD: Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection


Outlier detection (OD) is a key data mining task for identifying abnormal objects from general samples with numerous high-stake applications including fraud detection and intrusion detection. Due to the lack of ground truth labels, practitioners often have to build a large number of unsupervised models that are heterogeneous (i.e., different algorithms and hyperparameters) for further combination and analysis with ensemble learning, rather than relying on a single model. However, this yields severe scalability issues on high-dimensional, large datasets. How to accelerate the training and predicting with a large number of heterogeneous unsupervised OD models? How to ensure the acceleration does not deteriorate detection models’ accuracy? How to accommodate the acceleration need for both a single worker setting and a distributed system with multiple workers? In this study, we propose a three-module acceleration system called SUOD (scalable unsupervised outlier detection) to address these questions. It focuses on three complementary aspects to accelerate (dimensionality reduction for high-dimensional data, model approximation for complex models, and execution efficiency improvement for taskload imbalance within distributed systems), while controlling detection performance degradation. Extensive experiments on more than 20 benchmark datasets demonstrate SUOD's effectiveness in heterogeneous OD acceleration. By the submission time, the released open-source system has been widely used with more than 700,000 times downloads. A real-world deployment case on fraudulent claim analysis at IQVIA, a leading healthcare firm, is also provided.

Conference on Machine Learning and Systems (MLSys)
Yue Zhao

Machine Learning System and Information Systems Researcher.