We propose MetricSound, a scalable and interpretable ML-based monitoring framework for data center incident detection. \newline It is capable of distilling the representation of service health status and is scalable to high volumes of real-time system metrics with linear complexity on counter dimensions. Our key insight is to bridge the gap between non-parametric statistical methods and ML model-based methods. Specifically, the proposed approach uses unsupervised outlier detection algorithms to extract useful representations of time series and project them on an ECOD anomaly space. Our approach then stacks anomalies at different granularity and builds a supervised classifier with focal loss for unbalanced labels. It then uses Bayesian optimization and Recursive feature elimination based on Shapley values for more robust service failure detection. Furthermore, since the method establishes a one-to-one mapping from raw metric to anomaly score, once the model predicts failures, the Shapley value are used to interpret the outcome and correlation between metrics and pinpoint the low-level counter/resources that contribute to the incident. MetricSound has been tested on incidents from a commercial cloud provider’s data and achieves more than 92% precision, 89% accuracy, and 90% F1 score across 1-month metric data. Since it provides an interpretation of health status for each diagnosis, we conducted some case studies (for OS upgrade failures, and high availability database backup issues) and showed its capability to identify the right set of root causes. Unlike existing solutions targeted toward a specific system, it is faster, more interpretable, and more general. While challenges may vary for different cloud providers and services, we foresee that this general model for machine health can lead to cost savings, reduced human effort, and better customer experience.
Slides can be added in a few ways: