skip to main content
10.1145/3580305.3599392acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Public Access

Incremental Causal Graph Learning for Online Root Cause Analysis

Published: 04 August 2023 Publication History

Abstract

The task of root cause analysis (RCA) is to identify the root causes of system faults/failures by analyzing system monitoring data. Efficient RCA can greatly accelerate system failure recovery and mitigate system damages or financial losses. However, previous research has mostly focused on developing offline RCA algorithms, which often require manually initiating the RCA process, a significant amount of time and data to train a robust model, and then being retrained from scratch for a new system fault.
In this paper, we propose CORAL, a novel online RCA framework that can automatically trigger the RCA process and incrementally update the RCA model. CORAL consists of Trigger Point Detection, Incremental Disentangled Causal Graph Learning, and Network Propagation-based Root Cause Localization. The Trigger Point Detection component aims to detect system state transitions automatically and in near-real-time. To achieve this, we develop an online trigger point detection approach based on multivariate singular spectrum analysis and cumulative sum statistics. To efficiently update the RCA model, we propose an incremental disentangled causal graph learning approach to decouple the state-invariant and state-dependent information. After that, CORAL applies a random walk with restarts to the updated causal graph to accurately identify root causes. The online RCA process terminates when the causal graph and the generated root cause list converge. Extensive experiments on three real-world datasets demonstrate the effectiveness and superiority of the proposed framework.

Supplementary Material

MP4 File (rtfp0919-2min-promo.mp4)
Step into the future of system failure recovery with Dongjie Wang in this video presenting 'Incremental Causal Graph Learning for Online Root Cause Analysis'. Discover the innovative CORAL framework that revolutionizes root cause analysis, making it automatic and real-time while continuously learning from new data. Dive into how this method has demonstrated its effectiveness in real-world applications and what it means for the future of system failure recovery.

References

[1]
Chuadhry Mujeeb Ahmed, Venkata Reddy Palleti, and Aditya P Mathur. 2017. WADI: a water distribution testbed for research in the design of secure cyber physical systems. In CySWater. 25--28.
[2]
Arwa Alanqary, Abdullah Alomar, and Devavrat Shah. 2021. Change Point Detection via Multivariate Singular Spectrum Analysis. NeurIPS, Vol. 34 (2021), 23218--23230.
[3]
Samaneh Aminikhanghahi and Diane J Cook. 2017. A survey of methods for time series change point detection. Knowledge and Information Systems, Vol. 51, 2 (2017), 339--367.
[4]
Bjørn Andersen and Tom Fagerhaug. 2006. Root cause analysis: simplified tools and techniques. Quality Press.
[5]
Charles K Assaad, Emilie Devijver, and Eric Gaussier. 2022. Survey and Evaluation of Causal Discovery Methods for Time Series. JAIR, Vol. 73 (2022), 767--819.
[6]
Azzeddine Bakdi, Wahiba Bounoua, Amar Guichi, and Saad Mekhilef. 2021. Real-time fault detection in PV systems under MPPT using PMU and high-frequency multi-sensor data through online PCA-KDE-based multivariate KL divergence. International Journal of Electrical Power & Energy Systems, Vol. 125 (2021), 106457.
[7]
Alexis Bellot, Kim Branson, and Mihaela van der Schaar. 2021. Neural graphical modelling in continuous-time: consistency guarantees and algorithms. In ICLR.
[8]
Álvaro Brandón, Marc Solé, Alberto Huélamo, David Solans, María S Pérez, and Victor Muntés-Mulero. 2020. Graph-based root cause analysis for service-oriented and microservice architectures. Journal of Systems and Software, Vol. 159 (2020), 110432.
[9]
Alfonso Capozzoli, Fiorella Lauro, and Imran Khan. 2015. Fault detection analysis using data mining techniques for a cluster of smart office buildings. Expert Systems with Applications, Vol. 42, 9 (2015), 4324--4338.
[10]
Zhengzhang Chen, Haifeng Chen, and Yuening Li. 2022. Interpretable time series representation learning with multiple-level disentanglement. US Patent App. 17/582,191.
[11]
Wei Cheng, Kai Zhang, Haifeng Chen, Guofei Jiang, Zhengzhang Chen, and Wei Wang. 2016. Ranking causal anomalies via temporal and dynamical analysis on vanishing correlations. In SIGKDD. 805--814.
[12]
Ailin Deng and Bryan Hooi. 2021. Graph neural network-based anomaly detection in multivariate time series. In AAAI, Vol. 35. 4027--4035.
[13]
Boxiang Dong, Zhengzhang Chen, Hui Wang, Lu-An Tang, Kai Zhang, Ying Lin, Zhichun Li, and Haifeng Chen. 2017. Efficient discovery of abnormal event sequences in enterprise security systems. In CIKM. 707--715.
[14]
George K Fourlas and George C Karras. 2021. A survey on fault diagnosis methods for UAVs. In ICUAS. IEEE, 394--403.
[15]
Frank Bajak. 2021. Why did Amazon Web Services crash? Here's what it means. https://globalnews.ca/news/8434673/why-amazon-web-services-crash/.
[16]
Klaus Frick, Axel Munk, and Hannes Sieling. 2014. Multiscale change point inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 76, 3 (2014), 495--580.
[17]
Bent Fuglede and Flemming Topsoe. 2004. Jensen-Shannon divergence and Hilbert space embedding. In ISIT. 31.
[18]
Alexander Gepperth and Barbara Hammer. 2016. Incremental learning algorithms and applications. In ESANN.
[19]
Jiaping Gui, Ding Li, Zhengzhang Chen, Junghwan Rhee, Xusheng Xiao, Mu Zhang, Kangkook Jee, Zhichun Li, and Haifeng Chen. 2020. APTrace: A responsive system for agile enterprise level causality analysis. In ICDE. 1701--1712.
[20]
Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. 2018. Dialog-based interactive image retrieval. NeurIPS, Vol. 31 (2018).
[21]
Xiaojie Guo, Liang Zhao, Zhao Qin, Lingfei Wu, Amarda Shehu, and Yanfang Ye. 2020. Interpretable deep graph generation with node-edge co-disentanglement. In SIGKDD. 1697--1707.
[22]
Ruidan He, Wee Sun Lee, Hwee Tou Ng, and Daniel Dahlmeier. 2017. An unsupervised neural attention model for aspect extraction. In ACL. 388--397.
[23]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780.
[24]
Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li F Fei-Fei, and Juan Carlos Niebles. 2018. Learning to decompose and disentangle representations for video prediction. NeurIPS, Vol. 31 (2018).
[25]
Aapo Hyvärinen, Kun Zhang, Shohei Shimizu, and Patrik O Hoyer. 2010. Estimation of a structural vector autoregression model using non-gaussianity. Journal of Machine Learning Research, Vol. 11, 5 (2010).
[26]
Muhammad Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. 2022. Root Cause Analysis of Failures in Microservices through Causal Discovery. In NeurIPS.
[27]
Yuchen Jiao, Yanxi Chen, and Yuantao Gu. 2018. Subspace Change-Point Detection: A New Model and Solution. IEEE Journal of Selected Topics in Signal Processing, Vol. 12, 6 (2018), 1224--1239.
[28]
Emre Kiciman and Lakshminarayanan Subramanian. 2005. Root cause localization in large scale systems. In HotDep.
[29]
Thomas N Kipf and Max Welling. 2016. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016).
[30]
Haoyang Li, Xin Wang, Ziwei Zhang, Zehuan Yuan, Hang Li, and Wenwu Zhu. 2021. Disentangled contrastive learning on graphs. In NeurIPS, Vol. 34. 21872--21884.
[31]
Mingjie Li, Zeyan Li, Kanglin Yin, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2022b. Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition. In SIGKDD. 3230--3240.
[32]
Yuening Li, Zhengzhang Chen, Daochen Zha, Mengnan Du, Jingchao Ni, Denghui Zhang, Haifeng Chen, and Xia Hu. 2022a. Towards Learning Disentangled Representations for Time Series. In SIGKDD. 3270--3278.
[33]
Ying Lin, Zhengzhang Chen, Cheng Cao, Lu-An Tang, Kai Zhang, Wei Cheng, and Zhichun Li. 2018. Collaborative alert ranking for anomaly detection. In CIKM. 1987--1995.
[34]
Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li, Jiayu Ou, and Zheshun Wu. 2021. MicroHECL: high-efficient root cause localization in large-scale microservice systems. In ICSE. 338--347.
[35]
Siqi Liu, Adam Wright, and Milos Hauskrecht. 2018. Change-point detection method for clinical decision support system rule monitoring. Artificial Intelligence in Medicine, Vol. 91 (2018), 49--56.
[36]
Chen Luo, Zhengzhang Chen, Lu-An Tang, Anshumali Shrivastava, Zhichun Li, Haifeng Chen, and Jieping Ye. 2018. TINET: learning invariant networks via knowledge transfer. In SIGKDD. 1890--1899.
[37]
Aditya P Mathur and Nils Ole Tippenhauer. 2016. SWaT: A water treatment testbed for research and training on ICS security. In CySWater. 31--36.
[38]
Yuan Meng, Shenglin Zhang, Yongqian Sun, Ruru Zhang, Zhilong Hu, Yiyin Zhang, Chenyang Jia, Zhaogang Wang, and Dan Pei. 2020. Localizing failure root causes in a microservice through causality inference. In IWQoS. 1--10.
[39]
Meike Nauta, Doina Bucur, and Christin Seifert. 2019. Causal discovery with attention-based convolutional neural networks. Machine Learning and Knowledge Extraction, Vol. 1, 1 (2019), 312--340.
[40]
Ignavier Ng, AmirEmad Ghassami, and Kun Zhang. 2020. On the role of sparsity and dag constraints for learning linear dags. NeurIPS, Vol. 33 (2020), 17943--17954.
[41]
Roxana Pamfil, Nisara Sriwattanaworachai, Shaan Desai, Philip Pilgerstorfer, Konstantinos Georgatzis, Paul Beaumont, and Bryon Aragam. 2020. Dynotears: Structure learning from time-series data. In AISTATS. 1595--1605.
[42]
Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. 2013. Causal inference on time series using restricted structural equation models. NeurIPS, Vol. 26 (2013).
[43]
Jakob Runge. 2020. Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets. In UAI. 1388--1397.
[44]
Nicolas Seichepine, Slim Essid, Cédric Févotte, and Olivier Cappé. 2014. Piecewise constant nonnegative matrix factorization. In ICASSP. IEEE, 6721--6725.
[45]
Jacopo Soldani and Antonio Brogi. 2022. Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey. ACM Computing Surveys (CSUR), Vol. 55, 3 (2022), 1--39.
[46]
Marc Solé, Victor Muntés-Mulero, Annie Ibrahim Rana, and Giovani Estrada. 2017. Survey on models and techniques for root-cause analysis. arXiv preprint arXiv:1701.08546 (2017).
[47]
Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. 2000. Causation, prediction, and search. MIT press.
[48]
Jie Sun, Dane Taylor, and Erik M Bollt. 2015. Causal network inference by optimal causation entropy. SIAM Journal on Applied Dynamical Systems, Vol. 14, 1 (2015), 73--106.
[49]
LuAn Tang, Hengtong Zhang, Zhengzhang Chen, Bo Zong, LI Zhichun, Guofei Jiang, and Kenji Yoshihira. 2019. Graph-based attack chain discovery in enterprise security systems. US Patent 10,289,841.
[50]
A Tank, I Covert, N Foti, A Shojaie, and EB Fox. 2021. Neural Granger Causality. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
[51]
Dongjie Wang, Zhengzhang Chen, Yanjie Fu, Yanchi Liu, and Haifeng Chen. 2023 a. Incremental Causal Graph Learning for Online Unsupervised Root Cause Analysis. arXiv preprint arXiv:2305.10638 (2023).
[52]
Dongjie Wang, Zhengzhang Chen, Jingchao Ni, Liang Tong, Zheng Wang, Yanjie Fu, and Haifeng Chen. 2023 b. Hierarchical Graph Neural Networks for Causal Discovery and Root Cause Localization. arXiv preprint arXiv:2302.01987 (2023).
[53]
Xin Wang, Hong Chen, Si'ao Tang, Zihao Wu, and Wenwu Zhu. 2022. Disentangled Representation Learning.
[54]
Xiang Wang, Hongye Jin, An Zhang, Xiangnan He, Tong Xu, and Tat-Seng Chua. 2020. Disentangled graph collaborative filtering. In SIGIR. 1001--1010.
[55]
William Webber, Alistair Moffat, and Justin Zobel. 2010. A similarity measure for indefinite rankings. ACM Transactions on Information Systems (TOIS), Vol. 28, 4 (2010), 1--38.
[56]
Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. 2018. Dags with no tears: Continuous optimization for structure learning. NeurIPS, Vol. 31 (2018).

Cited By

View all
  • (2025)A Survey of Change Point Detection in Dynamic GraphsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.352385737:3(1030-1048)Online publication date: Mar-2025
  • (2024)Causal Discovery from Temporal Data: An Overview and New PerspectivesACM Computing Surveys10.1145/370529757:4(1-38)Online publication date: 23-Nov-2024
  • (2024)Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating SystemsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663834(126-137)Online publication date: 10-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2023
5996 pages
ISBN:9798400701030
DOI:10.1145/3580305
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. causal structure learning
  2. disentangled graph learning
  3. incremental learning
  4. root cause analysis
  5. trigger point detection

Qualifiers

  • Research-article

Funding Sources

Conference

KDD '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,022
  • Downloads (Last 6 weeks)118
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)A Survey of Change Point Detection in Dynamic GraphsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.352385737:3(1030-1048)Online publication date: Mar-2025
  • (2024)Causal Discovery from Temporal Data: An Overview and New PerspectivesACM Computing Surveys10.1145/370529757:4(1-38)Online publication date: 23-Nov-2024
  • (2024)Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating SystemsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663834(126-137)Online publication date: 10-Jul-2024
  • (2024)POND: Multi-Source Time Series Domain Adaptation with Information-Aware Prompt TuningProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671721(3140-3151)Online publication date: 25-Aug-2024
  • (2024)MARLP: Time-series Forecasting Control for Agricultural Managed Aquifer RechargeProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671533(4862-4872)Online publication date: 25-Aug-2024
  • (2024)Multi-view Causal Graph Fusion Based Anomaly Detection in Cyber-Physical InfrastructuresProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680096(4760-4767)Online publication date: 21-Oct-2024
  • (2024)RealTCD: Temporal Causal Discovery from Interventional Data with Large Language ModelProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680042(4669-4677)Online publication date: 21-Oct-2024
  • (2024)A Knowledge-Enhanced Transformer-FL Method for Fault Root Cause LocalizationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679816(1607-1616)Online publication date: 21-Oct-2024
  • (2024)MULAN: Multi-modal Causal Structure Learning and Root Cause Analysis for Microservice SystemsProceedings of the ACM Web Conference 202410.1145/3589334.3645442(4107-4116)Online publication date: 13-May-2024
  • (2024)KGroot: A knowledge graph-enhanced method for root cause analysisExpert Systems with Applications10.1016/j.eswa.2024.124679255(124679)Online publication date: Dec-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media