research-article

Public Access

Incremental Causal Graph Learning for Online Root Cause Analysis

Authors:

Haifeng ChenAuthors Info & Claims

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 2269 - 2278

https://doi.org/10.1145/3580305.3599392

Published: 04 August 2023 Publication History

PDF eReader

Abstract

The task of root cause analysis (RCA) is to identify the root causes of system faults/failures by analyzing system monitoring data. Efficient RCA can greatly accelerate system failure recovery and mitigate system damages or financial losses. However, previous research has mostly focused on developing offline RCA algorithms, which often require manually initiating the RCA process, a significant amount of time and data to train a robust model, and then being retrained from scratch for a new system fault.

In this paper, we propose CORAL, a novel online RCA framework that can automatically trigger the RCA process and incrementally update the RCA model. CORAL consists of Trigger Point Detection, Incremental Disentangled Causal Graph Learning, and Network Propagation-based Root Cause Localization. The Trigger Point Detection component aims to detect system state transitions automatically and in near-real-time. To achieve this, we develop an online trigger point detection approach based on multivariate singular spectrum analysis and cumulative sum statistics. To efficiently update the RCA model, we propose an incremental disentangled causal graph learning approach to decouple the state-invariant and state-dependent information. After that, CORAL applies a random walk with restarts to the updated causal graph to accurately identify root causes. The online RCA process terminates when the causal graph and the generated root cause list converge. Extensive experiments on three real-world datasets demonstrate the effectiveness and superiority of the proposed framework.

Supplementary Material

MP4 File (rtfp0919-2min-promo.mp4)

Step into the future of system failure recovery with Dongjie Wang in this video presenting 'Incremental Causal Graph Learning for Online Root Cause Analysis'. Discover the innovative CORAL framework that revolutionizes root cause analysis, making it automatic and real-time while continuously learning from new data. Dive into how this method has demonstrated its effectiveness in real-world applications and what it means for the future of system failure recovery.

Download
4.68 MB

References

[1]

Chuadhry Mujeeb Ahmed, Venkata Reddy Palleti, and Aditya P Mathur. 2017. WADI: a water distribution testbed for research in the design of secure cyber physical systems. In CySWater. 25--28.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

MULAN: Multi-modal Causal Structure Learning and Root Cause Analysis for Microservice Systems

Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition

Empirical study of root cause analysis of software failure

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations