research-article

Free Access

Interdependent Causal Networks for Root Cause Localization

Authors:
Dongjie Wang

University of Central Florida, Orlando, FL, USA

University of Central Florida, Orlando, FL, USA

0000-0003-3948-0059
View Profile

,
Zhengzhang Chen

NEC Laboratories America Inc, Princeton, NJ, USA

NEC Laboratories America Inc, Princeton, NJ, USA

0000-0002-6803-0535
View Profile

,
Jingchao Ni

AWS AI Labs, Amazon, Settale, WA, USA

AWS AI Labs, Amazon, Settale, WA, USA

0000-0002-2986-6612
View Profile

,
Liang Tong

NEC Laboratories America Inc, Princeton, NJ, USA

NEC Laboratories America Inc, Princeton, NJ, USA

0000-0003-3971-6949
View Profile

,
Zheng Wang

University of Utah, Salt Lake City, UT, USA

University of Utah, Salt Lake City, UT, USA

0000-0002-5596-9375
View Profile

,
Yanjie Fu

University of Central Florida, Orlando, FL, USA

University of Central Florida, Orlando, FL, USA

0000-0002-1767-8024
View Profile

,
Haifeng Chen

NEC Laboratories America Inc, Princeton, NJ, USA

NEC Laboratories America Inc, Princeton, NJ, USA

0000-0002-9363-738X
View Profile

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data MiningAugust 2023Pages 5051–5060https://doi.org/10.1145/3580305.3599849

Published:04 August 2023Publication History

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 5051–5060

ABSTRACT

The goal of root cause analysis is to identify the underlying causes of system problems by discovering and analyzing the causal structure from system monitoring data. It is indispensable for maintaining the stability and robustness of large-scale complex systems. Existing methods mainly focus on the construction of a single effective isolated causal network, whereas many real-world systems are complex and exhibit interdependent structures (i.e., multiple networks of a system are interconnected by cross-network links). In interdependent networks, the malfunctioning effects of problematic system entities can propagate to other networks or different levels of system entities. Consequently, ignoring the interdependency results in suboptimal root cause analysis outcomes.

In this paper, we propose REASON, a novel framework that enables the automatic discovery of both intra-level (i.e., within-network) and inter-level (i.e., across-network) causal relationships for root cause localization. REASON consists of Topological Causal Discovery (TCD) and Individual Causal Discovery (ICD). The TCD component aims to model the fault propagation in order to trace back to the root causes. To achieve this, we propose novel hierarchical graph neural networks to construct interdependent causal networks by modeling both intra-level and inter-level non-linear causal relations. Based on the learned interdependent causal networks, we then leverage random walk with restarts to model the network propagation of a system fault. The ICD component focuses on capturing abrupt change patterns of a single system entity. This component examines the temporal patterns of each entity's metric data (i.e., time series), and estimates its likelihood of being a root cause based on the Extreme Value theory. Combining the topological and individual causal scores, the top K system entities are identified as root causes. Extensive experiments on three real-world datasets validate the effectiveness of the proposed framework.

Supplemental Material

apfp246-2min-promo.mp4

mp4

5.2 MB

Download

References

Chuadhry Mujeeb Ahmed, Venkata Reddy Palleti, and Aditya P Mathur. 2017. WADI: a water distribution testbed for research in the design of secure cyber physical systems. In Proceedings of the 3rd International Workshop on Cyber-Physical Systems for Smart Water Networks. 25--28.Google ScholarDigital Library
M Hadi Amini, Kianoosh G Boroojeni, SS Iyengar, Panos M Pardalos, Frede Blaabjerg, and Asad M Madni. 2019. Sustainable interdependent networks II. Studies in systems, decision and control (2019), 167.Google Scholar
M Hadi Amini, Ahmed Imteaj, and Panos M Pardalos. 2020. Interdependent networks: A data science perspective. Patterns, Vol. 1, 1 (2020), 100003.Google ScholarCross Ref
Bjørn Andersen and Tom Fagerhaug. 2006. Root cause analysis: simplified tools and techniques. Quality Press.Google Scholar
Charles K Assaad, Emilie Devijver, and Eric Gaussier. 2022. Survey and Evaluation of Causal Discovery Methods for Time Series. Journal of Artificial Intelligence Research, Vol. 73 (2022), 767--819.Google ScholarDigital Library
Jan Beirlant, Yuri Goegebeur, Johan Segers, and Jozef L Teugels. 2004. Statistics of extremes: theory and applications. Vol. 558. John Wiley & Sons.Google Scholar
Alexis Bellot, Kim Branson, and Mihaela van der Schaar. 2021. Neural graphical modelling in continuous-time: consistency guarantees and algorithms. In International Conference on Learning Representations.Google Scholar
Stephen A Billings. 2013. Nonlinear system identification: NARMAX methods in the time, frequency, and spatio-temporal domains. John Wiley & Sons.Google Scholar
Álvaro Brandón, Marc Solé, Alberto Huélamo, David Solans, María S Pérez, and Victor Muntés-Mulero. 2020. Graph-based root cause analysis for service-oriented and microservice architectures. Journal of Systems and Software, Vol. 159 (2020), 110432.Google ScholarDigital Library
Sergey V Buldyrev, Roni Parshani, Gerald Paul, H Eugene Stanley, and Shlomo Havlin. 2010. Catastrophic cascade of failures in interdependent networks. Nature, Vol. 464, 7291 (2010), 1025--1028.Google Scholar
Alfonso Capozzoli, Fiorella Lauro, and Imran Khan. 2015. Fault detection analysis using data mining techniques for a cluster of smart office buildings. Expert Systems with Applications, Vol. 42, 9 (2015), 4324--4338.Google ScholarDigital Library
Zhengzhang Chen, Kanchana Padmanabhan, Andrea M Rocha, Yekaterina Shpanskaya, James R Mihelcic, Kathleen Scott, and Nagiza F Samatova. 2012. SPICE: discovery of phenotype-determining component interplays. BMC Systems Biology, Vol. 6, 1 (2012), 1--19.Google ScholarCross Ref
Wei Cheng, Kai Zhang, Haifeng Chen, Guofei Jiang, Zhengzhang Chen, and Wei Wang. 2016. Ranking causal anomalies via temporal and dynamical analysis on vanishing correlations. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 805--814.Google ScholarDigital Library
Arun Das, Joydeep Banerjee, and Arunabha Sen. 2014. Root Cause Analysis of Failures in Interdependent Power-Communication Networks. In 2014 IEEE Military Communications Conference. 910--915.Google Scholar
Boxiang Dong, Zhengzhang Chen, Hui Wang, Lu-An Tang, Kai Zhang, Ying Lin, Zhichun Li, and Haifeng Chen. 2017. Efficient discovery of abnormal event sequences in enterprise security systems. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 707--715.Google ScholarDigital Library
Doris Entner and Patrik O Hoyer. 2010. On causal discovery from time series data using FCI. Probabilistic graphical models (2010), 121--128.Google Scholar
George K Fourlas and George C Karras. 2021. A survey on fault diagnosis methods for UAVs. In 2021 International Conference on Unmanned Aircraft Systems (ICUAS). IEEE, 394--403.Google ScholarCross Ref
Jianxi Gao, Daqing Li, and Shlomo Havlin. 2014. From a single network to a network of networks. National Science Review, Vol. 1, 3 (2014), 346--356.Google ScholarCross Ref
Jiaping Gui, Ding Li, Zhengzhang Chen, Junghwan Rhee, Xusheng Xiao, Mu Zhang, Kangkook Jee, Zhichun Li, and Haifeng Chen. 2020. APTrace: A responsive system for agile enterprise level causality analysis. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1701--1712.Google ScholarCross Ref
Ait Mimoune Hamiche, Amine Boudghene Stambouli, and Samir Flazi. 2016. A review of the water-energy nexus. Renewable and Sustainable Energy Reviews, Vol. 65 (2016), 319--331. https://doi.org/10.1016/j.rser.2016.07.020Google ScholarCross Ref
Aapo Hyv"arinen, Kun Zhang, Shohei Shimizu, and Patrik O Hoyer. 2010. Estimation of a structural vector autoregression model using non-gaussianity. Journal of Machine Learning Research, Vol. 11, 5 (2010).Google Scholar
Emre Kiciman and Lakshminarayanan Subramanian. 2005. Root cause localization in large scale systems. In Proc. 1st Workshop on Hot Topics in Systems Dependability.Google Scholar
Maya Kosoff. 2022. One Amazon Employee's “Human Error” May Have Cost The Economy Millions. [EB/OL]. https://www.vanityfair.com/news/2017/03/one-amazon-employees-human-error-may-have-cost-the-economy-millions.Google Scholar
Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2017. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926 (2017).Google Scholar
Zeyan Li, Junjie Chen, Rui Jiao, Nengwen Zhao, Zhijun Wang, Shuwei Zhang, Yanjun Wu, Long Jiang, Leiqin Yan, Zikai Wang, Zhekang Chen, Wenchi Zhang, Xiaohui Nie, Kaixin Sui, and Dan Pei. 2021. Practical Root Cause Localization for Microservice Systems via Trace Analysis. In 2021 IEEE/ACM 29th International Symposium on Quality of Service. 1--10. https://doi.org/10.1109/IWQOS52092.2021.9521340Google ScholarCross Ref
Jin Jin Lin, Pengfei Chen, and Zibin Zheng. 2018. Microscope: Pinpoint performance issues with causal graphs in micro-service environments. In International Conference on Service-Oriented Computing. Springer, 3--20.Google ScholarDigital Library
Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li, Jiayu Ou, and Zheshun Wu. 2021. MicroHECL: high-efficient root cause localization in large-scale microservice systems. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice. IEEE, 338--347.Google ScholarDigital Library
Xueming Liu, H Eugene Stanley, and Jianxi Gao. 2016. Breakdown of interdependent directed networks. Proceedings of the National Academy of Sciences, Vol. 113, 5 (2016), 1138--1143.Google ScholarCross Ref
Aditya P Mathur and Nils Ole Tippenhauer. 2016. SWaT: A water treatment testbed for research and training on ICS security. In 2016 international workshop on cyber-physical systems for smart water networks (CySWater). IEEE, 31--36.Google ScholarCross Ref
Yuan Meng, Shenglin Zhang, Yongqian Sun, Ruru Zhang, Zhilong Hu, Yiyin Zhang, Chenyang Jia, Zhaogang Wang, and Dan Pei. 2020. Localizing failure root causes in a microservice through causality inference. In 2020 IEEE/ACM 28th International Symposium on Quality of Service. IEEE, 1--10.Google ScholarCross Ref
Meike Nauta, Doina Bucur, and Christin Seifert. 2019. Causal discovery with attention-based convolutional neural networks. Machine Learning and Knowledge Extraction, Vol. 1, 1 (2019), 312--340.Google ScholarCross Ref
M. Nekovee, Y. Moreno, G. Bianconi, and M. Marsili. 2007. Theory of rumour spreading in complex social networks. Physica A: Statistical Mechanics and its Applications, Vol. 374, 1 (2007), 457--470. https://doi.org/10.1016/j.physa.2006.07.017Google Scholar
Ignavier Ng, AmirEmad Ghassami, and Kun Zhang. 2020. On the role of sparsity and dag constraints for learning linear dags. Advances in Neural Information Processing Systems, Vol. 33 (2020), 17943--17954.Google Scholar
Jingchao Ni, Hanghang Tong, Wei Fan, and Xiang Zhang. 2014. Inside the atoms: ranking on a network of networks. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1356--1365.Google ScholarDigital Library
Jingchao Ni, Hanghang Tong, Wei Fan, and Xiang Zhang. 2015. Flexible and Robust Multi-Network Clustering. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 835--844.Google ScholarDigital Library
Roxana Pamfil, Nisara Sriwattanaworachai, Shaan Desai, Philip Pilgerstorfer, Konstantinos Georgatzis, Paul Beaumont, and Bryon Aragam. 2020. Dynotears: Structure learning from time-series data. In International Conference on Artificial Intelligence and Statistics. PMLR, 1595--1605.Google Scholar
Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. 2013. Causal inference on time series using restricted structural equation models. Advances in Neural Information Processing Systems, Vol. 26 (2013).Google Scholar
James Pickands III. 1975. Statistical inference using extreme order statistics. the Annals of Statistics (1975), 119--131.Google Scholar
Jakob Runge. 2020. Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets. In Conference on Uncertainty in Artificial Intelligence. PMLR, 1388--1397.Google Scholar
Davood Shiri and Vahid Akbari. 2021. Online Failure Diagnosis in Interdependent Networks. Operations Research Forum, Vol. 2, 1 (2021), 10.Google ScholarCross Ref
Alban Siffer, Pierre-Alain Fouque, Alexandre Termier, and Christine Largouet. 2017. Anomaly detection in streams with extreme value theory. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1067--1075.Google ScholarDigital Library
Jacopo Soldani and Antonio Brogi. 2022. Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey. ACM Computing Surveys (CSUR), Vol. 55, 3 (2022), 1--39.Google ScholarDigital Library
Marc Solé, Victor Muntés-Mulero, Annie Ibrahim Rana, and Giovani Estrada. 2017. Survey on models and techniques for root-cause analysis. arXiv preprint arXiv:1701.08546 (2017).Google Scholar
Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. 2000. Causation, prediction, and search. MIT press.Google Scholar
James H Stock and Mark W Watson. 2001. Vector autoregressions. Journal of Economic perspectives, Vol. 15, 4 (2001), 101--115.Google ScholarCross Ref
Jie Sun, Dane Taylor, and Erik M Bollt. 2015. Causal network inference by optimal causation entropy. SIAM Journal on Applied Dynamical Systems, Vol. 14, 1 (2015), 73--106.Google ScholarCross Ref
LuAn Tang, Hengtong Zhang, Zhengzhang Chen, Bo Zong, LI Zhichun, Guofei Jiang, and Kenji Yoshihira. 2019. Graph-based attack chain discovery in enterprise security systems. US Patent 10,289,841.Google Scholar
A Tank, I Covert, N Foti, A Shojaie, and EB Fox. 2021. Neural Granger Causality. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).Google ScholarDigital Library
Dongjie Wang, Zhengzhang Chen, Jingchao Ni, Liang Tong, Zheng Wang, Yanjie Fu, and Haifeng Chen. 2023. Hierarchical Graph Neural Networks for Causal Discovery and Root Cause Localization. arXiv preprint arXiv:2302.01987 (2023).Google Scholar
Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. 2018. Dags with no tears: Continuous optimization for structure learning. Advances in Neural Information Processing Systems, Vol. 31 (2018).Google Scholar

Index Terms

Interdependent Causal Networks for Root Cause Localization
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
      1. Causal reasoning and diagnostics
  2. Machine learning
    1. Learning paradigms
      1. Unsupervised learning

Recommendations

On Root Cause Localization and Anomaly Mitigation through Causal Inference
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

Due to a wide spectrum of applications in the real world, such as security, financial surveillance, and health risk, various deep anomaly detection models have been proposed and achieved state-of-the-art performance. However, besides being effective, in ...
Read More
Root Cause Analysis Using Sequence Alignment and Latent Semantic Indexing
ASWEC '08: Proceedings of the 19th Australian Conference on Software Engineering

Automatic identification of software faults has enormous practical significance. This requires characterizing program execution behavior. Equally important is the aspect of diagnosing (finding root-cause of) faults encountered. In this article, we ...
Read More
Empirical study of root cause analysis of software failure

Root Cause Analysis (RCA) is the process of identifying project issues, correcting them and taking preventive actions to avoid occurrences of such issues in the future. Issues could be variance in schedule, effort, cost, productivity, expected results ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2023
5996 pages
ISBN:9798400701030
DOI:10.1145/3580305
General Chairs:
Ambuj Singh
UC Santa Barbara, USA
,
Yizhou Sun
UC Los Angeles, USA
,
Program Chairs:
Leman Akoglu
Carnegie Mellon University, USA
,
Dimitrios Gunopulos
University of Athens, Greece
,
Xifeng Yan
UC Santa Barbara, USA
,
Ravi Kumar
Google, USA
,
Fatma Ozcan
Google, USA
,
Jieping Ye
Alibaba DAMO Academy
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 August 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
causal structure learning
graph neural networks
interdependent networks
network propagation
root cause analysis
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 568
  Total Downloads
- Downloads (Last 12 months)568
- Downloads (Last 6 weeks)95
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Interdependent Causal Networks for Root Cause Localization

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

On Root Cause Localization and Anomaly Mitigation through Causal Inference

Root Cause Analysis Using Sequence Alignment and Latent Semantic Indexing

Empirical study of root cause analysis of software failure