skip to main content
10.1145/3580305.3599849acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Free Access

Interdependent Causal Networks for Root Cause Localization

Published:04 August 2023Publication History

ABSTRACT

The goal of root cause analysis is to identify the underlying causes of system problems by discovering and analyzing the causal structure from system monitoring data. It is indispensable for maintaining the stability and robustness of large-scale complex systems. Existing methods mainly focus on the construction of a single effective isolated causal network, whereas many real-world systems are complex and exhibit interdependent structures (i.e., multiple networks of a system are interconnected by cross-network links). In interdependent networks, the malfunctioning effects of problematic system entities can propagate to other networks or different levels of system entities. Consequently, ignoring the interdependency results in suboptimal root cause analysis outcomes.

In this paper, we propose REASON, a novel framework that enables the automatic discovery of both intra-level (i.e., within-network) and inter-level (i.e., across-network) causal relationships for root cause localization. REASON consists of Topological Causal Discovery (TCD) and Individual Causal Discovery (ICD). The TCD component aims to model the fault propagation in order to trace back to the root causes. To achieve this, we propose novel hierarchical graph neural networks to construct interdependent causal networks by modeling both intra-level and inter-level non-linear causal relations. Based on the learned interdependent causal networks, we then leverage random walk with restarts to model the network propagation of a system fault. The ICD component focuses on capturing abrupt change patterns of a single system entity. This component examines the temporal patterns of each entity's metric data (i.e., time series), and estimates its likelihood of being a root cause based on the Extreme Value theory. Combining the topological and individual causal scores, the top K system entities are identified as root causes. Extensive experiments on three real-world datasets validate the effectiveness of the proposed framework.

Skip Supplemental Material Section

Supplemental Material

apfp246-2min-promo.mp4

mp4

5.2 MB

References

  1. Chuadhry Mujeeb Ahmed, Venkata Reddy Palleti, and Aditya P Mathur. 2017. WADI: a water distribution testbed for research in the design of secure cyber physical systems. In Proceedings of the 3rd International Workshop on Cyber-Physical Systems for Smart Water Networks. 25--28.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M Hadi Amini, Kianoosh G Boroojeni, SS Iyengar, Panos M Pardalos, Frede Blaabjerg, and Asad M Madni. 2019. Sustainable interdependent networks II. Studies in systems, decision and control (2019), 167.Google ScholarGoogle Scholar
  3. M Hadi Amini, Ahmed Imteaj, and Panos M Pardalos. 2020. Interdependent networks: A data science perspective. Patterns, Vol. 1, 1 (2020), 100003.Google ScholarGoogle ScholarCross RefCross Ref
  4. Bjørn Andersen and Tom Fagerhaug. 2006. Root cause analysis: simplified tools and techniques. Quality Press.Google ScholarGoogle Scholar
  5. Charles K Assaad, Emilie Devijver, and Eric Gaussier. 2022. Survey and Evaluation of Causal Discovery Methods for Time Series. Journal of Artificial Intelligence Research, Vol. 73 (2022), 767--819.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Jan Beirlant, Yuri Goegebeur, Johan Segers, and Jozef L Teugels. 2004. Statistics of extremes: theory and applications. Vol. 558. John Wiley & Sons.Google ScholarGoogle Scholar
  7. Alexis Bellot, Kim Branson, and Mihaela van der Schaar. 2021. Neural graphical modelling in continuous-time: consistency guarantees and algorithms. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  8. Stephen A Billings. 2013. Nonlinear system identification: NARMAX methods in the time, frequency, and spatio-temporal domains. John Wiley & Sons.Google ScholarGoogle Scholar
  9. Álvaro Brandón, Marc Solé, Alberto Huélamo, David Solans, María S Pérez, and Victor Muntés-Mulero. 2020. Graph-based root cause analysis for service-oriented and microservice architectures. Journal of Systems and Software, Vol. 159 (2020), 110432.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Sergey V Buldyrev, Roni Parshani, Gerald Paul, H Eugene Stanley, and Shlomo Havlin. 2010. Catastrophic cascade of failures in interdependent networks. Nature, Vol. 464, 7291 (2010), 1025--1028.Google ScholarGoogle Scholar
  11. Alfonso Capozzoli, Fiorella Lauro, and Imran Khan. 2015. Fault detection analysis using data mining techniques for a cluster of smart office buildings. Expert Systems with Applications, Vol. 42, 9 (2015), 4324--4338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Zhengzhang Chen, Kanchana Padmanabhan, Andrea M Rocha, Yekaterina Shpanskaya, James R Mihelcic, Kathleen Scott, and Nagiza F Samatova. 2012. SPICE: discovery of phenotype-determining component interplays. BMC Systems Biology, Vol. 6, 1 (2012), 1--19.Google ScholarGoogle ScholarCross RefCross Ref
  13. Wei Cheng, Kai Zhang, Haifeng Chen, Guofei Jiang, Zhengzhang Chen, and Wei Wang. 2016. Ranking causal anomalies via temporal and dynamical analysis on vanishing correlations. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 805--814.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Arun Das, Joydeep Banerjee, and Arunabha Sen. 2014. Root Cause Analysis of Failures in Interdependent Power-Communication Networks. In 2014 IEEE Military Communications Conference. 910--915.Google ScholarGoogle Scholar
  15. Boxiang Dong, Zhengzhang Chen, Hui Wang, Lu-An Tang, Kai Zhang, Ying Lin, Zhichun Li, and Haifeng Chen. 2017. Efficient discovery of abnormal event sequences in enterprise security systems. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 707--715.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Doris Entner and Patrik O Hoyer. 2010. On causal discovery from time series data using FCI. Probabilistic graphical models (2010), 121--128.Google ScholarGoogle Scholar
  17. George K Fourlas and George C Karras. 2021. A survey on fault diagnosis methods for UAVs. In 2021 International Conference on Unmanned Aircraft Systems (ICUAS). IEEE, 394--403.Google ScholarGoogle ScholarCross RefCross Ref
  18. Jianxi Gao, Daqing Li, and Shlomo Havlin. 2014. From a single network to a network of networks. National Science Review, Vol. 1, 3 (2014), 346--356.Google ScholarGoogle ScholarCross RefCross Ref
  19. Jiaping Gui, Ding Li, Zhengzhang Chen, Junghwan Rhee, Xusheng Xiao, Mu Zhang, Kangkook Jee, Zhichun Li, and Haifeng Chen. 2020. APTrace: A responsive system for agile enterprise level causality analysis. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1701--1712.Google ScholarGoogle ScholarCross RefCross Ref
  20. Ait Mimoune Hamiche, Amine Boudghene Stambouli, and Samir Flazi. 2016. A review of the water-energy nexus. Renewable and Sustainable Energy Reviews, Vol. 65 (2016), 319--331. https://doi.org/10.1016/j.rser.2016.07.020Google ScholarGoogle ScholarCross RefCross Ref
  21. Aapo Hyv"arinen, Kun Zhang, Shohei Shimizu, and Patrik O Hoyer. 2010. Estimation of a structural vector autoregression model using non-gaussianity. Journal of Machine Learning Research, Vol. 11, 5 (2010).Google ScholarGoogle Scholar
  22. Emre Kiciman and Lakshminarayanan Subramanian. 2005. Root cause localization in large scale systems. In Proc. 1st Workshop on Hot Topics in Systems Dependability.Google ScholarGoogle Scholar
  23. Maya Kosoff. 2022. One Amazon Employee's “Human Error” May Have Cost The Economy Millions. [EB/OL]. https://www.vanityfair.com/news/2017/03/one-amazon-employees-human-error-may-have-cost-the-economy-millions.Google ScholarGoogle Scholar
  24. Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2017. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926 (2017).Google ScholarGoogle Scholar
  25. Zeyan Li, Junjie Chen, Rui Jiao, Nengwen Zhao, Zhijun Wang, Shuwei Zhang, Yanjun Wu, Long Jiang, Leiqin Yan, Zikai Wang, Zhekang Chen, Wenchi Zhang, Xiaohui Nie, Kaixin Sui, and Dan Pei. 2021. Practical Root Cause Localization for Microservice Systems via Trace Analysis. In 2021 IEEE/ACM 29th International Symposium on Quality of Service. 1--10. https://doi.org/10.1109/IWQOS52092.2021.9521340Google ScholarGoogle ScholarCross RefCross Ref
  26. Jin Jin Lin, Pengfei Chen, and Zibin Zheng. 2018. Microscope: Pinpoint performance issues with causal graphs in micro-service environments. In International Conference on Service-Oriented Computing. Springer, 3--20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li, Jiayu Ou, and Zheshun Wu. 2021. MicroHECL: high-efficient root cause localization in large-scale microservice systems. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice. IEEE, 338--347.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Xueming Liu, H Eugene Stanley, and Jianxi Gao. 2016. Breakdown of interdependent directed networks. Proceedings of the National Academy of Sciences, Vol. 113, 5 (2016), 1138--1143.Google ScholarGoogle ScholarCross RefCross Ref
  29. Aditya P Mathur and Nils Ole Tippenhauer. 2016. SWaT: A water treatment testbed for research and training on ICS security. In 2016 international workshop on cyber-physical systems for smart water networks (CySWater). IEEE, 31--36.Google ScholarGoogle ScholarCross RefCross Ref
  30. Yuan Meng, Shenglin Zhang, Yongqian Sun, Ruru Zhang, Zhilong Hu, Yiyin Zhang, Chenyang Jia, Zhaogang Wang, and Dan Pei. 2020. Localizing failure root causes in a microservice through causality inference. In 2020 IEEE/ACM 28th International Symposium on Quality of Service. IEEE, 1--10.Google ScholarGoogle ScholarCross RefCross Ref
  31. Meike Nauta, Doina Bucur, and Christin Seifert. 2019. Causal discovery with attention-based convolutional neural networks. Machine Learning and Knowledge Extraction, Vol. 1, 1 (2019), 312--340.Google ScholarGoogle ScholarCross RefCross Ref
  32. M. Nekovee, Y. Moreno, G. Bianconi, and M. Marsili. 2007. Theory of rumour spreading in complex social networks. Physica A: Statistical Mechanics and its Applications, Vol. 374, 1 (2007), 457--470. https://doi.org/10.1016/j.physa.2006.07.017Google ScholarGoogle Scholar
  33. Ignavier Ng, AmirEmad Ghassami, and Kun Zhang. 2020. On the role of sparsity and dag constraints for learning linear dags. Advances in Neural Information Processing Systems, Vol. 33 (2020), 17943--17954.Google ScholarGoogle Scholar
  34. Jingchao Ni, Hanghang Tong, Wei Fan, and Xiang Zhang. 2014. Inside the atoms: ranking on a network of networks. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1356--1365.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Jingchao Ni, Hanghang Tong, Wei Fan, and Xiang Zhang. 2015. Flexible and Robust Multi-Network Clustering. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 835--844.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Roxana Pamfil, Nisara Sriwattanaworachai, Shaan Desai, Philip Pilgerstorfer, Konstantinos Georgatzis, Paul Beaumont, and Bryon Aragam. 2020. Dynotears: Structure learning from time-series data. In International Conference on Artificial Intelligence and Statistics. PMLR, 1595--1605.Google ScholarGoogle Scholar
  37. Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. 2013. Causal inference on time series using restricted structural equation models. Advances in Neural Information Processing Systems, Vol. 26 (2013).Google ScholarGoogle Scholar
  38. James Pickands III. 1975. Statistical inference using extreme order statistics. the Annals of Statistics (1975), 119--131.Google ScholarGoogle Scholar
  39. Jakob Runge. 2020. Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets. In Conference on Uncertainty in Artificial Intelligence. PMLR, 1388--1397.Google ScholarGoogle Scholar
  40. Davood Shiri and Vahid Akbari. 2021. Online Failure Diagnosis in Interdependent Networks. Operations Research Forum, Vol. 2, 1 (2021), 10.Google ScholarGoogle ScholarCross RefCross Ref
  41. Alban Siffer, Pierre-Alain Fouque, Alexandre Termier, and Christine Largouet. 2017. Anomaly detection in streams with extreme value theory. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1067--1075.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jacopo Soldani and Antonio Brogi. 2022. Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey. ACM Computing Surveys (CSUR), Vol. 55, 3 (2022), 1--39.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Marc Solé, Victor Muntés-Mulero, Annie Ibrahim Rana, and Giovani Estrada. 2017. Survey on models and techniques for root-cause analysis. arXiv preprint arXiv:1701.08546 (2017).Google ScholarGoogle Scholar
  44. Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. 2000. Causation, prediction, and search. MIT press.Google ScholarGoogle Scholar
  45. James H Stock and Mark W Watson. 2001. Vector autoregressions. Journal of Economic perspectives, Vol. 15, 4 (2001), 101--115.Google ScholarGoogle ScholarCross RefCross Ref
  46. Jie Sun, Dane Taylor, and Erik M Bollt. 2015. Causal network inference by optimal causation entropy. SIAM Journal on Applied Dynamical Systems, Vol. 14, 1 (2015), 73--106.Google ScholarGoogle ScholarCross RefCross Ref
  47. LuAn Tang, Hengtong Zhang, Zhengzhang Chen, Bo Zong, LI Zhichun, Guofei Jiang, and Kenji Yoshihira. 2019. Graph-based attack chain discovery in enterprise security systems. US Patent 10,289,841.Google ScholarGoogle Scholar
  48. A Tank, I Covert, N Foti, A Shojaie, and EB Fox. 2021. Neural Granger Causality. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Dongjie Wang, Zhengzhang Chen, Jingchao Ni, Liang Tong, Zheng Wang, Yanjie Fu, and Haifeng Chen. 2023. Hierarchical Graph Neural Networks for Causal Discovery and Root Cause Localization. arXiv preprint arXiv:2302.01987 (2023).Google ScholarGoogle Scholar
  50. Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. 2018. Dags with no tears: Continuous optimization for structure learning. Advances in Neural Information Processing Systems, Vol. 31 (2018).Google ScholarGoogle Scholar

Index Terms

  1. Interdependent Causal Networks for Root Cause Localization

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
        August 2023
        5996 pages
        ISBN:9798400701030
        DOI:10.1145/3580305

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 August 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24
      • Article Metrics

        • Downloads (Last 12 months)568
        • Downloads (Last 6 weeks)95

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader