ABSTRACT
The goal of root cause analysis is to identify the underlying causes of system problems by discovering and analyzing the causal structure from system monitoring data. It is indispensable for maintaining the stability and robustness of large-scale complex systems. Existing methods mainly focus on the construction of a single effective isolated causal network, whereas many real-world systems are complex and exhibit interdependent structures (i.e., multiple networks of a system are interconnected by cross-network links). In interdependent networks, the malfunctioning effects of problematic system entities can propagate to other networks or different levels of system entities. Consequently, ignoring the interdependency results in suboptimal root cause analysis outcomes.
In this paper, we propose REASON, a novel framework that enables the automatic discovery of both intra-level (i.e., within-network) and inter-level (i.e., across-network) causal relationships for root cause localization. REASON consists of Topological Causal Discovery (TCD) and Individual Causal Discovery (ICD). The TCD component aims to model the fault propagation in order to trace back to the root causes. To achieve this, we propose novel hierarchical graph neural networks to construct interdependent causal networks by modeling both intra-level and inter-level non-linear causal relations. Based on the learned interdependent causal networks, we then leverage random walk with restarts to model the network propagation of a system fault. The ICD component focuses on capturing abrupt change patterns of a single system entity. This component examines the temporal patterns of each entity's metric data (i.e., time series), and estimates its likelihood of being a root cause based on the Extreme Value theory. Combining the topological and individual causal scores, the top K system entities are identified as root causes. Extensive experiments on three real-world datasets validate the effectiveness of the proposed framework.
Supplemental Material
- Chuadhry Mujeeb Ahmed, Venkata Reddy Palleti, and Aditya P Mathur. 2017. WADI: a water distribution testbed for research in the design of secure cyber physical systems. In Proceedings of the 3rd International Workshop on Cyber-Physical Systems for Smart Water Networks. 25--28.Google ScholarDigital Library
- M Hadi Amini, Kianoosh G Boroojeni, SS Iyengar, Panos M Pardalos, Frede Blaabjerg, and Asad M Madni. 2019. Sustainable interdependent networks II. Studies in systems, decision and control (2019), 167.Google Scholar
- M Hadi Amini, Ahmed Imteaj, and Panos M Pardalos. 2020. Interdependent networks: A data science perspective. Patterns, Vol. 1, 1 (2020), 100003.Google ScholarCross Ref
- Bjørn Andersen and Tom Fagerhaug. 2006. Root cause analysis: simplified tools and techniques. Quality Press.Google Scholar
- Charles K Assaad, Emilie Devijver, and Eric Gaussier. 2022. Survey and Evaluation of Causal Discovery Methods for Time Series. Journal of Artificial Intelligence Research, Vol. 73 (2022), 767--819.Google ScholarDigital Library
- Jan Beirlant, Yuri Goegebeur, Johan Segers, and Jozef L Teugels. 2004. Statistics of extremes: theory and applications. Vol. 558. John Wiley & Sons.Google Scholar
- Alexis Bellot, Kim Branson, and Mihaela van der Schaar. 2021. Neural graphical modelling in continuous-time: consistency guarantees and algorithms. In International Conference on Learning Representations.Google Scholar
- Stephen A Billings. 2013. Nonlinear system identification: NARMAX methods in the time, frequency, and spatio-temporal domains. John Wiley & Sons.Google Scholar
- Álvaro Brandón, Marc Solé, Alberto Huélamo, David Solans, María S Pérez, and Victor Muntés-Mulero. 2020. Graph-based root cause analysis for service-oriented and microservice architectures. Journal of Systems and Software, Vol. 159 (2020), 110432.Google ScholarDigital Library
- Sergey V Buldyrev, Roni Parshani, Gerald Paul, H Eugene Stanley, and Shlomo Havlin. 2010. Catastrophic cascade of failures in interdependent networks. Nature, Vol. 464, 7291 (2010), 1025--1028.Google Scholar
- Alfonso Capozzoli, Fiorella Lauro, and Imran Khan. 2015. Fault detection analysis using data mining techniques for a cluster of smart office buildings. Expert Systems with Applications, Vol. 42, 9 (2015), 4324--4338.Google ScholarDigital Library
- Zhengzhang Chen, Kanchana Padmanabhan, Andrea M Rocha, Yekaterina Shpanskaya, James R Mihelcic, Kathleen Scott, and Nagiza F Samatova. 2012. SPICE: discovery of phenotype-determining component interplays. BMC Systems Biology, Vol. 6, 1 (2012), 1--19.Google ScholarCross Ref
- Wei Cheng, Kai Zhang, Haifeng Chen, Guofei Jiang, Zhengzhang Chen, and Wei Wang. 2016. Ranking causal anomalies via temporal and dynamical analysis on vanishing correlations. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 805--814.Google ScholarDigital Library
- Arun Das, Joydeep Banerjee, and Arunabha Sen. 2014. Root Cause Analysis of Failures in Interdependent Power-Communication Networks. In 2014 IEEE Military Communications Conference. 910--915.Google Scholar
- Boxiang Dong, Zhengzhang Chen, Hui Wang, Lu-An Tang, Kai Zhang, Ying Lin, Zhichun Li, and Haifeng Chen. 2017. Efficient discovery of abnormal event sequences in enterprise security systems. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 707--715.Google ScholarDigital Library
- Doris Entner and Patrik O Hoyer. 2010. On causal discovery from time series data using FCI. Probabilistic graphical models (2010), 121--128.Google Scholar
- George K Fourlas and George C Karras. 2021. A survey on fault diagnosis methods for UAVs. In 2021 International Conference on Unmanned Aircraft Systems (ICUAS). IEEE, 394--403.Google ScholarCross Ref
- Jianxi Gao, Daqing Li, and Shlomo Havlin. 2014. From a single network to a network of networks. National Science Review, Vol. 1, 3 (2014), 346--356.Google ScholarCross Ref
- Jiaping Gui, Ding Li, Zhengzhang Chen, Junghwan Rhee, Xusheng Xiao, Mu Zhang, Kangkook Jee, Zhichun Li, and Haifeng Chen. 2020. APTrace: A responsive system for agile enterprise level causality analysis. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1701--1712.Google ScholarCross Ref
- Ait Mimoune Hamiche, Amine Boudghene Stambouli, and Samir Flazi. 2016. A review of the water-energy nexus. Renewable and Sustainable Energy Reviews, Vol. 65 (2016), 319--331. https://doi.org/10.1016/j.rser.2016.07.020Google ScholarCross Ref
- Aapo Hyv"arinen, Kun Zhang, Shohei Shimizu, and Patrik O Hoyer. 2010. Estimation of a structural vector autoregression model using non-gaussianity. Journal of Machine Learning Research, Vol. 11, 5 (2010).Google Scholar
- Emre Kiciman and Lakshminarayanan Subramanian. 2005. Root cause localization in large scale systems. In Proc. 1st Workshop on Hot Topics in Systems Dependability.Google Scholar
- Maya Kosoff. 2022. One Amazon Employee's “Human Error” May Have Cost The Economy Millions. [EB/OL]. https://www.vanityfair.com/news/2017/03/one-amazon-employees-human-error-may-have-cost-the-economy-millions.Google Scholar
- Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2017. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926 (2017).Google Scholar
- Zeyan Li, Junjie Chen, Rui Jiao, Nengwen Zhao, Zhijun Wang, Shuwei Zhang, Yanjun Wu, Long Jiang, Leiqin Yan, Zikai Wang, Zhekang Chen, Wenchi Zhang, Xiaohui Nie, Kaixin Sui, and Dan Pei. 2021. Practical Root Cause Localization for Microservice Systems via Trace Analysis. In 2021 IEEE/ACM 29th International Symposium on Quality of Service. 1--10. https://doi.org/10.1109/IWQOS52092.2021.9521340Google ScholarCross Ref
- Jin Jin Lin, Pengfei Chen, and Zibin Zheng. 2018. Microscope: Pinpoint performance issues with causal graphs in micro-service environments. In International Conference on Service-Oriented Computing. Springer, 3--20.Google ScholarDigital Library
- Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li, Jiayu Ou, and Zheshun Wu. 2021. MicroHECL: high-efficient root cause localization in large-scale microservice systems. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice. IEEE, 338--347.Google ScholarDigital Library
- Xueming Liu, H Eugene Stanley, and Jianxi Gao. 2016. Breakdown of interdependent directed networks. Proceedings of the National Academy of Sciences, Vol. 113, 5 (2016), 1138--1143.Google ScholarCross Ref
- Aditya P Mathur and Nils Ole Tippenhauer. 2016. SWaT: A water treatment testbed for research and training on ICS security. In 2016 international workshop on cyber-physical systems for smart water networks (CySWater). IEEE, 31--36.Google ScholarCross Ref
- Yuan Meng, Shenglin Zhang, Yongqian Sun, Ruru Zhang, Zhilong Hu, Yiyin Zhang, Chenyang Jia, Zhaogang Wang, and Dan Pei. 2020. Localizing failure root causes in a microservice through causality inference. In 2020 IEEE/ACM 28th International Symposium on Quality of Service. IEEE, 1--10.Google ScholarCross Ref
- Meike Nauta, Doina Bucur, and Christin Seifert. 2019. Causal discovery with attention-based convolutional neural networks. Machine Learning and Knowledge Extraction, Vol. 1, 1 (2019), 312--340.Google ScholarCross Ref
- M. Nekovee, Y. Moreno, G. Bianconi, and M. Marsili. 2007. Theory of rumour spreading in complex social networks. Physica A: Statistical Mechanics and its Applications, Vol. 374, 1 (2007), 457--470. https://doi.org/10.1016/j.physa.2006.07.017Google Scholar
- Ignavier Ng, AmirEmad Ghassami, and Kun Zhang. 2020. On the role of sparsity and dag constraints for learning linear dags. Advances in Neural Information Processing Systems, Vol. 33 (2020), 17943--17954.Google Scholar
- Jingchao Ni, Hanghang Tong, Wei Fan, and Xiang Zhang. 2014. Inside the atoms: ranking on a network of networks. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1356--1365.Google ScholarDigital Library
- Jingchao Ni, Hanghang Tong, Wei Fan, and Xiang Zhang. 2015. Flexible and Robust Multi-Network Clustering. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 835--844.Google ScholarDigital Library
- Roxana Pamfil, Nisara Sriwattanaworachai, Shaan Desai, Philip Pilgerstorfer, Konstantinos Georgatzis, Paul Beaumont, and Bryon Aragam. 2020. Dynotears: Structure learning from time-series data. In International Conference on Artificial Intelligence and Statistics. PMLR, 1595--1605.Google Scholar
- Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. 2013. Causal inference on time series using restricted structural equation models. Advances in Neural Information Processing Systems, Vol. 26 (2013).Google Scholar
- James Pickands III. 1975. Statistical inference using extreme order statistics. the Annals of Statistics (1975), 119--131.Google Scholar
- Jakob Runge. 2020. Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets. In Conference on Uncertainty in Artificial Intelligence. PMLR, 1388--1397.Google Scholar
- Davood Shiri and Vahid Akbari. 2021. Online Failure Diagnosis in Interdependent Networks. Operations Research Forum, Vol. 2, 1 (2021), 10.Google ScholarCross Ref
- Alban Siffer, Pierre-Alain Fouque, Alexandre Termier, and Christine Largouet. 2017. Anomaly detection in streams with extreme value theory. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1067--1075.Google ScholarDigital Library
- Jacopo Soldani and Antonio Brogi. 2022. Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey. ACM Computing Surveys (CSUR), Vol. 55, 3 (2022), 1--39.Google ScholarDigital Library
- Marc Solé, Victor Muntés-Mulero, Annie Ibrahim Rana, and Giovani Estrada. 2017. Survey on models and techniques for root-cause analysis. arXiv preprint arXiv:1701.08546 (2017).Google Scholar
- Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. 2000. Causation, prediction, and search. MIT press.Google Scholar
- James H Stock and Mark W Watson. 2001. Vector autoregressions. Journal of Economic perspectives, Vol. 15, 4 (2001), 101--115.Google ScholarCross Ref
- Jie Sun, Dane Taylor, and Erik M Bollt. 2015. Causal network inference by optimal causation entropy. SIAM Journal on Applied Dynamical Systems, Vol. 14, 1 (2015), 73--106.Google ScholarCross Ref
- LuAn Tang, Hengtong Zhang, Zhengzhang Chen, Bo Zong, LI Zhichun, Guofei Jiang, and Kenji Yoshihira. 2019. Graph-based attack chain discovery in enterprise security systems. US Patent 10,289,841.Google Scholar
- A Tank, I Covert, N Foti, A Shojaie, and EB Fox. 2021. Neural Granger Causality. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).Google ScholarDigital Library
- Dongjie Wang, Zhengzhang Chen, Jingchao Ni, Liang Tong, Zheng Wang, Yanjie Fu, and Haifeng Chen. 2023. Hierarchical Graph Neural Networks for Causal Discovery and Root Cause Localization. arXiv preprint arXiv:2302.01987 (2023).Google Scholar
- Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. 2018. Dags with no tears: Continuous optimization for structure learning. Advances in Neural Information Processing Systems, Vol. 31 (2018).Google Scholar
Index Terms
- Interdependent Causal Networks for Root Cause Localization
Recommendations
On Root Cause Localization and Anomaly Mitigation through Causal Inference
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge ManagementDue to a wide spectrum of applications in the real world, such as security, financial surveillance, and health risk, various deep anomaly detection models have been proposed and achieved state-of-the-art performance. However, besides being effective, in ...
Root Cause Analysis Using Sequence Alignment and Latent Semantic Indexing
ASWEC '08: Proceedings of the 19th Australian Conference on Software EngineeringAutomatic identification of software faults has enormous practical significance. This requires characterizing program execution behavior. Equally important is the aspect of diagnosing (finding root-cause of) faults encountered. In this article, we ...
Empirical study of root cause analysis of software failure
Root Cause Analysis (RCA) is the process of identifying project issues, correcting them and taking preventive actions to avoid occurrences of such issues in the future. Issues could be variance in schedule, effort, cost, productivity, expected results ...
Comments