ABSTRACT
The complexity of microservices and their distributed nature necessitates constant monitoring and tracing of their execution to identify performance problems and underlying root causes. However, the large volume of collected data and the complexity of distributed communications pose challenges in identifying and locating abnormal services. In this paper, we propose a novel approach that takes into consideration the importance of execution contexts in propagating and localizing performance root causes. We achieve this by integrating social network analysis techniques with spectrum analysis. To evaluate our proposed approach, we conducted an experiment using a real-world benchmark, and we observed promising preliminary results, with a success rate of 91.3% in correctly identifying the primary root cause (top-1), and a perfect 100% success rate in finding the root cause within the top three candidates (top-3).
- Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment 2008, 10 (2008), P10008.Google ScholarCross Ref
- Lizhe Chen, Ji Wu, Haiyan Yang, and Kui Zhang. 2022. Does PageRank apply to service ranking in microservice regression testing? Software Quality Journal 30, 3 (2022), 757--779.Google ScholarDigital Library
- James A. Jones, Mary Jean Harrold, and John Stasko. 2002. Visualization of Test Information to Assist Fault Localization. In Proceedings of the 24th International Conference on Software Engineering (Orlando, Florida) (ICSE '02). Association for Computing Machinery, New York, NY, USA, 467--477. https://doi.org/10.1145/ 581339.581397Google ScholarDigital Library
- Zeyan Li, Junjie Chen, Rui Jiao, Nengwen Zhao, Zhijun Wang, Shuwei Zhang, Yanjun Wu, Long Jiang, Leiqin Yan, Zikai Wang, Zhekang Chen, Wenchi Zhang, Xiaohui Nie, Kaixin Sui, and Dan Pei. 2021. Practical Root Cause Localization for Microservice Systems via Trace Analysis. In 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). 1--10. https://doi.org/10.1109/ IWQOS52092.2021.9521340Google ScholarCross Ref
- Zeyan Li, Nengwen Zhao, Mingjie Li, Xianglin Lu, LixinWang, Dongdong Chang, Xiaohui Nie, Li Cao, Wenchi Zhang, Kaixin Sui, et al. 2022. Actionable and interpretable fault localization for recurring failures in online service systems. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 996--1008.Google ScholarDigital Library
- Zeyan Li, Nengwen Zhao, Shenglin Zhang, Yongqian Sun, Pengfei Chen, Xidao Wen, Minghua Ma, and Dan Pei. 2022. Constructing large-scale real-world benchmark datasets for AIOps. arXiv preprint arXiv:2208.03938 (2022).Google Scholar
- Jackson A Prado Lima and Silvia R Vergilio. 2020. Test Case Prioritization in Continuous Integration environments: A systematic mapping study. Information and Software Technology 121 (2020), 106268.Google ScholarDigital Library
- JinJin Lin, Pengfei Chen, and Zibin Zheng. 2018. Microscope: Pinpoint performance issues with causal graphs in micro-service environments. In Service- Oriented Computing: 16th International Conference, ICSOC 2018, Hangzhou, China, November 12--15, 2018, Proceedings 16. Springer, 3--20.Google ScholarDigital Library
- Leonardo Mariani, Cristina Monni, Mauro Pezzé, Oliviero Riganelli, and Rui Xin. 2018. Localizing faults in cloud systems. In 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST). IEEE, 262--273.Google ScholarCross Ref
- Lee Naish, Hua Jie Lee, and Kotagiri Ramamohanarao. 2011. A model for spectrabased software diagnosis. ACM Transactions on software engineering and methodology (TOSEM) 20, 3 (2011), 1--32.Google Scholar
- Austin Parker, Daniel Spoonhower, Jonathan Mace, Ben Sigelman, and Rebecca Isaacs. 2020. Distributed tracing in practice: Instrumenting, analyzing, and debugging microservices. O'Reilly Media.Google Scholar
- Jacopo Soldani and Antonio Brogi. 2022. Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey. ACM Computing Surveys (CSUR) 55, 3 (2022), 1--39.Google ScholarDigital Library
- Jörg Thalheim, Antonio Rodrigues, Istemi Ekin Akkus, Pramod Bhatotia, Ruichuan Chen, Bimal Viswanath, Lei Jiao, and Christof Fetzer. 2017. Sieve: Actionable insights from monitored metrics in distributed systems. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference. 14--27.Google ScholarDigital Library
- Ji Wang and Naser Ezzati-Jivan. 2020. Enhanced execution trace abstraction approach using social network analysis methods. Softwaretechnik-Trends 40, 3 (2020), 58--60.Google Scholar
- Li Wu, Johan Tordsson, Jasmin Bogatinovski, Erik Elmroth, and Odej Kao. 2021. MicroDiag: Fine-grained Performance Diagnosis for Microservice Systems. In 2021 IEEE/ACM International Workshop on Cloud Intelligence (CloudIntelligence). 31--36. https://doi.org/10.1109/CloudIntelligence52565.2021.00015Google ScholarCross Ref
- W. Xing and A. Ghorbani. 2004. Weighted PageRank algorithm. In Proceedings. Second Annual Conference on Communication Networks and Services Research, 2004. 305--314. https://doi.org/10.1109/DNSR.2004.1344743Google ScholarCross Ref
- Zihao Ye, Pengfei Chen, and Guangba Yu. 2021. T-Rank:A Lightweight Spectrum based Fault Localization Approach for Microservice Systems. In 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid). 416--425. https://doi.org/10.1109/CCGrid51090.2021.00051Google ScholarCross Ref
- Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, and Xiaoyun Li. 2021. MicroRank: End-to- End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments. In Proceedings of the Web Conference 2021 (Ljubljana, Slovenia) (WWW '21). Association for Computing Machinery, New York, NY, USA, 3087--3098. https://doi.org/10.1145/3442381.3449905Google ScholarDigital Library
- Guangba Yu, Zicheng Huang, and Pengfei Chen. 2021. TraceRank: Abnormal service localization with dis-aggregated end-to-end tracing data in cloud native systems. Journal of Software: Evolution and Process (2021), e2413.Google Scholar
Index Terms
- Context-aware Root Cause Localization in Distributed Traces Using Social Network Analysis (Work In Progress paper)
Recommendations
Root Cause Analysis Using Sequence Alignment and Latent Semantic Indexing
ASWEC '08: Proceedings of the 19th Australian Conference on Software EngineeringAutomatic identification of software faults has enormous practical significance. This requires characterizing program execution behavior. Equally important is the aspect of diagnosing (finding root-cause of) faults encountered. In this article, we ...
Empirical study of root cause analysis of software failure
Root Cause Analysis (RCA) is the process of identifying project issues, correcting them and taking preventive actions to avoid occurrences of such issues in the future. Issues could be variance in schedule, effort, cost, productivity, expected results ...
Clustering intrusion detection alarms to support root cause analysis
It is a well-known problem that intrusion detection systems overload their human operators by triggering thousands of alarms per day. This paper presents a new approach for handling intrusion detection alarms more efficiently. Central to this approach ...
Comments