skip to main content
10.1145/3663529.3663833acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Fault Diagnosis for Test Alarms in Microservices through Multi-source Data

Published: 10 July 2024 Publication History

Abstract

Nowadays, the testing of large-scale microservices could produce an enormous number of test alarms daily. Manually diagnosing these alarms is time-consuming and laborious for the testers. Automatic fault diagnosis with fault classification and localization can help testers efficiently handle the increasing volume of failed test cases. However, the current methods for diagnosing test alarms struggle to deal with the complex and frequently updated microservices. In this paper, we introduce SynthoDiag, a novel fault diagnosis framework for test alarms in microservices through multi-source logs (execution logs, trace logs, and test case information) organized with a knowledge graph. An Entity Fault Association and Position Value (EFA-PV) algorithm is proposed to localize the fault-indicative log entries. Additionally, an efficient block-based differentiation approach is used to filter out fault-irrelevant entries in the test cases, significantly improving the overall performance of fault diagnosis. At last, SynthoDiag is systematically evaluated with a large-scale real-world dataset from a top-tier global cloud service provider, Huawei Cloud, which provides services for more than three million users. The results show the Micro-F1 and Macro-F1 scores improvement of SynthoDiag over baseline methods in fault classification are 21% and 30%, respectively, and its top-5 accuracy of fault localization is 81.9%, significantly surpassing the previous methods.

References

[1]
Nuha Alshuqayran, Nour Ali, and Roger Evans. 2016. A systematic mapping study in microservice architecture. In 2016 IEEE 9th International Conference on Service-Oriented Computing and Applications (SOCA). 44–51.
[2]
Anunay Amar and Peter C Rigby. 2019. Mining historical test logs to predict bugs and localize faults in the test logs. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 140–151.
[3]
Mihael Ankerst, Markus M Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: Ordering points to identify the clustering structure. ACM Sigmod record, 28, 2 (1999), 49–60.
[4]
Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: Automatically detecting flaky tests. In Proceedings of the 40th international conference on software engineering. 433–444.
[5]
Zhichao Chen, Junjie Chen, Weijing Wang, Jianyi Zhou, Meng Wang, Xiang Chen, Shan Zhou, and Jianmin Wang. 2023. Exploring better black-Box test case prioritization via log analysis. ACM Transactions on Software Engineering and Methodology, 32, 3 (2023), 1–32.
[6]
Toby Clemson. 2014. Testing strategies in a microservice architecture. Cited, 23 (2014), 2017.
[7]
Yuanfei Dai, Shiping Wang, Neal N Xiong, and Wenzhong Guo. 2020. A survey on knowledge graph embedding: Approaches, applications and benchmarks. Electronics, 9, 5 (2020), 750.
[8]
Cheng Deng, Yuting Jia, Hui Xu, Chong Zhang, Jingyao Tang, Luoyi Fu, Weinan Zhang, Haisong Zhang, Xinbing Wang, and Chenghu Zhou. 2021. Gakg: A multimodal geoscience academic knowledge graph. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 4445–4454.
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[10]
Wojciech Dobrowolski, Maciej Nikodem, and Olgierd Unold. 2023. Software Failure Log Analysis for Engineers. Electronics, 12, 10 (2023), 2260.
[11]
Nicola Dragoni, Saverio Giallorenzo, Alberto Lluch Lafuente, Manuel Mazzara, Fabrizio Montesi, Ruslan Mustafin, and Larisa Safina. 2017. Microservices: yesterday, today, and tomorrow. Present and ulterior software engineering, 195–216.
[12]
Manuel Egele, Theodoor Scholte, Engin Kirda, and Christopher Kruegel. 2008. A survey on automated dynamic malware-analysis techniques and tools. ACM computing surveys (CSUR), 44, 2 (2008), 1–42.
[13]
Dieter Fensel, Umutcan Şimşek, Kevin Angele, Elwin Huaman, Elias Kärle, Oleksandra Panasiuk, Ioan Toma, Jürgen Umbrich, Alexander Wahler, and Dieter Fensel. 2020. Introduction: what is a knowledge graph? Knowledge graphs: Methodology, tools and selected use cases, 1–10.
[14]
Luca Gazzola, Maayan Goldstein, Leonardo Mariani, Marco Mobilio, Itai Segall, Alessandro Tundo, and Luca Ussi. 2023. ExVivoMicroTest: ExVivo testing of microservices. Journal of Software: Evolution and Process, 35, 4 (2023), e2452.
[15]
Israr Ghani, Wan MN Wan-Kadir, Ahmad Mustafa, and Muhammad Imran Babir. 2019. Microservice testing approaches: A systematic literature review. International Journal of Integrated Engineering, 11, 8 (2019), 65–80.
[16]
Javad Ghofrani and Daniel Lübke. 2018. Challenges of Microservices Architecture: A Survey on the State of the Practice. ZEUS, 2018 (2018), 1–8.
[17]
Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R Lyu. 2017. Drain: An online log parsing approach with fixed depth tree. In 2017 IEEE international conference on web services (ICWS). 33–40.
[18]
Kim Herzig and Nachiappan Nagappan. 2015. Empirically detecting false test alarms using association rules. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering. 2, 39–48.
[19]
Kim Herzig and Nachiappan Nagappan. 2015. Empirically detecting false test alarms using association rules. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering. 2, 39–48.
[20]
Chuanjia Hou, Tong Jia, Yifan Wu, Ying Li, and Jing Han. 2021. Diagnosing performance issues in microservices with heterogeneous data source. In 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). 493–500.
[21]
Shaohan Huang, Yi Liu, Carol Fung, Rong He, Yining Zhao, Hailong Yang, and Zhongzhi Luan. 2020. Hitanomaly: Hierarchical transformers for anomaly detection in system log. IEEE transactions on network and service management, 17, 4 (2020), 2064–2076.
[22]
Xuqian Huang, Jiuyang Tang, Zhen Tan, Weixin Zeng, Ji Wang, and Xiang Zhao. 2021. Knowledge graph embedding by relational and entity rotation. Knowledge-Based Systems, 229 (2021), 107310.
[23]
He Jiang, Xiaochen Li, Zijiang Yang, and Jifeng Xuan. 2017. What causes my test alarm? Automatic cause analysis for test alarms in system and integration testing. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). 712–723.
[24]
Amar Viswanathan Kannan, Dmitriy Fradkin, Ioannis Akrotirianakis, Tugba Kulahcioglu, Arquimedes Canedo, Aditi Roy, Shih-Yuan Yu, Malawade Arnav, and Mohammad Abdullah Al Faruque. 2020. Multimodal knowledge graph for deep learning papers and code. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 3417–3420.
[25]
Kamran Khan, Saif Ur Rehman, Kamran Aziz, Simon Fong, and Sababady Sarasvady. 2014. DBSCAN: Past, present and future. In The fifth international conference on the applications of digital information and web technologies (ICADIWT 2014). 232–238.
[26]
Jinhan Kim, Valeriy Savchenko, Kihyuck Shin, Konstantin Sorokin, Hyunseok Jeon, Georgiy Pankratenko, Sergey Markov, and Chul-Joo Kim. 2020. Automatic abnormal log detection by analyzing log history for providing debugging insight. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice. 71–80.
[27]
Myunghwan Kim, Roshan Sumbaly, and Sam Shah. 2013. Root cause detection in a service-oriented architecture. ACM SIGMETRICS Performance Evaluation Review, 41, 1 (2013), 93–104.
[28]
Sander Klock, Jan Martijn EM Van Der Werf, Jan Pieter Guelen, and Slinger Jansen. 2017. Workload-based clustering of coherent feature sets in microservice architectures. In 2017 IEEE International Conference on Software Architecture (ICSA). 11–20.
[29]
Daniel Li. 2018. Building Enterprise JavaScript Applications: Learn to Build and Deploy Robust JavaScript Applications Using Cucumber, Mocha, Jenkins, Docker, and Kubernetes. Packt Publishing Ltd.
[30]
Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering based problem identification for online service systems. In Proceedings of the 38th International Conference on Software Engineering Companion. 102–111.
[31]
Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li, Jiayu Ou, and Zheshun Wu. 2021. Microhecl: High-efficient root cause localization in large-scale microservice systems. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 338–347.
[32]
Siyang Lu, BingBing Rao, Xiang Wei, Byungchul Tak, Long Wang, and Liqiang Wang. 2017. Log-based abnormal task detection and root cause analysis for spark. In 2017 IEEE International Conference on Web Services (ICWS). 389–396.
[33]
Meng Ma, Jingmin Xu, Yuan Wang, Pengfei Chen, Zonghua Zhang, and Ping Wang. 2020. Automap: Diagnose your microservice-based web applications automatically. In Proceedings of The Web Conference 2020. 246–258.
[34]
Shang-Pin Ma, Chen-Yuan Fan, Yen Chuang, Wen-Tin Lee, Shin-Jie Lee, and Nien-Lin Hsueh. 2018. Using service dependency graph to analyze and test microservices. In 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC). 2, 81–86.
[35]
Weibin Meng, Ying Liu, Yichen Zhu, Shenglin Zhang, Dan Pei, Yuqing Liu, Yihao Chen, Ruizhi Zhang, Shimin Tao, and Pei Sun. 2019. Loganomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs. In IJCAI. 19, 4739–4745.
[36]
Fionn Murtagh and Pedro Contreras. 2012. Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2, 1 (2012), 86–97.
[37]
Annamalai Narayanan, Mahinthan Chandramohan, Rajasekar Venkatesan, Lihui Chen, Yang Liu, and Shantanu Jaiswal. 2017. graph2vec: Learning distributed representations of graphs. arXiv preprint arXiv:1707.05005.
[38]
Suman Ravuri and Oriol Vinyals. 2019. Classification accuracy score for conditional generative models. Advances in neural information processing systems, 32 (2019).
[39]
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
[40]
Aindrila Sarkar, Peter C Rigby, and Béla Bartalos. 2019. Improving bug triaging with high confidence predictions at ericsson. In 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME). 81–91.
[41]
Yicheng Sui, Yuzhe Zhang, Jianjun Sun, Ting Xu, Shenglin Zhang, Zhengdan Li, Yongqian Sun, Fangrui Guo, Junyu Shen, and Yuzhi Zhang. 2023. LogKG: Log Failure Diagnosis through Knowledge Graph. IEEE Transactions on Services Computing.
[42]
Lu Wang, Yu Xuan Jiang, Zhan Wang, Qi En Huo, Jie Dai, Sheng Long Xie, Rui Li, Ming Tao Feng, Yue Shen Xu, and Zhi Ping Jiang. 2022. The operation and maintenance governance of microservices architecture systems: A systematic literature review. Journal of Software: Evolution and Process, e2433.
[43]
Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. 2017. Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering, 29, 12 (2017), 2724–2743.
[44]
Tao Wang, Wenbo Zhang, Jiwei Xu, and Zeyu Gu. 2020. Workflow-aware automatic fault diagnosis for microservice-based applications with statistics. IEEE Transactions on Network and Service Management, 17, 4 (2020), 2350–2363.
[45]
Muhammad Waseem, Peng Liang, Mojtaba Shahin, Amleto Di Salle, and Gastón Márquez. 2021. Design, monitoring, and testing of microservices systems: The practitioners’ perspective. Journal of Systems and Software, 182 (2021), 111061.
[46]
Eberhard Wolff. 2016. Microservices: flexible software architecture. Addison-Wesley Professional.
[47]
Zhe Xie, Haowen Xu, Wenxiao Chen, Wanxue Li, Huai Jiang, Liangfei Su, Hanzhang Wang, and Dan Pei. 2023. Unsupervised Anomaly Detection on Microservice Traces through Graph VAE. In Proceedings of the ACM Web Conference 2023. 2874–2884.
[48]
Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, and Xiaoyun Li. 2021. Microrank: End-to-end latency issue localization with extended spectrum analysis in microservice environments. In Proceedings of the Web Conference 2021. 3087–3098.
[49]
Yue Yuan, Wenchang Shi, Bin Liang, and Bo Qin. 2019. An approach to cloud execution failure diagnosis based on exception logs in openstack. In 2019 IEEE 12th International Conference on Cloud Computing (CLOUD). 124–131.
[50]
Chenxi Zhang, Xin Peng, Chaofeng Sha, Ke Zhang, Zhenqing Fu, Xiya Wu, Qingwei Lin, and Dongmei Zhang. 2022. DeepTraLog: Trace-log combined microservice anomaly detection through graph-based deep learning. In Proceedings of the 44th International Conference on Software Engineering. 623–634.
[51]
Shenglin Zhang, Pengxiang Jin, Zihan Lin, Yongqian Sun, Bicheng Zhang, Sibo Xia, Zhengdan Li, Zhenyu Zhong, Minghua Ma, and Wa Jin. 2023. Robust Failure Diagnosis of Microservice System through Multimodal Data. arXiv preprint arXiv:2302.10512.
[52]
Xu Zhang, Yong Xu, Si Qin, Shilin He, Bo Qiao, Ze Li, Hongyu Zhang, Xukun Li, Yingnong Dang, and Qingwei Lin. 2021. Onion: identifying incident-indicating logs for cloud systems. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1253–1263.
[53]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Dewei Liu, Qilin Xiang, and Chuan He. 2019. Latent error prediction and fault localization for microservice applications by learning from system trace logs. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 683–694.

Cited By

View all
  • (2024)SparseRCA: Unsupervised Root Cause Analysis in Sparse Microservice Testing Traces2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00045(391-402)Online publication date: 28-Oct-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
FSE 2024: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering
July 2024
715 pages
ISBN:9798400706585
DOI:10.1145/3663529
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 July 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Execution Logs
  2. Fault Diagnosis
  3. Microservice
  4. Test Case
  5. Trace Logs

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China

Conference

FSE '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)174
  • Downloads (Last 6 weeks)21
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)SparseRCA: Unsupervised Root Cause Analysis in Sparse Microservice Testing Traces2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00045(391-402)Online publication date: 28-Oct-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media