skip to main content
10.1145/3442381.3449905acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments

Published: 03 June 2021 Publication History

Abstract

With the advantages of flexible scalability and fast delivery, microservice has become a popular software architecture in the modern IT industry. However, the explosion in the number of service instances and complex dependencies make the troubleshooting extremely challenging in microservice environments. To help understand and troubleshoot a microservice system, the end-to-end tracing technology has been widely applied to capture the execution path of each request. Nevertheless, the tracing data are not fully leveraged by cloud and application providers when conducting latency issue localization in the microservice environment.
This paper proposes a novel system, named MicroRank, which analyzes clues provided by normal and abnormal traces to locate root causes of latency issues. Once a latency issue is detected by the Anomaly Detector in MicroRank, the cause localization procedure is triggered. MicroRank first distinguishs which traces are abnormal. Then, MicroRank’s PageRank Scorer module uses the abnormal and normal trace information as its input and differentials the importance of different traces to extended spectrum techniques . Finally, the spectrum techniques can calculate the ranking list based on the weighted spectrum information from PageRank Scorer to locate root causes more effectively. The experimental evaluations on a widely-used open-source system and a production system show that MicroRank achieves excellent results not only in one root cause situation but also in two issues that happen at the same time. Moreover, MicroRank makes 6% to 22% improvement in recall in localizing root causes compared to current state-of-the-art methods.

References

[1]
Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. 2007. On the accuracy of spectrum-based fault localization. In Testing: Academic and Industrial Conference Practice and Research Techniques-MUTATION. IEEE, 89–98.
[2]
Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using magpie for request extraction and workload modelling. In 6th Symposium on Operating System Design and Implementation. USENIX Association, 259–272.
[3]
Salman Abdul Baset. 2012. Cloud SLAs: present and future. ACM SIGOPS Operation Systems Review 46, 2 (2012), 57–66.
[4]
Mike Y. Chen, Anthony J. Accardi, Emre Kiciman, David A. Patterson, Armando Fox, and Eric A. Brewer. 2004. Path-based failure and evolution management. In 1st Symposium on Networked Systems Design and Implementation. USENIX, 309–322.
[5]
Pengfei Chen, Yong Qi, Pengfei Zheng, and Di Hou. 2014. CauseInfer: automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In 2014 Conference on Computer Communications. IEEE, 1887–1895.
[6]
Francois Doray and Michel Dagenais. 2017. Diagnosing performance variations by comparing multi-Level execution traces. IEEE Transactions on Parallel & Distributed Systems 28, 2 (2017), 462–474.
[7]
Rodrigo Fonseca, George Porter, Randy H Katz, Scott Shenker, and Ion Stoica. 2007. X-trace: A pervasive network tracing framework. In Proceedings of the 4th USENIX conference on Networked systems design & implementation. 20–20.
[8]
Yu Gan, Yanqi Zhang, and Dailun Cheng et al.2019. An open-Source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. 3–18.
[9]
Hiranya Jayathilaka, Chandra Krintz, and Rich Wolski. 2017. Performance monitoring and root cause analysis for cloud-hosted web applications. In Proceedings of the 26th International Conference on World Wide Web. 469–478.
[10]
Glen Jeh and Jennifer Widom. 2003. Scaling personalized web search. In Proceedings of the Twelfth International World Wide Web Conference. ACM, 271–279.
[11]
Jiajun Jiang, Ran Wang, Yingfei Xiong, Xiangping Chen, and Lu Zhang. 2019. Combining spectrum-based fault localization and statistical debugging: an empirical study. In 34th IEEE/ACM International Conference on Automated Software Engineering. IEEE / ACM, 502–514.
[12]
Jiajun Jiang, Ran Wang, Yingfei Xiong, Xiangping Chen, and Lu Zhang. 2019. Combining spectrum-based fault localization and statistical debugging: an empirical study. In 34th IEEE/ACM International Conference on Automated Software Engineering. 502–514.
[13]
James A. Jones, Mary Jean Harrold, and John T. Stasko. 2002. Visualization of test information to assist fault localization. In Proceedings of the 24th International Conference on Software Engineering. ACM, 467–477.
[14]
Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O’Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, 2017. Canopy: an end-to-end performance tracing and analysis system. In Proceedings of the 26th ACM Symposium on Operating Systems Principles. 34–50.
[15]
Srikanth Kandula, Ratul Mahajan, and et al. Verkaik. 2009. Detailed diagnosis in enterprise networks. ACM SIGCOMM Computer Communication Review 39, 4 (2009), 243–254.
[16]
Pavneet Singh Kochhar, Xin Xia, David Lo, and Shanping Li. 2016. Practitioners’ expectations on automated fault localization. In Proceedings of the 25th International Symposium on Software Testing and Analysis. ACM, 165–176.
[17]
Gang Li, Tao Yuan, S Joe Qin, and Tianyou Chai. 2015. Dynamic time warping based causality analysis for root-cause diagnosis of nonstationary fault processes. IFAC-PapersOnLine 48, 8 (2015), 1288–1293.
[18]
Xing Li, Yan Chen, and Zhiqiang Lin. 2019. Towards automated inter-service authorization for microservice applications. In Proceedings of the ACM SIGCOMM 2019 Conference Posters and Demos. ACM, 3–5.
[19]
Jinjin Lin, Pengfei Chen, and Zibin Zheng. 2018. Microscope: pinpoint performance issues with causal graphs in micro-service environments. In 16th International Conference on Service-Oriented Computing. Springer, 3–20.
[20]
Chao Liu, Sambuddha Ghosal, Zhanhong Jiang, and Soumik Sarkar. 2016. An unsupervised spatiotemporal graphical modeling approach to anomaly detection in distributed CPS. In ACM/IEEE 7th International Conference on Cyber-Physical Systems. 1–10.
[21]
Chao Liu, Kin Gwn Lore, and Soumik Sarkar. 2017. Data-driven root-cause analysis for distributed system anomalies. In IEEE 56th Annual Conference on Decision and Control. 5745–5750.
[22]
Meng Ma, Jingmin Xu, Yuan Wang, Pengfei Chen, Zonghua Zhang, and Ping Wang. 2020. AutoMAP: diagnose your microservice-based web applications automatically. In The Web Conference 2020. ACM / IW3C2, 246–258.
[23]
Anton Michlmayr, Florian Rosenberg, Philipp Leitner, and Schahram Dustdar. 2009. Comprehensive qos monitoring of web services and event-based sla violation detection. In Proceedings of the 4th international workshop on middleware for service oriented computing. 1–6.
[24]
Lee Naish, Hua Jie Lee, and Kotagiri Ramamohanarao. 2011. A model for spectra-based software diagnosis. ACM Transactions on Software Engineering and Methodology 20, 3(2011), 11:1–11:32.
[25]
Hiep Nguyen, Zhiming Shen, Yongmin Tan, and Xiaohui Gu. 2013. Fchain: toward black-box online fault localization for cloud systems. In IEEE 33rd International Conference on Distributed Computing Systems. 21–30.
[26]
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web.Technical Report. Stanford InfoLab.
[27]
Spencer Pearson and José Campos etc.2017. Evaluating and improving fault localization. In Proceedings of the 39th International Conference on Software Engineering. IEEE / ACM, 609–620.
[28]
Patrick Reynolds, Charles Edwin Killian, Janet L Wiener, Jeffrey C Mogul, Mehul A Shah, and Amin Vahdat. 2006. Pip: detecting the unexpected in distributed systems. In NSDI, Vol. 6. 9–9.
[29]
Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. 2011. Diagnosing performance changes by comparing request flows. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation. 43–56.
[30]
Huasong Shan and Yuan Chen etc.2019. Diagnosis: unsupervised and real-time Diagnosis of small-window long-tail latency in large-scale microservice platforms. In The World Wide Web Conference. ACM, 3215–3222.
[31]
Yuri Shkuro. 2019. Mastring Distributed Tracing. Packt.
[32]
Benjamin H Sigelman and et al. Barroso. 2010. Dapper, a large-scale distributed systems tracing infrastructure. (2010).
[33]
Jörg Thalheim, Antonio Rodrigues, Istemi Ekin Akkus, Pramod Bhatotia, Ruichuan Chen, Bimal Viswanath, Lei Jiao, and Christof Fetzer. 2017. Sieve: actionable insights from monitored metrics in distributed systems. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference. 14–27.
[34]
Ping Wang, Jingmin Xu, Meng Ma, Weilan Lin, Disheng Pan, Yuan Wang, and Pengfei Chen. 2018. Cloudranger: root cause identification for cloud native systems. In Proceedings of the 18th cloud-hosted IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 492–502.
[35]
W. Eric Wong, Tingting Wei, Yu Qi, and Lei Zhao. 2008. A crosstab-based statistical method for effective fault localization. In First International Conference on Software Testing, Verification, and Validation. IEEE, 42–51.
[36]
Wenpu Xing and Ali A. Ghorbani. 2004. Weighted PageRank algorithm. In 2nd Annual Conference on Communication Networks and Services Research. IEEE Computer Society, 305–314.
[37]
Guangba Yu, Pengfei Chen, and Zibin Zheng. 2019. Microscaler: automatic scaling for microservices with an online learning Approach. In IEEE International Conference on Web Services. IEEE, 68–75.
[38]
Guangba Yu, Pengfei Chen, and Zibin Zheng. 2020. Microscaler: cost-effective scaling for microservice applications in the cloud with an online learning Approach. IEEE Transactions on Cloud Computing(2020). https://doi.org/10.1109/TCC.2020.2985352
[39]
Mengshi Zhang, Xia Li, Lingming Zhang, and Sarfraz Khurshid. 2017. Boosting spectrum-based fault localization using PageRank. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. 261–272.
[40]
Hao Zhou and Ming Chen etc.2018. Overload control for scaling WeChat microservices. In Proceedings of the ACM Symposium on Cloud Computing. ACM, 149–161.
[41]
Xiang Zhou, Xin Peng, and Tao Xie. 2019. Latent error prediction and fault localization for microservice applications by learning from system trace logs. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 683–694.

Cited By

View all
  • (2025)Mint: Cost-Efficient Tracing with All Requests Collection via Commonality and Variability AnalysisProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707287(683-697)Online publication date: 30-Mar-2025
  • (2025)Zoom-inRCL: Fine-grained root cause localization for B5G/6G network slicingComputer Networks10.1016/j.comnet.2024.110893256(110893)Online publication date: Jan-2025
  • (2024)Interpretable Failure Localization for Microservice Systems Based on Graph AutoencoderACM Transactions on Software Engineering and Methodology10.1145/369599934:2(1-28)Online publication date: 13-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '21: Proceedings of the Web Conference 2021
April 2021
4054 pages
ISBN:9781450383127
DOI:10.1145/3442381
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Microservice
  2. PageRank
  3. end-to-end tracing
  4. root cause localization
  5. spectrum analysis

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

WWW '21
Sponsor:
WWW '21: The Web Conference 2021
April 19 - 23, 2021
Ljubljana, Slovenia

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)330
  • Downloads (Last 6 weeks)47
Reflects downloads up to 27 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Mint: Cost-Efficient Tracing with All Requests Collection via Commonality and Variability AnalysisProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707287(683-697)Online publication date: 30-Mar-2025
  • (2025)Zoom-inRCL: Fine-grained root cause localization for B5G/6G network slicingComputer Networks10.1016/j.comnet.2024.110893256(110893)Online publication date: Jan-2025
  • (2024)Interpretable Failure Localization for Microservice Systems Based on Graph AutoencoderACM Transactions on Software Engineering and Methodology10.1145/369599934:2(1-28)Online publication date: 13-Sep-2024
  • (2024)ART: A Unified Unsupervised Framework for Incident Management in Microservice SystemsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695495(1183-1194)Online publication date: 27-Oct-2024
  • (2024)MRCA: Metric-level Root Cause Analysis for Microservices via Multi-Modal DataProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695485(1057-1068)Online publication date: 27-Oct-2024
  • (2024)The Potential of One-Shot Failure Root Cause Analysis: Collaboration of the Large Language Model and Small ClassifierProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695475(931-943)Online publication date: 27-Oct-2024
  • (2024)Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695065(706-715)Online publication date: 27-Oct-2024
  • (2024)HeMiRCA: Fine-Grained Root Cause Analysis for Microservices with Heterogeneous Data SourcesACM Transactions on Software Engineering and Methodology10.1145/367472633:8(1-25)Online publication date: 1-Jul-2024
  • (2024)A Bayesian LSTM Based Active Anomaly Detection Service for Large Online SystemsProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3674818(407-416)Online publication date: 24-Jul-2024
  • (2024)Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating SystemsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663834(126-137)Online publication date: 10-Jul-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media