skip to main content
10.1145/3611643.3616249acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data

Published: 30 November 2023 Publication History

Abstract

Root cause analysis (RCA) in large-scale microservice systems is a critical and challenging task. To understand and localize root causes of unexpected faults, modern observability tools collect and preserve multi-modal observability data, including metrics, traces, and logs. Since system faults may manifest as anomalies in different data sources, existing RCA approaches that rely on single-modal data are constrained in the granularity and interpretability of root causes. In this study, we present Nezha, an interpretable and fine-grained RCA approach that pinpoints root causes at the code region and resource type level by incorporative analysis of multi-modal data. Nezha transforms heterogeneous multi-modal data into a homogeneous event representation and extracts event patterns by constructing and mining event graphs. The core idea of Nezha is to compare event patterns in the fault-free phase with those in the fault-suffering phase to localize root causes in an interpretable way. Practical implementation and experimental evaluations on two microservice applications show that Nezha achieves a high top1 accuracy (89.77%) on average at the code region and resource type level and outperforms state-of-the-art approaches by a large margin. Two ablation studies further confirm the contributions of incorporating multi-modal data.

References

[1]
Anunay Amar and Peter C. Rigby. 2019. Mining historical test logs to predict bugs and localize faults in the test logs. In ICSE 2019. IEEE / ACM, 140–151. https://doi.org/10.1109/ICSE.2019.00031
[2]
Julien Audibert, Pietro Michiardi, Frédéric Guyard, Sébastien Marti, and Maria A. Zuluaga. 2020. USAD: UnSupervised Anomaly Detection on Multivariate Time Series. In KDD 2020. ACM, 3395–3404. https://doi.org/10.1145/3394486.3403392
[3]
Chetan Bansal, Sundararajan Renganathan, Ashima Asudani, Olivier Midy, and Mathru Janakiraman. 2020. DeCaf: diagnosing and triaging performance issues in large-scale cloud services. In ICSE-SEIP 2020. ACM, 201–210. https://doi.org/10.1145/3377813.3381353
[4]
Byteman. 2023. Java Byteman. https://github.com/bytemanproject/byteman Accessed Jan. 6, 2023
[5]
cAdvisor. 2023. cAdvisor. https://github.com/google/cadvisor Accessed Jan. 6, 2023
[6]
Pengfei Chen, Yong Qi, Pengfei Zheng, and Di Hou. 2014. CauseInfer: automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In INFOCOM 2014. 1887–1895. https://doi.org/10.1109/INFOCOM.2014.6848128
[7]
Yufu Chen, Meng Yan, Dan Yang, Xiaohong Zhang, and Ziliang Wang. 2022. Deep Attentive Anomaly Detection for Microservice Systems with Multimodal Time-Series Data. In ICWS 2022. IEEE, 373–378. https://doi.org/10.1109/ICWS55610.2022.00062
[8]
Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F. Wenisch. 2014. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services. In OSDI 2014. USENIX Association, 217–231.
[9]
Francois Doray and Michel Dagenais. 2017. Diagnosing Performance Variations by Comparing Multi-Level Execution Traces. IEEE TPDS, 28, 2 (2017), 462–474. https://doi.org/10.1109/TPDS.2016.2567390
[10]
Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. 2007. X-Trace: A Pervasive Network Tracing Framework. In NSDI, 2007. USENIX, 271–284.
[11]
Xiaoyu Fu, Rui Ren, Sally A. McKee, Jianfeng Zhan, and Ninghui Sun. 2014. Digging deeper into cluster system logs for failure prediction and root cause diagnosis. In CLUSTER 2014. IEEE, 103–112. https://doi.org/10.1109/CLUSTER.2014.6968768
[12]
FudanSELab. 2023. TrainTicket. https://github.com/FudanSELab/train-ticket Accessed Jan. 6, 2023
[13]
Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou. 2021. Sage: practical and scalable ML-driven performance debugging in microservices. In ASPLOS 2021. ACM, 135–151. https://doi.org/10.1145/3445814.3446700
[14]
GoogleCloudPlatform. 2023. OnlineBoutique. https://github.com/GoogleCloudPlatform/microservices-demo Accessed Jan. 6, 2023
[15]
Grafana. 2023. Grafana loki. https://github.com/grafana/loki Accessed Jan. 6, 2023
[16]
Grafana. 2023. Grafana promtail. https://grafana.com/docs/loki/latest/clients/promtail/ Accessed Jan. 6, 2023
[17]
Grafana. 2023. Grafana Tempo. https://github.com/grafana/tempo Accessed Jan. 6, 2023
[18]
Xiaofeng Guo, Xin Peng, Hanzhang Wang, Wanxue Li, Huai Jiang, Dan Ding, Tao Xie, and Liangfei Su. 2020. Graph-Based Trace Analysis for Microservice Architecture Understanding and Problem Diagnosis. In ESEC/FSE 2020. ACM, 1387–1397. https://doi.org/10.1145/3368089.3417066
[19]
Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. 2017. Drain: An Online Log Parsing Approach with Fixed Depth Tree. In ICWS 2017. IEEE, 33–40. https://doi.org/10.1109/ICWS.2017.13
[20]
Shilin He, Qingwei Lin, Jian-Guang Lou, Hongyu Zhang, Michael R. Lyu, and Dongmei Zhang. 2018. Identifying impactful service system problems via log analysis. In ESEC/FSE 2018. ACM, 60–70. https://doi.org/10.1145/3236024.3236083
[21]
Chuanjia Hou, Tong Jia, Yifan Wu, Ying Li, and Jing Han. 2021. Diagnosing Performance Issues in Microservices with Heterogeneous Data Source. In ISPA/BDCloud/SocialCom/SustainCom, 2021. IEEE, 493–500. https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00074
[22]
Lexiang Huang and Timothy Zhu. 2021. tprof: Performance Profiling via Structural Aggregation and Automated Analysis of Distributed Systems Traces. In SoCC 2021. ACM, 76–91. https://doi.org/10.1145/3472883.3486994
[23]
Zicheng Huang, Pengfei Chen, Guangba Yu, Hongyang Chen, and Zibin Zheng. 2021. Sieve: Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems. In ICWS 2021. IEEE, 436–446. https://doi.org/10.1109/ICWS53863.2021.00063
[24]
Istio. 2023. Istio. https://github.com/istio/istio Accessed Jan. 6, 2023
[25]
James A. Jones, Mary Jean Harrold, and John T. Stasko. 2002. Visualization of test information to assist fault localization. In ICSE 2002. ACM, 467–477. https://doi.org/10.1145/581339.581397
[26]
Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Paramvir Bahl. 2009. Detailed diagnosis in enterprise networks. In SIGCOMM 2009. ACM, 243–254. https://doi.org/10.1145/1592568.1592597
[27]
Kmaork. 2023. Hypno. https://docs.aws.amazon.com/prescriptive-guidance/latest/implementing-logging-monitoring-cloudwatch/configure-cloudwatch-ec2-on-premises.html Accessed Jan. 6, 2023
[28]
Kmaork. 2023. Hypno. https://github.com/kmaork/hypno Accessed Jan. 6, 2023
[29]
Xing Li, Yan Chen, and Zhiqiang Lin. 2019. Towards automated inter-service authorization for microservice applications. In SIGCOMM 2019. ACM, 3–5. https://doi.org/10.1145/3342280.3342288
[30]
Xiaoyun Li, Guangba Yu, Pengfei Chen, Hongyang Chen, and Zhekang Chen. 2022. Going through the Life Cycle of Faults in Clouds: Guidelines on Fault Handling. In ISSRE 2022. IEEE, 121–132. https://doi.org/10.1109/ISSRE55969.2022.00022
[31]
Yufeng Li, Guangba Yu, Pengfei Chen, Chuanfu Zhang, and Zibin Zheng. 2022. MicroSketch: Lightweight and Adaptive Sketch Based Performance Issue Detection and Localization in Microservice Systems. In ICSOC 2022 (Lecture Notes in Computer Science, Vol. 13740). Springer, 219–236. https://doi.org/10.1007/978-3-031-20984-0_15
[32]
Zeyan Li, Junjie Chen, Rui Jiao, Nengwen Zhao, Zhijun Wang, Shuwei Zhang, Yanjun Wu, Long Jiang, Leiqin Yan, and Zikai Wang. 2021. Practical Root Cause Localization for Microservice Systems via Trace Analysis. In IWQoS 2021. 1–10. https://doi.org/10.1109/IWQOS52092.2021.9521340
[33]
Zeyan Li, Nengwen Zhao, Mingjie Li, Xianglin Lu, Lixin Wang, Dongdong Chang, Xiaohui Nie, Li Cao, Wenchi Zhang, Kaixin Sui, Yanhua Wang, Xu Du, Guoqiang Duan, and Dan Pei. 2022. Actionable and interpretable fault localization for recurring failures in online service systems. In ESEC/FSE 2022. ACM, 996–1008. https://doi.org/10.1145/3540250.3549092
[34]
Fan Fred Lin, Keyur Muzumdar, Nikolay Pavlovich Laptev, Mihai-Valentin Curelea, Seunghak Lee, and Sriram Sankar. 2020. Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment. Proc. ACM Meas. Anal. Comput. Syst., 4, 2 (2020), 31:1–31:23. https://doi.org/10.1145/3392149
[35]
Jinjin Lin, Pengfei Chen, and Zibin Zheng. 2018. Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments. In ICSOC 2018. 3–20. https://doi.org/10.1007/978-3-030-03596-9_1
[36]
Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering based problem identification for online service systems. In ICSE Companion 2016. ACM, 102–111. https://doi.org/10.1145/2889160.2889232
[37]
Ping Liu, Haowen Xu, and et al. 2020. Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks. In ISSRE 2020. 48–58. https://doi.org/10.1109/ISSRE5003.2020.00014
[38]
Sumo logic. 2023. Go TraceId and SpanId injection into logs configuration. https://help.sumologic.com/docs/apm/traces/get-started-transaction-tracing/opentelemetry-instrumentation/go/traceid-and-spanid-injection-into-logs/ Accessed June 6, 2023
[39]
Sumo logic. 2023. Java TraceId and SpanId injection into logs configuration. https://help.sumologic.com/docs/apm/traces/get-started-transaction-tracing/opentelemetry-instrumentation/java/traceid-spanid-injection-into-logs-configuration/ Accessed June 6, 2023
[40]
Sumo logic. 2023. JavaScript TraceId and SpanId injection into logs configuration. https://help.sumologic.com/docs/apm/traces/get-started-transaction-tracing/opentelemetry-instrumentation/javascript/traceid-spanid-injection-into-logs/ Accessed June 6, 2023
[41]
Sumo logic. 2023. Python TraceId and SpanId injection into logs configuration. https://help.sumologic.com/docs/apm/traces/get-started-transaction-tracing/opentelemetry-instrumentation/python/traceid-spanid-injection-into-logs/ Accessed June 6, 2023
[42]
Chang Lou, Peng Huang, and Scott Smith. 2020. Understanding, Detecting and Localizing Partial Failures in Large System Software. In NSDI 2020. USENIX Association, 559–574.
[43]
Minghua Ma, Shenglin Zhang, Junjie Chen, Jim Xu, Haozhe Li, Yongliang Lin, Xiaohui Nie, Bo Zhou, Yong Wang, and Dan Pei. 2021. Jump-Starting Multivariate Time Series Anomaly Detection for Online Service Systems. In USENIX ATC 2021. USENIX Association, 413–426.
[44]
Vijayaraghavan Murali, Edward Yao, Umang Mathur, and Satish Chandra. 2021. Scalable Statistical Root Cause Analysis on App Telemetry. In ICSE (SEIP) 2021. IEEE, 288–297. https://doi.org/10.1109/ICSE-SEIP52600.2021.00038
[45]
Karthik Nagaraj, Charles Edwin Killian, and Jennifer Neville. 2012. Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems. In NSDI 2012. USENIX Association, 353–366.
[46]
Nezha. 2023. Augmented-OnlineBoutique. https://github.com/IntelligentDDS/Augmented-OnlineBoutique Accessed Jan. 6, 2023
[47]
Nezha. 2023. Augmented-TrainTicket. https://github.com/IntelligentDDS/Augmented-TrainTicket Accessed Jan. 6, 2023
[48]
Opentelemetry. 2023. Opentelemetry. https://opentelemetry.io Accessed Jan. 6, 2023
[49]
Opentelemetry. 2023. OpenTelemetry Collector. https://github.com/open-telemetry/opentelemetry-collector Accessed Jan. 6, 2023
[50]
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab.
[51]
Yicheng Pan, Meng Ma, Xinrui Jiang, and Ping Wang. 2021. Faster, deeper, easier: crowdsourcing diagnosis of microservice kernel failure from user space. In ISSTA 2021. ACM, 646–657. https://doi.org/10.1145/3460319.3464805
[52]
Pingcap. 2023. Golang Failpoint. https://github.com/pingcap/failpoint Accessed Jan. 6, 2023
[53]
Prometheus. 2023. Prometheus. https://github.com/prometheus/prometheus Accessed Jan. 6, 2023
[54]
Prometheus. 2023. Prometheus node exporter. https://github.com/prometheus/node_exporter Accessed Jan. 6, 2023
[55]
Carl Martin Rosenberg and Leon Moonen. 2020. Spectrum-Based Log Diagnosis. In ESEM 2020. ACM, 18:1–18:12. https://doi.org/10.1145/3382494.3410684
[56]
Benjamin H Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a large-scale distributed systems tracing infrastructure.
[57]
Apache SkyWalking. 2023. Apache SkyWalking. https://skywalking.apache.org Accessed Jan. 6, 2023
[58]
Cindy Sridharan. 2018. Distributed systems observability: a guide to building robust systems. O’Reilly Media.
[59]
Avishay Traeger, Ivan Deras, and Erez Zadok. 2008. DARC: dynamic analysis of root causes of latency distributions. In SIGMETRICS 2008. ACM, 277–288. https://doi.org/10.1145/1375457.1375489
[60]
Hanzhang Wang, Zhengkai Wu, Huai Jiang, Yichao Huang, Jiamu Wang, Selcuk Kopru, and Tao Xie. 2021. Groot: An event-graph-based approach for root cause analysis in industrial settings. In ASE 2021. 419–429. https://doi.org/10.1109/ASE51524.2021.9678708
[61]
Ping Wang, Jingmin Xu, Meng Ma, Weilan Lin, Disheng Pan, Yuan Wang, and Pengfei Chen. 2018. Cloudranger: root cause identification for cloud native systems. In CCGRID 2018. 492–502. https://doi.org/10.1109/CCGRID.2018.00076
[62]
Paul F Wilson, Larry D Dell, and Gaylor F Anderson. 1996. Root cause analysis: a tool for total quality management. The Journal for Healthcare Quality (JHQ), 18, 1 (1996), 40.
[63]
Li Wu, Johan Tordsson, Jasmin Bogatinovski, Erik Elmroth, and Odej Kao. 2021. MicroDiag: Fine-grained Performance Diagnosis for Microservice Systems. In Cloud Intelligence 2021. IEEE, 31–36.
[64]
Li Wu, Johan Tordsson, Erik Elmroth, and Odej Kao. 2020. MicroRCA: Root Cause Localization of Performance Issues in Microservices. In NOMS 2020. IEEE/IFIP, 1–9. https://doi.org/10.1109/NOMS47738.2020.9110353
[65]
Hiroyuki Yamada and Jun Nemoto. 2022. Scalar DL: Scalable and Practical Byzantine Fault Detection for Transactional Database Systems. Proc. VLDB Endow., 15, 7 (2022), 1324–1336.
[66]
Zihao Ye, Pengfei Chen, and Guangba Yu. 2021. T-Rank: A Lightweight Spectrum based Fault Localization Approach for Microservice Systems. In CCGrid 2021. IEEE/ACM, 416–425. https://doi.org/10.1109/CCGrid51090.2021.00051
[67]
Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, and Xiaoyun Li. 2021. MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments. In WWW 2021. ACM, 3087–3098. https://doi.org/10.1145/3442381.3449905
[68]
Guangba Yu, Pengfei Chen, Pairui Li, Tianjun Weng, Haibing Zheng, and Yuetang Deng. 2023. LogReducer: Identify and Reduce Log Hotspots inKernel on the Fly. In ICSE 2023. IEEE, 1763–1775. https://doi.org/10.1109/ICSE48619.2023.00151
[69]
Guangba Yu, Pengfei Chen, Yufeng LI, Hongyang Chen, Xiaoyun Li, and Zibin Zheng. 2023. Artifact of Paper "Nezha: Interpretable Fine- Grained Root Causes Analysis for Microservices on Multi-modal Observability Data". https://doi.org/10.5281/zenodo.8276375
[70]
Guangba Yu, Pengfei Chen, and Zibin Zheng. 2019. Microscaler: Automatic Scaling for Microservices with an Online Learning Approach. In ICWS 2019. IEEE, 68–75. https://doi.org/10.1109/ICWS.2019.00023
[71]
Guangba Yu, Zicheng Huang, and Pengfei Chen. 2021. TraceRank: Abnormal service localization with dis-aggregated end-to-end tracing data in cloud native systems. Journal of Software: Evolution and Process, e2413. https://doi.org/10.1002/smr.2413
[72]
Chenxi Zhang, Xin Peng, Chaofeng Sha, Ke Zhang, Zhenqing Fu, Xiya Wu, Qingwei Lin, and Dongmei Zhang. 2022. DeepTraLog: Trace-Log Combined Microservice Anomaly Detection through Graph-based Deep Learning. In ICSE 2022. IEEE, 623–634. https://doi.org/10.1145/3510003.3510180
[73]
Chenxi Zhang, Xin Peng, Tong Zhou, Chaofeng Sha, Zhenghui Yan, Yiru Chen, and Hong Yang. 2022. TraceCRL: contrastive representation learning for microservice trace analysis. In ESEC/FSE 2022. ACM, 1221–1232. https://doi.org/10.1145/3540250.3549146
[74]
Yingying Zhang, Zhengxiong Guan, Huajie Qian, Leili Xu, Hengbo Liu, Qingsong Wen, Liang Sun, Junwei Jiang, Lunting Fan, and Min Ke. 2021. CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms. In CIKM 2021. ACM, 4373–4382. https://doi.org/10.1145/3459637.3481903
[75]
Nengwen Zhao, Junjie Chen, Zhaoyang Yu, Honglin Wang, Jiesong Li, Bin Qiu, Hongyu Xu, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2021. Identifying bad software changes via multimodal anomaly detection for online service systems. In ESEC/FSE ’21. ACM, 527–539. https://doi.org/10.1145/3468264.3468543
[76]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2021. Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study. IEEE TSE, 47, 2 (2021), 243–260. https://doi.org/10.1109/TSE.2018.2887384

Cited By

View all
  • (2024)Identifying Performance Issues in Cloud Service Systems Based on Relational-Temporal FeaturesACM Transactions on Software Engineering and Methodology10.1145/370297834:3(1-31)Online publication date: 5-Nov-2024
  • (2024)Interpretable Failure Localization for Microservice Systems Based on Graph AutoencoderACM Transactions on Software Engineering and Methodology10.1145/369599934:2(1-28)Online publication date: 13-Sep-2024
  • (2024)ART: A Unified Unsupervised Framework for Incident Management in Microservice SystemsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695495(1183-1194)Online publication date: 27-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
November 2023
2215 pages
ISBN:9798400703270
DOI:10.1145/3611643
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 November 2023

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. Microservice
  2. Multi-modal Observability Data
  3. Root Cause Analysis

Qualifiers

  • Research-article

Funding Sources

Conference

ESEC/FSE '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)720
  • Downloads (Last 6 weeks)73
Reflects downloads up to 27 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Identifying Performance Issues in Cloud Service Systems Based on Relational-Temporal FeaturesACM Transactions on Software Engineering and Methodology10.1145/370297834:3(1-31)Online publication date: 5-Nov-2024
  • (2024)Interpretable Failure Localization for Microservice Systems Based on Graph AutoencoderACM Transactions on Software Engineering and Methodology10.1145/369599934:2(1-28)Online publication date: 13-Sep-2024
  • (2024)ART: A Unified Unsupervised Framework for Incident Management in Microservice SystemsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695495(1183-1194)Online publication date: 27-Oct-2024
  • (2024)Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive OptimizationProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695489(1107-1119)Online publication date: 27-Oct-2024
  • (2024)LSTD-MTS: Anomaly Detection with Capturing Long-Term Spatio-Temporal Dependence for Multi-dimensional Time SeriesProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3671383(397-406)Online publication date: 24-Jul-2024
  • (2024)Interaction Prediction and Anomaly Detection in a Microservices-based Telecommunication PlatformProceedings of the 2024 International Conference on Software and Systems Processes10.1145/3666015.3666017(56-65)Online publication date: 4-Sep-2024
  • (2024)Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal GraphCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663827(50-61)Online publication date: 10-Jul-2024
  • (2024)TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime StateProceedings of the ACM on Software Engineering10.1145/36437481:FSE(473-493)Online publication date: 12-Jul-2024
  • (2024)ChangeRCA: Finding Root Causes from Software Changes in Large Online SystemsProceedings of the ACM on Software Engineering10.1145/36437281:FSE(24-46)Online publication date: 12-Jul-2024
  • (2024)Microservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent SpaceProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671530(6049-6060)Online publication date: 25-Aug-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media