research-article

Root cause detection in a service-oriented architecture

Authors:

Roshan Sumbaly,

Sam ShahAuthors Info & Claims

SIGMETRICS '13: Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems

Pages 93 - 104

https://doi.org/10.1145/2465529.2465753

Published: 17 June 2013 Publication History

Abstract

Large-scale websites are predominantly built as a service-oriented architecture. Here, services are specialized for a certain task, run on multiple machines, and communicate with each other to serve a user's request. An anomalous change in a metric of one service can propagate to other services during this communication, resulting in overall degradation of the request. As any such degradation is revenue impacting, maintaining correct functionality is of paramount concern: it is important to find the root cause of any anomaly as quickly as possible. This is challenging because there are numerous metrics or sensors for a given service, and a modern website is usually composed of hundreds of services running on thousands of machines in multiple data centers.

This paper introduces MonitorRank, an algorithm that can reduce the time, domain knowledge, and human effort required to find the root causes of anomalies in such service-oriented architectures. In the event of an anomaly, MonitorRank provides a ranked order list of possible root causes for monitoring teams to investigate. MonitorRank uses the historical and current time-series metrics of each sensor as its input, along with the call graph generated between sensors to build an unsupervised model for ranking. Experiments on real production outage data from LinkedIn, one of the largest online social networks, shows a 26% to 51% improvement in mean average precision in finding root causes compared to baseline and current state-of-the-art methods.

References

[1]

T. Ahmed, B. Oreshkin, and M. Coates. Machine Learning Approaches to Network Anomaly Detection. In SysML, 2007.

Digital Library

[2]

A. Arefin, K. Nahrstedt, R. Rivas, J. Han, and Z. Huang. DIAMOND: Correlation-Based Anomaly Monitoring Daemon for DIME. In ISM, 2010.

Digital Library

[3]

M. Basseville and I. V. Nikiforov. Detection of Abrupt Changes - Theory and Application. Prentice-Hall, 1993.

Digital Library

[4]

A. T. Bouloutas, S. Calo, and A. Finkel. Alarm Correlation and Fault Identification in Communication Networks. TCOM, 42(2-4):523--533, 1994.

[5]

V. Chandola, A. Banerjee, and V. Kumar. Anomaly Detection: A Survey. CSUR, 41(3):15:1--15:58, 2009.

Digital Library

[6]

C. S. Chao, D. L. Yang, and A. C. Liu. An Automated Fault Diagnosis System Using Hierarchical Reasoning and Alarm Correlation. JNSM, 9(2):183--202, 2001.

Digital Library

[7]

M. Chen, A. X. Zheng, J. Lloyd, M. I. Jordan, and E. Brewer. Failure Diagnosis Using Decision Trees. In ICAC, 2004.

Digital Library

[8]

L. Cherkasova, K. Ozonat, N. Mi, J. Symons, and E. Smirni. Automated Anomaly Detection and Performance Modeling of Enterprise Applications. TOCS, 27(3):6:1--6:32, 2009.

Digital Library

[9]

P. H. dos Santos Teixeira and R. L. Milidiú. Data stream anomaly detection through principal subspace tracking. In SAC, 2010.

Digital Library

[10]

B. Efron, I. Johnstone, T. Hastie, and R. Tibshirani. The Least Angle Regression Algorithm for Solving the Lasso. Annals of Statistics, 32(2):407--451, 2004.

[11]

J. Gao, G. Jiang, H. Chen, and J. Han. Modeling Probabilistic Measurement Correlations for Problem Determination in Large-Scale Distributed Systems. In ICDCS, 2009.

Digital Library

[12]

A. M. Hein and S. A. Mckinley. Sensing and Decision-making in Random Search. PNAS, 109(30):12070--12074, 2012.

[13]

A. Jalali and S. Sanghavi. Learning the Dependence Graph of Time Series with Latent Factors. In ICML, 2012.

[14]

G. Jeh and J. Widom. Scaling Personalized Web Search. In WWW, 2003.

Digital Library

[15]

M. Jiang, M. A. Munawar, T. Reidemeister, and P. A. S. Ward. Dependency-aware Fault Diagnosis with Metric-correlation Models in Enterprise software systems. In CNSM, 2010.

[16]

R. Jiang, H. Fei, and J. Huan. Anomaly Localization for Network Data Streams with Graph Joint Sparse PCA. In KDD, 2011.

Digital Library

[17]

I. T. Jolliffe. Principal Component Analysis. Springer, second edition, Oct. 2002.

[18]

M. Khan, H. K. Le, H. Ahmadi, T. Abdelzaher, and J. Han. DustMiner: Troubleshooting Interactive Complexity Bugs in Sensor Networks. In Sensys, 2008.

Digital Library

[19]

J. Kreps, N. Narkhede, and J. Rao. Kafka: A Distributed Messaging System for Log Processing. In NetDB, 2011.

[20]

D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In CIKM, pages 556--559, 2003.

Digital Library

[21]

Y. Liu, L. Zhang, and Y. Guan. A Distributed Data Streaming Algorithm for Network-wide Traffic Anomaly Detection. In SIGMETRICS, 2009.

Digital Library

[22]

A. Mahimkar, Z. Ge, J. Wang, J. Yates, Y. Zhang, J. Emmons, B. Huntley, and M. Stockert. Rapid Detection of Maintenance Induced Changes in Service Performance. In CoNEXT, 2011.

Digital Library

[23]

N. Marwede, M. Rohr, A. V. Hoorn, and W. Hasselbring. Automatic Failure Diagnosis Support in Distributed Large-Scale Software Systems Based on Timing Behavior Anomaly Correlation. In CSMR, 2009.

Digital Library

[24]

L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999-66, Stanford InfoLab, 1999.

[25]

M. P. Papazoglou and W.-J. Heuvel. Service Oriented Architectures: Approaches, Technologies and Research Issues. The VLDB Journal, 16(3):389--415, July 2007.

Digital Library

[26]

A. B. Sharma, L. Golubchik, and R. Govindan. Sensor Faults: Detection Methods and Prevalence in Real-World Datasets. TOSN, 6(3):23:1--23:39, 2010.

Digital Library

[27]

M. Steinder and A. S. Sethi. A Survey of Fault Localization Techniques in Computer Networks. Science of Computer Programming, 53(2):165--194, 2004.

[28]

S. C. Tan, K. M. Ting, and T. F. Liu. Fast Anomaly Detection for Streaming Data. In IJCAI, 2011.

Digital Library

[29]

R. Tibshirani. Regression Shrinkage and Selection via the Lasso. J. Royal. Stats. Soc B., 58(1):267--288, 1996.

[30]

G. M. Viswanathan, S. V. Buldyrev, S. Havlin, M. G. E. da Luz, E. P. Raposo, and H. E. Stanley. Optimizing the Success of Random Searches. Nature, 401:911--914, 1999.

[31]

C. Wang, K. Schwan, V. Talwar, G. Eisenhauer, L. Hu, and M. Wolf. A Flexible Architecture Integrating Monitoring and Analytics for Managing Large-Scale Data Centers. In ICAC, 2011.

Digital Library

[32]

C. Wang, I. A. Rayan, G. Eisenhauer, K. Schwan, V. Talwar, M. Wolf, and C. Huneycutt. VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications. In Middleware, 2012.

Digital Library

[33]

W. Xing and A. Ghorbani. Weighted PageRank Algorithm. In CNSR, 2004.

Digital Library

[34]

L. Xiong, X. Chen, and J. Schneider. Direct Robust Matrix Factorization for Anomaly Detection. In ICDM, 2011.

Digital Library

[35]

H. Xu, C. Caramais, and S. Sanghavi. Robust PCA via Outlier Pursuit. In NIPS, 2010.

[36]

H. Yan, A. Flavel, Z. Ge, A. Gerber, D. Massey, C. Papadopoulos, H. Shah, and J. Yates. Argus: End-to-end Service Anomaly Detection and Localization from an ISP's Point of View. 2012.

[37]

F. Yang and D. Xiao. Progress in Root Cause and Fault Propagation Analysis of Large-Scale Industrial Processes. Journal of Control Science and Engineering, 2012:1--10, 2012.

Digital Library

[38]

Z.-Q. Zhang, C.-G. Wu, B.-K. Zhang, T. Xia, and A.-F. Li. SDG Multiple Fault Diagnosis by Real-time Inverse Inference. 87(2):173--189, 2005.

Cited By

Xie ZZhang SGeng YZhang YMa MNie XYao ZXu LSun YLi WPei DBaeza-Yates RBonchi F(2024)Microservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent SpaceProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671530(6049-6060)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671530
Tao LLu XZhang SLuan JLi YLi MLi ZYu QXie HXu RHu CYang CPei D(2024)Diagnosing Performance Issues for Large-Scale Microservice Systems With Heterogeneous GraphIEEE Transactions on Services Computing10.1109/TSC.2024.340217217:5(2223-2235)Online publication date: Sep-2024
https://doi.org/10.1109/TSC.2024.3402172
Wang JLi YQi QLu YWu B(2024)Multilayered Fault Detection and Localization With Transformer for Microservice SystemsIEEE Transactions on Reliability10.1109/TR.2024.335671773:3(1502-1515)Online publication date: Sep-2024
https://doi.org/10.1109/TR.2024.3356717
Show More Cited By

Index Terms

Root cause detection in a service-oriented architecture
1. Computing methodologies
  1. Machine learning
  2. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies
2. General and reference
  1. Cross-computing tools and techniques
    1. Metrics

Recommendations

Root cause detection in a service-oriented architecture
Performance evaluation review

Large-scale websites are predominantly built as a service-oriented architecture. Here, services are specialized for a certain task, run on multiple machines, and communicate with each other to serve a user's request. An anomalous change in a metric of ...
Service-oriented architecture (SOA)concepts and implementations
SIGAda '11: Proceedings of the 2011 ACM annual international conference on Special interest group on the ada programming language

This tutorial explains how to implement a Service-Oriented Architecture (SOA) for reliable systems using Enterprise Service Bus (ESB) technologies. The first half of the tutorial describes terms of Service-Oriented Architectures (SOA) including service, ...
Functionality-Based Service Matchmaking for Service-Oriented Architecture
ISADS '07: Proceedings of the Eighth International Symposium on Autonomous Decentralized Systems

Service matchmaking is a basic feature of Service- Oriented Architecture (SOA). In this paper, a semantic-based flexible service matchmaking approach is presented to efficiently identifying functionalitycompatible services. This approach utilizes SAWOWL-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMETRICS '13: Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems

June 2013

406 pages

ISBN:9781450319003

DOI:10.1145/2465529

General Chair:
Mor Harchol-Balter
Carnegie Mellon University, USA
,
Program Chairs:
John Douceur
Microsoft Research, USA
,
Jun Xu
Georgia Institute of Technology, USA

ACM SIGMETRICS Performance Evaluation Review Volume 41, Issue 1
Performance evaluation review
June 2013
385 pages
ISSN:0163-5999
DOI:10.1145/2494232
Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMETRICS: ACM Special Interest Group on Measurement and Evaluation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMETRICS '13

Sponsor:

SIGMETRICS

SIGMETRICS '13: ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems

June 17 - 21, 2013

PA, Pittsburgh, USA

Acceptance Rates

SIGMETRICS '13 Paper Acceptance Rate 54 of 196 submissions, 28%;

Overall Acceptance Rate 459 of 2,691 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

133
Total Citations
View Citations
1,481
Total Downloads

Downloads (Last 12 months)160
Downloads (Last 6 weeks)11

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xie ZZhang SGeng YZhang YMa MNie XYao ZXu LSun YLi WPei DBaeza-Yates RBonchi F(2024)Microservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent SpaceProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671530(6049-6060)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671530
Tao LLu XZhang SLuan JLi YLi MLi ZYu QXie HXu RHu CYang CPei D(2024)Diagnosing Performance Issues for Large-Scale Microservice Systems With Heterogeneous GraphIEEE Transactions on Services Computing10.1109/TSC.2024.340217217:5(2223-2235)Online publication date: Sep-2024
https://doi.org/10.1109/TSC.2024.3402172
Wang JLi YQi QLu YWu B(2024)Multilayered Fault Detection and Localization With Transformer for Microservice SystemsIEEE Transactions on Reliability10.1109/TR.2024.335671773:3(1502-1515)Online publication date: Sep-2024
https://doi.org/10.1109/TR.2024.3356717
Huang JChen PYu GWang YHuang HHe Z(2024)FaaSRCA: Full Lifecycle Root Cause Analysis for Serverless Applications2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00047(415-426)Online publication date: 28-Oct-2024
https://doi.org/10.1109/ISSRE62328.2024.00047
Jiang XZhang YBi TShen XZhang YPan YMa MHan LWang FLiu XWang P(2024)G-Cause: Parameter-free Global Diagnosis for Hyperscale Web Service Infrastructures2024 IEEE International Conference on Web Services (ICWS)10.1109/ICWS62655.2024.00119(1003-1014)Online publication date: 7-Jul-2024
https://doi.org/10.1109/ICWS62655.2024.00119
Soldani JForti SRoveroni LBrogi A(2024)Explaining Microservices' Cascading Failures From Their LogsSoftware: Practice and Experience10.1002/spe.3400Online publication date: 17-Dec-2024
https://doi.org/10.1002/spe.3400
Pan YMa MJiang XWang P(2023)DyCause: Crowdsourcing to Diagnose Microservice Kernel FailureIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.323391520:6(4763-4777)Online publication date: Nov-2023
https://doi.org/10.1109/TDSC.2022.3233915
Zhou TZhang CPeng XYan ZLi PLiang JZheng HZheng WDeng Y(2023)TraceStream: Anomalous Service Localization based on Trace Stream Clustering with Online Feedback2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE59848.2023.00033(601-611)Online publication date: 9-Oct-2023
https://doi.org/10.1109/ISSRE59848.2023.00033
Zhang SPan ZLiu HJin PSun YOuyang QWang JJia XZhang YYang HZou YPei D(2023)Efficient and Robust Trace Anomaly Detection for Large-Scale Microservice Systems2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE59848.2023.00012(69-79)Online publication date: 9-Oct-2023
https://doi.org/10.1109/ISSRE59848.2023.00012
Lee CYang TChen ZSu YLyu MGrundy JPollock LPenta M(2023)Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-Source DataProceedings of the 45th International Conference on Software Engineering10.1109/ICSE48619.2023.00150(1750-1762)Online publication date: 14-May-2023
https://dl.acm.org/doi/10.1109/ICSE48619.2023.00150
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten