skip to main content
10.1145/2465529.2465753acmconferencesArticle/Chapter ViewAbstractPublication PagesmetricsConference Proceedingsconference-collections
research-article

Root cause detection in a service-oriented architecture

Published: 17 June 2013 Publication History

Abstract

Large-scale websites are predominantly built as a service-oriented architecture. Here, services are specialized for a certain task, run on multiple machines, and communicate with each other to serve a user's request. An anomalous change in a metric of one service can propagate to other services during this communication, resulting in overall degradation of the request. As any such degradation is revenue impacting, maintaining correct functionality is of paramount concern: it is important to find the root cause of any anomaly as quickly as possible. This is challenging because there are numerous metrics or sensors for a given service, and a modern website is usually composed of hundreds of services running on thousands of machines in multiple data centers.
This paper introduces MonitorRank, an algorithm that can reduce the time, domain knowledge, and human effort required to find the root causes of anomalies in such service-oriented architectures. In the event of an anomaly, MonitorRank provides a ranked order list of possible root causes for monitoring teams to investigate. MonitorRank uses the historical and current time-series metrics of each sensor as its input, along with the call graph generated between sensors to build an unsupervised model for ranking. Experiments on real production outage data from LinkedIn, one of the largest online social networks, shows a 26% to 51% improvement in mean average precision in finding root causes compared to baseline and current state-of-the-art methods.

References

[1]
T. Ahmed, B. Oreshkin, and M. Coates. Machine Learning Approaches to Network Anomaly Detection. In SysML, 2007.
[2]
A. Arefin, K. Nahrstedt, R. Rivas, J. Han, and Z. Huang. DIAMOND: Correlation-Based Anomaly Monitoring Daemon for DIME. In ISM, 2010.
[3]
M. Basseville and I. V. Nikiforov. Detection of Abrupt Changes - Theory and Application. Prentice-Hall, 1993.
[4]
A. T. Bouloutas, S. Calo, and A. Finkel. Alarm Correlation and Fault Identification in Communication Networks. TCOM, 42(2-4):523--533, 1994.
[5]
V. Chandola, A. Banerjee, and V. Kumar. Anomaly Detection: A Survey. CSUR, 41(3):15:1--15:58, 2009.
[6]
C. S. Chao, D. L. Yang, and A. C. Liu. An Automated Fault Diagnosis System Using Hierarchical Reasoning and Alarm Correlation. JNSM, 9(2):183--202, 2001.
[7]
M. Chen, A. X. Zheng, J. Lloyd, M. I. Jordan, and E. Brewer. Failure Diagnosis Using Decision Trees. In ICAC, 2004.
[8]
L. Cherkasova, K. Ozonat, N. Mi, J. Symons, and E. Smirni. Automated Anomaly Detection and Performance Modeling of Enterprise Applications. TOCS, 27(3):6:1--6:32, 2009.
[9]
P. H. dos Santos Teixeira and R. L. Milidiú. Data stream anomaly detection through principal subspace tracking. In SAC, 2010.
[10]
B. Efron, I. Johnstone, T. Hastie, and R. Tibshirani. The Least Angle Regression Algorithm for Solving the Lasso. Annals of Statistics, 32(2):407--451, 2004.
[11]
J. Gao, G. Jiang, H. Chen, and J. Han. Modeling Probabilistic Measurement Correlations for Problem Determination in Large-Scale Distributed Systems. In ICDCS, 2009.
[12]
A. M. Hein and S. A. Mckinley. Sensing and Decision-making in Random Search. PNAS, 109(30):12070--12074, 2012.
[13]
A. Jalali and S. Sanghavi. Learning the Dependence Graph of Time Series with Latent Factors. In ICML, 2012.
[14]
G. Jeh and J. Widom. Scaling Personalized Web Search. In WWW, 2003.
[15]
M. Jiang, M. A. Munawar, T. Reidemeister, and P. A. S. Ward. Dependency-aware Fault Diagnosis with Metric-correlation Models in Enterprise software systems. In CNSM, 2010.
[16]
R. Jiang, H. Fei, and J. Huan. Anomaly Localization for Network Data Streams with Graph Joint Sparse PCA. In KDD, 2011.
[17]
I. T. Jolliffe. Principal Component Analysis. Springer, second edition, Oct. 2002.
[18]
M. Khan, H. K. Le, H. Ahmadi, T. Abdelzaher, and J. Han. DustMiner: Troubleshooting Interactive Complexity Bugs in Sensor Networks. In Sensys, 2008.
[19]
J. Kreps, N. Narkhede, and J. Rao. Kafka: A Distributed Messaging System for Log Processing. In NetDB, 2011.
[20]
D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In CIKM, pages 556--559, 2003.
[21]
Y. Liu, L. Zhang, and Y. Guan. A Distributed Data Streaming Algorithm for Network-wide Traffic Anomaly Detection. In SIGMETRICS, 2009.
[22]
A. Mahimkar, Z. Ge, J. Wang, J. Yates, Y. Zhang, J. Emmons, B. Huntley, and M. Stockert. Rapid Detection of Maintenance Induced Changes in Service Performance. In CoNEXT, 2011.
[23]
N. Marwede, M. Rohr, A. V. Hoorn, and W. Hasselbring. Automatic Failure Diagnosis Support in Distributed Large-Scale Software Systems Based on Timing Behavior Anomaly Correlation. In CSMR, 2009.
[24]
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999-66, Stanford InfoLab, 1999.
[25]
M. P. Papazoglou and W.-J. Heuvel. Service Oriented Architectures: Approaches, Technologies and Research Issues. The VLDB Journal, 16(3):389--415, July 2007.
[26]
A. B. Sharma, L. Golubchik, and R. Govindan. Sensor Faults: Detection Methods and Prevalence in Real-World Datasets. TOSN, 6(3):23:1--23:39, 2010.
[27]
M. Steinder and A. S. Sethi. A Survey of Fault Localization Techniques in Computer Networks. Science of Computer Programming, 53(2):165--194, 2004.
[28]
S. C. Tan, K. M. Ting, and T. F. Liu. Fast Anomaly Detection for Streaming Data. In IJCAI, 2011.
[29]
R. Tibshirani. Regression Shrinkage and Selection via the Lasso. J. Royal. Stats. Soc B., 58(1):267--288, 1996.
[30]
G. M. Viswanathan, S. V. Buldyrev, S. Havlin, M. G. E. da Luz, E. P. Raposo, and H. E. Stanley. Optimizing the Success of Random Searches. Nature, 401:911--914, 1999.
[31]
C. Wang, K. Schwan, V. Talwar, G. Eisenhauer, L. Hu, and M. Wolf. A Flexible Architecture Integrating Monitoring and Analytics for Managing Large-Scale Data Centers. In ICAC, 2011.
[32]
C. Wang, I. A. Rayan, G. Eisenhauer, K. Schwan, V. Talwar, M. Wolf, and C. Huneycutt. VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications. In Middleware, 2012.
[33]
W. Xing and A. Ghorbani. Weighted PageRank Algorithm. In CNSR, 2004.
[34]
L. Xiong, X. Chen, and J. Schneider. Direct Robust Matrix Factorization for Anomaly Detection. In ICDM, 2011.
[35]
H. Xu, C. Caramais, and S. Sanghavi. Robust PCA via Outlier Pursuit. In NIPS, 2010.
[36]
H. Yan, A. Flavel, Z. Ge, A. Gerber, D. Massey, C. Papadopoulos, H. Shah, and J. Yates. Argus: End-to-end Service Anomaly Detection and Localization from an ISP's Point of View. 2012.
[37]
F. Yang and D. Xiao. Progress in Root Cause and Fault Propagation Analysis of Large-Scale Industrial Processes. Journal of Control Science and Engineering, 2012:1--10, 2012.
[38]
Z.-Q. Zhang, C.-G. Wu, B.-K. Zhang, T. Xia, and A.-F. Li. SDG Multiple Fault Diagnosis by Real-time Inverse Inference. 87(2):173--189, 2005.

Cited By

View all
  • (2024)Microservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent SpaceProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671530(6049-6060)Online publication date: 25-Aug-2024
  • (2024)Diagnosing Performance Issues for Large-Scale Microservice Systems With Heterogeneous GraphIEEE Transactions on Services Computing10.1109/TSC.2024.340217217:5(2223-2235)Online publication date: Sep-2024
  • (2024)Multilayered Fault Detection and Localization With Transformer for Microservice SystemsIEEE Transactions on Reliability10.1109/TR.2024.335671773:3(1502-1515)Online publication date: Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMETRICS '13: Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems
June 2013
406 pages
ISBN:9781450319003
DOI:10.1145/2465529
  • cover image ACM SIGMETRICS Performance Evaluation Review
    ACM SIGMETRICS Performance Evaluation Review  Volume 41, Issue 1
    Performance evaluation review
    June 2013
    385 pages
    ISSN:0163-5999
    DOI:10.1145/2494232
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 June 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. anomaly correlation
  2. call graph
  3. monitoring
  4. service-oriented architecture

Qualifiers

  • Research-article

Conference

SIGMETRICS '13
Sponsor:

Acceptance Rates

SIGMETRICS '13 Paper Acceptance Rate 54 of 196 submissions, 28%;
Overall Acceptance Rate 459 of 2,691 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)160
  • Downloads (Last 6 weeks)11
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Microservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent SpaceProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671530(6049-6060)Online publication date: 25-Aug-2024
  • (2024)Diagnosing Performance Issues for Large-Scale Microservice Systems With Heterogeneous GraphIEEE Transactions on Services Computing10.1109/TSC.2024.340217217:5(2223-2235)Online publication date: Sep-2024
  • (2024)Multilayered Fault Detection and Localization With Transformer for Microservice SystemsIEEE Transactions on Reliability10.1109/TR.2024.335671773:3(1502-1515)Online publication date: Sep-2024
  • (2024)FaaSRCA: Full Lifecycle Root Cause Analysis for Serverless Applications2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00047(415-426)Online publication date: 28-Oct-2024
  • (2024)G-Cause: Parameter-free Global Diagnosis for Hyperscale Web Service Infrastructures2024 IEEE International Conference on Web Services (ICWS)10.1109/ICWS62655.2024.00119(1003-1014)Online publication date: 7-Jul-2024
  • (2024)Explaining Microservices' Cascading Failures From Their LogsSoftware: Practice and Experience10.1002/spe.3400Online publication date: 17-Dec-2024
  • (2023)DyCause: Crowdsourcing to Diagnose Microservice Kernel FailureIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.323391520:6(4763-4777)Online publication date: Nov-2023
  • (2023)TraceStream: Anomalous Service Localization based on Trace Stream Clustering with Online Feedback2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE59848.2023.00033(601-611)Online publication date: 9-Oct-2023
  • (2023)Efficient and Robust Trace Anomaly Detection for Large-Scale Microservice Systems2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE59848.2023.00012(69-79)Online publication date: 9-Oct-2023
  • (2023)Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-Source DataProceedings of the 45th International Conference on Software Engineering10.1109/ICSE48619.2023.00150(1750-1762)Online publication date: 14-May-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media