Localizing root causes of performance anomalies in cloud computing systems by analyzing request trace logs

Mi, HaiBo; Wang, HuaiMin; Zhou, YangFan; Lyu, Michael R.; Cai, Hua

doi:10.1007/s11432-012-4747-8

Localizing root causes of performance anomalies in cloud computing systems by analyzing request trace logs

Research Paper
Progress of Projects Supported by NSFC
Published: 29 December 2012

Volume 55, pages 2757–2773, (2012)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

HaiBo Mi¹,
HuaiMin Wang¹,
YangFan Zhou²,
Michael R. Lyu² &
…
Hua Cai³

324 Accesses
11 Citations
6 Altmetric
Explore all metrics

Abstract

It is hard to localize the primary cause of performance anomalies in cloud computing systems because of the complexity of interactions between components. The hidden connections in the huge number of request execution paths in such systems usually contain useful information for diagnosing performance anomalies. We propose an approach to localize anomalous invoked methods and their physical locations by leveraging request trace logs, which involves two steps: (1) firstly, cluster the requests according to their corresponding call sequences, identify anomalous requests with principal component analysis, and then pick out anomalous methods with Mann-Whitney hypothesis test; (2) secondly, compare the behavior similarities of all replicated instances of the anomalous methods with Jensen-Shannon divergence, and select the ones whose behaviors are different from those of others, which will be chosen as the final culprits of performance anomalies. We conduct experiments with four real-world cases to validate our approach in Alibaba Cloud Computing Inc. The results demonstrate that our approach can locate the prime causes of performance anomalies with the low false-positive rate and false-negative rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Taxonomy of Anomalies in Distributed Cloud Systems: The CRI-Model

LS-ADT: Lightweight and Scalable Anomaly Detection for Cloud Datacentres

Multi-source Log Clustering in Distributed Systems

References

Lu X, Wang H, Wang J, et al. Internet-based virtual computing environment: beyond the data center as a computer. Futur Gener Comp Syst, 2013, 29: 309–322
Article Google Scholar
Han S, Dang Y, Ge S, et al. Performance debugging in the large via mining millions of stack traces. In: Proceedings of the 34th International Conference on Software Engineering, Zurich, 2012. 176–186
Chilimbi T, Liblit B, Mehra K, et al. Holmes: Effective statistical debugging via efficient path profiling. In: 31st IEEE International Conference on Software Engineering, Vancouver, 2009. 34–44
Killian C, Nagaraj K, Pervez S, et al. Finding latent performance bugs in systems implementations. In: Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering. New York: ACM, 2010. 17–26
Chapter Google Scholar
Lan Z, Zheng Z, Li Y. Toward automated anomaly identification in large-scale systems. IEEE Trans Parallel Distrib Syst, 2010, 21: 174–187
Article Google Scholar
Malik H, Adams B, Hassan A. Pinpointing the subsystems responsible for the performance deviations in a load test. In: Proceedings of 21st IEEE International Symposium on Software Reliability Engineering, San Jose, 2010. 201–210
Reynolds P, Killian C, Wiener J, et al. Pip: Detecting the unexpected in distributed systems. In: Symposium on Networked Systems Design and Implementation, San Jose, 2006, 115–128
Sambasivan R, Zheng A, De Rosa M, et al. Diagnosing performance changes by comparing request flows. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation. Berkeley: USENIX Association, 2011. 43–56
Google Scholar
Jin G, Song L, Shi X, et al. Understanding and detecting real-world performance bugs. In: The 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation. New York: ACM, 2012. 77–88
Chapter Google Scholar
Thereska E, Ganger G. Ironmodel: Robust performance models in the wild. ACM SIGMETRICS Perform Eval Rev, 2008, 36: 253–264
Article Google Scholar
Mi H B, Wang H M, Yin G, et al. Performance problems diagnosis in cloud computing systems via analyzing request trace logs. In: The 13th International Conference on Network Operations and Management Symposium (NOMS), Maui, 2012. 893–899
Jolliffe I. Principal Component Analysis, 2nd ed. New York: Springer, 2002
MATH Google Scholar
Fay M, Proschan M. Wilcoxon-Mann-Whitney or t-test on assumptions for hypothesis tests and multiple interpretations of decision rules. Stat Surv, 2010, 4: 1–39
Article MathSciNet MATH Google Scholar
Melville P, Yang S, Saar-Tsechansky M, et al. Active learning for probability estimation using jensen-shannon divergence. In: Proceedings of the 16th European Conference on Machine Learning. Berlin/Heidelberg: Springer-Verlag, 2005. 268–279
Google Scholar
Sigelman B, Barroso L, Burrows M, et al. Dapper, a large-scale distributed systems tracing infrastructure. Technical Report dapper-2010-1, Google, 2010
Park I, Buch R. Event tracing-improve debugging and performance tuning with ETW. MSDN Mag, 2007. 81–92
Thereska E, Salmon B, Strunk J, et al. Stardust: tracking activity in a distributed storage system. ACM SIGMETRICS Perform Eval Rev, 2006, 34: 3–14
Article Google Scholar
Sang B, Zhan J, Lu G, et al. Precise, scalable, and online request tracing for multi-tier services of black boxes. IEEE Trans Parallel Distrib Syst, 2010, 99: 1–16
Google Scholar
Tak B, Tang C, Zhang C, et al. Vpath: precise discovery of request processing paths from black-box observations of thread and network activities. In: Proceedings of the 2009 Conference on USENIX Annual Technical Conference. Berkeley: USENIX Association, 2009. 19–32
Google Scholar
Koskinen E, Jannotti J. Borderpatrol: isolating events for black-box tracing. ACM SIGOPS Operat Syst Rev, 2008, 42: 191–203
Article Google Scholar
Reynolds P, Wiener J, Mogul J, et al. Wap5: black-box performance debugging for wide-area systems. In: Proceedings of the 15th International Conference on World Wide Web. New York: ACM, 2006. 347–356
Chapter Google Scholar
Aguilera M, Mogul J, Wiener J, et al. Performance debugging for distributed systems of black boxes. ACM SIGOPS Operat Syst Rev, 2003, 37: 74–89
Article Google Scholar
Chen M, Kiciman E, Fratkin E, et al. Pinpoint: Problem determination in large, dynamic internet services. In: Proceedings of 32nd IEEE International Conference on Dependable Systems and Networks, Bethesda, 2002. 595–604
Chen M, Accardi A, Kiciman E, et al. Path-based faliure and evolution management. In: Proceedings of the 1st Conference on Symposium on Networked Systems Design and Implementation, Vol. 1. Berkeley: USENIX Association, 2004. 23–36
Google Scholar
Barham P, Donnelly A, Isaacs R, et al. Using Magpie for request extraction and workload modelling. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation. Berkeley: USENIX Association, 2004. 259–272
Google Scholar
Fonseca R, Porter G, Katz R, et al. X-trace: A pervasive network tracing framework. In: Proceedings of the 4th USENIX Conference on Networked Systems Design and Implementation. Berkeley: USENIX Association, 2007. 20–33
Google Scholar
Mi H, Wang H, Yin G, et al. Magnifier: Online detection of performance problems in large-scale cloud computing systems. In: Proceedings of 8th IEEE International Conference on Services Computing, Washington DC, 2011. 418–425
Wang C, Schwan K, Talwar V, et al. A flexible architecture integrating monitoring and analytics for managing largescale data centers. In: Proceedings of the 8th ACM International Conference on Autonomic Computing. New York: ACM, 2011. 141–150
Google Scholar
Wang C, Viswanathan K, Choudur L, et al. Statistical techniques for online anomaly detection in data centers. In: Proceedings of the 12th IFIP/IEEE International Symposium on Integrated Network Management, Dublin, 2011. 385–392
Bodik P, Goldszmidt M, Fox A, et al. Fingerprinting the datacenter: automated classification of performance crises. In: Proceedings of the 5th European Conference on Computer Systems. New York: ACM, 2010. 111–124
Google Scholar
Wang C, Talwar V, Schwan K, et al. Online detection of utility cloud anomalies using metric distributions. In: Proceedings of the IEEE Network Operations and Management Symposium, Osaka, 2010. 96–103
Lakhina A, Crovella M, Diot C. Diagnosing network-wide traffic anomalies. ACM SIGCOMM Comput Commun Rev, 2004, 34: 219–230
Article Google Scholar
Xu W, Huang L, Fox A, et al. Detecting large-scale system problems by mining console logs. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. New York: ACM, 2009. 117–132
Chapter Google Scholar
Oliner A, Aiken A. Online detection of multi-component interactions in production systems. In: 41st IEEE/IFIP International Conference on Dependable Systems & Networks (DSN), Hong Kong, 2011. 49–60
Ringberg H, Soule A, Rexford J, et al. Sensitivity of PCA for traffic anomaly detection. ACM SIGMETRICS Perform Eval Rev, 2007, 35: 109–120
Article Google Scholar
King J, Jackson D. Variable selection in large environmental data sets using principal components analysis. Environmetrics, 1999, 10: 67–77
Article Google Scholar
Ghemawat S, Gobioff H, Leung S. The Google file system. ACM SIGOPS Operat Syst Rev, 2003, 37: 29–43
Article Google Scholar

Download references

Author information

Authors and Affiliations

National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha, 410000, China
HaiBo Mi & HuaiMin Wang
Shenzhen Research Institute, The Chinese University of Hong Kong, Shenzhen, 518000, China
YangFan Zhou & Michael R. Lyu
Computing Platform, Alibaba Cloud Computing Company, Hangzhou, 310000, China
Hua Cai

Authors

HaiBo Mi
View author publications
You can also search for this author in PubMed Google Scholar
HuaiMin Wang
View author publications
You can also search for this author in PubMed Google Scholar
YangFan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Michael R. Lyu
View author publications
You can also search for this author in PubMed Google Scholar
Hua Cai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to HaiBo Mi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mi, H., Wang, H., Zhou, Y. et al. Localizing root causes of performance anomalies in cloud computing systems by analyzing request trace logs. Sci. China Inf. Sci. 55, 2757–2773 (2012). https://doi.org/10.1007/s11432-012-4747-8

Download citation

Received: 01 June 2012
Accepted: 30 October 2012
Published: 29 December 2012
Issue Date: December 2012
DOI: https://doi.org/10.1007/s11432-012-4747-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Localizing root causes of performance anomalies in cloud computing systems by analyzing request trace logs

Abstract

Access this article

Similar content being viewed by others

A Taxonomy of Anomalies in Distributed Cloud Systems: The CRI-Model

LS-ADT: Lightweight and Scalable Anomaly Detection for Cloud Datacentres

Multi-source Log Clustering in Distributed Systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Localizing root causes of performance anomalies in cloud computing systems by analyzing request trace logs

Abstract

Access this article

Similar content being viewed by others

A Taxonomy of Anomalies in Distributed Cloud Systems: The CRI-Model

LS-ADT: Lightweight and Scalable Anomaly Detection for Cloud Datacentres

Multi-source Log Clustering in Distributed Systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation