Abstract
It is hard to localize the primary cause of performance anomalies in cloud computing systems because of the complexity of interactions between components. The hidden connections in the huge number of request execution paths in such systems usually contain useful information for diagnosing performance anomalies. We propose an approach to localize anomalous invoked methods and their physical locations by leveraging request trace logs, which involves two steps: (1) firstly, cluster the requests according to their corresponding call sequences, identify anomalous requests with principal component analysis, and then pick out anomalous methods with Mann-Whitney hypothesis test; (2) secondly, compare the behavior similarities of all replicated instances of the anomalous methods with Jensen-Shannon divergence, and select the ones whose behaviors are different from those of others, which will be chosen as the final culprits of performance anomalies. We conduct experiments with four real-world cases to validate our approach in Alibaba Cloud Computing Inc. The results demonstrate that our approach can locate the prime causes of performance anomalies with the low false-positive rate and false-negative rate.
Similar content being viewed by others
References
Lu X, Wang H, Wang J, et al. Internet-based virtual computing environment: beyond the data center as a computer. Futur Gener Comp Syst, 2013, 29: 309–322
Han S, Dang Y, Ge S, et al. Performance debugging in the large via mining millions of stack traces. In: Proceedings of the 34th International Conference on Software Engineering, Zurich, 2012. 176–186
Chilimbi T, Liblit B, Mehra K, et al. Holmes: Effective statistical debugging via efficient path profiling. In: 31st IEEE International Conference on Software Engineering, Vancouver, 2009. 34–44
Killian C, Nagaraj K, Pervez S, et al. Finding latent performance bugs in systems implementations. In: Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering. New York: ACM, 2010. 17–26
Lan Z, Zheng Z, Li Y. Toward automated anomaly identification in large-scale systems. IEEE Trans Parallel Distrib Syst, 2010, 21: 174–187
Malik H, Adams B, Hassan A. Pinpointing the subsystems responsible for the performance deviations in a load test. In: Proceedings of 21st IEEE International Symposium on Software Reliability Engineering, San Jose, 2010. 201–210
Reynolds P, Killian C, Wiener J, et al. Pip: Detecting the unexpected in distributed systems. In: Symposium on Networked Systems Design and Implementation, San Jose, 2006, 115–128
Sambasivan R, Zheng A, De Rosa M, et al. Diagnosing performance changes by comparing request flows. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation. Berkeley: USENIX Association, 2011. 43–56
Jin G, Song L, Shi X, et al. Understanding and detecting real-world performance bugs. In: The 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation. New York: ACM, 2012. 77–88
Thereska E, Ganger G. Ironmodel: Robust performance models in the wild. ACM SIGMETRICS Perform Eval Rev, 2008, 36: 253–264
Mi H B, Wang H M, Yin G, et al. Performance problems diagnosis in cloud computing systems via analyzing request trace logs. In: The 13th International Conference on Network Operations and Management Symposium (NOMS), Maui, 2012. 893–899
Jolliffe I. Principal Component Analysis, 2nd ed. New York: Springer, 2002
Fay M, Proschan M. Wilcoxon-Mann-Whitney or t-test on assumptions for hypothesis tests and multiple interpretations of decision rules. Stat Surv, 2010, 4: 1–39
Melville P, Yang S, Saar-Tsechansky M, et al. Active learning for probability estimation using jensen-shannon divergence. In: Proceedings of the 16th European Conference on Machine Learning. Berlin/Heidelberg: Springer-Verlag, 2005. 268–279
Sigelman B, Barroso L, Burrows M, et al. Dapper, a large-scale distributed systems tracing infrastructure. Technical Report dapper-2010-1, Google, 2010
Park I, Buch R. Event tracing-improve debugging and performance tuning with ETW. MSDN Mag, 2007. 81–92
Thereska E, Salmon B, Strunk J, et al. Stardust: tracking activity in a distributed storage system. ACM SIGMETRICS Perform Eval Rev, 2006, 34: 3–14
Sang B, Zhan J, Lu G, et al. Precise, scalable, and online request tracing for multi-tier services of black boxes. IEEE Trans Parallel Distrib Syst, 2010, 99: 1–16
Tak B, Tang C, Zhang C, et al. Vpath: precise discovery of request processing paths from black-box observations of thread and network activities. In: Proceedings of the 2009 Conference on USENIX Annual Technical Conference. Berkeley: USENIX Association, 2009. 19–32
Koskinen E, Jannotti J. Borderpatrol: isolating events for black-box tracing. ACM SIGOPS Operat Syst Rev, 2008, 42: 191–203
Reynolds P, Wiener J, Mogul J, et al. Wap5: black-box performance debugging for wide-area systems. In: Proceedings of the 15th International Conference on World Wide Web. New York: ACM, 2006. 347–356
Aguilera M, Mogul J, Wiener J, et al. Performance debugging for distributed systems of black boxes. ACM SIGOPS Operat Syst Rev, 2003, 37: 74–89
Chen M, Kiciman E, Fratkin E, et al. Pinpoint: Problem determination in large, dynamic internet services. In: Proceedings of 32nd IEEE International Conference on Dependable Systems and Networks, Bethesda, 2002. 595–604
Chen M, Accardi A, Kiciman E, et al. Path-based faliure and evolution management. In: Proceedings of the 1st Conference on Symposium on Networked Systems Design and Implementation, Vol. 1. Berkeley: USENIX Association, 2004. 23–36
Barham P, Donnelly A, Isaacs R, et al. Using Magpie for request extraction and workload modelling. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation. Berkeley: USENIX Association, 2004. 259–272
Fonseca R, Porter G, Katz R, et al. X-trace: A pervasive network tracing framework. In: Proceedings of the 4th USENIX Conference on Networked Systems Design and Implementation. Berkeley: USENIX Association, 2007. 20–33
Mi H, Wang H, Yin G, et al. Magnifier: Online detection of performance problems in large-scale cloud computing systems. In: Proceedings of 8th IEEE International Conference on Services Computing, Washington DC, 2011. 418–425
Wang C, Schwan K, Talwar V, et al. A flexible architecture integrating monitoring and analytics for managing largescale data centers. In: Proceedings of the 8th ACM International Conference on Autonomic Computing. New York: ACM, 2011. 141–150
Wang C, Viswanathan K, Choudur L, et al. Statistical techniques for online anomaly detection in data centers. In: Proceedings of the 12th IFIP/IEEE International Symposium on Integrated Network Management, Dublin, 2011. 385–392
Bodik P, Goldszmidt M, Fox A, et al. Fingerprinting the datacenter: automated classification of performance crises. In: Proceedings of the 5th European Conference on Computer Systems. New York: ACM, 2010. 111–124
Wang C, Talwar V, Schwan K, et al. Online detection of utility cloud anomalies using metric distributions. In: Proceedings of the IEEE Network Operations and Management Symposium, Osaka, 2010. 96–103
Lakhina A, Crovella M, Diot C. Diagnosing network-wide traffic anomalies. ACM SIGCOMM Comput Commun Rev, 2004, 34: 219–230
Xu W, Huang L, Fox A, et al. Detecting large-scale system problems by mining console logs. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. New York: ACM, 2009. 117–132
Oliner A, Aiken A. Online detection of multi-component interactions in production systems. In: 41st IEEE/IFIP International Conference on Dependable Systems & Networks (DSN), Hong Kong, 2011. 49–60
Ringberg H, Soule A, Rexford J, et al. Sensitivity of PCA for traffic anomaly detection. ACM SIGMETRICS Perform Eval Rev, 2007, 35: 109–120
King J, Jackson D. Variable selection in large environmental data sets using principal components analysis. Environmetrics, 1999, 10: 67–77
Ghemawat S, Gobioff H, Leung S. The Google file system. ACM SIGOPS Operat Syst Rev, 2003, 37: 29–43
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mi, H., Wang, H., Zhou, Y. et al. Localizing root causes of performance anomalies in cloud computing systems by analyzing request trace logs. Sci. China Inf. Sci. 55, 2757–2773 (2012). https://doi.org/10.1007/s11432-012-4747-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11432-012-4747-8