Skip to main content
Log in

Localizing root causes of performance anomalies in cloud computing systems by analyzing request trace logs

  • Research Paper
  • Progress of Projects Supported by NSFC
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

It is hard to localize the primary cause of performance anomalies in cloud computing systems because of the complexity of interactions between components. The hidden connections in the huge number of request execution paths in such systems usually contain useful information for diagnosing performance anomalies. We propose an approach to localize anomalous invoked methods and their physical locations by leveraging request trace logs, which involves two steps: (1) firstly, cluster the requests according to their corresponding call sequences, identify anomalous requests with principal component analysis, and then pick out anomalous methods with Mann-Whitney hypothesis test; (2) secondly, compare the behavior similarities of all replicated instances of the anomalous methods with Jensen-Shannon divergence, and select the ones whose behaviors are different from those of others, which will be chosen as the final culprits of performance anomalies. We conduct experiments with four real-world cases to validate our approach in Alibaba Cloud Computing Inc. The results demonstrate that our approach can locate the prime causes of performance anomalies with the low false-positive rate and false-negative rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Lu X, Wang H, Wang J, et al. Internet-based virtual computing environment: beyond the data center as a computer. Futur Gener Comp Syst, 2013, 29: 309–322

    Article  Google Scholar 

  2. Han S, Dang Y, Ge S, et al. Performance debugging in the large via mining millions of stack traces. In: Proceedings of the 34th International Conference on Software Engineering, Zurich, 2012. 176–186

  3. Chilimbi T, Liblit B, Mehra K, et al. Holmes: Effective statistical debugging via efficient path profiling. In: 31st IEEE International Conference on Software Engineering, Vancouver, 2009. 34–44

  4. Killian C, Nagaraj K, Pervez S, et al. Finding latent performance bugs in systems implementations. In: Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering. New York: ACM, 2010. 17–26

    Chapter  Google Scholar 

  5. Lan Z, Zheng Z, Li Y. Toward automated anomaly identification in large-scale systems. IEEE Trans Parallel Distrib Syst, 2010, 21: 174–187

    Article  Google Scholar 

  6. Malik H, Adams B, Hassan A. Pinpointing the subsystems responsible for the performance deviations in a load test. In: Proceedings of 21st IEEE International Symposium on Software Reliability Engineering, San Jose, 2010. 201–210

  7. Reynolds P, Killian C, Wiener J, et al. Pip: Detecting the unexpected in distributed systems. In: Symposium on Networked Systems Design and Implementation, San Jose, 2006, 115–128

  8. Sambasivan R, Zheng A, De Rosa M, et al. Diagnosing performance changes by comparing request flows. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation. Berkeley: USENIX Association, 2011. 43–56

    Google Scholar 

  9. Jin G, Song L, Shi X, et al. Understanding and detecting real-world performance bugs. In: The 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation. New York: ACM, 2012. 77–88

    Chapter  Google Scholar 

  10. Thereska E, Ganger G. Ironmodel: Robust performance models in the wild. ACM SIGMETRICS Perform Eval Rev, 2008, 36: 253–264

    Article  Google Scholar 

  11. Mi H B, Wang H M, Yin G, et al. Performance problems diagnosis in cloud computing systems via analyzing request trace logs. In: The 13th International Conference on Network Operations and Management Symposium (NOMS), Maui, 2012. 893–899

  12. Jolliffe I. Principal Component Analysis, 2nd ed. New York: Springer, 2002

    MATH  Google Scholar 

  13. Fay M, Proschan M. Wilcoxon-Mann-Whitney or t-test on assumptions for hypothesis tests and multiple interpretations of decision rules. Stat Surv, 2010, 4: 1–39

    Article  MathSciNet  MATH  Google Scholar 

  14. Melville P, Yang S, Saar-Tsechansky M, et al. Active learning for probability estimation using jensen-shannon divergence. In: Proceedings of the 16th European Conference on Machine Learning. Berlin/Heidelberg: Springer-Verlag, 2005. 268–279

    Google Scholar 

  15. Sigelman B, Barroso L, Burrows M, et al. Dapper, a large-scale distributed systems tracing infrastructure. Technical Report dapper-2010-1, Google, 2010

  16. Park I, Buch R. Event tracing-improve debugging and performance tuning with ETW. MSDN Mag, 2007. 81–92

  17. Thereska E, Salmon B, Strunk J, et al. Stardust: tracking activity in a distributed storage system. ACM SIGMETRICS Perform Eval Rev, 2006, 34: 3–14

    Article  Google Scholar 

  18. Sang B, Zhan J, Lu G, et al. Precise, scalable, and online request tracing for multi-tier services of black boxes. IEEE Trans Parallel Distrib Syst, 2010, 99: 1–16

    Google Scholar 

  19. Tak B, Tang C, Zhang C, et al. Vpath: precise discovery of request processing paths from black-box observations of thread and network activities. In: Proceedings of the 2009 Conference on USENIX Annual Technical Conference. Berkeley: USENIX Association, 2009. 19–32

    Google Scholar 

  20. Koskinen E, Jannotti J. Borderpatrol: isolating events for black-box tracing. ACM SIGOPS Operat Syst Rev, 2008, 42: 191–203

    Article  Google Scholar 

  21. Reynolds P, Wiener J, Mogul J, et al. Wap5: black-box performance debugging for wide-area systems. In: Proceedings of the 15th International Conference on World Wide Web. New York: ACM, 2006. 347–356

    Chapter  Google Scholar 

  22. Aguilera M, Mogul J, Wiener J, et al. Performance debugging for distributed systems of black boxes. ACM SIGOPS Operat Syst Rev, 2003, 37: 74–89

    Article  Google Scholar 

  23. Chen M, Kiciman E, Fratkin E, et al. Pinpoint: Problem determination in large, dynamic internet services. In: Proceedings of 32nd IEEE International Conference on Dependable Systems and Networks, Bethesda, 2002. 595–604

  24. Chen M, Accardi A, Kiciman E, et al. Path-based faliure and evolution management. In: Proceedings of the 1st Conference on Symposium on Networked Systems Design and Implementation, Vol. 1. Berkeley: USENIX Association, 2004. 23–36

    Google Scholar 

  25. Barham P, Donnelly A, Isaacs R, et al. Using Magpie for request extraction and workload modelling. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation. Berkeley: USENIX Association, 2004. 259–272

    Google Scholar 

  26. Fonseca R, Porter G, Katz R, et al. X-trace: A pervasive network tracing framework. In: Proceedings of the 4th USENIX Conference on Networked Systems Design and Implementation. Berkeley: USENIX Association, 2007. 20–33

    Google Scholar 

  27. Mi H, Wang H, Yin G, et al. Magnifier: Online detection of performance problems in large-scale cloud computing systems. In: Proceedings of 8th IEEE International Conference on Services Computing, Washington DC, 2011. 418–425

  28. Wang C, Schwan K, Talwar V, et al. A flexible architecture integrating monitoring and analytics for managing largescale data centers. In: Proceedings of the 8th ACM International Conference on Autonomic Computing. New York: ACM, 2011. 141–150

    Google Scholar 

  29. Wang C, Viswanathan K, Choudur L, et al. Statistical techniques for online anomaly detection in data centers. In: Proceedings of the 12th IFIP/IEEE International Symposium on Integrated Network Management, Dublin, 2011. 385–392

  30. Bodik P, Goldszmidt M, Fox A, et al. Fingerprinting the datacenter: automated classification of performance crises. In: Proceedings of the 5th European Conference on Computer Systems. New York: ACM, 2010. 111–124

    Google Scholar 

  31. Wang C, Talwar V, Schwan K, et al. Online detection of utility cloud anomalies using metric distributions. In: Proceedings of the IEEE Network Operations and Management Symposium, Osaka, 2010. 96–103

  32. Lakhina A, Crovella M, Diot C. Diagnosing network-wide traffic anomalies. ACM SIGCOMM Comput Commun Rev, 2004, 34: 219–230

    Article  Google Scholar 

  33. Xu W, Huang L, Fox A, et al. Detecting large-scale system problems by mining console logs. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. New York: ACM, 2009. 117–132

    Chapter  Google Scholar 

  34. Oliner A, Aiken A. Online detection of multi-component interactions in production systems. In: 41st IEEE/IFIP International Conference on Dependable Systems & Networks (DSN), Hong Kong, 2011. 49–60

  35. Ringberg H, Soule A, Rexford J, et al. Sensitivity of PCA for traffic anomaly detection. ACM SIGMETRICS Perform Eval Rev, 2007, 35: 109–120

    Article  Google Scholar 

  36. King J, Jackson D. Variable selection in large environmental data sets using principal components analysis. Environmetrics, 1999, 10: 67–77

    Article  Google Scholar 

  37. Ghemawat S, Gobioff H, Leung S. The Google file system. ACM SIGOPS Operat Syst Rev, 2003, 37: 29–43

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to HaiBo Mi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mi, H., Wang, H., Zhou, Y. et al. Localizing root causes of performance anomalies in cloud computing systems by analyzing request trace logs. Sci. China Inf. Sci. 55, 2757–2773 (2012). https://doi.org/10.1007/s11432-012-4747-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11432-012-4747-8

Keywords

Navigation