Skip to main content
Log in

An online service-oriented performance profiling tool for cloud computing systems

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

The growing scale and complexity of component interactions in cloud computing systems post great challenges for operators to understand the characteristics of system performance. Profiling has long been proved to be an effective approach to performance analysis; however, existing approaches confront new challenges that emerge in cloud computing systems. First, the efficiency of the profiling becomes of critical concern; second, service-oriented profiling should be considered to support separation-of-concerns performance analysis. To address the above issues, in this paper, we present P-Tracer, an online performance profiling tool specifically tailored for cloud computing systems. P-Tracer constructs a specific search engine that proactively processes performance logs and generates a particular index for fast queries; second, for each service, P-Tracer retrieves a statistical insight of performance characteristics from multi-dimensions and provides operators with a suite of web-based interfaces to query the critical information. We evaluate P-Tracer in the aspects of tracing overheads, data preprocessing scalability and querying efficiency. Three real-world case studies that happened in Alibaba cloud computing platform demonstrate that P-Tracer can help operators understand software behaviors and localize the primary causes of performance anomalies effectively and efficiently.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ren G, Tune E, Moseley T, Shi Y, Rus S, Hundt R. Google-wide profiling: a continuous profiling infrastructure for data centers. IEEE Micro Magazine, 2010, 30(4): 65–79

    Article  Google Scholar 

  2. Graham S, Kessler P, McKusick M. Gprof: a call graph execution profiler. ACM SIGPLAN Notices, 2004, 39(4): 49–57

    Article  Google Scholar 

  3. Mohr B, Wylie B, Wolf F. Performance measurement and analysis tools for extremely scalable systems. Concurrency and Computation: Practice and Experience, 2010, 22(16): 2212–2229

    Article  Google Scholar 

  4. Thereska E, Salmon B, Strunk J, Wachs M, Abd-El-Malek M, Lopez J, Ganger G. Stardust: tracking activity in a distributed storage system. ACM SIGMETRICS Performance Evaluation Review, 2006, 34(1): 3–14

    Article  Google Scholar 

  5. Cantrill B, Shapiro M, Leventhal A. Dynamic instrumentation of production systems. In: Proceedings of the 2004 USENIX Annual Technical Conference. 2004, 2–15

  6. Traeger A, Deras I, Zadok E. DARC: dynamic analysis of root causes of latency distributions. ACM SIGMETRICS Performance Evaluation Review, 2008, 36(1): 277–288

    Article  Google Scholar 

  7. Huang X, Wang W, Zhang W, Wei J, Huang T. An adaptive performance modeling approach to performance profiling of multi-service web applications. In: Proceedings of the 35th IEEE Computer Software and Applications Conference. 2011, 4–13

  8. Sigelman B, Barroso L, Burrows M, Stephenson P, Plakal M, Beaver D, Jaspan S, Shanbhag C. Dapper, a large-scale distributed systems tracing infrastructure. Technical Report, Google, 2010

  9. Park I, Buch R. Event tracing-improve debugging and performance tuning with etw. MSDN Magazine-Louisville. 2007, 81–92

  10. Sang B, Zhan J, Lu G, Wang H, Xu D, Wang L, Zhang Z, Jia Z. Precise, scalable, and online request tracing for multitier services of black boxes. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(6): 1159–1167

    Article  Google Scholar 

  11. Tak B, Tang C, Zhang C, Govindan S, Urgaonkar B, Chang R. Vpath: precise discovery of request processing paths from black-box observations of thread and network activities. In: Proceedings of the 2009 Conference on USENIX Annual Technical Conference. 2009, 19–32

  12. Koskinen E, Jannotti J. Borderpatrol: isolating events for black-box tracing. ACM SIGOPS Operating Systems Review, 2008, 42(4): 191–203

    Article  Google Scholar 

  13. Reynolds P, Wiener J, Mogul J, Aguilera M, Vahdat A. WAP5: blackbox performance debugging for wide-area systems. In: Proceedings of the 15th International Conference onWorldWideWeb. 2006, 347–356

  14. Aguilera M, Mogul J, Wiener J, Reynolds P, Muthitacharoen A. Performance debugging for distributed systems of black boxes. ACM SIGOPS Operating Systems Review, 2003, 37(5): 74–89

    Article  Google Scholar 

  15. Mills D. Network time protocol (Version 3) specification, implementation and analysis. RFC Editor, 1992

  16. Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 2008, 51(1): 107–113

    Article  Google Scholar 

  17. Abdi H. Coefficient of variation. Sage Publications, 2010

  18. Massie M, Chun B, Culler D. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 2004, 30(7): 817–840

    Article  Google Scholar 

  19. Fay M, Proschan M. Wilcoxon-mann-whitney or t-test? on assumptions for hypothesis tests and multiple interpretations of decision rules. Statistics Surveys, 2010

  20. Malik H, Adams B, Hassan A. Pinpointing the subsystems responsible for the performance deviations in a load test. In: Proceedings of the 21st International Symposium on Software Reliability Engineering. 2010, 201–210

  21. Bodik P, Goldszmidt M, Fox A, Woodard D, Andersen H. Fingerprinting the datacenter: automated classification of performance crises. In: Proceedings of the 5th European Conference on Computer Systems. 2010, 111–124

  22. Misailovic S, Sidiroglou S, Hoffmann H, Rinard M. Quality of service profiling. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering. 2010, 25–34

  23. Barham P, Donnelly A, Isaacs R, Mortier R. Using magpie for request extraction and workload modelling. In: Proceedings of the 6th Symposium on Opearting Systems Design and Implementation (OSDI). 2004, 259–272

  24. Chen M, Kiciman E, Fratkin E, Fox A, Brewer E. Pinpoint: Problem determination in large, dynamic internet services. In: Proceedings of the 32nd International Conference on Dependable Systems and Net works. 2002, 595–604

  25. Chen M, Accardi A, Kiciman E, Lloyd J, Patterson D, Fox A, Brewer E. Path-based faliure and evolution management. In: Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation. 2004, 23–36

  26. Chang F, Dean J, Ghemawat S, Hsieh W, Wallach D, Burrows M, Chandra T, Fikes A, Gruber R. Bigtable: a distributed storage system for structured data. ACM Transactions on Computer Systems, 2008, 26(2): 1–26

    Article  MATH  Google Scholar 

  27. Sambasivan R, Zheng A, De Rosa M, Krevat E, Whitman S, Stroucken M, Wang W, Xu L, Ganger G. Diagnosing performance changes by comparing request flows. In: Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation. 2011, 43–56

  28. Reynolds P, Killian C, Wiener J, Mogul J, Shah M, Vahdat A. Pip: detecting the unexpected in distributed systems. In: Proceedings of the 3rd Symposium on Networked Systems Design and Implementation. 2006, 115–128

  29. Thereska E, Ganger G. Ironmodel: robust performance models in the wild. ACM SIGMETRICS Performance Evaluation Review, 2008, 36(1): 253–264

    Article  Google Scholar 

  30. Mann G, Sandler M, Krushevskaja D, Guha S, Even-Dar E. Modeling the parallel execution of black-box services. In: Proceedings of the 3rd USENIX Conference on Hot Topics in Cloud Computing. 2011, 20–24

  31. Ostrowski K, Mann G, Sandler M. Diagnosing latency in multi-tier black-box services. In: Proceedings of the 5th Workshop on Large Scale Distributed Systems and Middleware. 2011

  32. Mi H, Wang H, Zhou Y, Lyu M R, Cai H. P-tracer: service-oriented performance profiling in cloud computing systems. In: Proceedings of IEEE 36th Annual Computer Software and Applications Conference. 2012

  33. Zhang Z, Zhan J, Li Y, Wang L, Meng D, Sang B. Precise request tracing and performance debugging for multi-tier services of black boxes. In: Proceedings of the 2009 IEEE/IFIP International Conference on Dependable Systems & Networks. 2009, 337–346

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haibo Mi.

Additional information

Haibo Mi received the BEng and MEng degrees from Communication Command University of Wu Han, in 2005 and 2008, respectively. He is currently working toward the PhD degree in National Laboratory for Parallel & Distributed Processing, National University of Defense Technology (NUDT), Changsha, China. His thesis focuses on performance maintenance in large-scale distributed systems. He has been worked with the engineers and operators of Alibaba Cloud Computing Company for two years. His research interests include distributed computing, cloud computing, performance monitoring and fault localization.

Huaimin Wang received his PhD in computer science from NUDT in 1992. He is now a professor and chief engineer in department of educational affairs, NUDT. He has been awarded the “Chang Jiang Scholars Program” professor by Ministry of Education of China, and the Distinct Young Scholar by the National Natural Science Foundation of China (NSFC), etc. He has worked as the director of several grand research projects and has published more than 100 research papers in international conferences and journals. His current research interests include middleware, software agent, trustworthy computing.

Yangfan Zhou is currently a research staff member with the Shenzhen Research Institute, The Chinese University of Hong Kong (CUHK) and Department of Computer Science and Engineering, CUHK. He received an MPhil and a PhD from CUHK in 2006 and 2009, respectively, and a BSc from Peking University in 2000. His current research is on software engineering issues (e.g., fault management, fault tolerance, reliability engineering, testing, and debugging) and their applications.

Michael Rung-Tsong Lyu received his PhD degree in computer science from University of California, Los Angeles, in 1988. He is now a professor in the Department of Computer Science & Engineering. He initiated the 1st International Symposium on Software Reliability Engineering (ISSRE) in 1990. He was the program chair for ISSRE 1996, the general chair for ISSRE 2001, the program cochair for PRDC 1999, www 2010, SRDS 2005, and ICEBE 2007, the general cochair for PRDC 2005, and a program committee member for many other conferences. Dr. Lyu’s research interests include software reliability engineering, distributed systems, fault-tolerant computing, data mining, and machine learning. He has published over 400 refereed journal and conference papers in these areas. Dr. Lyu is an IEEE Fellow, an AAAS Fellow, and received IEEE Reliability Society 2010 Engineer of the Year Award.

Hua Cai received the BS degree from the Shanghai Jiaotong University, Shanghai, China, in 1999, and the PhD degree from the Hong Kong University of Science and Technology (HKUST) in 2003, all in electrical and electronic engineering. He is a member of the IEEE and ACM. He joined Microsoft Research Asia, Beijing, China, in December 2003 and was an associate researcher in the Media Communication Group. He is now a senior expert in Alibaba Cloud Computing Company and leads the teams of cloud monitoring and computing platform. His research interests include distributed system, cloud computing, and mobile media computing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mi, H., Wang, H., Zhou, Y. et al. An online service-oriented performance profiling tool for cloud computing systems. Front. Comput. Sci. 7, 431–445 (2013). https://doi.org/10.1007/s11704-013-2193-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-013-2193-4

Keywords

Navigation