Abstract
The growing scale and complexity of component interactions in cloud computing systems post great challenges for operators to understand the characteristics of system performance. Profiling has long been proved to be an effective approach to performance analysis; however, existing approaches confront new challenges that emerge in cloud computing systems. First, the efficiency of the profiling becomes of critical concern; second, service-oriented profiling should be considered to support separation-of-concerns performance analysis. To address the above issues, in this paper, we present P-Tracer, an online performance profiling tool specifically tailored for cloud computing systems. P-Tracer constructs a specific search engine that proactively processes performance logs and generates a particular index for fast queries; second, for each service, P-Tracer retrieves a statistical insight of performance characteristics from multi-dimensions and provides operators with a suite of web-based interfaces to query the critical information. We evaluate P-Tracer in the aspects of tracing overheads, data preprocessing scalability and querying efficiency. Three real-world case studies that happened in Alibaba cloud computing platform demonstrate that P-Tracer can help operators understand software behaviors and localize the primary causes of performance anomalies effectively and efficiently.
Similar content being viewed by others
References
Ren G, Tune E, Moseley T, Shi Y, Rus S, Hundt R. Google-wide profiling: a continuous profiling infrastructure for data centers. IEEE Micro Magazine, 2010, 30(4): 65–79
Graham S, Kessler P, McKusick M. Gprof: a call graph execution profiler. ACM SIGPLAN Notices, 2004, 39(4): 49–57
Mohr B, Wylie B, Wolf F. Performance measurement and analysis tools for extremely scalable systems. Concurrency and Computation: Practice and Experience, 2010, 22(16): 2212–2229
Thereska E, Salmon B, Strunk J, Wachs M, Abd-El-Malek M, Lopez J, Ganger G. Stardust: tracking activity in a distributed storage system. ACM SIGMETRICS Performance Evaluation Review, 2006, 34(1): 3–14
Cantrill B, Shapiro M, Leventhal A. Dynamic instrumentation of production systems. In: Proceedings of the 2004 USENIX Annual Technical Conference. 2004, 2–15
Traeger A, Deras I, Zadok E. DARC: dynamic analysis of root causes of latency distributions. ACM SIGMETRICS Performance Evaluation Review, 2008, 36(1): 277–288
Huang X, Wang W, Zhang W, Wei J, Huang T. An adaptive performance modeling approach to performance profiling of multi-service web applications. In: Proceedings of the 35th IEEE Computer Software and Applications Conference. 2011, 4–13
Sigelman B, Barroso L, Burrows M, Stephenson P, Plakal M, Beaver D, Jaspan S, Shanbhag C. Dapper, a large-scale distributed systems tracing infrastructure. Technical Report, Google, 2010
Park I, Buch R. Event tracing-improve debugging and performance tuning with etw. MSDN Magazine-Louisville. 2007, 81–92
Sang B, Zhan J, Lu G, Wang H, Xu D, Wang L, Zhang Z, Jia Z. Precise, scalable, and online request tracing for multitier services of black boxes. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(6): 1159–1167
Tak B, Tang C, Zhang C, Govindan S, Urgaonkar B, Chang R. Vpath: precise discovery of request processing paths from black-box observations of thread and network activities. In: Proceedings of the 2009 Conference on USENIX Annual Technical Conference. 2009, 19–32
Koskinen E, Jannotti J. Borderpatrol: isolating events for black-box tracing. ACM SIGOPS Operating Systems Review, 2008, 42(4): 191–203
Reynolds P, Wiener J, Mogul J, Aguilera M, Vahdat A. WAP5: blackbox performance debugging for wide-area systems. In: Proceedings of the 15th International Conference onWorldWideWeb. 2006, 347–356
Aguilera M, Mogul J, Wiener J, Reynolds P, Muthitacharoen A. Performance debugging for distributed systems of black boxes. ACM SIGOPS Operating Systems Review, 2003, 37(5): 74–89
Mills D. Network time protocol (Version 3) specification, implementation and analysis. RFC Editor, 1992
Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 2008, 51(1): 107–113
Abdi H. Coefficient of variation. Sage Publications, 2010
Massie M, Chun B, Culler D. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 2004, 30(7): 817–840
Fay M, Proschan M. Wilcoxon-mann-whitney or t-test? on assumptions for hypothesis tests and multiple interpretations of decision rules. Statistics Surveys, 2010
Malik H, Adams B, Hassan A. Pinpointing the subsystems responsible for the performance deviations in a load test. In: Proceedings of the 21st International Symposium on Software Reliability Engineering. 2010, 201–210
Bodik P, Goldszmidt M, Fox A, Woodard D, Andersen H. Fingerprinting the datacenter: automated classification of performance crises. In: Proceedings of the 5th European Conference on Computer Systems. 2010, 111–124
Misailovic S, Sidiroglou S, Hoffmann H, Rinard M. Quality of service profiling. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering. 2010, 25–34
Barham P, Donnelly A, Isaacs R, Mortier R. Using magpie for request extraction and workload modelling. In: Proceedings of the 6th Symposium on Opearting Systems Design and Implementation (OSDI). 2004, 259–272
Chen M, Kiciman E, Fratkin E, Fox A, Brewer E. Pinpoint: Problem determination in large, dynamic internet services. In: Proceedings of the 32nd International Conference on Dependable Systems and Net works. 2002, 595–604
Chen M, Accardi A, Kiciman E, Lloyd J, Patterson D, Fox A, Brewer E. Path-based faliure and evolution management. In: Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation. 2004, 23–36
Chang F, Dean J, Ghemawat S, Hsieh W, Wallach D, Burrows M, Chandra T, Fikes A, Gruber R. Bigtable: a distributed storage system for structured data. ACM Transactions on Computer Systems, 2008, 26(2): 1–26
Sambasivan R, Zheng A, De Rosa M, Krevat E, Whitman S, Stroucken M, Wang W, Xu L, Ganger G. Diagnosing performance changes by comparing request flows. In: Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation. 2011, 43–56
Reynolds P, Killian C, Wiener J, Mogul J, Shah M, Vahdat A. Pip: detecting the unexpected in distributed systems. In: Proceedings of the 3rd Symposium on Networked Systems Design and Implementation. 2006, 115–128
Thereska E, Ganger G. Ironmodel: robust performance models in the wild. ACM SIGMETRICS Performance Evaluation Review, 2008, 36(1): 253–264
Mann G, Sandler M, Krushevskaja D, Guha S, Even-Dar E. Modeling the parallel execution of black-box services. In: Proceedings of the 3rd USENIX Conference on Hot Topics in Cloud Computing. 2011, 20–24
Ostrowski K, Mann G, Sandler M. Diagnosing latency in multi-tier black-box services. In: Proceedings of the 5th Workshop on Large Scale Distributed Systems and Middleware. 2011
Mi H, Wang H, Zhou Y, Lyu M R, Cai H. P-tracer: service-oriented performance profiling in cloud computing systems. In: Proceedings of IEEE 36th Annual Computer Software and Applications Conference. 2012
Zhang Z, Zhan J, Li Y, Wang L, Meng D, Sang B. Precise request tracing and performance debugging for multi-tier services of black boxes. In: Proceedings of the 2009 IEEE/IFIP International Conference on Dependable Systems & Networks. 2009, 337–346
Author information
Authors and Affiliations
Corresponding author
Additional information
Haibo Mi received the BEng and MEng degrees from Communication Command University of Wu Han, in 2005 and 2008, respectively. He is currently working toward the PhD degree in National Laboratory for Parallel & Distributed Processing, National University of Defense Technology (NUDT), Changsha, China. His thesis focuses on performance maintenance in large-scale distributed systems. He has been worked with the engineers and operators of Alibaba Cloud Computing Company for two years. His research interests include distributed computing, cloud computing, performance monitoring and fault localization.
Huaimin Wang received his PhD in computer science from NUDT in 1992. He is now a professor and chief engineer in department of educational affairs, NUDT. He has been awarded the “Chang Jiang Scholars Program” professor by Ministry of Education of China, and the Distinct Young Scholar by the National Natural Science Foundation of China (NSFC), etc. He has worked as the director of several grand research projects and has published more than 100 research papers in international conferences and journals. His current research interests include middleware, software agent, trustworthy computing.
Yangfan Zhou is currently a research staff member with the Shenzhen Research Institute, The Chinese University of Hong Kong (CUHK) and Department of Computer Science and Engineering, CUHK. He received an MPhil and a PhD from CUHK in 2006 and 2009, respectively, and a BSc from Peking University in 2000. His current research is on software engineering issues (e.g., fault management, fault tolerance, reliability engineering, testing, and debugging) and their applications.
Michael Rung-Tsong Lyu received his PhD degree in computer science from University of California, Los Angeles, in 1988. He is now a professor in the Department of Computer Science & Engineering. He initiated the 1st International Symposium on Software Reliability Engineering (ISSRE) in 1990. He was the program chair for ISSRE 1996, the general chair for ISSRE 2001, the program cochair for PRDC 1999, www 2010, SRDS 2005, and ICEBE 2007, the general cochair for PRDC 2005, and a program committee member for many other conferences. Dr. Lyu’s research interests include software reliability engineering, distributed systems, fault-tolerant computing, data mining, and machine learning. He has published over 400 refereed journal and conference papers in these areas. Dr. Lyu is an IEEE Fellow, an AAAS Fellow, and received IEEE Reliability Society 2010 Engineer of the Year Award.
Hua Cai received the BS degree from the Shanghai Jiaotong University, Shanghai, China, in 1999, and the PhD degree from the Hong Kong University of Science and Technology (HKUST) in 2003, all in electrical and electronic engineering. He is a member of the IEEE and ACM. He joined Microsoft Research Asia, Beijing, China, in December 2003 and was an associate researcher in the Media Communication Group. He is now a senior expert in Alibaba Cloud Computing Company and leads the teams of cloud monitoring and computing platform. His research interests include distributed system, cloud computing, and mobile media computing.
Rights and permissions
About this article
Cite this article
Mi, H., Wang, H., Zhou, Y. et al. An online service-oriented performance profiling tool for cloud computing systems. Front. Comput. Sci. 7, 431–445 (2013). https://doi.org/10.1007/s11704-013-2193-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-013-2193-4