An online service-oriented performance profiling tool for cloud computing systems

Mi, Haibo; Wang, Huaimin; Zhou, Yangfan; Lyu, Michael Rung-Tsong; Cai, Hua; Yin, Gang

doi:10.1007/s11704-013-2193-4

An online service-oriented performance profiling tool for cloud computing systems

Research Article
Published: 27 February 2013

Volume 7, pages 431–445, (2013)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Haibo Mi¹,
Huaimin Wang¹,
Yangfan Zhou²,
Michael Rung-Tsong Lyu²,
Hua Cai³ &
…
Gang Yin¹

256 Accesses
8 Citations
Explore all metrics

Abstract

The growing scale and complexity of component interactions in cloud computing systems post great challenges for operators to understand the characteristics of system performance. Profiling has long been proved to be an effective approach to performance analysis; however, existing approaches confront new challenges that emerge in cloud computing systems. First, the efficiency of the profiling becomes of critical concern; second, service-oriented profiling should be considered to support separation-of-concerns performance analysis. To address the above issues, in this paper, we present P-Tracer, an online performance profiling tool specifically tailored for cloud computing systems. P-Tracer constructs a specific search engine that proactively processes performance logs and generates a particular index for fast queries; second, for each service, P-Tracer retrieves a statistical insight of performance characteristics from multi-dimensions and provides operators with a suite of web-based interfaces to query the critical information. We evaluate P-Tracer in the aspects of tracing overheads, data preprocessing scalability and querying efficiency. Three real-world case studies that happened in Alibaba cloud computing platform demonstrate that P-Tracer can help operators understand software behaviors and localize the primary causes of performance anomalies effectively and efficiently.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Online Detection of Operator Errors in Cloud Computing Using Anti-patterns

Performance Analysis of Cloud Computing Infrastructure

Performance optimization for cloud computing systems in the microservice era: state-of-the-art and research opportunities

Article 22 January 2022

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Ren G, Tune E, Moseley T, Shi Y, Rus S, Hundt R. Google-wide profiling: a continuous profiling infrastructure for data centers. IEEE Micro Magazine, 2010, 30(4): 65–79
Article Google Scholar
Graham S, Kessler P, McKusick M. Gprof: a call graph execution profiler. ACM SIGPLAN Notices, 2004, 39(4): 49–57
Article Google Scholar
Mohr B, Wylie B, Wolf F. Performance measurement and analysis tools for extremely scalable systems. Concurrency and Computation: Practice and Experience, 2010, 22(16): 2212–2229
Article Google Scholar
Thereska E, Salmon B, Strunk J, Wachs M, Abd-El-Malek M, Lopez J, Ganger G. Stardust: tracking activity in a distributed storage system. ACM SIGMETRICS Performance Evaluation Review, 2006, 34(1): 3–14
Article Google Scholar
Cantrill B, Shapiro M, Leventhal A. Dynamic instrumentation of production systems. In: Proceedings of the 2004 USENIX Annual Technical Conference. 2004, 2–15
Traeger A, Deras I, Zadok E. DARC: dynamic analysis of root causes of latency distributions. ACM SIGMETRICS Performance Evaluation Review, 2008, 36(1): 277–288
Article Google Scholar
Huang X, Wang W, Zhang W, Wei J, Huang T. An adaptive performance modeling approach to performance profiling of multi-service web applications. In: Proceedings of the 35th IEEE Computer Software and Applications Conference. 2011, 4–13
Sigelman B, Barroso L, Burrows M, Stephenson P, Plakal M, Beaver D, Jaspan S, Shanbhag C. Dapper, a large-scale distributed systems tracing infrastructure. Technical Report, Google, 2010
Park I, Buch R. Event tracing-improve debugging and performance tuning with etw. MSDN Magazine-Louisville. 2007, 81–92
Sang B, Zhan J, Lu G, Wang H, Xu D, Wang L, Zhang Z, Jia Z. Precise, scalable, and online request tracing for multitier services of black boxes. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(6): 1159–1167
Article Google Scholar
Tak B, Tang C, Zhang C, Govindan S, Urgaonkar B, Chang R. Vpath: precise discovery of request processing paths from black-box observations of thread and network activities. In: Proceedings of the 2009 Conference on USENIX Annual Technical Conference. 2009, 19–32
Koskinen E, Jannotti J. Borderpatrol: isolating events for black-box tracing. ACM SIGOPS Operating Systems Review, 2008, 42(4): 191–203
Article Google Scholar
Reynolds P, Wiener J, Mogul J, Aguilera M, Vahdat A. WAP5: blackbox performance debugging for wide-area systems. In: Proceedings of the 15th International Conference onWorldWideWeb. 2006, 347–356
Aguilera M, Mogul J, Wiener J, Reynolds P, Muthitacharoen A. Performance debugging for distributed systems of black boxes. ACM SIGOPS Operating Systems Review, 2003, 37(5): 74–89
Article Google Scholar
Mills D. Network time protocol (Version 3) specification, implementation and analysis. RFC Editor, 1992
Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 2008, 51(1): 107–113
Article Google Scholar
Abdi H. Coefficient of variation. Sage Publications, 2010
Massie M, Chun B, Culler D. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 2004, 30(7): 817–840
Article Google Scholar
Fay M, Proschan M. Wilcoxon-mann-whitney or t-test? on assumptions for hypothesis tests and multiple interpretations of decision rules. Statistics Surveys, 2010
Malik H, Adams B, Hassan A. Pinpointing the subsystems responsible for the performance deviations in a load test. In: Proceedings of the 21st International Symposium on Software Reliability Engineering. 2010, 201–210
Bodik P, Goldszmidt M, Fox A, Woodard D, Andersen H. Fingerprinting the datacenter: automated classification of performance crises. In: Proceedings of the 5th European Conference on Computer Systems. 2010, 111–124
Misailovic S, Sidiroglou S, Hoffmann H, Rinard M. Quality of service profiling. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering. 2010, 25–34
Barham P, Donnelly A, Isaacs R, Mortier R. Using magpie for request extraction and workload modelling. In: Proceedings of the 6th Symposium on Opearting Systems Design and Implementation (OSDI). 2004, 259–272
Chen M, Kiciman E, Fratkin E, Fox A, Brewer E. Pinpoint: Problem determination in large, dynamic internet services. In: Proceedings of the 32nd International Conference on Dependable Systems and Net works. 2002, 595–604
Chen M, Accardi A, Kiciman E, Lloyd J, Patterson D, Fox A, Brewer E. Path-based faliure and evolution management. In: Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation. 2004, 23–36
Chang F, Dean J, Ghemawat S, Hsieh W, Wallach D, Burrows M, Chandra T, Fikes A, Gruber R. Bigtable: a distributed storage system for structured data. ACM Transactions on Computer Systems, 2008, 26(2): 1–26
Article MATH Google Scholar
Sambasivan R, Zheng A, De Rosa M, Krevat E, Whitman S, Stroucken M, Wang W, Xu L, Ganger G. Diagnosing performance changes by comparing request flows. In: Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation. 2011, 43–56
Reynolds P, Killian C, Wiener J, Mogul J, Shah M, Vahdat A. Pip: detecting the unexpected in distributed systems. In: Proceedings of the 3rd Symposium on Networked Systems Design and Implementation. 2006, 115–128
Thereska E, Ganger G. Ironmodel: robust performance models in the wild. ACM SIGMETRICS Performance Evaluation Review, 2008, 36(1): 253–264
Article Google Scholar
Mann G, Sandler M, Krushevskaja D, Guha S, Even-Dar E. Modeling the parallel execution of black-box services. In: Proceedings of the 3rd USENIX Conference on Hot Topics in Cloud Computing. 2011, 20–24
Ostrowski K, Mann G, Sandler M. Diagnosing latency in multi-tier black-box services. In: Proceedings of the 5th Workshop on Large Scale Distributed Systems and Middleware. 2011
Mi H, Wang H, Zhou Y, Lyu M R, Cai H. P-tracer: service-oriented performance profiling in cloud computing systems. In: Proceedings of IEEE 36th Annual Computer Software and Applications Conference. 2012
Zhang Z, Zhan J, Li Y, Wang L, Meng D, Sang B. Precise request tracing and performance debugging for multi-tier services of black boxes. In: Proceedings of the 2009 IEEE/IFIP International Conference on Dependable Systems & Networks. 2009, 337–346

Download references

Author information

Authors and Affiliations

National Lab for Parallel & Distributed Processing, National University of Defense Technology, Changsha, 410073, China
Haibo Mi, Huaimin Wang & Gang Yin
Shenzhen Research Institute, The Chinese University of Hong Kong, Shenzhen, 518000, China
Yangfan Zhou & Michael Rung-Tsong Lyu
Computing Platform, Alibaba Cloud Computing Company, Hangzhou, 310000, China
Hua Cai

Authors

Haibo Mi
View author publications
Search author on:PubMed Google Scholar
Huaimin Wang
View author publications
Search author on:PubMed Google Scholar
Yangfan Zhou
View author publications
Search author on:PubMed Google Scholar
Michael Rung-Tsong Lyu
View author publications
Search author on:PubMed Google Scholar
Hua Cai
View author publications
Search author on:PubMed Google Scholar
Gang Yin
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Haibo Mi.

Additional information

Haibo Mi received the BEng and MEng degrees from Communication Command University of Wu Han, in 2005 and 2008, respectively. He is currently working toward the PhD degree in National Laboratory for Parallel & Distributed Processing, National University of Defense Technology (NUDT), Changsha, China. His thesis focuses on performance maintenance in large-scale distributed systems. He has been worked with the engineers and operators of Alibaba Cloud Computing Company for two years. His research interests include distributed computing, cloud computing, performance monitoring and fault localization.

Huaimin Wang received his PhD in computer science from NUDT in 1992. He is now a professor and chief engineer in department of educational affairs, NUDT. He has been awarded the “Chang Jiang Scholars Program” professor by Ministry of Education of China, and the Distinct Young Scholar by the National Natural Science Foundation of China (NSFC), etc. He has worked as the director of several grand research projects and has published more than 100 research papers in international conferences and journals. His current research interests include middleware, software agent, trustworthy computing.

Yangfan Zhou is currently a research staff member with the Shenzhen Research Institute, The Chinese University of Hong Kong (CUHK) and Department of Computer Science and Engineering, CUHK. He received an MPhil and a PhD from CUHK in 2006 and 2009, respectively, and a BSc from Peking University in 2000. His current research is on software engineering issues (e.g., fault management, fault tolerance, reliability engineering, testing, and debugging) and their applications.

Michael Rung-Tsong Lyu received his PhD degree in computer science from University of California, Los Angeles, in 1988. He is now a professor in the Department of Computer Science & Engineering. He initiated the 1st International Symposium on Software Reliability Engineering (ISSRE) in 1990. He was the program chair for ISSRE 1996, the general chair for ISSRE 2001, the program cochair for PRDC 1999, www 2010, SRDS 2005, and ICEBE 2007, the general cochair for PRDC 2005, and a program committee member for many other conferences. Dr. Lyu’s research interests include software reliability engineering, distributed systems, fault-tolerant computing, data mining, and machine learning. He has published over 400 refereed journal and conference papers in these areas. Dr. Lyu is an IEEE Fellow, an AAAS Fellow, and received IEEE Reliability Society 2010 Engineer of the Year Award.

Hua Cai received the BS degree from the Shanghai Jiaotong University, Shanghai, China, in 1999, and the PhD degree from the Hong Kong University of Science and Technology (HKUST) in 2003, all in electrical and electronic engineering. He is a member of the IEEE and ACM. He joined Microsoft Research Asia, Beijing, China, in December 2003 and was an associate researcher in the Media Communication Group. He is now a senior expert in Alibaba Cloud Computing Company and leads the teams of cloud monitoring and computing platform. His research interests include distributed system, cloud computing, and mobile media computing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mi, H., Wang, H., Zhou, Y. et al. An online service-oriented performance profiling tool for cloud computing systems. Front. Comput. Sci. 7, 431–445 (2013). https://doi.org/10.1007/s11704-013-2193-4

Download citation

Received: 01 June 2012
Accepted: 03 November 2012
Published: 27 February 2013
Issue Date: June 2013
DOI: https://doi.org/10.1007/s11704-013-2193-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An online service-oriented performance profiling tool for cloud computing systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Online Detection of Operator Errors in Cloud Computing Using Anti-patterns

Performance Analysis of Cloud Computing Infrastructure

Performance optimization for cloud computing systems in the microservice era: state-of-the-art and research opportunities

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now