ASDF: An Automated, Online Framework for Diagnosing Performance Problems

Bare, Keith; Kavulya, Soila P.; Tan, Jiaqi; Pan, Xinghao; Marinelli, Eugene; Kasick, Michael; Gandhi, Rajeev; Narasimhan, Priya

doi:10.1007/978-3-642-17245-8_9

ASDF: An Automated, Online Framework for Diagnosing Performance Problems

Keith Bare¹⁹,
Soila P. Kavulya¹⁹,
Jiaqi Tan²⁰,
Xinghao Pan²⁰,
Eugene Marinelli¹⁹,
Michael Kasick¹⁹,
Rajeev Gandhi¹⁹ &
…
Priya Narasimhan¹⁹

Chapter

661 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 6420))

Abstract

Performance problems account for a significant percentage of documented failures in large-scale distributed systems, such as Hadoop. Localizing the source of these performance problems can be frustrating due to the overwhelming amount of monitoring information available. We automate problem localization using ASDF, an online diagnostic framework that transparently monitors and analyzes different time-varying data sources (e.g., OS performance counters, Hadoop logs) and narrows down performance problems to a specific node or a set of nodes. ASDF’s flexible architecture allows system administrators to easily customize data sources and analysis modules for their unique operating environments. We demonstrate the effectiveness of ASDF’s diagnostics on documented performance problems in Hadoop; our results indicate that ASDF incurs an average monitoring overhead of 0.38% of CPU time and achieves a balanced accuracy of 80% at localizing problems to the culprit node.

ASDF stands for Automated System for Diagnosing Failures.

This work is supported by the NSF CAREER Award CCR-0238381, and grant CNS-0326453.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Foundation, T.A.S.: Hadoop (2007), http://hadoop.apache.org/core
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: USENIX Symposium on Operating Systems Design and Implementation, San Francisco, CA, pp. 137–150 (December 2004)
Google Scholar
Foundation, T.A.S.: Apache’s JIRA issue tracker (2006), https://issues.apache.org/jira
IBM: Tivoli enterprise console (2010), http://www.ibm.com/software/tivoli/products/enterprise-console
Packard, H.: Hp operations manager (2010), http://www.managementsoftware.hp.com
LLC., N.E.: Hagios (2008), http://www.nagios.org
Ganglia: Ganglia monitoring system (2007), http://ganglia.info
Barham, P., Donnelly, A., Isaacs, R., Mortier, R.: Using Magpie for request extraction and workload modelling. In: USENIX Symposium on Operating Systems Design and Implementation, San Francisco, CA (December 2004)
Google Scholar
Inc., S.: Splunk: The it search company (2005), http://www.splunk.com
ZeroC, I.: Internet Communications Engine, ICE (2010), http://www.zeroc.com/ice.html
Sigelman, B.H., Barroso, L.A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., Shanbhag, C.: Dapper, a large-scale distributed systems tracing infrastructure. Technical Report dapper-2010-1, Google (April 2010)
Google Scholar
Fonseca, R., Porter, G., Katz, R., Shenker, S., Stoica, I.: X-Trace: A pervasive network tracing framework. In: USENIX Symposium on Networked Systems Design and Implementation, Cambridge, MA (April 2007)
Google Scholar
Godard, S.: SYSSTAT (2008), http://pagesperso-orange.fr/sebastien.godard
Ghemawat, S., Gobioff, H., Leung, S.: The Google File System. In: ACM Symposium on Operating Systems Principles, Lake George, NY, pp. 29 – 43 (October 2003)
Google Scholar
Tan, J., Narasimhan, P.: RAMS and BlackSheep: Inferring white-box application behavior using black-box techniques. Technical Report CMU-PDL-08-103, Carnegie Mellon University PDL (May 2008)
Google Scholar
Tan, J., Pan, X., Kavulya, S., Gandhi, R., Narasimhan, P.: Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop. In: USENIX Workshop on Hot Topics in Cloud Computing (HotCloud), San Diego, CA (June 2009)
Google Scholar
Tan, J., Kavulya, S., Gandhi, R., Narasimhan, P.: Visual, log-based causal tracing for performance debugging of MapReduce systems. In: International Conference on Distributed Computing Systems, Genoa, Italy (June 2010)
Google Scholar
Tan, J., Pan, X., Kavulya, S., Gandhi, R., Narasimhan, P.: SALSA: Analyzing Logs as State Machines. In: USENIX Workshop on Analysis of System Logs, San Diego, CA (December 2008)
Google Scholar
Pan, X., Tan, J., Kavulya, S., Gandhi, R., Narasimhan, P.: Ganesha: Black-Box Diagnosis of MapReduce Systems. In: Workshop on Hot Topics in Measurement and Modeling of Computer Systems (HotMetrics), Seattle, WA (June 2009)
Google Scholar
Pan, X., Tan, J., Kavulya, S., Gandhi, R., Narasimhan, P.: Blind Men and the Elephant: Piecing together Hadoop for diagnosis. In: International Symposium on Software Reliability Engineering (ISSRE), Mysuru, India (November 2009)
Google Scholar
Konwinski, A., Zaharia, M., Katz, R., Stoica, I.: X-tracing Hadoop. Hadoop Summit (March 2008)
Google Scholar
Cohen, I.: Machine learning for automated diagnosis of distributed systems performance. SF Bay ACM Data Mining SIG (August 2006)
Google Scholar
Xu, W., Huang, L., Fox, A., Patterson, D.A., Jordan, M.I.: Detecting large-scale system problems by mining console logs. In: ACM Symposium on Operating Systems Principles, Big Sky, Montana, pp. 117–132 (October 2009)
Google Scholar
Aguilera, M.K., Mogul, J.C., Wiener, J.L., Reynolds, P., Muthitacharoen, A.: Performance debugging for distributed system of black boxes. In: ACM Symposium on Operating Systems Principles, Bolton Landing, NY, 74–89 (October 2003)
Google Scholar
Kiciman, E., Fox, A.: Detecting application-level failures in component-based internet services. IEEE Trans. on Neural Networks: Special Issue on Adaptive Learning Systems in Communication Networks 16(5), 1027–1041 (2005)
Google Scholar
Chen, M.Y., Kiciman, E., Fratkin, E., Fox, A., Brewer, E.: Pinpoint: Problem determination in large, dynamic internet services. In: IEEE Conference on Dependable Systems and Networks, Bethesda, MD (June 2002)
Google Scholar
Cohen, I., Zhang, S., Goldszmidt, M., Symons, J., Kelly, T., Fox, A.: Capturing, indexing, clustering, and retrieving system history. In: ACM Symposium on Operating Systems Principles, Brighton, United Kingdom, pp. 105–118 (October 2005)
Google Scholar
Kiciman, E., Fox, A.: Detecting application-level failures in component-based internet services. In: USENIX Symposium on Networked Systems Design and Implementation, San Jose, CA, pp. 115– 128 (May 2006)
Google Scholar
Hauswirth, M., Diwan, A., Sweeney, P., Hind, M.: Vertical profiling: Understanding the behavior of object-oriented applications. In: ACM Conference on Object-Oriented Programming, Systems, Languages, and Applications, Vancouver, BC, Canada, pp. 251 – 269 (October 2004)
Google Scholar
Tucek, J., Lu, S., Huang, C., Xanthos, S., Zhou, Y.: Triage: diagnosing production run failures at the user’s site. In: Symposium on Operating Systems Principles (SOSP), Stevenson, WA, pp. 131–144 (October 2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Keith Bare, Soila P. Kavulya, Eugene Marinelli, Michael Kasick, Rajeev Gandhi & Priya Narasimhan
DSO National Laboratories, Singapore, 118230
Jiaqi Tan & Xinghao Pan

Authors

Keith Bare
View author publications
You can also search for this author in PubMed Google Scholar
Soila P. Kavulya
View author publications
You can also search for this author in PubMed Google Scholar
Jiaqi Tan
View author publications
You can also search for this author in PubMed Google Scholar
Xinghao Pan
View author publications
You can also search for this author in PubMed Google Scholar
Eugene Marinelli
View author publications
You can also search for this author in PubMed Google Scholar
Michael Kasick
View author publications
You can also search for this author in PubMed Google Scholar
Rajeev Gandhi
View author publications
You can also search for this author in PubMed Google Scholar
Priya Narasimhan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Science, University of Lisbon, Campo Grande, Bloco C6, Piso 3, 1749-016, Lisbon, Portugal
Antonio Casimiro
School of Computing, University of Kent, CT2 7NF, Canterbury, Kent, UK
Rogério de Lemos
Centre for Software Reliability, City University, London, Northampton Square, EC1V 0HB, London, UK
Cristina Gacek

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bare, K. et al. (2010). ASDF: An Automated, Online Framework for Diagnosing Performance Problems. In: Casimiro, A., de Lemos, R., Gacek, C. (eds) Architecting Dependable Systems VII. Lecture Notes in Computer Science, vol 6420. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17245-8_9

Download citation

DOI: https://doi.org/10.1007/978-3-642-17245-8_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-17244-1
Online ISBN: 978-3-642-17245-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics