Skip to main content

ASDF: An Automated, Online Framework for Diagnosing Performance Problems

  • Chapter

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 6420))

Abstract

Performance problems account for a significant percentage of documented failures in large-scale distributed systems, such as Hadoop. Localizing the source of these performance problems can be frustrating due to the overwhelming amount of monitoring information available. We automate problem localization using ASDF, an online diagnostic framework that transparently monitors and analyzes different time-varying data sources (e.g., OS performance counters, Hadoop logs) and narrows down performance problems to a specific node or a set of nodes. ASDF’s flexible architecture allows system administrators to easily customize data sources and analysis modules for their unique operating environments. We demonstrate the effectiveness of ASDF’s diagnostics on documented performance problems in Hadoop; our results indicate that ASDF incurs an average monitoring overhead of 0.38% of CPU time and achieves a balanced accuracy of 80% at localizing problems to the culprit node.

ASDF stands for Automated System for Diagnosing Failures.

This work is supported by the NSF CAREER Award CCR-0238381, and grant CNS-0326453.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Foundation, T.A.S.: Hadoop (2007), http://hadoop.apache.org/core

  2. Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: USENIX Symposium on Operating Systems Design and Implementation, San Francisco, CA, pp. 137–150 (December 2004)

    Google Scholar 

  3. Foundation, T.A.S.: Apache’s JIRA issue tracker (2006), https://issues.apache.org/jira

  4. IBM: Tivoli enterprise console (2010), http://www.ibm.com/software/tivoli/products/enterprise-console

  5. Packard, H.: Hp operations manager (2010), http://www.managementsoftware.hp.com

  6. LLC., N.E.: Hagios (2008), http://www.nagios.org

  7. Ganglia: Ganglia monitoring system (2007), http://ganglia.info

  8. Barham, P., Donnelly, A., Isaacs, R., Mortier, R.: Using Magpie for request extraction and workload modelling. In: USENIX Symposium on Operating Systems Design and Implementation, San Francisco, CA (December 2004)

    Google Scholar 

  9. Inc., S.: Splunk: The it search company (2005), http://www.splunk.com

  10. ZeroC, I.: Internet Communications Engine, ICE (2010), http://www.zeroc.com/ice.html

  11. Sigelman, B.H., Barroso, L.A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., Shanbhag, C.: Dapper, a large-scale distributed systems tracing infrastructure. Technical Report dapper-2010-1, Google (April 2010)

    Google Scholar 

  12. Fonseca, R., Porter, G., Katz, R., Shenker, S., Stoica, I.: X-Trace: A pervasive network tracing framework. In: USENIX Symposium on Networked Systems Design and Implementation, Cambridge, MA (April 2007)

    Google Scholar 

  13. Godard, S.: SYSSTAT (2008), http://pagesperso-orange.fr/sebastien.godard

  14. Ghemawat, S., Gobioff, H., Leung, S.: The Google File System. In: ACM Symposium on Operating Systems Principles, Lake George, NY, pp. 29 – 43 (October 2003)

    Google Scholar 

  15. Tan, J., Narasimhan, P.: RAMS and BlackSheep: Inferring white-box application behavior using black-box techniques. Technical Report CMU-PDL-08-103, Carnegie Mellon University PDL (May 2008)

    Google Scholar 

  16. Tan, J., Pan, X., Kavulya, S., Gandhi, R., Narasimhan, P.: Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop. In: USENIX Workshop on Hot Topics in Cloud Computing (HotCloud), San Diego, CA (June 2009)

    Google Scholar 

  17. Tan, J., Kavulya, S., Gandhi, R., Narasimhan, P.: Visual, log-based causal tracing for performance debugging of MapReduce systems. In: International Conference on Distributed Computing Systems, Genoa, Italy (June 2010)

    Google Scholar 

  18. Tan, J., Pan, X., Kavulya, S., Gandhi, R., Narasimhan, P.: SALSA: Analyzing Logs as State Machines. In: USENIX Workshop on Analysis of System Logs, San Diego, CA (December 2008)

    Google Scholar 

  19. Pan, X., Tan, J., Kavulya, S., Gandhi, R., Narasimhan, P.: Ganesha: Black-Box Diagnosis of MapReduce Systems. In: Workshop on Hot Topics in Measurement and Modeling of Computer Systems (HotMetrics), Seattle, WA (June 2009)

    Google Scholar 

  20. Pan, X., Tan, J., Kavulya, S., Gandhi, R., Narasimhan, P.: Blind Men and the Elephant: Piecing together Hadoop for diagnosis. In: International Symposium on Software Reliability Engineering (ISSRE), Mysuru, India (November 2009)

    Google Scholar 

  21. Konwinski, A., Zaharia, M., Katz, R., Stoica, I.: X-tracing Hadoop. Hadoop Summit (March 2008)

    Google Scholar 

  22. Cohen, I.: Machine learning for automated diagnosis of distributed systems performance. SF Bay ACM Data Mining SIG (August 2006)

    Google Scholar 

  23. Xu, W., Huang, L., Fox, A., Patterson, D.A., Jordan, M.I.: Detecting large-scale system problems by mining console logs. In: ACM Symposium on Operating Systems Principles, Big Sky, Montana, pp. 117–132 (October 2009)

    Google Scholar 

  24. Aguilera, M.K., Mogul, J.C., Wiener, J.L., Reynolds, P., Muthitacharoen, A.: Performance debugging for distributed system of black boxes. In: ACM Symposium on Operating Systems Principles, Bolton Landing, NY, 74–89 (October 2003)

    Google Scholar 

  25. Kiciman, E., Fox, A.: Detecting application-level failures in component-based internet services. IEEE Trans. on Neural Networks: Special Issue on Adaptive Learning Systems in Communication Networks 16(5), 1027–1041 (2005)

    Google Scholar 

  26. Chen, M.Y., Kiciman, E., Fratkin, E., Fox, A., Brewer, E.: Pinpoint: Problem determination in large, dynamic internet services. In: IEEE Conference on Dependable Systems and Networks, Bethesda, MD (June 2002)

    Google Scholar 

  27. Cohen, I., Zhang, S., Goldszmidt, M., Symons, J., Kelly, T., Fox, A.: Capturing, indexing, clustering, and retrieving system history. In: ACM Symposium on Operating Systems Principles, Brighton, United Kingdom, pp. 105–118 (October 2005)

    Google Scholar 

  28. Kiciman, E., Fox, A.: Detecting application-level failures in component-based internet services. In: USENIX Symposium on Networked Systems Design and Implementation, San Jose, CA, pp. 115– 128 (May 2006)

    Google Scholar 

  29. Hauswirth, M., Diwan, A., Sweeney, P., Hind, M.: Vertical profiling: Understanding the behavior of object-oriented applications. In: ACM Conference on Object-Oriented Programming, Systems, Languages, and Applications, Vancouver, BC, Canada, pp. 251 – 269 (October 2004)

    Google Scholar 

  30. Tucek, J., Lu, S., Huang, C., Xanthos, S., Zhou, Y.: Triage: diagnosing production run failures at the user’s site. In: Symposium on Operating Systems Principles (SOSP), Stevenson, WA, pp. 131–144 (October 2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Bare, K. et al. (2010). ASDF: An Automated, Online Framework for Diagnosing Performance Problems. In: Casimiro, A., de Lemos, R., Gacek, C. (eds) Architecting Dependable Systems VII. Lecture Notes in Computer Science, vol 6420. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17245-8_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-17245-8_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-17244-1

  • Online ISBN: 978-3-642-17245-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics