Skip to main content

An Architectural Framework for Detecting Process Hangs/Crashes

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 3463))

Abstract

This paper addresses the challenges faced in practical implementation of heartbeat-based process/crash and hang detection. We propose an in-processor hardware module to reduce error detection latency and instrumentation overhead. Three hardware techniques integrated into the main pipeline of a superscalar processor are presented. The techniques discussed in this work are: (i) Instruction Count Heartbeat (ICH), which detects process crashes and a class of hangs where the process exists but is not executing any instructions, (ii) Infinite Loop Hang Detector (ILHD), which captures process hangs in infinite execution of legitimate loops, and (iii) Sequential Code Hang Detector (SCHD), which detects process hangs in illegal loops. The proposed design has the following unique features: 1) operating system neutral detection techniques, 2) elimination of any instrumentation for detection of all application crashes and OS hangs, and 3) an automated and light-weight compile-time instrumentation methodology to detect all process hangs (including infinite loops), the detection being performed in the hardware module at runtime. The proposed techniques can support heartbeat protocols to detect operating system/process crashes and hangs in distributed systems. Evaluation of the techniques for hang detection show a low 1.6% performance overhead and 6% memory overhead for the instrumentation. The crash detection technique does not incur any performance overhead and has a latency of a few instructions.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Eveking, H.: SuperScalar DLX Documentation, http://www.rs.e-technik.tu-darmstadt.de/TUD/res/dlxdocu/DlxPdf.zip

  2. Burger, D., Austin, T.M.: The SimpleScalar Tool Set, Version 2.0. Tech. Rep. CS-1342, Univ of Wisconsin-Madison (June 1997)

    Google Scholar 

  3. Chandra, T.D., Toueg, S.: Unreliable Failure Detectors for Reliable Distributed Systems. Journal of the ACM 43(2), 225–267 (1996)

    Article  MATH  MathSciNet  Google Scholar 

  4. Gouda, M., McGuire, T.: Accelerated heartbeat protocols. In: Proc. of the Int’l Conf. on Distributed Computing Systems, pp. 202–209 (May 1998)

    Google Scholar 

  5. Kalbarczyk, Z., Bagchi, S., Whisnant, K., Iyer, R.K.: Chameleon: A Software Infrastructure for Adaptive Fault Tolerance. IEEE Trans. on PDS 10(6) (June 1999)

    Google Scholar 

  6. Murphy, N.: Watchdog Timers. Embedded Systems Programming (November 2000)

    Google Scholar 

  7. Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach. Morgan-Kaufmann, San Francisco (1996)

    MATH  Google Scholar 

  8. Yang, Z.: Implementation of Preemptive Control Flow Checking Via Editing of Program Executables. Master’s Thesis, University of Illinois at Urbana-Champaign (December 2002)

    Google Scholar 

  9. Li, Y.-T.S., et al.: Performance Estimation of Embedded Software with Instruction Cache Modeling. ACM Trans. on Design Automation of Electronic Systems 4(3), 257–279

    Google Scholar 

  10. Felber, P., Defago, X., Guerraoui, R., Oser, P.: Failure Detectors as First Class Objects. In: Proc. of the Int’l Symposium on Distributed Objects and Applications (1999)

    Google Scholar 

  11. AIX V 5.1: System Management Concepts, http://publib16.boulder.ibm.com/pseries/en_US/aixbman/admnconc/syshang_intro.htm

  12. Eddon, G., Eddon, H.: Understanding the DCOM Wire Protocol by Analyzing Network Data Packets. Microsoft Systems Journal (March 1998)

    Google Scholar 

  13. Sun Cluster 3.1 Concepts Guide, http://docs.sun.com/db/doc/817-0519

  14. Chen, W., Toueg, S., Aguilera, M.K.: On the Quality of Service of Failure Detectors. In: Proc. DSN 2000 (2000)

    Google Scholar 

  15. Bertier, M., Marin, O., Sens, P.: Implementation and Performance Evaluation of an Adaptable Failure Detector. In: Proc. DSN 2002 (2002)

    Google Scholar 

  16. Geist, A., et al.: PVM: Parallel Virtual Machine—A Users’ Guide and Tutorial for Networked Parallel Computing. Scientific and Engineering Series. MIT Press, Cambridge (1994)

    MATH  Google Scholar 

  17. Hayashibaral, N., Defago, X., Yared, R., Katayama, T.: The Accrual Failure Detector. IS-RR-2004-010, May 10 (2004)

    Google Scholar 

  18. Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of Distributed Consensus with One Faulty Process. Journal of the ACM 32(2), 374–382 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  19. Nakka, N., Saggese, G.P., Kalbarczyk, Z., Iyer, R.K.: An Architectural Framework for Detecting Process Hangs/Crashes, http://www.crhc.uiuc.edu/~nakka/HCDetect.pdf

  20. Gu, W., Kalbarczyk, Z., Iyer, R.K.: Error Sensitivity of the Linux Kernel Executing on PowerPC G4 and Pentium 4 Processors. In: Proc. of DSN 2004, pp. 827–836 (2004)

    Google Scholar 

  21. Whisnant, K., Iyer, R.K., Kalbarczyk, Z.T., Jones III, P.H., Rennels, D.A., Some, R.: The Effects of an ARMOR-Based SIFT Environment on the Performance and Dependability of User Applications. IEEE Trans. on Software Engg. 30(4), 257–277 (2004)

    Article  Google Scholar 

  22. Lee, I., Iyer, R.K.: Faults, Symptoms, and Software Fault Tolerance in the Tandem GUARDIAN90 Operating System. In: FTCS 1993 (1993)

    Google Scholar 

  23. Beauragard, D.J.: Error-Injection-Based Failure Profile of the IEEE 1394 Bus. Master’s Thesis, University of Illinois at Urbana-Champaign (2003)

    Google Scholar 

  24. PWDOG1 - PCI Watchdog for Windows XP, 2000, NT, 98, Linux Kernel (2000), http://www.quancom.de/qprod01/homee.htm

  25. AT&T 5ESSTM from top to bottom, http://www.morehouse.org/hin/ess/ess05.htm

  26. Siewiorek, D.P., Swarz, R.S.: Reliable Computer Systems: Design and Evaluation, Ch. 8, 2nd edn.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Nakka, N., Saggese, G.P., Kalbarczyk, Z., Iyer, R.K. (2005). An Architectural Framework for Detecting Process Hangs/Crashes. In: Dal Cin, M., Kaâniche, M., Pataricza, A. (eds) Dependable Computing - EDCC 5. EDCC 2005. Lecture Notes in Computer Science, vol 3463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11408901_8

Download citation

  • DOI: https://doi.org/10.1007/11408901_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25723-3

  • Online ISBN: 978-3-540-32019-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics