Skip to main content

ROS: The Rollback-One-Step Method to Minimize the Waiting Time during Debugging Long-Running Parallel Programs

  • Conference paper
  • First Online:
High Performance Computing for Computational Science — VECPAR 2002 (VECPAR 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2565))

Abstract

Cyclic debugging is used to execute programs over and over again for tracking down and eliminating bugs. During re-execution, programmers may want to stop at breakpoints or apply step-by-step execution for inspecting the program’s state and detecting errors. For long-running parallel programs, the biggest drawback is the cost associated with restarting the program’s execution every time from the beginning. A solution is offered by combining checkpointing and debugging, which allows a program run to be initiated at any intermediate checkpoint. A problem is the selection of an appropriate recovery line for a given breakpoint. The temporal distance between these two points may be rather long if recovery lines are only chosen at consistent global checkpoints. The method described in this paper allows users to select an arbitrary checkpoint as a starting point for debugging and thus to shorten the temporal distance. In addition, a mechanism for reducing the amount of trace data (in terms of logged messages) is provided. The resulting technique is able to reduce the waiting time and the costs of cyclic debugging.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chandy, K. M., and Lamport, L. “Distributed Snapshots: Determining Global States of Distributed Systems”, ACM Transactions on Computer Systems 3 (1985), pp. 63–75.

    Article  Google Scholar 

  2. Cunha, J. C., and Lourenco, J. “An Integrated Testing and Debugging Environment for Parallel and Distributed Programs”, Proc. of the 23rd EUROMICRO Conference, IEEE Computer Society Budapest, Hungary (1997), pp. 291–298.

    Google Scholar 

  3. Dow, C. R., and Lin, C. M. “Adaptive Distributed Breakpoint Detection and Checkpoint Space Reduction in Message Massing Programs”, Computers and Artificial Intelligence (2000), Vol. 19, pp. 547–568.

    MATH  Google Scholar 

  4. Elnozahy, E. N., Johnson, D. B., and Wang, Y. M. “A Survey of Rollback-Recovery Protocols in Message-Passing Systems”, Technical Report CMU-CS, Carnegie Mellon University, (October 1996), pp. 96–181.

    Google Scholar 

  5. Feldman, S.I., Brown, Ch. B. “Igor: A System for Program Debugging via Reversible Execution”, Proc. of the ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging (May 5–6, 1988), University of Wisconsin, Madison, Wisconsin, USA, SIGPLAN Notices (January 1989), Vol. 24, No. 1, pp. 112-123.

    Google Scholar 

  6. Fowler, J., and Zwaenepoel, W. “Causal Distributed Breakpoints”, Proc. of the 10th International Conference on Distributed Computing Systems (ICDCS) (1990), pp. 134–141.

    Google Scholar 

  7. Garcia, I. C., and Buzato. L. E. “Progressive Construction of Consistent Global Checkpoints”, In 19th IEEE International Conference on Distributed Computing Systems (ICDCS’99), Austin, Texas, EUA (June 1999).

    Google Scholar 

  8. Haban, D., and Weigel, W. “Global Events and Global Breakpoints in Distributed Systems”, Proc. of the 21st Annual Hawaii International Conference on System Sciences, Software Track, IEEE Computer Society (January 1988), Vol. 2, pp. 166–175.

    Google Scholar 

  9. Hélary, J. M., Mostefaoui, A., and Raynal., M. “Communication-Induced Determination of Consistent Snapshot”, IEEE Transaction on Parallel and Distributed Systems (September 1999), Vol. 10, No. 9.

    Google Scholar 

  10. Kacsuk, P., “Systematic Macrostep Debugging of Message Passing Parallel Programs”, In: Kacsuk, P., Kotsis, G., “Distributed and Parallel Systems (DAPSYS’98)”, Future Generation Computer Systems, North-Holland (April 2000), Vol. 16, No. 6, pp. 597–607.

    Google Scholar 

  11. Kranzlmüller, D. “Event Graph Analysis for Debugging Massively Parallel Programs”, PhD Thesis, GUP Linz, Johannes Kepler University Linz, Austria (September 2000), http://www.gup.uni-linz.ac.at/~dk/thesis/thesis.php.

  12. Lamport, L. “Time, Clocks, and the Ordering of Events in a Distributed System”, Communications of the ACM (July 1978), Vol. 21, No. 7, pp. 558–565.

    Article  MATH  Google Scholar 

  13. Manivannan, D. and Singhal, M. “A Low Overhead Recovery Technique Using Quasi-Synchronous Checkpointing”, Proc. 16th IEEE International Conference on Distributed Computing Systems, Hong-Kong (1996), pp. 100–107.

    Google Scholar 

  14. Netzer, R. H. B., and Xu, J. “Adaptive Message Logging for Incremental Program Replay”, IEEE Parallel & Distributed Technology (November 1993), Vol. 1, No. 4, pp. 32–40.

    Article  Google Scholar 

  15. Netzer, R. H. B., Subramanian, S., and Xu, J. “Critical-Path-Based Message Logging for Incremental Replay of Message-Passing Programs”, In 14th International Conference on Distributed Computing Systems, Poznan, Poland (June 1994).

    Google Scholar 

  16. Netzer, R. H. B., and Xu, J. “Sender-Based Message Logging for Reducing Rollback Propagation”, Proc. of the 7th IEEE Symposium on Parallel and Distributed Processing (SPDP’ 95).

    Google Scholar 

  17. Pan, D.Z., and Linton, M.A. “Supporting Reverse Execution of Parallel Programs”, Proc. of the ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging (May 5–6, 1988), University of Wisconsin, Madison, Wisconsin, USA, SIGPLAN Notices (January 1989), Vol. 24, No. 1, pp. 124–129.

    Google Scholar 

  18. Plank, J. S. “An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance”, Technical Report of University of Tennessee, UT-CS-97-372 (July 1997).

    Google Scholar 

  19. Randel, B. “System Structure for Software Fault Tolerance”, IEEE Transactions on Software Engineering TSE (June 1975), Vol. 1, No. 2, pp. 221–232.

    Google Scholar 

  20. Raynal, M., and Singhal, M. “Logical Time: A Way to Capture Causality in Distributed Systems”, IRISA (January 1995).

    Google Scholar 

  21. Ruget, F. “Cheaper Matrix Clocks”, Proc. of the 8th International Workshop on Distributed Algorithms, Springer-Verlag LNCS 857 (G. Tel and P. Vityani Eds) (1994), pp. 355–369.

    Google Scholar 

  22. Wang, Y. M., and Fuchs, W. K. “Optimistic Message Logging for Independent Checkpointing in Message Passing Systems”, Proc. of the 11th Symposium on Reliable Distributed Systems, (October 1992), pp. 147–154.

    Google Scholar 

  23. Wang, Y. M., and Fuchs, W. K. “Lazy Checkpoint Coordination for Bounding Rollback Propagation”, Proc. of the 12th Symposium on Reliable Distributed Systems (1993), pp. 78–85.

    Google Scholar 

  24. Wang, Y. M. “The Maximum and Minimum Consistent Global Checkpoints and Their Applications”, Proc. IEEE Symposium Reliable Distributed Systems (September 1995), pp. 86–95.

    Google Scholar 

  25. Wang, Y. M., and Fuchs, W. K. “Optimal Message Log Reclamation for Uncoordinated Checkpointing”, Fault-Tolerant Parallel and Distributed Systems, IEEE Computer Society Press (1995), pp. 24–29.

    Google Scholar 

  26. Wang, Y. M. “Consistent Global Checkpoints That Contains a Set of Local Checkpoints”, IEEE Transactions on Computers (1997), Vol. 46, No. 4, pp. 456–468.

    Article  Google Scholar 

  27. Yang, Z., and Marsland, T. “Global Snapshots for Distributed Debugging”, Technical Report TR 92-03, Laboratory for Distributed and Parallel Computing, Computing Science Department, University of Alberta, Edmonton, Canada T6G 2H1 (1992).

    Google Scholar 

  28. Zambonelli, F. “On the Effectiveness of Distributed Checkpoint Algorithms for Domino-Free Recovery”, In 7th IEEE Symposium on High-Performance Distributed Computing (July 1998).

    Google Scholar 

  29. Zambonelli, F., and Netzer, R. H. B. “An Efficient Logging Algorithm for Incremental Replay of Message-Passing Applications”, Proc. of the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing (1999).

    Google Scholar 

  30. Zambonelli, F., and Netzer, R. H. B. “Deadlock-Free Incremental Replay of Message-Passing Programs”, Journal of Parallel and Distributed Computing 61 (2001), pp. 667–678.

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Thoai, N., Kranzlmüller, D., Volkert, J. (2003). ROS: The Rollback-One-Step Method to Minimize the Waiting Time during Debugging Long-Running Parallel Programs. In: Palma, J.M.L.M., Sousa, A.A., Dongarra, J., Hernández, V. (eds) High Performance Computing for Computational Science — VECPAR 2002. VECPAR 2002. Lecture Notes in Computer Science, vol 2565. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36569-9_45

Download citation

  • DOI: https://doi.org/10.1007/3-540-36569-9_45

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-00852-1

  • Online ISBN: 978-3-540-36569-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics