ROS: The Rollback-One-Step Method to Minimize the Waiting Time during Debugging Long-Running Parallel Programs

Thoai, Nam; Kranzlmüller, Dieter; Volkert, Jens

doi:10.1007/3-540-36569-9_45

Nam Thoai⁷,
Dieter Kranzlmüller⁷ &
Jens Volkert⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2565))

Included in the following conference series:

International Conference on High Performance Computing for Computational Science

678 Accesses
1 Citations

Abstract

Cyclic debugging is used to execute programs over and over again for tracking down and eliminating bugs. During re-execution, programmers may want to stop at breakpoints or apply step-by-step execution for inspecting the program’s state and detecting errors. For long-running parallel programs, the biggest drawback is the cost associated with restarting the program’s execution every time from the beginning. A solution is offered by combining checkpointing and debugging, which allows a program run to be initiated at any intermediate checkpoint. A problem is the selection of an appropriate recovery line for a given breakpoint. The temporal distance between these two points may be rather long if recovery lines are only chosen at consistent global checkpoints. The method described in this paper allows users to select an arbitrary checkpoint as a starting point for debugging and thus to shorten the temporal distance. In addition, a mechanism for reducing the amount of trace data (in terms of logged messages) is provided. The resulting technique is able to reduce the waiting time and the costs of cyclic debugging.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chandy, K. M., and Lamport, L. “Distributed Snapshots: Determining Global States of Distributed Systems”, ACM Transactions on Computer Systems 3 (1985), pp. 63–75.
Article Google Scholar
Cunha, J. C., and Lourenco, J. “An Integrated Testing and Debugging Environment for Parallel and Distributed Programs”, Proc. of the 23rd EUROMICRO Conference, IEEE Computer Society Budapest, Hungary (1997), pp. 291–298.
Google Scholar
Dow, C. R., and Lin, C. M. “Adaptive Distributed Breakpoint Detection and Checkpoint Space Reduction in Message Massing Programs”, Computers and Artificial Intelligence (2000), Vol. 19, pp. 547–568.
MATH Google Scholar
Elnozahy, E. N., Johnson, D. B., and Wang, Y. M. “A Survey of Rollback-Recovery Protocols in Message-Passing Systems”, Technical Report CMU-CS, Carnegie Mellon University, (October 1996), pp. 96–181.
Google Scholar
Feldman, S.I., Brown, Ch. B. “Igor: A System for Program Debugging via Reversible Execution”, Proc. of the ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging (May 5–6, 1988), University of Wisconsin, Madison, Wisconsin, USA, SIGPLAN Notices (January 1989), Vol. 24, No. 1, pp. 112-123.
Google Scholar
Fowler, J., and Zwaenepoel, W. “Causal Distributed Breakpoints”, Proc. of the 10th International Conference on Distributed Computing Systems (ICDCS) (1990), pp. 134–141.
Google Scholar
Garcia, I. C., and Buzato. L. E. “Progressive Construction of Consistent Global Checkpoints”, In 19th IEEE International Conference on Distributed Computing Systems (ICDCS’99), Austin, Texas, EUA (June 1999).
Google Scholar
Haban, D., and Weigel, W. “Global Events and Global Breakpoints in Distributed Systems”, Proc. of the 21st Annual Hawaii International Conference on System Sciences, Software Track, IEEE Computer Society (January 1988), Vol. 2, pp. 166–175.
Google Scholar
Hélary, J. M., Mostefaoui, A., and Raynal., M. “Communication-Induced Determination of Consistent Snapshot”, IEEE Transaction on Parallel and Distributed Systems (September 1999), Vol. 10, No. 9.
Google Scholar
Kacsuk, P., “Systematic Macrostep Debugging of Message Passing Parallel Programs”, In: Kacsuk, P., Kotsis, G., “Distributed and Parallel Systems (DAPSYS’98)”, Future Generation Computer Systems, North-Holland (April 2000), Vol. 16, No. 6, pp. 597–607.
Google Scholar
Kranzlmüller, D. “Event Graph Analysis for Debugging Massively Parallel Programs”, PhD Thesis, GUP Linz, Johannes Kepler University Linz, Austria (September 2000), http://www.gup.uni-linz.ac.at/~dk/thesis/thesis.php.
Lamport, L. “Time, Clocks, and the Ordering of Events in a Distributed System”, Communications of the ACM (July 1978), Vol. 21, No. 7, pp. 558–565.
Article MATH Google Scholar
Manivannan, D. and Singhal, M. “A Low Overhead Recovery Technique Using Quasi-Synchronous Checkpointing”, Proc. 16th IEEE International Conference on Distributed Computing Systems, Hong-Kong (1996), pp. 100–107.
Google Scholar
Netzer, R. H. B., and Xu, J. “Adaptive Message Logging for Incremental Program Replay”, IEEE Parallel & Distributed Technology (November 1993), Vol. 1, No. 4, pp. 32–40.
Article Google Scholar
Netzer, R. H. B., Subramanian, S., and Xu, J. “Critical-Path-Based Message Logging for Incremental Replay of Message-Passing Programs”, In 14th International Conference on Distributed Computing Systems, Poznan, Poland (June 1994).
Google Scholar
Netzer, R. H. B., and Xu, J. “Sender-Based Message Logging for Reducing Rollback Propagation”, Proc. of the 7th IEEE Symposium on Parallel and Distributed Processing (SPDP’ 95).
Google Scholar
Pan, D.Z., and Linton, M.A. “Supporting Reverse Execution of Parallel Programs”, Proc. of the ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging (May 5–6, 1988), University of Wisconsin, Madison, Wisconsin, USA, SIGPLAN Notices (January 1989), Vol. 24, No. 1, pp. 124–129.
Google Scholar
Plank, J. S. “An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance”, Technical Report of University of Tennessee, UT-CS-97-372 (July 1997).
Google Scholar
Randel, B. “System Structure for Software Fault Tolerance”, IEEE Transactions on Software Engineering TSE (June 1975), Vol. 1, No. 2, pp. 221–232.
Google Scholar
Raynal, M., and Singhal, M. “Logical Time: A Way to Capture Causality in Distributed Systems”, IRISA (January 1995).
Google Scholar
Ruget, F. “Cheaper Matrix Clocks”, Proc. of the 8th International Workshop on Distributed Algorithms, Springer-Verlag LNCS 857 (G. Tel and P. Vityani Eds) (1994), pp. 355–369.
Google Scholar
Wang, Y. M., and Fuchs, W. K. “Optimistic Message Logging for Independent Checkpointing in Message Passing Systems”, Proc. of the 11th Symposium on Reliable Distributed Systems, (October 1992), pp. 147–154.
Google Scholar
Wang, Y. M., and Fuchs, W. K. “Lazy Checkpoint Coordination for Bounding Rollback Propagation”, Proc. of the 12th Symposium on Reliable Distributed Systems (1993), pp. 78–85.
Google Scholar
Wang, Y. M. “The Maximum and Minimum Consistent Global Checkpoints and Their Applications”, Proc. IEEE Symposium Reliable Distributed Systems (September 1995), pp. 86–95.
Google Scholar
Wang, Y. M., and Fuchs, W. K. “Optimal Message Log Reclamation for Uncoordinated Checkpointing”, Fault-Tolerant Parallel and Distributed Systems, IEEE Computer Society Press (1995), pp. 24–29.
Google Scholar
Wang, Y. M. “Consistent Global Checkpoints That Contains a Set of Local Checkpoints”, IEEE Transactions on Computers (1997), Vol. 46, No. 4, pp. 456–468.
Article Google Scholar
Yang, Z., and Marsland, T. “Global Snapshots for Distributed Debugging”, Technical Report TR 92-03, Laboratory for Distributed and Parallel Computing, Computing Science Department, University of Alberta, Edmonton, Canada T6G 2H1 (1992).
Google Scholar
Zambonelli, F. “On the Effectiveness of Distributed Checkpoint Algorithms for Domino-Free Recovery”, In 7th IEEE Symposium on High-Performance Distributed Computing (July 1998).
Google Scholar
Zambonelli, F., and Netzer, R. H. B. “An Efficient Logging Algorithm for Incremental Replay of Message-Passing Applications”, Proc. of the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing (1999).
Google Scholar
Zambonelli, F., and Netzer, R. H. B. “Deadlock-Free Incremental Replay of Message-Passing Programs”, Journal of Parallel and Distributed Computing 61 (2001), pp. 667–678.
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

GUP Linz, Johannes Kepler University Linz, Altenbergerstraβe 69, A-4040, Linz, Austria/Europe
Nam Thoai, Dieter Kranzlmüller & Jens Volkert

Authors

Nam Thoai
View author publications
You can also search for this author in PubMed Google Scholar
Dieter Kranzlmüller
View author publications
You can also search for this author in PubMed Google Scholar
Jens Volkert
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculdade de Engenharia da, Universidade do Porto, Rua Dr. Roberto Frias, 4200-465, Porto, Portugal
José M. L. M. Palma & A. Augusto Sousa &
Department of Computer Science, University of Tennessee, 37996-1301, Knoxville, TN, USA
Jack Dongarra
Departamento de Sistemas Informáticos y Computación, Universidad Politécnica de Valencia, Camino de Vera, s/n, Apartado 22012, 46020, Valencia, Spain
Vicente Hernández

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Thoai, N., Kranzlmüller, D., Volkert, J. (2003). ROS: The Rollback-One-Step Method to Minimize the Waiting Time during Debugging Long-Running Parallel Programs. In: Palma, J.M.L.M., Sousa, A.A., Dongarra, J., Hernández, V. (eds) High Performance Computing for Computational Science — VECPAR 2002. VECPAR 2002. Lecture Notes in Computer Science, vol 2565. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36569-9_45

Download citation

DOI: https://doi.org/10.1007/3-540-36569-9_45
Published: 15 April 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00852-1
Online ISBN: 978-3-540-36569-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics