Abstract
Cyclic debugging is used to execute programs over and over again for tracking down and eliminating bugs. During re-execution, programmers may want to stop at breakpoints or apply step-by-step execution for inspecting the program’s state and detecting errors. For long-running parallel programs, the biggest drawback is the cost associated with restarting the program’s execution every time from the beginning. A solution is offered by combining checkpointing and debugging, which allows a program run to be initiated at any intermediate checkpoint. A problem is the selection of an appropriate recovery line for a given breakpoint. The temporal distance between these two points may be rather long if recovery lines are only chosen at consistent global checkpoints. The method described in this paper allows users to select an arbitrary checkpoint as a starting point for debugging and thus to shorten the temporal distance. In addition, a mechanism for reducing the amount of trace data (in terms of logged messages) is provided. The resulting technique is able to reduce the waiting time and the costs of cyclic debugging.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Chandy, K. M., and Lamport, L. “Distributed Snapshots: Determining Global States of Distributed Systems”, ACM Transactions on Computer Systems 3 (1985), pp. 63–75.
Cunha, J. C., and Lourenco, J. “An Integrated Testing and Debugging Environment for Parallel and Distributed Programs”, Proc. of the 23rd EUROMICRO Conference, IEEE Computer Society Budapest, Hungary (1997), pp. 291–298.
Dow, C. R., and Lin, C. M. “Adaptive Distributed Breakpoint Detection and Checkpoint Space Reduction in Message Massing Programs”, Computers and Artificial Intelligence (2000), Vol. 19, pp. 547–568.
Elnozahy, E. N., Johnson, D. B., and Wang, Y. M. “A Survey of Rollback-Recovery Protocols in Message-Passing Systems”, Technical Report CMU-CS, Carnegie Mellon University, (October 1996), pp. 96–181.
Feldman, S.I., Brown, Ch. B. “Igor: A System for Program Debugging via Reversible Execution”, Proc. of the ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging (May 5–6, 1988), University of Wisconsin, Madison, Wisconsin, USA, SIGPLAN Notices (January 1989), Vol. 24, No. 1, pp. 112-123.
Fowler, J., and Zwaenepoel, W. “Causal Distributed Breakpoints”, Proc. of the 10th International Conference on Distributed Computing Systems (ICDCS) (1990), pp. 134–141.
Garcia, I. C., and Buzato. L. E. “Progressive Construction of Consistent Global Checkpoints”, In 19th IEEE International Conference on Distributed Computing Systems (ICDCS’99), Austin, Texas, EUA (June 1999).
Haban, D., and Weigel, W. “Global Events and Global Breakpoints in Distributed Systems”, Proc. of the 21st Annual Hawaii International Conference on System Sciences, Software Track, IEEE Computer Society (January 1988), Vol. 2, pp. 166–175.
Hélary, J. M., Mostefaoui, A., and Raynal., M. “Communication-Induced Determination of Consistent Snapshot”, IEEE Transaction on Parallel and Distributed Systems (September 1999), Vol. 10, No. 9.
Kacsuk, P., “Systematic Macrostep Debugging of Message Passing Parallel Programs”, In: Kacsuk, P., Kotsis, G., “Distributed and Parallel Systems (DAPSYS’98)”, Future Generation Computer Systems, North-Holland (April 2000), Vol. 16, No. 6, pp. 597–607.
Kranzlmüller, D. “Event Graph Analysis for Debugging Massively Parallel Programs”, PhD Thesis, GUP Linz, Johannes Kepler University Linz, Austria (September 2000), http://www.gup.uni-linz.ac.at/~dk/thesis/thesis.php.
Lamport, L. “Time, Clocks, and the Ordering of Events in a Distributed System”, Communications of the ACM (July 1978), Vol. 21, No. 7, pp. 558–565.
Manivannan, D. and Singhal, M. “A Low Overhead Recovery Technique Using Quasi-Synchronous Checkpointing”, Proc. 16th IEEE International Conference on Distributed Computing Systems, Hong-Kong (1996), pp. 100–107.
Netzer, R. H. B., and Xu, J. “Adaptive Message Logging for Incremental Program Replay”, IEEE Parallel & Distributed Technology (November 1993), Vol. 1, No. 4, pp. 32–40.
Netzer, R. H. B., Subramanian, S., and Xu, J. “Critical-Path-Based Message Logging for Incremental Replay of Message-Passing Programs”, In 14th International Conference on Distributed Computing Systems, Poznan, Poland (June 1994).
Netzer, R. H. B., and Xu, J. “Sender-Based Message Logging for Reducing Rollback Propagation”, Proc. of the 7th IEEE Symposium on Parallel and Distributed Processing (SPDP’ 95).
Pan, D.Z., and Linton, M.A. “Supporting Reverse Execution of Parallel Programs”, Proc. of the ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging (May 5–6, 1988), University of Wisconsin, Madison, Wisconsin, USA, SIGPLAN Notices (January 1989), Vol. 24, No. 1, pp. 124–129.
Plank, J. S. “An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance”, Technical Report of University of Tennessee, UT-CS-97-372 (July 1997).
Randel, B. “System Structure for Software Fault Tolerance”, IEEE Transactions on Software Engineering TSE (June 1975), Vol. 1, No. 2, pp. 221–232.
Raynal, M., and Singhal, M. “Logical Time: A Way to Capture Causality in Distributed Systems”, IRISA (January 1995).
Ruget, F. “Cheaper Matrix Clocks”, Proc. of the 8th International Workshop on Distributed Algorithms, Springer-Verlag LNCS 857 (G. Tel and P. Vityani Eds) (1994), pp. 355–369.
Wang, Y. M., and Fuchs, W. K. “Optimistic Message Logging for Independent Checkpointing in Message Passing Systems”, Proc. of the 11th Symposium on Reliable Distributed Systems, (October 1992), pp. 147–154.
Wang, Y. M., and Fuchs, W. K. “Lazy Checkpoint Coordination for Bounding Rollback Propagation”, Proc. of the 12th Symposium on Reliable Distributed Systems (1993), pp. 78–85.
Wang, Y. M. “The Maximum and Minimum Consistent Global Checkpoints and Their Applications”, Proc. IEEE Symposium Reliable Distributed Systems (September 1995), pp. 86–95.
Wang, Y. M., and Fuchs, W. K. “Optimal Message Log Reclamation for Uncoordinated Checkpointing”, Fault-Tolerant Parallel and Distributed Systems, IEEE Computer Society Press (1995), pp. 24–29.
Wang, Y. M. “Consistent Global Checkpoints That Contains a Set of Local Checkpoints”, IEEE Transactions on Computers (1997), Vol. 46, No. 4, pp. 456–468.
Yang, Z., and Marsland, T. “Global Snapshots for Distributed Debugging”, Technical Report TR 92-03, Laboratory for Distributed and Parallel Computing, Computing Science Department, University of Alberta, Edmonton, Canada T6G 2H1 (1992).
Zambonelli, F. “On the Effectiveness of Distributed Checkpoint Algorithms for Domino-Free Recovery”, In 7th IEEE Symposium on High-Performance Distributed Computing (July 1998).
Zambonelli, F., and Netzer, R. H. B. “An Efficient Logging Algorithm for Incremental Replay of Message-Passing Applications”, Proc. of the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing (1999).
Zambonelli, F., and Netzer, R. H. B. “Deadlock-Free Incremental Replay of Message-Passing Programs”, Journal of Parallel and Distributed Computing 61 (2001), pp. 667–678.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Thoai, N., Kranzlmüller, D., Volkert, J. (2003). ROS: The Rollback-One-Step Method to Minimize the Waiting Time during Debugging Long-Running Parallel Programs. In: Palma, J.M.L.M., Sousa, A.A., Dongarra, J., Hernández, V. (eds) High Performance Computing for Computational Science — VECPAR 2002. VECPAR 2002. Lecture Notes in Computer Science, vol 2565. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36569-9_45
Download citation
DOI: https://doi.org/10.1007/3-540-36569-9_45
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00852-1
Online ISBN: 978-3-540-36569-3
eBook Packages: Springer Book Archive