Abstract
Cyclic debugging depicts error detection techniques, where programs are iteratively executed to identify the original reason for incorrect runtime behavior. This characteristic is especially problematic for large-scale, long-running parallel programs concerning the requirements in time and processing resources and the associated computing costs. A solution to these problems is offered by a combination of techniques, which use the event graph model as the main representation of parallel program behavior. On the one hand, the number of deployed processes can be reduced with process isolation, where only a subset of the original processes are executed during debugging. On the other hand, an integrated checkpointing mechanism allows to extract limited periods of execution time, or to start subsequent program executions at intermediate points. Additionally, the event graph offers equivalent program execution in case of nondeterminism, as well as the possibility to investigate the effects of program perturbation induced by the observation functionality.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Agrawal, H., DeMillo, R.A., Spafford, E.H., An Execution Backtracking Approach to Debugging. IEEE Software, Vol. 8, No. 3, pp. 21–26 (May 1991).
Balzer, R.M., EXDAMS-EXtendable Debugging and Monitoring System, Proceedings of the AFIPS Spring Joint Computer Conference, pp. 567–580 (1969).
Bates, P., Wileden, J.S., High-Level Debugging of Distributed Systems: The Behavioral Abstraction Approach, Journal of Systems and Software, Vol. 3, No. 4, pp. 255–264 (Dec 1983).
Choi, J.-D., Miller, B.P., Netzer, R.B., Techniques for Debugging Parallel Programs with Flowback Analysis, ACM Transactions on Programming Languages and Systems, Vol. 13, No. 4, pp. 491–530 (Oct. 1991).
Cunha, J.C., Loureno, J.M., Anto, T., An Experiment in Tool Integration: the DDBG Parallel and Distributed Debugger, EUROMICRO Journal of Systems Architecture, 2nd Special Issue on Tools and Environments for Parallel Processing, Elsevier Science Publisher (1998).
Etnus (Dolphin Interconnect Solutions Inc): TotalView 4.1.0, Documentation, Framingham, Massachusetts, USA, (2000). http://www.etnus.com/pub/totalview/tv4.1.0/totalview-4.1.0-doc-pdf.tar
Feldman, S.I., Brown, Ch.B., Igor: A System for Program Debugging via Reversible Execution, Proceedings of the ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging (May 1988), University of Wisconsin, Madison, Wisconsin, USA, SIGPLAN Notices, Vol. 24, No. 1, pp. 112–123 (January 1989).
Geist, G.A., Beguelin, A., Dongarra, J.J., Jiang, W., Manchek, R., and Sunderam, V.S., PVM3-User’s Guide and Reference Manual, Technical Report, Oak Ridge National Laboratory, Tennessee, MIT Press,FraminghamCambridge, MA, USA (1994).
Hood, R., The p2d2 Project: Building a Portable Distributed Debugger, Proc. SPDT’96, ACM SIGMETRICS Symp. on Par. and Distr. Tools, Philadelphia, USA, pp. 127–136 (May 1996).
Hossfeld, F., Teraflops Computing: A Challenge to Parallel Numerics, in: P. Zinterhof, M. Vajtersic, A. Uhl, (Eds.), “Parallel Computation”, Proc. 4th Intl. ACPC Conf., Lecture Notes in Computer Science, Vol. 1557, Springer-Verlag, Salzburg, Austria, pp. 1–12 (Feb. 1999).
Kranzlmüller, D., Event Graph Analysis for Debugging Massively Parallel Programs, PhD Thesis, GUP Linz, Joh. Kepler Univ. Linz, Austria, (September 2000). http://www.gup.uni-linz.ac.at/~dk/thesis.
Kranzlmüller, D., Incremental Tracing and Process Isolation for Debugging Parallel Programs Computers and Artificial Intelligence, Vol. 19, No. 6, pp. 569–585 (Nov. 2000).
Kranzlmüller, D., Schaubschläger, Ch., Volkert, J., An Integrated Record&Replay Mechanism for Nondeterministic Message Passing Programs, Proc. EuroPVM/MPI 2001, 8th European PVM/MPI Users’ Group Meeting, Lecture Notes in Computer Science, Vol. 2131, Springer Verlag, Santorini, Greece, pp. 192–200 (September 2001).
Krawczyk, H., Wiszniewski, B., Analysis and Testing of Distributed Software Applications, in: Wilson, D.R., (Ed.), C3-Industrial Control, Computers, and Communication Series, Research Studies Press Ltd., Baldock, Hertfordshire, England (1998).
Lamport, L., Time, Clocks, and the Ordering of Events in a Distributed System, Communi-cations of the ACM, pp. 558–565 (July 1978).
May, J., Berman, F., Panorama: A Portable, Extensible Parallel Debugger, Proc. 3rd ACM/ONR Workshop on Parallel and Distributed Debugging, San Diego, CA, USA (May 1993), reprinted in: ACM SIGPLAN Notices, Vol. 28, No. 12, pp. 96–106 (Dec. 1993).
Message Passing Interface Forum, MPI: A Message-Passing Interface Standard-Version 1.1,(June 1995). http://www.mcs.anl.gov/mpi/
Netzer, R.H.B., Weaver, M.H., Optimal tracing and incremental reexecution for debugging long-running programs, Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation, Orlando, FL, pp. 313–325(June 1994).
Plank, J.S., An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance, Technical Report of University of Tennessee, UT-CS-97-372, Jul. 1997.
van Rick, M., Tourancheau, B., The Design of the General Parallel Monitoring System, Programming Environments for Parallel Computing, IFIP, North Holland, pp. 127–137 (1992).
Thoai, N., Kranzlmüller, D., Volkert, J., Rollback-One-Step Checkpointing and Reduced MessageLogging for Debugging Message-Passing Programs, Proc. 5th International Meeting on Vector and Parallel Processing VECPAR2002, Porto, Portugal (June 2002). [submitted]
Weiser, M., Program Slicing, IEEE Transaction on Software Engineering, Vol. 10, No. 4, pp. 352–357 (July 1984).
Zambonelli, F., Netzer, R.H.B., An Efficient Logging Algorithm for Incremental Replay of Message-Passing Applications, Proc. 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing (1999).
Zeller, A., Visual Debugging with DDD Dr. Dobb’s Journal, No. 332, pp. 21–28 (2001). http://www.ddj.com/articles/2001/0103/0103a/0103a.htm
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kranzlmüller, D., Thoai, N., Volkert, J. (2002). Debugging Large-Scale, Long-Running Parallel Programs. In: Sloot, P.M.A., Hoekstra, A.G., Tan, C.J.K., Dongarra, J.J. (eds) Computational Science — ICCS 2002. ICCS 2002. Lecture Notes in Computer Science, vol 2330. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46080-2_96
Download citation
DOI: https://doi.org/10.1007/3-540-46080-2_96
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43593-8
Online ISBN: 978-3-540-46080-0
eBook Packages: Springer Book Archive