Abstract
Debugging parallel programs on MIMD machines is a difficult task because successive executions of the same program can lead to different behaviors. To solve this problem, a method called execution replay has been introduced, which guarantees the reexecution of a program to be equivalent to the initial execution. In this paper we present an execution replay technique in the context of distributed memory architectures. In contrary to all other proposed approaches, our technique can treat non-blocking message passing primitives, and can be adapted to any form of message passing communication. Since the technique is based on an events numbering, we show how to bound these numbers, and then analyse the influence of this bound on the amount of recorded information. The prototype implemented on an Intel iPSC/2 shows that the overhead due to the recording of control information is extremely low (about 1%).
Project funded by the "Fonds national suisse" under contract number 20-5495.88
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
M. Bramer, "Computer Game-Playing theory and practice", Ellis Horwood Series, Halsted Press, 1983.
R. Curtis, L. Wittie, "BugNet: A Debugging System for Parallel Programming Environments", Proc. 3rd Int. Conf. on Distrib. Computing Syst. Hollywood, FL, Oct 1982.
S. Feldmann, C. Brown, "IGOR: A System for Program Debugging via Reversible Execution", SIGPLAN Notices, Volume 24, Number 1, Jan. 1989.
R. Fowler, T. Leblanc, "An Integrated Approach to Parallel Program Debugging and Performance Analysis on Large-Scale Multiprocessors", SIGPLAN Notices, Volume 24, Number 1, Jan. 1989.
G. Fox, S. Otto, "Matrix algorithms on a hypercube I: Matrix multiplication", Parallel Computing, No 4, North-Holland, 1987.
J. Fowler, W.Zwaenepoel, "Causal Distributed Breakpoints", Proc. 10th IEEE Int. Conf. on Distributed Computing Systems, Paris, May 90.
S. Jones, "Bugnet: A Real-Time Distributed Debugging System", Proc. of 6th Internat. Symposium on Reliability in Distributed Software and DB Systems, Williamsburg, Va, March 1987.
T. Leblanc, A. Robbins, "Event driven monitoring of distributed programs", Proc. 5th Int. Conf. Distrib. Comput. Syst., Denver, CO, May 1985.
T. Leblanc, J. Mellor-Crummey, "Debugging Parallel Programs with Instant Replay", IEEE Transactions on Computers C-36(4), April 1987.
E. Leu, A. Schiper, A. Zramdini, "Réexécution de programmes parallèles: une approche systématique", Technical Report 90-07, Ecole Polytechnique Fédérale de Lausanne, Département d'Informatique, Switzerland.
D. Pan, M. Linton, "Supporting Reverse Execution for Parallel Programs", SIGPLAN Notices, Volume 24, Number 1, Jan. 1989.
D. Peterson, H. Westphal, "An efficient Implementation of Instant Replay", Technical report, European Computer-Industry Research Centre, Muenchen, West Germany.
D. Snowden, A. Wellings, "Debugging Distributed Real-Time Applications in ADA", University of York, UK, April 1988.
W. Zhou, "PM: A System for Prototyping and Monitoring Remote Procedure Call Programs", ACM SIGSOFT Software Engineering Notes, Vol. 15, Number 1, Jan. 1990.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1991 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Leu, E., Schiper, A., Zramdini, A. (1991). Efficient execution replay technique for distributed memory architectures. In: Bode, A. (eds) Distributed Memory Computing. EDMCC 1991. Lecture Notes in Computer Science, vol 487. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0032948
Download citation
DOI: https://doi.org/10.1007/BFb0032948
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-53951-3
Online ISBN: 978-3-540-46478-5
eBook Packages: Springer Book Archive