Skip to main content

Efficient execution replay technique for distributed memory architectures

  • Systems Software
  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 487))

Abstract

Debugging parallel programs on MIMD machines is a difficult task because successive executions of the same program can lead to different behaviors. To solve this problem, a method called execution replay has been introduced, which guarantees the reexecution of a program to be equivalent to the initial execution. In this paper we present an execution replay technique in the context of distributed memory architectures. In contrary to all other proposed approaches, our technique can treat non-blocking message passing primitives, and can be adapted to any form of message passing communication. Since the technique is based on an events numbering, we show how to bound these numbers, and then analyse the influence of this bound on the amount of recorded information. The prototype implemented on an Intel iPSC/2 shows that the overhead due to the recording of control information is extremely low (about 1%).

Project funded by the "Fonds national suisse" under contract number 20-5495.88

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   74.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. M. Bramer, "Computer Game-Playing theory and practice", Ellis Horwood Series, Halsted Press, 1983.

    Google Scholar 

  2. R. Curtis, L. Wittie, "BugNet: A Debugging System for Parallel Programming Environments", Proc. 3rd Int. Conf. on Distrib. Computing Syst. Hollywood, FL, Oct 1982.

    Google Scholar 

  3. S. Feldmann, C. Brown, "IGOR: A System for Program Debugging via Reversible Execution", SIGPLAN Notices, Volume 24, Number 1, Jan. 1989.

    Google Scholar 

  4. R. Fowler, T. Leblanc, "An Integrated Approach to Parallel Program Debugging and Performance Analysis on Large-Scale Multiprocessors", SIGPLAN Notices, Volume 24, Number 1, Jan. 1989.

    Google Scholar 

  5. G. Fox, S. Otto, "Matrix algorithms on a hypercube I: Matrix multiplication", Parallel Computing, No 4, North-Holland, 1987.

    Google Scholar 

  6. J. Fowler, W.Zwaenepoel, "Causal Distributed Breakpoints", Proc. 10th IEEE Int. Conf. on Distributed Computing Systems, Paris, May 90.

    Google Scholar 

  7. S. Jones, "Bugnet: A Real-Time Distributed Debugging System", Proc. of 6th Internat. Symposium on Reliability in Distributed Software and DB Systems, Williamsburg, Va, March 1987.

    Google Scholar 

  8. T. Leblanc, A. Robbins, "Event driven monitoring of distributed programs", Proc. 5th Int. Conf. Distrib. Comput. Syst., Denver, CO, May 1985.

    Google Scholar 

  9. T. Leblanc, J. Mellor-Crummey, "Debugging Parallel Programs with Instant Replay", IEEE Transactions on Computers C-36(4), April 1987.

    Google Scholar 

  10. E. Leu, A. Schiper, A. Zramdini, "Réexécution de programmes parallèles: une approche systématique", Technical Report 90-07, Ecole Polytechnique Fédérale de Lausanne, Département d'Informatique, Switzerland.

    Google Scholar 

  11. D. Pan, M. Linton, "Supporting Reverse Execution for Parallel Programs", SIGPLAN Notices, Volume 24, Number 1, Jan. 1989.

    Google Scholar 

  12. D. Peterson, H. Westphal, "An efficient Implementation of Instant Replay", Technical report, European Computer-Industry Research Centre, Muenchen, West Germany.

    Google Scholar 

  13. D. Snowden, A. Wellings, "Debugging Distributed Real-Time Applications in ADA", University of York, UK, April 1988.

    Google Scholar 

  14. W. Zhou, "PM: A System for Prototyping and Monitoring Remote Procedure Call Programs", ACM SIGSOFT Software Engineering Notes, Vol. 15, Number 1, Jan. 1990.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Arndt Bode

Rights and permissions

Reprints and permissions

Copyright information

© 1991 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Leu, E., Schiper, A., Zramdini, A. (1991). Efficient execution replay technique for distributed memory architectures. In: Bode, A. (eds) Distributed Memory Computing. EDMCC 1991. Lecture Notes in Computer Science, vol 487. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0032948

Download citation

  • DOI: https://doi.org/10.1007/BFb0032948

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-53951-3

  • Online ISBN: 978-3-540-46478-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics