Skip to main content

Software Fault Tolerance of Concurrent Programs Using Controlled Re-execution

  • Conference paper
  • First Online:
Distributed Computing (DISC 1999)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1693))

Included in the following conference series:

Abstract

Concurrent programs often encounter failures, such as races, owing to the presence of synchronization faults (bugs). One existing tech- nique to tolerate synchronization faults is to roll back the program to a previous state and re-execute, in the hope that the failure does not recur. Instead of relying on chance, our approach is to control the re-execution in order to avoid a recurrence of the synchronization failure. The control is achievedb y tracing information during an execution andu sing this information to add synchronizations during the re-execution.

The approach gives rise to a general problem, calledt he off-line predicate control problem, which takes a computation anda property specified on the computation, andou tputs a “controlled“ computation that maintains the property. We solve the predicate control problem for the mutual exclusion property, which is especially important in synchronization fault tolerance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Avizienis and L. Chen. On the implementation of n-version programming for software fault tolerance during execution. In Proc. of the First IEEE-CS International Conference on Computer Software and Applications, pages 149–155, November 1977.

    Google Scholar 

  2. J. D. Choi and H. Srinivasan. Deterministic replay of java multithreaded applications. In 2nd SIGMETRICS Symp. on Parallel and Distr. Tools, pages 48–59, Aug. 1998.

    Google Scholar 

  3. F. Cristian. Understanding fault-tolerant distributed systems. CACM, 34(2):56–78, Feb 1991.

    Google Scholar 

  4. M. Feng and C. E. Leiserson. Efficient detection of determinacy races in cilk programs. In Proc. of 9th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 22–25, Newport, USA, June 1997.

    Google Scholar 

  5. Y. Huang and C. Kintala. Software implemented fault tolerance: technologies and experience. In Proc. IEEE Fault-Tolerant Comp. Symp., pages 138–144, June 1993.

    Google Scholar 

  6. R. K. Iyer and I. Lee. Software fault tolerance in computer operating systems. In M. R. Lyu, editor, Software Fault Tolerance, Trends in Software Series, chapter 11, pages 249–278. John Wiley & Sons, Inc., 1995.

    Google Scholar 

  7. L. Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, July 1978.

    Article  MATH  Google Scholar 

  8. D. Lea. Concurrent Programming in Java: Design Principles and Patterns, chapter 3.1.2. The Java Series. Addison Wesley Longman, Inc., 1997.

    Google Scholar 

  9. F. Mattern. Virtual time and global states of distributed systems. In Parallel and Distributed Algorithms: Proc. of the International Workshop on Parallel and Distributed Algorithms, pages 215–226. Elsevier Science Publishers B. V. (North Holland), 1989.

    Google Scholar 

  10. R. H. B. Netzer. Race condition detection for debugging shared-memory parallel programs. PhD thesis, University of Wisconsin-Madison, 1991.

    Google Scholar 

  11. R. H. B. Netzer. Optimal tracing and replay for debugging shared-memory parallel programs. In Proc. of ACM/ONR Workshop on Parallel and Distributed Debugging, pages 1–11, May 1993. Also available as ACM SIGPLAN Notices Vol. 28, No. 12.

    Google Scholar 

  12. R. H. B. Netzer and B. P. Miller. Optimal tracing and replay for debugging message-passing parallel programs. In Supercomputing’ 92, pages 502–511, November 1992.

    Google Scholar 

  13. B. Randell. System structure for software fault-tolerance. IEEE Transactions on Software Engineering, 1(2):220–232, June 1975.

    Google Scholar 

  14. M. Raynal. Algorithms for mutual exclusion. MIT Press, 1986.

    Google Scholar 

  15. M. Ronnse and W. Zwaenepoel. Execution replay for treadmarks. In Proc. of the 5th EUROMICRO Workshop on Parallel and Distributed Processing (PDP’97), pages 343–350, January 1997.

    Google Scholar 

  16. A. Tarafdar and V. K. Garg. Predicate control for active debugging of distributed programs. In Proc. of the 9th Symposium on Parallel and Distributed Processing, Orlando, USA, April 1998. IEEE.

    Google Scholar 

  17. Y. M. Wang, Y. Huang, W. K. Fuchs, C. Kintala, and G. Suri. Progressive retry for software failure recovery in message-passing applications. IEEE Trans. on Computers, 46(10):1137–1141, October 1997.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1999 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Tarafdar, A., Garg, V.K. (1999). Software Fault Tolerance of Concurrent Programs Using Controlled Re-execution. In: Jayanti, P. (eds) Distributed Computing. DISC 1999. Lecture Notes in Computer Science, vol 1693. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48169-9_15

Download citation

  • DOI: https://doi.org/10.1007/3-540-48169-9_15

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-66531-1

  • Online ISBN: 978-3-540-48169-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics