Software Fault Tolerance of Concurrent Programs Using Controlled Re-execution

Tarafdar, Ashis; Garg, Vijay K.

doi:10.1007/3-540-48169-9_15

Ashis Tarafdar⁵ &
Vijay K. Garg⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1693))

Included in the following conference series:

International Symposium on Distributed Computing

560 Accesses
4 Citations

Abstract

Concurrent programs often encounter failures, such as races, owing to the presence of synchronization faults (bugs). One existing tech- nique to tolerate synchronization faults is to roll back the program to a previous state and re-execute, in the hope that the failure does not recur. Instead of relying on chance, our approach is to control the re-execution in order to avoid a recurrence of the synchronization failure. The control is achievedb y tracing information during an execution andu sing this information to add synchronizations during the re-execution.

The approach gives rise to a general problem, calledt he off-line predicate control problem, which takes a computation anda property specified on the computation, andou tputs a “controlled“ computation that maintains the property. We solve the predicate control problem for the mutual exclusion property, which is especially important in synchronization fault tolerance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Case Study for Reversible Computing: Reversible Debugging of Concurrent Programs

Causal Debugging for Concurrent Systems

Regression verification for multi-threaded programs (with extensions to locks and dynamic thread creation)

Article 13 November 2015

References

A. Avizienis and L. Chen. On the implementation of n-version programming for software fault tolerance during execution. In Proc. of the First IEEE-CS International Conference on Computer Software and Applications, pages 149–155, November 1977.
Google Scholar
J. D. Choi and H. Srinivasan. Deterministic replay of java multithreaded applications. In 2nd SIGMETRICS Symp. on Parallel and Distr. Tools, pages 48–59, Aug. 1998.
Google Scholar
F. Cristian. Understanding fault-tolerant distributed systems. CACM, 34(2):56–78, Feb 1991.
Google Scholar
M. Feng and C. E. Leiserson. Efficient detection of determinacy races in cilk programs. In Proc. of 9th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 22–25, Newport, USA, June 1997.
Google Scholar
Y. Huang and C. Kintala. Software implemented fault tolerance: technologies and experience. In Proc. IEEE Fault-Tolerant Comp. Symp., pages 138–144, June 1993.
Google Scholar
R. K. Iyer and I. Lee. Software fault tolerance in computer operating systems. In M. R. Lyu, editor, Software Fault Tolerance, Trends in Software Series, chapter 11, pages 249–278. John Wiley & Sons, Inc., 1995.
Google Scholar
L. Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, July 1978.
Article MATH Google Scholar
D. Lea. Concurrent Programming in Java: Design Principles and Patterns, chapter 3.1.2. The Java Series. Addison Wesley Longman, Inc., 1997.
Google Scholar
F. Mattern. Virtual time and global states of distributed systems. In Parallel and Distributed Algorithms: Proc. of the International Workshop on Parallel and Distributed Algorithms, pages 215–226. Elsevier Science Publishers B. V. (North Holland), 1989.
Google Scholar
R. H. B. Netzer. Race condition detection for debugging shared-memory parallel programs. PhD thesis, University of Wisconsin-Madison, 1991.
Google Scholar
R. H. B. Netzer. Optimal tracing and replay for debugging shared-memory parallel programs. In Proc. of ACM/ONR Workshop on Parallel and Distributed Debugging, pages 1–11, May 1993. Also available as ACM SIGPLAN Notices Vol. 28, No. 12.
Google Scholar
R. H. B. Netzer and B. P. Miller. Optimal tracing and replay for debugging message-passing parallel programs. In Supercomputing’ 92, pages 502–511, November 1992.
Google Scholar
B. Randell. System structure for software fault-tolerance. IEEE Transactions on Software Engineering, 1(2):220–232, June 1975.
Google Scholar
M. Raynal. Algorithms for mutual exclusion. MIT Press, 1986.
Google Scholar
M. Ronnse and W. Zwaenepoel. Execution replay for treadmarks. In Proc. of the 5th EUROMICRO Workshop on Parallel and Distributed Processing (PDP’97), pages 343–350, January 1997.
Google Scholar
A. Tarafdar and V. K. Garg. Predicate control for active debugging of distributed programs. In Proc. of the 9th Symposium on Parallel and Distributed Processing, Orlando, USA, April 1998. IEEE.
Google Scholar
Y. M. Wang, Y. Huang, W. K. Fuchs, C. Kintala, and G. Suri. Progressive retry for software failure recovery in message-passing applications. IEEE Trans. on Computers, 46(10):1137–1141, October 1997.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Sciences, The Univ. of Texas at Austin, Austin, TX, 78712, USA
Ashis Tarafdar
Dept. of Electr. and Comp. Engg., The Univ. of Texas at Austin, Austin, TX, 78712, USA
Vijay K. Garg

Authors

Ashis Tarafdar
View author publications
You can also search for this author in PubMed Google Scholar
Vijay K. Garg
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department for Computer Science 6211 Sudikoff Laboratory for Computer Science, Dartmouth College, Hanover, NH, 03755, USA
Prasad Jayanti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tarafdar, A., Garg, V.K. (1999). Software Fault Tolerance of Concurrent Programs Using Controlled Re-execution. In: Jayanti, P. (eds) Distributed Computing. DISC 1999. Lecture Notes in Computer Science, vol 1693. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48169-9_15

Download citation

DOI: https://doi.org/10.1007/3-540-48169-9_15
Published: 03 June 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66531-1
Online ISBN: 978-3-540-48169-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics