Abstract
Concurrent programs often encounter failures, such as races, owing to the presence of synchronization faults (bugs). One existing tech- nique to tolerate synchronization faults is to roll back the program to a previous state and re-execute, in the hope that the failure does not recur. Instead of relying on chance, our approach is to control the re-execution in order to avoid a recurrence of the synchronization failure. The control is achievedb y tracing information during an execution andu sing this information to add synchronizations during the re-execution.
The approach gives rise to a general problem, calledt he off-line predicate control problem, which takes a computation anda property specified on the computation, andou tputs a “controlled“ computation that maintains the property. We solve the predicate control problem for the mutual exclusion property, which is especially important in synchronization fault tolerance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
A. Avizienis and L. Chen. On the implementation of n-version programming for software fault tolerance during execution. In Proc. of the First IEEE-CS International Conference on Computer Software and Applications, pages 149–155, November 1977.
J. D. Choi and H. Srinivasan. Deterministic replay of java multithreaded applications. In 2nd SIGMETRICS Symp. on Parallel and Distr. Tools, pages 48–59, Aug. 1998.
F. Cristian. Understanding fault-tolerant distributed systems. CACM, 34(2):56–78, Feb 1991.
M. Feng and C. E. Leiserson. Efficient detection of determinacy races in cilk programs. In Proc. of 9th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 22–25, Newport, USA, June 1997.
Y. Huang and C. Kintala. Software implemented fault tolerance: technologies and experience. In Proc. IEEE Fault-Tolerant Comp. Symp., pages 138–144, June 1993.
R. K. Iyer and I. Lee. Software fault tolerance in computer operating systems. In M. R. Lyu, editor, Software Fault Tolerance, Trends in Software Series, chapter 11, pages 249–278. John Wiley & Sons, Inc., 1995.
L. Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, July 1978.
D. Lea. Concurrent Programming in Java: Design Principles and Patterns, chapter 3.1.2. The Java Series. Addison Wesley Longman, Inc., 1997.
F. Mattern. Virtual time and global states of distributed systems. In Parallel and Distributed Algorithms: Proc. of the International Workshop on Parallel and Distributed Algorithms, pages 215–226. Elsevier Science Publishers B. V. (North Holland), 1989.
R. H. B. Netzer. Race condition detection for debugging shared-memory parallel programs. PhD thesis, University of Wisconsin-Madison, 1991.
R. H. B. Netzer. Optimal tracing and replay for debugging shared-memory parallel programs. In Proc. of ACM/ONR Workshop on Parallel and Distributed Debugging, pages 1–11, May 1993. Also available as ACM SIGPLAN Notices Vol. 28, No. 12.
R. H. B. Netzer and B. P. Miller. Optimal tracing and replay for debugging message-passing parallel programs. In Supercomputing’ 92, pages 502–511, November 1992.
B. Randell. System structure for software fault-tolerance. IEEE Transactions on Software Engineering, 1(2):220–232, June 1975.
M. Raynal. Algorithms for mutual exclusion. MIT Press, 1986.
M. Ronnse and W. Zwaenepoel. Execution replay for treadmarks. In Proc. of the 5th EUROMICRO Workshop on Parallel and Distributed Processing (PDP’97), pages 343–350, January 1997.
A. Tarafdar and V. K. Garg. Predicate control for active debugging of distributed programs. In Proc. of the 9th Symposium on Parallel and Distributed Processing, Orlando, USA, April 1998. IEEE.
Y. M. Wang, Y. Huang, W. K. Fuchs, C. Kintala, and G. Suri. Progressive retry for software failure recovery in message-passing applications. IEEE Trans. on Computers, 46(10):1137–1141, October 1997.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tarafdar, A., Garg, V.K. (1999). Software Fault Tolerance of Concurrent Programs Using Controlled Re-execution. In: Jayanti, P. (eds) Distributed Computing. DISC 1999. Lecture Notes in Computer Science, vol 1693. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48169-9_15
Download citation
DOI: https://doi.org/10.1007/3-540-48169-9_15
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66531-1
Online ISBN: 978-3-540-48169-0
eBook Packages: Springer Book Archive