Abstract
A distributed system can support fault-tolerant applications by replicating data and computation at nodes that have independent failure modes. We present a scheme called parallel execution threads (PET) which can be used to implement fault-tolerant computations in an object-based distributed system. In a system that replicates objects, the PET scheme can be used to replicate a computation by creating a number of parallel threads which execute with different replicas of the invoked objects. A computation can be completed successfully if at least one thread does not encounter any failed nodes and its completion preserves the consistency of the objects. The PET scheme can tolerate failures that occur during the execution of the computation as long as all threads are not affected by the failures. We present the algorithms required to implement the PET scheme and also address some performance issues.
Similar content being viewed by others
References
Ahamad M, Dasgupta P, LeBlanc R, Wilkes T.: Fault-tolerant computing in object based distributed operating systems. In: Proc 6th Symp on Reliability in Distributed Systems, March 1987
Avizienis A: Then-version approach to fault-tolerant software. IEEE Trans Software Eng 11 (12): 1491–1501 (1985)
Bernabéu Aubán JM, Hutto PW, Khalidi MYA, Ahamad M, Appelbe WF, Dasgupta P, LeBlanc RJ, Ramachandran U: The architecture ofRa: a kernel forClouds. In Proc 22nd Annu Hawaii Int Conf on System Sciences, January 1989
Bernstein PA, Goodman N: An algorithm for concurrency control and recovery in replicated distributed databases. ACM Trans Database Syst 9(4):596–615 (1984)
Birman K, Joseph T, Raeuchle R, El Abbadi A: Implementing fault-tolerant distributed objects. IEEE Trans Software Eng 11(6):502–508 (1985)
Cooper E: Replicated distributed programs. In: Proc 10th ACM Symp on Operating Systems Principles, December 1985
Dasgupta P, LeBlanc RJ, Appelbee W: TheClouds distributed operating system. In: Proc Int Conf on Distributed Systems, June 1988
Garcia Molina H: Elections in a distributed computing system. IEEE Trans. Comput C-31(1):48–59 (1982)
Gifford D: Weighted voting for replicated data. In: Proc 7th Symp on Operating Systems (Pacific Grove, California). ACM, December 1979
Ng TP, Shi SSB: Replicated transactions. In: Proc 9th Int Conf on Distributed Computing Systems, pp 474–480. IEEE, June 1989
Oki B, Liskov B: Viewstamped replication: a general primary copy method to support highly-available distributed systems. In: Proc 7th Symp on Principles of Distributed Computing, August 1988
Ramachandran U, Ahamad M, Khalidi MY: Unifying synchronization and data transfer in maintaining coherence of distributed shared memory. In: Proc Int Conf on Parallel Processing, August 1989
Stonebreaker M: Concurrency control and consistency of multiple copies of data in distributed INGRES. IEEE Trans Software Eng 5(3):188–194 (1979)
Yap KS, Jalote P, Tripathi S: Fault tolerant remote procedure calls. In: 8th Int Conf on Distributed Computing, June 1988
Author information
Authors and Affiliations
Additional information
Mustaque Ahamad received his B.E. (Hons.) degree in Electrical Engineering from the Birla Institute of Technology and Science, Pilani, India. He obtained his M.S. and Ph.D. degrees in Computer Science from the State University of New York at Stony Brook in 1983 and 1985 respectively. Since September 1985, he is an Assistant Professor in the School of Information and Computer Science at the Georgia Institute of Technology, Atlanta. His research interests include distributed operating systems, distributed algorithms, faulttolerant systems and performance evaluation.
Partha Dasgupta is an Assistant Professor at Georgia Tech since 1984. He has a Ph.D. in Computer Science from the State University of New York at Stony Brook. He is the technical project director of the Clouds distributed operating systems project, as well as a coprincipal investigator of Georgia Tech's NSF-CER award. His research interests include building distributed operating systems, distributed algorithms, fault-tolerant systems and distributed programming support.
Richard J. LeBlanc, Jr. received the B.S. degree in physics from Louisiana State University in 1972 and the M.S. and Ph.D. degrees in computer sciences from the University of Wisconsin-Madison in 1974 and 1977, respectively. He is currently a Professor in the School of Information and Computer Science of the Georgia Institute of Technology. His research interests include programming language design and implementation, programming environments, and software engineering. Dr. LeBlanc's current research work involves application of these interests in distributed processing systems. As co-director of the Clouds Project, he is studying language concepts and software engineering methodology for utilizing a highly reliable, object-based distributed system. He is also interested in specification-based software development methodologies and tools. Dr. LeBlanc is a member of the Association for Computing Machinery, the IEEE Computer Society and Sigma Xi.
This work was supported in part by NSF grants CCR-8619886 and CCR-8806358, and RADC contract number F30602-86-C-0032
Rights and permissions
About this article
Cite this article
Ahamad, M., Dasgupta, P. & LeBlanc, R.J. Fault-tolerant atomic computations in an object-based distributed system. Distrib Comput 4, 69–80 (1990). https://doi.org/10.1007/BF01786632
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/BF01786632