Abstract
Roll-forward recovery schemes were proposed to enhance the performance of fault tolerant systems employing checkpointing approach. In the roll-forward schemes, multiple processors are used for simultaneous roll-forward and validation processing. This paper proposes the sample comparison approach along with the checkpointing, which further improves the performance by reducing the overhead imposed by the checkpointing. We also develop general analytical models for estimating the availability, which are applicable for any checkpointing scheme. Performance comparisons reveal that the availabilities of the checkpointing schemes with sample comparison are higher than those of the schemes without it, while the required checkpoint interval is larger.
References
A. Agbaria, A. Freund, and R. Friedman. Evaluating distributed checkpointing protocols. 23rd Intl. Conf. Dist. Comput. Syst., May 2003, pp. 266–273.
L. Alvisi, E. Elnozahy, S. Rao, S. A. Husain, and A. D. Mel. An analysis of communication induced checkpointing. 29th Fault-Tolerance Comput. Symp., June 1999, pp. 242–249.
R. Baldoni, J. M. Helary, and M. Raynal. Rollback-dependency trackability: A minimal characterization and its protocol. Inform, and Comput., 2001.
G. Gao and M. Singhal. Mutable checkpoints: A new checkpointing approach for mobile computing systems. IEEE Trans. Parallel Dist. Syst., 12(2):157–172, 2001.
J. M. Helary, A. Mostefaoui, R. H. B. Netzer, and M. Raynal. Communication-based prevention of useless checkpoints in distributed computations. Distributed Comput., 13:29–43, 2000.
B. Lee, T. Park, and H. Y. Yeom. On the impossibility of non-blocking consistent casual recovery. IEICE Trnas. Inform. Syst. E83-D, (2):291–294, 2000.
J. Long, W. K. Fuchs, and J. A. Abraham. Compiler-assisted static checkpoint insertion. 22nd Intl. Symp. Fault-Tolerant Computing, July 1992, pp. 58–65.
J. Long, W. K. Fuchs, and J. A. Abraham. Implementing forward recovery using checkpoints in distributed systems. IFIP Work. Conf. Dependable Comput. for Critical Appl., 1992, pp. 27–46.
D. Manivannan and M. Singhal. Quazi-synchronous checkpoint: Models, characterization, and classification. IEEE Trans. Parallel and Distributed Systems, 10(7):703–713, 1999.
T. Park and H. Y. Yeom. An asychronous recovery scheme based on optimistic message logging for mobile computing systems. 20th Intl. Conf. Dist. Comput. Syst., April 2000, pp. 436–443.
G.-L. Park, H. Y. Youn, and H.-S. Choo. Optimal checkpoint interval analysis using stochastic petri net. IEEE Intl. Symp. Dependable Computing, Dec. 2001, pp. 57–60.
D. K. Pradhan and N. H. Vaidya. Roll-forward checkpointing scheme: A novel fault tolerant architecture. IEEE Trans. Computers, 43(10):1163–1174, 1994.
S. Rao, L. Alvisi, and H. M. Vin. The cost of recovery in message logging protocols. IEEE Trans. Knowledge Data Eng., 12(2):160–173, 2000.
J. Tsai, S. Y. Kuo, and Y. M. Wang. Evaluation on dominio-free communication-induced checkpointing protocols. Inform. Process. Lett., 69(1):31–37, 1999.
B. Yao, K.-F. Ssu, and W. K. Fuchs. Message logging in mobile computing. 29th Intl. Symp. on Fault-Tolerant Computing, 1999, pp. 14–19.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Park, GL., Yong, H.Y. A New Approach for High Performance Computing Systems with Various Checkpointing Schemes. J Supercomput 33, 65–78 (2005). https://doi.org/10.1007/s11227-005-0221-3
Issue Date:
DOI: https://doi.org/10.1007/s11227-005-0221-3