A Novel Fault-Tolerant Parallel Algorithm

Wang, Panfeng; Du, Yunfei; Fu, Hongyi; Zhou, Haifang; Yang, Xuejun; Yang, Wenjing

doi:10.1007/978-3-540-76837-1_6

Panfeng Wang¹,
Yunfei Du¹,
Hongyi Fu¹,
Haifang Zhou¹,
Xuejun Yang¹ &
…
Wenjing Yang¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4847))

Included in the following conference series:

International Workshop on Advanced Parallel Processing Technologies

889 Accesses

Abstract

The mean-time-between-failure of current high-performance computer systems is much shorter than the running times of many computational applications, whereas those applications are the main workload for those systems. Currently, checkpoint/restart is the most commonly used scheme for such applications to tolerate hardware failures. But this scheme has its performance limitation when the number of processors becomes much larger. In this paper, we propose a novel fault-tolerant parallel algorithm FPAPR. First, we introduce the basic idea of FPAPR. Second, we specify the details of how to implement a FPAPR program by using two NPB kernels as examples. Third, we theoretically analyze the overhead of FPAPR, and find out that the overhead of FPAPR decreases with the increase of the number of processors. At last, the experimental results on a 512-CPU cluster show the overhead introduced by the algorithm is very small.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bailey, D., Barszcz, E., Barton, J., Browning, D., Carter, R., Dagum, L., Fatoohi, R., Fineberg, S., Frederickson, P., Lasinski, T., Schreiber, R., Simon, H., Venkatakrishnan, V., Weeratunga, S.: The nas parallel benchmarks. Technical report (1994)
Google Scholar
Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Automated application-level checkpointing of mpi programs. In: PPoPP 2003. Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, San Diego, California, USA, pp. 84–94. ACM Press, New York, NY, USA (2003)
Google Scholar
Chiueh, T.-C., Deng, P.: Evaluation of checkpoint mechanisms for massively parallel machines. In: FTCS 1996. Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing, Washington, DC, USA, p. 370. IEEE Computer Society Press, Los Alamitos (1996)
Google Scholar
Mootaz Elnozahy, E.N., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Article Google Scholar
Engelmann, C., Geist, A.: Super-scalable algorithms for computing on 100,000 processors. pp. 313–321 (2005)
Google Scholar
Fagg, G.E., Dongarra, J.: Ft-mpi: Fault tolerant mpi, supporting dynamic applications in a dynamic world. In: PVM/MPI, pp. 346–353 (2000)
Google Scholar
Geist, A., Engelmann, C.: Development of naturally fault tolerant algorithms for computing on 100,000 processors (2002)
Google Scholar
Plank, J.S.: Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques. In: 15th Symposium on Reliable Distributed Systems, pp. 76–85 (October 1996)
Google Scholar
Plank, J.S., Li, K.: ickp: A consistent checkpointer for multicomputers. IEEE Parallel Distrib. Technol. 2(2), 62–67 (1994)
Article Google Scholar
Stellner, G.: CoCheck: Checkpointing and Process Migration for MPI. In: IPPS 1996. Proceedings of the 10th International Parallel Processing Symposium, Honolulu, Hawaii (1996)
Google Scholar
Sun, X.-H., Ni, L.M.: Another view on parallel speedup. In: Supercomputing 1990. Proceedings of the 1990 conference on Supercomputing, New York, New York, United States, pp. 324–333. IEEE Computer Society Press, Los Alamitos, CA, USA (1990)
Google Scholar

Download references

Author information

Authors and Affiliations

National Laboratory for Paralleling and Distributed Processing, College of Computer, National University of Defense Technology, Changsha, Hunan, 410073, China
Panfeng Wang, Yunfei Du, Hongyi Fu, Haifang Zhou, Xuejun Yang & Wenjing Yang

Authors

Panfeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yunfei Du
View author publications
You can also search for this author in PubMed Google Scholar
Hongyi Fu
View author publications
You can also search for this author in PubMed Google Scholar
Haifang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Xuejun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Wenjing Yang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Ming Xu Yinwei Zhan Jiannong Cao Yijun Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, P., Du, Y., Fu, H., Zhou, H., Yang, X., Yang, W. (2007). A Novel Fault-Tolerant Parallel Algorithm. In: Xu, M., Zhan, Y., Cao, J., Liu, Y. (eds) Advanced Parallel Processing Technologies. APPT 2007. Lecture Notes in Computer Science, vol 4847. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76837-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-540-76837-1_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76836-4
Online ISBN: 978-3-540-76837-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics