Abstract
The current supercomputers are almost achieving the petaflop level. These machines present a high number of interruptions in a relatively short time interval. Fault tolerance and preventive maintenance are key issues in order to enlarge the MTTI (Mean Time To Interrupt). In this paper we present how RADIC, a architecture for fault tolerance, provides different protection levels able to avoid system interruptions and allows the performance of preventive maintenance tasks. Our experiments show the effectiveness of our solution in order to keep a high availability with a large MTTI.
Chapter PDF
References
Schroeder, B., Gibson, G.A.: Understanding failures in petascale computers. Journal of Physics: Conference Series 78, 012022 (11 p.) (2007)
Duarte, A., Rexachs, D., Luque, E.: Increasing the cluster availability using RADIC. In: IEEE International Conference on Cluster Computing, 2006, pp. 1–8 (2006)
Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)
Gropp, W., Lusk, E., Skjellum, A.: Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press, Cambridge (1999); LCCN: QA76.642 G76 1999
Jalote, P.: Reliable, Atomic and Causal Broadcast. In: Fault Tolerance in Distributed Systems, vol. 1, p. 142. P T R Prentice Hall, USA (1994)
Duarte, A., Rexachs, D., Luque, E.: An intelligent management of fault tolerance in cluster using radicmpi. In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds.) PVM/MPI 2006. LNCS, vol. 4192, pp. 150–157. Springer, Heidelberg (2006)
Bouteiller, A., Herault, T., Krawezik, G., Lemarinier, P., Cappello, F.: Mpich-v project: A multiprotocol automatic fault-tolerant mpi. International Journal of High Performance Computing Applications 20(3), 319 (2006)
Li, Y., Lan, Z.: Exploit failure prediction for adaptive fault-tolerance in cluster computing. In: Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID 2006), May 16-19, 2006, vol. 1, pp. 531–538 (2006)
Kondo, M., Hayashida, T., Imai, M., Nakamura, H., Nanya, T., Hori, A.: Evaluation of checkpointing mechanism on score cluster system. IEICE Transactions on Information and Systems 86(12), 2553–2562 (2003)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Santos, G., Duarte, A., Rexachs, D., Luque, E. (2008). Providing Non-stop Service for Message-Passing Based Parallel Applications with RADIC. In: Luque, E., Margalef, T., Benítez, D. (eds) Euro-Par 2008 – Parallel Processing. Euro-Par 2008. Lecture Notes in Computer Science, vol 5168. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85451-7_7
Download citation
DOI: https://doi.org/10.1007/978-3-540-85451-7_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85450-0
Online ISBN: 978-3-540-85451-7
eBook Packages: Computer ScienceComputer Science (R0)