Abstract
There is a growing interest in deploying MPI over very large numbers of heterogenous, geographically distributed resources. FT-MPI provides the fault-tolerance necessary at this scale, but presents some issues when crossing multiple administrative domains. Using the H2O metacomputing framework, we add cross-administrative domain interoperability and “pluggability” to FT-MPI. The latter feature allows us, using proxies, to transparently replace one vulnerable module – its name service – with fault-tolerant replacements. We present an algorithm for improving performance of operations over the proxies. We evaluate its performance in a comparison using the original name service, OpenLDAP and current Emory research project HDNS.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Dewolfs, D., Kurzyniec, D., Sunderam, V., Broeckhove, J., Dhaene, T., Fagg, G.E.: Applicability of Generic Naming Services and Fault-Tolerant Metacomputing with FT-MPI. In: Di Martino, B., Kranzlmüller, D., Dongarra, J. (eds.) EuroPVM/MPI 2005. LNCS, vol. 3666, pp. 268–275. Springer, Heidelberg (2005)
Kurzyniec, D., Sunderam, V.: Combining FT-MPI with H20: Fault-tolerant MPI across administrative boundaries. In: Proceedings of the HCW 2005-14th Heterogeneous Computing Workshop (2005)
Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In: Eighth IEEE International Symposium on High Performance Distributed Computing, p. 31 (1999)
Bouteiller, A., Cappello, F., Herault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: ACM/IEEE SC2003 Conference, p. 25 (2003)
Chen, Y., Li, K., Plank, J.S.: CLIP: A checkpointing tool for message-passing parallel programs (1997), Available at: http://citeseerist.psu.edu/chen97clip.html
Elnozahy, E., Zwaenepoel, W.: Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output. IEEE Transactions on Computers, Special Issue on Fault-Tolerant Computing 41(5), 526–531 (1992)
Fagg, G., Gabriel, E., Chen, Z., Angskun, T., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J.: Process fault-tolerance: Sematics, design and applications for high-performance computing. International Journal for High Performance Applications and Supercomputing (2004)
Kurzyniec, D., Wrzosek, T., Drzewiecki, D., Sunderam, V.: Towards self-organising distributed computing frameworks: The H2O approach. Parallel Processing Letters 13(2), 273–290 (2003)
Louca, S., Neophytou, N., Lachanas, A., Eviripidou, P.: MPI-FT: Portable fault-tolerance scheme for MPI. Parallel Processing Letters 10(4), 371–382 (2000)
Stellner, G.: CoCheck: Checkpointing and process migration for MPI. In: 10th International Parallel Processing Symposium, pp. 526–531 (1996)
Migliardi, M., Sunderam, V.: The Harness Metacomputing Framework. In: The Ninth SIAM Conference on Parallel Processing for Scientific Computing, S. Antonio (1999)
Gorissen, D., Wendykier, P., Kurzyniec, D., Sunderam, V.: Integrating Heterogeneous Information Services Using JNDI. In: Proceedings of the HCW 2006 - 15th Heterogeneous Computing Workshop, Rhodes Island, Greece (April 2006)
Fagg, G.E., Angskun, T., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J.: Scalable Fault Tolerant MPI: Extending the Recovery Algorithm. In: Di Martino, B., Kranzlmüller, D., Dongarra, J. (eds.) EuroPVM/MPI 2005. LNCS, vol. 3666, pp. 67–75. Springer, Heidelberg (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dewolfs, D., Broeckhove, J., Sunderam, V., Fagg, G.E. (2006). FT-MPI, Fault-Tolerant Metacomputing and Generic Name Services: A Case Study. In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2006. Lecture Notes in Computer Science, vol 4192. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11846802_24
Download citation
DOI: https://doi.org/10.1007/11846802_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-39110-4
Online ISBN: 978-3-540-39112-8
eBook Packages: Computer ScienceComputer Science (R0)