Abstract
High performance computing systems continue to grow: currently deployed systems exceed 160,000 cores and systems exceeding 1,000,000 cores are planned. Without significant improvements in component reliability, partial system failure modes could become an unacceptably regular occurrence, limiting the usability of advanced computing infrastructures. In this work, we intend to ease the development of survivable systems and applications through the implementation of a reliable key/value data store based on a distributed hash table (DHT). Borrowing from techniques developed for unreliable wide-area systems, we implemented a distributed data service built with MPI [1] that enables user data structures to survive partial system failure. The service is based on a new implementation of the Kademlia [2] distributed hash table.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
The MPI Forum: MPI-2: Extensions to the message-passing interface (1997)
Maymounkov, P., Mazières, D.: Kademlia: A peer-to-peer information system based on the XOR metric. In: Druschel, P., Kaashoek, M.F., Rowstron, A. (eds.) IPTPS 2002. LNCS, vol. 2429, pp. 53–65. Springer, Heidelberg (2002)
Lu, C.-d., Reed, D.A.: Assessing fault sensitivity in MPI applications. In: Proc. SC 2004 (2004)
Gropp, W., Lusk, E.: Fault tolerance in MPI programs. J. High Performance Computing Applications 18(3) (2004)
Latham, R., Ross, R., Thakur, R.: Can MPI be used for persistent parallel services? In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds.) PVM/MPI 2006. LNCS, vol. 4192, pp. 275–284. Springer, Heidelberg (2006)
Thakur, R., Gropp, W.: Open issues in MPI implementation. In: Proc. Asia-Pacific Computer Systems Architecture Conference (2007)
Chen, Z., Fagg, G.E., Gabriel, E., Langou, J., Angskun, T., Bosilca, G., Dongarra, J.: Fault tolerant high performance computing by a coding approach. In: Proc. Symposium on Principles and Practice of Parallel Programming (2005)
Schulz, M., Bronevetsky, G., Fernandes, R., Marques, D., Pingali, K., Stodghill, P.: Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs. In: Proc. SC 2004 (2004)
Bouteiller, A., Herault, T., Krawezik, G., Lemarinier, P., Cappello, F.: MPICH-V Project: A multiprotocol automatic fault-tolerant MPI. J. High Performance Computing Applications 20(3) (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wozniak, J.M., Jacobs, B., Latham, R., Lang, S., Son, S.W., Ross, R. (2009). Implementing Reliable Data Structures for MPI Services in High Component Count Systems. In: Ropo, M., Westerholm, J., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2009. Lecture Notes in Computer Science, vol 5759. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03770-2_39
Download citation
DOI: https://doi.org/10.1007/978-3-642-03770-2_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03769-6
Online ISBN: 978-3-642-03770-2
eBook Packages: Computer ScienceComputer Science (R0)