Implementing Reliable Data Structures for MPI Services in High Component Count Systems

Wozniak, Justin M.; Jacobs, Bryan; Latham, Robert; Lang, Sam; Son, Seung Woo; Ross, Robert

doi:10.1007/978-3-642-03770-2_39

Justin M. Wozniak¹⁸,
Bryan Jacobs¹⁸,
Robert Latham¹⁸,
Sam Lang¹⁸,
Seung Woo Son¹⁸ &
…
Robert Ross¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 5759))

Included in the following conference series:

European Parallel Virtual Machine / Message Passing Interface Users’ Group Meeting

1103 Accesses
2 Citations

Abstract

High performance computing systems continue to grow: currently deployed systems exceed 160,000 cores and systems exceeding 1,000,000 cores are planned. Without significant improvements in component reliability, partial system failure modes could become an unacceptably regular occurrence, limiting the usability of advanced computing infrastructures. In this work, we intend to ease the development of survivable systems and applications through the implementation of a reliable key/value data store based on a distributed hash table (DHT). Borrowing from techniques developed for unreliable wide-area systems, we implemented a distributed data service built with MPI [1] that enables user data structures to survive partial system failure. The service is based on a new implementation of the Kademlia [2] distributed hash table.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

The MPI Forum: MPI-2: Extensions to the message-passing interface (1997)
Google Scholar
Maymounkov, P., Mazières, D.: Kademlia: A peer-to-peer information system based on the XOR metric. In: Druschel, P., Kaashoek, M.F., Rowstron, A. (eds.) IPTPS 2002. LNCS, vol. 2429, pp. 53–65. Springer, Heidelberg (2002)
Chapter Google Scholar
Lu, C.-d., Reed, D.A.: Assessing fault sensitivity in MPI applications. In: Proc. SC 2004 (2004)
Google Scholar
Gropp, W., Lusk, E.: Fault tolerance in MPI programs. J. High Performance Computing Applications 18(3) (2004)
Google Scholar
Latham, R., Ross, R., Thakur, R.: Can MPI be used for persistent parallel services? In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds.) PVM/MPI 2006. LNCS, vol. 4192, pp. 275–284. Springer, Heidelberg (2006)
Chapter Google Scholar
Thakur, R., Gropp, W.: Open issues in MPI implementation. In: Proc. Asia-Pacific Computer Systems Architecture Conference (2007)
Google Scholar
Chen, Z., Fagg, G.E., Gabriel, E., Langou, J., Angskun, T., Bosilca, G., Dongarra, J.: Fault tolerant high performance computing by a coding approach. In: Proc. Symposium on Principles and Practice of Parallel Programming (2005)
Google Scholar
Schulz, M., Bronevetsky, G., Fernandes, R., Marques, D., Pingali, K., Stodghill, P.: Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs. In: Proc. SC 2004 (2004)
Google Scholar
Bouteiller, A., Herault, T., Krawezik, G., Lemarinier, P., Cappello, F.: MPICH-V Project: A multiprotocol automatic fault-tolerant MPI. J. High Performance Computing Applications 20(3) (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, 60439, USA
Justin M. Wozniak, Bryan Jacobs, Robert Latham, Sam Lang, Seung Woo Son & Robert Ross

Authors

Justin M. Wozniak
View author publications
You can also search for this author in PubMed Google Scholar
Bryan Jacobs
View author publications
You can also search for this author in PubMed Google Scholar
Robert Latham
View author publications
You can also search for this author in PubMed Google Scholar
Sam Lang
View author publications
You can also search for this author in PubMed Google Scholar
Seung Woo Son
View author publications
You can also search for this author in PubMed Google Scholar
Robert Ross
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information Technology, Åbo Akademi, 20500, Turku, Finland
Matti Ropo & Jan Westerholm &
Department of Electrical Engineering and Computer Science, University of Tennessee, 37996-3450, Knoxville, TN, USA
Jack Dongarra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wozniak, J.M., Jacobs, B., Latham, R., Lang, S., Son, S.W., Ross, R. (2009). Implementing Reliable Data Structures for MPI Services in High Component Count Systems. In: Ropo, M., Westerholm, J., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2009. Lecture Notes in Computer Science, vol 5759. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03770-2_39

Download citation

DOI: https://doi.org/10.1007/978-3-642-03770-2_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03769-6
Online ISBN: 978-3-642-03770-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics