Skip to main content

Implementing Reliable Data Structures for MPI Services in High Component Count Systems

  • Conference paper
Recent Advances in Parallel Virtual Machine and Message Passing Interface (EuroPVM/MPI 2009)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 5759))

Abstract

High performance computing systems continue to grow: currently deployed systems exceed 160,000 cores and systems exceeding 1,000,000 cores are planned. Without significant improvements in component reliability, partial system failure modes could become an unacceptably regular occurrence, limiting the usability of advanced computing infrastructures. In this work, we intend to ease the development of survivable systems and applications through the implementation of a reliable key/value data store based on a distributed hash table (DHT). Borrowing from techniques developed for unreliable wide-area systems, we implemented a distributed data service built with MPI [1] that enables user data structures to survive partial system failure. The service is based on a new implementation of the Kademlia [2] distributed hash table.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. The MPI Forum: MPI-2: Extensions to the message-passing interface (1997)

    Google Scholar 

  2. Maymounkov, P., Mazières, D.: Kademlia: A peer-to-peer information system based on the XOR metric. In: Druschel, P., Kaashoek, M.F., Rowstron, A. (eds.) IPTPS 2002. LNCS, vol. 2429, pp. 53–65. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  3. Lu, C.-d., Reed, D.A.: Assessing fault sensitivity in MPI applications. In: Proc. SC 2004 (2004)

    Google Scholar 

  4. Gropp, W., Lusk, E.: Fault tolerance in MPI programs. J. High Performance Computing Applications 18(3) (2004)

    Google Scholar 

  5. Latham, R., Ross, R., Thakur, R.: Can MPI be used for persistent parallel services? In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds.) PVM/MPI 2006. LNCS, vol. 4192, pp. 275–284. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  6. Thakur, R., Gropp, W.: Open issues in MPI implementation. In: Proc. Asia-Pacific Computer Systems Architecture Conference (2007)

    Google Scholar 

  7. Chen, Z., Fagg, G.E., Gabriel, E., Langou, J., Angskun, T., Bosilca, G., Dongarra, J.: Fault tolerant high performance computing by a coding approach. In: Proc. Symposium on Principles and Practice of Parallel Programming (2005)

    Google Scholar 

  8. Schulz, M., Bronevetsky, G., Fernandes, R., Marques, D., Pingali, K., Stodghill, P.: Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs. In: Proc. SC 2004 (2004)

    Google Scholar 

  9. Bouteiller, A., Herault, T., Krawezik, G., Lemarinier, P., Cappello, F.: MPICH-V Project: A multiprotocol automatic fault-tolerant MPI. J. High Performance Computing Applications 20(3) (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wozniak, J.M., Jacobs, B., Latham, R., Lang, S., Son, S.W., Ross, R. (2009). Implementing Reliable Data Structures for MPI Services in High Component Count Systems. In: Ropo, M., Westerholm, J., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2009. Lecture Notes in Computer Science, vol 5759. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03770-2_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-03770-2_39

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-03769-6

  • Online ISBN: 978-3-642-03770-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics