Surviving Errors with OpenSHMEM

Bouteiller, Aurelien; Bosilca, George; Venkata, Manjunath Gorentla

doi:10.1007/978-3-319-50995-2_5

Aurelien Bouteiller¹⁷,
George Bosilca¹⁷ &
Manjunath Gorentla Venkata¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 10007))

Included in the following conference series:

Workshop on OpenSHMEM and Related Technologies

479 Accesses

Abstract

Unexpected error conditions stem from a variety of underlying causes, including resource exhaustion, network failures, hardware failures, or program errors. As the scale of HPC systems continues to grow, so does the probability of encountering a condition that causes a failure; meanwhile, error recovery and run-through failure management are becoming mature, and interoperable HPC programming paradigms are beginning to feature advanced error management. As a result from these developments, it becomes increasingly desirable to gracefully handle error conditions in OpenSHMEM. In this paper, we present the design and rationale behind an extension of the OpenSHMEM API that can (1) notify user code of unexpected erroneous conditions, (2) permit customized user response to errors without incurring overhead on an error-free execution path, (3) propagate the occurence of an error condition to all Processing Elements, and (4) consistently close the erroneous epoch in order to resume the application.

M.G. Venkata—This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Concepts for OpenMP Target Offload Resilience

FINJ: A Fault Injection Tool for HPC Systems

Towards High Performance Resilience Using Performance Portable Abstractions

References

Amarasinghe, S., et al.: Exascale programming challenges. In: Proceedings of the Workshop on Exascale Programming Challenges, Marina del Rey, CA, USA. U.S Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR), July 2011. http://science.energy.gov/~/media/ascr/pdf/program-documents/docs/ProgrammingChallengesWorkshopReport.pdf
Balaji, P., Buntinas, D., Goodell, D., Gropp, W., Krishna, J., Lusk, E., Thakur, R.: PMI: a scalable parallel process-management interface for extreme-scale systems. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds.) EuroMPI 2010. LNCS, vol. 6305, pp. 31–41. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15646-5_4. http://dl.acm.org/citation.cfm?id=1894122.1894127
Chapter Google Scholar
Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., Matsuoka, S.: FTI: high performance fault tolerance interface for hybrid systems. In: International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2011 (2011)
Google Scholar
Benoit, A., Cavelan, A., Robert, Y., Sun, H.: Assessing general-purpose algorithms to cope with fail-stop and silent errors. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2014. LNCS, vol. 8966, pp. 215–236. Springer, Heidelberg (2015). doi:10.1007/978-3-319-17248-4_11
Google Scholar
Bland, W., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.: Post-failure recovery of MPI communication capability: design and rationale. Int. J. High Perform. Comput. Appl. 27(3), 244–254 (2013). http://hpc.sagepub.com/content/27/3/244.abstract
Article Google Scholar
Bosilca, G., Bouteiller, A., Guermouche, A., Herault, T., Sens, P., Robert, Y., Dongarra, J.J.: Failure detection and propagation in HPC systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016. ACM, New York (2016, to appear)
Google Scholar
Bosilca, G., Bouteiller, A., Brunet, E., Cappello, F., Dongarra, J., Guermouche, A., Herault, T., Robert, Y., Vivien, F., Zaidouni, D.: Unified model for assessing checkpointing protocols at extreme-scale. Concur. Comput. Pract. Exp. 26(17), 2772–2791 (2014). doi:10.1002/cpe.3173
Article Google Scholar
Bouteiller, A., Bosilca, G., Dongarra, J.J.: Plan B: Interruption of ongoing MPI operations to support failure recovery. In: Proceedings of the 22nd European MPI Users’ Group Meeting, EuroMPI 2015, pp. 11:1–11:9 (2015). http://doi.acm.org/10.1145/2802658.2802668
Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.J.: Correlated set coordination in fault tolerant message logging protocols for many-core clusters. Concur. Comput. Pract. Exp. 25(4), 572–585 (2013). doi:10.1002/cpe.2859
Article Google Scholar
Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. J. ACM (JACM) 43(2), 225–267 (1996)
Article MathSciNet MATH Google Scholar
Chen, Z.: Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. In: Proceedings of the PPoPP, pp. 167–176 (2013)
Google Scholar
Davies, T., Karlsson, C., Liu, H., Ding, C., Chen, Z.: High performance linpack benchmark: a fault tolerant implementation without checkpointing. In: Proceedings of the 25th ACM International Conference on Supercomputing (ICS 2011). ACM (2011)
Google Scholar
Dongarra, J., et al.: The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 25(1), 3–60 (2011). doi:10.1177/1094342010391989
Article Google Scholar
Ferreira, K., Stearley, J., Laros III, J.H., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P.G., Arnold, D.: Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 44:1–44:12. ACM, New York (2011). http://doi.acm.org/10.1145/2063384.2063443
Fiala, D., Mueller, F., Engelmann, C., Riesen, R., Ferreira, K., Brightwell, R.: Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the SC 2012, p. 78 (2012)
Google Scholar
Herault, T., Bouteiller, A., Bosilca, G., Gamell, M., Teranishi, K., Parashar, M., Dongarra, J.: Practical scalable consensus for pseudo-synchronous distributed systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, pp. 31:1–31:12. ACM, New York (2015). http://doi.acm.org/10.1145/2807591.2807665
Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21(7), 558–565 (1978)
Article MATH Google Scholar
Lamport, L., Shostak, R., Pease, M.: The byzantine generals problem. ACM Trans. Program. Lang. Syst. 4(3), 382–401 (1982). doi:10.1145/357172.357176
Article MATH Google Scholar
Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2010). http://dx.doi.org/10.1109/SC.2010.18
Petrini, F., Frachtenberg, E., Hoisie, A., Coll, S.: Performance evaluation of the quadrics interconnection network. Cluster Comput. 6(2), 125–142 (2003). doi:10.1023/A:1022852505633
Article Google Scholar
Poole, S.W., Hernandez, O.R., Kuehn, J.A., Shipman, G.M., Curtis, A., Feind, K.: OpenSHMEM - toward a unified RMA model. In: Padua, D.A. (ed.) Encyclopedia of Parallel Computing, pp. 1379–1391. Springer, Heidelberg (2011)
Google Scholar
Schroeder, B., Gibson, G.: Understanding failures in petascale computers. J. Phys.: Conf. Ser. 78, 12–22 (2007). IOP Publishing
Google Scholar
Shahzad, F., Kreutzer, M., Zeiser, T., Machado, R., Pieper, A., Hager, G., Wellein, G.: Building a fault tolerant application using the GASPI communication layer. In: Proceedings of the 2015 IEEE International Conference on Cluster Computing, CLUSTER 2015, pp. 580–587. IEEE Computer Society, Washington (2015). http://dx.doi.org/10.1109/CLUSTER.2015.106
Zheng, Z., Chien, A.A., Teranishi, K.: Fault tolerance in an inner-outer solver: a GVR-enabled case study. In: Daydé, M., Marques, O., Nakajima, K. (eds.) VECPAR 2014. LNCS, vol. 8969, pp. 124–132. Springer, Heidelberg (2015). doi:10.1007/978-3-319-17353-5_11
Google Scholar

Download references

Author information

Authors and Affiliations

Innovative Computing Laboratory, University of Tennessee, Knoxville, USA
Aurelien Bouteiller & George Bosilca
Oak Ridge National Laboratory, Oak Ridge, USA
Manjunath Gorentla Venkata

Authors

Aurelien Bouteiller
View author publications
You can also search for this author in PubMed Google Scholar
George Bosilca
View author publications
You can also search for this author in PubMed Google Scholar
Manjunath Gorentla Venkata
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aurelien Bouteiller .

Editor information

Editors and Affiliations

Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
Manjunath Gorentla Venkata
Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
Neena Imam
Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
Swaroop Pophale
Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
Tiffany M. Mintz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bouteiller, A., Bosilca, G., Venkata, M.G. (2016). Surviving Errors with OpenSHMEM. In: Gorentla Venkata, M., Imam, N., Pophale, S., Mintz, T. (eds) OpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments. OpenSHMEM 2016. Lecture Notes in Computer Science(), vol 10007. Springer, Cham. https://doi.org/10.1007/978-3-319-50995-2_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-50995-2_5
Published: 15 December 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50994-5
Online ISBN: 978-3-319-50995-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics