Towards Local-Failure Local-Recovery in PDE Frameworks: The Case of Linear Solvers

Altenbernd, Mirco; Dreier, Nils-Arne; Engwer, Christian; Göddeke, Dominik

doi:10.1007/978-3-030-67077-1_2

Mirco Altenbernd¹⁴,
Nils-Arne Dreier¹⁵,
Christian Engwer¹⁵ &
…
Dominik Göddeke^14,16

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12456))

Included in the following conference series:

International Conference on High Performance Computing in Science and Engineering

323 Accesses
1 Citations

Abstract

It is expected that with the appearance of exascale supercomputers the mean time between failure in supercomputers will decrease. Classical checkpoint-restart approaches are too expensive at that scale. Local-failure local-recovery (LFLR) strategies are an option that promises to leverage the costs, but actually implementing it into any sufficiently large simulation environment is a challenging task. In this paper we discuss how LFLR methods can be incorporated in a PDE framework, focussing at the linear solvers as the innermost component. We discuss how Krylov solvers can be modified to support LFLR, and present numerical tests. We exemplify our approach by reporting on the implementation of these features in the Dune framework, present C++ software abstractions, which simplify the incorporation of LFLR techniques and show how we use these in our solver library. To reduce the memory costs of full remote backups, we further investigate the benefits of lossy compression and in-memory checkpointing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Agullo, E., Giraud, L., Guermouche, A., Roman, J., Zounon, M.: Numerical recovery strategies for parallel resilient Krylov linear solvers. Numer. Linear Algebra Appl. 23(5), 888–905 (2016)
Article MathSciNet MATH Google Scholar
Agullo, E., Giraud, L., Salas, P., Zounon, M.: Interpolation-restart strategies for resilient eigensolvers. SIAM J. Sci. Comput. 38(5), C560–C583 (2016)
Article MathSciNet MATH Google Scholar
Ali, M.M., Southern, J., Strazdins, P., Harding, B.: Application level fault recovery: using Fault-Tolerant Open MPI in a PDE solver. In: 2014 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1169–1178. IEEE (2014)
Google Scholar
Altenbernd, M., Göddeke, D.: Soft fault detection and correction for multigrid. Int. J. High Perform. Comput. Appl. 32(6), 897–912 (2018). https://doi.org/10.1177/1094342016684006
Article Google Scholar
Ashraf, R.A., Hukerikar, S., Engelmann, C.: Shrink or substitute: handling process failures in HPC systems using in-situ recovery. In: Proceedings of the 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP 2018), pp. 178–185. IEEE (2018)
Google Scholar
Bastian, P., et al.: A generic grid interface for parallel and adaptive scientific computing. Part II: implementation and tests in DUNE. Computing 82(2–3), 121–138 (2008). https://doi.org/10.1007/s00607-008-0004-9
Article MathSciNet MATH Google Scholar
Bastian, P., et al.: A generic grid interface for parallel and adaptive scientific computing. Part I: abstract framework. Computing 82(2–3), 103–119 (2008). https://doi.org/10.1007/s00607-008-0003-x
Article MathSciNet MATH Google Scholar
Bastian, P., et al.: The DUNE framework: basic concepts and recent developments. arXiv preprint arXiv:1909.13672 (2019)
Bautista-Gomez, L., Zyulkyarov, F., Unsal, O., McIntosh-Smith, S.: Unprotected computing: a large-scale study of dram raw error rate on a supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 55. IEEE Press (2016)
Google Scholar
Bland, W., Lu, H., Seo, S., Balaji, P.: Lessons learned implementing user-level failure mitigation in MPICH. In: Proceedings of the 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 1123–1126 (2015). https://doi.org/10.1109/CCGrid.2015.51
Bland, W., Bouteiller, A., Herault, T., Hursey, J., Bosilca, G., Dongarra, J.J.: An evaluation of user-level failure mitigation support in MPI. In: Träff, J.L., Benkner, S., Dongarra, J.J. (eds.) EuroMPI 2012. LNCS, vol. 7490, pp. 193–203. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33518-1_24
Chapter Google Scholar
Blatt, M., Bastian, P.: The iterative solver template library. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 666–675. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75755-9_82
Chapter Google Scholar
Cantwell, C.D., Nielsen, A.S.: A minimally intrusive low-memory approach to resilience for existing transient solvers. J. Sci. Comput. 78(1), 565–581 (2019)
Article MathSciNet MATH Google Scholar
Cappello, F.: Fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. Int. J. High Perform. Comput. Appl. 23(3), 212–226 (2009)
Article Google Scholar
Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomputing Front. Innovations 1(1), 5–28 (2014)
Google Scholar
Chen, C., Du, Y., Zuo, K., Fang, J., Yang, C.: Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization. J. Supercomputing 75(8), 4226–4247 (2017). https://doi.org/10.1007/s11227-017-2116-5
Article Google Scholar
Di, S., Cappello, F.: Fast error-bounded lossy HPC data compression with SZ. In: Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, pp. 730–739. IEEE (2016)
Google Scholar
Di Martino, C., Kramer, W., Kalbarczyk, Z., Iyer, R.: Measuring and understanding extreme-scale application resilience: a field study of 5,000,000 HPC application runs. In: Proceedings of the 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 25–36. IEEE (2015)
Google Scholar
Dongarra, J., Herault, T., Robert, Y.: Fault tolerance techniques for high-performance computing. In: Herault, T., Robert, Y. (eds.) Fault-Tolerance Techniques for High-Performance Computing. CCN, pp. 3–85. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20943-2_1
Chapter MATH Google Scholar
Dongarra, J., et al.: The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 25(1), 3–60 (2011). https://doi.org/10.1177/1094342010391989
Article Google Scholar
Dongarra, J., et al.: Applied mathematics research for exascale computing. Technical report, U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research Program (2014). http://science.energy.gov/~/media/ascr/pdf/research/am/docs/EMWGreport.pdf
Engwer, C., Altenbernd, M., Dreier, N.A., Göddeke, D.: A high-level C++ approach to manage local errors, asynchrony and faults in an MPI application. In: Proceedings of the 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2018), pp. 714–721. IEEE (2018)
Google Scholar
Gamell, M., et al.: Evaluating online global recovery with fenix using application-aware in-memory checkpointing techniques. In: Proceedings of the 45th International Conference on Parallel Processing Workshops (ICPPW 2016), pp. 346–355. IEEE (2016)
Google Scholar
Gamell, M., Katz, D., Kolla, H., Chen, J., Klasky, S., Parashar, M.: Exploring automatic, online failure recovery for scientific applications at extreme scales. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, pp. 895–906. IEEE (2014)
Google Scholar
Göddeke, D., Altenbernd, M., Ribbrock, D.: Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing. Parallel Comput. 49, 117–135 (2015)
Article MathSciNet Google Scholar
Gupta, S., Patel, T., Engelmann, C., Tiwari, D.: Failures in large scale systems: long-term measurement, analysis, and implications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 44. ACM (2017)
Google Scholar
Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand. 49(6), 409–436 (1952)
Article MathSciNet MATH Google Scholar
Huang, K.H., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 100(6), 518–528 (1984)
Article MATH Google Scholar
Huber, M., Gmeiner, B., Rüde, U., Wohlmuth, B.: Resilience for massively parallel multigrid solvers. SIAM J. Sci. Comput. 38(5), S217–S239 (2016)
Article MathSciNet MATH Google Scholar
Keyes, D.E.: Exaflop/s: the why and the how. Comptes Rendus Mécanique 339(2–3), 70–77 (2011). https://doi.org/10.1016/j.crme.2010.11.002
Article MATH Google Scholar
Kohl, N., et al.: A scalable and extensible checkpointing scheme for massively parallel simulations. Int. J. High Perform. Comput. Appl. 33, 571–589 (2017). https://doi.org/10.1177/1094342018767736
Article Google Scholar
Langou, J., Chen, Z., Bosilca, G., Dongarra, J.: Recovery patterns for iterative methods in a parallel unstable environment. SIAM J. Sci. Comput. 30, 102–116 (2007)
Article MathSciNet MATH Google Scholar
Liang, X., et al.: Error-controlled lossy compression optimized for high compression ratios of scientific datasets. In: Proceedings of the IEEE International Conference on Big Data (Big Data 2018), pp. 438–447 (2018)
Google Scholar
Losada, N., Bosilca, G., Bouteiller, A., González, P., Martín, M.: Local rollback for resilient MPI applications with application-level checkpointing and message logging. Future Gener. Comput. Syst. 91, 450–464 (2019)
Article Google Scholar
Meuer, H., Strohmaier, E., Dongarra, J.J., Simon, H.D.: Top500 supercomputer sites (2019). http://www.top500.org/
Nielsen, A.S.: Scaling and resilience in numerical algorithms for exascale computing. Ph.D. thesis, École Polytechnique Fédérale de Lausanne (2018). https://infoscience.epfl.ch/record/258087/files/EPFL_TH8926.pdf
Saad, Y., Schultz, M.H.: GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 7(3), 856–869 (1986). https://doi.org/10.1137/0907058
Article MathSciNet MATH Google Scholar
Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secure Comput. 7(4), 337–350 (2009)
Article Google Scholar
Sloan, J., Kumar, R., Bronevetsky, G.: An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance. In: Dependable Systems and Networks (DSN 2013), pp. 1–12 (2013). https://doi.org/10.1109/DSN.2013.6575309
Snir, M., et al.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28(2), 129–173 (2014)
Article Google Scholar
Snir, M., et al.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28(2), 129–173 (2014). https://doi.org/10.1177/1094342014522573
Article Google Scholar
Tao, D., Di, S., Chen, Z., Cappello, F.: Significantly improving lossy compression for scientific data sets based on multidimensional prediction and error-controlled quantization. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS 2017), pp. 1129–1139. IEEE (2017)
Google Scholar
Teranishi, K., Heroux, M.A.: Toward local failure local recovery resilience model using MPI-ULFM. In: Proceedings of the 21st European MPI Users’ Group Meeting, p. 51. ACM (2014)
Google Scholar

Download references

Acknowledgements

Supported by the German Research Foundation in the Priority Programme 1648 ‘Software for Exascale Computing’, grants GO 1758/2-2 and EN 1042/2-2; and under Germany’s Excellence Strategy EXC 2044–390685587, Mathematics Münster: Dynamics–Geometry–Structure.

Author information

Authors and Affiliations

University of Stuttgart, Allmandring 5b, 70569, Stuttgart, Germany
Mirco Altenbernd & Dominik Göddeke
University of Münster, Orleansring 10, 48149, Münster, Germany
Nils-Arne Dreier & Christian Engwer
Stuttgart Center for Simulation Science (SimTech), Stuttgart, Germany
Dominik Göddeke

Authors

Mirco Altenbernd
View author publications
You can also search for this author in PubMed Google Scholar
Nils-Arne Dreier
View author publications
You can also search for this author in PubMed Google Scholar
Christian Engwer
View author publications
You can also search for this author in PubMed Google Scholar
Dominik Göddeke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dominik Göddeke .

Editor information

Editors and Affiliations

VSB - Technical University of Ostrava, Ostrava-Poruba, Czech Republic
Tomáš Kozubek
ETH Zurich, Zurich, Switzerland
Peter Arbenz
Brno University of Technology, Brno, Czech Republic
Jiří Jaroš
Technical University of Ostrava, Ostrava-Poruba, Czech Republic
Lubomír Říha
Institute of Mathematics of the CAS, Prague, Czech Republic
Jakub Šístek
Charles University in Prague, Prague, Czech Republic
Petr Tichý

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Altenbernd, M., Dreier, NA., Engwer, C., Göddeke, D. (2021). Towards Local-Failure Local-Recovery in PDE Frameworks: The Case of Linear Solvers. In: Kozubek, T., Arbenz, P., Jaroš, J., Říha, L., Šístek, J., Tichý, P. (eds) High Performance Computing in Science and Engineering. HPCSE 2019. Lecture Notes in Computer Science(), vol 12456. Springer, Cham. https://doi.org/10.1007/978-3-030-67077-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-67077-1_2
Published: 08 January 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67076-4
Online ISBN: 978-3-030-67077-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Towards Local-Failure Local-Recovery in PDE Frameworks: The Case of Linear Solvers