Abstract
It is expected that with the appearance of exascale supercomputers the mean time between failure in supercomputers will decrease. Classical checkpoint-restart approaches are too expensive at that scale. Local-failure local-recovery (LFLR) strategies are an option that promises to leverage the costs, but actually implementing it into any sufficiently large simulation environment is a challenging task. In this paper we discuss how LFLR methods can be incorporated in a PDE framework, focussing at the linear solvers as the innermost component. We discuss how Krylov solvers can be modified to support LFLR, and present numerical tests. We exemplify our approach by reporting on the implementation of these features in the Dune framework, present C++ software abstractions, which simplify the incorporation of LFLR techniques and show how we use these in our solver library. To reduce the memory costs of full remote backups, we further investigate the benefits of lossy compression and in-memory checkpointing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agullo, E., Giraud, L., Guermouche, A., Roman, J., Zounon, M.: Numerical recovery strategies for parallel resilient Krylov linear solvers. Numer. Linear Algebra Appl. 23(5), 888–905 (2016)
Agullo, E., Giraud, L., Salas, P., Zounon, M.: Interpolation-restart strategies for resilient eigensolvers. SIAM J. Sci. Comput. 38(5), C560–C583 (2016)
Ali, M.M., Southern, J., Strazdins, P., Harding, B.: Application level fault recovery: using Fault-Tolerant Open MPI in a PDE solver. In: 2014 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1169–1178. IEEE (2014)
Altenbernd, M., Göddeke, D.: Soft fault detection and correction for multigrid. Int. J. High Perform. Comput. Appl. 32(6), 897–912 (2018). https://doi.org/10.1177/1094342016684006
Ashraf, R.A., Hukerikar, S., Engelmann, C.: Shrink or substitute: handling process failures in HPC systems using in-situ recovery. In: Proceedings of the 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP 2018), pp. 178–185. IEEE (2018)
Bastian, P., et al.: A generic grid interface for parallel and adaptive scientific computing. Part II: implementation and tests in DUNE. Computing 82(2–3), 121–138 (2008). https://doi.org/10.1007/s00607-008-0004-9
Bastian, P., et al.: A generic grid interface for parallel and adaptive scientific computing. Part I: abstract framework. Computing 82(2–3), 103–119 (2008). https://doi.org/10.1007/s00607-008-0003-x
Bastian, P., et al.: The DUNE framework: basic concepts and recent developments. arXiv preprint arXiv:1909.13672 (2019)
Bautista-Gomez, L., Zyulkyarov, F., Unsal, O., McIntosh-Smith, S.: Unprotected computing: a large-scale study of dram raw error rate on a supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 55. IEEE Press (2016)
Bland, W., Lu, H., Seo, S., Balaji, P.: Lessons learned implementing user-level failure mitigation in MPICH. In: Proceedings of the 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 1123–1126 (2015). https://doi.org/10.1109/CCGrid.2015.51
Bland, W., Bouteiller, A., Herault, T., Hursey, J., Bosilca, G., Dongarra, J.J.: An evaluation of user-level failure mitigation support in MPI. In: Träff, J.L., Benkner, S., Dongarra, J.J. (eds.) EuroMPI 2012. LNCS, vol. 7490, pp. 193–203. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33518-1_24
Blatt, M., Bastian, P.: The iterative solver template library. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 666–675. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75755-9_82
Cantwell, C.D., Nielsen, A.S.: A minimally intrusive low-memory approach to resilience for existing transient solvers. J. Sci. Comput. 78(1), 565–581 (2019)
Cappello, F.: Fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. Int. J. High Perform. Comput. Appl. 23(3), 212–226 (2009)
Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomputing Front. Innovations 1(1), 5–28 (2014)
Chen, C., Du, Y., Zuo, K., Fang, J., Yang, C.: Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization. J. Supercomputing 75(8), 4226–4247 (2017). https://doi.org/10.1007/s11227-017-2116-5
Di, S., Cappello, F.: Fast error-bounded lossy HPC data compression with SZ. In: Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, pp. 730–739. IEEE (2016)
Di Martino, C., Kramer, W., Kalbarczyk, Z., Iyer, R.: Measuring and understanding extreme-scale application resilience: a field study of 5,000,000 HPC application runs. In: Proceedings of the 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 25–36. IEEE (2015)
Dongarra, J., Herault, T., Robert, Y.: Fault tolerance techniques for high-performance computing. In: Herault, T., Robert, Y. (eds.) Fault-Tolerance Techniques for High-Performance Computing. CCN, pp. 3–85. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20943-2_1
Dongarra, J., et al.: The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 25(1), 3–60 (2011). https://doi.org/10.1177/1094342010391989
Dongarra, J., et al.: Applied mathematics research for exascale computing. Technical report, U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research Program (2014). http://science.energy.gov/~/media/ascr/pdf/research/am/docs/EMWGreport.pdf
Engwer, C., Altenbernd, M., Dreier, N.A., Göddeke, D.: A high-level C++ approach to manage local errors, asynchrony and faults in an MPI application. In: Proceedings of the 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2018), pp. 714–721. IEEE (2018)
Gamell, M., et al.: Evaluating online global recovery with fenix using application-aware in-memory checkpointing techniques. In: Proceedings of the 45th International Conference on Parallel Processing Workshops (ICPPW 2016), pp. 346–355. IEEE (2016)
Gamell, M., Katz, D., Kolla, H., Chen, J., Klasky, S., Parashar, M.: Exploring automatic, online failure recovery for scientific applications at extreme scales. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, pp. 895–906. IEEE (2014)
Göddeke, D., Altenbernd, M., Ribbrock, D.: Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing. Parallel Comput. 49, 117–135 (2015)
Gupta, S., Patel, T., Engelmann, C., Tiwari, D.: Failures in large scale systems: long-term measurement, analysis, and implications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 44. ACM (2017)
Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand. 49(6), 409–436 (1952)
Huang, K.H., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 100(6), 518–528 (1984)
Huber, M., Gmeiner, B., Rüde, U., Wohlmuth, B.: Resilience for massively parallel multigrid solvers. SIAM J. Sci. Comput. 38(5), S217–S239 (2016)
Keyes, D.E.: Exaflop/s: the why and the how. Comptes Rendus Mécanique 339(2–3), 70–77 (2011). https://doi.org/10.1016/j.crme.2010.11.002
Kohl, N., et al.: A scalable and extensible checkpointing scheme for massively parallel simulations. Int. J. High Perform. Comput. Appl. 33, 571–589 (2017). https://doi.org/10.1177/1094342018767736
Langou, J., Chen, Z., Bosilca, G., Dongarra, J.: Recovery patterns for iterative methods in a parallel unstable environment. SIAM J. Sci. Comput. 30, 102–116 (2007)
Liang, X., et al.: Error-controlled lossy compression optimized for high compression ratios of scientific datasets. In: Proceedings of the IEEE International Conference on Big Data (Big Data 2018), pp. 438–447 (2018)
Losada, N., Bosilca, G., Bouteiller, A., González, P., Martín, M.: Local rollback for resilient MPI applications with application-level checkpointing and message logging. Future Gener. Comput. Syst. 91, 450–464 (2019)
Meuer, H., Strohmaier, E., Dongarra, J.J., Simon, H.D.: Top500 supercomputer sites (2019). http://www.top500.org/
Nielsen, A.S.: Scaling and resilience in numerical algorithms for exascale computing. Ph.D. thesis, École Polytechnique Fédérale de Lausanne (2018). https://infoscience.epfl.ch/record/258087/files/EPFL_TH8926.pdf
Saad, Y., Schultz, M.H.: GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 7(3), 856–869 (1986). https://doi.org/10.1137/0907058
Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secure Comput. 7(4), 337–350 (2009)
Sloan, J., Kumar, R., Bronevetsky, G.: An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance. In: Dependable Systems and Networks (DSN 2013), pp. 1–12 (2013). https://doi.org/10.1109/DSN.2013.6575309
Snir, M., et al.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28(2), 129–173 (2014)
Snir, M., et al.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28(2), 129–173 (2014). https://doi.org/10.1177/1094342014522573
Tao, D., Di, S., Chen, Z., Cappello, F.: Significantly improving lossy compression for scientific data sets based on multidimensional prediction and error-controlled quantization. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS 2017), pp. 1129–1139. IEEE (2017)
Teranishi, K., Heroux, M.A.: Toward local failure local recovery resilience model using MPI-ULFM. In: Proceedings of the 21st European MPI Users’ Group Meeting, p. 51. ACM (2014)
Acknowledgements
Supported by the German Research Foundation in the Priority Programme 1648 ‘Software for Exascale Computing’, grants GO 1758/2-2 and EN 1042/2-2; and under Germany’s Excellence Strategy EXC 2044–390685587, Mathematics Münster: Dynamics–Geometry–Structure.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Altenbernd, M., Dreier, NA., Engwer, C., Göddeke, D. (2021). Towards Local-Failure Local-Recovery in PDE Frameworks: The Case of Linear Solvers. In: Kozubek, T., Arbenz, P., Jaroš, J., Říha, L., Šístek, J., Tichý, P. (eds) High Performance Computing in Science and Engineering. HPCSE 2019. Lecture Notes in Computer Science(), vol 12456. Springer, Cham. https://doi.org/10.1007/978-3-030-67077-1_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-67077-1_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67076-4
Online ISBN: 978-3-030-67077-1
eBook Packages: Computer ScienceComputer Science (R0)