Skip to main content

Towards Local-Failure Local-Recovery in PDE Frameworks: The Case of Linear Solvers

  • Conference paper
  • First Online:
High Performance Computing in Science and Engineering (HPCSE 2019)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12456))

Abstract

It is expected that with the appearance of exascale supercomputers the mean time between failure in supercomputers will decrease. Classical checkpoint-restart approaches are too expensive at that scale. Local-failure local-recovery (LFLR) strategies are an option that promises to leverage the costs, but actually implementing it into any sufficiently large simulation environment is a challenging task. In this paper we discuss how LFLR methods can be incorporated in a PDE framework, focussing at the linear solvers as the innermost component. We discuss how Krylov solvers can be modified to support LFLR, and present numerical tests. We exemplify our approach by reporting on the implementation of these features in the Dune framework, present C++ software abstractions, which simplify the incorporation of LFLR techniques and show how we use these in our solver library. To reduce the memory costs of full remote backups, we further investigate the benefits of lossy compression and in-memory checkpointing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://gitlab.dune-project.org/exadune/blackchannel-ulfm, BSD-3 licence.

  2. 2.

    https://gitlab.dune-project.org/exadune/dune-common/tree/feature/ulfm-mpiguard, https://gitlab.dune-project.org/exadune/dune-istl/tree/fault_tolerance_interface.

References

  1. Agullo, E., Giraud, L., Guermouche, A., Roman, J., Zounon, M.: Numerical recovery strategies for parallel resilient Krylov linear solvers. Numer. Linear Algebra Appl. 23(5), 888–905 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  2. Agullo, E., Giraud, L., Salas, P., Zounon, M.: Interpolation-restart strategies for resilient eigensolvers. SIAM J. Sci. Comput. 38(5), C560–C583 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  3. Ali, M.M., Southern, J., Strazdins, P., Harding, B.: Application level fault recovery: using Fault-Tolerant Open MPI in a PDE solver. In: 2014 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1169–1178. IEEE (2014)

    Google Scholar 

  4. Altenbernd, M., Göddeke, D.: Soft fault detection and correction for multigrid. Int. J. High Perform. Comput. Appl. 32(6), 897–912 (2018). https://doi.org/10.1177/1094342016684006

    Article  Google Scholar 

  5. Ashraf, R.A., Hukerikar, S., Engelmann, C.: Shrink or substitute: handling process failures in HPC systems using in-situ recovery. In: Proceedings of the 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP 2018), pp. 178–185. IEEE (2018)

    Google Scholar 

  6. Bastian, P., et al.: A generic grid interface for parallel and adaptive scientific computing. Part II: implementation and tests in DUNE. Computing 82(2–3), 121–138 (2008). https://doi.org/10.1007/s00607-008-0004-9

    Article  MathSciNet  MATH  Google Scholar 

  7. Bastian, P., et al.: A generic grid interface for parallel and adaptive scientific computing. Part I: abstract framework. Computing 82(2–3), 103–119 (2008). https://doi.org/10.1007/s00607-008-0003-x

    Article  MathSciNet  MATH  Google Scholar 

  8. Bastian, P., et al.: The DUNE framework: basic concepts and recent developments. arXiv preprint arXiv:1909.13672 (2019)

  9. Bautista-Gomez, L., Zyulkyarov, F., Unsal, O., McIntosh-Smith, S.: Unprotected computing: a large-scale study of dram raw error rate on a supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 55. IEEE Press (2016)

    Google Scholar 

  10. Bland, W., Lu, H., Seo, S., Balaji, P.: Lessons learned implementing user-level failure mitigation in MPICH. In: Proceedings of the 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 1123–1126 (2015). https://doi.org/10.1109/CCGrid.2015.51

  11. Bland, W., Bouteiller, A., Herault, T., Hursey, J., Bosilca, G., Dongarra, J.J.: An evaluation of user-level failure mitigation support in MPI. In: Träff, J.L., Benkner, S., Dongarra, J.J. (eds.) EuroMPI 2012. LNCS, vol. 7490, pp. 193–203. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33518-1_24

    Chapter  Google Scholar 

  12. Blatt, M., Bastian, P.: The iterative solver template library. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 666–675. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75755-9_82

    Chapter  Google Scholar 

  13. Cantwell, C.D., Nielsen, A.S.: A minimally intrusive low-memory approach to resilience for existing transient solvers. J. Sci. Comput. 78(1), 565–581 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  14. Cappello, F.: Fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. Int. J. High Perform. Comput. Appl. 23(3), 212–226 (2009)

    Article  Google Scholar 

  15. Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomputing Front. Innovations 1(1), 5–28 (2014)

    Google Scholar 

  16. Chen, C., Du, Y., Zuo, K., Fang, J., Yang, C.: Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization. J. Supercomputing 75(8), 4226–4247 (2017). https://doi.org/10.1007/s11227-017-2116-5

    Article  Google Scholar 

  17. Di, S., Cappello, F.: Fast error-bounded lossy HPC data compression with SZ. In: Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, pp. 730–739. IEEE (2016)

    Google Scholar 

  18. Di Martino, C., Kramer, W., Kalbarczyk, Z., Iyer, R.: Measuring and understanding extreme-scale application resilience: a field study of 5,000,000 HPC application runs. In: Proceedings of the 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 25–36. IEEE (2015)

    Google Scholar 

  19. Dongarra, J., Herault, T., Robert, Y.: Fault tolerance techniques for high-performance computing. In: Herault, T., Robert, Y. (eds.) Fault-Tolerance Techniques for High-Performance Computing. CCN, pp. 3–85. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20943-2_1

    Chapter  MATH  Google Scholar 

  20. Dongarra, J., et al.: The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 25(1), 3–60 (2011). https://doi.org/10.1177/1094342010391989

    Article  Google Scholar 

  21. Dongarra, J., et al.: Applied mathematics research for exascale computing. Technical report, U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research Program (2014). http://science.energy.gov/~/media/ascr/pdf/research/am/docs/EMWGreport.pdf

  22. Engwer, C., Altenbernd, M., Dreier, N.A., Göddeke, D.: A high-level C++ approach to manage local errors, asynchrony and faults in an MPI application. In: Proceedings of the 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2018), pp. 714–721. IEEE (2018)

    Google Scholar 

  23. Gamell, M., et al.: Evaluating online global recovery with fenix using application-aware in-memory checkpointing techniques. In: Proceedings of the 45th International Conference on Parallel Processing Workshops (ICPPW 2016), pp. 346–355. IEEE (2016)

    Google Scholar 

  24. Gamell, M., Katz, D., Kolla, H., Chen, J., Klasky, S., Parashar, M.: Exploring automatic, online failure recovery for scientific applications at extreme scales. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, pp. 895–906. IEEE (2014)

    Google Scholar 

  25. Göddeke, D., Altenbernd, M., Ribbrock, D.: Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing. Parallel Comput. 49, 117–135 (2015)

    Article  MathSciNet  Google Scholar 

  26. Gupta, S., Patel, T., Engelmann, C., Tiwari, D.: Failures in large scale systems: long-term measurement, analysis, and implications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 44. ACM (2017)

    Google Scholar 

  27. Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand. 49(6), 409–436 (1952)

    Article  MathSciNet  MATH  Google Scholar 

  28. Huang, K.H., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 100(6), 518–528 (1984)

    Article  MATH  Google Scholar 

  29. Huber, M., Gmeiner, B., Rüde, U., Wohlmuth, B.: Resilience for massively parallel multigrid solvers. SIAM J. Sci. Comput. 38(5), S217–S239 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  30. Keyes, D.E.: Exaflop/s: the why and the how. Comptes Rendus Mécanique 339(2–3), 70–77 (2011). https://doi.org/10.1016/j.crme.2010.11.002

    Article  MATH  Google Scholar 

  31. Kohl, N., et al.: A scalable and extensible checkpointing scheme for massively parallel simulations. Int. J. High Perform. Comput. Appl. 33, 571–589 (2017). https://doi.org/10.1177/1094342018767736

    Article  Google Scholar 

  32. Langou, J., Chen, Z., Bosilca, G., Dongarra, J.: Recovery patterns for iterative methods in a parallel unstable environment. SIAM J. Sci. Comput. 30, 102–116 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  33. Liang, X., et al.: Error-controlled lossy compression optimized for high compression ratios of scientific datasets. In: Proceedings of the IEEE International Conference on Big Data (Big Data 2018), pp. 438–447 (2018)

    Google Scholar 

  34. Losada, N., Bosilca, G., Bouteiller, A., González, P., Martín, M.: Local rollback for resilient MPI applications with application-level checkpointing and message logging. Future Gener. Comput. Syst. 91, 450–464 (2019)

    Article  Google Scholar 

  35. Meuer, H., Strohmaier, E., Dongarra, J.J., Simon, H.D.: Top500 supercomputer sites (2019). http://www.top500.org/

  36. Nielsen, A.S.: Scaling and resilience in numerical algorithms for exascale computing. Ph.D. thesis, École Polytechnique Fédérale de Lausanne (2018). https://infoscience.epfl.ch/record/258087/files/EPFL_TH8926.pdf

  37. Saad, Y., Schultz, M.H.: GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 7(3), 856–869 (1986). https://doi.org/10.1137/0907058

    Article  MathSciNet  MATH  Google Scholar 

  38. Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secure Comput. 7(4), 337–350 (2009)

    Article  Google Scholar 

  39. Sloan, J., Kumar, R., Bronevetsky, G.: An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance. In: Dependable Systems and Networks (DSN 2013), pp. 1–12 (2013). https://doi.org/10.1109/DSN.2013.6575309

  40. Snir, M., et al.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28(2), 129–173 (2014)

    Article  Google Scholar 

  41. Snir, M., et al.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28(2), 129–173 (2014). https://doi.org/10.1177/1094342014522573

    Article  Google Scholar 

  42. Tao, D., Di, S., Chen, Z., Cappello, F.: Significantly improving lossy compression for scientific data sets based on multidimensional prediction and error-controlled quantization. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS 2017), pp. 1129–1139. IEEE (2017)

    Google Scholar 

  43. Teranishi, K., Heroux, M.A.: Toward local failure local recovery resilience model using MPI-ULFM. In: Proceedings of the 21st European MPI Users’ Group Meeting, p. 51. ACM (2014)

    Google Scholar 

Download references

Acknowledgements

Supported by the German Research Foundation in the Priority Programme 1648 ‘Software for Exascale Computing’, grants GO 1758/2-2 and EN 1042/2-2; and under Germany’s Excellence Strategy EXC 2044–390685587, Mathematics Münster: Dynamics–Geometry–Structure.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dominik Göddeke .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Altenbernd, M., Dreier, NA., Engwer, C., Göddeke, D. (2021). Towards Local-Failure Local-Recovery in PDE Frameworks: The Case of Linear Solvers. In: Kozubek, T., Arbenz, P., Jaroš, J., Říha, L., Šístek, J., Tichý, P. (eds) High Performance Computing in Science and Engineering. HPCSE 2019. Lecture Notes in Computer Science(), vol 12456. Springer, Cham. https://doi.org/10.1007/978-3-030-67077-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-67077-1_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-67076-4

  • Online ISBN: 978-3-030-67077-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics