Fault-Tolerant LU Factorization Is Low Cost

Coti, Camille; Petrucci, Laure; Torres González, Daniel Alberto

doi:10.1007/978-3-030-85665-6_33

Camille Coti¹¹,
Laure Petrucci¹¹ &
Daniel Alberto Torres González¹¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12820))

Included in the following conference series:

European Conference on Parallel Processing

1715 Accesses
1 Citations

Abstract

At large scale, failures are statistically frequent and need to be taken into account. Tolerating failures has arisen as a major challenge in parallel computing as the size of the systems grow, failures become more common and some computation units are expected to fail during the execution of a program. Algorithms used in these programs must be scalable, while being resilient to hardware failures that will happen during the execution. In this paper, we present an algorithm that takes advantage of intrinsic properties of the scalable communication-avoiding LU algorithms in order to make them fault-tolerant and proceed with the computation in spite of failures. We evaluate the overhead of the fault tolerance mechanisms with respect to failure-free execution on both tall-and-skinny matrices (TSLU) and square matrices (CALU), and the cost of a failure during the execution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.top500.org/lists/2020/11/.

References

Aupy, G., Benoit, A., Hérault, T., Robert, Y., Dongarra, J.: Optimal checkpointing period: time vs. energy. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2013. LNCS, vol. 8551, pp. 203–214. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10214-6_10
Chapter Google Scholar
Benacchio, T., et al.: Resilience and fault-tolerance in high-performance computing for numerical weather and climate prediction. Int. J. High Perform. Comput. Appl. (2020)
Google Scholar
Benoît, A., Cavelan, A., Cappello, F., Raghavan, P., Robert, Y., Sun, H.: Coping with silent and fail-stop errors at scale by combining replication and checkpointing. J. Parallel Distrib. Comput. 122, 209–225 (2018)
Article Google Scholar
Bland, W., Bouteiller, A., Herault, T., Hursey, J., Bosilca, G., Dongarra, J.J.: An evaluation of user-level failure mitigation support in MPI. In: Träff, J.L., Benkner, S., Dongarra, J.J. (eds.) Recent Advances in the Message Passing Interface, pp. 193–203. Springer, Berlin (2012). https://doi.org/10.1007/978-3-642-33518-1_24
Chapter Google Scholar
Bosilca, G., et al.: Failure detection and propagation in HPC systems. In: SC 2016: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 312–322 (2016)
Google Scholar
Bosilca, G., Delmas, R., Dongarra, J., Langou, J.: Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput. 69(4), 410–416 (2009)
Article Google Scholar
Bouteiller, A., Herault, T., Krawezik, G., Lemarinier, P., Cappello, F.: MPICH-V project: a multiprotocol automatic fault-tolerant MPI. Int. J. High Perform. Comput. Appl. 20(3), 319–333 (2006)
Article Google Scholar
Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomputing frontiers and innovations 1(1), 5–28 (2014)
Google Scholar
Coti, C.: Exploiting redundant computation in communication-avoiding algorithms for algorithm-based fault tolerance. In: 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS), pp. 214–219 (2016). https://doi.org/10.1109/BigDataSecurity-HPSC-IDS.2016.59
Coti, C.: Scalable, robust, fault-tolerant parallel QR factorization. In: 2016 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC) and 15th International Symposium on Distributed Computing and Applications for Business Engineering (DCABES), pp. 626–633 (2016). https://doi.org/10.1109/CSE-EUC-DCABES.2016.250
Coti, C., et al.: Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI. In: SC 2006: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, p. 18 (2006)
Google Scholar
Coti, C., Petrucci, L., Torres Gonzalez, D.A.: Fault-tolerant matrix factorisation: a formal model and proof. In: 6th International Workshop on Synthesis of Complex Parameters (SynCoP) 2019 (2019)
Google Scholar
Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34(1), 206–239 (2012). https://doi.org/10.1137/080731992
Article MathSciNet MATH Google Scholar
Dey, T., et al.: Optimizing asynchronous multi-level checkpoint/restart configurations with machine learning. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1036–1043. IEEE (2020)
Google Scholar
Di, S., Cappello, F.: Optimization of error-bounded lossy compression for hard-to-compress HPC data. IEEE Trans. Parallel Distrib. Syst. 29(1), 129–143 (2017)
Article Google Scholar
Dongarra, J., et al.: The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 25(1), 3–60 (2011)
Google Scholar
Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002). https://doi.org/10.1145/568522.568525
Article Google Scholar
Fagg, G.E., Dongarra, J.J.: FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Dongarra, J., Kacsuk, P., Podhorszki, N. (eds.) EuroPVM/MPI 2000. LNCS, vol. 1908, pp. 346–353. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45255-9_47
Chapter Google Scholar
Gamell, M., et al.: Evaluating online global recovery with fenix using application-aware in-memory checkpointing techniques. In: 2016 45th International Conference on Parallel Processing Workshops (ICPPW), pp. 346–355 (2016)
Google Scholar
Gamell, M., et al.: Exploring failure recovery for stencil-based applications at extreme scales. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, pp. 279–282. HPDC 2015, Association for Computing Machinery, New York (2015). https://doi.org/10.1145/2749246.2749260
Grigori, L., Demmel, J.W., Xiang, H.: CALU: a communication optimal LU factorization algorithm. SIAM J. Matrix Anal. Appl. 32(4), 1317–1350 (2011). https://doi.org/10.1137/100788926
Article MathSciNet MATH Google Scholar
Gropp, W., Snir, M.: Programming for exascale computers. Comput. Sci. Eng. 15, 27 (2013)
Article Google Scholar
Huang, K.H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 100(6), 518–528 (1984)
Article Google Scholar
Jones, W.M., Daly, J.T., DeBardeleben, N.: Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 276–279 (2010)
Google Scholar
Lemarinier, P., Bouteiller, A., Herault, T., Krawezik, G., Cappello, F.: Improved message logging versus improved coordinated checkpointing for fault tolerant MPI. In: 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No. 04EX935), pp. 115–124. IEEE (2004)
Google Scholar
Losada, N., Bouteiller, A., Bosilca, G.: Asynchronous receiver-driven replay for local rollback of MPI applications. In: 2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS), pp. 1–10. IEEE (2019)
Google Scholar
Martino, C.D., Kalbarczyk, Z., Iyer, R.K., Baccanico, F., Fullop, J., Kramer, W.: Lessons learned from the analysis of system failures at petascale: the case of blue waters. In: Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014, pp. 610–621. IEEE Computer Society, Washington (2014). https://doi.org/10.1109/DSN.2014.62
Reed, D., Lu, C., Mendes, C.: Reliability challenges in large systems. Future Gener. Comput. Syst. 22(3), 293–302 (2006). https://doi.org/10.1016/j.future.2004.11.015
Shalf, J., Dosanjh, S., Morrison, J.: Exascale computing technology challenges. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 1–25. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19328-6_1
Chapter Google Scholar
Tao, D., Di, S., Liang, X., Chen, Z., Cappello, F.: Improving performance of iterative methods by lossy checkponting. In: Proceedings of the 27th International Symposium on High-performance Parallel and Distributed Computing, pp. 52–65 (2018)
Google Scholar
Thakur, R., et al.: MPI at exascale. In: Procceedings of SciDAC 2010 (2010)
Google Scholar

Download references

Acknowledgement

Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations(see https://www.grid5000.fr).

The authors would like to thank Julien Langou for the discussions on the numerical stability of the computation of the L matrix.

Author information

Authors and Affiliations

LIPN, CNRS UMR 7030, Université Sorbonne Paris Nord, 99, avenue Jean-Baptiste Clément, 93430, Villetaneuse, France
Camille Coti, Laure Petrucci & Daniel Alberto Torres González

Authors

Camille Coti
View author publications
You can also search for this author in PubMed Google Scholar
Laure Petrucci
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Alberto Torres González
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Alberto Torres González .

Editor information

Editors and Affiliations

Universidade de Lisboa, Lisbon, Portugal
Leonel Sousa
Universidade de Lisboa, Lisbon, Portugal
Nuno Roma
Universidade de Lisboa, Lisbon, Portugal
Pedro Tomás

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Coti, C., Petrucci, L., Torres González, D.A. (2021). Fault-Tolerant LU Factorization Is Low Cost. In: Sousa, L., Roma, N., Tomás, P. (eds) Euro-Par 2021: Parallel Processing. Euro-Par 2021. Lecture Notes in Computer Science(), vol 12820. Springer, Cham. https://doi.org/10.1007/978-3-030-85665-6_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-85665-6_33
Published: 25 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85664-9
Online ISBN: 978-3-030-85665-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics