Abstract
At large scale, failures are statistically frequent and need to be taken into account. Tolerating failures has arisen as a major challenge in parallel computing as the size of the systems grow, failures become more common and some computation units are expected to fail during the execution of a program. Algorithms used in these programs must be scalable, while being resilient to hardware failures that will happen during the execution. In this paper, we present an algorithm that takes advantage of intrinsic properties of the scalable communication-avoiding LU algorithms in order to make them fault-tolerant and proceed with the computation in spite of failures. We evaluate the overhead of the fault tolerance mechanisms with respect to failure-free execution on both tall-and-skinny matrices (TSLU) and square matrices (CALU), and the cost of a failure during the execution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aupy, G., Benoit, A., Hérault, T., Robert, Y., Dongarra, J.: Optimal checkpointing period: time vs. energy. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2013. LNCS, vol. 8551, pp. 203–214. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10214-6_10
Benacchio, T., et al.: Resilience and fault-tolerance in high-performance computing for numerical weather and climate prediction. Int. J. High Perform. Comput. Appl. (2020)
Benoît, A., Cavelan, A., Cappello, F., Raghavan, P., Robert, Y., Sun, H.: Coping with silent and fail-stop errors at scale by combining replication and checkpointing. J. Parallel Distrib. Comput. 122, 209–225 (2018)
Bland, W., Bouteiller, A., Herault, T., Hursey, J., Bosilca, G., Dongarra, J.J.: An evaluation of user-level failure mitigation support in MPI. In: Träff, J.L., Benkner, S., Dongarra, J.J. (eds.) Recent Advances in the Message Passing Interface, pp. 193–203. Springer, Berlin (2012). https://doi.org/10.1007/978-3-642-33518-1_24
Bosilca, G., et al.: Failure detection and propagation in HPC systems. In: SC 2016: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 312–322 (2016)
Bosilca, G., Delmas, R., Dongarra, J., Langou, J.: Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput. 69(4), 410–416 (2009)
Bouteiller, A., Herault, T., Krawezik, G., Lemarinier, P., Cappello, F.: MPICH-V project: a multiprotocol automatic fault-tolerant MPI. Int. J. High Perform. Comput. Appl. 20(3), 319–333 (2006)
Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomputing frontiers and innovations 1(1), 5–28 (2014)
Coti, C.: Exploiting redundant computation in communication-avoiding algorithms for algorithm-based fault tolerance. In: 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS), pp. 214–219 (2016). https://doi.org/10.1109/BigDataSecurity-HPSC-IDS.2016.59
Coti, C.: Scalable, robust, fault-tolerant parallel QR factorization. In: 2016 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC) and 15th International Symposium on Distributed Computing and Applications for Business Engineering (DCABES), pp. 626–633 (2016). https://doi.org/10.1109/CSE-EUC-DCABES.2016.250
Coti, C., et al.: Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI. In: SC 2006: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, p. 18 (2006)
Coti, C., Petrucci, L., Torres Gonzalez, D.A.: Fault-tolerant matrix factorisation: a formal model and proof. In: 6th International Workshop on Synthesis of Complex Parameters (SynCoP) 2019 (2019)
Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34(1), 206–239 (2012). https://doi.org/10.1137/080731992
Dey, T., et al.: Optimizing asynchronous multi-level checkpoint/restart configurations with machine learning. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1036–1043. IEEE (2020)
Di, S., Cappello, F.: Optimization of error-bounded lossy compression for hard-to-compress HPC data. IEEE Trans. Parallel Distrib. Syst. 29(1), 129–143 (2017)
Dongarra, J., et al.: The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 25(1), 3–60 (2011)
Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002). https://doi.org/10.1145/568522.568525
Fagg, G.E., Dongarra, J.J.: FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Dongarra, J., Kacsuk, P., Podhorszki, N. (eds.) EuroPVM/MPI 2000. LNCS, vol. 1908, pp. 346–353. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45255-9_47
Gamell, M., et al.: Evaluating online global recovery with fenix using application-aware in-memory checkpointing techniques. In: 2016 45th International Conference on Parallel Processing Workshops (ICPPW), pp. 346–355 (2016)
Gamell, M., et al.: Exploring failure recovery for stencil-based applications at extreme scales. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, pp. 279–282. HPDC 2015, Association for Computing Machinery, New York (2015). https://doi.org/10.1145/2749246.2749260
Grigori, L., Demmel, J.W., Xiang, H.: CALU: a communication optimal LU factorization algorithm. SIAM J. Matrix Anal. Appl. 32(4), 1317–1350 (2011). https://doi.org/10.1137/100788926
Gropp, W., Snir, M.: Programming for exascale computers. Comput. Sci. Eng. 15, 27 (2013)
Huang, K.H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 100(6), 518–528 (1984)
Jones, W.M., Daly, J.T., DeBardeleben, N.: Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 276–279 (2010)
Lemarinier, P., Bouteiller, A., Herault, T., Krawezik, G., Cappello, F.: Improved message logging versus improved coordinated checkpointing for fault tolerant MPI. In: 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No. 04EX935), pp. 115–124. IEEE (2004)
Losada, N., Bouteiller, A., Bosilca, G.: Asynchronous receiver-driven replay for local rollback of MPI applications. In: 2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS), pp. 1–10. IEEE (2019)
Martino, C.D., Kalbarczyk, Z., Iyer, R.K., Baccanico, F., Fullop, J., Kramer, W.: Lessons learned from the analysis of system failures at petascale: the case of blue waters. In: Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014, pp. 610–621. IEEE Computer Society, Washington (2014). https://doi.org/10.1109/DSN.2014.62
Reed, D., Lu, C., Mendes, C.: Reliability challenges in large systems. Future Gener. Comput. Syst. 22(3), 293–302 (2006). https://doi.org/10.1016/j.future.2004.11.015
Shalf, J., Dosanjh, S., Morrison, J.: Exascale computing technology challenges. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 1–25. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19328-6_1
Tao, D., Di, S., Liang, X., Chen, Z., Cappello, F.: Improving performance of iterative methods by lossy checkponting. In: Proceedings of the 27th International Symposium on High-performance Parallel and Distributed Computing, pp. 52–65 (2018)
Thakur, R., et al.: MPI at exascale. In: Procceedings of SciDAC 2010 (2010)
Acknowledgement
Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations(see https://www.grid5000.fr).
The authors would like to thank Julien Langou for the discussions on the numerical stability of the computation of the L matrix.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Coti, C., Petrucci, L., Torres González, D.A. (2021). Fault-Tolerant LU Factorization Is Low Cost. In: Sousa, L., Roma, N., Tomás, P. (eds) Euro-Par 2021: Parallel Processing. Euro-Par 2021. Lecture Notes in Computer Science(), vol 12820. Springer, Cham. https://doi.org/10.1007/978-3-030-85665-6_33
Download citation
DOI: https://doi.org/10.1007/978-3-030-85665-6_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85664-9
Online ISBN: 978-3-030-85665-6
eBook Packages: Computer ScienceComputer Science (R0)