Abstract
On future large-scale systems, the mean time between failures (MTBF) of the system is expected to decrease so that many faults could occur during the solution of large problems. Consequently, it becomes critical to design parallel numerical linear algebra kernels that can survive faults. In that framework, we investigate the relevance of approaches relying on numerical techniques, which might be combined with more classical techniques for real large-scale parallel implementations. Our main objective is to provide robust resilient schemes so that the solver may keep converging in the presence of the hard fault without restarting the calculation from scratch. For this purpose, we study interpolation-restart (IR) strategies. For a given numerical scheme, the IR strategies consist of extracting relevant information from available data after a fault. After data extraction, a well-selected part of the missing data is regenerated through interpolation strategies to constitute a meaningful input to restart the numerical algorithm. In this paper, we revisit a few state-of-the-art methods in numerical linear algebra in the light of our IR strategies. Through a few numerical experiments, we illustrate the respective robustness of the resulting resilient schemes with respect to the MTBF via qualitative illustrations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Agullo, E., Cools, S., Giraud, L., Vanroose, W., Yetkin, F.E.: On the sensitivity of CG to soft-errors and robust numerical detection mechanisms. Research Report in Preparation, Inria (2017)
Agullo, E., GiraudL, L., Moreau, A.: Adaptive soft-error detection criterion for GMRES. Research Report in Preparation, Inria (2017)
Agullo, E., Giraud, L., Guermouche, A., Roman, J., Zounon, M.: Numerical recovery strategies for parallel resilient Krylov linear solvers. Numer. Linear Algebra Appl. 23, 888–905 (2016)
Agullo, E., Giraud, L., Salas, P., Zounon, M.: Interpolation-restart strategies for resilient eigensolvers. SIAM J. Sci. Comput. 38(5), C560–C583 (2016)
Alvisi, L., Marzullo, K.: Message logging: pessimistic, optimistic, causal, and optimal. IEEE Trans. Softw. Eng. 24(2), 149–159 (1998)
Anfinson, J., Luk, F.T.: A linear algebraic model of algorithm-based fault tolerance. IEEE Trans. Comput. 37, 1599–1604 (1988)
Austin, T.M.: DIVA: a reliable substrate for deep submicron microarchitecture design. In: Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 32, Washington, DC, pp. 196–207. IEEE Computer Society (1999)
Cappello, F., Casanova, H., Robert, Y.: Preventive migration vs. preventive checkpointing for extreme scale supercomputers. Parallel Process. Lett. 21, 111–132 (2011)
Chen, Z.: Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. In: ACM SIGPLAN Notices, vol. 48, pp. 167–176. ACM (2013)
Elnozahy, E.N., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Elnozahy, E.N., Johnson, D.B., Zwaenepoel, W.: The performance of consistent checkpointing. In: Proceedings of the 11th Symposium on Reliable Distributed Systems, pp. 39–47, October 1992
Gunnels, J.A., Van De Geijn, R.A., Katz, D.S., Quintana-ortí, E.S.: Fault-tolerant high-performance matrix multiplication: theory and practice. In: Dependable Systems and Networks, pp. 47–56 (2001)
Huang, K.-H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33, 518–528 (1984)
Iyer, R.K., Nakka, N.M., Kalbarczyk, Z.T., Mitra, S.: Recent advances and new avenues in hardware-level reliability support. IEEE Micro 25(6), 18–29 (2005)
Johnson, D.B., Zwaenepoel, W.: Sender-based message logging (1987)
Langou, J., Chen, Z., Bosilca, G., Dongarra, J.: Recovery patterns for iterative methods in a parallel unstable environment. SIAM J. Sci. Comput. 30, 102–116 (2007)
Li, C.-C.J., Fuchs, W.K.: Catch-compiler-assisted techniques for checkpointing. In: 20th International Symposium on Fault-Tolerant Computing. FTCS-20. Digest of Papers, pp. 74–81, June 1990
Liu, Y., Nassar, R., Leangsuksun, C.B., Naksinehaboon, N., Paun, M., Scott, S.L.: An optimal checkpoint/restart model for a large scale high performance computing system. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2008), pp. 1–9, April 2008
Oh, N., Shirvani, P.P., McCluskey, E.J.: Error detection by duplicated instructions in super-scalar processors. IEEE Trans. Reliab. 51(1), 63–75 (2002)
Plank, J.S., Kim, Y., Dongarra, J.: Fault tolerant matrix operations for networks of workstations using diskless checkpointing. J. Parallel Distrib. Comput. 43(2), 125–138 (1997)
Plank, J.: An overview of checkpointing in uniprocessor and distributed systems, focusing on implementation and performance. Technical report UT-CS-97-372, Department of Computer Science, University of Tennessee (1997)
Plank, J.S., Li, K.: ICKP: a consistent checkpointer for multicomputers. Parallel Distrib. Technol. Syst. Appl. 2(2), 62–67 (1994). IEEE
Raju, N., Liu, Y., Leangsuksun, C.B., Nassar, R., Scott, S.: Reliability Analysis in HPC clusters. In: Proceedings of the High Availability and Performance Computing Workshop (2006)
Sancho, J.C., Petrini, F., Davis, K., Gioiosa, R., Jiang, S.: Current practice and a direction forward in checkpoint/restart implementations for fault tolerance. In: Proceedings of 19th IEEE International Parallel and Distributed Processing Symposium, April 2005
Scholzel, M.: Reduced triple modular redundancy for built-in self-repair in VLIW-processors. In: Signal Processing Algorithms, Architectures, Arrangements and Applications, pp. 21–26 (2007)
Vijaykumar, T.N., Pomeranz, I., Cheng, K.: Transient-fault recovery using simultaneous multithreading. In: Proceedings of the 29th Annual International Symposium on Computer Architecture, pp. 87–98 (2002)
Wang, C., Mueller, F., Engelmann, C., Scott, S.L.: Hybrid full/incremental checkpoint/restart for MPI jobs in HPC environments. Department of Computer Science, North Carolina State University (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Agullo, E. et al. (2017). Hard Faults and Soft-Errors: Possible Numerical Remedies in Linear Algebra Solvers. In: Dutra, I., Camacho, R., Barbosa, J., Marques, O. (eds) High Performance Computing for Computational Science – VECPAR 2016. VECPAR 2016. Lecture Notes in Computer Science(), vol 10150. Springer, Cham. https://doi.org/10.1007/978-3-319-61982-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-61982-8_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-61981-1
Online ISBN: 978-3-319-61982-8
eBook Packages: Computer ScienceComputer Science (R0)