Skip to main content
Log in

Fault Tolerant Lanczos Eigensolver via an Invariant Checking Method

  • Published:
Journal of Electronic Testing Aims and scope Submit manuscript

Abstract

An extensive survey of the literature shows that the Lanczos eigensolver is a popular iterative method for approximating a few maximal eigenvalues of a real symmetric matrix, particularly if the matrix is large and sparse. In recent years, graphics processing units (GPUs) have become a popular platform for scientific computing applications, many of which are based on linear algebra, and are increasingly being used as the main computational units in supercomputers. This trend is expected to continue as the number of computations required by scientific applications reach petascale and exascale range. In this paper, building on our earlier work [22], we investigate in detail the error checking mechanism for the Lanczos eigensolver. We identify a low cost invariant for efficient error checking, and through mathematical analysis determine the efficiency of our mechanism when used by the Lanczos eigensolver. We evaluate the proposed fault tolerant scheme using an open-source sparse eigensolver on a GPU platform, with and without the injection of faults. We use a large number of sparse matrices from real applications, to determine the efficiency and efficacy of our method and our implementation shows that the proposed fault tolerant method has good error coverage and low overhead. To the best of our knowledge, we are the first to introduce such a scheme for the Lanczos method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Agerwala T (2010) Exascale computing: The challenges and opportunities in the next decade. In Proc. of the International Symposium on High Performance Computer Architecture (HPCA)

  2. Arnoldi W (1951) The principle of minimized iterations in the solution of the matrix eigenvalue problem. Quart Appl Math 9:17–29

    Article  MathSciNet  Google Scholar 

  3. Balay S, Abhyankar S, Adams M, Brown J, Brune P, Buschelman K, Dalcin L, Eijkhout V, Gropp W, Kaushik D, Knepley M, May D, McInnes L, Rupp K, Sanan P, Smith B, Zampini S, Zhang H, Zhang H (2017) PETSc users manual. Technical Report ANL-95/11 - Revision 3.8, Argonne National Laboratory

  4. Balay S, Gropp W, McInnes L, Smith B (1997) Efficient management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163–202. Birkhäuser Press

  5. Braun C, Halder S, Wunderlich HJ (2014) A-ABFT: Autonomous algorithm-based fault tolerance for matrix multiplications on graphics processing units. In Proc. of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 443–454

  6. Bronevetsky G, de Supinski B (2008) Soft error vulnerability of iterative linear algebra methods. In Proc. of the International Conference on Supercomputing, pages 155–164

  7. Chen Z (2013) Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods. In Proc. of the Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 167–176

  8. Chen J, Liang X, Chen Z (2016) Online algorithm-based fault tolerance for Cholesky decomposition on heterogeneous systems with GPUs. In Proc. of the IEEE International Parallel and Distributed Processing Symposium (IPDPS)

  9. Davis TA, Hu Y (2011) The University of Florida sparse matrix collection. ACM Trans Math Soft 38(1):1:1–1:25

  10. Elliott J, Hoemmen M, Mueller F (2014) Evaluating the impact of SDC on the GMRES iterative solver. In Proc. of the IEEE International Parallel and Distributed Processing Symposium (IPDPS) pages 1193–1202

  11. Golub GH, van Loan CF (1996) Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore, MD

    MATH  Google Scholar 

  12. Hakkarinen D, Wu P, Chen Z (2015) Fail-stop failure algorithm-based fault tolerance for Cholesky decomposition. IEEE Trans Par Distr Sys 26(5):1323–1335

    Article  Google Scholar 

  13. Hernandez V, Roman JE, Tomas A, Vidal V (2006) Lanczos methods in SLEPc. Technical Report STR-5, Universitat Politècnica de València. Available at http://slepc.upv.es

  14. Hernandez V, Roman JE, Vidal V (2005) SLEPc: A scalable and flexible toolkit for the solution of eigenvalue problems. ACM Trans Math Soft 31(3):351–362

    Article  MathSciNet  Google Scholar 

  15. Heroux MA (2009) Software challenges for extreme scale computing: Going from petascale to exascale systems. Int J High Perf Comput Appl 23(4):437–439

    Article  Google Scholar 

  16. Huang KH, Abraham JA (1984) Algorithm-based fault tolerance for matrix operations. IEEE Trans Comp C-33(6):518–528

  17. Kim H, Vuduc R, Baghsorkhi S, Choi J, Hwu W (2012) Performance analysis and tuning for general purpose graphics processing units (GPGPU). Synthesis Lectures on Computer Architecture

  18. Knyazev A (2001) Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method. SIAM J Sci Comput 23(2):517–541

    Article  MathSciNet  Google Scholar 

  19. Lanczos C (1950) An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J Res Nat Bur Stand 45(4):255–282

    Article  MathSciNet  Google Scholar 

  20. Loh F, Ramanathan P, Saluja KK (2015) Transient fault resilient QR factorization on GPUs. In Proc. of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS ’15, pages 63–70

  21. Loh F, Saluja KK, Ramanathan P (2016) Fault tolerance through invariant checking for iterative solvers. In Proc. of the International Conference on VLSI Design and International Conference on Embedded Systems (VLSID), pages 481–486

  22. Loh F, Saluja KK, Ramanathan P (2020) Fault tolerance through invariant checking for the lanczos eigensolver. In Proc. of the International Conference on VLSI Design and International Conference on Embedded Systems (VLSID), pages 13–18

  23. Nie B, Tiwari D, Gupta S, Smirni E, Rogers JH (2016) A large-scale study of soft-errors on GPUs in the field. In Proc. of the International Symposium on High Performance Computer Architecture (HPCA), pages 519–530

  24. NVIDIA (2016) NVIDIA GeForce GTX 1080. White Paper

  25. Oboril F, Tahoori MB, Heuveline V, Lukarski D, Weiss JP (2011) Numerical defect correction as an algorithm-based fault tolerance technique for iterative solvers. In Proc. of the IEEE Pacific Rim International Symposium on Dependable Computing (PRDC), pages 144–153

  26. Shivakumar P, Kistler M, Keckler SW, Burger D, Alvisi L (2003) Modeling the impact of device and pipeline scaling on the soft error rate of processor elements. Technical Report 2002-19, Dept. of Computer Sciences, The University of Texas at Austin

  27. Scholl A, Braun C, Kochte MA, Wunderlich H (2015) Low-overhead fault-tolerance for the preconditioned conjugate gradient solver. In Proc. of the IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS), pages 60–65

  28. Shantharam M, Srinivasmurthy S, Raghavan P (2012) Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In Proc. of the International Conference on Supercomputing, pages 69–78

  29. Siefert N, Jahinuzzaman S, Velamala J, Ascazubi R, Patel N, Gill B, Basile J, Hicks J (2015) Soft error rate improvements in 14-nm technology featuring second-generation 3D tri-gate transistors. IEEE Trans Nucl Sci 62(6):2570–2577

    Article  Google Scholar 

  30. Sloan J, Kumar R, Bronevetsky G (2012) Algorithmic approaches to low overhead fault detection for sparse linear algebra. In Proc. of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 1–12

  31. Wu P, Guan Q, DeBardeleben N, Blanchard S, Tao D, Liang X, Chen J, Chen Z (2016) Towards practical algorithm based fault tolerance in dense linear algebra. In Proc. of the 25th International Symposium on High-performance Parallel and Distributed Computing, HPDC ’16, pages 31–42

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Felix Loh.

Additional information

Responsible Editor: V. D. Agrawal.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Loh, F., Saluja, K.K. & Ramanathan, P. Fault Tolerant Lanczos Eigensolver via an Invariant Checking Method. J Electron Test 37, 409–422 (2021). https://doi.org/10.1007/s10836-021-05945-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10836-021-05945-1

Keywords

Navigation