Abstract
The reduced reliability of next generation exascale systems means that the resiliency properties of a numerical algorithm will become an important factor in both the choice of algorithm, and in its analysis. The multigrid algorithm is the workhorse for the distributed solution of linear systems but little is known about its resiliency properties and convergence behavior in a fault-prone environment. In the current work, we propose a probabilistic model for the effect of faults involving random diagonal matrices. We summarize results of the theoretical analysis of the model for the rate of convergence of fault-prone multigrid methods which show that the standard multigrid method will not be resilient. Finally, we present a modification of the standard multigrid algorithm that will be resilient.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
M. Ainsworth, C. Glusa, Is the multigrid method fault-tolerant? The Two Grid Case (Submitted)
M. Ainsworth, C. Glusa, Is the multigrid method fault-tolerant? The Multi Grid Case (In preparation)
A. Avižienis, J.-C. Laprie, B. Randell, C. Landwehr, Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Secure Comput. 1 (1), 11–33 (2004)
P. Bougerol, J. Lacroix, Products of Random Matrices with Applications to Schrödinger Operators. Progress in Probability and Statistics, vol. 8 (Birkhäuser Boston Inc., Boston, 1985)
J.H. Bramble, Multigrid Methods, vol. 294 (Longman Scientific & Technical, Harlow, 1993)
F. Cappello, Fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. Int. J. High Perform. Comput. Appl. 23 (3), 212–226 (2009)
F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer, M. Snir, Toward exascale resilience. Int. J. High Perform. Comput. Appl. 23, 374–388 (2009)
F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer, M. Snir, Toward exascale resilience: 2014 update. Supercomput. Front. Innov. 1 (1), 5–28 (2014)
M. Casas, B.R. de Supinski, G. Bronevetsky, M. Schulz, Fault Resilience of the Algebraic Multi-grid Solver (ICS’12) (ACM, New York, 2012), pp. 91–100
A. Crisanti, G. Paladin, A. Vulpiani, Products of Random Matrices (Springer, Berlin/Heidelberg, 1993)
T. Cui, J. Xu, C.-S. Zhang, An Error-Resilient Redundant Subspace Correction Method, ArXiv e-prints (2013)
J. Elliott, F. Mueller, M. Stoyanov, C.G. Webster, Quantifying the impact of single bit flips on floating point arithmetic, Technical report ORNL/TM-2013/282, Oak Ridge National Laboratory, 2013
M. Embree, L.N. Trefethen, Growth and decay of random Fibonacci sequences. Proc.: Math. Phys. Eng. Sci. 455 (1987), 2471–2485 (1999) (English)
H. Furstenberg, H. Kesten, Products of random matrices. Ann. Math. Stat. 31 (2), 457–469 (1960)
W. Hackbusch, Multi-grid Methods and Applications, vol. 4 (Springer, Berlin, 1985)
W. Hackbusch, Iterative Solution of Large Sparse Systems of Equations. Applied Mathematical Sciences, vol. 95 (Springer, New York, 1994). Translated and revised from the 1991 German original
T. Herault, Y. Robert, Fault-Tolerance Techniques for High-Performance Computing (Springer, Cham, 2015)
K.-H. Huang, J. Abraham, Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 100 (6), 518–528 (1984)
M. Huber, B. Gmeiner, U. Rüde, B. Wohlmuth, Resilience for multigrid software at the extreme scale, arXiv preprint arXiv:1506.06185 (2015)
R. Mainieri, Zeta function for the Lyapunov exponent of a product of random matrices. Phys. Rev. Lett. 68, 1965–1968 (1992)
S.F. McCormick, W.L. Briggs, V.E. Henson, A Multigrid Tutorial (SIAM, Philadelphia, 2000)
M. Shantharam, S. Srinivasmurthy, P. Raghavan, Characterizing the Impact of Soft Errors on Iterative Methods in Scientific Computing (ICS’11) (ACM, New York, 2011), pp. 152–161
J. Sloan, R. Kumar, G. Bronevetsky, Algorithmic approaches to low overhead fault detection for sparse linear algebra, in 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Boston (IEEE, 2012), pp. 1–12
M. Snir, R.W. Wisniewski, J.A. Abraham, S.V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, et al., Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28 (2), 129–173 (2014)
M. Stoyanov, C. Webster, Numerical analysis of fixed point algorithms in the presence of hardware faults. SIAM J. Sci. Comput. 37 (5), C532–C553 (2015)
U. Trottenberg, C.W. Oosterlee, A. Schüller, Multigrid (Academic Press Inc., San Diego, 2001). With contributions by A. Brandt, P. Oswald and K. Stüben
J.N. Tsitsiklis, V.D. Blondel, The Lyapunov exponent and joint spectral radius of pairs of matrices are hard-when not impossible-to compute and to approximate. Math. Control Signals Syst. 10 (1), 31–40 (1997) (English)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Ainsworth, M., Glusa, C. (2016). Multigrid at Scale?. In: Karasözen, B., Manguoğlu, M., Tezer-Sezgin, M., Göktepe, S., Uğur, Ö. (eds) Numerical Mathematics and Advanced Applications ENUMATH 2015. Lecture Notes in Computational Science and Engineering, vol 112. Springer, Cham. https://doi.org/10.1007/978-3-319-39929-4_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-39929-4_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-39927-0
Online ISBN: 978-3-319-39929-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)