skip to main content
10.1145/2304576.2304590acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Fault resilience of the algebraic multi-grid solver

Published: 25 June 2012 Publication History

Abstract

As HPC system sizes grow to millions of cores and chip feature sizes continue to decrease, HPC applications become increasingly exposed to transient hardware faults. These faults can cause aborts and performance degradation. Most importantly, they can corrupt results. Thus, we must evaluate the fault vulnerability of key HPC algorithms to develop cost-effective techniques to improve application resilience.
We present an approach that analyzes the vulnerability of applications to faults, systematically reduces it by protecting the most vulnerable components and predicts application vulnerability at large scales. Weinitially focus on sparse scientific applications and apply our approachin this paper to the Algebraic Multi Grid (AMG) algorithm. We empirically analyze AMG's vulnerability to hardware faults in both sequential and parallel (hybrid MPI/OpenMP) executions on up to 1,600 cores and propose and evaluate the use of targeted pointer replication to reduce it. Our techniques increase AMG's resilience to transient hardware faults by 50-80% and improve its scalability on faulty computational environments by 35%. Further, we show how to model AMG's scalability in fault-prone environments to predict execution times of large-scale runs accurately.

References

[1]
International Technology Roadmap for Semiconductors. White Paper, ITRS, 2010.
[2]
A. A. Al-Yamani, N. Oh, and E. J. McCluskey. Performance Evaluation of Checksum-Based ABFT. In 16th IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems, 2001.
[3]
C. J. Anfinson and F. T. Luk. A Linear Algebraic Model of Algorithm-Based Fault Tolerance. IEEE Transactions on Computers, 37(12):1599--1604, 1988.
[4]
A. Avritzer, F. P. Duarte, R. M. M. Leao, E. de Souza e Silva, M. Cohen, and D. Costello. Reliability Estimation for Large Distributed Software Systems. In Conference of the Center for Advanced Studies on Collaborative Research, 2008.
[5]
L. N. Bairavasundaram, R. G. Goodson, B. Schroeder, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. An Analysis of Data Corruptions in the Storage Stack. In USENIX Conference on File and Storage Technologies, (FAST), 2008.
[6]
W. L. Briggs, V. E. Henson, and S. F. McCormick. A Multigrid Tutorial. 2000.
[7]
G. Bronevetsky and B. R. de Supinski. Soft Error Vulnerability of Iterative Linear Algebra Methods. In International Conference on Supercomputing (ICS), pages 155--164, 2008.
[8]
M. de Kruijf, S. Nomura, and K. Sankaralingam. Relax: An Architectural Framework for Software Recovery of Hardware Faults. In International Symposium on Computer Architecture (ISCA), 2010.
[9]
R. Falgout, J. E. Jones, and U. M. Yang. The design and implementation of hypre, a library of parallel high performance preconditioners. Numerical Solution of Partial Differential Equations on Parallel Computers, (51), 2006.
[10]
J. N. Glosli, K. J. Caspersen, J. A. Gunnels, D. F. Richards, R. E. Rudd, and F. H. Streitz. Extending Stability Beyond CPU Millennium: A Micron-Scale Atomistic Simulation of Kelvin-Helmholtz Instability. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, SC'07, pages 58:1--58:11, New York, NY, USA, 2007. ACM.
[11]
G. H. Golub and Q. Ye. Inexact Preconditioned Conjugate Gradient Method with Inner-Outer Iteration. SIAM Journal of Scientific Computing, pages 1305--1320, 1999.
[12]
K. H. Huang and A. J. Abraham. Algorithm Based Fault Tolerant for Matrix Operations. IEEE Transactions on Computers, C33:518--528, 1984.
[13]
J. Y. Jou and A. J. Abraham. Fault-Tolerant Matrix Arithmetic and Signal Processing on Highly Concurrent Computing Structures. Proc IEEE, 74:732--741, 1986.
[14]
P. Kogge. ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. Technical report, DARPA IPTO, September 2008.
[15]
C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. In International Symposium on Code Generation and Optimization (CGO), 2004.
[16]
L. Leem, H. Cho, J. Bau, Q. A. Jacobson, and S. Mitra. ERSA: Error Resilient System Architecture for Probabilistic Applications. In Conference on Design, Automation and Test in Europe (DATE), pages 1560--1565, 2010.
[17]
S. P. Meyn and R. L. Tweendie. Markov Chains and Stochastic Stability. 2008.
[18]
S. Michalak, K. W. Harris, N. W. Hengartner, B. E. Takala, and S. A. Wender. Predicting the Number of Fatal Soft Errors in Los Alamos National Laboratory's ASC Q Supercomputer. IEEE Transactions on Device and Materials Reliability, 5(3), September 2005.
[19]
CollA. Mishra and P. Banerjee. An Algorithm-Based Error Detection Scheme for the Multigrid Method. IEEE Transactions on Computers, 52(9):1089--1099, 2003.
[20]
D. C. Montgomery, E. A. Peck, and G. G. Vining. Introduction to Linear Regression Analysis. Wiley, 2006.
[21]
A. D. Polyanin. Handbook of Linear Partial Differential Equations for Engineers and Scientists. 2002.
[22]
M. Rinard, H. Hoffmann, S. Misailovic, and S. Sidiroglou. Patterns and Statistical Analysis for Understanding Reduced Resource Computing. In Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2010.
[23]
A. Roy-Chowdhury, N. Bellas, and P. Banerjee. Algorithm-Based Error-Detection Schemes for Iterative Solution of Partial Differential Equations. IEEE Transactions on Computers, 45(4), 1996.
[24]
Y. Saad and M. Schultz. GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems. SIAM Journal of Scientific Computing, 7, 1986.
[25]
J. Sloan, D. Kesler, R. Kumar, and A. Rahimi. A Optimization-Based Methodology for Application Robustification: Transforming Applications for Error Tolerance. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 161--170, 2010.
[26]
H. D. Sterck, R. Falgout, J. W. Nolting, and U. M. Yang. Distance-Two Interpolation for Parallel Algebraic Multigrid. Numerical Linear Algebra with Applications, 15:115--139, 2008.
[27]
M. Turmon, R. Granat, D. S. Katz, and J. Z. Lou. Tests and Tolerances for High-Performance Software-Implemented Fault Detection. IEEE Transactions on Computers, 52(5):579--591, 2003.

Cited By

View all
  • (2024)ApproxDup: Developing an Approximate Instruction Duplication Mechanism for Efficient SDC Detection in GPGPUsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333082143:4(1051-1064)Online publication date: Apr-2024
  • (2024)Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express LinkSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00100(1-18)Online publication date: 17-Nov-2024
  • (2024)AmgT: Algebraic Multigrid Solver on Tensor CoresSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00058(1-16)Online publication date: 17-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing
June 2012
400 pages
ISBN:9781450313162
DOI:10.1145/2304576
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. algebraic multi-grid solver
  2. resilience
  3. transient faults.

Qualifiers

  • Research-article

Conference

ICS'12
Sponsor:
ICS'12: International Conference on Supercomputing
June 25 - 29, 2012
San Servolo Island, Venice, Italy

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)2
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)ApproxDup: Developing an Approximate Instruction Duplication Mechanism for Efficient SDC Detection in GPGPUsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333082143:4(1051-1064)Online publication date: Apr-2024
  • (2024)Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express LinkSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00100(1-18)Online publication date: 17-Nov-2024
  • (2024)AmgT: Algebraic Multigrid Solver on Tensor CoresSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00058(1-16)Online publication date: 17-Nov-2024
  • (2023)Recovering Detectable Uncorrectable Errors via Spatial Data PredictionProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624120(507-515)Online publication date: 12-Nov-2023
  • (2023)Evaluating the Resiliency of Posits for Scientific ComputingProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624116(477-487)Online publication date: 12-Nov-2023
  • (2023)HPC Hardware Design Reliability Benchmarking With HDFITIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.323777734:3(995-1006)Online publication date: 1-Mar-2023
  • (2021)Understanding a program's resiliency through error propagationProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441589(362-373)Online publication date: 17-Feb-2021
  • (2021)SpotSDC: Revealing the Silent Data Corruption Propagation in High-Performance Computing SystemsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2020.299495427:10(3938-3952)Online publication date: 1-Oct-2021
  • (2020)Exploring Non-Volatility of Non-Volatile Memory for High Performance Computing Under Failures2020 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER49012.2020.00034(237-247)Online publication date: Sep-2020
  • (2020)Tracking scientific simulation using online time-series modelling2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)10.1109/CCGrid49817.2020.00-73(202-211)Online publication date: May-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media