ABSTRACT
Graphics Processing Units are nowadays the most common general-purpose computing accelerators employed in High Performance Computing (HPC) systems. The performance and energy efficiency of such devices enables extremely powerful HPC systems to be built. However, as the machine scale increases, the reliability problem increases as well, with failures on an exascale system expected to occur every few hours.
We present data obtained at Los Alamos Neutron Science Center and measure how algorithms optimization and hardening strategies impact the Silent Data Corruption and crash sensitivity of modern GPUs. We also extend our reliability analysis by evaluating the Mean Executions and Mean Workload Between Failures of the different algorithms implementations. Moreover, we push even more the compromise of reliability and performance applying hardening strategies to current optimized codes. We show that common strategies, such as ECC and Checkpoint-rollback, can be no match to strategies like Algorithm-Based Fault Tolerance and even Duplication with Comparison.
- J. W. Cooley and J. W. Tukey. An Algorithm for the Machine Calculation of Complex Fourier Series. Mathematics of Computation, 19, 1965.Google Scholar
- J. Dongarra, H. Meuer, and E. Strohmaier. TOP500 Supercomputer Sites: November 2013, 2013.Google Scholar
- K.-H. Huang and J. Abraham. Algorithm-based fault tolerance for matrix operations. Computers, IEEE Transactions on, C-33(6):518--528, June 1984. Google ScholarDigital Library
- JEDEC. Measurement and Reporting of Alpha Particle and Terrestrial Cosmic Ray-Induced Soft Errors in Semiconductor Devices. Technical Report JESD89A, JEDEC Standard, 2006.Google Scholar
- J.-Y. Jou and J. Abraham. Fault-Tolerant FFT Networks. Computers, IEEE Transactions on, 37(5):548--561, 1988. Google ScholarDigital Library
- S. Kannan, N. Farooqui, A. Gavrilovska, and K. Schwan. Heterocheckpoint: Efficient checkpointing for accelerator-based systems. In Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on, pages 738--743, June 2014. Google ScholarDigital Library
- J. Krüger and R. Westermann. Linear Algebra Operators for GPU Implementation of Numerical Algorithms. In SIGGRAPH 2003. Google ScholarDigital Library
- R. Lucas. Top ten exascale research challenges. In DOE ASCAC Subcommittee Report, 2014.Google Scholar
- W. C. Needleman, S.B. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(2):443--453, 1969.Google Scholar
- NVIDIA. NVIDIA Kepler K20 GPU Datasheet, 2012.Google Scholar
- D. Oliveira, P. Rech, H. Quinn, T. Fairbanks, L. Monroe, S. Michalak, C. Anderson-Cook, P. Navaux, and L. Carro. Modern gpus radiation sensitivity evaluation and mitigation through duplication with comparison. Nuclear Science, IEEE Transactions on, 61(6):3115--3122, Dec 2014.Google Scholar
- Preparing for exascale: Ornl leadership computing facility application requirements and strategy. 2009.Google Scholar
- L. Pilla, P. Rech, F. Silvestri, C. Frost, P. Navaux, M. Reorda, and L. Carro. Software-based hardening strategies for neutron sensitive fit algorithms on gpus. Nuclear Science, IEEE Transactions on, PP(99):1--7, 2014.Google Scholar
- P. Rech, C. Aguiar, C. Frost, and L. Carro. An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs. Nuclear Science, IEEE Transactions on, 60(4):2797--2804, 2013.Google Scholar
- P. Rech, L. L. Pilla, P. O. A. Navaux, and L. Carro. Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability. In DSN 2014, Atlanta, USA, 2014. Google ScholarDigital Library
- D. Tiwari, S. Gupta, J. Rogers, D. Maxwell, P. Rech, S. Vazhkudai, D. Oliveira, D. Londo, N. DeBardeleben, P. Navaux, L. Carro, and A. Bland. Understanding gpu errors on large-scale hpc systems and the implications for system design and operation. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, Feb 2015.Google ScholarCross Ref
- V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, number November, pages 1--11, 2008. Google ScholarDigital Library
- V. Volkov and B. Kazian. Fitting FFT onto the G80 architecture. University of California, Berkeley, 40, 2008.Google Scholar
- C. Weaver et al. Techniques to reduce the soft error rate of a high-performance microprocessor. In ISCA'04, pages 264--275. IEEE Press, 2004. Google ScholarDigital Library
Index Terms
- The Path to Exascale: Code Optimizations and Hardening Solutions Reliability
Recommendations
Hybridizing S3D into an Exascale application using OpenACC: An approach for moving to multi-petaflops and beyond
SC '12: Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and AnalysisHybridization is the process of converting an application with a single level of parallelism to an application with multiple levels of parallelism. Over the past 15 years a majority of the applications that run on High Performance Computing systems have ...
Hybridizing S3D into an exascale application using OpenACC: an approach for moving to multi-petaflops and beyond
SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisHybridization is the process of converting an application with a single level of parallelism to an application with multiple levels of parallelism. Over the past 15 years a majority of the applications that run on High Performance Computing systems have ...
Post-Radiation Fault Analysis of a High Reliability FPGA Linux SoC
FPGA '23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate ArraysFPGAs are increasingly being used in space and other harsh radiation environments. However, SRAM-based FPGAs are susceptible to radiation in these environments and experience upsets within the configuration memory (CRAM), causing design failure. The ...
Comments