skip to main content
10.1145/2751504.2751510acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

The Path to Exascale: Code Optimizations and Hardening Solutions Reliability

Authors Info & Claims
Published:15 June 2015Publication History

ABSTRACT

Graphics Processing Units are nowadays the most common general-purpose computing accelerators employed in High Performance Computing (HPC) systems. The performance and energy efficiency of such devices enables extremely powerful HPC systems to be built. However, as the machine scale increases, the reliability problem increases as well, with failures on an exascale system expected to occur every few hours.

We present data obtained at Los Alamos Neutron Science Center and measure how algorithms optimization and hardening strategies impact the Silent Data Corruption and crash sensitivity of modern GPUs. We also extend our reliability analysis by evaluating the Mean Executions and Mean Workload Between Failures of the different algorithms implementations. Moreover, we push even more the compromise of reliability and performance applying hardening strategies to current optimized codes. We show that common strategies, such as ECC and Checkpoint-rollback, can be no match to strategies like Algorithm-Based Fault Tolerance and even Duplication with Comparison.

References

  1. J. W. Cooley and J. W. Tukey. An Algorithm for the Machine Calculation of Complex Fourier Series. Mathematics of Computation, 19, 1965.Google ScholarGoogle Scholar
  2. J. Dongarra, H. Meuer, and E. Strohmaier. TOP500 Supercomputer Sites: November 2013, 2013.Google ScholarGoogle Scholar
  3. K.-H. Huang and J. Abraham. Algorithm-based fault tolerance for matrix operations. Computers, IEEE Transactions on, C-33(6):518--528, June 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. JEDEC. Measurement and Reporting of Alpha Particle and Terrestrial Cosmic Ray-Induced Soft Errors in Semiconductor Devices. Technical Report JESD89A, JEDEC Standard, 2006.Google ScholarGoogle Scholar
  5. J.-Y. Jou and J. Abraham. Fault-Tolerant FFT Networks. Computers, IEEE Transactions on, 37(5):548--561, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Kannan, N. Farooqui, A. Gavrilovska, and K. Schwan. Heterocheckpoint: Efficient checkpointing for accelerator-based systems. In Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on, pages 738--743, June 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Krüger and R. Westermann. Linear Algebra Operators for GPU Implementation of Numerical Algorithms. In SIGGRAPH 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Lucas. Top ten exascale research challenges. In DOE ASCAC Subcommittee Report, 2014.Google ScholarGoogle Scholar
  9. W. C. Needleman, S.B. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(2):443--453, 1969.Google ScholarGoogle Scholar
  10. NVIDIA. NVIDIA Kepler K20 GPU Datasheet, 2012.Google ScholarGoogle Scholar
  11. D. Oliveira, P. Rech, H. Quinn, T. Fairbanks, L. Monroe, S. Michalak, C. Anderson-Cook, P. Navaux, and L. Carro. Modern gpus radiation sensitivity evaluation and mitigation through duplication with comparison. Nuclear Science, IEEE Transactions on, 61(6):3115--3122, Dec 2014.Google ScholarGoogle Scholar
  12. Preparing for exascale: Ornl leadership computing facility application requirements and strategy. 2009.Google ScholarGoogle Scholar
  13. L. Pilla, P. Rech, F. Silvestri, C. Frost, P. Navaux, M. Reorda, and L. Carro. Software-based hardening strategies for neutron sensitive fit algorithms on gpus. Nuclear Science, IEEE Transactions on, PP(99):1--7, 2014.Google ScholarGoogle Scholar
  14. P. Rech, C. Aguiar, C. Frost, and L. Carro. An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs. Nuclear Science, IEEE Transactions on, 60(4):2797--2804, 2013.Google ScholarGoogle Scholar
  15. P. Rech, L. L. Pilla, P. O. A. Navaux, and L. Carro. Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability. In DSN 2014, Atlanta, USA, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Tiwari, S. Gupta, J. Rogers, D. Maxwell, P. Rech, S. Vazhkudai, D. Oliveira, D. Londo, N. DeBardeleben, P. Navaux, L. Carro, and A. Bland. Understanding gpu errors on large-scale hpc systems and the implications for system design and operation. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, Feb 2015.Google ScholarGoogle ScholarCross RefCross Ref
  17. V. Volkov and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, number November, pages 1--11, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. V. Volkov and B. Kazian. Fitting FFT onto the G80 architecture. University of California, Berkeley, 40, 2008.Google ScholarGoogle Scholar
  19. C. Weaver et al. Techniques to reduce the soft error rate of a high-performance microprocessor. In ISCA'04, pages 264--275. IEEE Press, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The Path to Exascale: Code Optimizations and Hardening Solutions Reliability

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        FTXS '15: Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale
        June 2015
        78 pages
        ISBN:9781450335690
        DOI:10.1145/2751504

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 June 2015

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        FTXS '15 Paper Acceptance Rate9of15submissions,60%Overall Acceptance Rate16of25submissions,64%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader