Soft error vulnerability prediction of GPGPU applications

Topçu, Burak; Öz, Işıl

doi:10.1007/s11227-022-04933-2

Soft error vulnerability prediction of GPGPU applications

Published: 19 November 2022

Volume 79, pages 6965–6990, (2023)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Burak Topçu¹ &
Işıl Öz¹

362 Accesses
Explore all metrics

Abstract

As graphics processing units (GPUs) evolve to offer high performance for general-purpose computations in addition to inherently fault-tolerant graphics applications, soft error reliability becomes a significant concern. Fault injection provides a method of evaluating the soft error vulnerability of target programs. Since performing fault injection experiments for complex GPU hardware structures takes impractical times, the prediction-based techniques to evaluate the soft error vulnerability of general-purpose GPU (GPGPU) programs based on metrics from different domains get crucial for both HPC developers and GPU vendors. In this work, we propose machine learning (ML)-based prediction frameworks for the soft error vulnerability evaluation of GPGPU programs. We consider program characteristics, hardware usage and performance metrics collected from the simulation and the profiling tools. While we utilize regression models to predict the masked fault rates, we build classification models to specify the vulnerability level of the GPGPU programs based on their silent data corruption (SDC) and crash rates. Our prediction models achieve maximum prediction accuracy rates of 95.9, 88.46, and 85.7% for masked fault rates, SDCs, and crashes, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparative analysis of soft-error sensitivity in LU decomposition algorithms on diverse GPUs

Article Open access 22 February 2024

Efficient Soft Error Vulnerability Analysis Using Non-intrusive Fault Injection Techniques

Predicting the Soft Error Vulnerability of Parallel Applications Using Machine Learning

Article 28 March 2021

Data availibility

We share all source codes related to our prediction environment, our configuration files, and collected metrics from both the profiler and the simulator in our GitHub repository https://github.com/topcuburak/FaultPredictionOnGPGPUs.

Notes

https://github.com/topcuburak/FaultPredictionOnGPGPUs.

References

Aamodt TM, Fung WWL, Rogers TG, Martonosi M (2018) General-purpose graphics processor architecture. Morgan & Claypool Publishers
Book Google Scholar
Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev Comput Stat 2(4):433–459
Article Google Scholar
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: 2009 IEEE International Symposium on Workload Characterization (IISWC), pp 44–54 . https://doi.org/10.1109/IISWC.2009.5306797
Clark JA, Pradhan DK (1995) Fault injection: a method for validating computer-system dependability. Computer 28(6):47–56
Article Google Scholar
Dimitrov M, Mantor M, Zhou H (2009) Understanding software approaches for gpgpu reliability. In: Workshop on General Purpose Processing on Graphics Processing Units
Du B, Condia JER, Reorda MS (2019) An extended model to support detailed gpgpu reliability analysis. In: 2019 14th International Conference on Design Technology of Integrated Systems In Nanoscale Era (DTIS), pp 1–6 . https://doi.org/10.1109/DTIS.2019.8735047
Fang B, Pattabiraman K, Ripeanu M, Gurumurthi S (2016) A systematic methodology for evaluating the error resilience of gpgpu applications. IEEE Trans Parallel Distrib Syst 27(12):3397–3411
Article Google Scholar
Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J (2012) Auto-tuning a high-level language targeted to gpu codes. In: 2012 Innovative Parallel Computing (InPar)
Guo L, Li D, Laguna I (2021) Paris: predicting application resilience using machine learning. J Parallel Distrib Comput 152:111–124. https://doi.org/10.1016/j.jpdc.2021.02.015
Article Google Scholar
Hari S, Tsai T, Stephenson M, Keckler S, Emer J (2017) Sassifi: an architecture-level fault injection tool for gpu application resilience evaluation. In: 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp 249–258 . https://doi.org/10.1109/ISPASS.2017.7975296
Jauk D, Yang D, Schulz M (2019) Predicting faults in high performance computing systems: an in-depth survey of the state-of-the-practice. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19 . https://doi.org/10.1145/3295500.3356185
Kalra C, Previlon F, Li X, Rubin N, Kaeli D (2018) Prism: predicting resilience of gpu applications using statistical methods. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’18
Khairy M, Shen Z, Aamodt TM, Rogers TG (2020) Accel-sim: an extensible simulation framework for validated gpu modeling. In: 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) . https://doi.org/10.1109/ISCA45697.2020.00047
Kirk DB, mei W, Hwu, W (2017) Programming massively parallel processors (Third Edition). Morgan Kaufmann
Laguna I, Schulz M, Richards DF, Calhoun J, Olson L (2016) Ipas: intelligent protection against silent output corruption in scientific applications. In: IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
Leveugle R, Calvez A, Maistri P, Vanhauwaert P (2009) Statistical fault injection: quantified error and confidence. In: 2009 Design, Automation & Test in Europe Conference & Exhibition, Proceedings of the Conference on Design, Automation and Test in Europe (DATE)
Lu Q, Pattabiraman K, Gupta MS, Rivers JA (2014) Sdctune: a model for predicting the sdc proneness of an application for configurable protection. In: International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES)
Mahmoud A, Hari SKS, Sullivan MB, Tsai T, Keckler SW (2018) Optimizing software-directed instruction replication for gpu error detection. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’18
Mei-Chen H, Tsai TK, Iyer RK (1997) Fault injection techniques and tools. Computer 30(4):75–82
Article Google Scholar
Mittal S, Vetter JS (2016) A survey of techniques for modeling and improving reliability of computing systems. IEEE Trans Parallel Distrib Syst 27(4):1226–1238
Article Google Scholar
Mukherjee S (2008) Architecture design for soft errors. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA
Google Scholar
Nie B, Xue J, Gupta S, Patel T, Engelmann C, Smirni E, Tiwari D (2018) Machine learning models for gpu error prediction in a large scale hpc system. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp 95–106 . https://doi.org/10.1109/DSN.2018.00022
Nvidia, cuda-gdb (2022) https://developer.nvidia.com/cuda-gdb
NVIDIA: Nvidia, cuda llvm compiler. https://developer.nvidia.com/cuda-llvm-compiler
NVIDIA: Nvidia nsight compute. https://developer.nvidia.com/nsight-compute
NVIDIA: Nvidia, pascal architecture whitepaper. https://www.nvidia.com/en-us/data-center/resources/pascal-architecture-whitepaper
NVIDIA: Data sheet: Nvidia quadro p4000 (2018). https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/productspage/quadro/quadro-desktop/quadro-pascal-p4000-data-sheet-a4-nvidia-704358-r2-web.pdf
NVIDIA: Nvidia parallel thread execution is a version 7.4 (2021). https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
Oliveira D, Moreira FB, Rech P, Navaux P (2018) Predicting the reliability behavior of hpc applications. In: International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)
Oz I, Karadas OF (2021) Regional soft error vulnerability and error propagation analysis for gpgpu applications. J Supercomput 78(3):4095–4130. https://doi.org/10.1007/s11227-021-04026-6
Article Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Sabena D, Sterpone L, Carro L, Rech P (2014) Reliability evaluation of embedded gpgpus for safety critical applications. IEEE Trans Nucl Sci 61(6):3123–3129. https://doi.org/10.1109/TNS.2014.2363358
Article Google Scholar
Unknown: Nvidia quadro p4000 (2022). https://www.techpowerup.com/gpu-specs/quadro-p4000.c2930
Wei X, Zhang R, Liu Y, Yue H, Tan J (2019) Evaluating the soft error resilience of instructions for gpu applications. In: 2019 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC), pp 459–464 . https://doi.org/10.1109/CSE/EUC.2019.00091
Öz I, Arslan S (2021) Predicting the soft error vulnerability of parallel applications using machine learning. Int J Parallel Program 49(3):410–439
Article Google Scholar

Download references

Acknowledgements

This work was supported by the Scientific and Technological Research Council of Turkey (TÜBİTAK), Grant No: 119E011.

Funding

This work was supported by the Scientific and Technological Research Council of Turkey (TÜBİTAK), Grant No: 119E011.

Author information

Authors and Affiliations

Computer Engineering Department, Izmir Institute of Technology, Izmir, Turkey
Burak Topçu & Işıl Öz

Authors

Burak Topçu
View author publications
You can also search for this author in PubMed Google Scholar
Işıl Öz
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

BT: run the experiments and wrote the main manuscript. IO: defined the methodology, analyzed the results, and contributed in writing the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Işıl Öz.

Ethics declarations

Conflicts of interest

The authors declare that they have no competing interests.

Consent to participate

Not applicable

Consent for publication

Not applicable

Ethical approval

Not applicable

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Topçu, B., Öz, I. Soft error vulnerability prediction of GPGPU applications. J Supercomput 79, 6965–6990 (2023). https://doi.org/10.1007/s11227-022-04933-2

Download citation

Accepted: 06 November 2022
Published: 19 November 2022
Issue Date: April 2023
DOI: https://doi.org/10.1007/s11227-022-04933-2

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Soft error vulnerability prediction of GPGPU applications

Abstract

Access this article

Similar content being viewed by others

Comparative analysis of soft-error sensitivity in LU decomposition algorithms on diverse GPUs

Efficient Soft Error Vulnerability Analysis Using Non-intrusive Fault Injection Techniques

Predicting the Soft Error Vulnerability of Parallel Applications Using Machine Learning

Data availibility

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Consent to participate

Consent for publication

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Navigation

Soft error vulnerability prediction of GPGPU applications

Abstract

Access this article

Similar content being viewed by others

Comparative analysis of soft-error sensitivity in LU decomposition algorithms on diverse GPUs

Efficient Soft Error Vulnerability Analysis Using Non-intrusive Fault Injection Techniques

Predicting the Soft Error Vulnerability of Parallel Applications Using Machine Learning

Data availibility

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Consent to participate

Consent for publication

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation