Abstract
As graphics processing units (GPUs) evolve to offer high performance for general-purpose computations in addition to inherently fault-tolerant graphics applications, soft error reliability becomes a significant concern. Fault injection provides a method of evaluating the soft error vulnerability of target programs. Since performing fault injection experiments for complex GPU hardware structures takes impractical times, the prediction-based techniques to evaluate the soft error vulnerability of general-purpose GPU (GPGPU) programs based on metrics from different domains get crucial for both HPC developers and GPU vendors. In this work, we propose machine learning (ML)-based prediction frameworks for the soft error vulnerability evaluation of GPGPU programs. We consider program characteristics, hardware usage and performance metrics collected from the simulation and the profiling tools. While we utilize regression models to predict the masked fault rates, we build classification models to specify the vulnerability level of the GPGPU programs based on their silent data corruption (SDC) and crash rates. Our prediction models achieve maximum prediction accuracy rates of 95.9, 88.46, and 85.7% for masked fault rates, SDCs, and crashes, respectively.
Similar content being viewed by others
Data availibility
We share all source codes related to our prediction environment, our configuration files, and collected metrics from both the profiler and the simulator in our GitHub repository https://github.com/topcuburak/FaultPredictionOnGPGPUs.
References
Aamodt TM, Fung WWL, Rogers TG, Martonosi M (2018) General-purpose graphics processor architecture. Morgan & Claypool Publishers
Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev Comput Stat 2(4):433–459
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: 2009 IEEE International Symposium on Workload Characterization (IISWC), pp 44–54 . https://doi.org/10.1109/IISWC.2009.5306797
Clark JA, Pradhan DK (1995) Fault injection: a method for validating computer-system dependability. Computer 28(6):47–56
Dimitrov M, Mantor M, Zhou H (2009) Understanding software approaches for gpgpu reliability. In: Workshop on General Purpose Processing on Graphics Processing Units
Du B, Condia JER, Reorda MS (2019) An extended model to support detailed gpgpu reliability analysis. In: 2019 14th International Conference on Design Technology of Integrated Systems In Nanoscale Era (DTIS), pp 1–6 . https://doi.org/10.1109/DTIS.2019.8735047
Fang B, Pattabiraman K, Ripeanu M, Gurumurthi S (2016) A systematic methodology for evaluating the error resilience of gpgpu applications. IEEE Trans Parallel Distrib Syst 27(12):3397–3411
Grauer-Gray S, Xu L, Searles R, Ayalasomayajula S, Cavazos J (2012) Auto-tuning a high-level language targeted to gpu codes. In: 2012 Innovative Parallel Computing (InPar)
Guo L, Li D, Laguna I (2021) Paris: predicting application resilience using machine learning. J Parallel Distrib Comput 152:111–124. https://doi.org/10.1016/j.jpdc.2021.02.015
Hari S, Tsai T, Stephenson M, Keckler S, Emer J (2017) Sassifi: an architecture-level fault injection tool for gpu application resilience evaluation. In: 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp 249–258 . https://doi.org/10.1109/ISPASS.2017.7975296
Jauk D, Yang D, Schulz M (2019) Predicting faults in high performance computing systems: an in-depth survey of the state-of-the-practice. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19 . https://doi.org/10.1145/3295500.3356185
Kalra C, Previlon F, Li X, Rubin N, Kaeli D (2018) Prism: predicting resilience of gpu applications using statistical methods. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’18
Khairy M, Shen Z, Aamodt TM, Rogers TG (2020) Accel-sim: an extensible simulation framework for validated gpu modeling. In: 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) . https://doi.org/10.1109/ISCA45697.2020.00047
Kirk DB, mei W, Hwu, W (2017) Programming massively parallel processors (Third Edition). Morgan Kaufmann
Laguna I, Schulz M, Richards DF, Calhoun J, Olson L (2016) Ipas: intelligent protection against silent output corruption in scientific applications. In: IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
Leveugle R, Calvez A, Maistri P, Vanhauwaert P (2009) Statistical fault injection: quantified error and confidence. In: 2009 Design, Automation & Test in Europe Conference & Exhibition, Proceedings of the Conference on Design, Automation and Test in Europe (DATE)
Lu Q, Pattabiraman K, Gupta MS, Rivers JA (2014) Sdctune: a model for predicting the sdc proneness of an application for configurable protection. In: International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES)
Mahmoud A, Hari SKS, Sullivan MB, Tsai T, Keckler SW (2018) Optimizing software-directed instruction replication for gpu error detection. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’18
Mei-Chen H, Tsai TK, Iyer RK (1997) Fault injection techniques and tools. Computer 30(4):75–82
Mittal S, Vetter JS (2016) A survey of techniques for modeling and improving reliability of computing systems. IEEE Trans Parallel Distrib Syst 27(4):1226–1238
Mukherjee S (2008) Architecture design for soft errors. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA
Nie B, Xue J, Gupta S, Patel T, Engelmann C, Smirni E, Tiwari D (2018) Machine learning models for gpu error prediction in a large scale hpc system. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp 95–106 . https://doi.org/10.1109/DSN.2018.00022
Nvidia, cuda-gdb (2022) https://developer.nvidia.com/cuda-gdb
NVIDIA: Nvidia, cuda llvm compiler. https://developer.nvidia.com/cuda-llvm-compiler
NVIDIA: Nvidia nsight compute. https://developer.nvidia.com/nsight-compute
NVIDIA: Nvidia, pascal architecture whitepaper. https://www.nvidia.com/en-us/data-center/resources/pascal-architecture-whitepaper
NVIDIA: Data sheet: Nvidia quadro p4000 (2018). https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/productspage/quadro/quadro-desktop/quadro-pascal-p4000-data-sheet-a4-nvidia-704358-r2-web.pdf
NVIDIA: Nvidia parallel thread execution is a version 7.4 (2021). https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
Oliveira D, Moreira FB, Rech P, Navaux P (2018) Predicting the reliability behavior of hpc applications. In: International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)
Oz I, Karadas OF (2021) Regional soft error vulnerability and error propagation analysis for gpgpu applications. J Supercomput 78(3):4095–4130. https://doi.org/10.1007/s11227-021-04026-6
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Sabena D, Sterpone L, Carro L, Rech P (2014) Reliability evaluation of embedded gpgpus for safety critical applications. IEEE Trans Nucl Sci 61(6):3123–3129. https://doi.org/10.1109/TNS.2014.2363358
Unknown: Nvidia quadro p4000 (2022). https://www.techpowerup.com/gpu-specs/quadro-p4000.c2930
Wei X, Zhang R, Liu Y, Yue H, Tan J (2019) Evaluating the soft error resilience of instructions for gpu applications. In: 2019 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC), pp 459–464 . https://doi.org/10.1109/CSE/EUC.2019.00091
Öz I, Arslan S (2021) Predicting the soft error vulnerability of parallel applications using machine learning. Int J Parallel Program 49(3):410–439
Acknowledgements
This work was supported by the Scientific and Technological Research Council of Turkey (TÜBİTAK), Grant No: 119E011.
Funding
This work was supported by the Scientific and Technological Research Council of Turkey (TÜBİTAK), Grant No: 119E011.
Author information
Authors and Affiliations
Contributions
BT: run the experiments and wrote the main manuscript. IO: defined the methodology, analyzed the results, and contributed in writing the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no competing interests.
Consent to participate
Not applicable
Consent for publication
Not applicable
Ethical approval
Not applicable
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Topçu, B., Öz, I. Soft error vulnerability prediction of GPGPU applications. J Supercomput 79, 6965–6990 (2023). https://doi.org/10.1007/s11227-022-04933-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04933-2