Skip to main content
Log in

Evaluating low-level software-based hardening techniques for configurable GPU architectures

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The high processing power of GPUs makes them attractive for safety-critical applications, where transient effects are a major concern, and resilience must be enforced without compromising performance. Configurable softcore GPUs are a recent technology that allows detailed reliability assessment capable of bringing directions to the design of reliable GPU applications. This work investigates the reliability of the register files and the pipeline of a softcore GPU under radiation-induced faults. It proposes software-based fault tolerance techniques to mitigate errors. Faults are simulated at the register transfer level in four case-study algorithms, and the Architectural Vulnerability Factor (AVF) and Mean Workload to Failure (MWTF) are checked over different GPU configurations. Results indicate that software-based techniques efficiently reduce AVF. In terms of MWTF, results show that the best cases depend on an optimized balance between GPU configuration, application runtime, and AVF.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Chernikova A, Oprea A, Nita-Rotaru C, Kim B (2019) Are self-driving cars secure? Evasion attacks against deep neural networks for steering angle prediction. In: 2019 IEEE Security and Privacy Workshops (SPW), pp 132–137. https://doi.org/10.1109/SPW.2019.00033

  2. Hassani R, Aiatullah M, Luksch P (2014) Improving HPC application performance in public cloud. IERI Procedia 10:169–176. https://doi.org/10.1016/j.ieri.2014.09.072

    Article  Google Scholar 

  3. Hakobyan G, Yang B (2019) High-performance automotive radar: a review of signal processing algorithms and modulation schemes. IEEE Signal Process Mag 36(5):32–44. https://doi.org/10.1109/MSP.2019.2911722

    Article  Google Scholar 

  4. Bojarski M, Del Testa D, Dworakowski D, Firner B, Flepp B, Goyal P, Jackel LD, Monfort M, Muller U, Zhang J et al (2016) End to end learning for self-driving cars. arXiv preprint arXiv:160407316

  5. JEDEC (2006) Measurement and reporting of alpha particle and terrestrial cosmic ray induced soft errors in semiconductor devices. https://www.jedec.org/standards-documents/docs/jesd-89a. Accessed 19 Sept 2021

  6. Oliveira DA, Rech P, Quinn HM, Fairbanks TD, Monroe L, Michalak SE, Anderson-Cook C, Navaux PO, Carro L (2014) Modern GPUs radiation sensitivity evaluation and mitigation through duplication with comparison. IEEE Trans Nucl Sci 61(6):3115–3122

    Article  Google Scholar 

  7. Pilla LL, Rech P, Silvestri F, Frost C, Navaux POA, Reorda MS, Carro L (2014) Software-based hardening strategies for neutron sensitive FFT algorithms on GPUs. IEEE Trans Nucl Sci 61(4):1874–1880

    Article  Google Scholar 

  8. Slayman C (2010) Soft errors—past history and recent discoveries. In: IEEE International Integrated Reliability Workshop Final Report, pp 25–30

  9. Dixit A, Wood A (2011) The impact of new technology on soft error rates. In: International Reliability Physics Symposium, pp 1–7

  10. Azambuja JR, Nazar G, Rech P, Carro L, Kastensmidt FL, Fairbanks T, Quinn H (2013) Evaluating neutron induced see in SRAM-based FPGA protected by hardware- and software-based fault tolerant techniques. IEEE Trans Nucl Sci 60(6):4243–4250. https://doi.org/10.1109/TNS.2013.2288305

    Article  Google Scholar 

  11. Tiwari D, Gupta S, Rogers J, Maxwell D, Rech P, Vazhkudai S, Oliveira D, Londo D, DeBardeleben N, Navaux P, Carro L, Bland A (2015) Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In: 2015 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 331–342. https://doi.org/10.1109/HPCA.2015.7056044

  12. Hari SKS, Tsai T, Stephenson M, Keckler SW, Emer J (2017) SASSIFI: an architecture-level fault injection tool for GPU application resilience evaluation. In: 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp 249–258

  13. Gonçalves M, Saquetti M, Kastensmidt F, Azambuja JR (2017) A low-level software-based fault tolerance approach to detect SEUs in GPUs’ register files. Microelectron Reliab 76:665–669

    Article  Google Scholar 

  14. Gonçalves M, Saquetti M, Azambuja JR (2018) Evaluating the reliability of a GPU pipeline to SEU and the impacts of software-based and hardware-based fault tolerance techniques. Microelectron Reliab 88:931–935

    Article  Google Scholar 

  15. Mahmoud A, Hari SKS, Sullivan MB, Tsai T, Keckler SW (2018) Optimizing software-directed instruction replication for GPU error detection. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, pp 842–853

  16. Rhod EL, Lisbôa CAL, Carro L, Sonza Reorda M, Violante M (2008) Hardware and software transparency in the protection of programs against SEUs and SETs. J Electron Test 24(1–3):45–56

    Article  Google Scholar 

  17. Condia JER, Du B, Sonza Reorda M, Sterpone L (2020) Flexgripplus: an improved GPGPU model to support reliability analysis. Microelectron Reliab 109:113660. https://doi.org/10.1016/j.microrel.2020.113660

    Article  Google Scholar 

  18. Kadi MA, Janssen B, Yudi J, Huebner M (2018) General-purpose computing with soft GPUs on FPGAs. ACM Trans Reconfigurable Technol Syst 11(1):1–22. https://doi.org/10.1145/3173548

    Article  Google Scholar 

  19. Goncalves MM, Azambuja JR, Condia JER, Sonza Reorda M, Sterpone L (2020) Evaluating software-based hardening techniques for general-purpose registers on a GPGPU. In: 2020 IEEE Latin-American Test Symposium (LATS). IEEE, pp 1–6

  20. Dimitrov M, Mantor M, Zhou H (2009) Understanding software approaches for GPGPU reliability. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2. ACM, New York, pp 94–104. https://doi.org/10.1145/1513895.1513907

  21. Wadden J, Lyashevsky A, Gurumurthi S, Sridharan V, Skadron K (2014) Real-world design and evaluation of compiler-managed GPU redundant multithreading. In: 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pp 73–84. https://doi.org/10.1109/ISCA.2014.6853227

  22. Rech P, Aguiar C, Frost C, Carro L (2013) An efficient and experimentally tuned software-based hardening strategy for matrix multiplication on GPUs. IEEE Trans Nucl Sci 60(4):2797–2804

    Article  Google Scholar 

  23. Braun C, Halder S, Wunderlich HJ (2014) A-abft: autonomous algorithm-based fault tolerance for matrix multiplications on graphics processing units. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, pp 443–454

  24. Sullivan MB, Hari SKS, Zimmer B, Tsai T, Keckler SW (2018) Swapcodes: error codes for hardware-software cooperative GPU pipeline error detection. In: 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, pp 762–774

  25. Gonçalves M, Condia JR, Reorda MS, Sterpone L, Azambuja J (2020) Improving GPU register file reliability with a comprehensive ISA extension. Microelectron Reliab 114:113768. https://doi.org/10.1016/j.microrel.2020.113768

    Article  Google Scholar 

  26. Goncalves MM, Lamb IP, Rech P, Brum RM, Azambuja JR (2020) Improving selective fault tolerance in GPU register files by relaxing application accuracy. IEEE Trans Nucl Sci 67(7):1573–1580. https://doi.org/10.1109/TNS.2020.2982162

    Article  Google Scholar 

  27. Gupta M, Lowell D, Kalamatianos J, Raasch S, Sridharan V, Tullsen D, Gupta R (2017) Compiler techniques to reduce the synchronization overhead of GPU redundant multithreading. In: 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, pp 1–6

  28. Sundaram A, Aakel A, Lockhart D, Thaker D, Franklin D (2008) Efficient fault tolerance in multi-media applications through selective instruction replication. In: Proceedings of the 2008 workshop on Radiation effects and fault tolerance in nanometer technologies, pp 339–346

  29. Kalra C, Previlon F, Rubin N, Kaeli D (2020) Armorall: compiler-based resilience targeting GPU applications. ACM Trans Archit Code Optim (TACO) 17(2):1–24

    Article  Google Scholar 

  30. Goncalves M, Fernandes F, Lamb I, Rech P, Azambuja JR (2019) Selective fault tolerance for register files of graphics processing units. IEEE Trans Nucl Sci 66(7):1449–1456

    Article  Google Scholar 

  31. dos Santos FF, Brandalero M, Basso PM, Hubner M, Carro L, Rech P (2020) Reduced-precision dwc for mixed-precision GPUs. In: 2020 IEEE 26th International Symposium on On-Line Testing and Robust System Design (IOLTS). IEEE, pp 1–6

  32. Andryc K, Merchant M, Tessier R (2013) Flexgrip: a soft GPGPU for FPGAs. In: 2013 International Conference on Field-Programmable Technology (FPT), pp 230–237. https://doi.org/10.1109/FPT.2013.6718358

  33. Lindholm E, Nickolls J, Oberman S, Montrym J (2008) Nvidia tesla: a unified graphics and computing architecture. IEEE Micro 28(2):39–55

    Article  Google Scholar 

  34. Oh N, Shirvani PP, McCluskey EJ (2002) Error detection by duplicated instructions in super-scalar processors. IEEE Trans Reliab 51(1):63–75

    Article  Google Scholar 

  35. Azambuja JR, Lapolli A, Rosa L, Kastensmidt FL (2011) Detecting sees in microprocessors through a non-intrusive hybrid technique. IEEE Trans Nucl Sci 58(3):993–1000

    Article  Google Scholar 

  36. Mukherjee SS, Weaver C, Emer J, Reinhardt SK, Austin T (2003) A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In: Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36., pp 29–40. https://doi.org/10.1109/MICRO.2003.1253181

  37. Leveugle R, Calvez A, Maistri P, Vanhauwaert P (2009) Statistical fault injection: quantified error and confidence. In: 2009 Design, Automation and Test in Europe. IEEE, pp 502–506. https://doi.org/10.1109/DATE.2009.5090716

  38. Reis GA, Chang J, Vachharajani N, Mukherjee SS, Rangan R, August DI (2005) Design and evaluation of hybrid fault-detection systems. In: 32nd International Symposium on Computer Architecture (ISCA’05), pp 148–159. https://doi.org/10.1109/ISCA.2005.21

Download references

Acknowledgements

This work has been partially supported by the European Commission through the Horizon 2020 RESCUE-ETN project under grant 722325, Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) - Finance Code 001, Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), and Fundação de Amparo à pesquisa do Estado do RS (FAPERGS).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcio M. Goncalves.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Goncalves, M.M., Condia, J.E.R., Reorda, M.S. et al. Evaluating low-level software-based hardening techniques for configurable GPU architectures. J Supercomput 78, 8081–8105 (2022). https://doi.org/10.1007/s11227-021-04154-z

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-04154-z

Keywords

Navigation