Skip to main content

Advertisement

Log in

Thermal neutrons: a possible threat for supercomputer reliability

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The high performance, high efficiency, and low cost of Commercial Off-The-Shelf (COTS) devices make them attractive for applications with strict reliability constraints. Today, COTS devices are adopted in HPC and safety-critical applications such as autonomous driving. Unfortunately, the cheap natural boron widely used in COTS chip manufacturing process makes them highly susceptible to thermal (low energy) neutrons. In this paper, we demonstrate that thermal neutrons are a significant threat to COTS device reliability. For our study, we consider two DDR memories, an AMD APU, three NVIDIA GPUs, an Intel accelerator, and an FPGA executing a relevant set of algorithms. We consider different scenarios that impact the thermal neutron flux such as weather, concrete walls and floors, and HPC liquid cooling systems. Correlating beam experiments and neutron detector data, we show that thermal neutrons FIT rate could be comparable or even higher than the high energy neutron FIT rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Lucas R (2014) Top ten exascale research challenges. In: DOE ASCAC Subcommittee Report

  2. Cohen A, Shen X, Torrellas J, Tuck J, Zhou Y, Adve S, Akturk I, Bagchi S, Balasubramonian R, Barik R, Beck M, Bodik R, Butt A, Ceze L, Chen H, Chen Y, Chilimbi T, Christodorescu M, Criswell J, Ding C, Ding Y, Dwarkadas S, Elmroth E, Gibbons P, Guo X, Gupta R, Heiser G, Hoffman H, Huang J, Hunter H, Kim J, King S, Larus J, Liu C, Lu S, Lucia B, Maleki S, Mazumdar S, Neamtiu I, Pingali K, Rech P, Scott M, Solihin Y, Song D, Szefer J, Tsafrir D, Urgaonkar B, Wolf M, Xie Y, Zhao J, Zhong L, Zhu Y (2018) Inter-disciplinary research challenges in computer systems for the 2020s. Tech. rep, National Science Foundation, USA

  3. Dongarra J, Meuer H, Strohmaier E (2015) ISO26262 Standard. https://www.iso.org/obp/ui/#iso:std:iso:26262:-1:ed-1:v1:en

  4. Ziegler J, Puchner H (2004) SER-history. A Guide for Designing with Memory ICs (Cypress, Trends and Challenges

  5. Baumann R (2005) Radiation-induced soft errors in advanced semiconductor technologies. IEEE Trans Device Mater Reliab 5(3):305–316. https://doi.org/10.1109/TDMR.2005.853449

    Article  Google Scholar 

  6. Snir M, Wisniewski RW, Abraham JA, Adve SV, Bagchi S, Balaji P, Belak J, Bose P, Cappello F, Carlson B, et al (2014) Addressing failures in exascale computing. Int J High Perform Comput Appl 1094342014522573

  7. Dirk J, Nelson ME, Ziegler JF, Thompson A, Zabel TH (2003) Terrestrial thermal neutrons. IEEE Trans Nuclear Sci 50(6):2060

    Article  Google Scholar 

  8. JEDEC (2006) Measurement and reporting of alpha particle and terrestrial cosmic ray-induced soft errors in semiconductor devices. Tech. Rep. JESD89A, JEDEC Standard

  9. Baumann R, Hossain T, Smith E, Murata S, Kitagawa H (1995) Boron as a primary source of radiation in high density DRAMs. In: 1995 Symposium on VLSI Technology. Digest of Technical Papers. IEEE. IEEE, Kyoto, Japan, Japan, pp 81–82

  10. Normand E, Vranish K, Sheets A, Stitt M, Kim R (2006) Quantifying the double-sided neutron SEU threat, from low energy (thermal) and high energy>10 MeV) neutrons. IEEE Trans Nucl Sci 53(6):3587

    Article  Google Scholar 

  11. Wen SJ, Pai S, Wong R, Romain M, Tam N (2010) B10 finding and correlation to thermal neutron soft error rate sensitivity for SRAMs in the sub-micron technology. In: 2010 IEEE International Integrated Reliability Workshop Final Report (IEEE), pp 31–33

  12. Weulersse C, Houssany S, Guibbaud N, Segura-Ruiz J, Beaucour J, Miller F, Mazurek M (2018) Contribution of thermal neutrons to soft error rate. IEEE Trans Nucl Sci 65(8):1851

    Article  Google Scholar 

  13. Lee S, Kim I, Ha S, Yu Cs, Noh J, Pae S, Park J (2015) Radiation-induced soft error rate analyses for 14 nm FinFET SRAM devices. In: 2015 IEEE International Reliability Physics Symposium. IEEE (IEEE), pp 4B–1

  14. Fang YP, Oates AS (2016) Characterization of single bit and multiple cell soft error events in planar and FinFET SRAMs. IEEE Trans Device Mater Reliab 16(2):132

    Article  Google Scholar 

  15. Maillard P, Hart M, Barton J, Jain P, Karp J (2015) Neutron, 64 MeV proton, thermal neutron and alpha single-event upset characterization of Xilinx 20nm UltraScale Kintex FPGA. In: 2015 IEEE Radiation Effects Data Workshop (REDW) (IEEE, 2015), pp 1–5

  16. Hess VF (1913) Über den Ursprung der durchdringenden Strahlung. Z Phys 14:610

    Google Scholar 

  17. Ziegler JF (1996) Terrestrial cosmic rays. IBM J Res Dev 40(1):19. https://doi.org/10.1147/rd.401.0019

    Article  Google Scholar 

  18. Hands A, Morris P, Ryden K, Dyer C, Truscott P, Chugg A, Parker S (2011) Single event effects in power MOSFETs due to atmospheric and thermal neutrons. IEEE Trans Nucl Sci 58(6):2687

    Article  Google Scholar 

  19. Baumann R (2000) Soft error characterization and modeling methodologies at Texas Instruments. In: Proceedings of the Semiconductor Research Council 4th Topical Conference Reliability.[CD-Rom] SemaTech CD-ROM (SemaTech, USA), pp 0043–3283

  20. Sheu R, Jiang S (2003) Cosmic-ray-induced neutron spectra and effective dose rates near air/ground and air/water interfaces in Taiwan. Health Phys 84(1):92

    Article  Google Scholar 

  21. Patterson MK, Fenwick D (2008) The state of datacenter cooling. Intel Corporation White Paper. http://download.intel.com/technology/eep/data-center-efficiency/stateof-date-center-cooling.pdf

  22. Summit system overview (2006). https://www.olcf.ornl.gov/wp-content/uploads/2018/05/Intro_Summit_System_Overview.pdf

  23. Capozzoli A, Primiceri G (2015) Cooling systems in data centers: state of art and emerging technologies. Energy Procedia 83:484

    Article  Google Scholar 

  24. Dongarra J, Meuer H, Strohmaier E (2018) TOP500 Supercomputer Sites: November 2018. http://www.top500.org

  25. Gao T, David M, Geer J, Schmidt R, Sammakia B (2015) Experimental and numerical dynamic investigation of an energy efficient liquid cooled chiller-less data center test facility. Energy Build 91:83

    Article  Google Scholar 

  26. Ellsworth M, Campbell L, Simons R, Iyengar M, Schmidt R, Chu R (2008) The evolution of water cooling for IBM large server systems: Back to the future. In: 2008 11th Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems. IEEE (IEEE), pp 266–274

  27. Ellsworth MJ, Goth GF, Zoodsma RJ, Arvelo A, Campbell LA, Anderl WJ (2012) An overview of the IBM power 775 supercomputer water cooling system. J Electron Packag 134(2):020906

    Article  Google Scholar 

  28. Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). Austin, TX, USA, IEEE, pp 44–54

  29. Fragkiadaki K, Zhang W, Zhang G, Shi J (2012) Two-granularity tracking: mediating trajectory and detection graphs for tracking under occlusions In: European Conference on Computer Vision. Springer, pp 552–565

  30. DIMACS. 9th dimacs (2006). www.dis.uniroma1.it/challenge9/download.shtml

  31. Redmon J, Divvala SK, Girshick RB, Farhadi A (2015) You only look once: unified, real-time object detection. CoRR abs/1506.02640. http://arxiv.org/abs/1506.02640

  32. Deng L (2012) The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process Mag 29(6):141. https://doi.org/10.1109/msp.2012.2211477

    Article  Google Scholar 

  33. Cazzaniga C, Frost CD (2018) Progress of the Scientific Commissioning of a fast neutron beamline for Chip Irradiation. J Phys: Conf Ser 1021:012037

    Google Scholar 

  34. Chiesa D, Nastasi M, Cazzaniga C, Rebai M, Arcidiacono L, Previtali E, Gorini G, Frost CD (2018) Measurement of the neutron flux at spallation sources using multi-foil activation. Nuclear Instruments and Methods in Physics Research Section A: Accelerators. Spectrometers, Detectors and Associated Equipment

  35. Tietze H, Schmidt W, Geick R (1989) Rotax, a spectrometer for coherent neutron inelastic scattering at ISIS. Phys B: Condens Matter 156:550

    Article  Google Scholar 

  36. Oliveira D, Pilla L, DeBardeleben N, Blanchard S, Quinn H, Koren I, Navaux P, Rech P (2017) Experimental and Analytical Study of Xeon Phi Reliability. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (ACM, New York, NY, USA), SC’17, pp 28:1–28:12. https://doi.org/10.1145/3126908.3126960

  37. Oliveira D, Pilla L, Hanzich M, Fratin V, Fernandes F, Lunardi C, Cela J, Navaux P, Carro L, Rech P (2017) Radiation-induced error criticality in modern HPC parallel accelerators. In: Proceedings of 21st IEEE Symposium on High Performance Computer Architecture (HPCA) (ACM)

  38. Constantinescu C (2008) Intermittent faults and effects on reliability of integrated circuits. In: Reliability and Maintainability Symposium, 2008. RAMS 2008. Annual. IEEE (IEEE, Las Vegas, NV, USA), pp 370–374

  39. Quinn H, Graham P, Fairbanks T (2010) SEEs induced by high-energy protons and neutrons in SDRAM. In: 2011 IEEE Radiation Effects Data Workshop, pp 1–5. https://doi.org/10.1109/REDW.2010.6062524

  40. Srour JR, Marshall CJ, Marshall PW (2003) Review of displacement damage effects in silicon devices. IEEE Trans Nucl Sci 50(3):653. https://doi.org/10.1109/TNS.2003.813197

    Article  Google Scholar 

  41. Association EI et al (1996) Test procedures for the measurement of single-event effects in semiconductor devices from heavy ion irradiation. EIA/JEDEC Standard (57)

  42. Constantinescu C (2002) Impact of deep submicron technology on dependability of VLSI circuits. In: IEEE Proceedings of the International Conference on Dependable Systems and Networks, 2002 DSN 2002. IEEE, Washington, DC, USA, pp 205–209

  43. Sridharan V, Stearley J, DeBardeleben N, Blanchard S, Gurumurthi S (2013) Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults. In: Proceedings of SC13: International Conference for High Performance Computing. Storage and Analysis. ACM, Networking, p 22

  44. Guertin SM, Cui M (2017) SEE test results for the snapdragon 820. In: 2017 IEEE Radiation Effects Data Workshop (REDW), pp 1–6. https://doi.org/10.1109/NSREC.2017.8115452

  45. Sridharan V, Liberty D (2012) A study of DRAM failures in the field. In: 2012 International Conference for (IEEE) High Performance Computing, Networking, Storage and Analysis (SC), pp 1–11

  46. Fratin V, Oliveira D, Lunardi C, Santos F, Rodrigues G, Rech P (2018) Code-dependent and architecture-dependent reliability behaviors. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE (IEEE), pp 13–26

  47. dos Santos FF, Pimenta PF, Lunardi C, Draghetti L, Carro L, Kaeli D, Rech P (2019) Analyzing and increasing the reliability of convolutional neural networks on GPUs. IEEE Trans Reliab 68(2):663. https://doi.org/10.1109/TR.2018.2878387

    Article  Google Scholar 

  48. Jeon H, Wilkening M, Sridharan V, Gurumurthi S, Loh GH (2013) Architectural vulnerability modeling and analysis of integrated graphics processors In: IEEE 10th Workshop on Silicon Errors in Logic—System Effects (SELSE) (IEEE)

  49. Dodd PE (2005) Physics-based simulation of single-event effects. IEEE Trans Device Mater Reliab 5(3):343. https://doi.org/10.1109/TDMR.2005.855826

    Article  Google Scholar 

  50. Soft-error testing resource (2006). http://www.seutest.com/cgi-bin/FluxCalculator.cgi

  51. Werner CJ et al (2018) mcnp6.2 release notes

  52. Leo WR (2012) Techniques for nuclear and particle physics experiments: a how-to approach. Springer, Berlin

    Google Scholar 

Download references

Acknowledgements

Authors would like to thank Robert Baumann and Gus Sinnis for their precious help and support. This work is based on experiments performed thanks to the STFC support (DOI: 10.5286/ISIS.E.RB2000036 and DOI: 10.5286/ISIS.E.RB1900122) and was partially sponsored by the Laboratory Directed Research and Development program of Los Alamos National Laboratory under project numbers: 20190499ER and 20180017ER, by CAPES/PVE - Finance Code 001, the EU H2020 Programme, and from MCTI/RNP-Brazil under the HPC4E project, Grant Agreement No. 689772, and by the project FAPERGS 17/2551-0001 2020. Document approved for release with the code LA-UR-20-23114.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paolo Rech.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

FIT (failure in time) rate is a measure of the number of device failures in one billion (\(10^9\)) device-hours of operation.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Oliveira, D., Blanchard, S., DeBardeleben, N. et al. Thermal neutrons: a possible threat for supercomputer reliability. J Supercomput 77, 1612–1634 (2021). https://doi.org/10.1007/s11227-020-03324-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-020-03324-9

Keywords

Navigation