Abstract
The high performance, high efficiency, and low cost of Commercial Off-The-Shelf (COTS) devices make them attractive for applications with strict reliability constraints. Today, COTS devices are adopted in HPC and safety-critical applications such as autonomous driving. Unfortunately, the cheap natural boron widely used in COTS chip manufacturing process makes them highly susceptible to thermal (low energy) neutrons. In this paper, we demonstrate that thermal neutrons are a significant threat to COTS device reliability. For our study, we consider two DDR memories, an AMD APU, three NVIDIA GPUs, an Intel accelerator, and an FPGA executing a relevant set of algorithms. We consider different scenarios that impact the thermal neutron flux such as weather, concrete walls and floors, and HPC liquid cooling systems. Correlating beam experiments and neutron detector data, we show that thermal neutrons FIT rate could be comparable or even higher than the high energy neutron FIT rate.
Similar content being viewed by others
References
Lucas R (2014) Top ten exascale research challenges. In: DOE ASCAC Subcommittee Report
Cohen A, Shen X, Torrellas J, Tuck J, Zhou Y, Adve S, Akturk I, Bagchi S, Balasubramonian R, Barik R, Beck M, Bodik R, Butt A, Ceze L, Chen H, Chen Y, Chilimbi T, Christodorescu M, Criswell J, Ding C, Ding Y, Dwarkadas S, Elmroth E, Gibbons P, Guo X, Gupta R, Heiser G, Hoffman H, Huang J, Hunter H, Kim J, King S, Larus J, Liu C, Lu S, Lucia B, Maleki S, Mazumdar S, Neamtiu I, Pingali K, Rech P, Scott M, Solihin Y, Song D, Szefer J, Tsafrir D, Urgaonkar B, Wolf M, Xie Y, Zhao J, Zhong L, Zhu Y (2018) Inter-disciplinary research challenges in computer systems for the 2020s. Tech. rep, National Science Foundation, USA
Dongarra J, Meuer H, Strohmaier E (2015) ISO26262 Standard. https://www.iso.org/obp/ui/#iso:std:iso:26262:-1:ed-1:v1:en
Ziegler J, Puchner H (2004) SER-history. A Guide for Designing with Memory ICs (Cypress, Trends and Challenges
Baumann R (2005) Radiation-induced soft errors in advanced semiconductor technologies. IEEE Trans Device Mater Reliab 5(3):305–316. https://doi.org/10.1109/TDMR.2005.853449
Snir M, Wisniewski RW, Abraham JA, Adve SV, Bagchi S, Balaji P, Belak J, Bose P, Cappello F, Carlson B, et al (2014) Addressing failures in exascale computing. Int J High Perform Comput Appl 1094342014522573
Dirk J, Nelson ME, Ziegler JF, Thompson A, Zabel TH (2003) Terrestrial thermal neutrons. IEEE Trans Nuclear Sci 50(6):2060
JEDEC (2006) Measurement and reporting of alpha particle and terrestrial cosmic ray-induced soft errors in semiconductor devices. Tech. Rep. JESD89A, JEDEC Standard
Baumann R, Hossain T, Smith E, Murata S, Kitagawa H (1995) Boron as a primary source of radiation in high density DRAMs. In: 1995 Symposium on VLSI Technology. Digest of Technical Papers. IEEE. IEEE, Kyoto, Japan, Japan, pp 81–82
Normand E, Vranish K, Sheets A, Stitt M, Kim R (2006) Quantifying the double-sided neutron SEU threat, from low energy (thermal) and high energy>10 MeV) neutrons. IEEE Trans Nucl Sci 53(6):3587
Wen SJ, Pai S, Wong R, Romain M, Tam N (2010) B10 finding and correlation to thermal neutron soft error rate sensitivity for SRAMs in the sub-micron technology. In: 2010 IEEE International Integrated Reliability Workshop Final Report (IEEE), pp 31–33
Weulersse C, Houssany S, Guibbaud N, Segura-Ruiz J, Beaucour J, Miller F, Mazurek M (2018) Contribution of thermal neutrons to soft error rate. IEEE Trans Nucl Sci 65(8):1851
Lee S, Kim I, Ha S, Yu Cs, Noh J, Pae S, Park J (2015) Radiation-induced soft error rate analyses for 14 nm FinFET SRAM devices. In: 2015 IEEE International Reliability Physics Symposium. IEEE (IEEE), pp 4B–1
Fang YP, Oates AS (2016) Characterization of single bit and multiple cell soft error events in planar and FinFET SRAMs. IEEE Trans Device Mater Reliab 16(2):132
Maillard P, Hart M, Barton J, Jain P, Karp J (2015) Neutron, 64 MeV proton, thermal neutron and alpha single-event upset characterization of Xilinx 20nm UltraScale Kintex FPGA. In: 2015 IEEE Radiation Effects Data Workshop (REDW) (IEEE, 2015), pp 1–5
Hess VF (1913) Über den Ursprung der durchdringenden Strahlung. Z Phys 14:610
Ziegler JF (1996) Terrestrial cosmic rays. IBM J Res Dev 40(1):19. https://doi.org/10.1147/rd.401.0019
Hands A, Morris P, Ryden K, Dyer C, Truscott P, Chugg A, Parker S (2011) Single event effects in power MOSFETs due to atmospheric and thermal neutrons. IEEE Trans Nucl Sci 58(6):2687
Baumann R (2000) Soft error characterization and modeling methodologies at Texas Instruments. In: Proceedings of the Semiconductor Research Council 4th Topical Conference Reliability.[CD-Rom] SemaTech CD-ROM (SemaTech, USA), pp 0043–3283
Sheu R, Jiang S (2003) Cosmic-ray-induced neutron spectra and effective dose rates near air/ground and air/water interfaces in Taiwan. Health Phys 84(1):92
Patterson MK, Fenwick D (2008) The state of datacenter cooling. Intel Corporation White Paper. http://download.intel.com/technology/eep/data-center-efficiency/stateof-date-center-cooling.pdf
Summit system overview (2006). https://www.olcf.ornl.gov/wp-content/uploads/2018/05/Intro_Summit_System_Overview.pdf
Capozzoli A, Primiceri G (2015) Cooling systems in data centers: state of art and emerging technologies. Energy Procedia 83:484
Dongarra J, Meuer H, Strohmaier E (2018) TOP500 Supercomputer Sites: November 2018. http://www.top500.org
Gao T, David M, Geer J, Schmidt R, Sammakia B (2015) Experimental and numerical dynamic investigation of an energy efficient liquid cooled chiller-less data center test facility. Energy Build 91:83
Ellsworth M, Campbell L, Simons R, Iyengar M, Schmidt R, Chu R (2008) The evolution of water cooling for IBM large server systems: Back to the future. In: 2008 11th Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems. IEEE (IEEE), pp 266–274
Ellsworth MJ, Goth GF, Zoodsma RJ, Arvelo A, Campbell LA, Anderl WJ (2012) An overview of the IBM power 775 supercomputer water cooling system. J Electron Packag 134(2):020906
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). Austin, TX, USA, IEEE, pp 44–54
Fragkiadaki K, Zhang W, Zhang G, Shi J (2012) Two-granularity tracking: mediating trajectory and detection graphs for tracking under occlusions In: European Conference on Computer Vision. Springer, pp 552–565
DIMACS. 9th dimacs (2006). www.dis.uniroma1.it/challenge9/download.shtml
Redmon J, Divvala SK, Girshick RB, Farhadi A (2015) You only look once: unified, real-time object detection. CoRR abs/1506.02640. http://arxiv.org/abs/1506.02640
Deng L (2012) The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process Mag 29(6):141. https://doi.org/10.1109/msp.2012.2211477
Cazzaniga C, Frost CD (2018) Progress of the Scientific Commissioning of a fast neutron beamline for Chip Irradiation. J Phys: Conf Ser 1021:012037
Chiesa D, Nastasi M, Cazzaniga C, Rebai M, Arcidiacono L, Previtali E, Gorini G, Frost CD (2018) Measurement of the neutron flux at spallation sources using multi-foil activation. Nuclear Instruments and Methods in Physics Research Section A: Accelerators. Spectrometers, Detectors and Associated Equipment
Tietze H, Schmidt W, Geick R (1989) Rotax, a spectrometer for coherent neutron inelastic scattering at ISIS. Phys B: Condens Matter 156:550
Oliveira D, Pilla L, DeBardeleben N, Blanchard S, Quinn H, Koren I, Navaux P, Rech P (2017) Experimental and Analytical Study of Xeon Phi Reliability. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (ACM, New York, NY, USA), SC’17, pp 28:1–28:12. https://doi.org/10.1145/3126908.3126960
Oliveira D, Pilla L, Hanzich M, Fratin V, Fernandes F, Lunardi C, Cela J, Navaux P, Carro L, Rech P (2017) Radiation-induced error criticality in modern HPC parallel accelerators. In: Proceedings of 21st IEEE Symposium on High Performance Computer Architecture (HPCA) (ACM)
Constantinescu C (2008) Intermittent faults and effects on reliability of integrated circuits. In: Reliability and Maintainability Symposium, 2008. RAMS 2008. Annual. IEEE (IEEE, Las Vegas, NV, USA), pp 370–374
Quinn H, Graham P, Fairbanks T (2010) SEEs induced by high-energy protons and neutrons in SDRAM. In: 2011 IEEE Radiation Effects Data Workshop, pp 1–5. https://doi.org/10.1109/REDW.2010.6062524
Srour JR, Marshall CJ, Marshall PW (2003) Review of displacement damage effects in silicon devices. IEEE Trans Nucl Sci 50(3):653. https://doi.org/10.1109/TNS.2003.813197
Association EI et al (1996) Test procedures for the measurement of single-event effects in semiconductor devices from heavy ion irradiation. EIA/JEDEC Standard (57)
Constantinescu C (2002) Impact of deep submicron technology on dependability of VLSI circuits. In: IEEE Proceedings of the International Conference on Dependable Systems and Networks, 2002 DSN 2002. IEEE, Washington, DC, USA, pp 205–209
Sridharan V, Stearley J, DeBardeleben N, Blanchard S, Gurumurthi S (2013) Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults. In: Proceedings of SC13: International Conference for High Performance Computing. Storage and Analysis. ACM, Networking, p 22
Guertin SM, Cui M (2017) SEE test results for the snapdragon 820. In: 2017 IEEE Radiation Effects Data Workshop (REDW), pp 1–6. https://doi.org/10.1109/NSREC.2017.8115452
Sridharan V, Liberty D (2012) A study of DRAM failures in the field. In: 2012 International Conference for (IEEE) High Performance Computing, Networking, Storage and Analysis (SC), pp 1–11
Fratin V, Oliveira D, Lunardi C, Santos F, Rodrigues G, Rech P (2018) Code-dependent and architecture-dependent reliability behaviors. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE (IEEE), pp 13–26
dos Santos FF, Pimenta PF, Lunardi C, Draghetti L, Carro L, Kaeli D, Rech P (2019) Analyzing and increasing the reliability of convolutional neural networks on GPUs. IEEE Trans Reliab 68(2):663. https://doi.org/10.1109/TR.2018.2878387
Jeon H, Wilkening M, Sridharan V, Gurumurthi S, Loh GH (2013) Architectural vulnerability modeling and analysis of integrated graphics processors In: IEEE 10th Workshop on Silicon Errors in Logic—System Effects (SELSE) (IEEE)
Dodd PE (2005) Physics-based simulation of single-event effects. IEEE Trans Device Mater Reliab 5(3):343. https://doi.org/10.1109/TDMR.2005.855826
Soft-error testing resource (2006). http://www.seutest.com/cgi-bin/FluxCalculator.cgi
Werner CJ et al (2018) mcnp6.2 release notes
Leo WR (2012) Techniques for nuclear and particle physics experiments: a how-to approach. Springer, Berlin
Acknowledgements
Authors would like to thank Robert Baumann and Gus Sinnis for their precious help and support. This work is based on experiments performed thanks to the STFC support (DOI: 10.5286/ISIS.E.RB2000036 and DOI: 10.5286/ISIS.E.RB1900122) and was partially sponsored by the Laboratory Directed Research and Development program of Los Alamos National Laboratory under project numbers: 20190499ER and 20180017ER, by CAPES/PVE - Finance Code 001, the EU H2020 Programme, and from MCTI/RNP-Brazil under the HPC4E project, Grant Agreement No. 689772, and by the project FAPERGS 17/2551-0001 2020. Document approved for release with the code LA-UR-20-23114.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
FIT (failure in time) rate is a measure of the number of device failures in one billion (\(10^9\)) device-hours of operation.
Rights and permissions
About this article
Cite this article
Oliveira, D., Blanchard, S., DeBardeleben, N. et al. Thermal neutrons: a possible threat for supercomputer reliability. J Supercomput 77, 1612–1634 (2021). https://doi.org/10.1007/s11227-020-03324-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03324-9