skip to main content
survey

A Methodology for Comparing the Reliability of GPU-Based and CPU-Based HPCs

Published: 06 February 2020 Publication History

Abstract

Today, GPUs are widely used as coprocessors/accelerators in High-Performance Heterogeneous Computing due to their many advantages. However, many researches emphasize that GPUs are not as reliable as desired yet. Despite the fact that GPUs are more vulnerable to hardware errors than CPUs, the use of GPUs in HPCs is increasing more and more. Moreover, due to native reliability problems of GPUs, combining a great number of GPUs with CPUs can significantly increase HPCs’ failure rates. For this reason, analyzing the reliability characteristics of GPU-based HPCs has become a very important issue. Therefore, in this study we evaluate the reliability of GPU-based HPCs. For this purpose, we first examined field data analysis studies for GPU-based and CPU-based HPCs and identified factors that could increase systems failure/error rates. We then compared GPU-based HPCs with CPU-based HPCs in terms of reliability with the help of these factors in order to point out reliability challenges of GPU-based HPCs. Our primary goal is to present a study that can guide the researchers in this field by indicating the current state of GPU-based heterogeneous HPCs and requirements for the future, in terms of reliability. Our second goal is to offer a methodology to compare the reliability of GPU-based HPCs and CPU-based HPCs. To the best of our knowledge, this is the first survey study to compare the reliability of GPU-based and CPU-based HPCs in a systematic manner.

References

[1]
2018. pizDaint Supercomputer. Retrieved from https://www.cscs.ch/computers/piz-daint/.
[2]
2018. Titan Supercomputer. Retrieved from https://www.olcf.ornl.gov/titan/.
[3]
2018. Top500 HPC List. Retrieved from https://www.top500.org.
[4]
Muhammad Alfian Amrizal, Pei Li, Mulya Agung, Ryusuke Egawa, and Hiroyuki Takizawa. 2018. A failure prediction-based adaptive checkpointing method with less reliance on temperature monitoring for HPC applications. In 2018 IEEE International Conference on Cluster Computing (CLUSTER’18). IEEE, 515--523.
[5]
Elisabeth Baseman, Nathan DeBardeleben, Kurt Ferreira, Scott Levy, Steven Raasch, Vilas Sridharan, Taniya Siddiqua, and Qiang Guan. 2016. Improving DRAM fault characterization through machine learning. In 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop. IEEE, 250--253.
[6]
Leonardo Bautista-Gomez, Ana Gainaru, Swann Perarnau, Devesh Tiwari, Saurabh Gupta, Christian Engelmann, Franck Cappello, and Marc Snir. 2016. Reducing waste in extreme scale systems through introspective analysis. In 2016 IEEE International Parallel and Distributed Processing Symposium. IEEE, 212--221.
[7]
Leonardo Bautista-Gomez, Ferad Zyulkyarov, Osman Unsal, and Simon McIntosh-Smith. 2016. Unprotected computing: A large-scale study of DRAM raw error rate on a supercomputer. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16). IEEE, 645--655.
[8]
Robin M. Betz, Nathan A. DeBardeleben, and Ross C. Walker. 2014. An investigation of the effects of hard and soft errors on graphics processing unit-accelerated molecular dynamics simulations. Concurrency and Computation: Practice and Experience 26, 13 (2014), 2134--2140.
[9]
Franck Cappello. 2009. Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. International Journal of High Performance Computing Applications 23, 3 (2009), 212--226.
[10]
Franck Cappello, Al Geist, William Gropp, Sanjay Kale, Bill Kramer, and Marc Snir. 2014. Toward exascale resilience: 2014 update. Supercomputing Frontiers and Innovations 1, 1 (2014), 5--28.
[11]
Nicholas P. Cardo. [n.d.]. Detecting and managing GPU failures.
[12]
Athanasios Chatzidimitriou, Manolis Kaliorakis, Sotiris Tselonis, and Dimitris Gizopoulos. 2017. Performance-aware reliability assessment of heterogeneous chips. In 2017 IEEE 35th VLSI Test Symposium (VTS’17). IEEE, 1--6.
[13]
Min Chen, Shiwen Mao, and Yunhao Liu. 2014. Big data: A survey. Mobile Networks and Applications 19, 2 (2014), 171--209.
[14]
Daniel Dauwe, Sudeep Pasricha, Anthony A. Maciejewski, and Howard Jay Siegel. 2017. An analysis of resilience techniques for exascale computing platforms. In IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’17). IEEE, 914--923.
[15]
Nathan DeBardeleben, Sean Blanchard, David Kaeli, and Paolo Rech. 2015. Field, experimental, and analytical data on large-scale HPC systems and evaluation of the implications for exascale system design. In 2015 IEEE 33rd VLSI Test Symposium (VTS’15). IEEE, 1--2.
[16]
Nathan DeBardeleben, Sean Blanchard, Laura Monroe, Phil Romero, Daryl Grunau, Craig Idler, and Cornell Wright. 2013. GPU behavior on a large HPC cluster. In European Conference on Parallel Processing. Springer, 680--689.
[17]
Nathan DeBardeleben, Sean Blanchard, Vilas Sridharan, Sudhanva Gurumurthi, Jon Stearley, K. Ferreira, and John Shalf. 2014. Extra bits on SRAM and DRAM errors--More data from the field. In IEEE Workshop on Silicon Errors in Logic-System Effects (SELSE’14).
[18]
David Defour and Eric Petit. 2013. GPUburn: A system to test and mitigate GPU hardware failures. In International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII’13). IEEE, 263--270.
[19]
Sheng Di and Franck Cappello. 2016. Adaptive impact-driven detection of silent data corruption for HPC applications. IEEE Transactions on Parallel and Distributed Systems 27, 10 (2016), 2809--2823.
[20]
Sheng Di, Rinku Gupta, Marc Snir, Eric Pershey, and Franck Cappello. 2017. LogAider: A tool for mining potential correlations of HPC log events. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE Press, 442--451.
[21]
Catello Di Martino, Marcello Cinque, and Domenico Cotroneo. 2012. Assessing time coalescence techniques for the analysis of supercomputer logs. In 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’12). IEEE, 1--12.
[22]
Catello Di Martino, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Fabio Baccanico, Joseph Fullop, and William Kramer. 2014. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’14). IEEE, 610--621.
[23]
Martin Dimitrov, Mike Mantor, and Huiyang Zhou. 2009. Understanding software approaches for GPGPU reliability. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units. ACM, 94--104.
[24]
Jack Dongarra, Pete Beckman, Terry Moore, Patrick Aerts, Giovanni Aloisio, Jean-Claude Andre, David Barkai, Jean-Yves Berthou, Taisuke Boku, Bertrand Braunschweig, Franck Cappello, Barbara Chapman, Chi Xuebin Chi, Alok Choudhary, Sudip Dosanjh, Thom Dunning, Sandro Fiore, Al Geist, Bill Gropp, Robert Harrison, Mark Hereld, Michael Heroux, Adolfy Hoisie, Koh Hotta, Jin Zhong Jin, Yutaka Ishikawa, Fred Johnson, Sanjay Kale, Richard Kenway, David Keyes, Bill Kramer, Jesus Labarta, Alain Lichnewsky, Thomas Lippert, Bob Lucas, Barney MacCabe, Satoshi Matsuoka, Paul Messina, Peter Michielse, Bernd Mohr, Matthias S. Mueller, Wolfgang E. Nagel, Hiroshi Nakashima, Michael E. Papka, Dan Reed, Mitsuhisa Sato, Ed Seidel, John Shalf, David Skinner, Marc Snir, Thomas Sterling, Rick Stevens, Fred Streitz, Bob Sugar, Shinji Sumimoto, William Tang, John Taylor, Rajeev Thakur, Anne Trefethen, Mateo Valero, Aad Van Der Steen, Jeffrey Vetter, Peg Williams, Robert Wisniewski, and Kathy Yelick. 2011. The international exascale software project roadmap. International Journal of High Performance Computing Applications 25, 1 (2011), 3--60.
[25]
Nosayba El-Sayed and Bianca Schroeder. 2013. Reading between the lines of failure logs: Understanding how HPC systems fail. In 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’13). IEEE, 1--12.
[26]
Bo Fang, Karthik Pattabiraman, Matei Ripeanu, and Sudhanva Gurumurthi. 2014. GPU-QIN: A methodology for evaluating the error resilience of GPGPU applications. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’14). IEEE, 221--230.
[27]
Bo Fang, Karthik Pattabiraman, Matei Ripeanu, and Sudhanva Gurumurthi. 2016. A systematic methodology for evaluating the error resilience of GPGPU applications. IEEE Transactions on Parallel and Distributed Systems 27, 12 (2016), 3397--3411.
[28]
Bo Fang, Jiesheng Wei, Karthik Pattabiraman, and Matei Ripeanu. 2012. Evaluating error resiliency of GPGPU applications. In 2012 SC Companion: High Performance Computing, Networking, Storage and Analysis (SCC’12). IEEE, 1502--1503.
[29]
Bo Fang, Jiesheng Wei, Karthik Pattabiraman, and Matei Ripeanu. 2012. Towards building error resilient GPGPU applications. In SC Companion: High Performance Computing, Networking Storage and Analysis.
[30]
Valerio Formicola, Saurabh Jha, Daniel Chen, Fei Deng, Amanda Bonnie, Mike Mason, Jim Brandt, Ann Gentile, Larry Kaplan, Jason Repik, Jeremy Enos, Mike Showerman, Annette Greiner, Zbigniew Kalbarczyk, Ravishankar K. Iyer, and Bill Krammer. 2017. Understanding fault scenarios and impacts through fault injection experiments in Cielo. Urbana 51 (2017), 61801.
[31]
Ana Gainaru, Franck Cappello, and William Kramer. 2012. Taming of the shrew: Modeling the normal and faulty behaviour of large-scale hpc systems. In 2012 IEEE 26th International Parallel 8 Distributed Processing Symposium (IPDPS’12). IEEE, 1168--1179.
[32]
Ana Gainaru, Franck Cappello, Marc Snir, and William Kramer. 2012. Fault prediction under the microscope: A closer look into HPC systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 77.
[33]
Al Geist and Daniel A. Reed. 2017. A survey of high-performance computing scaling challenges. International Journal of High Performance Computing Applications 31, 1 (2017), 104--113.
[34]
L. Bautista Gomez, Franck Cappello, Luigi Carro, Nathan DeBardeleben, Bo Fang, Sudhanva Gurumurthi, Karthik Pattabiraman, Paolo Rech, and M. Sonza Reorda. 2014. GPGPUs: How to combine high computational power with high reliability. In Proceedings of the Conference on Design, Automation 8 Test in Europe. European Design and Automation Association, 341.
[35]
Leonardo Bautista Gomez, Akira Nukada, Naoya Maruyama, Franck Cappello, and Satoshi Matsuoka. 2010. Low-overhead diskless checkpoint for hybrid computing systems. In 2010 International Conference on High Performance Computing (HiPC’10). IEEE, 1--10.
[36]
Narasimha Raju Gottumukkala, Chokchai Box Leangsuksun, Narate Taerat, Raja Nassar, and Stephen L. Scott. 2007. Reliability-aware resource allocation in HPC systems. In 2007 IEEE International Conference on Cluster Computing. IEEE, 312--321.
[37]
Narasimha Raju Gottumukkala, Raja Nassar, Mihaela Paun, Chokchai Box Leangsuksun, and Stephen L. Scott. 2010. Reliability of a system of k nodes for high performance computing applications. IEEE Transactions on Reliability 59, 1 (2010), 162--169.
[38]
Jiexing Gu, Ziming Zheng, Zhiling Lan, John White, Eva Hocks, and Byung-Hoon Park. 2008. Dynamic meta-learning for failure prediction in large-scale systems: A case study. In 37th International Conference on Parallel Processing, 2008 (ICPP’08). IEEE, 157--164.
[39]
Prashasta Gujrati, Yawei Li, Zhiling Lan, Rajeev Thakur, and John White. 2007. A meta-learning failure predictor for blue gene/l systems. In International Conference on Parallel Processing, 2007 (ICPP’07). IEEE, 40--40.
[40]
Saurabh Gupta, Tirthak Patel, Christian Engelmann, and Devesh Tiwari. 2017. Failures in large scale systems: Long-term measurement, analysis, and implications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 44.
[41]
Saurabh Gupta, Devesh Tiwari, Christopher Jantzi, James Rogers, and Don Maxwell. 2015. Understanding and exploiting spatial properties of system failures on extreme-scale HPC systems. In 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’15). IEEE, 37--44.
[42]
Thomas J. Hacker, Fabian Romero, and Christopher D. Carothers. 2009. An analysis of clustered failures on large supercomputing systems. Journal of Parallel and Distributed Computing 69, 7 (2009), 652--665.
[43]
Imran S. Haque and Vijay S. Pande. 2010. Hard data on soft errors: A large-scale assessment of real-world error rates in GPGPU. In Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. IEEE Computer Society, 691--696.
[44]
Siva Kumar Sastry Hari, Sarita V. Adve, Helia Naeimi, and Pradeep Ramachandran. 2013. Relyzer: Application resiliency analyzer for transient faults. IEEE Micro 33, 3 (2013), 58--66.
[45]
Siva Kumar Sastry Hari, Timothy Tsai, Mark Stephenson, Stephen W. Keckler, and Joel Emer. 2015. Sassifi: Evaluating resilience of GPU applications. In Proceedings of the Workshop on Silicon Errors in Logic-System Effects (SELSE’15).
[46]
Siva Kumar Sastry Hari, Timothy Tsai, Mark Stephenson, Stephen W. Keckler, and Joel Emer. 2017. Sassifi: An architecture-level fault injection tool for GPU application resilience evaluation. In 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’17). IEEE, 249--258.
[47]
Eric Heien, Derrick Kondo, Ana Gainaru, Dan LaPine, Bill Kramer, and Franck Cappello. 2011. Modeling and tolerating heterogeneous failures in large parallel systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 45.
[48]
Marcus Hilbrich, Matthias Weber, and Ronny Tschüter. 2013. Automatic analysis of large data sets: A walk-through on methods from different perspectives. In 2013 International Conference on Cloud Computing and Big Data (CloudCom-Asia’13). IEEE, 373--380.
[49]
Andy A. Hwang, Ioan A. Stefanovici, and Bianca Schroeder. 2012. Cosmic rays don’t strike twice: Understanding the nature of DRAM errors and the implications for system design. In ACM SIGPLAN Notices, Vol. 47. ACM, 111--122.
[50]
Elmira Yu Kalimulina. 2017. Analysis of system reliability with control, dependent failures, and arbitrary repair times. International Journal of System Assurance Engineering and Management 8, 1 (2017), 180--188.
[51]
Sudarsun Kannan, Naila Farooqui, Ada Gavrilovska, and Karsten Schwan. 2014. Heterocheckpoint: Efficient checkpointing for accelerator-based systems. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’14). IEEE, 738--743.
[52]
David B. Kirk and W. Hwu Wen-Mei. 2016. Programming Massively Parallel Processors: A Hands-On Approach. Morgan Kaufmann.
[53]
Zhiling Lan, Jiexing Gu, Ziming Zheng, Rajeev Thakur, and Susan Coghlan. 2010. A study of dynamic meta-learning for failure prediction in large-scale systems. Journal of Parallel and Distributed Computing 70, 6 (2010), 630--643.
[54]
Scott Levy, Kurt B. Ferreira, Nathan DeBardeleben, Taniya Siddiqua, Vilas Sridharan, and Elisabeth Baseman. 2018. Lessons learned from memory errors observed over the lifetime of Cielo. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 43.
[55]
Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Morris Jette, and Ramendra Sahoo. 2006. Bluegene/l failure analysis and prediction models. In International Conference on Dependable Systems and Networks, 2006 (DSN’06). IEEE, 425--434.
[56]
Yudan Liu, Raja Nassar, Chokchai Leangsuksun, Nichamon Naksinehaboon, Mihaela Paun, and Stephen L. Scott. 2008. An optimal checkpoint/restart model for a large scale high performance computing system. In IEEE International Symposium on Parallel and Distributed Processing, 2008 (IPDPS’08). IEEE, 1--9.
[57]
Charng-Da Lu. 2013. Failure data analysis of HPC systems. arXiv preprint arXiv:1302.4779 (2013).
[58]
Naoya Maruyama, Akira Nukada, and Satoshi Matsuoka. 2010. A high-performance fault-tolerant software framework for memory on commodity GPUs. In 2010 IEEE International Symposium on Parallel 8 Distributed Processing (IPDPS’10). IEEE, 1--12.
[59]
Naoya Maruyama, Akira Nukada, and Satoshi Matsuoka. 2009. Software-based ECC for GPUs. In 2009 Symposium on Application Accelerators in High Performance Computing (SAAHPC’09), Vol. 107.
[60]
Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu. 2015. Revisiting memory errors in large-scale production data centers: Analysis and modeling of new trends from the field. In 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’15). IEEE, 415--426.
[61]
Sparsh Mittal and Jeffrey S. Vetter. 2015. A survey of CPU-GPU heterogeneous computing techniques. ACM Computing Surveys (CSUR) 47, 4 (2015), 69.
[62]
Sparsh Mittal and Jeffrey S. Vetter. 2016. A survey of techniques for modeling and improving reliability of computing systems. IEEE Transactions on Parallel and Distributed Systems 27, 4 (2016), 1226--1238.
[63]
Shubhendu S. Mukherjee, Joel Emer, and Steven K. Reinhardt. 2005. The soft error problem: An architectural perspective. In 11th International Symposium on High-Performance Computer Architecture, 2005 (HPCA-11’05). IEEE, 243--247.
[64]
Shubhendu S. Mukherjee, Christopher Weaver, Joel Emer, Steven K. Reinhardt, and Todd Austin. 2003. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003 (MICRO-36’03). IEEE, 29--40.
[65]
Onur Mutlu. 2013. Memory scaling: A systems architecture perspective. In 2013 5th IEEE International Memory Workshop (IMW’13). IEEE, 21--25.
[66]
Onur Mutlu and Lavanya Subramanian. 2015. Research problems and opportunities in memory systems. Supercomputing Frontiers and Innovations 1, 3 (2015), 19--55.
[67]
Bin Nie, Devesh Tiwari, Saurabh Gupta, Evgenia Smirni, and James H. Rogers. 2016. A large-scale study of soft-errors on GPUs in the field. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA’16). IEEE, 519--530.
[68]
Bin Nie, Ji Xue, Saurabh Gupta, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. 2017. Characterizing temperature, power, and soft-error behaviors in data center systems: Insights, challenges, and opportunities. In 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’17). IEEE, 22--31.
[69]
Bin Nie, Ji Xue, Saurabh Gupta, Tirthak Patel, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. 2018. Machine learning models for GPU error prediction in a large scale HPC system. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’18). IEEE, 95--106.
[70]
Akira Nukada, Hiroyuki Takizawa, and Satoshi Matsuoka. 2011. NVCR: A transparent checkpoint-restart library for NVIDIA CUDA. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW’11). IEEE, 104--113.
[71]
Adam Oliner and Jon Stearley. 2007. What supercomputers say: A study of five system logs. In 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2007 (DSN’07). IEEE, 575--584.
[72]
Adam J. Oliner, Alex Aiken, and Jon Stearley. 2008. Alert detection in system logs. In 8th IEEE International Conference on Data Mining, 2008 (ICDM’08). IEEE, 959--964.
[73]
David J. Palframan, Nam Sung Kim, and Mikko H. Lipasti. 2014. Precision-aware soft error protection for GPUs. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 49--59.
[74]
Byung H. Park, Saurabh Hukerikar, Ryan Adamson, and Christian Engelmann. 2017. Big data meets HPC log analytics: Scalable approach to understanding systems at extreme scale. In 2017 IEEE International Conference on Cluster Computing (CLUSTER’17). IEEE, 758--765.
[75]
Antonio Pecchia, Domenico Cotroneo, Zbigniew Kalbarczyk, and Ravishankar K. Iyer. 2011. Improving log-based field failure data analysis of multi-node computing systems. In 2011 IEEE/IFIP 41st International Conference on Dependable Systems 8 Networks (DSN’11). IEEE, 97--108.
[76]
Swann Perarnau and Leonardo Bautista-Gomez. 2016. Monitoring strategies for scalable dynamic checkpointing. In 2016 7th International Green and Sustainable Computing Conference (IGSC’16). IEEE, 1--8.
[77]
Behnam Pourghassemi and Aparna Chandramowlishwaran. 2017. cudaCR: An in-kernel application-level checkpoint/restart scheme for CUDA-enabled GPUs. In 2017 IEEE International Conference on Cluster Computing (CLUSTER’17). IEEE, 725--732.
[78]
Fritz G. Previlon, Babatunde Egbantan, Devesh Tiwari, Paolo Rech, and David R. Kaeli. 2017. Combining architectural fault-injection and neutron beam testing approaches toward better understanding of GPU soft-error resilience. In 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS’17). IEEE, 898--901.
[79]
Narasimha Raju, Y. Liu Gottumukkala, Chokchai B. Leangsuksun, Raja Nassar, and Stephen Scott. 2006. Reliability analysis in HPC clusters. In Proceedings of the High Availability and Performance Computing Workshop. 673--684.
[80]
Paolo Rech, Caroline Aguiar, R. Ferreira, Christopher Frost, and Luigi Carro. 2012. Neutron radiation test of graphic processing units. In 2012 IEEE 18th International On-Line Testing Symposium (IOLTS’12). IEEE, 55--60.
[81]
P. Rech, C. Aguiar, C. Frost, and L. Carro. 2013. An efficient and experimentally tuned software-based hardening strategy for matrix multiplication on GPUs. IEEE Transactions on Nuclear Science 60, 4 (2013), 2797--2804.
[82]
Paolo Rech, Laércio Lima Pilla, Philippe Olivier Alexandre Navaux, and Luigi Carro. 2014. Impact of GPUs parallelism management on safety-critical and HPC applications reliability. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’14). IEEE, 455--466.
[83]
Felipe Rosa, Fernanda Kastensmidt, Ricardo Reis, and Luciano Ost. 2015. A fast and scalable fault injection framework to evaluate multi/many-core soft error reliability. In 2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS’15). IEEE, 211--214.
[84]
Ramendra K. Sahoo, Adam J. Oliner, Irina Rish, Manish Gupta, José E. Moreira, Sheng Ma, Ricardo Vilalta, and Anand Sivasubramaniam. 2003. Critical event prediction for proactive management in large-scale computer clusters. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 426--435.
[85]
Fernando Fernandes dos Santos and Paolo Rech. 2017. Analyzing the criticality of transient faults-induced SDCS on GPU applications. In Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems. ACM, 1.
[86]
Hamid Sarbazi-Azad. 2016. Advances in GPU Research and Practice. Morgan Kaufmann.
[87]
Horst Schirmeier, Martin Hoffmann, Christian Dietrich, Michael Lenz, Daniel Lohmann, and Olaf Spinczyk. 2015. FAIL*: An open and versatile fault-injection framework for the assessment of software-implemented hardware fault tolerance. In 2015 11th European Dependable Computing Conference (EDCC’15). IEEE, 245--255.
[88]
Bianca Schroeder and Garth Gibson. 2010. A large-scale study of failures in high-performance computing systems. IEEE Transactions on Dependable and Secure Computing 7, 4 (2010), 337--350.
[89]
Bianca Schroeder and Garth A. Gibson. 2007. Understanding failures in petascale computers. In Journal of Physics: Conference Series, Vol. 78. IOP Publishing, 012022.
[90]
Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. 2009. DRAM errors in the wild: A large-scale field study. In ACM SIGMETRICS Performance Evaluation Review, Vol. 37. ACM, 193--204.
[91]
John Shalf, Sudip Dosanjh, and John Morrison. 2010. Exascale computing technology challenges. In International Conference on High Performance Computing for Computational Science. Springer, 1--25.
[92]
Jeremy W. Sheaffer, David P. Luebke, and Kevin Skadron. 2007. A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors. In Graphics Hardware, Vol. 2007. 55--64.
[93]
Justin Y. Shi, Moussa Taifi, Abdallah Khreishah, and Jie Wu. 2011. Sustainable GPU computing at scale. In 2011 IEEE 14th International Conference on Computational Science and Engineering (CSE’11). IEEE, 263--272.
[94]
Lin Shi, Hao Chen, and Ting Li. 2013. Hybrid CPU/GPU checkpoint for GPU-based heterogeneous systems. In International Conference on Parallel Computing in Fluid Dynamics. Springer, 470--481.
[95]
Taniya Siddiqua, Athanasios E. Papathanasiou, Arijit Biswas, and Sudhanva Gurumurthi. 2013. Analysis and modeling of memory errors from large-scale field data collection. In Workshop on Silicon Errors in Logic -- System Effects (SELSE’13).
[96]
Taniya Siddiqua, Vilas Sridharan, Steven E. Raasch, Nathan DeBardeleben, Kurt B. Ferreira, Scott Levy, Elisabeth Baseman, and Qiang Guan. 2017. Lifetime memory reliability data from the field. In 2017 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT’17). IEEE, 1--6.
[97]
Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2015. Efficient GPU synchronization without scopes: Saying no to complex consistency models. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 647--659.
[98]
Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2017. HeteroSync: A benchmark suite for fine-grained synchronization on tightly coupled GPUs. In 2017 IEEE International Symposium on Workload Characterization (IISWC’17). IEEE, 239--249.
[99]
Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A. Chien, Paul Coteus, Nathan A. DeBardeleben, Pedro C. Diniz, Christian Engelmann, Mattan Erez, Saverio Fazzari, Al Geist, Rinku Gupta, Fred Johnson, Sriram Krishnamoorthy, Sven Leyffer, Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van Hensbergen. 2014. Addressing failures in exascale computing. International Journal of High Performance Computing Applications 28, 2 (2014), 129--173.
[100]
Lizandro D. Solano-Quinde, Brett M. Bode, and Arun K. Somani. 2010. Coarse grain computation-communication overlap for efficient application-level checkpointing for GPUs. In 2010 IEEE International Conference on Electro/Information Technology (EIT’10). IEEE, 1--5.
[101]
Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt B. Ferreira, Jon Stearley, John Shalf, and Sudhanva Gurumurthi. 2015. Memory errors in modern systems: The good, the bad, and the ugly. In ACM SIGPLAN Notices, Vol. 50. ACM, 297--310.
[102]
Vilas Sridharan and Dean Liberty. 2012. A study of DRAM failures in the field. In 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE, 1--11.
[103]
Vilas Sridharan, Jon Stearley, Nathan DeBardeleben, Sean Blanchard, and Sudhanva Gurumurthi. 2013. Feng shui of supercomputer memory positional effects in DRAM and SRAM faults. In 2013 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’13). IEEE, 1--11.
[104]
Jon Stearley, Robert Ballance, and Lara Bauman. 2012. A state-machine approach to disambiguating supercomputer event logs. In 2012 Workshop on Managing Systems Automatically and Dynamically (MAD’12). 155--192.
[105]
Jon Stearley and Adam J. Oliner. 2008. Bad words: Finding faults in Spirit’s syslogs. In 8th IEEE International Symposium on Cluster Computing and the Grid, 2008 (CCGRID’08). IEEE, 765--770.
[106]
Mark Stephenson, Siva Kumar Sastry Hari, Yunsup Lee, Eiman Ebrahimi, Daniel R. Johnson, David Nellans, Mike O’Connor, and Stephen W. Keckler. 2015. Flexible software profiling of GPU architectures. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 185--197.
[107]
Narate Taerat, Nichamon Naksinehaboon, Clayton Chandler, James Elliott, Chokchai Leangsuksun, George Ostrouchov, Stephen L. Scott, and Christian Engelmann. 2009. Blue gene/l log analysis and time to interrupt estimation. In International Conference on Availability, Reliability and Security, 2009 (ARES’09). IEEE, 173--180.
[108]
Hiroyuki Takizawa, Katsuto Sato, Kazuhiko Komatsu, and Hiroaki Kobayashi. 2009. CheCUDA: A checkpoint/restart tool for CUDA applications. In 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies. IEEE, 408--413.
[109]
Jingweijia Tan, Nilanjan Goswami, Tao Li, and Xin Fu. 2011. Analyzing soft-error vulnerability on GPGPU microarchitecture. In 2011 IEEE International Symposium on Workload Characterization (IISWC’11). IEEE, 226--235.
[110]
Thanadech Thanakornworakij, Raja Nassar, Chokchai Box Leangsuksun, and Mihaela Paun. 2011. The effect of correlated failure on the reliability of HPC systems. In 2011 9th IEEE International Symposium on Parallel and Distributed Processing with Applications Workshops (ISPAW’11). IEEE, 284--288.
[111]
Thanadech Thanakornworakij, Raja Nassar, Chokchai Box Leangsuksun, and Mihaela Paun. 2013. Reliability model of a system of k nodes with simultaneous failures for high-performance computing applications. International Journal of High Performance Computing Applications 27, 4 (2013), 474--482.
[112]
Devesh Tiwari, Saurabh Gupta, George Gallarno, Jim Rogers, and Don Maxwell. 2015. Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 38.
[113]
Devesh Tiwari, Saurabh Gupta, James Rogers, Don Maxwell, Paolo Rech, Sudharshan Vazhkudai, Daniel Oliveira, Dave Londo, Nathan DeBardeleben, Philippe Navaux, Luigi Carro, and Arthur Bland. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 331--342.
[114]
Devesh Tiwari, Saurabh Gupta, James H. Rogers, and Don E. Maxwell. 2015. Experience with GPUs on the Titan Supercomputer from a Reliability, Performance and Power Perspective. Technical Report. Oak Ridge National Laboratory (ORNL), Oak Ridge, TN. Oak Ridge Leadership Computing Facility (OLCF).
[115]
Sotiris Tselonis and Dimitris Gizopoulos. 2016. GUFI: A framework for GPUs reliability assessment. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’16). IEEE, 90--100.
[116]
Alessandro Vallero, Dimitris Gizopoulos, and Stefano Di Carlo. 2017. SIFI: AMD southern islands GPU microarchitectural level fault injector. In IEEE International Symposium on On-Line Testing and Robust System Design (IOLTS’17). 138--144.
[117]
Alessandro Vallero, Sotiris Tselonis, Dimitris Gizopoulos, and Stefano Di Carlo. 2018. Multi-faceted microarchitecture level reliability characterization for NVIDIA and AMD GPUs. In 2018 IEEE 36th VLSI Test Symposium (VTS’18). IEEE, 1--6.
[118]
Guosai Wang, Lifei Zhang, and Wei Xu. 2017. What can we learn from four years of data center hardware failures? In 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’17). IEEE, 25--36.
[119]
Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael Jordan. 2009. Online system problem detection by mining patterns of console logs. In 9th IEEE International Conference on Data Mining, 2009 (ICDM’09). IEEE, 588--597.
[120]
Xinhai Xu, Yufei Lin, Tao Tang, and Yisong Lin. 2010. HiAL-Ckpt: A hierarchical application-level checkpointing for CPU-GPU hybrid systems. In 2010 5th International Conference on Computer Science and Education (ICCSE’10). IEEE, 1895--1899.
[121]
Xin-Hai Xu, Xue-Jun Yang, Jing-Ling Xue, Yu-Fei Lin, and Yi-Song Lin. 2012. PartialRC: A partial recomputing method for efficient fault recovery on GPGPUs. Journal of Computer Science and Technology 27, 2 (2012), 240--255.
[122]
Xuejun Yang, Zhiyuan Wang, Jingling Xue, and Yun Zhou. 2012. The reliability wall for exascale supercomputing. IEEE Transactions on Computers 61, 6 (2012), 767--779.
[123]
Keun Soo Yim, Cuong Pham, Mushfiq Saleheen, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2011. Hauberk: Lightweight silent data corruption error detector for GPGPU. In 2011 IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’11). IEEE, 287--300.
[124]
Mohamed Zahran. 2017. Heterogeneous computing: Here to stay. Communications of the ACM 60, 3 (2017), 42--45.
[125]
Ziming Zheng, Zhiling Lan, Rinku Gupta, Susan Coghlan, and Peter Beckman. 2010. A practical failure prediction with location and lead time for blue gene/p. In 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W’10). IEEE, 15--22.

Cited By

View all
  • (2024)Research and training of helmet recognition model based on deep learningInternational Conference on Mechatronic Engineering and Artificial Intelligence (MEAI 2023)10.1117/12.3025727(189)Online publication date: 28-Feb-2024
  • (2022)GPU Devices for Safety-Critical Systems: A SurveyACM Computing Surveys10.1145/354952655:7(1-37)Online publication date: 15-Dec-2022
  • (2022)A Survey on Dynamic Fuzzy Machine LearningACM Computing Surveys10.1145/354401355:7(1-42)Online publication date: 15-Dec-2022
  • Show More Cited By

Index Terms

  1. A Methodology for Comparing the Reliability of GPU-Based and CPU-Based HPCs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Computing Surveys
    ACM Computing Surveys  Volume 53, Issue 1
    January 2021
    781 pages
    ISSN:0360-0300
    EISSN:1557-7341
    DOI:10.1145/3382040
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 February 2020
    Accepted: 01 November 2019
    Revised: 01 January 2019
    Received: 01 May 2018
    Published in CSUR Volume 53, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. System failure
    2. checkpoint/recovery
    3. log file analysis

    Qualifiers

    • Survey
    • Survey
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)118
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Research and training of helmet recognition model based on deep learningInternational Conference on Mechatronic Engineering and Artificial Intelligence (MEAI 2023)10.1117/12.3025727(189)Online publication date: 28-Feb-2024
    • (2022)GPU Devices for Safety-Critical Systems: A SurveyACM Computing Surveys10.1145/354952655:7(1-37)Online publication date: 15-Dec-2022
    • (2022)A Survey on Dynamic Fuzzy Machine LearningACM Computing Surveys10.1145/354401355:7(1-42)Online publication date: 15-Dec-2022
    • (2022)Methodological Standards in Accessibility Research on Motor Impairments: A SurveyACM Computing Surveys10.1145/354350955:7(1-35)Online publication date: 15-Dec-2022
    • (2022)A comparison between CPU and GPU for image classification using Convolutional Neural Networks2022 International Conference on Communication, Computing and Internet of Things (IC3IoT)10.1109/IC3IOT53935.2022.9767990(1-4)Online publication date: 10-Mar-2022
    • (2021)Regional soft error vulnerability and error propagation analysis for GPGPU applicationsThe Journal of Supercomputing10.1007/s11227-021-04026-6Online publication date: 23-Aug-2021

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media