Programmer-guided reliability for extreme-scale applications

Bernholdt, David E.; Elwasif, Wael R.; Kartsaklis, Christos; Lee, Seyong; Mintz, Tiffany M.

doi:10.1177/1094342016667625

Title: Programmer-guided reliability for extreme-scale applications

Journal Article · Wed Nov 30 00:00:00 EST 2016 · International Journal of High Performance Computing Applications

DOI:https://doi.org/10.1177/1094342016667625· OSTI ID:1464028

^[1];

^[1]; Kartsaklis, Christos ^[1];

^[1];

^[1]

Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

We present “programmer-guided reliability” (PGR) as a systematic conceptual approach to address the expected rise in soft errors in coming extreme-scale systems at the application level. The approach involves instrumentation of the application with code to detect data corruption errors. The location and nature of these error detectors are at the discretion of the programmer, who uses their knowledge and experience with the problem domain, the application, the solution algorithms, etc., to determine the most vulnerable areas of the code and the most appropriate ways to detect data corruption. To illustrate the approach, we provide examples of error detectors from four different benchmark-scale applications. We also describe a simple control framework that allows for runtime configuration of the error detectors without recompilation of the application, as well as dynamic reconfiguration during the execution of the application. Lastly, we discuss a number of future directions building on the basic PGR approach, including the incorporation of some general error detectors into the programming environment in order to make them more easily usable by the programmer.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

Grant/Contract Number:: AC05-00OR22725

OSTI ID:: 1464028

Journal Information:: International Journal of High Performance Computing Applications, Vol. 32, Issue 5; ISSN 1094-3420

Publisher:: SAGECopyright Statement

Country of Publication:: United States

Language:: English

References (36)

Detection and correction of silent data corruption for large-scale high-performance computing Fiala, David; Mueller, Frank; Engelmann, Christian 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.49	conference	November 2012
Gaining confidence in scientific applications through executable interface contracts Dahlgren, T. L.; Bernholdt, D. E.; McInnes, L. C. Journal of Physics: Conference Series, Vol. 125 https://doi.org/10.1088/1742-6596/125/1/012086	journal	July 2008
Transparent Redundant Computing with MPI Brightwell, Ron; Ferreira, Kurt; Riesen, Rolf Recent Advances in the Message Passing Interface https://doi.org/10.1007/978-3-642-15646-5_22	book	January 2010
Algorithm-based fault tolerance for floating-point operations in massively parallel systems Rexford, J.; Jha, N. K. [Proceedings] 1992 IEEE International Symposium on Circuits and Systems https://doi.org/10.1109/ISCAS.1992.230168	conference	January 1992
Fault recovery for a distributed SP-based delay constrained multicast routing algorithm Ural, H.; Zhu, K. Proceedings 16th International Parallel and Distributed Processing Symposium. IPDPS 2002 https://doi.org/10.1109/IPDPS.2002.1015529	conference	January 2002
ABARIS: An Adaptable Fault Detection/Recovery Component Framework for MPIs Jitsumoto, Hideyuki; Endo, Toshio; Matsuoka, Satoshi 2007 IEEE International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2007.370603	conference	March 2007
SWIFT: Software Implemented Fault Tolerance Reis, G. A.; Chang, J.; Vachharajani, N. International Symposium on Code Generation and Optimization https://doi.org/10.1109/CGO.2005.34	conference	January 2005
PRASE: An Approach for Program Reliability Analysis with Soft Errors Xu, Jianjun; Shen, Rui; Tan, Qingping 2008 14th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC) https://doi.org/10.1109/PRDC.2008.30	conference	December 2008
Improving scientific software component quality through assertions Dahlgren, Tamara L.; Devanbu, Premkumar T. Proceedings of the second international workshop on Software engineering for high performance computing system applications - SE-HPCS '05 https://doi.org/10.1145/1145319.1145341	conference	January 2005
See applications run and throughput jump: The case for redundant computing in HPC Riesen, Rolf; Ferreira, Kurt; Stearley, Jon 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W) https://doi.org/10.1109/DSNW.2010.5542625	conference	June 2010
Performance-Driven Interface Contract Enforcement for Scientific Components Dahlgren, Tamara L. Component-Based Software Engineering https://doi.org/10.1007/978-3-540-73551-9_11	book	January 2007
Adaptive incremental checkpointing for massively parallel systems Agarwal, Saurabh; Garg, Rahul; Gupta, Meeta S. Proceedings of the 18th annual international conference on Supercomputing - ICS '04 https://doi.org/10.1145/1006209.1006248	conference	January 2004
FITL: extending LLVM for the translation of fault-injection directives Denny, Joel E.; Lee, Seyong; Vetter, Jeffrey S. Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC - LLVM '15 https://doi.org/10.1145/2833157.2833160	conference	January 2015
OpenARC: Extensible OpenACC Compiler Framework for Directive-Based Accelerator Programming Study Lee, Seyong; Vetter, Jeffrey S. 2014 First Workshop on Accelerator Programming using Directives (WACCPD) https://doi.org/10.1109/WACCPD.2014.7	conference	November 2014
ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research Lidman, Jacob; Quinlan, Daniel J.; Liao, Chunhua 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264672	conference	June 2012
Applying 'design by contract' Meyer, B. Computer, Vol. 25, Issue 10 https://doi.org/10.1109/2.161279	journal	October 1992
Fault tolerant algorithms for heat transfer problems Ltaief, Hatem; Gabriel, Edgar; Garbey, Marc Journal of Parallel and Distributed Computing, Vol. 68, Issue 5 https://doi.org/10.1016/j.jpdc.2007.09.004	journal	May 2008
Fault injection techniques and tools Computer, Vol. 30, Issue 4 https://doi.org/10.1109/2.585157	journal	April 1997
ACR: automatic checkpoint/restart for soft and hard error protection Ni, Xiang; Meneses, Esteban; Jain, Nikhil Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503266	conference	January 2013
CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems Gupta, Rinku; Beckman, Pete; Park, Byung-Hoon 2009 International Conference on Parallel Processing (ICPP) https://doi.org/10.1109/ICPP.2009.20	conference	September 2009
Real-world design and evaluation of compiler-managed GPU redundant multithreading Wadden, Jack; Lyashevsky, Alexander; Gurumurthi, Sudhanva 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA) https://doi.org/10.1109/ISCA.2014.6853227	conference	June 2014
Realization of User Level Fault Tolerant Policy Management through a Holistic Approach for Fault Correlation Park, Byung H.; Naughton, Thomas J.; Agarwal, Pratul 2011 IEEE International Symposium on Policies for Distributed Systems and Networks - POLICY https://doi.org/10.1109/POLICY.2011.34	conference	June 2011
High performance linpack benchmark: a fault tolerant implementation without checkpointing Davies, Teresa; Karlsson, Christer; Liu, Hui Proceedings of the international conference on Supercomputing - ICS '11 https://doi.org/10.1145/1995896.1995923	conference	January 2011
Evaluating the viability of process replication reliability for exascale systems Ferreira, Kurt; Stearley, Jon; Laros, James H. Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063443	conference	January 2011
Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation Borkar, S. IEEE Micro, Vol. 25, Issue 6 https://doi.org/10.1109/MM.2005.110	journal	November 2005
Characterizing the impact of soft errors on iterative methods in scientific computing Shantharam, Manu; Srinivasmurthy, Sowmyalatha; Raghavan, Padma Proceedings of the international conference on Supercomputing - ICS '11 https://doi.org/10.1145/1995896.1995922	conference	January 2011
Strategies for Fault Tolerance in Multicomponent Applications Shet, Aniruddha G.; Elwasif, Wael R.; Foley, Samantha S. Procedia Computer Science, Vol. 4 https://doi.org/10.1016/j.procs.2011.04.249	journal	January 2011
Soft error vulnerability of iterative linear algebra methods Bronevetsky, Greg; de Supinski, Bronis Proceedings of the 22nd annual international conference on Supercomputing - ICS '08 https://doi.org/10.1145/1375527.1375552	conference	January 2008
Fault resilience of the algebraic multi-grid solver Casas, Marc; de Supinski, Bronis R.; Bronevetsky, Greg Proceedings of the 26th ACM international conference on Supercomputing - ICS '12 https://doi.org/10.1145/2304576.2304590	conference	January 2012
Addressing failures in exascale computing Snir, Marc; Wisniewski, Robert W.; Abraham, Jacob A. The International Journal of High Performance Computing Applications, Vol. 28, Issue 2 https://doi.org/10.1177/1094342014522573	journal	March 2014
Algorithm-based recovery for iterative methods without checkpointing Chen, Zizhong Proceedings of the 20th international symposium on High performance distributed computing - HPDC '11 https://doi.org/10.1145/1996130.1996142	conference	January 2011
Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU Yim, Keun Soo; Pham, Cuong; Saleheen, Mushfiq Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.36	conference	May 2011
Toward Exascale Resilience Cappello, Franck; Geist, Al; Gropp, Bill The International Journal of High Performance Computing Applications, Vol. 23, Issue 4 https://doi.org/10.1177/1094342009347767	journal	September 2009
Parallel Programmability and the Chapel Language Chamberlain, B. L.; Callahan, D.; Zima, H. P. The International Journal of High Performance Computing Applications, Vol. 21, Issue 3 https://doi.org/10.1177/1094342007078442	journal	August 2007
CheCUDA: A Checkpoint/Restart Tool for CUDA Applications Takizawa, Hiroyuki; Sato, Katsuto; Komatsu, Kazuhiko 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT) https://doi.org/10.1109/PDCAT.2009.78	conference	December 2009
Containment Domains: A Scalable, Efficient and Flexible Resilience Scheme for Exascale Systems Chung, Jinsuk; Lee, Ikhwan; Sullivan, Michael Scientific Programming, Vol. 21, Issue 3-4 https://doi.org/10.1155/2013/473915	journal	January 2013

Similar Records

Rolex: Resilience-oriented language extensions for extreme-scale systems

Journal Article · Thu May 26 00:00:00 EDT 2016 · Journal of Supercomputing · OSTI ID:1464028

Lucas, Robert F.; Hukerikar, Saurabh

MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection

Conference · Tue Sep 05 00:00:00 EDT 2017 · OSTI ID:1464028

Subasi, Omer; Di, Sheng; Balaprakash, Prasanna; +5 more

Exploring versioned distributed arrays for resilience in scientific applications: Global view resilience

Journal Article · Thu Sep 08 00:00:00 EDT 2016 · International Journal of High Performance Computing Applications · OSTI ID:1464028

Chien, Andrew A.; Balaji, Pavan; Dun, Nan; +14 more

Related Subjects

97 MATHEMATICS AND COMPUTING
applications
error detection
fault tolerance
resilience
soft errors

Title: Programmer-guided reliability for extreme-scale applications

Citation Formats

References (36)

Similar Records

Related Subjects