Detection and correction of silent data corruption for large-scale high-performance computing
- Fiala, David; Mueller, Frank; Engelmann, Christian
-
2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
https://doi.org/10.1109/SC.2012.49
|
conference
|
November 2012 |
Gaining confidence in scientific applications through executable interface contracts
|
journal
|
July 2008 |
Transparent Redundant Computing with MPI
|
book
|
January 2010 |
Algorithm-based fault tolerance for floating-point operations in massively parallel systems
|
conference
|
January 1992 |
Fault recovery for a distributed SP-based delay constrained multicast routing algorithm
|
conference
|
January 2002 |
ABARIS: An Adaptable Fault Detection/Recovery Component Framework for MPIs
|
conference
|
March 2007 |
SWIFT: Software Implemented Fault Tolerance
|
conference
|
January 2005 |
PRASE: An Approach for Program Reliability Analysis with Soft Errors
|
conference
|
December 2008 |
Improving scientific software component quality through assertions
- Dahlgren, Tamara L.; Devanbu, Premkumar T.
-
Proceedings of the second international workshop on Software engineering for high performance computing system applications - SE-HPCS '05
https://doi.org/10.1145/1145319.1145341
|
conference
|
January 2005 |
See applications run and throughput jump: The case for redundant computing in HPC
|
conference
|
June 2010 |
Performance-Driven Interface Contract Enforcement for Scientific Components
|
book
|
January 2007 |
Adaptive incremental checkpointing for massively parallel systems
|
conference
|
January 2004 |
FITL: extending LLVM for the translation of fault-injection directives
|
conference
|
January 2015 |
OpenARC: Extensible OpenACC Compiler Framework for Directive-Based Accelerator Programming Study
|
conference
|
November 2014 |
ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research
- Lidman, Jacob; Quinlan, Daniel J.; Liao, Chunhua
-
2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)
https://doi.org/10.1109/DSNW.2012.6264672
|
conference
|
June 2012 |
Applying 'design by contract'
|
journal
|
October 1992 |
Fault tolerant algorithms for heat transfer problems
|
journal
|
May 2008 |
Fault injection techniques and tools
|
journal
|
April 1997 |
ACR: automatic checkpoint/restart for soft and hard error protection
- Ni, Xiang; Meneses, Esteban; Jain, Nikhil
-
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
https://doi.org/10.1145/2503210.2503266
|
conference
|
January 2013 |
CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems
|
conference
|
September 2009 |
Real-world design and evaluation of compiler-managed GPU redundant multithreading
|
conference
|
June 2014 |
Realization of User Level Fault Tolerant Policy Management through a Holistic Approach for Fault Correlation
|
conference
|
June 2011 |
High performance linpack benchmark: a fault tolerant implementation without checkpointing
|
conference
|
January 2011 |
Evaluating the viability of process replication reliability for exascale systems
- Ferreira, Kurt; Stearley, Jon; Laros, James H.
-
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
https://doi.org/10.1145/2063384.2063443
|
conference
|
January 2011 |
Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation
|
journal
|
November 2005 |
Characterizing the impact of soft errors on iterative methods in scientific computing
|
conference
|
January 2011 |
Strategies for Fault Tolerance in Multicomponent Applications
|
journal
|
January 2011 |
Soft error vulnerability of iterative linear algebra methods
|
conference
|
January 2008 |
Fault resilience of the algebraic multi-grid solver
|
conference
|
January 2012 |
Addressing failures in exascale computing
|
journal
|
March 2014 |
Algorithm-based recovery for iterative methods without checkpointing
|
conference
|
January 2011 |
Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU
- Yim, Keun Soo; Pham, Cuong; Saleheen, Mushfiq
-
Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium
https://doi.org/10.1109/IPDPS.2011.36
|
conference
|
May 2011 |
Toward Exascale Resilience
|
journal
|
September 2009 |
Parallel Programmability and the Chapel Language
|
journal
|
August 2007 |
CheCUDA: A Checkpoint/Restart Tool for CUDA Applications
- Takizawa, Hiroyuki; Sato, Katsuto; Komatsu, Kazuhiko
-
2009 International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)
https://doi.org/10.1109/PDCAT.2009.78
|
conference
|
December 2009 |
Containment Domains: A Scalable, Efficient and Flexible Resilience Scheme for Exascale Systems
|
journal
|
January 2013 |