A Performance and Energy Comparison of Fault Tolerance Techniques for Exascale Computing Systems
|
conference
|
December 2016 |
Fault tolerance using lower fidelity data in adaptive mesh applications
|
conference
|
January 2013 |
Correcting soft errors online in LU factorization
- Davies, Teresa; Chen, Zizhong
-
HPDC'13: The 22nd International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
https://doi.org/10.1145/2462902.2462920
|
conference
|
October 2018 |
Partial Redundancy in HPC Systems with Non-Uniform Node Reliabilities
|
conference
|
November 2018 |
Investigating applications portability with the Uintah DAG-based runtime system on PetaScale supercomputers
- Meng, Qingyu; Humphrey, Alan; Schmidt, John
-
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
https://doi.org/10.1145/2503210.2503250
|
conference
|
January 2013 |
Resilience for Massively Parallel Multigrid Solvers
|
journal
|
January 2016 |
Scalable, fault tolerant membership for MPI tasks on HPC systems
|
conference
|
January 2006 |
Hybrid Checkpointing for MPI Jobs in HPC Environments
|
conference
|
December 2010 |
Exploring versioned distributed arrays for resilience in scientific applications: global view resilience
|
journal
|
September 2016 |
A Cell-Centered Adaptive Projection Method for the Incompressible Euler Equations
|
journal
|
September 2000 |
Improving Uintah's Scalability Through the Use of Portable Kokkos-Based Data Parallel Tasks
- Holmen, John K.; Humphrey, Alan; Sunderland, Daniel
-
PEARC17: Practice and Experience in Advanced Research Computing 2017, Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact
https://doi.org/10.1145/3093338.3093388
|
conference
|
July 2017 |
Reducing Network Congestion and Synchronization Overhead During Aggregation of Hierarchical Data
|
conference
|
December 2017 |
MOL solvers for hyperbolic PDEs with source terms
|
journal
|
May 2001 |
On spatial adaptivity and interpolation when using the method of lines
|
journal
|
January 1998 |
Improving the performance of Uintah: A large-scale adaptive meshing computational framework
|
conference
|
April 2010 |
A node-centered local refinement algorithm for Poisson's equation in complex geometries
|
journal
|
November 2004 |
A scalable double in-memory checkpoint and restart scheme towards exascale
- Zheng, Gengbin; Kale, Laxmikant V.
-
2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)
https://doi.org/10.1109/DSNW.2012.6264677
|
conference
|
June 2012 |
Optimizing Checkpoints Using NVM as Virtual Memory
- Kannan, Sudarsun; Gavrilovska, Ada; Schwan, Karsten
-
2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
https://doi.org/10.1109/IPDPS.2013.69
|
conference
|
May 2013 |
High Order ENO and WENO Schemes for Computational Fluid Dynamics
|
book
|
January 1999 |
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes
|
conference
|
January 2002 |
High performance linpack benchmark: a fault tolerant implementation without checkpointing
|
conference
|
January 2011 |
Berkeley lab checkpoint/restart (BLCR) for Linux clusters
|
journal
|
September 2006 |
McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression
|
journal
|
January 2013 |
Radiative Heat Transfer Calculation on 16384 GPUs Using a Reverse Monte Carlo Ray Tracing Approach with Adaptive Mesh Refinement
|
conference
|
May 2016 |
Granularity and the Cost of Error Recovery in Resilient AMR Scientific Applications
- Dubey, Anshu; Fujita, Hajime; Graves, Daniel T.
-
SC16: International Conference for High Performance Computing, Networking, Storage and Analysis
https://doi.org/10.1109/SC.2016.41
|
conference
|
November 2016 |
PIDX: Efficient Parallel I/O for Multi-resolution Multi-dimensional Scientific Datasets
|
conference
|
September 2011 |
Design and modeling of a non-blocking checkpointing system
- Sato, Kento; Maruyama, Naoya; Mohror, Kathryn
-
2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
https://doi.org/10.1109/SC.2012.46
|
conference
|
November 2012 |
Correcting soft errors online in LU factorization
|
conference
|
January 2013 |
Uniformly high order accurate essentially non-oscillatory schemes, III
|
journal
|
August 1987 |
Algorithm-Based Fault Tolerance for Matrix Operations
|
journal
|
June 1984 |
On the history of multivariate polynomial interpolation
|
journal
|
October 2000 |
Extending the Uintah Framework through the Petascale Modeling of Detonation in Arrays of High Explosive Devices
|
journal
|
January 2016 |
Failures in large scale systems: long-term measurement, analysis, and implications
- Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
-
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
https://doi.org/10.1145/3126908.3126937
|
conference
|
January 2017 |
A study of numerical methods for hyperbolic conservation laws with stiff source terms
|
journal
|
January 1990 |
Uniformly High Order Accurate Essentially Non-oscillatory Schemes, III
|
journal
|
February 1997 |
Compiler-enhanced incremental checkpointing for OpenMP applications
|
conference
|
May 2009 |
Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters
- Martino, Catello Di; Kalbarczyk, Zbigniew; Iyer, Ravishankar K.
-
2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
https://doi.org/10.1109/DSN.2014.62
|
conference
|
June 2014 |
Adapting grid applications to safety using fault-tolerant methods: Design, implementation and evaluations
|
journal
|
February 2010 |
Resilience for Stencil Computations with Latent Errors
|
conference
|
August 2017 |
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems
|
conference
|
January 2009 |
Compiler-enhanced incremental checkpointing for OpenMP applications
- Bronevetsky, Greg; Marques, Daniel J.; Pingali, Keshav K.
-
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming - PPoPP '08
https://doi.org/10.1145/1345206.1345253
|
conference
|
January 2008 |
Addressing Global Data Dependencies in Heterogeneous Asynchronous Runtime Systems on GPUs
- Peterson, Brad; Humphrey, Alan; Schmidt, John
-
Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware - ESPM2'17
https://doi.org/10.1145/3152041.3152082
|
conference
|
January 2017 |
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
|
book
|
January 2000 |
Adaptive Polynomial Interpolation on Evenly Spaced Meshes
|
journal
|
January 2007 |
Preserving Nonnegativity in Discontinuous Galerkin Approximations to Scalar Transport via Truncation and Mass Aware Rescaling (TMAR)
|
journal
|
November 2016 |