research-article

FTI: high performance fault tolerance interface for hybrid systems

Authors:

Leonardo Bautista-Gomez,

Dimitri Komatitsch,

Franck Cappello,

Naoya Maruyama,

Satoshi MatsuokaAuthors Info & Claims

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 32, Pages 1 - 32

https://doi.org/10.1145/2063384.2063427

Published: 12 November 2011 Publication History

Abstract

Large scientific applications deployed on current petascale systems expend a significant amount of their execution time dumping checkpoint files to remote storage. New fault tolerant techniques will be critical to efficiently exploit post-petascale systems. In this work, we propose a low-overhead high-frequency multi-level checkpoint technique in which we integrate a highly-reliable topology-aware Reed-Solomon encoding in a three-level checkpoint scheme. We efficiently hide the encoding time using one Fault-Tolerance dedicated thread per node. We implement our technique in the Fault Tolerance Interface FTI. We evaluate the correctness of our performance model and conduct a study of the reliability of our library. To demonstrate the performance of FTI, we present a case study of the Mw9.0 Tohoku Japan earthquake simulation with SPECFEM3D on TSUBAME2.0. We demonstrate a checkpoint overhead as low as 8% on sustained 0.1 petaflops runs (1152 GPUs) while checkpointing at high frequency.

References

[1]

A. Moody, G. Bronevetsky, K. Mohror, B. R. de Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, 2010

Digital Library

[2]

X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, Y. Xie. Leveraging 3D PCRAM Technologies to Reduce Checkpoint Overhead for Future Exascale Systems. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Portland, 2009.

Digital Library

[3]

Z. Cheng, J. Dongarra, A scalable Checkpoint Encoding Algorithm for Diskless Checkpointing. Proceedings of the 11th IEEE High Assurance Systems Engineering Symposium, HASE 2008, Nanjing, China, December, 2008.

Digital Library

[4]

J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, M. Polte, and M. Wingate, Plfs: A checkpoint filesystem for parallel applications. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Portland, 2009.

Digital Library

[5]

L. Bautista-Gomez, N. Maruyama, A. Nukada, F. Cappello, S. Matsuoka, "Low-overhead diskless checkpoint for hybrid computing systems", International Conference on High Performance Computing, Goa, India, December 2010.

[6]

L. Bautista-Gomez, N. Maruyama, F. Cappello, S. Matsuoka, "Distributed Diskless Checkpoint for large scale systems", IEEE/ACM International Symposium on Cluster, Cloud and Grid computing (CCGrid2010), Melbourne, Australia, May 2010.

Digital Library

[7]

The Top 500 http://www.top500.org/

[8]

The Green 500 http://www.green500.org/

[9]

F. Cappello, Fault tolerance in Petascale/Exascale systems: current knowledge, challenges and research opportunities International Journal on High Performance Computing Applications, SAGE, Volume 23, Issue 3, 2009.

Digital Library

[10]

B. Schroeder, E. Pinheiro, W. Weber. DRAM errors in the wild: A Large-Scale Field Study. In Proceedings of the 11th international joint conference on Measurement and modeling of computer systems (SIGMETRICS), ACM, New York, NY, USA, 2009.

Digital Library

[11]

B. Welch, M. Unangst, Z. Abbasi, G. Gibson, B. Mueller, J. Small, J. Zelenka, and B. Zhou. Scalable performance of the panasas parallel file system. In FAST'08: Proceedings of the 6th USENIX Conference on File and Storage Technologies, pages 1--17, Berkeley, CA, USA, 2008. USENIX Association.

Digital Library

[12]

F. Schmuck, R. Haskin, GPFS: A Shared-Disk File System for Large Computing Clusters, Proceedings of the Conference on File and Storage Technologies, p.231--244, January 28--30, 2002

Digital Library

[13]

S. Microsystems. Lustre file system, October 2008

[14]

J. S. Plank, Jerasure: A Library in C/C++ Facilitating Erasure Coding for Storage Applications, Technical Report CS-07-603, University of Tennessee, September, 2007.

[15]

J. S. Plank, J. Luo, C. D. Schuman, L. Xu, Z. Wilcox-O'Hearn. A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage. In Proceedings of the Seventh USENIX Conference on File and Storage Technologies (FAST), San Francisco, CA, 2009.

Digital Library

[16]

S. Matsuoka, The Road to TSUBAME and beyond, Petascale Computing: Algorithms and Applications, Chapman & Hall Crc Computational Science Series, 2008, pp. 289--310.

[17]

A GPU Accelerated Storage System, Abdullah Gharaibeh, Samer Al-Kiswany, Sathish Gopalakrishnan, Matei Ripeanu, IEEE/ACM International Symposium on High Performance Distributed Computing (HPDC 2010), Chicago, IL, June 2010.

Digital Library

[18]

A. Petitet, R. Whaley, J. Dongarra and A. Cleary. HPL -- a portable implementation of the high performance Linpack benchmark for distributed computers. http://www.netlib.org/benchmark/hpl

[19]

NA Kofahi, S Al-Bokhitan, A Al-Nazer, On Disk-based and Diskless Checkpointing for Parallel and Distributed Systems: An Empirical Analysis - Information Technology Journal, v.4 n.4, p.367--376, 2005.

[20]

http://www.nvidia.com/object/fermi_architecture.html

[21]

J. Duell, P. Hargrove and E. Roman, Requirements for Linux Checkpoint/Restart Lawrence Berkeley National Laboratory Technical Report LBNL-49659, 2002.

[22]

E. Roman, A Survey of Checkpoint/Restart Implementations Lawrence Berkeley National Laboratory Technical Report LBNL-54942, 2003.

[23]

J. Duell, P. Hargrove and E. Roman, The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart Lawrence Berkeley National Laboratory Technical Report LBNL -- 54941, 2002.

[24]

S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove and E. Roman, The LAM/MPI checkpoint/restart framework: system-initiated checkpointing Proc. Los Alamos Computer Science Institute (LACSI) Symp. Santa Fe, New Mexico, USA, October 2003.

[25]

J. S. Plank, M. Beck, G. Kingsley and K. Li, Libckpt: Transparent checkpointing under UNIX. In Proceedings of the USENIX, Technical Conference, 213--223, 1995.

Digital Library

[26]

S. Matsuoka, I. Yamagata, H. Jitsumoto, H. Nakada, Speculative Checkpointing: Exploiting Temporal Affinity of Memory Operations, HPC Asia 2009, pp. 390--396, 2009.

[27]

Z. Chen and J. J. Dongarra. Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources. In 20th International Parallel and Distributed Processing Symposium (IPDPS), Rhodes Island, Greece, april 2006.

Digital Library

[28]

J. Plank, K. Li, M. A. Puening, Diskless Checkpointing, IEEE Transactions on Parallel and Distributed Systems, v.9 n.10, p.972--986, October 1998.

Digital Library

[29]

J. S. Plank and L. Xu, Optimizing Cauchy Reed-Solomon Codes for Fault-Tolerant Network Storage Applications, NCA-06: 5th IEEE International Symposium on Network Computing Applications, Cambridge, MA, July, 2006.

Digital Library

[30]

C. Lu, Scalable diskless checkpointing for large parallel systems, PhD. Thesis, University of Illinois at Urbana-Champaign, IL, 2005.

Digital Library

[31]

A. Moody, G. Bronevetsky, Scalable I/O Systems via Node-Local Storage: Approaching 1 TB/sec File I/O. DOE technical report, 2009.

[32]

S. Matsuoka, T. Aoki, T. Endo, A. Nukada, T. Kato, A. Hasegawa, GPU-accelerated computing-from hype to mainstream, the rebirth of vector computing. Journal of Physics: Conference Series, v.180, no.012043, 2009.

[33]

B. Schroeder, G. A. Gibson, Understanding failures in petascale computers, SciDAC, Journal of Physics: Conference Series, v.78, no.012022, 2007.

[34]

M. Curry, L. Ward, T. Skjellum, and R. Brightwell. Accelerating reed-solomon coding in raid systems with gpus. In International Parallel and Distributed Processing Symposium, April 2008.

[35]

W. D. Gropp, R. Ross, and N. Miller. Providing efficient I/O redundancy in MPI environments. Lecture Notes in Computer Science, 3241:7786, September 2004.

[36]

A. Nukada, S. Matsuoka, NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA in Proceedings at the International Heterogeneity in Computing Workshop, Alaska, 2011. (To appear)

Digital Library

[37]

D. Komatitsch, S. Tsuboi, C. Ji and J. Tromp, A 14.6 billion degrees of freedom, 5 teraflops, 2.5 terabyte earthquake simulation on the Earth Simulator, Proceedings of the ACM/IEEE Supercomputing SC'2003 conference, November 2003.

Digital Library

[38]

G. Grider, J. Loncaric, and D. Limpart, Roadrunner System Management Report, Los Alamos National Laboratory, Tech. Rep. LA-UR-07-7405, 2007.

[39]

R. A. Oldfield, S. Arunagiri, P. J. Teller et al., Modeling the Impact of Checkpoints on Next-Generation Systems, in MSST'07. Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies, 2007, pp. 30--46.

Digital Library

[40]

S. Y. Borkar, Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation, IEEE Micro, vol. 25, no. 6, pp. 10--16, 2005.

Digital Library

[41]

D. Reed, High-End Computing: The Challenge of Scale, Director's Colloquium, LANL, May 2004.

[42]

K. Barker, K. Davis, A. Hoisie, D. Kerbyson, M. Lang, S. Pakin, J. Sancho, Entering the petaflop era: the architecture and performance of Roadrunner, Proceedings of the 2008 ACM/IEEE conference on Supercomputing, November 15--21, 2008, Austin, Texas.

Digital Library

[43]

B. Schroeder, G. A. Gibson, A large-scale study of failures in high-performance computing systems, Proceedings of the International Conference on Dependable Systems and Networks (DSN'06), p.249--258, June 25--28, 2006.

Digital Library

[44]

http://www.open-mpi.org/

[45]

John W. Young. 1974. A first order approximation to the optimum checkpoint interval. Commun. ACM 17, 9 (September 1974), 530--531. DOI=10.1145/361147.361115 http://doi.acm.org/10.1145/361147.361115

Digital Library

[46]

http://www.gsic.titech.ac.jp/ccwww/index.php?www&&&/tgc/trouble_list.html

[47]

D. Komatitsch, D. Michéa, G. Erlebacher, Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA, Journal of Parallel and Distributed Computing, vol. 69(5), p. 451--460. 2009.

Digital Library

[48]

D. Komatitsch, G. Erlebacher, D. Göddeke, D. Michéa, High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster, Journal of Computational Physics, vol. 229(20), p. 7692--7714. 2010.

Digital Library

[49]

http://icl.cs.utk.edu/papi/

[50]

http://www.geodynamics.org/cig/software/specfem3d-globe

[51]

B. Kennet, E. Engdahl, Traveltimes for global earthquake location and phase identification. Geophys. J. Int., 105, 429--465, 1991.

[52]

M. Kikuchi, H. Kanamori, Inversion of complex body waves. III, Bull. Seismol. Soc. Am., 81, 2335--2350, 1991.

[53]

M. Kikuchi, H. Kanamori, Note on Teleseismic Body-Wave Inversion Program, 2003. http://www.eri.u-tokyo.ac.jp/ETAL/KIKUCHI/

[54]

D. Komatitsch, J. Ritsema, J. Tromp, The spectral-element method, Beowulf computing, and global seismology, Science 298, 1737--1742, 2002.

[55]

C. Lawson, R. Hanson, Solving Least Squares Problems, Prentice-Hall, New Jersey, 340 pp, 1974.

[56]

T. Nakamura, S. Tsuboi, Y. Kaneda, Y. Yamanaka, Rupture process of the 2008 Wenchuan, China earthquake inferred from teleseismic waveform inversion and forward modeling of broadband seismic waves, Tectonophysics, vol. 491, 72--84, 2010.

[57]

S. Tsuboi, D. Komatitsch, C. Ji, J. Tromp, Broadband modelling of the 2002 Denali fault earthquake on the Earth Simulator, Phys. Earth Planet. Inter. 139, 305--312, 2003.

Cited By

Alnahdi NAlnanih R(2025)A Novel Information Model for Software Interface Reliability in the Software Development Life CycleProcedia Computer Science10.1016/j.procs.2024.11.091251:C(116-123)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1016/j.procs.2024.11.091
Maurya AUnderwood RRafique MCappello FNicolae BMencagli GDazzi PLowenthal DBadia R(2024)DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language ModelsProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658685(227-239)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658685
Benoit APerotin LRobert YVivien F(2024)Checkpointing Strategies to Tolerate Non-Memoryless Failures on HPC PlatformsACM Transactions on Parallel Computing10.1145/362456011:1(1-26)Online publication date: 11-Mar-2024
https://dl.acm.org/doi/10.1145/3624560
Show More Cited By

Recommendations

Recovery Device for Real-Time Dual-Redundant Computer Systems

This paper proposes the design of specialized hardware, called Recovery Device, for a dual-redundant computer system that operates in real-time. Recovery Device executes all fault-tolerant services including fault detection, fault type determination, ...
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Sampling + DMR: practical and low-overhead permanent fault detection
ISCA '11

With technology scaling, manufacture-time and in-field permanent faults are becoming a fundamental problem. Multi-core architectures with spares can tolerate them by detecting and isolating faulty cores, but the required fault detection coverage becomes ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

November 2011

866 pages

ISBN:9781450307710

DOI:10.1145/2063384

Conference Chair:
Scott Lathrop
University of Chicago
,
Program Chairs:
Jim Costa
Sandia National Laboratories
,
William Kramer
National Center for Supercomputing Applications

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SC '11

Sponsor:

SIGARCH
IEEE-CS

SC '11: International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 18, 2011

Washington, Seattle

Acceptance Rates

SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

218
Total Citations
View Citations
894
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)1

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Alnahdi NAlnanih R(2025)A Novel Information Model for Software Interface Reliability in the Software Development Life CycleProcedia Computer Science10.1016/j.procs.2024.11.091251:C(116-123)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1016/j.procs.2024.11.091
Maurya AUnderwood RRafique MCappello FNicolae BMencagli GDazzi PLowenthal DBadia R(2024)DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language ModelsProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658685(227-239)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658685
Benoit APerotin LRobert YVivien F(2024)Checkpointing Strategies to Tolerate Non-Memoryless Failures on HPC PlatformsACM Transactions on Parallel Computing10.1145/362456011:1(1-26)Online publication date: 11-Mar-2024
https://dl.acm.org/doi/10.1145/3624560
Huang XXu WMeng SZhang WFu XGuo LSato K(2024)Scrutinizing Variables for Checkpoint Using Automatic DifferentiationSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00056(372-379)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SCW63240.2024.00056
Fu XZhang WMeng SHuang XXu WGuo LSato K(2024)AutoCheck: Automatically Identifying Variables for Checkpointing by Data Dependency AnalysisProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00105(1-16)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00105
Rocco RRepetti LBoella EGregori DPalermo G(2024)Extending the Legio Resilience Framework to Handle Critical Process Failures in MPI2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00015(44-51)Online publication date: 20-Mar-2024
https://doi.org/10.1109/PDP62718.2024.00015
Barbirotta MMinervini FMorales CCristal AUnsal OOlivieri M(2024)Enhancing Fault Tolerance in High-Performance Computing: A Real Hardware Case Study on a RISC-V Vector Processing UnitIEEE Open Journal of the Computer Society10.1109/OJCS.2024.34688955(553-565)Online publication date: 2024
https://doi.org/10.1109/OJCS.2024.3468895
Fu XHuang XXu WZhang WMeng SGuo LSato K(2024)Benchmarking Variables for Checkpointing in HPC Applications2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00090(406-413)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00090
Xu YCooperman G(2024)Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00028(238-249)Online publication date: 24-Sep-2024
https://doi.org/10.1109/CLUSTER59578.2024.00028
Fu XMeng SZhang WGuo LSato KAhn DLaguna ILee GSchulz M(2024)Distributed Order Recording Techniques for Efficient Record-and-Replay of Multi - Threaded Programs2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00010(27-38)Online publication date: 24-Sep-2024
https://doi.org/10.1109/CLUSTER59578.2024.00010
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten