skip to main content
10.1145/2063384.2063427acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

FTI: high performance fault tolerance interface for hybrid systems

Published: 12 November 2011 Publication History

Abstract

Large scientific applications deployed on current petascale systems expend a significant amount of their execution time dumping checkpoint files to remote storage. New fault tolerant techniques will be critical to efficiently exploit post-petascale systems. In this work, we propose a low-overhead high-frequency multi-level checkpoint technique in which we integrate a highly-reliable topology-aware Reed-Solomon encoding in a three-level checkpoint scheme. We efficiently hide the encoding time using one Fault-Tolerance dedicated thread per node. We implement our technique in the Fault Tolerance Interface FTI. We evaluate the correctness of our performance model and conduct a study of the reliability of our library. To demonstrate the performance of FTI, we present a case study of the Mw9.0 Tohoku Japan earthquake simulation with SPECFEM3D on TSUBAME2.0. We demonstrate a checkpoint overhead as low as 8% on sustained 0.1 petaflops runs (1152 GPUs) while checkpointing at high frequency.

References

[1]
A. Moody, G. Bronevetsky, K. Mohror, B. R. de Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, 2010
[2]
X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, Y. Xie. Leveraging 3D PCRAM Technologies to Reduce Checkpoint Overhead for Future Exascale Systems. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Portland, 2009.
[3]
Z. Cheng, J. Dongarra, A scalable Checkpoint Encoding Algorithm for Diskless Checkpointing. Proceedings of the 11th IEEE High Assurance Systems Engineering Symposium, HASE 2008, Nanjing, China, December, 2008.
[4]
J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, M. Polte, and M. Wingate, Plfs: A checkpoint filesystem for parallel applications. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Portland, 2009.
[5]
L. Bautista-Gomez, N. Maruyama, A. Nukada, F. Cappello, S. Matsuoka, "Low-overhead diskless checkpoint for hybrid computing systems", International Conference on High Performance Computing, Goa, India, December 2010.
[6]
L. Bautista-Gomez, N. Maruyama, F. Cappello, S. Matsuoka, "Distributed Diskless Checkpoint for large scale systems", IEEE/ACM International Symposium on Cluster, Cloud and Grid computing (CCGrid2010), Melbourne, Australia, May 2010.
[7]
The Top 500 http://www.top500.org/
[8]
The Green 500 http://www.green500.org/
[9]
F. Cappello, Fault tolerance in Petascale/Exascale systems: current knowledge, challenges and research opportunities International Journal on High Performance Computing Applications, SAGE, Volume 23, Issue 3, 2009.
[10]
B. Schroeder, E. Pinheiro, W. Weber. DRAM errors in the wild: A Large-Scale Field Study. In Proceedings of the 11th international joint conference on Measurement and modeling of computer systems (SIGMETRICS), ACM, New York, NY, USA, 2009.
[11]
B. Welch, M. Unangst, Z. Abbasi, G. Gibson, B. Mueller, J. Small, J. Zelenka, and B. Zhou. Scalable performance of the panasas parallel file system. In FAST'08: Proceedings of the 6th USENIX Conference on File and Storage Technologies, pages 1--17, Berkeley, CA, USA, 2008. USENIX Association.
[12]
F. Schmuck, R. Haskin, GPFS: A Shared-Disk File System for Large Computing Clusters, Proceedings of the Conference on File and Storage Technologies, p.231--244, January 28--30, 2002
[13]
S. Microsystems. Lustre file system, October 2008
[14]
J. S. Plank, Jerasure: A Library in C/C++ Facilitating Erasure Coding for Storage Applications, Technical Report CS-07-603, University of Tennessee, September, 2007.
[15]
J. S. Plank, J. Luo, C. D. Schuman, L. Xu, Z. Wilcox-O'Hearn. A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage. In Proceedings of the Seventh USENIX Conference on File and Storage Technologies (FAST), San Francisco, CA, 2009.
[16]
S. Matsuoka, The Road to TSUBAME and beyond, Petascale Computing: Algorithms and Applications, Chapman & Hall Crc Computational Science Series, 2008, pp. 289--310.
[17]
A GPU Accelerated Storage System, Abdullah Gharaibeh, Samer Al-Kiswany, Sathish Gopalakrishnan, Matei Ripeanu, IEEE/ACM International Symposium on High Performance Distributed Computing (HPDC 2010), Chicago, IL, June 2010.
[18]
A. Petitet, R. Whaley, J. Dongarra and A. Cleary. HPL -- a portable implementation of the high performance Linpack benchmark for distributed computers. http://www.netlib.org/benchmark/hpl
[19]
NA Kofahi, S Al-Bokhitan, A Al-Nazer, On Disk-based and Diskless Checkpointing for Parallel and Distributed Systems: An Empirical Analysis - Information Technology Journal, v.4 n.4, p.367--376, 2005.
[20]
http://www.nvidia.com/object/fermi_architecture.html
[21]
J. Duell, P. Hargrove and E. Roman, Requirements for Linux Checkpoint/Restart Lawrence Berkeley National Laboratory Technical Report LBNL-49659, 2002.
[22]
E. Roman, A Survey of Checkpoint/Restart Implementations Lawrence Berkeley National Laboratory Technical Report LBNL-54942, 2003.
[23]
J. Duell, P. Hargrove and E. Roman, The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart Lawrence Berkeley National Laboratory Technical Report LBNL -- 54941, 2002.
[24]
S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove and E. Roman, The LAM/MPI checkpoint/restart framework: system-initiated checkpointing Proc. Los Alamos Computer Science Institute (LACSI) Symp. Santa Fe, New Mexico, USA, October 2003.
[25]
J. S. Plank, M. Beck, G. Kingsley and K. Li, Libckpt: Transparent checkpointing under UNIX. In Proceedings of the USENIX, Technical Conference, 213--223, 1995.
[26]
S. Matsuoka, I. Yamagata, H. Jitsumoto, H. Nakada, Speculative Checkpointing: Exploiting Temporal Affinity of Memory Operations, HPC Asia 2009, pp. 390--396, 2009.
[27]
Z. Chen and J. J. Dongarra. Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources. In 20th International Parallel and Distributed Processing Symposium (IPDPS), Rhodes Island, Greece, april 2006.
[28]
J. Plank, K. Li, M. A. Puening, Diskless Checkpointing, IEEE Transactions on Parallel and Distributed Systems, v.9 n.10, p.972--986, October 1998.
[29]
J. S. Plank and L. Xu, Optimizing Cauchy Reed-Solomon Codes for Fault-Tolerant Network Storage Applications, NCA-06: 5th IEEE International Symposium on Network Computing Applications, Cambridge, MA, July, 2006.
[30]
C. Lu, Scalable diskless checkpointing for large parallel systems, PhD. Thesis, University of Illinois at Urbana-Champaign, IL, 2005.
[31]
A. Moody, G. Bronevetsky, Scalable I/O Systems via Node-Local Storage: Approaching 1 TB/sec File I/O. DOE technical report, 2009.
[32]
S. Matsuoka, T. Aoki, T. Endo, A. Nukada, T. Kato, A. Hasegawa, GPU-accelerated computing-from hype to mainstream, the rebirth of vector computing. Journal of Physics: Conference Series, v.180, no.012043, 2009.
[33]
B. Schroeder, G. A. Gibson, Understanding failures in petascale computers, SciDAC, Journal of Physics: Conference Series, v.78, no.012022, 2007.
[34]
M. Curry, L. Ward, T. Skjellum, and R. Brightwell. Accelerating reed-solomon coding in raid systems with gpus. In International Parallel and Distributed Processing Symposium, April 2008.
[35]
W. D. Gropp, R. Ross, and N. Miller. Providing efficient I/O redundancy in MPI environments. Lecture Notes in Computer Science, 3241:7786, September 2004.
[36]
A. Nukada, S. Matsuoka, NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA in Proceedings at the International Heterogeneity in Computing Workshop, Alaska, 2011. (To appear)
[37]
D. Komatitsch, S. Tsuboi, C. Ji and J. Tromp, A 14.6 billion degrees of freedom, 5 teraflops, 2.5 terabyte earthquake simulation on the Earth Simulator, Proceedings of the ACM/IEEE Supercomputing SC'2003 conference, November 2003.
[38]
G. Grider, J. Loncaric, and D. Limpart, Roadrunner System Management Report, Los Alamos National Laboratory, Tech. Rep. LA-UR-07-7405, 2007.
[39]
R. A. Oldfield, S. Arunagiri, P. J. Teller et al., Modeling the Impact of Checkpoints on Next-Generation Systems, in MSST'07. Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies, 2007, pp. 30--46.
[40]
S. Y. Borkar, Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation, IEEE Micro, vol. 25, no. 6, pp. 10--16, 2005.
[41]
D. Reed, High-End Computing: The Challenge of Scale, Director's Colloquium, LANL, May 2004.
[42]
K. Barker, K. Davis, A. Hoisie, D. Kerbyson, M. Lang, S. Pakin, J. Sancho, Entering the petaflop era: the architecture and performance of Roadrunner, Proceedings of the 2008 ACM/IEEE conference on Supercomputing, November 15--21, 2008, Austin, Texas.
[43]
B. Schroeder, G. A. Gibson, A large-scale study of failures in high-performance computing systems, Proceedings of the International Conference on Dependable Systems and Networks (DSN'06), p.249--258, June 25--28, 2006.
[44]
http://www.open-mpi.org/
[45]
John W. Young. 1974. A first order approximation to the optimum checkpoint interval. Commun. ACM 17, 9 (September 1974), 530--531. DOI=10.1145/361147.361115 http://doi.acm.org/10.1145/361147.361115
[46]
http://www.gsic.titech.ac.jp/ccwww/index.php?www&&&/tgc/trouble_list.html
[47]
D. Komatitsch, D. Michéa, G. Erlebacher, Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA, Journal of Parallel and Distributed Computing, vol. 69(5), p. 451--460. 2009.
[48]
D. Komatitsch, G. Erlebacher, D. Göddeke, D. Michéa, High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster, Journal of Computational Physics, vol. 229(20), p. 7692--7714. 2010.
[49]
http://icl.cs.utk.edu/papi/
[50]
http://www.geodynamics.org/cig/software/specfem3d-globe
[51]
B. Kennet, E. Engdahl, Traveltimes for global earthquake location and phase identification. Geophys. J. Int., 105, 429--465, 1991.
[52]
M. Kikuchi, H. Kanamori, Inversion of complex body waves. III, Bull. Seismol. Soc. Am., 81, 2335--2350, 1991.
[53]
M. Kikuchi, H. Kanamori, Note on Teleseismic Body-Wave Inversion Program, 2003. http://www.eri.u-tokyo.ac.jp/ETAL/KIKUCHI/
[54]
D. Komatitsch, J. Ritsema, J. Tromp, The spectral-element method, Beowulf computing, and global seismology, Science 298, 1737--1742, 2002.
[55]
C. Lawson, R. Hanson, Solving Least Squares Problems, Prentice-Hall, New Jersey, 340 pp, 1974.
[56]
T. Nakamura, S. Tsuboi, Y. Kaneda, Y. Yamanaka, Rupture process of the 2008 Wenchuan, China earthquake inferred from teleseismic waveform inversion and forward modeling of broadband seismic waves, Tectonophysics, vol. 491, 72--84, 2010.
[57]
S. Tsuboi, D. Komatitsch, C. Ji, J. Tromp, Broadband modelling of the 2002 Denali fault earthquake on the Earth Simulator, Phys. Earth Planet. Inter. 139, 305--312, 2003.

Cited By

View all
  • (2025)A Novel Information Model for Software Interface Reliability in the Software Development Life CycleProcedia Computer Science10.1016/j.procs.2024.11.091251:C(116-123)Online publication date: 11-Feb-2025
  • (2024)DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language ModelsProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658685(227-239)Online publication date: 3-Jun-2024
  • (2024)Checkpointing Strategies to Tolerate Non-Memoryless Failures on HPC PlatformsACM Transactions on Parallel Computing10.1145/362456011:1(1-26)Online publication date: 11-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
November 2011
866 pages
ISBN:9781450307710
DOI:10.1145/2063384
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SC '11
Sponsor:

Acceptance Rates

SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)A Novel Information Model for Software Interface Reliability in the Software Development Life CycleProcedia Computer Science10.1016/j.procs.2024.11.091251:C(116-123)Online publication date: 11-Feb-2025
  • (2024)DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language ModelsProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658685(227-239)Online publication date: 3-Jun-2024
  • (2024)Checkpointing Strategies to Tolerate Non-Memoryless Failures on HPC PlatformsACM Transactions on Parallel Computing10.1145/362456011:1(1-26)Online publication date: 11-Mar-2024
  • (2024)Scrutinizing Variables for Checkpoint Using Automatic DifferentiationSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00056(372-379)Online publication date: 17-Nov-2024
  • (2024)AutoCheck: Automatically Identifying Variables for Checkpointing by Data Dependency AnalysisProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00105(1-16)Online publication date: 17-Nov-2024
  • (2024)Extending the Legio Resilience Framework to Handle Critical Process Failures in MPI2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00015(44-51)Online publication date: 20-Mar-2024
  • (2024)Enhancing Fault Tolerance in High-Performance Computing: A Real Hardware Case Study on a RISC-V Vector Processing UnitIEEE Open Journal of the Computer Society10.1109/OJCS.2024.34688955(553-565)Online publication date: 2024
  • (2024)Benchmarking Variables for Checkpointing in HPC Applications2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00090(406-413)Online publication date: 27-May-2024
  • (2024)Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00028(238-249)Online publication date: 24-Sep-2024
  • (2024)Distributed Order Recording Techniques for Efficient Record-and-Replay of Multi - Threaded Programs2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00010(27-38)Online publication date: 24-Sep-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media