research-article

Self-Checkpoint: An In-Memory Checkpoint Method Using Less Space and Its Practice on Fault-Tolerant HPL

Authors:
Xiongchao Tang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Jidong Zhai

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Bowen Yu

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Wenguang Chen

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Weimin Zheng

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingJanuary 2017Pages 401–413https://doi.org/10.1145/3018743.3018745

Published:26 January 2017Publication History

PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 401–413

ABSTRACT

Fault tolerance is increasingly important in high performance computing due to the substantial growth of system scale and decreasing system reliability. In-memory/diskless checkpoint has gained extensive attention as a solution to avoid the IO bottleneck of traditional disk-based checkpoint methods. However, applications using previous in-memory checkpoint suffer from little available memory space. To provide high reliability, previous in-memory checkpoint methods either need to keep two copies of checkpoints to tolerate failures while updating old checkpoints or trade performance for space by flushing in-memory checkpoints into disk.

In this paper, we propose a novel in-memory checkpoint method, called self-checkpoint, which can not only achieve the same reliability of previous in-memory checkpoint methods, but also increase the available memory space for applications by almost 50%. To validate our method, we apply the self-checkpoint to an important problem, fault tolerant HPL. We implement a scalable and fault tolerant HPL based on this new method, called SKT-HPL, and validate it on two large-scale systems. Experimental results with 24,576 processes show that SKT-HPL achieves over 95% of the performance of the original HPL. Compared to the state-of-the-art in-memory checkpoint method, it improves the available memory size by 47% and the performance by 5%.

References

top500 website. http://top500.org/.Google Scholar
S. Agarwal, R. Garg, M. S. Gupta, and J. E. Moreira. Adaptive incremental checkpointing for massively parallel systems. In Proceedings of the 18th annual international conference on Supercomputing, pages 277--286. ACM, 2004. Google ScholarDigital Library
L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka. FTI: High Performance Fault Tolerance Interface for Hybrid Systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 32:1--32:32, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0771-0. 10.1145/2063384.2063427. Google ScholarDigital Library
W. Bland, P. Du, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI. In SpringerLink, pages 477--488. Springer Berlin Heidelberg, Aug. 2012. URL http://link.springer.com/chapter/10.1007/978--3--642--32820--6_48. DOI: 10.1007/978--3--642--32820--6\_48.Google Scholar
G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In Supercomputing, ACM/IEEE 2002 Conference, pages 29--29, Nov. 2002. 10.1109/SC.2002.10048.Google ScholarDigital Library
A. Bouteiller, F. Cappello, T. Herault, G. Krawezik, P. Lemarinier, and F. Magniette. MPICH-V2: A Fault Tolerant MPI for Volatile Nodes Based on Pessimistic Sender Based Message Logging. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, SC '03, pages 25--, New York, NY, USA, 2003. ACM. ISBN 1--58113--695--1. 10.1145/1048935.1050176. Google ScholarDigital Library
Z. Chen. Algorithm-based recovery for iterative methods without checkpointing. In Proceedings of the 20th International Symposium on High Performance Distributed Computing, HPDC '11, pages 73--84, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0552--5. 10.1145/1996130.1996142. Google ScholarDigital Library
Z. Chen. Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 167--176, New York, NY, USA, 2013. ACM. ISBN 978--1--4503--1922--5. 10.1145/2442516.2442533. URL http://doi.acm.org/10.1145/2442516.2442533.Google ScholarDigital Library
Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra. Fault tolerant high performance computing by a coding approach. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '05, pages 213--223, New York, NY, USA, 2005. ACM. ISBN 1--59593-080--9. 10.1145/1065944.1065973. URL http://doi.acm.org/10.1145/1065944.1065973.Google ScholarDigital Library
T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen. High performance linpack benchmark: a fault tolerant implementation without checkpointing. In Proceedings of the international conference on Supercomputing, pages 162--171. ACM, 2011. Google ScholarDigital Library
X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, and Y. Xie. Leveraging 3d pcram technologies to reduce checkpoint overhead for future exascale systems. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 57:1--57:12, New York, NY, USA, 2009. ACM. ISBN 978--1--60558--744--8. 10.1145/1654059.1654117. Google ScholarDigital Library
A. Duarte, D. Rexachs, and E. Luque. An Intelligent Management of Fault Tolerance in Cluster Using RADICMPI. In B. Mohr, J. L. Träff, J. Worringen, and J. Dongarra, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, number 4192 in Lecture Notes in Computer Science, pages 150--157. Springer Berlin Heidelberg, Sept. 2006. ISBN 978--3--540--39110--4 978--3--540--39112--8. URL http://link.springer.com/chapter/10.1007/11846802_26.Google ScholarDigital Library
I. P. Egwutuoha, D. Levy, B. Selic, and S. Chen. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 65 (3): 1302--1326, Sept. 2013. ISSN 0920--8542, 1573-0484. 10.1007/s11227-013-0884-0.Google ScholarDigital Library
N. El-Sayed and B. Schroeder. Reading between the lines of failure logs: Understanding how HPC systems fail. In Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on, pages 1--12. IEEE, 2013.Google ScholarDigital Library
G. E. Fagg and J. J. Dongarra. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In J. Dongarra, P. Kacsuk, and N. Podhorszki, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, number 1908 in Lecture Notes in Computer Science, pages 346--353. Springer Berlin Heidelberg, 2000. ISBN 978--3--540--41010--2, 978--3--540--45255--3.Google Scholar
K. Ferreira, J. Stearley, J. H. Laros, III, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. Evaluating the viability of process replication reliability for exascale systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 44:1--44:12, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0771-0. 10.1145/2063384.2063443. Google ScholarDigital Library
D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, and R. Brightwell. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 78. IEEE Computer Society Press, 2012. Google ScholarDigital Library
L. A. B. Gomez, N. Maruyama, F. Cappello, and S. Matsuoka. Distributed Diskless Checkpoint for Large Scale Systems. pages 63--72. IEEE, 2010. ISBN 978--1--4244--6987--1. 10.1109/CCGRID.2010.40.Google Scholar
P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (blcr) for linux clusters. In Journal of Physics: Conference Series, volume 46, page 494. IOP Publishing, 2006.Google Scholar
K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. Computers, IEEE Transactions on, 100 (6): 518--528, 1984.Google Scholar
C. Jin, H. Jiang, D. Feng, and L. Tian. P-code: A new raid-6 code with optimal properties. In Proceedings of the 23rd international conference on Supercomputing, pages 360--369. ACM, 2009. Google ScholarDigital Library
D. Li, Z. Chen, P. Wu, and J. S. Vetter. Rethinking Algorithm-based Fault Tolerance with a Cooperative Software-hardware Approach. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '13, pages 44:1--44:12, New York, NY, USA, 2013. ACM. ISBN 978--1--4503--2378--9. 10.1145/2503210.2503226. URL http://doi.acm.org/10.1145/2503210.2503226.Google ScholarDigital Library
A. Moody, G. Bronevetsky, K. Mohror, and B. De Supinski. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for, pages 1--11, Nov. 2010. 10.1109/SC.2010.18.Google Scholar
D. A. Patterson, G. Gibson, and R. H. Katz. A case for redundant arrays of inexpensive disks (raid). In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, SIGMOD '88, pages 109--116. ACM, 1988. Google ScholarDigital Library
A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary. HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers. http://www.netlib.org/benchmark/hpl/.Google Scholar
J. S. Plank and K. Li. Faster checkpointing with N+1 parity. In Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing, pages 288--297, June 1994. 10.1109/FTCS.1994.315631.Google ScholarCross Ref
J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. Parallel and Distributed Systems, IEEE Transactions on, 9 (10): 972--986, 1998.Google ScholarDigital Library
Y. Robert. Fault-tolerance techniques for computing at scale. CCGrid2014, 2014.Google Scholar
B. Schroeder and G. Gibson. A Large-Scale Study of Failures in High-Performance Computing Systems. IEEE Transactions on Dependable and Secure Computing, 7 (4): 337--350, Oct. 2010. ISSN 1545--5971. 10.1109/TDSC.2009.4.Google ScholarDigital Library
C. Wang, F. Mueller, C. Engelmann, and S. Scott. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1--10, Mar. 2007. 10.1109/IPDPS.2007.370307. Google ScholarCross Ref
Wang, Mueller, Engelmann, and Scott]wang2011hybridC. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Hybrid full/incremental checkpoint/restart for mpi jobs in hpc environments. In International Conference on Parallel and Distributed Systems, 2011.Google Scholar
Wang, Yao, Chen, Tan, Balaji, and Buntinas]wang_building_2011R. Wang, E. Yao, M. Chen, G. Tan, P. Balaji, and D. Buntinas. Building algorithmically nonstop fault tolerant MPI programs. In High Performance Computing (HiPC), 2011 18th International Conference on, pages 1--9. IEEE, 2011.Google ScholarDigital Library
S. B. Wicker and V. K. Bhargava. Reed-Solomon codes and their applications. John Wiley & Sons, 1999. Google ScholarCross Ref
P. Wu and Z. Chen. Ft-scalapack: Correcting soft errors on-line for scalapack cholesky, qr, and lu factorization routines. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC '14, pages 49--60, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2749-7. 10.1145/2600212.2600232. Google ScholarDigital Library
E. Yao, M. Chen, R. Wang, W. Zhang, and G. Tan. A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism. arXiv preprint arXiv:1106.4213, 2011.Google Scholar
E. Yao, R. Wang, M. Chen, G. Tan, and N. Sun. A Case Study of Designing Efficient Algorithm-based Fault Tolerant Application for Exascale Parallelism. pages 438--448. IEEE, May 2012. ISBN 978-1-4673-0975-2, 978-0-7695-4675-9. 10.1109/IPDPS.2012.48.Google Scholar
G. Zheng, L. Shi, and L. V. Kale. Ftc-charm++: an in-memory checkpoint-based fault tolerant runtime for charm++and mpi. In IEEE International Conference on Cluster Computing, pages 93--103, Sept 2004.Google Scholar
G. Zheng, X. Ni, and L. V. Kalé. A scalable double in-memory checkpoint and restart scheme towards exascale. In Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on, pages 1--6. IEEE, 2012. Google ScholarCross Ref

Index Terms

Self-Checkpoint: An In-Memory Checkpoint Method Using Less Space and Its Practice on Fault-Tolerant HPL
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software fault tolerance
        Checkpoint / restart

Recommendations

GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles

Large deep learning models have recently garnered substantial attention from both academia and industry. Nonetheless, frequent failures are observed during large model training due to large-scale resources involved and extended training time. Existing ...
Read More
Self-Checkpoint: An In-Memory Checkpoint Method Using Less Space and Its Practice on Fault-Tolerant HPL
PPoPP '17

Fault tolerance is increasingly important in high performance computing due to the substantial growth of system scale and decreasing system reliability. In-memory/diskless checkpoint has gained extensive attention as a solution to avoid the IO ...
Read More
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems

The scalability of future Massively Parallel Processing (MPP) systems is being severely challenged by high failure rates. Current centralized Hard Disk Drive (HDD) checkpointing results in overhead of 25% or more at petascale. Since systems become more ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
January 2017
476 pages
ISBN:9781450344937
DOI:10.1145/3018743
General Chair:
Vivek Sarkar
Rice University, USA
,
Program Chair:
Lawrence Rauchwerger
Texas A&M University, USA
ACM SIGPLAN Notices Volume 52, Issue 8
PPoPP '17
August 2017
442 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3155284
Editor:
Matthew Fluet
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 January 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
fault tolerance
fault-tolerant hpl
in-memory checkpoint
memory consumption
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 532
  Total Downloads
- Downloads (Last 12 months)22
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Self-Checkpoint: An In-Memory Checkpoint Method Using Less Space and Its Practice on Fault-Tolerant HPL

PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

ABSTRACT

References

Cited By

Index Terms

Recommendations

GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints

Self-Checkpoint: An In-Memory Checkpoint Method Using Less Space and Its Practice on Fault-Tolerant HPL

Hybrid checkpointing using emerging nonvolatile memories for future exascale systems