ABSTRACT
Fault tolerance is increasingly important in high performance computing due to the substantial growth of system scale and decreasing system reliability. In-memory/diskless checkpoint has gained extensive attention as a solution to avoid the IO bottleneck of traditional disk-based checkpoint methods. However, applications using previous in-memory checkpoint suffer from little available memory space. To provide high reliability, previous in-memory checkpoint methods either need to keep two copies of checkpoints to tolerate failures while updating old checkpoints or trade performance for space by flushing in-memory checkpoints into disk.
In this paper, we propose a novel in-memory checkpoint method, called self-checkpoint, which can not only achieve the same reliability of previous in-memory checkpoint methods, but also increase the available memory space for applications by almost 50%. To validate our method, we apply the self-checkpoint to an important problem, fault tolerant HPL. We implement a scalable and fault tolerant HPL based on this new method, called SKT-HPL, and validate it on two large-scale systems. Experimental results with 24,576 processes show that SKT-HPL achieves over 95% of the performance of the original HPL. Compared to the state-of-the-art in-memory checkpoint method, it improves the available memory size by 47% and the performance by 5%.
- top500 website. http://top500.org/.Google Scholar
- S. Agarwal, R. Garg, M. S. Gupta, and J. E. Moreira. Adaptive incremental checkpointing for massively parallel systems. In Proceedings of the 18th annual international conference on Supercomputing, pages 277--286. ACM, 2004. Google ScholarDigital Library
- L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka. FTI: High Performance Fault Tolerance Interface for Hybrid Systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 32:1--32:32, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0771-0. 10.1145/2063384.2063427. Google ScholarDigital Library
- W. Bland, P. Du, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI. In SpringerLink, pages 477--488. Springer Berlin Heidelberg, Aug. 2012. URL http://link.springer.com/chapter/10.1007/978--3--642--32820--6_48. DOI: 10.1007/978--3--642--32820--6\_48.Google Scholar
- G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In Supercomputing, ACM/IEEE 2002 Conference, pages 29--29, Nov. 2002. 10.1109/SC.2002.10048.Google ScholarDigital Library
- A. Bouteiller, F. Cappello, T. Herault, G. Krawezik, P. Lemarinier, and F. Magniette. MPICH-V2: A Fault Tolerant MPI for Volatile Nodes Based on Pessimistic Sender Based Message Logging. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, SC '03, pages 25--, New York, NY, USA, 2003. ACM. ISBN 1--58113--695--1. 10.1145/1048935.1050176. Google ScholarDigital Library
- Z. Chen. Algorithm-based recovery for iterative methods without checkpointing. In Proceedings of the 20th International Symposium on High Performance Distributed Computing, HPDC '11, pages 73--84, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0552--5. 10.1145/1996130.1996142. Google ScholarDigital Library
- Z. Chen. Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 167--176, New York, NY, USA, 2013. ACM. ISBN 978--1--4503--1922--5. 10.1145/2442516.2442533. URL http://doi.acm.org/10.1145/2442516.2442533.Google ScholarDigital Library
- Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra. Fault tolerant high performance computing by a coding approach. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '05, pages 213--223, New York, NY, USA, 2005. ACM. ISBN 1--59593-080--9. 10.1145/1065944.1065973. URL http://doi.acm.org/10.1145/1065944.1065973.Google ScholarDigital Library
- T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen. High performance linpack benchmark: a fault tolerant implementation without checkpointing. In Proceedings of the international conference on Supercomputing, pages 162--171. ACM, 2011. Google ScholarDigital Library
- X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, and Y. Xie. Leveraging 3d pcram technologies to reduce checkpoint overhead for future exascale systems. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 57:1--57:12, New York, NY, USA, 2009. ACM. ISBN 978--1--60558--744--8. 10.1145/1654059.1654117. Google ScholarDigital Library
- A. Duarte, D. Rexachs, and E. Luque. An Intelligent Management of Fault Tolerance in Cluster Using RADICMPI. In B. Mohr, J. L. Träff, J. Worringen, and J. Dongarra, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, number 4192 in Lecture Notes in Computer Science, pages 150--157. Springer Berlin Heidelberg, Sept. 2006. ISBN 978--3--540--39110--4 978--3--540--39112--8. URL http://link.springer.com/chapter/10.1007/11846802_26.Google ScholarDigital Library
- I. P. Egwutuoha, D. Levy, B. Selic, and S. Chen. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 65 (3): 1302--1326, Sept. 2013. ISSN 0920--8542, 1573-0484. 10.1007/s11227-013-0884-0.Google ScholarDigital Library
- N. El-Sayed and B. Schroeder. Reading between the lines of failure logs: Understanding how HPC systems fail. In Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on, pages 1--12. IEEE, 2013.Google ScholarDigital Library
- G. E. Fagg and J. J. Dongarra. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In J. Dongarra, P. Kacsuk, and N. Podhorszki, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, number 1908 in Lecture Notes in Computer Science, pages 346--353. Springer Berlin Heidelberg, 2000. ISBN 978--3--540--41010--2, 978--3--540--45255--3.Google Scholar
- K. Ferreira, J. Stearley, J. H. Laros, III, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. Evaluating the viability of process replication reliability for exascale systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 44:1--44:12, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0771-0. 10.1145/2063384.2063443. Google ScholarDigital Library
- D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, and R. Brightwell. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 78. IEEE Computer Society Press, 2012. Google ScholarDigital Library
- L. A. B. Gomez, N. Maruyama, F. Cappello, and S. Matsuoka. Distributed Diskless Checkpoint for Large Scale Systems. pages 63--72. IEEE, 2010. ISBN 978--1--4244--6987--1. 10.1109/CCGRID.2010.40.Google Scholar
- P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (blcr) for linux clusters. In Journal of Physics: Conference Series, volume 46, page 494. IOP Publishing, 2006.Google Scholar
- K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. Computers, IEEE Transactions on, 100 (6): 518--528, 1984.Google Scholar
- C. Jin, H. Jiang, D. Feng, and L. Tian. P-code: A new raid-6 code with optimal properties. In Proceedings of the 23rd international conference on Supercomputing, pages 360--369. ACM, 2009. Google ScholarDigital Library
- D. Li, Z. Chen, P. Wu, and J. S. Vetter. Rethinking Algorithm-based Fault Tolerance with a Cooperative Software-hardware Approach. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '13, pages 44:1--44:12, New York, NY, USA, 2013. ACM. ISBN 978--1--4503--2378--9. 10.1145/2503210.2503226. URL http://doi.acm.org/10.1145/2503210.2503226.Google ScholarDigital Library
- A. Moody, G. Bronevetsky, K. Mohror, and B. De Supinski. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for, pages 1--11, Nov. 2010. 10.1109/SC.2010.18.Google Scholar
- D. A. Patterson, G. Gibson, and R. H. Katz. A case for redundant arrays of inexpensive disks (raid). In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, SIGMOD '88, pages 109--116. ACM, 1988. Google ScholarDigital Library
- A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary. HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers. http://www.netlib.org/benchmark/hpl/.Google Scholar
- J. S. Plank and K. Li. Faster checkpointing with N+1 parity. In Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing, pages 288--297, June 1994. 10.1109/FTCS.1994.315631.Google ScholarCross Ref
- J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. Parallel and Distributed Systems, IEEE Transactions on, 9 (10): 972--986, 1998.Google ScholarDigital Library
- Y. Robert. Fault-tolerance techniques for computing at scale. CCGrid2014, 2014.Google Scholar
- B. Schroeder and G. Gibson. A Large-Scale Study of Failures in High-Performance Computing Systems. IEEE Transactions on Dependable and Secure Computing, 7 (4): 337--350, Oct. 2010. ISSN 1545--5971. 10.1109/TDSC.2009.4.Google ScholarDigital Library
- C. Wang, F. Mueller, C. Engelmann, and S. Scott. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1--10, Mar. 2007. 10.1109/IPDPS.2007.370307. Google ScholarCross Ref
- Wang, Mueller, Engelmann, and Scott]wang2011hybridC. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Hybrid full/incremental checkpoint/restart for mpi jobs in hpc environments. In International Conference on Parallel and Distributed Systems, 2011.Google Scholar
- Wang, Yao, Chen, Tan, Balaji, and Buntinas]wang_building_2011R. Wang, E. Yao, M. Chen, G. Tan, P. Balaji, and D. Buntinas. Building algorithmically nonstop fault tolerant MPI programs. In High Performance Computing (HiPC), 2011 18th International Conference on, pages 1--9. IEEE, 2011.Google ScholarDigital Library
- S. B. Wicker and V. K. Bhargava. Reed-Solomon codes and their applications. John Wiley & Sons, 1999. Google ScholarCross Ref
- P. Wu and Z. Chen. Ft-scalapack: Correcting soft errors on-line for scalapack cholesky, qr, and lu factorization routines. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC '14, pages 49--60, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2749-7. 10.1145/2600212.2600232. Google ScholarDigital Library
- E. Yao, M. Chen, R. Wang, W. Zhang, and G. Tan. A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism. arXiv preprint arXiv:1106.4213, 2011.Google Scholar
- E. Yao, R. Wang, M. Chen, G. Tan, and N. Sun. A Case Study of Designing Efficient Algorithm-based Fault Tolerant Application for Exascale Parallelism. pages 438--448. IEEE, May 2012. ISBN 978-1-4673-0975-2, 978-0-7695-4675-9. 10.1109/IPDPS.2012.48.Google Scholar
- G. Zheng, L. Shi, and L. V. Kale. Ftc-charm++: an in-memory checkpoint-based fault tolerant runtime for charm++and mpi. In IEEE International Conference on Cluster Computing, pages 93--103, Sept 2004.Google Scholar
- G. Zheng, X. Ni, and L. V. Kalé. A scalable double in-memory checkpoint and restart scheme towards exascale. In Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on, pages 1--6. IEEE, 2012. Google ScholarCross Ref
Index Terms
- Self-Checkpoint: An In-Memory Checkpoint Method Using Less Space and Its Practice on Fault-Tolerant HPL
Recommendations
GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
SOSP '23: Proceedings of the 29th Symposium on Operating Systems PrinciplesLarge deep learning models have recently garnered substantial attention from both academia and industry. Nonetheless, frequent failures are observed during large model training due to large-scale resources involved and extended training time. Existing ...
Self-Checkpoint: An In-Memory Checkpoint Method Using Less Space and Its Practice on Fault-Tolerant HPL
PPoPP '17Fault tolerance is increasingly important in high performance computing due to the substantial growth of system scale and decreasing system reliability. In-memory/diskless checkpoint has gained extensive attention as a solution to avoid the IO ...
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems
The scalability of future Massively Parallel Processing (MPP) systems is being severely challenged by high failure rates. Current centralized Hard Disk Drive (HDD) checkpointing results in overhead of 25% or more at petascale. Since systems become more ...
Comments