skip to main content
10.1145/3018743.3018745acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Self-Checkpoint: An In-Memory Checkpoint Method Using Less Space and Its Practice on Fault-Tolerant HPL

Published:26 January 2017Publication History

ABSTRACT

Fault tolerance is increasingly important in high performance computing due to the substantial growth of system scale and decreasing system reliability. In-memory/diskless checkpoint has gained extensive attention as a solution to avoid the IO bottleneck of traditional disk-based checkpoint methods. However, applications using previous in-memory checkpoint suffer from little available memory space. To provide high reliability, previous in-memory checkpoint methods either need to keep two copies of checkpoints to tolerate failures while updating old checkpoints or trade performance for space by flushing in-memory checkpoints into disk.

In this paper, we propose a novel in-memory checkpoint method, called self-checkpoint, which can not only achieve the same reliability of previous in-memory checkpoint methods, but also increase the available memory space for applications by almost 50%. To validate our method, we apply the self-checkpoint to an important problem, fault tolerant HPL. We implement a scalable and fault tolerant HPL based on this new method, called SKT-HPL, and validate it on two large-scale systems. Experimental results with 24,576 processes show that SKT-HPL achieves over 95% of the performance of the original HPL. Compared to the state-of-the-art in-memory checkpoint method, it improves the available memory size by 47% and the performance by 5%.

References

  1. top500 website. http://top500.org/.Google ScholarGoogle Scholar
  2. S. Agarwal, R. Garg, M. S. Gupta, and J. E. Moreira. Adaptive incremental checkpointing for massively parallel systems. In Proceedings of the 18th annual international conference on Supercomputing, pages 277--286. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka. FTI: High Performance Fault Tolerance Interface for Hybrid Systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 32:1--32:32, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0771-0. 10.1145/2063384.2063427. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. W. Bland, P. Du, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI. In SpringerLink, pages 477--488. Springer Berlin Heidelberg, Aug. 2012. URL http://link.springer.com/chapter/10.1007/978--3--642--32820--6_48. DOI: 10.1007/978--3--642--32820--6\_48.Google ScholarGoogle Scholar
  5. G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In Supercomputing, ACM/IEEE 2002 Conference, pages 29--29, Nov. 2002. 10.1109/SC.2002.10048.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Bouteiller, F. Cappello, T. Herault, G. Krawezik, P. Lemarinier, and F. Magniette. MPICH-V2: A Fault Tolerant MPI for Volatile Nodes Based on Pessimistic Sender Based Message Logging. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, SC '03, pages 25--, New York, NY, USA, 2003. ACM. ISBN 1--58113--695--1. 10.1145/1048935.1050176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Z. Chen. Algorithm-based recovery for iterative methods without checkpointing. In Proceedings of the 20th International Symposium on High Performance Distributed Computing, HPDC '11, pages 73--84, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0552--5. 10.1145/1996130.1996142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Z. Chen. Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 167--176, New York, NY, USA, 2013. ACM. ISBN 978--1--4503--1922--5. 10.1145/2442516.2442533. URL http://doi.acm.org/10.1145/2442516.2442533.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra. Fault tolerant high performance computing by a coding approach. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '05, pages 213--223, New York, NY, USA, 2005. ACM. ISBN 1--59593-080--9. 10.1145/1065944.1065973. URL http://doi.acm.org/10.1145/1065944.1065973.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen. High performance linpack benchmark: a fault tolerant implementation without checkpointing. In Proceedings of the international conference on Supercomputing, pages 162--171. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, and Y. Xie. Leveraging 3d pcram technologies to reduce checkpoint overhead for future exascale systems. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 57:1--57:12, New York, NY, USA, 2009. ACM. ISBN 978--1--60558--744--8. 10.1145/1654059.1654117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Duarte, D. Rexachs, and E. Luque. An Intelligent Management of Fault Tolerance in Cluster Using RADICMPI. In B. Mohr, J. L. Träff, J. Worringen, and J. Dongarra, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, number 4192 in Lecture Notes in Computer Science, pages 150--157. Springer Berlin Heidelberg, Sept. 2006. ISBN 978--3--540--39110--4 978--3--540--39112--8. URL http://link.springer.com/chapter/10.1007/11846802_26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. I. P. Egwutuoha, D. Levy, B. Selic, and S. Chen. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 65 (3): 1302--1326, Sept. 2013. ISSN 0920--8542, 1573-0484. 10.1007/s11227-013-0884-0.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. N. El-Sayed and B. Schroeder. Reading between the lines of failure logs: Understanding how HPC systems fail. In Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on, pages 1--12. IEEE, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. E. Fagg and J. J. Dongarra. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In J. Dongarra, P. Kacsuk, and N. Podhorszki, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, number 1908 in Lecture Notes in Computer Science, pages 346--353. Springer Berlin Heidelberg, 2000. ISBN 978--3--540--41010--2, 978--3--540--45255--3.Google ScholarGoogle Scholar
  16. K. Ferreira, J. Stearley, J. H. Laros, III, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. Evaluating the viability of process replication reliability for exascale systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 44:1--44:12, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0771-0. 10.1145/2063384.2063443. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, and R. Brightwell. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 78. IEEE Computer Society Press, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. A. B. Gomez, N. Maruyama, F. Cappello, and S. Matsuoka. Distributed Diskless Checkpoint for Large Scale Systems. pages 63--72. IEEE, 2010. ISBN 978--1--4244--6987--1. 10.1109/CCGRID.2010.40.Google ScholarGoogle Scholar
  19. P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (blcr) for linux clusters. In Journal of Physics: Conference Series, volume 46, page 494. IOP Publishing, 2006.Google ScholarGoogle Scholar
  20. K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. Computers, IEEE Transactions on, 100 (6): 518--528, 1984.Google ScholarGoogle Scholar
  21. C. Jin, H. Jiang, D. Feng, and L. Tian. P-code: A new raid-6 code with optimal properties. In Proceedings of the 23rd international conference on Supercomputing, pages 360--369. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Li, Z. Chen, P. Wu, and J. S. Vetter. Rethinking Algorithm-based Fault Tolerance with a Cooperative Software-hardware Approach. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '13, pages 44:1--44:12, New York, NY, USA, 2013. ACM. ISBN 978--1--4503--2378--9. 10.1145/2503210.2503226. URL http://doi.acm.org/10.1145/2503210.2503226.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Moody, G. Bronevetsky, K. Mohror, and B. De Supinski. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for, pages 1--11, Nov. 2010. 10.1109/SC.2010.18.Google ScholarGoogle Scholar
  24. D. A. Patterson, G. Gibson, and R. H. Katz. A case for redundant arrays of inexpensive disks (raid). In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, SIGMOD '88, pages 109--116. ACM, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary. HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers. http://www.netlib.org/benchmark/hpl/.Google ScholarGoogle Scholar
  26. J. S. Plank and K. Li. Faster checkpointing with N+1 parity. In Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing, pages 288--297, June 1994. 10.1109/FTCS.1994.315631.Google ScholarGoogle ScholarCross RefCross Ref
  27. J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. Parallel and Distributed Systems, IEEE Transactions on, 9 (10): 972--986, 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Robert. Fault-tolerance techniques for computing at scale. CCGrid2014, 2014.Google ScholarGoogle Scholar
  29. B. Schroeder and G. Gibson. A Large-Scale Study of Failures in High-Performance Computing Systems. IEEE Transactions on Dependable and Secure Computing, 7 (4): 337--350, Oct. 2010. ISSN 1545--5971. 10.1109/TDSC.2009.4.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. Wang, F. Mueller, C. Engelmann, and S. Scott. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1--10, Mar. 2007. 10.1109/IPDPS.2007.370307. Google ScholarGoogle ScholarCross RefCross Ref
  31. Wang, Mueller, Engelmann, and Scott]wang2011hybridC. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Hybrid full/incremental checkpoint/restart for mpi jobs in hpc environments. In International Conference on Parallel and Distributed Systems, 2011.Google ScholarGoogle Scholar
  32. Wang, Yao, Chen, Tan, Balaji, and Buntinas]wang_building_2011R. Wang, E. Yao, M. Chen, G. Tan, P. Balaji, and D. Buntinas. Building algorithmically nonstop fault tolerant MPI programs. In High Performance Computing (HiPC), 2011 18th International Conference on, pages 1--9. IEEE, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. B. Wicker and V. K. Bhargava. Reed-Solomon codes and their applications. John Wiley & Sons, 1999. Google ScholarGoogle ScholarCross RefCross Ref
  34. P. Wu and Z. Chen. Ft-scalapack: Correcting soft errors on-line for scalapack cholesky, qr, and lu factorization routines. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC '14, pages 49--60, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2749-7. 10.1145/2600212.2600232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. E. Yao, M. Chen, R. Wang, W. Zhang, and G. Tan. A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism. arXiv preprint arXiv:1106.4213, 2011.Google ScholarGoogle Scholar
  36. E. Yao, R. Wang, M. Chen, G. Tan, and N. Sun. A Case Study of Designing Efficient Algorithm-based Fault Tolerant Application for Exascale Parallelism. pages 438--448. IEEE, May 2012. ISBN 978-1-4673-0975-2, 978-0-7695-4675-9. 10.1109/IPDPS.2012.48.Google ScholarGoogle Scholar
  37. G. Zheng, L. Shi, and L. V. Kale. Ftc-charm++: an in-memory checkpoint-based fault tolerant runtime for charm++and mpi. In IEEE International Conference on Cluster Computing, pages 93--103, Sept 2004.Google ScholarGoogle Scholar
  38. G. Zheng, X. Ni, and L. V. Kalé. A scalable double in-memory checkpoint and restart scheme towards exascale. In Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on, pages 1--6. IEEE, 2012. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Self-Checkpoint: An In-Memory Checkpoint Method Using Less Space and Its Practice on Fault-Tolerant HPL

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
        January 2017
        476 pages
        ISBN:9781450344937
        DOI:10.1145/3018743

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 January 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader