skip to main content
10.1145/2184512.2184574acmconferencesArticle/Chapter ViewAbstractPublication Pagesacm-seConference Proceedingsconference-collections
research-article

Application monitoring and checkpointing in HPC: looking towards exascale systems

Published:29 March 2012Publication History

ABSTRACT

As computational cluster computers rapidly grow in both size and complexity, system reliability and, in particular, application resilience have become increasingly important factors to consider in maintaining efficiency and providing improved compute performance over predecessor systems. One commonly used mechanism for providing application fault tolerance in parallel systems is the use of checkpointing.

We demonstrate the impact of sub-optimal checkpoint intervals on application efficiency via simulation with real workload data. We find that application efficiency is relatively insensitive to error in estimation of an application's mean time to interrupt (AMTTI), a parameter central to calculating the optimal checkpoint interval. This result corroborates the trends predicted by previous analytical models. We also find that erring on the side of overestimation may be preferable to underestimation.

We further discuss how application monitoring and resilience frameworks can benefit from this insensitivity to error in AMTTI estimates. Finally, we discuss the importance of application monitoring at exascale and conclude with a discussion of challenges faced in the use of checkpointing at such extreme scales.

References

  1. R. Ballance and N. DeBardeleben. The Mojo Application Monitoring Tool Suite. In 11th LCI International Conference on High-Performance Clustered Computing, March 2010.Google ScholarGoogle Scholar
  2. J. T. Daly. A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps. Future Generation Computer Systems, 22:300--312, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. T. Daly. Methodology and metrics for quantifying application throughput. In Proceedings of the Nuclear Explosives Code Developers Conference, 2006.Google ScholarGoogle Scholar
  4. J. T. Daly, L. A. Pritchett-Sheats, and S. E. Michalak. Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale. In Workshop on Resilience held at the IEEE Intl. Conf. on Cluster Computing and the Grid, May 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. X. Dong, Y. Xie, N. Muralimanohar, and N. P. Jouppi. Hybrid checkpointing using emerging nonvolatile memories for future exascale systems. ACM Transactions on Architecture and Code Optimization, 8:6:1--6:29, June 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Dongarra and P. Beckman. International Exascale Software Project Roadmap. International Journal of High Performance Computer Applications, 25(1), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Geist and R. Lucas. Major computer science challenges at exascale. In Exascale.org, Feb. 2009.Google ScholarGoogle Scholar
  8. G. Grider. ExaScale FSIO: Can we get there? Can we afford to? In HEC FSIO R&D Workshop, July 2010.Google ScholarGoogle Scholar
  9. E. Hendriks. Bproc: the beowulf distributed process space. In Proc. of the 16th Intl. Conf. on Supercomputing, pages 129--136. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. BeoSim Website. http://www.parl.clemson.edu/beosim.Google ScholarGoogle Scholar
  11. W. M. Jones. Network-aware selective job checkpoint and migration to enhance co-allocation in multi-cluster systems. In Journal of Concurrency and Computation: Practice and Experience, volume 21, pages 1672--1691. John Wiley and Sons, Ltd., September 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. M. Jones, J. T. Daly, and N. A. DeBardeleben. Application resilience: Making progress in spite of failure. In The Workshop on Resilience held in conjunction with the IEEE Intl. Conf. on Cluster Computing and the Grid, pages 789--794, May 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. M. Jones, J. T. Daly, and N. A. DeBardeleben. Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters. In HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pages 276--279, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. M. Jones, L. W. Pang, D. Stanzione, and W. B. Ligon III. Characterization of bandwidth-aware meta-schedulers for co-allocating jobs across multiple clusters. In Journal of Supercomputing, Special Issue on the Evaluation of Grid and Cluster Computing Systems, volume 34, pages 135--163. Springer Science and Business Media B. V, November 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. DARPA, 2008.Google ScholarGoogle Scholar
  16. A. Moody and G. Bronevetsky. Scalable I/O Systems via Node-Local Storage: Approaching 1 TB/sec File I/O. In Lawrence Livermore National Laboratory: Technical Report #415791, 2009.Google ScholarGoogle Scholar
  17. A. Moody, G. Bronevetsky, K. Mohror, and B. R. d. Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proc. of the ACM/IEEE Intl. Conf. for High Perf. Comp., Networking, Storage and Analysis, pages 1--11, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. A. Ballance et al. Application Monitoring. Cray User Group Meeting, May 2008.Google ScholarGoogle Scholar
  19. B. Schroeder and G. Gibson. A large-scale study of failures in high-performance computing systems. In International Conference on Dependable Systems and Networks, pages 249--258, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Schroeder and G. Gibson. Understanding failures in petascale computers. In J. of Physics, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  21. N. D. Singpurwalla and A. G. Wilson. Probability, chance and the probability of chance. In IIE Transactions, volume 41, pages 12--22, Jan 2009.Google ScholarGoogle Scholar
  22. Vivek Sarkar et al. ExaScale Computing Software Study: Software Challenges in Extreme Scale Systems. DARPA, September 2009.Google ScholarGoogle Scholar
  23. Ubiquitous High Perf. Comp. (UHPC) Request for Information (RFI). DARPA-SN-09-46, 2009.Google ScholarGoogle Scholar
  24. J. W. Young. A first-order approximation to the optimum checkpoint interval. In Communications of the ACM, pages 530--531, September 1974. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Application monitoring and checkpointing in HPC: looking towards exascale systems

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ACM-SE '12: Proceedings of the 50th Annual Southeast Regional Conference
      March 2012
      424 pages
      ISBN:9781450312035
      DOI:10.1145/2184512

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 29 March 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate178of377submissions,47%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader