skip to main content
10.1145/1851476.1851509acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters

Published:21 June 2010Publication History

ABSTRACT

As computational clusters rapidly grow in both size and complexity, system reliability and, in particular, application resilience have become increasingly important factors to consider in maintaining efficiency and providing improved computational performance over predecessor systems. One commonly used mechanism for providing application fault tolerance in parallel systems is the use of checkpointing.

By making use of a multi-cluster simulator, we study the impact of sub-optimal checkpoint intervals on overall application efficiency. By using a model of a 1926 node cluster and workload statistics from Los Alamos National Laboratory to parameterize the simulator, we find that dramatically overestimating the AMTTI has a fairly minor impact on application efficiency while potentially having a much more severe impact on user-centric performance metrics such a queueing delay. We compare and contrast these results with the trends predicted by an analytical model.

References

  1. }}J. T. Daly. A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps. Future Generation Computer Systems, 22:300--312, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. }}J. T. Daly. Methodology and metrics for quantifying application throughput. In Proceedings of the Nuclear-Explosives Code Developers Conference, 2006.Google ScholarGoogle Scholar
  3. }}E. Hendriks. Bproc: the beowulf distributed process space. In ICS '02: Proceedings of the 16th international conference on Supercomputing, pages 129--136, New York, NY, USA, 2002. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. }}BeoSim Website. http://www.parl.clemson.edu/beosim.Google ScholarGoogle Scholar
  5. }}W. M. Jones. Network-aware selective job checkpoint and migration to enhance co-allocation in multi-cluster systems. In Journal of Concurrency and Computation: Practice and Experience, volume 21, pages 1672--1691. John Wiley and Sons, Ltd., September 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. }}W. M. Jones, J. T. Daly, and N. A. DeBardeleben. Application resilience: Making progress in spite of failure. In The Workshop on Resilience held in conjunction with the IEEE International Conference on Cluster Computing and the Grid (CCGRID 2008), pages 789--794, May 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. }}W. M. Jones, L. W. Pang, D. Stanzione, and W. B. Ligon III. Characterization of bandwidth-aware meta-schedulers for co-allocating jobs across multiple clusters. In Journal of Supercomputing, Special Issue on the Evaluation of Grid and Cluster Computing Systems, volume 34, pages 135--163. Springer Science and Business Media B. V, November 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. }}J. W. Young. A first-order approximation to the optimum checkpoint interval. In Communications of the ACM, pages 530--531, September 1974. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
          June 2010
          911 pages
          ISBN:9781605589428
          DOI:10.1145/1851476

          Copyright © 2010 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 21 June 2010

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate166of966submissions,17%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader