research-article

Application monitoring and checkpointing in HPC: looking towards exascale systems

Authors:
William M. Jones

Coastal Carolina University, Conway, SC

Coastal Carolina University, Conway, SC
View Profile

,
John T. Daly

Center for Exceptional Computing, ACS, Fort Meade, MD

Center for Exceptional Computing, ACS, Fort Meade, MD
View Profile

,
Nathan DeBardeleben

High Performance Computing, Los Alamos National Laboratory, Los Alamos, MN

High Performance Computing, Los Alamos National Laboratory, Los Alamos, MN
View Profile

ACM-SE '12: Proceedings of the 50th Annual Southeast Regional ConferenceMarch 2012Pages 262–267https://doi.org/10.1145/2184512.2184574

Published:29 March 2012Publication History

ACM-SE '12: Proceedings of the 50th Annual Southeast Regional Conference

Pages 262–267

ABSTRACT

As computational cluster computers rapidly grow in both size and complexity, system reliability and, in particular, application resilience have become increasingly important factors to consider in maintaining efficiency and providing improved compute performance over predecessor systems. One commonly used mechanism for providing application fault tolerance in parallel systems is the use of checkpointing.

We demonstrate the impact of sub-optimal checkpoint intervals on application efficiency via simulation with real workload data. We find that application efficiency is relatively insensitive to error in estimation of an application's mean time to interrupt (AMTTI), a parameter central to calculating the optimal checkpoint interval. This result corroborates the trends predicted by previous analytical models. We also find that erring on the side of overestimation may be preferable to underestimation.

We further discuss how application monitoring and resilience frameworks can benefit from this insensitivity to error in AMTTI estimates. Finally, we discuss the importance of application monitoring at exascale and conclude with a discussion of challenges faced in the use of checkpointing at such extreme scales.

References

R. Ballance and N. DeBardeleben. The Mojo Application Monitoring Tool Suite. In 11th LCI International Conference on High-Performance Clustered Computing, March 2010.Google Scholar
J. T. Daly. A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps. Future Generation Computer Systems, 22:300--312, 2006. Google ScholarDigital Library
J. T. Daly. Methodology and metrics for quantifying application throughput. In Proceedings of the Nuclear Explosives Code Developers Conference, 2006.Google Scholar
J. T. Daly, L. A. Pritchett-Sheats, and S. E. Michalak. Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale. In Workshop on Resilience held at the IEEE Intl. Conf. on Cluster Computing and the Grid, May 2008. Google ScholarDigital Library
X. Dong, Y. Xie, N. Muralimanohar, and N. P. Jouppi. Hybrid checkpointing using emerging nonvolatile memories for future exascale systems. ACM Transactions on Architecture and Code Optimization, 8:6:1--6:29, June 2011. Google ScholarDigital Library
J. Dongarra and P. Beckman. International Exascale Software Project Roadmap. International Journal of High Performance Computer Applications, 25(1), 2011. Google ScholarDigital Library
A. Geist and R. Lucas. Major computer science challenges at exascale. In Exascale.org, Feb. 2009.Google Scholar
G. Grider. ExaScale FSIO: Can we get there? Can we afford to? In HEC FSIO R&D Workshop, July 2010.Google Scholar
E. Hendriks. Bproc: the beowulf distributed process space. In Proc. of the 16th Intl. Conf. on Supercomputing, pages 129--136. ACM, 2002. Google ScholarDigital Library
BeoSim Website. http://www.parl.clemson.edu/beosim.Google Scholar
W. M. Jones. Network-aware selective job checkpoint and migration to enhance co-allocation in multi-cluster systems. In Journal of Concurrency and Computation: Practice and Experience, volume 21, pages 1672--1691. John Wiley and Sons, Ltd., September 2009. Google ScholarDigital Library
W. M. Jones, J. T. Daly, and N. A. DeBardeleben. Application resilience: Making progress in spite of failure. In The Workshop on Resilience held in conjunction with the IEEE Intl. Conf. on Cluster Computing and the Grid, pages 789--794, May 2008. Google ScholarDigital Library
W. M. Jones, J. T. Daly, and N. A. DeBardeleben. Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters. In HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pages 276--279, 2010. Google ScholarDigital Library
W. M. Jones, L. W. Pang, D. Stanzione, and W. B. Ligon III. Characterization of bandwidth-aware meta-schedulers for co-allocating jobs across multiple clusters. In Journal of Supercomputing, Special Issue on the Evaluation of Grid and Cluster Computing Systems, volume 34, pages 135--163. Springer Science and Business Media B. V, November 2005. Google ScholarDigital Library
ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. DARPA, 2008.Google Scholar
A. Moody and G. Bronevetsky. Scalable I/O Systems via Node-Local Storage: Approaching 1 TB/sec File I/O. In Lawrence Livermore National Laboratory: Technical Report #415791, 2009.Google Scholar
A. Moody, G. Bronevetsky, K. Mohror, and B. R. d. Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proc. of the ACM/IEEE Intl. Conf. for High Perf. Comp., Networking, Storage and Analysis, pages 1--11, 2010. Google ScholarDigital Library
R. A. Ballance et al. Application Monitoring. Cray User Group Meeting, May 2008.Google Scholar
B. Schroeder and G. Gibson. A large-scale study of failures in high-performance computing systems. In International Conference on Dependable Systems and Networks, pages 249--258, 2006. Google ScholarDigital Library
B. Schroeder and G. Gibson. Understanding failures in petascale computers. In J. of Physics, 2007.Google ScholarCross Ref
N. D. Singpurwalla and A. G. Wilson. Probability, chance and the probability of chance. In IIE Transactions, volume 41, pages 12--22, Jan 2009.Google Scholar
Vivek Sarkar et al. ExaScale Computing Software Study: Software Challenges in Extreme Scale Systems. DARPA, September 2009.Google Scholar
Ubiquitous High Perf. Comp. (UHPC) Request for Information (RFI). DARPA-SN-09-46, 2009.Google Scholar
J. W. Young. A first-order approximation to the optimum checkpoint interval. In Communications of the ACM, pages 530--531, September 1974. Google ScholarDigital Library

Index Terms

Application monitoring and checkpointing in HPC: looking towards exascale systems
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance

Recommendations

Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters
HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

As computational clusters rapidly grow in both size and complexity, system reliability and, in particular, application resilience have become increasingly important factors to consider in maintaining efficiency and providing improved computational ...
Read More
A fully informed model-based checkpointing protocol for preventing useless checkpoints

Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, ...
Read More
Resilient MPI applications using an application-level checkpointing framework and ULFM

Future exascale systems, formed by millions of cores, will present high failure rates, and long-running applications will need to make use of new fault tolerance techniques to ensure successful execution completion. The Fault Tolerance Working Group, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ACM-SE '12: Proceedings of the 50th Annual Southeast Regional Conference
March 2012
424 pages
ISBN:9781450312035
DOI:10.1145/2184512
Conference Chair:
Randy K. Smith
University of Alabama
,
Program Chair:
Susan V. Vrbsky
University of Alabama
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 March 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
checkpointing
exascale
prediction
resilience
simulation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate178of377submissions,47%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 301
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Application monitoring and checkpointing in HPC: looking towards exascale systems

ACM-SE '12: Proceedings of the 50th Annual Southeast Regional Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters

A fully informed model-based checkpointing protocol for preventing useless checkpoints

Resilient MPI applications using an application-level checkpointing framework and ULFM

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Application monitoring and checkpointing in HPC: looking towards exascale systems

ACM-SE '12: Proceedings of the 50th Annual Southeast Regional Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters

A fully informed model-based checkpointing protocol for preventing useless checkpoints

Resilient MPI applications using an application-level checkpointing framework and ULFM

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media