ABSTRACT
As computational cluster computers rapidly grow in both size and complexity, system reliability and, in particular, application resilience have become increasingly important factors to consider in maintaining efficiency and providing improved compute performance over predecessor systems. One commonly used mechanism for providing application fault tolerance in parallel systems is the use of checkpointing.
We demonstrate the impact of sub-optimal checkpoint intervals on application efficiency via simulation with real workload data. We find that application efficiency is relatively insensitive to error in estimation of an application's mean time to interrupt (AMTTI), a parameter central to calculating the optimal checkpoint interval. This result corroborates the trends predicted by previous analytical models. We also find that erring on the side of overestimation may be preferable to underestimation.
We further discuss how application monitoring and resilience frameworks can benefit from this insensitivity to error in AMTTI estimates. Finally, we discuss the importance of application monitoring at exascale and conclude with a discussion of challenges faced in the use of checkpointing at such extreme scales.
- R. Ballance and N. DeBardeleben. The Mojo Application Monitoring Tool Suite. In 11th LCI International Conference on High-Performance Clustered Computing, March 2010.Google Scholar
- J. T. Daly. A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps. Future Generation Computer Systems, 22:300--312, 2006. Google ScholarDigital Library
- J. T. Daly. Methodology and metrics for quantifying application throughput. In Proceedings of the Nuclear Explosives Code Developers Conference, 2006.Google Scholar
- J. T. Daly, L. A. Pritchett-Sheats, and S. E. Michalak. Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale. In Workshop on Resilience held at the IEEE Intl. Conf. on Cluster Computing and the Grid, May 2008. Google ScholarDigital Library
- X. Dong, Y. Xie, N. Muralimanohar, and N. P. Jouppi. Hybrid checkpointing using emerging nonvolatile memories for future exascale systems. ACM Transactions on Architecture and Code Optimization, 8:6:1--6:29, June 2011. Google ScholarDigital Library
- J. Dongarra and P. Beckman. International Exascale Software Project Roadmap. International Journal of High Performance Computer Applications, 25(1), 2011. Google ScholarDigital Library
- A. Geist and R. Lucas. Major computer science challenges at exascale. In Exascale.org, Feb. 2009.Google Scholar
- G. Grider. ExaScale FSIO: Can we get there? Can we afford to? In HEC FSIO R&D Workshop, July 2010.Google Scholar
- E. Hendriks. Bproc: the beowulf distributed process space. In Proc. of the 16th Intl. Conf. on Supercomputing, pages 129--136. ACM, 2002. Google ScholarDigital Library
- BeoSim Website. http://www.parl.clemson.edu/beosim.Google Scholar
- W. M. Jones. Network-aware selective job checkpoint and migration to enhance co-allocation in multi-cluster systems. In Journal of Concurrency and Computation: Practice and Experience, volume 21, pages 1672--1691. John Wiley and Sons, Ltd., September 2009. Google ScholarDigital Library
- W. M. Jones, J. T. Daly, and N. A. DeBardeleben. Application resilience: Making progress in spite of failure. In The Workshop on Resilience held in conjunction with the IEEE Intl. Conf. on Cluster Computing and the Grid, pages 789--794, May 2008. Google ScholarDigital Library
- W. M. Jones, J. T. Daly, and N. A. DeBardeleben. Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters. In HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pages 276--279, 2010. Google ScholarDigital Library
- W. M. Jones, L. W. Pang, D. Stanzione, and W. B. Ligon III. Characterization of bandwidth-aware meta-schedulers for co-allocating jobs across multiple clusters. In Journal of Supercomputing, Special Issue on the Evaluation of Grid and Cluster Computing Systems, volume 34, pages 135--163. Springer Science and Business Media B. V, November 2005. Google ScholarDigital Library
- ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. DARPA, 2008.Google Scholar
- A. Moody and G. Bronevetsky. Scalable I/O Systems via Node-Local Storage: Approaching 1 TB/sec File I/O. In Lawrence Livermore National Laboratory: Technical Report #415791, 2009.Google Scholar
- A. Moody, G. Bronevetsky, K. Mohror, and B. R. d. Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proc. of the ACM/IEEE Intl. Conf. for High Perf. Comp., Networking, Storage and Analysis, pages 1--11, 2010. Google ScholarDigital Library
- R. A. Ballance et al. Application Monitoring. Cray User Group Meeting, May 2008.Google Scholar
- B. Schroeder and G. Gibson. A large-scale study of failures in high-performance computing systems. In International Conference on Dependable Systems and Networks, pages 249--258, 2006. Google ScholarDigital Library
- B. Schroeder and G. Gibson. Understanding failures in petascale computers. In J. of Physics, 2007.Google ScholarCross Ref
- N. D. Singpurwalla and A. G. Wilson. Probability, chance and the probability of chance. In IIE Transactions, volume 41, pages 12--22, Jan 2009.Google Scholar
- Vivek Sarkar et al. ExaScale Computing Software Study: Software Challenges in Extreme Scale Systems. DARPA, September 2009.Google Scholar
- Ubiquitous High Perf. Comp. (UHPC) Request for Information (RFI). DARPA-SN-09-46, 2009.Google Scholar
- J. W. Young. A first-order approximation to the optimum checkpoint interval. In Communications of the ACM, pages 530--531, September 1974. Google ScholarDigital Library
Index Terms
- Application monitoring and checkpointing in HPC: looking towards exascale systems
Recommendations
Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters
HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed ComputingAs computational clusters rapidly grow in both size and complexity, system reliability and, in particular, application resilience have become increasingly important factors to consider in maintaining efficiency and providing improved computational ...
A fully informed model-based checkpointing protocol for preventing useless checkpoints
Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, ...
Resilient MPI applications using an application-level checkpointing framework and ULFM
Future exascale systems, formed by millions of cores, will present high failure rates, and long-running applications will need to make use of new fault tolerance techniques to ensure successful execution completion. The Fault Tolerance Working Group, ...
Comments