skip to main content
10.1145/3229710.3229755acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Checkpoint Scheduling for Shared Usage of Burst-Buffers in Supercomputers

Published: 13 August 2018 Publication History

Abstract

User-defined and system-level checkpointing have contrary properties. While user-defined checkpoints are smaller and simpler to recover, system-level checkpointing better knows the global system's state and parameters like the expected mean time to failure (MTTF) per node. Both approaches lead to non-optimal checkpoint time, intervals, sizes, or I/O bandwidth when concurrent checkpoints conflict and compete for it.
We combine user-defined and system-level checkpointing to exploit the benefits and avoid the drawbacks of each other. Thus, applications frequently offer to create checkpoints. The system accepts such offers according to the current status and implied costs to recalculate from the last checkpoint or denies them, i.e., immediately lets continue the application without checkpoint creation. To support this approach, we develop economic models for multi-application checkpointing on shared I/O resources that are dedicated for checkpointing (e.g. burst-buffers) by defining an appropriate goal function and solving a global optimization problem.
Using our models, the checkpoints of applications on a supercomputer are scheduled to effectively use the available I/O bandwidth and minimize the failure overhead (checkpoint creations plus recalculations). Our simulations show an overall reduction in failure overhead of all nodes of up to 30% for a typical supercomputer workload (HLRN). We can also derive the most cost effective burst-buffer bandwidth for a given node's MTTF and application workload.

References

[1]
Stefan Andersson, Stephen Sachs, Christian Tuma, and Thorsten Schütt. 2015. Data Warp: First Experiences. In Cray User Group 2015 Proceedings. Cray User Group, 1--6.
[2]
Guillaume Aupy, Ana Gainaru, and Valentin Le Fèvre. 2017. Periodic I/O Scheduling for Super-Computers. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation - 8th International Workshop, PMBS 2017, Denver, CO, USA, November 13, 2017, Proceedings. Springer, 44--66.
[3]
Leonardo Arturo Bautista-Gomez, Seiji Tsuboi, Dimitri Komatitsch, Franck Cappello, Naoya Maruyama, and Satoshi Matsuoka. 2011. FTT: high performance fault tolerance interface for hybrid systems. In Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12--18, 2011. ACM, 32:1--32:32.
[4]
John T. Daly. 2006. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Comp. Syst. 22, 3 (2006), 303--312.
[5]
Sheng Di, Mohamed-Slim Bouguerra, Leonardo Arturo Bautista-Gomez, and Franck Cappello. 2014. Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium, Phoenix, AZ, USA, May 19-23, 2014. IEEE Computer Society, 1181--1190.
[6]
B. Dimitrov, Z. Khalil, N. Kolev, and P. Petrovi. 1991. On the optimal total processing time using checkpoints. IEEE Transactions on Software Engineering 17 (1991), 436--442.
[7]
Matthieu Dorier, Gabriel Antoniu, Robert B. Ross, Dries Kimpe, and Shadi Ibrahim. 2014. CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium, Phoenix, AZ, USA, May 19--23, 2014. IEEE Computer Society, 155--164.
[8]
Geoffrey Fairchild. 2018. pyHarmonySearch library. https://github.com/gfairchild/pyHarmonySearch
[9]
Michael Falk, Jürg Hüsler, and Rolf-Dieter Reiss. 1994. Laws of Small Numbers: Extremes and Rare Events. Birkhauser, Boston.
[10]
Ana Gainaru, Guillaume Aupy, Anne Benoit, Franck Cappello, Yves Robert, and Marc Snir. 2015. Scheduling the I/O of HPC Applications Under Congestion. In 2015 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015, Hyderabad, India, May 25--29, 2015. IEEE Computer Society, 1013--1022.
[11]
Xiao Zhi Gao, V. Govindasamy, He Xu, X. Wang, and Kai Zenger. 2015. Harmony Search Method: Theory and Applications. Comp. Int. and Neurosc. 2015 (2015), 258491:1--258491:10.
[12]
Zong Woo Geem, Joong-Hoon Kim, and G. V. Loganathan. 2001. A New Heuristic Optimization Algorithm: Harmony Search. Simulation 76, 2 (2001), 60--68.
[13]
Fred Glover. 1986. Future paths for integer programming and links to artificial intelligence. Computers & OR 13, 5 (1986), 533--549.
[14]
Thomas Hérault, Yves Robert, Aurélien Bouteiller, Dorian Arnold, Kurt B Ferreira, George Bosilca, and Jack Dongarra. 2017. Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms. Research Report RR-9109. INRIA. 1--20 pages. https://hal.inria.fr/hal-01621295
[15]
S.W. Kwak, B.J. Choi, and B.K. Kim. 2001. An optimal checkpointing- strategy for real-time control systems under transient faults. IEEE Transactions on Software Engineering 50 (2001), 293--301.
[16]
P.J. M. Laarhoven and E. H. L. Aarts (Eds.). 1987. Simulated Annealing: Theory and Applications. Kluwer Academic Publishers, Norwell, MA, USA.
[17]
Ning Liu, Jason Cope, Philip H. Cams, Christopher D. Carothers, Robert B. Ross, Gary Grider, Adam Crume, and Carlos Maltzahn. 2012. On the role of burst buffers in leadership-class storage systems. In IEEE 28th Symposium on Mass Storage Systems and Technologies, MSST 2012, April 16--20, 2012, Asilomar Conference Grounds, Pacific Grove, CA, USA. IEEE Computer Society, 1--11.
[18]
Catello Di Martino, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer, Fabio Baccanico, Joseph Fullop, and William Kramer. 2014. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters. In 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014, Atlanta, GA, USA, June 23--26, 2014. IEEE Computer Society, 610--621.
[19]
Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R. de Supinski. 2010. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In Conference on High Performance Computing Networking, Storage and Analysis, SC 2010, New Orleans, LA, USA, November 13--19, 2010. IEEE, 1--11.
[20]
Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R. de Supinski. 2018. SciPy library. https://github.com/scipy/scipy
[21]
Luís Moura Silva and joão Gabriel Silva. 1998. System-Level Versus User-Defined Checkpointing. In The Seventeenth Symposium on Reliable Distributed Systems, SRDS 1998, West Lafayette, Indiana, USA, October 20-22, 1998, Proceedings. IEEE Computer Society, 68--74.
[22]
Devesh Tiwari, Saurabh Gupta, and Sudharshan S. Vazhkudai. 2014. Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems. In 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014, Atlanta, GA, USA, June 23--26, 2014. IEEE Computer Society, 25--36.
[23]
N.H. Vaidya. 1997. Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE Trans. Comput. 46 (1997), 942--947.
[24]
Teng Wang, Sarp Oral, Yandong Wang, Bradley W. Settlemyer, Scott Atchley, and Weikuan Yu. 2014. BurstMem: A high-performance burst buffer system for scientific applications. In 2014 IEEE International Conference on Big Data, Big Data 2014, Washington, DC, USA, October 27--30, 2014. IEEE, 71--79.
[25]
John W. Young. 1974. A First Order Approximation to the Optimal Checkpoint Interval. Commun. ACM 17, 9 (1974), 530--531.

Cited By

View all
  • (2024)Rollback-Free Recovery for a High Performance Dense Linear Solver With Reduced Memory FootprintIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340036535:7(1307-1319)Online publication date: 13-May-2024
  • (2021)Combining XOR and Partner Checkpointing for Resilient Multilevel Checkpoint/Restart2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00036(277-288)Online publication date: May-2021
  • (2020)Solving Linear Systems on High Performance Hardware with Resilience to Multiple Hard Faults2020 International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS51746.2020.00034(266-275)Online publication date: Sep-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP Workshops '18: Workshop Proceedings of the 47th International Conference on Parallel Processing
August 2018
409 pages
ISBN:9781450365239
DOI:10.1145/3229710
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • University of Oregon: University of Oregon

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. HPC
  2. checkpoint/restart
  3. optimization
  4. scheduler
  5. shared bandwidth

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICPP '18 Comp

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Rollback-Free Recovery for a High Performance Dense Linear Solver With Reduced Memory FootprintIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340036535:7(1307-1319)Online publication date: 13-May-2024
  • (2021)Combining XOR and Partner Checkpointing for Resilient Multilevel Checkpoint/Restart2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00036(277-288)Online publication date: May-2021
  • (2020)Solving Linear Systems on High Performance Hardware with Resilience to Multiple Hard Faults2020 International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS51746.2020.00034(266-275)Online publication date: Sep-2020
  • (2020)FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale ComputingSoftware for Exascale Computing - SPPEXA 2016-201910.1007/978-3-030-47956-5_16(483-516)Online publication date: 31-Jul-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media