research-article

Checkpoint Scheduling for Shared Usage of Burst-Buffers in Supercomputers

Authors:

Masoud Gholami,

Florian Schintke,

Thorsten SchüttAuthors Info & Claims

ICPP Workshops '18: Workshop Proceedings of the 47th International Conference on Parallel Processing

Article No.: 44, Pages 1 - 10

https://doi.org/10.1145/3229710.3229755

Published: 13 August 2018 Publication History

Abstract

User-defined and system-level checkpointing have contrary properties. While user-defined checkpoints are smaller and simpler to recover, system-level checkpointing better knows the global system's state and parameters like the expected mean time to failure (MTTF) per node. Both approaches lead to non-optimal checkpoint time, intervals, sizes, or I/O bandwidth when concurrent checkpoints conflict and compete for it.

We combine user-defined and system-level checkpointing to exploit the benefits and avoid the drawbacks of each other. Thus, applications frequently offer to create checkpoints. The system accepts such offers according to the current status and implied costs to recalculate from the last checkpoint or denies them, i.e., immediately lets continue the application without checkpoint creation. To support this approach, we develop economic models for multi-application checkpointing on shared I/O resources that are dedicated for checkpointing (e.g. burst-buffers) by defining an appropriate goal function and solving a global optimization problem.

Using our models, the checkpoints of applications on a supercomputer are scheduled to effectively use the available I/O bandwidth and minimize the failure overhead (checkpoint creations plus recalculations). Our simulations show an overall reduction in failure overhead of all nodes of up to 30% for a typical supercomputer workload (HLRN). We can also derive the most cost effective burst-buffer bandwidth for a given node's MTTF and application workload.

References

[1]

Stefan Andersson, Stephen Sachs, Christian Tuma, and Thorsten Schütt. 2015. Data Warp: First Experiences. In Cray User Group 2015 Proceedings. Cray User Group, 1--6.

[2]

Guillaume Aupy, Ana Gainaru, and Valentin Le Fèvre. 2017. Periodic I/O Scheduling for Super-Computers. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation - 8th International Workshop, PMBS 2017, Denver, CO, USA, November 13, 2017, Proceedings. Springer, 44--66.

[3]

Leonardo Arturo Bautista-Gomez, Seiji Tsuboi, Dimitri Komatitsch, Franck Cappello, Naoya Maruyama, and Satoshi Matsuoka. 2011. FTT: high performance fault tolerance interface for hybrid systems. In Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12--18, 2011. ACM, 32:1--32:32.

Digital Library

[4]

John T. Daly. 2006. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Comp. Syst. 22, 3 (2006), 303--312.

Digital Library

[5]

Sheng Di, Mohamed-Slim Bouguerra, Leonardo Arturo Bautista-Gomez, and Franck Cappello. 2014. Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium, Phoenix, AZ, USA, May 19-23, 2014. IEEE Computer Society, 1181--1190.

Digital Library

[6]

B. Dimitrov, Z. Khalil, N. Kolev, and P. Petrovi. 1991. On the optimal total processing time using checkpoints. IEEE Transactions on Software Engineering 17 (1991), 436--442.

Digital Library

[7]

Matthieu Dorier, Gabriel Antoniu, Robert B. Ross, Dries Kimpe, and Shadi Ibrahim. 2014. CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium, Phoenix, AZ, USA, May 19--23, 2014. IEEE Computer Society, 155--164.

Digital Library

[8]

Geoffrey Fairchild. 2018. pyHarmonySearch library. https://github.com/gfairchild/pyHarmonySearch

[9]

Michael Falk, Jürg Hüsler, and Rolf-Dieter Reiss. 1994. Laws of Small Numbers: Extremes and Rare Events. Birkhauser, Boston.

[10]

Ana Gainaru, Guillaume Aupy, Anne Benoit, Franck Cappello, Yves Robert, and Marc Snir. 2015. Scheduling the I/O of HPC Applications Under Congestion. In 2015 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015, Hyderabad, India, May 25--29, 2015. IEEE Computer Society, 1013--1022.

Digital Library

[11]

Xiao Zhi Gao, V. Govindasamy, He Xu, X. Wang, and Kai Zenger. 2015. Harmony Search Method: Theory and Applications. Comp. Int. and Neurosc. 2015 (2015), 258491:1--258491:10.

Digital Library

[12]

Zong Woo Geem, Joong-Hoon Kim, and G. V. Loganathan. 2001. A New Heuristic Optimization Algorithm: Harmony Search. Simulation 76, 2 (2001), 60--68.

[13]

Fred Glover. 1986. Future paths for integer programming and links to artificial intelligence. Computers & OR 13, 5 (1986), 533--549.

Digital Library

[14]

Thomas Hérault, Yves Robert, Aurélien Bouteiller, Dorian Arnold, Kurt B Ferreira, George Bosilca, and Jack Dongarra. 2017. Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms. Research Report RR-9109. INRIA. 1--20 pages. https://hal.inria.fr/hal-01621295

[15]

S.W. Kwak, B.J. Choi, and B.K. Kim. 2001. An optimal checkpointing- strategy for real-time control systems under transient faults. IEEE Transactions on Software Engineering 50 (2001), 293--301.

[16]

P.J. M. Laarhoven and E. H. L. Aarts (Eds.). 1987. Simulated Annealing: Theory and Applications. Kluwer Academic Publishers, Norwell, MA, USA.

Digital Library

[17]

Ning Liu, Jason Cope, Philip H. Cams, Christopher D. Carothers, Robert B. Ross, Gary Grider, Adam Crume, and Carlos Maltzahn. 2012. On the role of burst buffers in leadership-class storage systems. In IEEE 28th Symposium on Mass Storage Systems and Technologies, MSST 2012, April 16--20, 2012, Asilomar Conference Grounds, Pacific Grove, CA, USA. IEEE Computer Society, 1--11.

[18]

Catello Di Martino, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer, Fabio Baccanico, Joseph Fullop, and William Kramer. 2014. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters. In 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014, Atlanta, GA, USA, June 23--26, 2014. IEEE Computer Society, 610--621.

Digital Library

[19]

Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R. de Supinski. 2010. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In Conference on High Performance Computing Networking, Storage and Analysis, SC 2010, New Orleans, LA, USA, November 13--19, 2010. IEEE, 1--11.

Digital Library

[20]

Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R. de Supinski. 2018. SciPy library. https://github.com/scipy/scipy

[21]

Luís Moura Silva and joão Gabriel Silva. 1998. System-Level Versus User-Defined Checkpointing. In The Seventeenth Symposium on Reliable Distributed Systems, SRDS 1998, West Lafayette, Indiana, USA, October 20-22, 1998, Proceedings. IEEE Computer Society, 68--74.

[22]

Devesh Tiwari, Saurabh Gupta, and Sudharshan S. Vazhkudai. 2014. Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems. In 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014, Atlanta, GA, USA, June 23--26, 2014. IEEE Computer Society, 25--36.

Digital Library

[23]

N.H. Vaidya. 1997. Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE Trans. Comput. 46 (1997), 942--947.

Digital Library

[24]

Teng Wang, Sarp Oral, Yandong Wang, Bradley W. Settlemyer, Scott Atchley, and Weikuan Yu. 2014. BurstMem: A high-performance burst buffer system for scientific applications. In 2014 IEEE International Conference on Big Data, Big Data 2014, Washington, DC, USA, October 27--30, 2014. IEEE, 71--79.

[25]

John W. Young. 1974. A First Order Approximation to the Optimal Checkpoint Interval. Commun. ACM 17, 9 (1974), 530--531.

Digital Library

Cited By

Loreti DArtioli MCiampolini A(2024)Rollback-Free Recovery for a High Performance Dense Linear Solver With Reduced Memory FootprintIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340036535:7(1307-1319)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3400365
Gholami MSchintke F(2021)Combining XOR and Partner Checkpointing for Resilient Multilevel Checkpoint/Restart2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00036(277-288)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00036
Loreti DArtioli MCiampolini A(2020)Solving Linear Systems on High Performance Hardware with Resilience to Multiple Hard Faults2020 International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS51746.2020.00034(266-275)Online publication date: Sep-2020
https://doi.org/10.1109/SRDS51746.2020.00034
Show More Cited By

Index Terms

Checkpoint Scheduling for Shared Usage of Burst-Buffers in Supercomputers
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance
    2. Reliability
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software fault tolerance
        Checkpoint / restart
    2. Software system structures
      1. Software system models
        Massively parallel systems

Recommendations

Checkpoint and restore of file locks in userspace
CEE-SECR '17: Proceedings of the 13th Central & Eastern European Software Engineering Conference in Russia

Checkpoint/restore (a.k.a checkpoint/restart) is a technique which is naturally described by its two parts. The first one is a checkpoint. It allows creating snapshot of an application. The second one is restart. It uses the snapshot to run a copy of the ...
A user-level infiniband-based file system and checkpoint strategy for burst buffers
CCGRID '14: Proceedings of the 14th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing

Checkpoint/Restart is an indispensable fault tolerance technique commonly used by high-performance computing applications that run continuously for hours or days at a time. However, even with state-of-the-art checkpoint/restart techniques, high failure ...
Process Migration for MPI Applications based on Coordinated Checkpoint
ICPADS '05: Proceedings of the 11th International Conference on Parallel and Distributed Systems - Volume 01

A lot of research has been done on faulttolerance for MPI applications, some on checkpoint/restart, and some on network faulttolerance. Process migration, however, has not gained widespread use due to the additional complexity of the requirement that ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP Workshops '18: Workshop Proceedings of the 47th International Conference on Parallel Processing

August 2018

409 pages

ISBN:9781450365239

DOI:10.1145/3229710

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

University of Oregon: University of Oregon

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Deutsche Forschungsgemeinschaft

Conference

ICPP '18 Comp

ICPP '18 Comp: 47th International Conference on Parallel Processing Companion

August 13 - 16, 2018

OR, Eugene, USA

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
102
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Loreti DArtioli MCiampolini A(2024)Rollback-Free Recovery for a High Performance Dense Linear Solver With Reduced Memory FootprintIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.340036535:7(1307-1319)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3400365
Gholami MSchintke F(2021)Combining XOR and Partner Checkpointing for Resilient Multilevel Checkpoint/Restart2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00036(277-288)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00036
Loreti DArtioli MCiampolini A(2020)Solving Linear Systems on High Performance Hardware with Resilience to Multiple Hard Faults2020 International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS51746.2020.00034(266-275)Online publication date: Sep-2020
https://doi.org/10.1109/SRDS51746.2020.00034
Weinhold CLackorzynski ABierbaum JKüttler MPlaneta MWeisbach HHille MHärtig HMargolin ASharf DLevy EGak PBarak AGholami MSchintke FSchütt TReinefeld ALieber MNagel W(2020)FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale ComputingSoftware for Exascale Computing - SPPEXA 2016-201910.1007/978-3-030-47956-5_16(483-516)Online publication date: 31-Jul-2020
https://doi.org/10.1007/978-3-030-47956-5_16

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten