Abstract
Scientific applications are large and complex; task-based programming models are a popular approach to developing these applications due to their ease of programming and ability to handle complex workflows and distribute their workload across large infrastructures. In these environments, either the hardware or the software may lead to failures from a myriad of origins: application logic, system software, memory, network, or disk. Re-executing a failed application can take hours, days, or even weeks, thus, dragging out the research. This article proposes a recovery system for dynamic task-based models to reduce the re-execution time of failed runs. The design encapsulates in a checkpointing manager the automatic checkpointing of the execution, leveraging different mechanisms that can be arbitrarily defined and tuned to fit the needs of each performance. Additionally, it offers an API call to establish snapshots of the execution from the application code. The experiments executed on a prototype implementation have reached a speedup of 1.9\(\times \) after re-execution and shown no overhead on the execution time on successful first runs of specific applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Cid-Fuentes, J.Á., et al.: dislib: large scale high performance machine learning in python. In: 2019 15th International Conference on eScience (eScience) (2019)
Babuji, Y., et al.: Parsl: pervasive parallel programming in python. CoRR (2019)
Andrio, P., et al.: Bioexcel building blocks, a software library for interoperable biomolecular simulation workflows. Sci. Data 6, 169 (2019)
Badia, R.M., et al.: Comp superscalar, an interoperable programming framework. SoftwareX 3, 32–36 (2015)
Badia, R.M., et al.: Enabling python to execute efficiently in heterogeneous distributed infrastructures with pycompss. In: PyHPC 2017. Association for Computing Machinery, New York (2017)
Bauer, M., et al.: Legion: expressing locality and independence with logical regions. In: SC 2012: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2012)
Deelman, E., et al.: Pegasus, a workflow management system for science automation. Future Gener. Comput. Syst. 46, 17–35 (2014)
Ejarque, J., Bertran, M., Cid-Fuentes, J.Á., Conejero, J., Badia, R.M.: Managing failures in task-based parallel workflows in distributed computing environments. In: Malawski, M., Rzadca, K. (eds.) Euro-Par 2020. LNCS, vol. 12247, pp. 411–425. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57675-2_26
Quan, O., Xu, H.: The study of comparisons of three crossover operators in genetic algorithm for solving single machine scheduling problem (2015)
Qureshi, K., Khan, F., Manuel, P., Nazir, B.: A hybrid fault tolerance technique in grid computing system. J. Supercomput. 56, 106–128 (2011)
Rocklin, M.: Dask: parallel computation with blocked algorithms and task scheduling, pp. 126–132 (2015)
Vanderster, D., Dimopoulos, N., Sobie, R.: Intelligent selection of fault tolerance techniques on the grid, pp. 69–76 (2007)
Acknowledgements
This work has been supported by the Spanish Government (PID2019-107255GB), by Generalitat de Catalunya (contract 2017-SGR-01414), and by the European Commission through the Horizon 2020 Research and Innovation program under Grant Agreement No. 955558 (eFlows4HPC-project). This work has partially been co-funded with 50% by the European Regional Development Fund under the framework of the ERFD Operative Programme for Catalunya 2014–2020.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Vergés, P., Lordan, F., Ejarque, J., Badia, R.M. (2023). Task-Level Checkpointing System for Task-Based Parallel Workflows. In: Singer, J., Elkhatib, Y., Blanco Heras, D., Diehl, P., Brown, N., Ilic, A. (eds) Euro-Par 2022: Parallel Processing Workshops. Euro-Par 2022. Lecture Notes in Computer Science, vol 13835. Springer, Cham. https://doi.org/10.1007/978-3-031-31209-0_19
Download citation
DOI: https://doi.org/10.1007/978-3-031-31209-0_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-31208-3
Online ISBN: 978-3-031-31209-0
eBook Packages: Computer ScienceComputer Science (R0)