Skip to main content

Task-Level Checkpointing System for Task-Based Parallel Workflows

  • Conference paper
  • First Online:
Euro-Par 2022: Parallel Processing Workshops (Euro-Par 2022)

Abstract

Scientific applications are large and complex; task-based programming models are a popular approach to developing these applications due to their ease of programming and ability to handle complex workflows and distribute their workload across large infrastructures. In these environments, either the hardware or the software may lead to failures from a myriad of origins: application logic, system software, memory, network, or disk. Re-executing a failed application can take hours, days, or even weeks, thus, dragging out the research. This article proposes a recovery system for dynamic task-based models to reduce the re-execution time of failed runs. The design encapsulates in a checkpointing manager the automatic checkpointing of the execution, leveraging different mechanisms that can be arbitrarily defined and tuned to fit the needs of each performance. Additionally, it offers an API call to establish snapshots of the execution from the application code. The experiments executed on a prototype implementation have reached a speedup of 1.9\(\times \) after re-execution and shown no overhead on the execution time on successful first runs of specific applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Implementation with PyCOMPSs distributed within the dislib library [1].

  2. 2.

    Implementation with PyCOMPSs offered as a BioExcel Building Blocks (BioBB) [3].

References

  1. Cid-Fuentes, J.Á., et al.: dislib: large scale high performance machine learning in python. In: 2019 15th International Conference on eScience (eScience) (2019)

    Google Scholar 

  2. Babuji, Y., et al.: Parsl: pervasive parallel programming in python. CoRR (2019)

    Google Scholar 

  3. Andrio, P., et al.: Bioexcel building blocks, a software library for interoperable biomolecular simulation workflows. Sci. Data 6, 169 (2019)

    Article  Google Scholar 

  4. Badia, R.M., et al.: Comp superscalar, an interoperable programming framework. SoftwareX 3, 32–36 (2015)

    Article  Google Scholar 

  5. Badia, R.M., et al.: Enabling python to execute efficiently in heterogeneous distributed infrastructures with pycompss. In: PyHPC 2017. Association for Computing Machinery, New York (2017)

    Google Scholar 

  6. Bauer, M., et al.: Legion: expressing locality and independence with logical regions. In: SC 2012: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2012)

    Google Scholar 

  7. Deelman, E., et al.: Pegasus, a workflow management system for science automation. Future Gener. Comput. Syst. 46, 17–35 (2014)

    Article  Google Scholar 

  8. Ejarque, J., Bertran, M., Cid-Fuentes, J.Á., Conejero, J., Badia, R.M.: Managing failures in task-based parallel workflows in distributed computing environments. In: Malawski, M., Rzadca, K. (eds.) Euro-Par 2020. LNCS, vol. 12247, pp. 411–425. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57675-2_26

    Chapter  Google Scholar 

  9. Quan, O., Xu, H.: The study of comparisons of three crossover operators in genetic algorithm for solving single machine scheduling problem (2015)

    Google Scholar 

  10. Qureshi, K., Khan, F., Manuel, P., Nazir, B.: A hybrid fault tolerance technique in grid computing system. J. Supercomput. 56, 106–128 (2011)

    Article  Google Scholar 

  11. Rocklin, M.: Dask: parallel computation with blocked algorithms and task scheduling, pp. 126–132 (2015)

    Google Scholar 

  12. Vanderster, D., Dimopoulos, N., Sobie, R.: Intelligent selection of fault tolerance techniques on the grid, pp. 69–76 (2007)

    Google Scholar 

Download references

Acknowledgements

This work has been supported by the Spanish Government (PID2019-107255GB), by Generalitat de Catalunya (contract 2017-SGR-01414), and by the European Commission through the Horizon 2020 Research and Innovation program under Grant Agreement No. 955558 (eFlows4HPC-project). This work has partially been co-funded with 50% by the European Regional Development Fund under the framework of the ERFD Operative Programme for Catalunya 2014–2020.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pere Vergés .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vergés, P., Lordan, F., Ejarque, J., Badia, R.M. (2023). Task-Level Checkpointing System for Task-Based Parallel Workflows. In: Singer, J., Elkhatib, Y., Blanco Heras, D., Diehl, P., Brown, N., Ilic, A. (eds) Euro-Par 2022: Parallel Processing Workshops. Euro-Par 2022. Lecture Notes in Computer Science, vol 13835. Springer, Cham. https://doi.org/10.1007/978-3-031-31209-0_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-31209-0_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-31208-3

  • Online ISBN: 978-3-031-31209-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics