Optimized Checkpointing Protocols for Data Parallel Programs

Bertolli, Carlo; Vanneschi, Marco

doi:10.3233/978-1-60750-530-3-433

Abstract

A main issue of fault tolerance techniques for High-Performance applications, based on checkpointing and rollback recovery, is related to the feasibility of statically analyzing the induced overhead. In this paper we show that, under the hypothesis of a general unstructured parallelism model (e.g. MPI, OpenMP), it is difficult, where not impossible, to achieve such an analysis. To overcome this issue we propose an approach to fault tolerance based on structured parallel programming (e.g. divide-and-conquer and parallel sort). We show which are the gains of assuming such a programming model in the case of data parallel programs: we introduce an optimized checkpointing protocol and we compare it with a Communication-Induce-Checkpointing (CIC) protocol, representing one of the most advanced solution for general “unstructured” parallelism models.

Contact

IOS Press Copyright 2024

Contact

IOS Press Copyright 2024

This website uses cookies

This website uses cookies