Loading [MathJax]/extensions/MathMenu.js
An Algorithm-Based Fault Tolerance Strategy for the Bitonic Sort Parallel Algorithm | IEEE Conference Publication | IEEE Xplore

An Algorithm-Based Fault Tolerance Strategy for the Bitonic Sort Parallel Algorithm


Abstract:

High Performance Computing (HPC) systems are employed to solve hard problems and rely on parallel algorithms which present very long execution times - up to several days....Show More

Abstract:

High Performance Computing (HPC) systems are employed to solve hard problems and rely on parallel algorithms which present very long execution times - up to several days. These systems are expensive in terms of the computational resources required, including energy consumption. Thus, after failures occur it is highly desirable to loose as little of the work that has already been done as possible. In this work we present an Algorithm-Based Fault Tolerance (ABFT) strategy that can be applied to make a robust version of any hypercube-based parallel algorithm. Note that we do not assume a physical hypercube: after nodes crash, fault-free nodes autonomously adapt themselves according to a logical topology called VCube, preserving several logarithmic properties. The proposed strategy guarantees that the algorithm does not halt even after up to (N - 1) nodes crash, in a system of N nodes. We use parallel sorting as a case study, describing how to make a fault-tolerant version of the Bitonic Sort parallel algorithm. The algorithm was implemented in MPI using ULMF to handle faults. Experimental results are presented showing the performance and robustness of the proposed solution.
Date of Conference: 22-26 November 2021
Date Added to IEEE Xplore: 17 January 2022
ISBN Information:
Conference Location: Florianópolis, Brazil

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.