Loading [MathJax]/extensions/MathMenu.js
Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation | IEEE Conference Publication | IEEE Xplore

Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation


Abstract:

Traditionally, cluster computing has employed checkpointing to address fault tolerance. Recently, new models for parallel applications have grown in popularity namely Map...Show More

Abstract:

Traditionally, cluster computing has employed checkpointing to address fault tolerance. Recently, new models for parallel applications have grown in popularity namely MapReduce and Dryad, with runtime systems providing their own re-execute based fault tolerance mechanisms, but with no analysis of their failure characteristics. Another development is the availability of failure data spanning years for systems of significant size at Los Alamos National Labs (LANL), but the time between failure (TBF) for these systems is a poor fit to the exponential distribution assumed by optimization work in checkpointing, bringing these results into question. The work in this paper describes a discrete event simulation driven by the LANL data and by models of parallel checkpointing and MapReduce tasks. The simulation allows us to then evaluate and assess the fault tolerance characteristics of these tasks with the goal of minimizing the expected running time of a parallel program in a cluster in the presence of faults for both fault tolerance models.
Date of Conference: 31 August 2009 - 04 September 2009
Date Added to IEEE Xplore: 16 October 2009
ISBN Information:

ISSN Information:

Conference Location: New Orleans, LA, USA

Contact IEEE to Subscribe

References

References is not available for this document.