Conferences >2009 IEEE International Confe...

Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Traditionally, cluster computing has employed checkpointing to address fault tolerance. Recently, new models for parallel applications have grown in popularity namely Map...Show More

Metadata

Abstract:

Traditionally, cluster computing has employed checkpointing to address fault tolerance. Recently, new models for parallel applications have grown in popularity namely MapReduce and Dryad, with runtime systems providing their own re-execute based fault tolerance mechanisms, but with no analysis of their failure characteristics. Another development is the availability of failure data spanning years for systems of significant size at Los Alamos National Labs (LANL), but the time between failure (TBF) for these systems is a poor fit to the exponential distribution assumed by optimization work in checkpointing, bringing these results into question. The work in this paper describes a discrete event simulation driven by the LANL data and by models of parallel checkpointing and MapReduce tasks. The simulation allows us to then evaluate and assess the fault tolerance characteristics of these tasks with the goal of minimizing the expected running time of a parallel program in a cluster in the presence of faults for both fault tolerance models.

Published in: 2009 IEEE International Conference on Cluster Computing and Workshops

Date of Conference: 31 August 2009 - 04 September 2009

Date Added to IEEE Xplore: 16 October 2009

ISBN Information:

ISSN Information:

DOI: 10.1109/CLUSTR.2009.5289185

Conference Location: New Orleans, LA, USA

Contents

References is not available for this document.

Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation

Abstract:

Metadata

Abstract:

ISSN Information:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

References

IEEE Account

Purchase Details

Profile Information

Need Help?