research-article

FALCON: a system for reliable checkpoint recovery in shared grid environments

Authors:

Tanzima Zerin Islam,

Saurabh Bagchi,

Rudolf EigenmannAuthors Info & Claims

SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

Article No.: 50, Pages 1 - 12

https://doi.org/10.1145/1654059.1654110

Published: 14 November 2009 Publication History

Abstract

In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion times. Checkpoint-recovery is an attractive mechanism for recovering from such "failures". Today's FGCS systems often use expensive, high-performance dedicated checkpoint servers. However, in geographically distributed clusters, this may incur high checkpoint transfer latencies. In this paper we present a system called Falcon that uses available disk resources of the FGCS machines as shared checkpoint repositories. However, an unavailable storage host may lead to loss of checkpoint data. Therefore, we model failures of storage hosts and develop a prediction algorithm for choosing reliable checkpoint repositories. We experiment with Falcon in the university-wide Condor testbed at Purdue and show improved and consistent performance for guest jobs in the presence of irregular resource availability.

References

[1]

http://www.rcac.purdue.edu/boilergrid/.

[2]

T. Bray. The Bonnie home page. Located at http://www.textuality.com/bonnie, 1996.

[3]

G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill. Collective operations in application-level fault-tolerant mpi. In ICS '03, pages 234--243, 2003.

Digital Library

[4]

de Camargo, R. Y., Cerqueira, Renato, and K. Fabio. Strategies for storage of checkpointing data using non-dedicated repositories on grid systems. In MGC, pages 1--6, 2005.

Digital Library

[5]

E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375--408, 2002.

Digital Library

[6]

D. K. Jacob Strauss and F. Kaashoek. A measurement study of available bandwidth estimation tools. In IMC, pages 39--44, 2003.

Digital Library

[7]

A. K., J. A., X. Wu, F. M., J. B., C.-W. Tseng, and Y. D. Biobench: A benchmark suite of bioinformatics applications. In ISPASS '05, pages 2--9, 2005.

Digital Library

[8]

M. K. Aguilera, R. Janakiraman, and L. Xu. Using erasure codes efficiently for storage in a distributed system. In DSN '05: Proceedings of the 2005 International Conference on Dependable Systems and Networks, pages 336--345, 2005.

Digital Library

[9]

X. Ren and R. Eigenmann. Empirical studies on the behavior of resource availability in fine-grained cycle sharing systems. In ICPP '06: Proceedings of the 2006 International Conference on Parallel Processing, pages 3--11, 2006.

Digital Library

[10]

X. Ren, R. Eigenmann, and S. Bagchi. Failure-aware checkpointing in fine-grained cycle sharing systems. In Proceedings of the 16th international symposium on High performance distributed computing, pages 33--42, 2007.

Digital Library

[11]

X. Ren, S. Lee, R. Eigenmann, and S. Bagchi. Resource Failure Prediction in Fine-Grained Cycle Sharing Systems. In Proc. of Fifteenth IEEE International Symposium on High Performance Distributed Computing (HPDC-15), pages 19--23, 2006.

[12]

X. Ren, S. Lee, R. Eigenmann, and S. Bagchi. Prediction of resource availability in fine-grained cycle sharing systems empirical evaluation. J. Grid Comput., 5(2):173--195, 2007.

[13]

S. Rhea, P. Eaton, D. Geels, H. Weatherspoon, B. Zhao, and J. Kubiatowicz. Pond: the OceanStore prototype. In Proc. of the 2nd USENIX Conference on File and Storage Technologies (FAST), 2003.

Digital Library

[14]

R. Rodrigues and B. Liskov. High Availability in DHTs: Erasure Coding vs. Replication. In Peer-to-Peer Systems IV 4th International Workshop IPTPS 2005, 2005.

Digital Library

[15]

B. Rood and M. Lewis. Scheduling on the Grid via multi-state resource availability prediction. In Grid '08, pages 126--135, 2008.

Digital Library

[16]

B. Rood and M. J. Lewis. Multi-state grid resource availability characterization. In GRID '07, pages 42--49, 2007.

Digital Library

[17]

K. Ryu and J. Hollingsworth. Resource Policing to Support Fine-Grain Cycle Stealing in Networks of Workstations. IEEE Transactions on Parallel And Distributed Systems, pages 878--892, 2004.

Digital Library

[18]

D. Thain, T. Tannenbaum, and M. Livny. Distributed computing in practice: the Condor experience. Concurrency - Practice and Experience, 17(2--4):323--356, 2005.

Digital Library

[19]

J. Walters and V. Chaudhary. A Comprehensive User-level Checkpointing Strategy for MPI Applications. Technical report, TR 2007-1, The State University of New York, Buffalo, NY, 2007.

[20]

Z. Wilcox-O'Hearn. Zfec Homepage. Located at http://allmydata.org/trac/zfec, 2008.

Cited By

Islam TMohror KBagchi SMoody Ade Supinski BEigenmann RHollingsworth J(2012)McrEngineProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2389020(1-11)Online publication date: 10-Nov-2012
https://dl.acm.org/doi/10.5555/2388996.2389020
Liu PZhang C(2012)PertIEEE Transactions on Software Engineering10.1109/TSE.2011.6638:4(909-922)Online publication date: 1-Jul-2012
https://dl.acm.org/doi/10.1109/TSE.2011.66
Islam TMohror KBagchi SMoody Ade Supinski BEigenmann R(2012)MCREngineProceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2012.77(1-11)Online publication date: 10-Nov-2012
https://dl.acm.org/doi/10.1109/SC.2012.77
Show More Cited By

Index Terms

FALCON: a system for reliable checkpoint recovery in shared grid environments
1. Information systems
  1. Information systems applications

Recommendations

Reliable and Efficient Distributed Checkpointing System for Grid Environments

In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion ...
Failure-aware checkpointing in fine-grained cycle sharing systems
HPDC '07: Proceedings of the 16th international symposium on High performance distributed computing

Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amountof idle computational resources available on the Internet. Such systems allow guest jobs to run on a host if they do not significantly impact the local users of the host. Since ...
Failure-aware energy-efficient VM consolidation in cloud computing systems
Abstract
VM consolidation is an important technique used in cloud computing systems to improve energy efficiency. It migrates the running VMs from under utilized physical resources to other resources in order to reduce the energy consumption. ...
Highlights
- Reliability, energy consumption and task finishing time modelling under failures.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

November 2009

778 pages

ISBN:9781605587448

DOI:10.1145/1654059

Conference Chair:
Wilfred Pinfold

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 November 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SC '09

Sponsor:

SIGARCH
IEEE-CS

SC '09: International Conference for High Performance Computing, Networking, Storage and Analysis

November 14 - 20, 2009

Oregon, Portland

Acceptance Rates

SC '09 Paper Acceptance Rate 59 of 261 submissions, 23%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
39
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Islam TMohror KBagchi SMoody Ade Supinski BEigenmann RHollingsworth J(2012)McrEngineProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2389020(1-11)Online publication date: 10-Nov-2012
https://dl.acm.org/doi/10.5555/2388996.2389020
Liu PZhang C(2012)PertIEEE Transactions on Software Engineering10.1109/TSE.2011.6638:4(909-922)Online publication date: 1-Jul-2012
https://dl.acm.org/doi/10.1109/TSE.2011.66
Islam TMohror KBagchi SMoody Ade Supinski BEigenmann R(2012)MCREngineProceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2012.77(1-11)Online publication date: 10-Nov-2012
https://dl.acm.org/doi/10.1109/SC.2012.77
Yigitbasi NGallet MKondo DIosup AEpema D(2010)Analysis and modeling of time-correlated failures in large-scale distributed systems2010 11th IEEE/ACM International Conference on Grid Computing10.1109/GRID.2010.5697961(65-72)Online publication date: Oct-2010
https://doi.org/10.1109/GRID.2010.5697961

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten