skip to main content
10.1145/1654059.1654110acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

FALCON: a system for reliable checkpoint recovery in shared grid environments

Published: 14 November 2009 Publication History

Abstract

In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion times. Checkpoint-recovery is an attractive mechanism for recovering from such "failures". Today's FGCS systems often use expensive, high-performance dedicated checkpoint servers. However, in geographically distributed clusters, this may incur high checkpoint transfer latencies. In this paper we present a system called Falcon that uses available disk resources of the FGCS machines as shared checkpoint repositories. However, an unavailable storage host may lead to loss of checkpoint data. Therefore, we model failures of storage hosts and develop a prediction algorithm for choosing reliable checkpoint repositories. We experiment with Falcon in the university-wide Condor testbed at Purdue and show improved and consistent performance for guest jobs in the presence of irregular resource availability.

References

[1]
http://www.rcac.purdue.edu/boilergrid/.
[2]
T. Bray. The Bonnie home page. Located at http://www.textuality.com/bonnie, 1996.
[3]
G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill. Collective operations in application-level fault-tolerant mpi. In ICS '03, pages 234--243, 2003.
[4]
de Camargo, R. Y., Cerqueira, Renato, and K. Fabio. Strategies for storage of checkpointing data using non-dedicated repositories on grid systems. In MGC, pages 1--6, 2005.
[5]
E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375--408, 2002.
[6]
D. K. Jacob Strauss and F. Kaashoek. A measurement study of available bandwidth estimation tools. In IMC, pages 39--44, 2003.
[7]
A. K., J. A., X. Wu, F. M., J. B., C.-W. Tseng, and Y. D. Biobench: A benchmark suite of bioinformatics applications. In ISPASS '05, pages 2--9, 2005.
[8]
M. K. Aguilera, R. Janakiraman, and L. Xu. Using erasure codes efficiently for storage in a distributed system. In DSN '05: Proceedings of the 2005 International Conference on Dependable Systems and Networks, pages 336--345, 2005.
[9]
X. Ren and R. Eigenmann. Empirical studies on the behavior of resource availability in fine-grained cycle sharing systems. In ICPP '06: Proceedings of the 2006 International Conference on Parallel Processing, pages 3--11, 2006.
[10]
X. Ren, R. Eigenmann, and S. Bagchi. Failure-aware checkpointing in fine-grained cycle sharing systems. In Proceedings of the 16th international symposium on High performance distributed computing, pages 33--42, 2007.
[11]
X. Ren, S. Lee, R. Eigenmann, and S. Bagchi. Resource Failure Prediction in Fine-Grained Cycle Sharing Systems. In Proc. of Fifteenth IEEE International Symposium on High Performance Distributed Computing (HPDC-15), pages 19--23, 2006.
[12]
X. Ren, S. Lee, R. Eigenmann, and S. Bagchi. Prediction of resource availability in fine-grained cycle sharing systems empirical evaluation. J. Grid Comput., 5(2):173--195, 2007.
[13]
S. Rhea, P. Eaton, D. Geels, H. Weatherspoon, B. Zhao, and J. Kubiatowicz. Pond: the OceanStore prototype. In Proc. of the 2nd USENIX Conference on File and Storage Technologies (FAST), 2003.
[14]
R. Rodrigues and B. Liskov. High Availability in DHTs: Erasure Coding vs. Replication. In Peer-to-Peer Systems IV 4th International Workshop IPTPS 2005, 2005.
[15]
B. Rood and M. Lewis. Scheduling on the Grid via multi-state resource availability prediction. In Grid '08, pages 126--135, 2008.
[16]
B. Rood and M. J. Lewis. Multi-state grid resource availability characterization. In GRID '07, pages 42--49, 2007.
[17]
K. Ryu and J. Hollingsworth. Resource Policing to Support Fine-Grain Cycle Stealing in Networks of Workstations. IEEE Transactions on Parallel And Distributed Systems, pages 878--892, 2004.
[18]
D. Thain, T. Tannenbaum, and M. Livny. Distributed computing in practice: the Condor experience. Concurrency - Practice and Experience, 17(2--4):323--356, 2005.
[19]
J. Walters and V. Chaudhary. A Comprehensive User-level Checkpointing Strategy for MPI Applications. Technical report, TR 2007-1, The State University of New York, Buffalo, NY, 2007.
[20]
Z. Wilcox-O'Hearn. Zfec Homepage. Located at http://allmydata.org/trac/zfec, 2008.

Cited By

View all
  • (2012)McrEngineProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2389020(1-11)Online publication date: 10-Nov-2012
  • (2012)PertIEEE Transactions on Software Engineering10.1109/TSE.2011.6638:4(909-922)Online publication date: 1-Jul-2012
  • (2012)MCREngineProceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2012.77(1-11)Online publication date: 10-Nov-2012
  • Show More Cited By

Index Terms

  1. FALCON: a system for reliable checkpoint recovery in shared grid environments

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
    November 2009
    778 pages
    ISBN:9781605587448
    DOI:10.1145/1654059
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 November 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Condor
    2. checkpointing
    3. cycle-sharing systems
    4. failure model
    5. reliability

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SC '09
    Sponsor:

    Acceptance Rates

    SC '09 Paper Acceptance Rate 59 of 261 submissions, 23%;
    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)17
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2012)McrEngineProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2389020(1-11)Online publication date: 10-Nov-2012
    • (2012)PertIEEE Transactions on Software Engineering10.1109/TSE.2011.6638:4(909-922)Online publication date: 1-Jul-2012
    • (2012)MCREngineProceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2012.77(1-11)Online publication date: 10-Nov-2012
    • (2010)Analysis and modeling of time-correlated failures in large-scale distributed systems2010 11th IEEE/ACM International Conference on Grid Computing10.1109/GRID.2010.5697961(65-72)Online publication date: Oct-2010

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media