skip to main content
10.1145/1272366.1272372acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
Article

Failure-aware checkpointing in fine-grained cycle sharing systems

Published: 25 June 2007 Publication History

Abstract

Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amountof idle computational resources available on the Internet. Such systems allow guest jobs to run on a host if they do not significantly impact the local users of the host. Since the hosts are typically provided voluntarily, their availability fluctuates greatly. To provide fault tolerance to guest jobs without adding significant computational overhead, we propose failure-aware checkpointing techniques that apply the knowledge of resource availability to select checkpoint repositories and to determine checkpoint intervals. We present the schemes of selecting reliable and efficient repositories from the non-dedicated hosts that contribute their disk storage. These schemes are formulated as 0/1 programming problems to optimize the network overhead of transferring checkpoints and the work lost due to unavailability of a storage host when needed to recover a guest job. We determine the checkpoint interval by comparing the cost of checkpointing immediately and the cost of delaying that to a later time, which is a function of the resource availability. We evaluate these techniques on an FGCS system called iShare, using trace-based simulation. The results show that they achieve better application performance than the prevalent methods which use checkpointing with a fixed periodicity on dedicated checkpoint servers.

References

[1]
M. K. Aguilera, RJanakiraman, and LXu. Using erasure codes efficiently for storage in a distributed system. In Proc. of DSN'05, pages 336--345, 2005.
[2]
J. Basney and M. Livny. Managing network resources in condor. In Proc. of HPDC'00, pages 298--299, 2000.
[3]
R. Buyya and M. Murshed. GridSim: A toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing. Concurrency and Computation: Practice and Experience, 14:1175--1220, 2002.
[4]
C. Cachin and S. Tessaro. Optimal resilience for erasure-coded byzantine distributed storage. In Proc. of DSN'06, pages 115--124, 2006.
[5]
R. de Camargo, R. Cerqueira, and F. Kon. Strategies for storage of checkpointing data using non-dedicated repositories on grid systems. In Int'l Workshop on Middleware for Grid Computing, pages 1--6, 2005.
[6]
P. A. Dinda and D. R. O'Halaron. An evaluation of linear models for host load prediction. In Proc. of HPDC'99, page 10, 1999.
[7]
W. Gentzsh. Sun Grid Engine: towards creating a compute power grid. In Int. Symposium on Cluster Computing and the Grid, pages 35--39, 2001.
[8]
http://setiathome.ssl.berkeley.edu/. SETIυhome: Search for extraterrestrial intelligence at home.
[9]
Y. Ling, J. Mi, and X. Lin. A variational calculus approach to optimal checkpoint placement. IEEE. Trans. on Computers, 50(7):699--708, 2001.
[10]
D. Nurmi, J. Brevik, and R. Wolski. Minimizing the network overhead of checkpointing in cycle-harvesting cluster environments. In Proc. of Cluster'05, 2006.
[11]
C. H. Papadimitriou and K. Steiglitz. Combinational Optimization: Algorithms and Complexity. Dover Publications, 1998.
[12]
J. S. Plank and W. Elwasif. Experimental assessment of workstation failures and their impact on checkpointing systems. In 28th International Symposium on Fault-Tolerant Computing, pages 48--57, 1998.
[13]
J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Trans. on Parallel and Distributed Systems, 9(10):972--986, 1998.
[14]
M. O. Rabin. Efficient dispersal of information for security, load balancing, and fault tolerance. Journal of the ACM, 36(2):335--348, 1989.
[15]
X. Ren and R. Eigenmann. iShare - Open internet sharing built on P2P and web. In Proc. of EGC'05, pages 1117--1127, 2005.
[16]
X. Ren and R. Eigenmann. Empirical studies on the behavior of resource availability in fine-grained cycle sharing systems. In Proc. of ICPP'06, pages 3--11, 2006.
[17]
X. Ren, R. Eigenmann, and S. Bagchi. Availability prediction for non-dedicated storages in fine-grained cycle sharing systems. Technical Report ECE-HPCLab-06201, Purdue University, 2006.
[18]
X. Ren, S. Lee, R. Eigenmann, and S. Bagchi. Resource availability prediction in fine-grained cycle sharing systems. In Proc. of HPDC'06, pages 93--104, 2006.
[19]
X. Ren, S. Lee, R. Eigenmann, and S. Bagchi. Prediction of resource availability in fine-grained cycle sharing systems and empirical evaluation. To appear in the Journal of Grid Computing, 2007.
[20]
K. D. Ryu and J. Hollingsworth. Resource policing to support fine-grain cycle stealing in networks of workstations. IEEE Trans. on Parallel and Distributed Systems, 15(9): 878--891, 2004.
[21]
D. Thain, J. Basney, S. Son, and M. Livny. The kangaroo approach to data movement on the grid. In Proc. of HPDC'01, pages 325--333, 2001.
[22]
D. Thain, T. Tannenbaum, and M. Livny. Distributed computing in practice: The condor experience. Concurrency - Practice and Experience, 17(2--4):323--356, 2004.
[23]
N. H. Vaidya. Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE. Trans. on Computers, 46(8):942--927, 1997.
[24]
R. Wolski, N. Spring, and J. Hayes. The network weather service: A distributed resource performance forecasting service for metacomputing. Journal of Future Generation Computing Systems, 15(5--6):757--768, 1999.
[25]
Y. Y. Zhang, M. Squillante, A. Sivasubramaniam, and R. K. Sahoo. Performance implications of failures in large-scale cluster scheduling. In 10th Workshop on Job Scheduling Strategies for Parallel Processing, 2004.
[26]
D. Zhou and V. Lo. Wave scheduler: Scheduling for faster turnaround time in peer-based desktop grid systems. mIn Proc. of the 11th Workshop on Job Scheduling Strategies for Parallel Processing, 2005.

Cited By

View all
  • (2019)Adaptive fault-tolerant scheduling strategies for mobile cloud computingThe Journal of Supercomputing10.1007/s11227-019-02745-5Online publication date: 10-Jan-2019
  • (2018)Simulation of virtual machine live migration in high throughput computing environmentsProceedings of the 22nd International Symposium on Distributed Simulation and Real Time Applications10.5555/3330299.3330305(47-54)Online publication date: 15-Oct-2018
  • (2018)Validation of a dynamic checkpoint mechanism for Apache Hadoop with failure scenarios2018 IEEE 19th Latin-American Test Symposium (LATS)10.1109/LATW.2018.8347240(1-6)Online publication date: Mar-2018
  • Show More Cited By

Index Terms

  1. Failure-aware checkpointing in fine-grained cycle sharing systems

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    HPDC '07: Proceedings of the 16th international symposium on High performance distributed computing
    June 2007
    256 pages
    ISBN:9781595936738
    DOI:10.1145/1272366
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 June 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. checkpointing
    2. cycle-sharing systems

    Qualifiers

    • Article

    Conference

    HPDC07
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 166 of 966 submissions, 17%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Adaptive fault-tolerant scheduling strategies for mobile cloud computingThe Journal of Supercomputing10.1007/s11227-019-02745-5Online publication date: 10-Jan-2019
    • (2018)Simulation of virtual machine live migration in high throughput computing environmentsProceedings of the 22nd International Symposium on Distributed Simulation and Real Time Applications10.5555/3330299.3330305(47-54)Online publication date: 15-Oct-2018
    • (2018)Validation of a dynamic checkpoint mechanism for Apache Hadoop with failure scenarios2018 IEEE 19th Latin-American Test Symposium (LATS)10.1109/LATW.2018.8347240(1-6)Online publication date: Mar-2018
    • (2018)Simulation of Virtual Machine Live Migration in High Throughput Computing Environments2018 IEEE/ACM 22nd International Symposium on Distributed Simulation and Real Time Applications (DS-RT)10.1109/DISTRA.2018.8601013(1-8)Online publication date: Oct-2018
    • (2015)Failure mitigation in linear, sesquilinear and bijective operations on integer data streams via numerical entanglement2015 IEEE 21st International On-Line Testing Symposium (IOLTS)10.1109/IOLTS.2015.7229844(122-127)Online publication date: Jul-2015
    • (2014)Reliable and Efficient Distributed Checkpointing System for Grid EnvironmentsJournal of Grid Computing10.1007/s10723-014-9297-412:4(593-613)Online publication date: 1-Dec-2014
    • (2014)Mechanisms for building autonomically scalable services on cooperatively shared computing platformsSoftware—Practice & Experience10.1002/spe.220644:10(1251-1276)Online publication date: 1-Oct-2014
    • (2013)Complexity Analysis of Checkpoint Scheduling with Variable CostsIEEE Transactions on Computers10.1109/TC.2012.5762:6(1269-1275)Online publication date: 1-Jun-2013
    • (2012)A Mobile Device Group Based Fault Tolerance Scheduling Algorithm in Mobile GridEmbedded and Multimedia Computing Technology and Service10.1007/978-94-007-5076-0_59(485-492)Online publication date: 2012
    • (2011)Scheduling of Computing Services on Intranet NetworksIEEE Transactions on Services Computing10.1109/TSC.2011.284:3(207-215)Online publication date: 1-Jul-2011
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media