Article

Failure-aware checkpointing in fine-grained cycle sharing systems

Authors:

Rudolf Eigenmann,

Saurabh BagchiAuthors Info & Claims

HPDC '07: Proceedings of the 16th international symposium on High performance distributed computing

Pages 33 - 42

https://doi.org/10.1145/1272366.1272372

Published: 25 June 2007 Publication History

Abstract

Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amountof idle computational resources available on the Internet. Such systems allow guest jobs to run on a host if they do not significantly impact the local users of the host. Since the hosts are typically provided voluntarily, their availability fluctuates greatly. To provide fault tolerance to guest jobs without adding significant computational overhead, we propose failure-aware checkpointing techniques that apply the knowledge of resource availability to select checkpoint repositories and to determine checkpoint intervals. We present the schemes of selecting reliable and efficient repositories from the non-dedicated hosts that contribute their disk storage. These schemes are formulated as 0/1 programming problems to optimize the network overhead of transferring checkpoints and the work lost due to unavailability of a storage host when needed to recover a guest job. We determine the checkpoint interval by comparing the cost of checkpointing immediately and the cost of delaying that to a later time, which is a function of the resource availability. We evaluate these techniques on an FGCS system called iShare, using trace-based simulation. The results show that they achieve better application performance than the prevalent methods which use checkpointing with a fixed periodicity on dedicated checkpoint servers.

References

[1]

M. K. Aguilera, RJanakiraman, and LXu. Using erasure codes efficiently for storage in a distributed system. In Proc. of DSN'05, pages 336--345, 2005.

Digital Library

[2]

J. Basney and M. Livny. Managing network resources in condor. In Proc. of HPDC'00, pages 298--299, 2000.

Digital Library

[3]

R. Buyya and M. Murshed. GridSim: A toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing. Concurrency and Computation: Practice and Experience, 14:1175--1220, 2002.

[4]

C. Cachin and S. Tessaro. Optimal resilience for erasure-coded byzantine distributed storage. In Proc. of DSN'06, pages 115--124, 2006.

Digital Library

[5]

R. de Camargo, R. Cerqueira, and F. Kon. Strategies for storage of checkpointing data using non-dedicated repositories on grid systems. In Int'l Workshop on Middleware for Grid Computing, pages 1--6, 2005.

Digital Library

[6]

P. A. Dinda and D. R. O'Halaron. An evaluation of linear models for host load prediction. In Proc. of HPDC'99, page 10, 1999.

Digital Library

[7]

W. Gentzsh. Sun Grid Engine: towards creating a compute power grid. In Int. Symposium on Cluster Computing and the Grid, pages 35--39, 2001.

Digital Library

[8]

http://setiathome.ssl.berkeley.edu/. SETIυhome: Search for extraterrestrial intelligence at home.

[9]

Y. Ling, J. Mi, and X. Lin. A variational calculus approach to optimal checkpoint placement. IEEE. Trans. on Computers, 50(7):699--708, 2001.

Digital Library

[10]

D. Nurmi, J. Brevik, and R. Wolski. Minimizing the network overhead of checkpointing in cycle-harvesting cluster environments. In Proc. of Cluster'05, 2006.

[11]

C. H. Papadimitriou and K. Steiglitz. Combinational Optimization: Algorithms and Complexity. Dover Publications, 1998.

Digital Library

[12]

J. S. Plank and W. Elwasif. Experimental assessment of workstation failures and their impact on checkpointing systems. In 28th International Symposium on Fault-Tolerant Computing, pages 48--57, 1998.

Digital Library

[13]

J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Trans. on Parallel and Distributed Systems, 9(10):972--986, 1998.

Digital Library

[14]

M. O. Rabin. Efficient dispersal of information for security, load balancing, and fault tolerance. Journal of the ACM, 36(2):335--348, 1989.

Digital Library

[15]

X. Ren and R. Eigenmann. iShare - Open internet sharing built on P2P and web. In Proc. of EGC'05, pages 1117--1127, 2005.

Digital Library

[16]

X. Ren and R. Eigenmann. Empirical studies on the behavior of resource availability in fine-grained cycle sharing systems. In Proc. of ICPP'06, pages 3--11, 2006.

Digital Library

[17]

X. Ren, R. Eigenmann, and S. Bagchi. Availability prediction for non-dedicated storages in fine-grained cycle sharing systems. Technical Report ECE-HPCLab-06201, Purdue University, 2006.

[18]

X. Ren, S. Lee, R. Eigenmann, and S. Bagchi. Resource availability prediction in fine-grained cycle sharing systems. In Proc. of HPDC'06, pages 93--104, 2006.

[19]

X. Ren, S. Lee, R. Eigenmann, and S. Bagchi. Prediction of resource availability in fine-grained cycle sharing systems and empirical evaluation. To appear in the Journal of Grid Computing, 2007.

[20]

K. D. Ryu and J. Hollingsworth. Resource policing to support fine-grain cycle stealing in networks of workstations. IEEE Trans. on Parallel and Distributed Systems, 15(9): 878--891, 2004.

Digital Library

[21]

D. Thain, J. Basney, S. Son, and M. Livny. The kangaroo approach to data movement on the grid. In Proc. of HPDC'01, pages 325--333, 2001.

Digital Library

[22]

D. Thain, T. Tannenbaum, and M. Livny. Distributed computing in practice: The condor experience. Concurrency - Practice and Experience, 17(2--4):323--356, 2004.

Digital Library

[23]

N. H. Vaidya. Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE. Trans. on Computers, 46(8):942--927, 1997.

Digital Library

[24]

R. Wolski, N. Spring, and J. Hayes. The network weather service: A distributed resource performance forecasting service for metacomputing. Journal of Future Generation Computing Systems, 15(5--6):757--768, 1999.

Digital Library

[25]

Y. Y. Zhang, M. Squillante, A. Sivasubramaniam, and R. K. Sahoo. Performance implications of failures in large-scale cluster scheduling. In 10th Workshop on Job Scheduling Strategies for Parallel Processing, 2004.

Digital Library

[26]

D. Zhou and V. Lo. Wave scheduler: Scheduling for faster turnaround time in peer-based desktop grid systems. mIn Proc. of the 11th Workshop on Job Scheduling Strategies for Parallel Processing, 2005.

Digital Library

Cited By

Lee JGil J(2019)Adaptive fault-tolerant scheduling strategies for mobile cloud computingThe Journal of Supercomputing10.1007/s11227-019-02745-5Online publication date: 10-Jan-2019
https://doi.org/10.1007/s11227-019-02745-5
Alrajeh OForshaw MMcGough AThomas NRisco Martín JBesada EDe Rango F(2018)Simulation of virtual machine live migration in high throughput computing environmentsProceedings of the 22nd International Symposium on Distributed Simulation and Real Time Applications10.5555/3330299.3330305(47-54)Online publication date: 15-Oct-2018
https://dl.acm.org/doi/10.5555/3330299.3330305
Cardoso PBarcelos P(2018)Validation of a dynamic checkpoint mechanism for Apache Hadoop with failure scenarios2018 IEEE 19th Latin-American Test Symposium (LATS)10.1109/LATW.2018.8347240(1-6)Online publication date: Mar-2018
https://doi.org/10.1109/LATW.2018.8347240
Show More Cited By

Index Terms

Failure-aware checkpointing in fine-grained cycle sharing systems
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks

Recommendations

FALCON: a system for reliable checkpoint recovery in shared grid environments
SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion ...
Failure-aware energy-efficient VM consolidation in cloud computing systems
Abstract
VM consolidation is an important technique used in cloud computing systems to improve energy efficiency. It migrates the running VMs from under utilized physical resources to other resources in order to reduce the energy consumption. ...
Highlights
- Reliability, energy consumption and task finishing time modelling under failures.
Parallel and consistent live checkpointing and restoration of split-memory VMs
Abstract
Recently, clouds provide virtual machines (VMs) with a large amount of memory for big data analysis. For easier migration of such VMs, split migration divides the memory of a VM into several fragments and transfers them to multiple hosts. Since ...
Highlights
- Enable parallel and efficient checkpoint/restore of a VM across multiple hosts
- Achieve consistent live checkpointing by considering remote paging
- Support incremental live checkpointing to dramatically reduce the checkpoint time

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '07: Proceedings of the 16th international symposium on High performance distributed computing

June 2007

256 pages

ISBN:9781595936738

DOI:10.1145/1272366

General Chair:
Carl Kesselman
USC/ISI
,
Program Chairs:
Jack Dongarra
University of Tennessee
,
David Walker
University of Cardiff

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

HPDC07

Sponsor:

HPDC07: International Symposium on High Performance Distributed Computing

June 25 - 29, 2007

California, Monterey, USA

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
293
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lee JGil J(2019)Adaptive fault-tolerant scheduling strategies for mobile cloud computingThe Journal of Supercomputing10.1007/s11227-019-02745-5Online publication date: 10-Jan-2019
https://doi.org/10.1007/s11227-019-02745-5
Alrajeh OForshaw MMcGough AThomas NRisco Martín JBesada EDe Rango F(2018)Simulation of virtual machine live migration in high throughput computing environmentsProceedings of the 22nd International Symposium on Distributed Simulation and Real Time Applications10.5555/3330299.3330305(47-54)Online publication date: 15-Oct-2018
https://dl.acm.org/doi/10.5555/3330299.3330305
Cardoso PBarcelos P(2018)Validation of a dynamic checkpoint mechanism for Apache Hadoop with failure scenarios2018 IEEE 19th Latin-American Test Symposium (LATS)10.1109/LATW.2018.8347240(1-6)Online publication date: Mar-2018
https://doi.org/10.1109/LATW.2018.8347240
Alrajeh OForshaw MStephen McGough AThomas N(2018)Simulation of Virtual Machine Live Migration in High Throughput Computing Environments2018 IEEE/ACM 22nd International Symposium on Distributed Simulation and Real Time Applications (DS-RT)10.1109/DISTRA.2018.8601013(1-8)Online publication date: Oct-2018
https://doi.org/10.1109/DISTRA.2018.8601013
Anam MAndreopoulos Y(2015)Failure mitigation in linear, sesquilinear and bijective operations on integer data streams via numerical entanglement2015 IEEE 21st International On-Line Testing Symposium (IOLTS)10.1109/IOLTS.2015.7229844(122-127)Online publication date: Jul-2015
https://doi.org/10.1109/IOLTS.2015.7229844
Islam TBagchi SEigenmann R(2014)Reliable and Efficient Distributed Checkpointing System for Grid EnvironmentsJournal of Grid Computing10.1007/s10723-014-9297-412:4(593-613)Online publication date: 1-Dec-2014
https://dl.acm.org/doi/10.1007/s10723-014-9297-4
Padhye VTripathi A(2014)Mechanisms for building autonomically scalable services on cooperatively shared computing platformsSoftware—Practice & Experience10.1002/spe.220644:10(1251-1276)Online publication date: 1-Oct-2014
https://dl.acm.org/doi/10.1002/spe.2206
Bouguerra MTrystram DWagner F(2013)Complexity Analysis of Checkpoint Scheduling with Variable CostsIEEE Transactions on Computers10.1109/TC.2012.5762:6(1269-1275)Online publication date: 1-Jun-2013
https://dl.acm.org/doi/10.1109/TC.2012.57
Lee JChoi SSuh TGil JShi WYu H(2012)A Mobile Device Group Based Fault Tolerance Scheduling Algorithm in Mobile GridEmbedded and Multimedia Computing Technology and Service10.1007/978-94-007-5076-0_59(485-492)Online publication date: 2012
https://doi.org/10.1007/978-94-007-5076-0_59
Yenke BMehaut JTchuente M(2011)Scheduling of Computing Services on Intranet NetworksIEEE Transactions on Services Computing10.1109/TSC.2011.284:3(207-215)Online publication date: 1-Jul-2011
https://dl.acm.org/doi/10.1109/TSC.2011.28
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten