Skip to main content

Lightweight Virtual Machine Checkpoint and Rollback for Long-running Applications

  • Conference paper
  • First Online:
Book cover Algorithms and Architectures for Parallel Processing (ICA3PP 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9530))

Abstract

Checkpoint/rollback is an effective approach to guarantee that the long-running applications can be completed in the face of failures. However, it does not come for free. The application suffers from long downtime and performance penalty when it is being checkpointed or rolled back, which result in extra overhead on application execution time. This problem would get worse in virtualized environment mainly due to the heavyweight of virtual machine. This paper proposes warmCR, a lightweight checkpoint/rollback system for virtual machine, which aims to reduce its own extra overhead on application execution time. First, warmCR employs the redirect-on-write approach to create disk checkpoint and leverages the copy-on-write method to lively create memory checkpoint, so that both the downtime and checkpoint duration are reduced. Second, we propose a working set based rollback approach to provide short downtime without compromising application performance. Third, workload-aware batched processing is proposed to achieve trade-off between downtime and performance loss. In addition to presenting warmCR, we detail its implementation, and provide extensive experimental results to prove its efficiency and effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Amazon EC2. http://aws.amazon.com/ec2/

  2. ElasticSearch. http://www.elasticsearch.org/

  3. Vallee, G., Naughton, T., Ong, H., et al.: Checkpoint/restart of virtual machines based on Xen. In: HAPCW (2006)

    Google Scholar 

  4. Ford, D., Labelle, F., Popovici, F.I., et al.: Availability in globally distributed storage systems. In: OSDI, pp. 1–14 (2010)

    Google Scholar 

  5. Plank, J.S., Beck, M., Kingsley, G., et al.: Libckpt: transparent checkpointing under Unix. Computer Science Department (1994)

    Google Scholar 

  6. Li, J., Liu, H., Cui, L., Li, B., Wo, T.: iROW: an efficient live snapshot system for virtual machine disk. In: ICPADS, pp. 376–383 (2012)

    Google Scholar 

  7. Vaidya, N.H.: Impact of checkpoint latency on overhead ratio of a checkpointing scheme. TOC 46(8), 942–947 (1997)

    Google Scholar 

  8. Zhang, I., Garthwaite, A., Baskakov, Y., et al.: Fast restore of checkpointed memory using working set estimation. In: VEE, pp. 87–98 (2011)

    Google Scholar 

  9. Song, X., Shi, J., Liu, R., et al.: Parallelizing live migration of virtual machines. In: VEE, pp. 85–96 (2013)

    Google Scholar 

  10. Lee, M., Krishnakumar, A.S., Krishnan, P., et al.: Hypervisor-assisted application checkpointing in virtualized environments. In: DSN, pp. 371–382 (2011)

    Google Scholar 

  11. Arunagiri, S., Seelam, S., Oldfield, R.A., et al.: Impact of checkpoint latency on the optimal checkpoint interval and execution time (2008)

    Google Scholar 

  12. Young, J.M.: A first order approximation to the optimal checkpoint interval. Comm. ACM 17(9), 530–531 (1974)

    Article  MATH  Google Scholar 

  13. Tantawi, A.N., Ruschitzka, M.: Performance analysis of checkpointing strategies. TOC 2(2), 123–144 (1984)

    Article  Google Scholar 

  14. Duda, A.: The effects of checkpointing on program execution time. Inf. Process. Lett. 16(5), 221–229 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  15. Kourai, K., Chiba, S.: Fast software rejuvenation of virtual machine monitors. TDSC 8(6), 839–851 (2011)

    Google Scholar 

  16. Leners, J.B., Wu, H., Hung, W.L., et al.: Detecting failures in distributed systems with the FALCON spy network. In: SOSP, pp. 279–294 (2011)

    Google Scholar 

  17. Garg, S., et al.: Minimizing completion time of a program by checkpointing and rejuvenation. In: SIGMETRICS, pp. 252–261 (1996)

    Google Scholar 

  18. Kangarlou, A., Eugster, P., Xu, D.: VNsnap: taking snapshots of virtual networked environments with minimal downtime. In: DSN, pp. 524–533 (2009)

    Google Scholar 

  19. Sun, M.H., Blough, D.M.: Fast, Lightweight Virtual Machine Checkpointing (2010)

    Google Scholar 

  20. Liu, H.K., Jin, H., Liao, X.F., et al.: VMckpt: lightweight and live virtual machine checkpointing. Sci. China Inf. Sci. 55(12), 2865–2880 (2012)

    Article  Google Scholar 

  21. Garg, R., Sodha, K., Cooperman, G.: A generic checkpoint-restart mechanism for virtual machines (2012). arXiv preprint. arXiv:1212.1787

  22. Hibler, M., Ricci, R., Stoller, L., Duerig, J., et al.: Large-scale virtualization in the emulab network testbed. In: ATC, pp. 113–128 (2008)

    Google Scholar 

  23. Liu, Y., Nassar, R., Leangsuksun, C.B., Naksinehaboon, N., Paun, M., Scott, S.L.: An optimal checkpoint/restart model for a large scale high performance computing system. In: Symposium on Parallel and Distributed Processing, pp. 1–9 (2008)

    Google Scholar 

  24. Maoz, T., Barak, A., Amar, L.: Combining virtual machine migration with process migration for HPC on multi-clusters and grids. In: Cluster, pp. 89–98 (2008)

    Google Scholar 

  25. Waldspurger, C.A.: Memory resource management in VMware ESX server. In: OSDI, pp. 181–194 (2002)

    Google Scholar 

  26. Jin, H., Deng, L., Wu, S.: Live virtual machine migration with adaptive memory compression. In: CLUSTER, pp. 1–10 (2009)

    Google Scholar 

  27. Hines, M.R., Gopalan, K.: Post-copy based live virtual machine migration using adaptive pre-paging and dynamic self-ballooning. In: VEE, pp. 51–60 (2009)

    Google Scholar 

  28. Park, E., Egger, B., Lee, J.: Fast and space-efficient virtual machine checkpointing. In: VEE, pp. 75–85 (2011)

    Google Scholar 

  29. Chiang, J.-H., Li, H.-L., Chiueh, T.-C.: Introspection-based memory de-duplication and migration. In: VEE, pp. 51–62 (2013)

    Google Scholar 

  30. Gray, J.: Why do computers stop and what can be done about it? In: German Association for Computing Machinery Conference on Office Automation (1985)

    Google Scholar 

Download references

Acknowledgement

We would like to thank the anonymous reviewers for their valuable comments and help in improving this paper. This work is supported by National Key Technology Support Program under grant No. 2012BAH46B02.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhiyu Hao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Cui, L. et al. (2015). Lightweight Virtual Machine Checkpoint and Rollback for Long-running Applications. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9530. Springer, Cham. https://doi.org/10.1007/978-3-319-27137-8_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27137-8_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27136-1

  • Online ISBN: 978-3-319-27137-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics