Skip to main content
Log in

VMckpt: lightweight and live virtual machine checkpointing

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

Recent advance of virtualization technology provides a new approach to check-point/restart at the virtual machine (VM) level. In contrast to traditional process-level checkpointing, checkpointing at the virtualization layer brings up several advantages, such as compatibility, transparence, flexibility and simplicity. However, because the virtualization layer has little semantic knowledge about the operation system and the applications running atop, VM-layer checkpointing requires saving the entire operating system state rather than a single process. The overhead may render the approach impractical. To reduce the size of VM checkpoint, in this paper we propose a page eviction scheme and an incremental checkpointing mechanism to avoid saving unnecessary VM pages in the checkpoint. To keep the system online transparently, we propose a live checkpointing mechanism by saving the memory image in a copy-on-write (COW) manner. We implement the performance optimization mechanisms in a prototype system, called VMckpt. Experimental results with a group of representative applications show that our page eviction scheme and incremental checkpointing can significantly reduce the checkpoint file size by up to 87% and shorten the total checkpointing/restart time by a factor of up to 71%, in comparison with the Xens default checkpointing mechanism. The observed application downtimes due to checkpointing can be reduced to as small as 300 ms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Xue R N, Chen W G, Zheng W M. CprFS: a user-level file system to support consistent file state for checkpoint and restart. In: Proceedings of 22nd ACM International Conference on Supercomputing(SC’08). Island of Kos: ACM Press, 2008. 114–123

    Chapter  Google Scholar 

  2. Buntinas D, Coti C, Herault T, et al. Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MP. Future Gener Comp Sy, 2008, 24: 73–84

    Article  Google Scholar 

  3. Fu S, Xu C Z. Exploring event correlation for failure prediction in coalitions of clusters. In: Proceedings of 2005 ACM/IEEE Conference on Supercomputing (SC’07). Reno: ACM Press, 2007. 1–12

    Chapter  Google Scholar 

  4. Vallee G, Naughton T, Ong H, et al. Checkpoint/restart of virtual machines based on Xen. In: High Availability and Performance Computing Workshop (HAPCW’06). Santa Fe, 2006.

  5. Barham P, Dragovic B, Fraser K, et al. Xen and the art of virtualization. In: Proceedings of 19th ACM Symposium on Operating Systems Principles (SOSP’03). New York: ACM Press, 2003. 164–177

    Google Scholar 

  6. Waldspurger C A. Memory resource management in VMware ESX server. In: Proceedings of 5th Symposium on Operating Systems Design and Implementation (OSDI’02). Boston: ACM Press, 2002. 181–194

    Chapter  Google Scholar 

  7. Cully B, Lefebvre G, Meyer D, et al. Remus: high availability via asynchronous virtual machine replication. In: Proceedings of 5th USENIX Symposium on Networked Systems Design & Implementation (NSDI’08). San Francisco: USENIX, 2008. 161–174

    Google Scholar 

  8. Nagarajan A B, Mueller F, Engelmann C, et al. Proactive fault tolerance for HPC with Xen virtualization. In: Proceedings of 21st ACM International Conference on Supercomputing (ICS’07). Seattle: ACM Press, 2007. 23–32

    Chapter  Google Scholar 

  9. Clark C, Fraser K, Hand S, et al. Live migration of virtual machines. In: Proceedings of 2nd Symposium on Networked Systems Design and Implementation (NSDI’05). Boston: USENIX, 2005. 273–286

    Google Scholar 

  10. Liu H K, Jin H, Liao X F, et al. Live virtual machine migration via asynchronous replication and state synchronization. IEEE Trans Parall Distr, 2011, 22: 1986–1999

    Article  Google Scholar 

  11. King S T, Dunlap G W, Chen P M. Debugging operating systems with time-traveling virtual machines. In: Proceedings of USENIX Annual Technical Conference (USENIX’05). Anaheim: USENIX, 2005. 1–15

    Google Scholar 

  12. Sotomayor B, Keahey K, Foster I. Combining batch execution and leasing using virtual machines. In: Proceedings of 18th International Symposium on High Performance Distributed Computing (HPDC’08). Boston: ACM Press, 2008. 87–96

    Chapter  Google Scholar 

  13. Chen Y, Plank J S, Li K. CLIP: a checkpointing tool for message-passing parallel programs. In: Proceedings of High Performance Networking and Computing (SC’97). San Jose: IEEE Computer Society, 1997. 1–11

    Google Scholar 

  14. Plank J S, Beck M, Kingsley G, et al. Libckpt: transparent checkpointing under Unix. In: Usenix Winter 1995 Technical Conference. New Orleans: USENIX, 1995. 213–223

    Google Scholar 

  15. Plank J S, Chen Y, Li K. Memory exclusion: optimizing the performance of checkpointing systems. Softw Pract Exper, 1999, 29: 125–142

    Article  Google Scholar 

  16. Gioiosa R, Sancho J C, Jiang S, et al. Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers. In: Proceedings of 2005 ACM/IEEE Conference on Supercomputing (SC’05). Washington: IEEE Computer Society, 2005. 1–14

    Google Scholar 

  17. Sancho J C, Petrini F, Davis K, et al. Current practice and a direction forward in checkpoint/restart implementations for fault tolerance. In: Proceedings of 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS05). Denver: IEEE Computer Society, 2005. 300b

    Chapter  Google Scholar 

  18. Marques D, Bronevetsky G, Fernandes R, et al. Optimizing checkpoint sizes in the C3 system. In: Proceedings of 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05). Denver: IEEE Computer Society, 2005. 226.1

    Google Scholar 

  19. Sapuntzakis C P, Chandra R, Pfaff B, et al. Optimizing the migration of virtual computers. In: Proceedings of 5th Symposium on Operating Systems Design and Implementation (OSDI’02). Boston: ACM Press, 2002. 377–390

    Chapter  Google Scholar 

  20. Hines M, Gopalan K. Post-copy based live virtual machine migration using adaptive pre-paging and dynamic selfballooning. In: Proceedings of 2009 International Conference on Virtual Execution Environments (VEE’09). Washington: ACM Press, 2009. 51–60

    Chapter  Google Scholar 

  21. Shma P T, Laden G, Yehuda M B, et al. Virtual machine time travel using continuous data protection and checkpointing. Oper Syst Rev, 2008, 42: 127–134

    Article  Google Scholar 

  22. Kangarlou A, Eugster P, Xu D Y. VNsnap: taking snapshots of virtual networked environments with minimal downtime. In: Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN09). Estoril: IEEE Press, 2009. 524–533

    Chapter  Google Scholar 

  23. Li Y W, Lan Z L. A fast recovery mechanism for checkpointing in networked environments. In: Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’08). Anchorage: IEEE Press, 2008. 217–226

    Google Scholar 

  24. Cully B, Warfield A. Virtual machine checkpointing. Xen summit. 2007

  25. Liu H K, Jin H, Liao X F. Optimize the performance of virtual machine checkpointing with memory exclusion. In: Proceedings of 4th ChinaGrid Annual Conference (ChinaGrid’09). Yantai: IEEE Computer Society, 2009. 199–204

    Chapter  Google Scholar 

  26. Meyer D, Aggarwal G, Cully B, et al. Parallax: virtual disks for virtual machines. In: Proceedings of the 3rd ACM European Conference on Computer Systems (EuroSys’08). New York: ACM Press, 2008. 41–54

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to XiaoFei Liao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, H., Jin, H., Liao, X. et al. VMckpt: lightweight and live virtual machine checkpointing. Sci. China Inf. Sci. 55, 2865–2880 (2012). https://doi.org/10.1007/s11432-011-4501-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11432-011-4501-7

Keywords

Navigation