Abstract
Recent advance of virtualization technology provides a new approach to check-point/restart at the virtual machine (VM) level. In contrast to traditional process-level checkpointing, checkpointing at the virtualization layer brings up several advantages, such as compatibility, transparence, flexibility and simplicity. However, because the virtualization layer has little semantic knowledge about the operation system and the applications running atop, VM-layer checkpointing requires saving the entire operating system state rather than a single process. The overhead may render the approach impractical. To reduce the size of VM checkpoint, in this paper we propose a page eviction scheme and an incremental checkpointing mechanism to avoid saving unnecessary VM pages in the checkpoint. To keep the system online transparently, we propose a live checkpointing mechanism by saving the memory image in a copy-on-write (COW) manner. We implement the performance optimization mechanisms in a prototype system, called VMckpt. Experimental results with a group of representative applications show that our page eviction scheme and incremental checkpointing can significantly reduce the checkpoint file size by up to 87% and shorten the total checkpointing/restart time by a factor of up to 71%, in comparison with the Xens default checkpointing mechanism. The observed application downtimes due to checkpointing can be reduced to as small as 300 ms.
Similar content being viewed by others
References
Xue R N, Chen W G, Zheng W M. CprFS: a user-level file system to support consistent file state for checkpoint and restart. In: Proceedings of 22nd ACM International Conference on Supercomputing(SC’08). Island of Kos: ACM Press, 2008. 114–123
Buntinas D, Coti C, Herault T, et al. Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MP. Future Gener Comp Sy, 2008, 24: 73–84
Fu S, Xu C Z. Exploring event correlation for failure prediction in coalitions of clusters. In: Proceedings of 2005 ACM/IEEE Conference on Supercomputing (SC’07). Reno: ACM Press, 2007. 1–12
Vallee G, Naughton T, Ong H, et al. Checkpoint/restart of virtual machines based on Xen. In: High Availability and Performance Computing Workshop (HAPCW’06). Santa Fe, 2006.
Barham P, Dragovic B, Fraser K, et al. Xen and the art of virtualization. In: Proceedings of 19th ACM Symposium on Operating Systems Principles (SOSP’03). New York: ACM Press, 2003. 164–177
Waldspurger C A. Memory resource management in VMware ESX server. In: Proceedings of 5th Symposium on Operating Systems Design and Implementation (OSDI’02). Boston: ACM Press, 2002. 181–194
Cully B, Lefebvre G, Meyer D, et al. Remus: high availability via asynchronous virtual machine replication. In: Proceedings of 5th USENIX Symposium on Networked Systems Design & Implementation (NSDI’08). San Francisco: USENIX, 2008. 161–174
Nagarajan A B, Mueller F, Engelmann C, et al. Proactive fault tolerance for HPC with Xen virtualization. In: Proceedings of 21st ACM International Conference on Supercomputing (ICS’07). Seattle: ACM Press, 2007. 23–32
Clark C, Fraser K, Hand S, et al. Live migration of virtual machines. In: Proceedings of 2nd Symposium on Networked Systems Design and Implementation (NSDI’05). Boston: USENIX, 2005. 273–286
Liu H K, Jin H, Liao X F, et al. Live virtual machine migration via asynchronous replication and state synchronization. IEEE Trans Parall Distr, 2011, 22: 1986–1999
King S T, Dunlap G W, Chen P M. Debugging operating systems with time-traveling virtual machines. In: Proceedings of USENIX Annual Technical Conference (USENIX’05). Anaheim: USENIX, 2005. 1–15
Sotomayor B, Keahey K, Foster I. Combining batch execution and leasing using virtual machines. In: Proceedings of 18th International Symposium on High Performance Distributed Computing (HPDC’08). Boston: ACM Press, 2008. 87–96
Chen Y, Plank J S, Li K. CLIP: a checkpointing tool for message-passing parallel programs. In: Proceedings of High Performance Networking and Computing (SC’97). San Jose: IEEE Computer Society, 1997. 1–11
Plank J S, Beck M, Kingsley G, et al. Libckpt: transparent checkpointing under Unix. In: Usenix Winter 1995 Technical Conference. New Orleans: USENIX, 1995. 213–223
Plank J S, Chen Y, Li K. Memory exclusion: optimizing the performance of checkpointing systems. Softw Pract Exper, 1999, 29: 125–142
Gioiosa R, Sancho J C, Jiang S, et al. Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers. In: Proceedings of 2005 ACM/IEEE Conference on Supercomputing (SC’05). Washington: IEEE Computer Society, 2005. 1–14
Sancho J C, Petrini F, Davis K, et al. Current practice and a direction forward in checkpoint/restart implementations for fault tolerance. In: Proceedings of 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS05). Denver: IEEE Computer Society, 2005. 300b
Marques D, Bronevetsky G, Fernandes R, et al. Optimizing checkpoint sizes in the C3 system. In: Proceedings of 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05). Denver: IEEE Computer Society, 2005. 226.1
Sapuntzakis C P, Chandra R, Pfaff B, et al. Optimizing the migration of virtual computers. In: Proceedings of 5th Symposium on Operating Systems Design and Implementation (OSDI’02). Boston: ACM Press, 2002. 377–390
Hines M, Gopalan K. Post-copy based live virtual machine migration using adaptive pre-paging and dynamic selfballooning. In: Proceedings of 2009 International Conference on Virtual Execution Environments (VEE’09). Washington: ACM Press, 2009. 51–60
Shma P T, Laden G, Yehuda M B, et al. Virtual machine time travel using continuous data protection and checkpointing. Oper Syst Rev, 2008, 42: 127–134
Kangarlou A, Eugster P, Xu D Y. VNsnap: taking snapshots of virtual networked environments with minimal downtime. In: Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN09). Estoril: IEEE Press, 2009. 524–533
Li Y W, Lan Z L. A fast recovery mechanism for checkpointing in networked environments. In: Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’08). Anchorage: IEEE Press, 2008. 217–226
Cully B, Warfield A. Virtual machine checkpointing. Xen summit. 2007
Liu H K, Jin H, Liao X F. Optimize the performance of virtual machine checkpointing with memory exclusion. In: Proceedings of 4th ChinaGrid Annual Conference (ChinaGrid’09). Yantai: IEEE Computer Society, 2009. 199–204
Meyer D, Aggarwal G, Cully B, et al. Parallax: virtual disks for virtual machines. In: Proceedings of the 3rd ACM European Conference on Computer Systems (EuroSys’08). New York: ACM Press, 2008. 41–54
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, H., Jin, H., Liao, X. et al. VMckpt: lightweight and live virtual machine checkpointing. Sci. China Inf. Sci. 55, 2865–2880 (2012). https://doi.org/10.1007/s11432-011-4501-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11432-011-4501-7