ABSTRACT
General-purpose virtual machine fault tolerance (VMFT) implementations are based on an epoch-based execution model, in which outputs of a VM being protected are buffered and released to the external world at specific time points. Because this execution model increases the size and variation of the per-packet round-trip delay and disrupts the use of the delayed ACK mechanism, the TCP performance of a VM running under this execution model tends to suffer a noticeable drop. This paper describes the design, implementation and evaluation of a set of TCP performance optimizations that are meant to address the TCP performance problems caused by the epoch-based execution model. Measurements on a complete VMFT prototype implementation called Cuju demonstrate that the proposed optimizations are able to eliminate most of these TCP performance losses when MTU is 1500 bytes.
- T. Bressoud and F. B. Schneider. 1995. Hypervisor-based fault tolerance. In Proceedings of the fifteenth ACM symposium on Operating systems principles. ACM, 1--11. Google ScholarDigital Library
- C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield. 2005. Live Migration of Virtual Machines. In Proceedings of the 2nd conference on Symposium on Networked Systems Design and Implementation. USENIX Association, 273--286.Google Scholar
- Intel Corporation. 2015. Page Modification Logging for Virtual Machine Monitor. Intel White Paper (2015). https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/page-modification-logging-vmm-white-paper.pdf.Google Scholar
- B. Cully, G. Lefebvre, D. Meyer, M. Feeley, N. Hutchinson, and A. Warfield. 2008. Remus: high availability via asynchronous virtual machine replication. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation. USENIX Association, 161--174.Google Scholar
- G. W. Dunlap, D. G. Lucchetti, M. A. Fetterman, and P. M. Chen. 2008. Execution replay of multiprocessor virtual machines. In Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on virtual execution environments. ACM, 121--130. Google ScholarDigital Library
- B. Gerofi and Y. Ishikawa. 2011. RDMA based Replication of Multiproces- sor Virtual Machines over High-Performance Interconnects. In Proceedings of the 2011 IEEE International Conference on Cluster Computing. IEEE, 35--44. Google ScholarDigital Library
- Balazs Gerofi and Yutaka Ishikawa. 2012. Enhancing TCP Throughput of Highly Available Virtual Machines via Speculative Communication. In Proceedings of the eighth ACM SIGPLAN/SIGOPS international conference on virtual execution environments. ACM, 87--96.Google ScholarDigital Library
- S. Ha, I. Rhee, and L. Xu. 2008. CUBIC: a new TCP-friendly high-speed TCP variant. In SIGOPS Operating Systems Review. ACM, 42:64--74.Google ScholarDigital Library
- A. Kangarlou, S. Gamage, R. R. Kompella, and D. Xu. 2010. vSnoop: Improving TCP Throughput in Virtualized Environments via Acknowl- edgement Offload. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC2010. IEEE, 1--11. Google ScholarDigital Library
- A. Kivity, Y. Kamay, D. Laor, U. Lublin, and A. Liguori. 2007. KVM: the Linux virtual machine monitor. In Proceedings of Ottawa Linux Symposium. USENIX Association, 225--230.Google Scholar
- D. Lee, B. Wester, K. Veeraraghavan, S. Narayanasamy, P. M. Chen, and J. Flinn. 2010. Respec: efficient online multiprocessor replay via speculation and external determinism. In Proceedings of ASPLOS 2010. ACM, 77--90. Google ScholarDigital Library
- M. Lu and Tzi cker Chiueh. 2009. Fast memory state synchronization for virtualization-based fault tolerance. In Proceedings of IEEE/IFIP International Conference on Dependable Systems Networks. IEEE, 534--543. Google ScholarCross Ref
- G. Pokam, C. Pereira, K. Danne, L. Yang, S. King, and J. Torel-las. 2009. Hardware and Software Approaches for Deterministic Multi-Processor Replay of Concurrent Programs. In Intel Technology Journal. Intel, V3-4:20--41.Google Scholar
- D. J. Scales, M. Nelson, and G. Venkitachalam. 2010. The design of a practical system for fault-tolerant virtual machines. In SIGOPS Operating Systems Review. ACM, 44:30--39. Google ScholarDigital Library
- R.E. Strom, D.F. Bacon, and Yemini. 1988. Volatile logging in n-fault-tolerant distributed systems. In Proceedings of the Eighteenth International Symposium on Fault-Tolerant Computing. IEEE, 44--49. Google ScholarCross Ref
- Y. Tamura. 2008. Kemari: Virtual Machine Synchronization for Fault Tolerance using DomT. In Technical report of NTT Cyber Space Labs. NTT.Google Scholar
- K. V. Vishwanath and N. Nagappan. 2010. Characterizing cloud computing hardware reliability. In Proceedings of the 1st ACM symposium on Cloud computing. ACM, 193--204. Google ScholarDigital Library
- X. Zhang, Z. Huo, J. Ma, and D. Meng. 2010. Exploiting Data Deduplication to Accelerate Live Virtual Machine Migration. In Proceedings of IEEE International Conference on Cluster Computing. IEEE, 88--96. Google ScholarDigital Library
- J. Zhu, W. Dong, Z. Jiang, X. Shi, Z. Xiao, and X. Li. 2010. Improving the Performance of Hypervisor-based Fault Tolerance. In Proceedings of IEEE International Symposium on Parallel Distributed Processing (IPDPS). IEEE, 1--10. Google ScholarCross Ref
Index Terms
- TCP Performance Optimization for Epoch-based Execution
Recommendations
Active window management: performance assessment through an extensive comparison with XCP
NETWORKING'08: Proceedings of the 7th international IFIP-TC6 networking conference on AdHoc and sensor networks, wireless networks, next generation internetThe most efficient approaches defined so far to address performance degradations in end-to-end congestion control exploit the flow control mechanism to improve end-to-end performance. The most authoritative solution in this context seems to be the ...
TCP westwood: end-to-end congestion control for wired/wireless networks
TCP Westwood (TCPW) is a sender-side modification of the TCP congestion window algorithm that improves upon the performance of TCP Reno in wired as well as wireless networks. The improvement is most significant in wireless networks with lossy links. In ...
Delay-based TCP congestion avoidance: A network calculus interpretation and performance improvements
In delay-based TCP congestion avoidance mechanisms, a source adjusts its window size to adapt to changes in network conditions as measured through changing queueing delays. Although network calculus (NC) has been used to study window flow control and ...
Comments