ABSTRACT
In recent years, since scale-up machines are not economical and may not be affordable for small businesses, scale-out has become the standard answer to data analysis, machine learning, and many other fields. However, these frameworks introduce complex programming models that put a burden on developers. Therefore, Single System Image (SSI), which means a cluster of machines that appears to be one single system, has been proposed to hide the complexity of distributed systems. Unfortunately, due to the mature ecosystem of current mainstream Operating Systems (OSes), it might be non-trivial and even unaffordable to modify the current OS to implement SSI. With the wide use of virtualization, we believe that it is appealing to support SSI at the hypervisor, without modifying guest OSes.
This paper presents GiantVM, an open-source distributed hypervisor that provides the many-to-one virtualization to aggregate resources from multiple physical machines, as well as providing a uniform hardware abstraction for guest OS. GiantVM combines the benefits of scale-up and scale-out solutions, which means unmodified applications are able to run with a huge amount of physical resources. Furthermore, GiantVM leverages distributed shared memory to achieve aggregation of memory. We also propose techniques to deal with the challenges of CPU and I/O virtualization in distributed environments. We have implemented GiantVM based on a state-of-the-art type-II hypervisor QEMU-KVM, and it can currently host conventional OSes such as Linux. Evaluations identify the performance bottleneck and show that GiantVM outperforms Spark by up to 3.4X with two text-processing programs.
- 2019. The JSR-133 Cookbook for Compiler Writers. http://gee.cs.oswego.edu/dl/jmm/cookbook.htmlGoogle Scholar
- 2019. Perf. https://perf.wiki.kernel.org/index.php/Main_PageGoogle Scholar
- 2019. ScaleMP. https://www.scalemp.comGoogle Scholar
- 2019. std::atomic. https://en.cppreference.com/w/cpp/atomic/atomicGoogle Scholar
- 2019. Stress-ng. http://kernel.ubuntu.com/~cking/stress-ng/Google Scholar
- 2019. Sysbench. https://github.com/akopytov/sysbench/Google Scholar
- 2019. TidalScale. https://www.tidalscale.comGoogle Scholar
- Darren Abramson, Jeff Jackson, Sridhar Muthrasanallur, Gil Neiger, Greg Regnier, Rajesh Sankaran, Ioannis Schoinas, Rich Uhlig, Balaji Vembu, and John Wiegert. 2006. Intel Virtualization Technology for Directed I/O. Intel Technology Journal 10, 3 (2006), 179 -- 192. http://search.ebscohost.com/login.aspx?direct=true&db=egs&AN=22445025&site=eds-liveGoogle ScholarCross Ref
- Keith Adams and Ole Agesen. 2006. A comparison of software and hardware techniques for x86 virtualization. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2006, San Jose, CA, USA, October 21--25, 2006. 2--13. Google ScholarDigital Library
- Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Stanko Novakovic, Arun Ramanathan, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei. 2018. Remote regions: a simple abstraction for remote memory. In 2018 USENIX Annual Technical Conference, USENIX ATC 2018, Boston, MA, USA, July 11--13, 2018. 775--787. https://www.usenix.org/conference/atc18/presentation/aguileraGoogle Scholar
- AMD. 2005. AMD64 Virtualization Codenamed "Pacifica" Technology: Secure Virtual Machine Architecture Reference Manual.Google Scholar
- Cristiana Amza, Alan L. Cox, Sandhya Dwarkadas, Peter J. Keleher, Honghui Lu, Ramakrishnan Rajamony, Weimin Yu, and Willy Zwaenepoel. 1996. TreadMarks: Shared Memory Computing on Networks of Workstations. IEEE Computer 29, 2 (1996), 18--28. Google ScholarDigital Library
- Krste Asanović. 2014. FireBox: A Hardware Building Block for 2020 Warehouse-Scale Computers. In FAST.Google Scholar
- Infiniband Trade Association. 2008. InfiniBand architecture volume 1, general specifications.Google Scholar
- Amnon Barak and Oren La'adan. 1998. The MOSIX multicomputer operating system for high performance cluster computing. Future Generation Comp. Syst. 13, 4--5 (1998), 361--372. Google ScholarDigital Library
- Antonio Barbalace, Marina Sadini, Saif Ansary, Christopher Jelesnianski, Akshay Ravichandran, Cagil Kendir, Alastair Murray, and Binoy Ravindran. 2015. Popcorn: Bridging the Programmability Gap in heterogeneous-ISA Platforms. In Proceedings of the Tenth European Conference on Computer Systems (EuroSys '15). ACM, New York, NY, USA, Article 29, 16 pages. Google ScholarDigital Library
- Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. 2003. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles 2003, SOSP 2003, Bolton Landing, NY, USA, October 19--22, 2003. 164--177. Google ScholarDigital Library
- C Gordon Bell and Ike Nassi. 2018. Revisiting Scalable Coherent Shared Memory. Computer 51, 1 (2018), 40--49.Google ScholarCross Ref
- Fabrice Bellard. 2005. QEMU, a Fast and Portable Dynamic Translator. In Proceedings of the Annual Conference on USENIX Annual Technical Conference (ATEC '05). USENIX Association, Berkeley, CA, USA, 41--41. http://dl.acm.org/citation.cfm7id-1247360.1247401Google ScholarDigital Library
- Christopher J. Berry, James D. Warnock, John Isakson, John Badar, Brian Bell, Frank Malgioglio, Guenter Mayer, Dina Hamid, Jesse Surprise, David Wolpert, Ofer Geva, Bill Huott, Leon J. Sigal, Sean M. Carey, Richard F. Rizzolo, Ricardo Nigaglioni, Mark Cichanowski, Dureseti Chidambarrao, Christian Jacobi, Anthony Saporito, Arthur O'neill, Robert Sonnelitter, Christian G. Zoellin, Michael H. Wood, and José Neves. 2018. IBM z14TM: 14nm microprocessor for the next-generation mainframe. In 2018 IEEE International Solid-State Circuits Conference, ISSCC 2018, San Francisco, CA, USA, February 11--15, 2018. 36--38. Google ScholarCross Ref
- Timothy Broomhead, Laurence Cremean, Julien Ridoux, and Darryl Veitch. 2010. Virtualize Everything but Time. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 451--464. http://dl.acm.org/citation.cfm?id=1924943.1924975Google ScholarDigital Library
- Rajkumar Buyya. 1997. Single System Image: Need, Approaches, and Supporting HPC Systems. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA 1997, June 30 - July 3, 1997, Las Vegas, Nevada, USA. 1106.Google Scholar
- Rajkumar Buyya, Toni Cortes, and Hai Jin. 2001. Single System Image. IJHPCA 15, 2 (2001), 124--135. Google ScholarDigital Library
- Qingchao Cai, Wentian Guo, Hao Zhang, Divyakant Agrawal, Gang Chen, Beng Chin Ooi, Kian-Lee Tan, Yong Meng Teo, and Sheng Wang. 2018. Efficient Distributed Memory Management with RDMA and Caching. PVLDB 11, 11 (2018), 1604--1617. Google ScholarDigital Library
- Wei Cao, Zhenjun Liu, Peng Wang, Sen Chen, Caifeng Zhu, Song Zheng, Yuhui Wang, and Guoqing Ma. 2018. PolarFS: An Ultra-low Latency and Failure Resilient Distributed File System for Shared Storage Cloud Database. PVLDB 11, 12 (2018), 1849--1862. Google ScholarDigital Library
- John B. Carter, John K. Bennett, and Willy Zwaenepoel. 1991. Implementation and Performance of Munin. In Proceedings of the Thirteenth ACM Symposium on Operating System Principles, SOSP 1991, Asilomar Conference Center, Pacific Grove, California, USA, October 13--16, 1991. 152--164. Google ScholarDigital Library
- Matthew Chapman and Gernot Heiser. 2009. vNUMA: A Virtual Shared-memory Multiprocessor. In Proceedings of the 2009 Conference on USENIX Annual Technical Conference (USENIX'09). USENIX Association, Berkeley, CA, USA, 2--2. http://dl.acm.org/citation.cfm?id=1855807.1855809Google Scholar
- David Cunningham, David Grove, Benjamin Herta, Arun Iyengar, Kiyokuni Kawachiya, Hiroki Murata, Vijay Saraswat, Mikio Takeuchi, and Olivier Tardieu. 2014. Resilient X10: Efficient Failure-aware Programming. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14). ACM, New York, NY, USA, 67--80. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107--113. Google ScholarDigital Library
- Yaozu Dong, Xiaowei Yang, Jianhui Li, Guangdeng Liao, Kun Tian, and Haibing Guan. 2012. High performance network virtualization with SR-IOV. J. Parallel and Distrib. Comput. 72, 11 (2012), 1471--1480. Communication Architectures for Scalable Systems. Google ScholarDigital Library
- Bryan Fink, Eric Knauft, and Gene Zhang. 2017. vSAN: Modern Distributed Storage. Operating Systems Review 51, 1 (2017), 33--37. Google ScholarDigital Library
- B. Fleisch and G. Popek. 1989. Mirage: A Coherent Distributed Shared Memory Design. SIGOPS Oper. Syst. Rev. 23, 5 (Nov. 1989), 211--223. Google ScholarDigital Library
- Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G. Shin. 2017. Efficient Memory Disaggregation with Infiniswap. In 14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, Boston, MA, USA, March 27--29, 2017. 649--667. https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/guGoogle Scholar
- Philip D. Healy, Theo Lynn, Enda Barrett, and John P. Morrison. 2016. Single system image: A survey. J. Parallel Distrib. Comput. 90--91 (2016), 35--51. Google ScholarDigital Library
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. FaSST: Fast, Scalable and Simple Distributed Transactions with Two-sided (RDMA) Datagram RPCs. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 185--201. http://dl.acm.org/citation.cfm?id=3026877.3026892Google ScholarDigital Library
- Kimberly Keeton. 2015. The Machine: An Architecture for Memory-centric Computing. In Proceedings of the 5th International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2015, Portland, OR, USA, June 16, 2015. 1:1. Google ScholarDigital Library
- Avi Kivity, Yaniv Kamay, Dor Laor, Uri Lublin, and Anthony Liguori. 2007. KVM: the Linux Virtual Machine Monitor. In Proceedings of the 2007 Ottawa Linux Symposium (OLS'-07).Google Scholar
- Ana Klimovic, Heiner Litz, and Christos Kozyrakis. 2017. ReFlex: Remote Flash ≈ Local Flash. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2017, Xi'an, China, April 8--12, 2017. 345--359. Google ScholarDigital Library
- Joshua LeVasseur, Volkmar Uhlig, Yaowei Yang, Matthew Chapman, Peter Chubb, Ben Leslie, and Gernot Heiser. 2008. Pre-virtualization: Soft layering for virtual machines. In 13th Asia-Pacific Computer Systems Architecture Conference, ACSAC 2008, Hsinchu, China, August 4--6, 2008. 1--9. Google ScholarCross Ref
- Kai Li and Paul Hudak. 1989. Memory Coherence in Shared Virtual Memory Systems. ACM Trans. Comput. Syst. 7, 4 (Nov. 1989), 321--359. Google ScholarDigital Library
- Kevin T. Lim, Jichuan Chang, Trevor N. Mudge, Parthasarathy Ranganathan, Steven K. Reinhardt, and Thomas F. Wenisch. 2009. Disaggregated memory for expansion and sharing in blade servers. In 36th International Symposium on Computer Architecture (ISCA 2009), June 20--24, 2009, Austin, TX, USA. 267--278. Google ScholarDigital Library
- Kevin T. Lim, Yoshio Turner, Jose Renato Santos, Alvin AuYoung, Jichuan Chang, Parthasarathy Ranganathan, and Thomas F. Wenisch. 2012. System-level implications of disaggregated memory. In 18th IEEE International Symposium on High Performance Computer Architecture, HPCA 2012, New Orleans, LA, USA, 25--29 February, 2012. 189--200. Google ScholarDigital Library
- Ilias Marinos, Robert N.M. Watson, and Mark Handley. 2014. Network Stack Specialization for Performance. In Proceedings of the 2014 ACM Conference on SIGCOMM (SIGCOMM '14). ACM, New York, NY, USA, 175--186. Google ScholarDigital Library
- Frank McSherry, Michael Isard, and Derek Gordon Murray. 2015. Scalability! But at what COST?. In 15th Workshop on Hot Topics in Operating Systems, HotOS XV, Kartause Ittingen, Switzerland, May 18--20, 2015. https://www.usenix.org/conference/hotos15/workshop-program/presentation/mcsherryGoogle ScholarDigital Library
- Mihir Nanavati, Jake Wires, and Andrew Warfield. 2017. Decibel: Isolation and Sharing in Disaggregated Rack-Scale Storage. In 14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, Boston, MA, USA, March 27--29, 2017. 17--33. https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/nanavatiGoogle Scholar
- Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2015. Latency-Tolerant Software Distributed Shared Memory. In 2015 USENIX Annual Technical Conference, USENIX ATC '15, July 8--10, Santa Clara, CA, USA. 291--305. https://www.usenix.org/conference/atc15/technical-session/presentation/nelsonGoogle Scholar
- Edmund B. Nightingale, Orion Hodson, Ross McIlroy, Chris Hawblitzel, and Galen Hunt. 2009. Helios: Heterogeneous Multiprocessing with Satellite Kernels. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles (SOSP '09). ACM, New York, NY, USA, 221--234. Google ScholarDigital Library
- John K. Ousterhout, Andrew R. Cherenson, Fred Douglis, Michael N. Nelson, and Brent B. Welch. 1988. The Sprite Network Operating System. IEEE Computer 21, 2 (1988), 23--36. Google ScholarDigital Library
- Jelica Protic, Milo Tomasevic, and Veljko Milutinovic. 1996. Distributed shared memory: concepts and systems. IEEE P&DT 4, 2 (1996), 63--71. Google ScholarDigital Library
- Tiago Pais Pitta De Lacerda Ruivo, Gerard Bernabeu Altayo, Gabriele Garzoglio, Steven Timm, Hyunwoo Kim, Seo-Young Noh, and Ioan Raicu. 2014. Exploring Infiniband Hardware Virtualization in Open-Nebula towards Efficient High-Performance Computing. 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (2014), 943--948.Google ScholarDigital Library
- Rusty Russell. 2008. Virtio: Towards a De-facto Standard for Virtual I/O Devices. SIGOPS Oper. Syst. Rev. 42, 5 (July 2008), 95--103. Google ScholarDigital Library
- Jerome H. Saltzer, Roy Levin, and David D. Redell (Eds.). 1983. Proceedings of the Ninth ACM Symposium on Operating System Principles, SOSP 1983, Bretton Woods, New Hampshire, USA, October 10--13, 1983. ACM. Google ScholarDigital Library
- Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. 2010. x86-TSO: a rigorous and usable programmer's model for x86 multiprocessors. Commun. ACM 53, 7 (2010), 89--97. Google ScholarDigital Library
- Yizhou Shan, Shin-Yeh Tsai, and Yiying Zhang. 2017. Distributed Shared Persistent Memory. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC '17). ACM, New York, NY, USA, 323--337. Google ScholarDigital Library
- Maomeng Su, Mingxing Zhang, Kang Chen, Zhenyu Guo, and Yongwei Wu. 2017. RFP: When RPC is Faster than Server-Bypass with RDMA. In Proceedings of the Twelfth European Conference on Computer Systems, EuroSys 2017, Belgrade, Serbia, April 23--26, 2017. 1--15. Google ScholarDigital Library
- Andrew S. Tanenbaum, M. Frans Kaashoek, Robbert van Renesse, and Henri E. Bal. 1991. The Amoeba distributed operating system - A status report. Computer Communications 14, 6 (1991), 324--335. Google ScholarDigital Library
- Kun Tian, Yaozu Dong, and David Cowperthwaite. 2014. A Full GPU Virtualization Solution with Mediated Pass-Through. In 2014 USENIX Annual Technical Conference (USENIX ATC 14). USENIX Association, Philadelphia, PA, 121--132. https://www.usenix.org/conference/atc14/technical-sessions/presentation/tianGoogle ScholarDigital Library
- Rich Uhlig, Gil Neiger, Dion Rodgers, Amy L. Santoni, Fernando C. M. Martins, Andrew V. Anderson, Steven M. Bennett, Alain Kägi, Felix H. Leung, and Larry Smith. 2005. Intel Virtualization Technology. IEEE Computer 38, 5 (2005), 48--56. Google ScholarDigital Library
- Geoffroy Vallée, Renaud Lottiaux, Louis Rilling, Jean-Yves Berthou, Ivan Dutka Malhen, and Christine Morin. 2003. A Case for Single System Image Cluster Operating Systems: The Kerrighed Approach. Parallel Processing Letters 13, 2 (2003), 95--122. Google ScholarCross Ref
- Charles F. Webb. 2008. IBM z10: The Next-Generation Mainframe Microprocessor. IEEE Micro 28, 2 (2008), 19--29. Google ScholarDigital Library
- David Wentzlaff and Anant Agarwal. 2009. Factored Operating Systems (Fos): The Case for a Scalable Operating System for Multicores. SIGOPS Oper. Syst. Rev. 43, 2 (April 2009), 76--85. Google ScholarDigital Library
- Mochi Xue, Kun Tian, Yaozu Dong, Jiacheng Ma, Jiajun Wang, Zhengwei Qi, Bingsheng He, and Haibing Guan. 2016. gScale: Scaling up GPU Virtualization with Dynamic Sharing of Graphics Memory Space. In USENIX Annual Technical Conference. USENIX Association, Denver, CO, USA, 579--590. https://www.usenix.org/conference/atc16/technical-sessions/presentation/xueGoogle Scholar
- Sadegh Yazdanshenas and Vaughn Betz. 2017. Quantifying and mitigating the costs of FPGA virtualization. In 27th International Conference on Field Programmable Logic and Applications, FPL 2017, Ghent, Belgium, September 4--8, 2017. 1--7. Google ScholarCross Ref
- Yilun Chen Yizhou Shan, Yutong Huang and Yiying Zhang. 2018. Lego: A Decomposed, Distributed OS for Hardware Resource Disaggregation. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA. https://www.usenix.org/conference/osdi18/presentation/shanGoogle Scholar
- Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2--2. http://dl.acm.org/citation.cfm?id=2228298.2228301Google ScholarDigital Library
- Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. 2015. Congestion Control for Large-Scale RDMA Deployments. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM '15). ACM, New York, NY, USA, 523--536. Google ScholarDigital Library
Index Terms
- GiantVM: a type-II hypervisor implementing many-to-one virtualization
Recommendations
GiantVM: A Novel Distributed Hypervisor for Resource Aggregation with DSM-aware Optimizations
We present GiantVM,1 an open-source distributed hypervisor that provides the many-to-one virtualization to aggregate resources from multiple physical machines. We propose techniques to enable distributed CPU and I/O virtualization and distributed shared ...
A distributed hypervisor for resource aggregation: poster
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel ProgrammingScale-out has become the standard answer to data analysis, machine learning and many other fields. Contrary to common belief, scale-up machines can outperform scale-out clusters for a considerable portion of tasks. However, those scale-up machines are ...
Instant Virtual Machine Live Migration
Economics of Grids, Clouds, Systems, and ServicesAbstractLive migration of virtual machines (VMs) is an important tool for data center operators to achieve maintenance, power management, and load balancing. The relatively high cost of live migration makes it difficult to employ live migration for rapid ...
Comments