ABSTRACT
Containers are prevalently adopted due to the deployment and performance advantages over virtual machines. For many containerized data-intensive applications, however, bulky data transfers may pose performance issues. In particular, communication across co-located containers on the same host incurs large overheads in memory copy and the kernel's TCP stack. Existing solutions such as shared-memory networking and RDMA have their own limitations, including insufficient memory isolation and limited scalability.
This paper presents PipeDevice, a new system for low overhead intra-host container communication. PipeDevice follows a hardware-software co-design approach --- it offloads data forwarding entirely onto hardware, which accesses application data in hugepages on the host, thereby eliminating CPU overhead from memory copy and TCP processing. PipeDevice preserves memory isolation and scales well to connections, making it deployable in public clouds. Isolation is achieved by allocating dedicated memory to each connection from hugepages. To achieve high scalability, PipeDevice stores the connection states entirely in host DRAM and manages them in software. Evaluation with a prototype implementation on commodity FPGA shows that for delivering 80 Gbps across containers PipeDevice saves 63.2% CPU compared to kernel TCP stack, and 40.5% over FreeFlow. PipeDevice provides salient benefits to applications. For example, we port baidu-allreduce to PipeDevice and obtain ~2.2× gains in allreduce throughput.
- Achieving Fast, Scalable I/O for Virtualized Servers. https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/scalable-i-o-virtualized-servers-paper.pdf.Google Scholar
- Amazon web service. https://aws.amazon.com/.Google Scholar
- AMD Zen 4 Epyc CPU. https://www.techradar.com/news/amd-zen-4-epyc-cpu-could-be-an-epic-128-core-256-thread-monster.Google Scholar
- Baidu-allreduce. https://github.com/baidu-research/baidu-allreduce.Google Scholar
- bpftrace: High-level tracing language for linux systems. https://bpftrace.org/.Google Scholar
- Cilium. https://github.com/cilium/cilium.Google Scholar
- Cloud-Native Network Functions. https://www.cisco.com/c/en/us/solutions/service-provider/industry/cable/cloud-native-network-functions.html.Google Scholar
- Container management in 2021: In-depth guide. https://research.aimultiple.com/container-management/.Google Scholar
- containerd: an industry-standard container runtime with an emphasis on simplicity, robustness and portability. https://containerd.io/.Google Scholar
- Deep learning containers in Google Cloud. https://cloud.google.com/deep-learning-containers.Google Scholar
- Enable Istio proxy sidecar injection in Oracle cloud native environment. https://docs.oracle.com/en/learn/ocne-sidecars/index.html#introduction.Google Scholar
- F-Stack: A high performance userspace stack based on FreeBSD 11.0 stable. http://www.f-stack.org/.Google Scholar
- Fast memcpy with SPDK and Intel I/OAT DMA Engine. https://www.intel.com/content/www/us/en/developer/articles/technical/fast-memcpy-using-spdk-and-ioat-dma-engine.html.Google Scholar
- FreeFlow TCP. https://github.com/microsoft/Freeflow/tree/tcp.Google Scholar
- Gloo. https://github.com/facebookincubator/gloo.Google Scholar
- Implement mmap() for zero copy receive. https://lwn.net/Articles/752207/.Google Scholar
- Implementing TCP Sockets over RDMA. https://www.openfabrics.org/images/eventpresos/workshops2014/IBUG/presos/Thursday/PDF/09_Sockets-over-rdma.pdf.Google Scholar
- Information about the TCP chimney offload, receive side scaling, and network direct memory access features in Windows server 2008. https://support.microsoft.com/en-us/help/951037/information-about-the-tcp-chimney-offload-receive-side-scaling-and-net.Google Scholar
- Intel Arria 10 product table. https://www.intel.co.id/content/dam/www/programmable/us/en/pdfs/literature/pt/arria-10-product-table.pdf.Google Scholar
- Intel C610 Series Chipset Datasheet. https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/x99-chipset-pch-datasheet.pdf.Google Scholar
- Intel DSA specification. https://www.intel.com/content/www/us/en/develop/articles/intel-data-streaming-accelerator-architecture-specification.html.Google Scholar
- Intel QuickData Technology Software Guide. https://www.intel.com/content/dam/doc/white-paper/quickdata-technology-software-guide-for-linux-paper.pdf.Google Scholar
- IOAT benchmark. https://github.com/spdk/spdk/tree/master/examples/ioat/perf.Google Scholar
- io_uring. https://man.archlinux.org/man/io_uring.7.en.Google Scholar
- Istio. https://istio.io/latest/about/service-mesh/.Google Scholar
- Linkerd architecture. https://linkerd.io/2.11/reference/architecture/.Google Scholar
- Mellanox BlueField-2 DPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-2-dpu.pdf.Google Scholar
- Mellanox BlueField DPU DMA Guide. https://docs.nvidia.com/doca/sdk/dma-samples/index.html.Google Scholar
- Microsoft Azure. https://azure.microsoft.com/.Google Scholar
- NCCL. https://github.com/NVIDIA/nccl.Google Scholar
- Open MPI: Open source high performance computing. https://www.open-mpi.org/.Google Scholar
- Perftest. https://github.com/linux-rdma/perftest.Google Scholar
- Run Spark applications with Docker using Amazon EMR 6.x. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-docker.html.Google Scholar
- Seastar. http://www.seastar-project.org/.Google Scholar
- Spark and Docker: Your Spark development cycle just got 10x faster! https://towardsdatascience.com/spark-and-docker-your-spark-development-cycle-just-got-10x-faster-f41ed50c67fd.Google Scholar
- TCP mmap() program. https://lwn.net/Articles/752197/.Google Scholar
- What is container management and why is it important. https://searchitoperations.techtarget.com/definition/container-management-software.Google Scholar
- Why use Docker containers for machine learning development? https://aws.amazon.com/cn/blogs/opensource/why-use-docker-containers-for-machine-learning-development/.Google Scholar
- Zero-copy TCP receive. https://lwn.net/Articles/752188/.Google Scholar
- P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and D.K. Panda. Zero copy sockets direct protocol over infiniband-preliminary implementation and performance analysis. In Proc. IEEE ISPASS, 2004.Google Scholar
- Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. IX: A protected dataplane operating system for high throughput and low latency. In Proc. USENIX OSDI, 2014.Google Scholar
- Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. A cloud-scale acceleration architecture. In Proc. IEEE/ACM MICRO, 2016.Google ScholarCross Ref
- Youmin Chen, Youyou Lu, and Jiwu Shu. Scalable RDMA RPC on reliable connection with efficient resource sharing. In Proc. ACM EuroSys, 2019.Google ScholarDigital Library
- Yuchen Cheng, Chunghsuan Wu, Yanqiang Liu, Rui Ren, Hong Xu, Bin Yang, and Zhengwei Qi. OPS: Optimized shuffle management system for Apache Spark. In Proc. ACM ICPP, 2020.Google ScholarDigital Library
- Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. FaRM: Fast remote memory. In Proc. USENIX NSDI, 2014.Google Scholar
- Weibei Fan, Jing He, Zhijie Han, Peng Li, and Ruchuan Wang. Intelligent resource scheduling based on locality principle in data center networks. IEEE Communications Magazine, 58(10):94--100, 2020.Google ScholarCross Ref
- Philipp Fent, Alexander van Renen, Andreas Kipf, Viktor Leis, Thomas Neumann, and Alfons Kemper. Low-latency communication for fast DBMS using RDMA and shared memory. In Proc. IEEE ICDE, 2020.Google ScholarCross Ref
- Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, Harish Kumar Chandrappa, Somesh Chaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, Fengfen Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel, Tejas Sapre, Mark Shaw, Gabriel Silva, Madhan Sivakumar, Nisheeth Srivastava, Anshuman Verma, Qasim Zuhair, Deepak Bansal, Doug Burger, Kushagra Vaid, David A. Maltz, and Albert Greenberg. Azure Accelerated Networking: SmartNICs in the public cloud. In Proc. USENIX NSDI, 2018.Google Scholar
- D. Goldenberg, M. Kagan, R. Ravid, and M.S. Tsirkin. Sockets Direct Protocol over InfiniBand in clusters: is it beneficial? In Proc. IEEE HOTI, 2005.Google Scholar
- Sangjin Han, Scott Marshall, Byung-Gon Chun, and Sylvia Ratnasamy. MegaPipe: A new programming interface for scalable network I/O. In Proc. USENIX OSDI, 2012.Google Scholar
- Zhiqiang He, Dongyang Wang, Binzhang Fu, Kun Tan, Bei Hua, Zhi-Li Zhang, and Kai Zheng. MasQ: RDMA for Virtual Private Cloud. In Proc. ACM SIGCOMM, 2020.Google Scholar
- Michio Honda, Giuseppe Lettieri, Lars Eggert, and Douglas Santry. PASTE: A network programming interface for non-volatile main memory. In Proc. USENIX NSDI, 2018.Google Scholar
- Jinho Hwang, K. K. Ramakrishnan, and Timothy Wood. NetVM: High performance and flexible networking using virtualization on commodity platforms. In Proc. USENIX NSDI, 2014.Google Scholar
- EunYoung Jeong, Shinae Wood, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and KyoungSoo Park. mTCP: A highly scalable user-level TCP stack for multicore systems. In Proc. USENIX NSDI, 2014.Google Scholar
- Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In Proc. USENIX OSDI, 2020.Google Scholar
- Anuj Kalia, Michael Kaminsky, and David Andersen. Datacenter RPCs can be general and fast. In Proc. USENIX NSDI, 2019.Google Scholar
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. Using RDMA efficiently for key-value services. In Proc. ACM SIGCOMM, 2014.Google ScholarDigital Library
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. Design guidelines for high performance RDMA systems. In Proc. USENIX ATC, 2016.Google ScholarDigital Library
- Anuj Kalia, Michael Kaminsky, and David G. Andersen. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. In Proc. USENIX OSDI, 2016.Google Scholar
- Junaid Khalid, Eric Rozner, Wesley Felter, Cong Xu, Karthick Rajamani, Alexandre Ferreira, and Aditya Akella. Iron: Isolating network-based CPU in container environments. In Proc. USENIX NSDI, 2018.Google Scholar
- Daehyeok Kim, Tianlong Yu, Hongqiang Liu, Yibo Zhu, Jitu Padhye, Shachar Raindel, Chuanxiong Guo, Vyas Sekar, and Srinivasan Seshan. FreeFlow: Software-based virtual RDMA networking for containerized clouds. In Proc. USENIX NSDI, 2019.Google Scholar
- Sameer G. Kulkarni, Wei Zhang, Jinho Hwang, Shriram Rajagopalan, K. K. Ramakrishnan, Timothy Wood, Mayutan Arumaithurai, and Xiaoming Fu. NFVnice: Dynamic backpressure and scheduling for NFV service chains. In Proc. ACM SIGCOMM, 2017.Google Scholar
- Jiaxin Lei, Manish Munikar, Kun Suo, Hui Lu, and Jia Rao. Parallelizing packet processing in container overlay networks. In Proc. ACM EuroSys, 2021.Google ScholarDigital Library
- Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, and Lintao Zhang. SocksDirect: Data-center sockets can be fast and compatible. In Proc. ACM SIGCOMM, 2020.Google Scholar
- Jian Li, Shuai Xue, Wang Zhang, Ruhui Ma, Zhengwei Qi, and Haibing Guan. When I/O interrupt becomes system bottleneck: Efficiency and scalability enhancement for SR-IOV network virtualization. IEEE Transactions on Cloud Computing, 7(4):1183--1196, 2019.Google ScholarCross Ref
- Xiaofeng Lin, Yu Chen, Xiaodong Li, Junjie Mao, Jiaquan He, Wei Xu, and Yuanchun Shi. Scalable kernel TCP design and implementation for short-lived connections. In Proc. ASPLOS, 2016.Google ScholarDigital Library
- Glenn K. Lockwood, Mahidhar Tatineni, and Rick Wagner. SR-IOV: Performance benefits for virtualized interconnects. In Proc. ACM XSEDE, 2014.Google Scholar
- Patrick MacArthur and Robert D. Russell. An Efficient Method for Stream Semantics over RDMA. In Proc. IEEE IPDPS, 2014.Google ScholarDigital Library
- Ilias Marinos, Robert NM Watson, and Mark Handley. Network stack specialization for performance. In Proc. ACM SIGCOMM, 2014.Google ScholarDigital Library
- YoungGyoun Moon, SeungEon Lee, Muhammad Asim Jamshed, and KyoungSoo Park. AccelTCP: Accelerating network applications with stateful TCP offloading. In Proc. USENIX NSDI, 2020.Google Scholar
- Jaehyun Nam, Seungsoo Lee, Hyunmin Seo, Phil Porras, Vinod Yegneswaran, and Seungwon Shin. BASTION: A security enforcement network stack for container networks. In Proc. USENIX ATC, 2020.Google Scholar
- Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W. Moore. Understanding PCIe performance for end host networking. In Proc. ACM SIGCOMM, 2018.Google ScholarDigital Library
- Zhixiong Niu, Hong Xu, Peng Cheng, Qiang Su, Yongqiang Xiong, Tao Wang, Dongsu Han, and Keith Winstein. NetKernel: Making network stack part of the virtualized infrastructure. In Proc. USENIX ATC, 2020.Google Scholar
- Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. A generic communication scheduler for distributed DNN training acceleration. In Proc. ACM SOSP, 2019.Google ScholarDigital Library
- Arjun Singhvi, Aditya Akella, Dan Gibson, Thomas F. Wenisch, Monica Wong-Chan, Sean Clark, Milo M. K. Martin, Moray McLaren, Prashant Chandra, Rob Cauble, Hassan M. G. Wassel, Behnam Montazeri, Simon L. Sabato, Joel Scherpelz, and Amin Vahdat. 1RMA: Re-envisioning remote memory access for multi-tenant datacenters. In Proc. ACM SIGCOMM, 2020.Google ScholarDigital Library
- Livio Soares and Michael Stumm. FlexSC: Flexible system call scheduling with exception-less system calls. In Proc. USENIX OSDI, 2010.Google Scholar
- Shin-Yeh Tsai and Yiying Zhang. LITE Kernel RDMA Support for Datacenter Applications. In Proc. ACM SOSP, 2017.Google Scholar
- Jian Yang, Joseph Izraelevitz, and Steven Swanson. FileMR: Rethinking RDMA networking for scalable persistent memory. In Proc. USENIX NSDI, 2020.Google Scholar
- Kenichi Yasukata, Michio Honda, Douglas Santry, and Lars Eggert. StackMap: Low-latency networking with the OS stack and dedicated NICs. In Proc. USENIX ATC, 2016.Google Scholar
- Tianlong Yu, Shadi Abdollahian Noghabi, Shachar Raindel, Hongqiang Liu, Jitu Padhye, and Vyas Sekar. FreeFlow: High performance container networking. In Proc. ACM HotNets, 2016.Google Scholar
- Wei Zhang, Guyue Liu, Wenhui Zhang, Neel Shah, Phil Lopreiato, Gregoire Todeschi, KK Ramakrishnan, and Timothy Wood. OpenNetVM: A platform for high performance network service chains. In Proc. ACM HotMiddlebox, 2015.Google Scholar
- Dongfang Zhao, Mohamed Mohamed, and Heiko Ludwig. Locality-aware scheduling for containers in cloud computing. IEEE Transactions on Cloud Computing, 8(2):635--646, 2020.Google ScholarCross Ref
- Chao Zheng, Qiuwen Lu, Jia Li, Qinyun Liu, and Binxing Fang. A flexible and efficient container-based NFV platform for middlebox networking. In Proc. ACM SAC, 2018.Google ScholarDigital Library
- Danyang Zhuo, Kaiyuan Zhang, Yibo Zhu, Hongqiang Harry Liu, Matthew Rockett, Arvind Krishnamurthy, and Thomas Anderson. Slim: OS kernel support for a low-overhead container overlay network. In Proc. USENIX NSDI, 2019.Google Scholar
Index Terms
- PipeDevice: a hardware-software co-design approach to intra-host container communication
Recommendations
Hardware-Software Co-design to Mitigate DRAM Refresh Overheads: A Case for Refresh-Aware Process Scheduling
ASPLOS '17DRAM cells need periodic refresh to maintain data integrity. With high capacity DRAMs, DRAM refresh poses a significant performance bottleneck as the number of rows to be refreshed (and hence the refresh cycle time, tRFC) with each refresh command ...
Hardware-Software Co-design to Mitigate DRAM Refresh Overheads: A Case for Refresh-Aware Process Scheduling
Asplos'17DRAM cells need periodic refresh to maintain data integrity. With high capacity DRAMs, DRAM refresh poses a significant performance bottleneck as the number of rows to be refreshed (and hence the refresh cycle time, tRFC) with each refresh command ...
Hardware-Software Co-design to Mitigate DRAM Refresh Overheads: A Case for Refresh-Aware Process Scheduling
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating SystemsDRAM cells need periodic refresh to maintain data integrity. With high capacity DRAMs, DRAM refresh poses a significant performance bottleneck as the number of rows to be refreshed (and hence the refresh cycle time, tRFC) with each refresh command ...
Comments