research-article

PipeDevice: a hardware-software co-design approach to intra-host container communication

Authors:
Qiang Su

City University of Hong Kong, Hong Kong SAR, China

City University of Hong Kong, Hong Kong SAR, China
View Profile

,
Chuanwen Wang

CUHK, Hong Kong SAR, China

CUHK, Hong Kong SAR, China
View Profile

,
Zhixiong Niu

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China
View Profile

,
Ran Shu

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China
View Profile

,
Peng Cheng

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China
View Profile

,
Yongqiang Xiong

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China
View Profile

,
Dongsu Han

KAIST, Daejeon, South Korea

KAIST, Daejeon, South Korea
View Profile

,
Chun Jason Xue

City University of Hong Kong, Hong Kong SAR, China

City University of Hong Kong, Hong Kong SAR, China
View Profile

,
Hong Xu

CUHK, Hong Kong SAR, China

CUHK, Hong Kong SAR, China
View Profile

CoNEXT '22: Proceedings of the 18th International Conference on emerging Networking EXperiments and TechnologiesNovember 2022Pages 126–139https://doi.org/10.1145/3555050.3569118

Published:30 November 2022Publication History

CoNEXT '22: Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies

Pages 126–139

ABSTRACT

Containers are prevalently adopted due to the deployment and performance advantages over virtual machines. For many containerized data-intensive applications, however, bulky data transfers may pose performance issues. In particular, communication across co-located containers on the same host incurs large overheads in memory copy and the kernel's TCP stack. Existing solutions such as shared-memory networking and RDMA have their own limitations, including insufficient memory isolation and limited scalability.

This paper presents PipeDevice, a new system for low overhead intra-host container communication. PipeDevice follows a hardware-software co-design approach --- it offloads data forwarding entirely onto hardware, which accesses application data in hugepages on the host, thereby eliminating CPU overhead from memory copy and TCP processing. PipeDevice preserves memory isolation and scales well to connections, making it deployable in public clouds. Isolation is achieved by allocating dedicated memory to each connection from hugepages. To achieve high scalability, PipeDevice stores the connection states entirely in host DRAM and manages them in software. Evaluation with a prototype implementation on commodity FPGA shows that for delivering 80 Gbps across containers PipeDevice saves 63.2% CPU compared to kernel TCP stack, and 40.5% over FreeFlow. PipeDevice provides salient benefits to applications. For example, we port baidu-allreduce to PipeDevice and obtain ~2.2× gains in allreduce throughput.

References

Achieving Fast, Scalable I/O for Virtualized Servers. https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/scalable-i-o-virtualized-servers-paper.pdf.Google Scholar
Amazon web service. https://aws.amazon.com/.Google Scholar
AMD Zen 4 Epyc CPU. https://www.techradar.com/news/amd-zen-4-epyc-cpu-could-be-an-epic-128-core-256-thread-monster.Google Scholar
Baidu-allreduce. https://github.com/baidu-research/baidu-allreduce.Google Scholar
bpftrace: High-level tracing language for linux systems. https://bpftrace.org/.Google Scholar
Cilium. https://github.com/cilium/cilium.Google Scholar
Cloud-Native Network Functions. https://www.cisco.com/c/en/us/solutions/service-provider/industry/cable/cloud-native-network-functions.html.Google Scholar
Container management in 2021: In-depth guide. https://research.aimultiple.com/container-management/.Google Scholar
containerd: an industry-standard container runtime with an emphasis on simplicity, robustness and portability. https://containerd.io/.Google Scholar
Deep learning containers in Google Cloud. https://cloud.google.com/deep-learning-containers.Google Scholar
Enable Istio proxy sidecar injection in Oracle cloud native environment. https://docs.oracle.com/en/learn/ocne-sidecars/index.html#introduction.Google Scholar
F-Stack: A high performance userspace stack based on FreeBSD 11.0 stable. http://www.f-stack.org/.Google Scholar
Fast memcpy with SPDK and Intel I/OAT DMA Engine. https://www.intel.com/content/www/us/en/developer/articles/technical/fast-memcpy-using-spdk-and-ioat-dma-engine.html.Google Scholar
FreeFlow TCP. https://github.com/microsoft/Freeflow/tree/tcp.Google Scholar
Gloo. https://github.com/facebookincubator/gloo.Google Scholar
Implement mmap() for zero copy receive. https://lwn.net/Articles/752207/.Google Scholar
Implementing TCP Sockets over RDMA. https://www.openfabrics.org/images/eventpresos/workshops2014/IBUG/presos/Thursday/PDF/09_Sockets-over-rdma.pdf.Google Scholar
Information about the TCP chimney offload, receive side scaling, and network direct memory access features in Windows server 2008. https://support.microsoft.com/en-us/help/951037/information-about-the-tcp-chimney-offload-receive-side-scaling-and-net.Google Scholar
Intel Arria 10 product table. https://www.intel.co.id/content/dam/www/programmable/us/en/pdfs/literature/pt/arria-10-product-table.pdf.Google Scholar
Intel C610 Series Chipset Datasheet. https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/x99-chipset-pch-datasheet.pdf.Google Scholar
Intel DSA specification. https://www.intel.com/content/www/us/en/develop/articles/intel-data-streaming-accelerator-architecture-specification.html.Google Scholar
Intel QuickData Technology Software Guide. https://www.intel.com/content/dam/doc/white-paper/quickdata-technology-software-guide-for-linux-paper.pdf.Google Scholar
IOAT benchmark. https://github.com/spdk/spdk/tree/master/examples/ioat/perf.Google Scholar
io_uring. https://man.archlinux.org/man/io_uring.7.en.Google Scholar
Istio. https://istio.io/latest/about/service-mesh/.Google Scholar
Linkerd architecture. https://linkerd.io/2.11/reference/architecture/.Google Scholar
Mellanox BlueField-2 DPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-2-dpu.pdf.Google Scholar
Mellanox BlueField DPU DMA Guide. https://docs.nvidia.com/doca/sdk/dma-samples/index.html.Google Scholar
Microsoft Azure. https://azure.microsoft.com/.Google Scholar
NCCL. https://github.com/NVIDIA/nccl.Google Scholar
Open MPI: Open source high performance computing. https://www.open-mpi.org/.Google Scholar
Perftest. https://github.com/linux-rdma/perftest.Google Scholar
Run Spark applications with Docker using Amazon EMR 6.x. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-docker.html.Google Scholar
Seastar. http://www.seastar-project.org/.Google Scholar
Spark and Docker: Your Spark development cycle just got 10x faster! https://towardsdatascience.com/spark-and-docker-your-spark-development-cycle-just-got-10x-faster-f41ed50c67fd.Google Scholar
TCP mmap() program. https://lwn.net/Articles/752197/.Google Scholar
What is container management and why is it important. https://searchitoperations.techtarget.com/definition/container-management-software.Google Scholar
Why use Docker containers for machine learning development? https://aws.amazon.com/cn/blogs/opensource/why-use-docker-containers-for-machine-learning-development/.Google Scholar
Zero-copy TCP receive. https://lwn.net/Articles/752188/.Google Scholar
P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and D.K. Panda. Zero copy sockets direct protocol over infiniband-preliminary implementation and performance analysis. In Proc. IEEE ISPASS, 2004.Google Scholar
Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. IX: A protected dataplane operating system for high throughput and low latency. In Proc. USENIX OSDI, 2014.Google Scholar
Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. A cloud-scale acceleration architecture. In Proc. IEEE/ACM MICRO, 2016.Google ScholarCross Ref
Youmin Chen, Youyou Lu, and Jiwu Shu. Scalable RDMA RPC on reliable connection with efficient resource sharing. In Proc. ACM EuroSys, 2019.Google ScholarDigital Library
Yuchen Cheng, Chunghsuan Wu, Yanqiang Liu, Rui Ren, Hong Xu, Bin Yang, and Zhengwei Qi. OPS: Optimized shuffle management system for Apache Spark. In Proc. ACM ICPP, 2020.Google ScholarDigital Library
Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. FaRM: Fast remote memory. In Proc. USENIX NSDI, 2014.Google Scholar
Weibei Fan, Jing He, Zhijie Han, Peng Li, and Ruchuan Wang. Intelligent resource scheduling based on locality principle in data center networks. IEEE Communications Magazine, 58(10):94--100, 2020.Google ScholarCross Ref
Philipp Fent, Alexander van Renen, Andreas Kipf, Viktor Leis, Thomas Neumann, and Alfons Kemper. Low-latency communication for fast DBMS using RDMA and shared memory. In Proc. IEEE ICDE, 2020.Google ScholarCross Ref
Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, Harish Kumar Chandrappa, Somesh Chaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, Fengfen Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel, Tejas Sapre, Mark Shaw, Gabriel Silva, Madhan Sivakumar, Nisheeth Srivastava, Anshuman Verma, Qasim Zuhair, Deepak Bansal, Doug Burger, Kushagra Vaid, David A. Maltz, and Albert Greenberg. Azure Accelerated Networking: SmartNICs in the public cloud. In Proc. USENIX NSDI, 2018.Google Scholar
D. Goldenberg, M. Kagan, R. Ravid, and M.S. Tsirkin. Sockets Direct Protocol over InfiniBand in clusters: is it beneficial? In Proc. IEEE HOTI, 2005.Google Scholar
Sangjin Han, Scott Marshall, Byung-Gon Chun, and Sylvia Ratnasamy. MegaPipe: A new programming interface for scalable network I/O. In Proc. USENIX OSDI, 2012.Google Scholar
Zhiqiang He, Dongyang Wang, Binzhang Fu, Kun Tan, Bei Hua, Zhi-Li Zhang, and Kai Zheng. MasQ: RDMA for Virtual Private Cloud. In Proc. ACM SIGCOMM, 2020.Google Scholar
Michio Honda, Giuseppe Lettieri, Lars Eggert, and Douglas Santry. PASTE: A network programming interface for non-volatile main memory. In Proc. USENIX NSDI, 2018.Google Scholar
Jinho Hwang, K. K. Ramakrishnan, and Timothy Wood. NetVM: High performance and flexible networking using virtualization on commodity platforms. In Proc. USENIX NSDI, 2014.Google Scholar
EunYoung Jeong, Shinae Wood, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and KyoungSoo Park. mTCP: A highly scalable user-level TCP stack for multicore systems. In Proc. USENIX NSDI, 2014.Google Scholar
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In Proc. USENIX OSDI, 2020.Google Scholar
Anuj Kalia, Michael Kaminsky, and David Andersen. Datacenter RPCs can be general and fast. In Proc. USENIX NSDI, 2019.Google Scholar
Anuj Kalia, Michael Kaminsky, and David G. Andersen. Using RDMA efficiently for key-value services. In Proc. ACM SIGCOMM, 2014.Google ScholarDigital Library
Anuj Kalia, Michael Kaminsky, and David G. Andersen. Design guidelines for high performance RDMA systems. In Proc. USENIX ATC, 2016.Google ScholarDigital Library
Anuj Kalia, Michael Kaminsky, and David G. Andersen. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. In Proc. USENIX OSDI, 2016.Google Scholar
Junaid Khalid, Eric Rozner, Wesley Felter, Cong Xu, Karthick Rajamani, Alexandre Ferreira, and Aditya Akella. Iron: Isolating network-based CPU in container environments. In Proc. USENIX NSDI, 2018.Google Scholar
Daehyeok Kim, Tianlong Yu, Hongqiang Liu, Yibo Zhu, Jitu Padhye, Shachar Raindel, Chuanxiong Guo, Vyas Sekar, and Srinivasan Seshan. FreeFlow: Software-based virtual RDMA networking for containerized clouds. In Proc. USENIX NSDI, 2019.Google Scholar
Sameer G. Kulkarni, Wei Zhang, Jinho Hwang, Shriram Rajagopalan, K. K. Ramakrishnan, Timothy Wood, Mayutan Arumaithurai, and Xiaoming Fu. NFVnice: Dynamic backpressure and scheduling for NFV service chains. In Proc. ACM SIGCOMM, 2017.Google Scholar
Jiaxin Lei, Manish Munikar, Kun Suo, Hui Lu, and Jia Rao. Parallelizing packet processing in container overlay networks. In Proc. ACM EuroSys, 2021.Google ScholarDigital Library
Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, and Lintao Zhang. SocksDirect: Data-center sockets can be fast and compatible. In Proc. ACM SIGCOMM, 2020.Google Scholar
Jian Li, Shuai Xue, Wang Zhang, Ruhui Ma, Zhengwei Qi, and Haibing Guan. When I/O interrupt becomes system bottleneck: Efficiency and scalability enhancement for SR-IOV network virtualization. IEEE Transactions on Cloud Computing, 7(4):1183--1196, 2019.Google ScholarCross Ref
Xiaofeng Lin, Yu Chen, Xiaodong Li, Junjie Mao, Jiaquan He, Wei Xu, and Yuanchun Shi. Scalable kernel TCP design and implementation for short-lived connections. In Proc. ASPLOS, 2016.Google ScholarDigital Library
Glenn K. Lockwood, Mahidhar Tatineni, and Rick Wagner. SR-IOV: Performance benefits for virtualized interconnects. In Proc. ACM XSEDE, 2014.Google Scholar
Patrick MacArthur and Robert D. Russell. An Efficient Method for Stream Semantics over RDMA. In Proc. IEEE IPDPS, 2014.Google ScholarDigital Library
Ilias Marinos, Robert NM Watson, and Mark Handley. Network stack specialization for performance. In Proc. ACM SIGCOMM, 2014.Google ScholarDigital Library
YoungGyoun Moon, SeungEon Lee, Muhammad Asim Jamshed, and KyoungSoo Park. AccelTCP: Accelerating network applications with stateful TCP offloading. In Proc. USENIX NSDI, 2020.Google Scholar
Jaehyun Nam, Seungsoo Lee, Hyunmin Seo, Phil Porras, Vinod Yegneswaran, and Seungwon Shin. BASTION: A security enforcement network stack for container networks. In Proc. USENIX ATC, 2020.Google Scholar
Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W. Moore. Understanding PCIe performance for end host networking. In Proc. ACM SIGCOMM, 2018.Google ScholarDigital Library
Zhixiong Niu, Hong Xu, Peng Cheng, Qiang Su, Yongqiang Xiong, Tao Wang, Dongsu Han, and Keith Winstein. NetKernel: Making network stack part of the virtualized infrastructure. In Proc. USENIX ATC, 2020.Google Scholar
Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. A generic communication scheduler for distributed DNN training acceleration. In Proc. ACM SOSP, 2019.Google ScholarDigital Library
Arjun Singhvi, Aditya Akella, Dan Gibson, Thomas F. Wenisch, Monica Wong-Chan, Sean Clark, Milo M. K. Martin, Moray McLaren, Prashant Chandra, Rob Cauble, Hassan M. G. Wassel, Behnam Montazeri, Simon L. Sabato, Joel Scherpelz, and Amin Vahdat. 1RMA: Re-envisioning remote memory access for multi-tenant datacenters. In Proc. ACM SIGCOMM, 2020.Google ScholarDigital Library
Livio Soares and Michael Stumm. FlexSC: Flexible system call scheduling with exception-less system calls. In Proc. USENIX OSDI, 2010.Google Scholar
Shin-Yeh Tsai and Yiying Zhang. LITE Kernel RDMA Support for Datacenter Applications. In Proc. ACM SOSP, 2017.Google Scholar
Jian Yang, Joseph Izraelevitz, and Steven Swanson. FileMR: Rethinking RDMA networking for scalable persistent memory. In Proc. USENIX NSDI, 2020.Google Scholar
Kenichi Yasukata, Michio Honda, Douglas Santry, and Lars Eggert. StackMap: Low-latency networking with the OS stack and dedicated NICs. In Proc. USENIX ATC, 2016.Google Scholar
Tianlong Yu, Shadi Abdollahian Noghabi, Shachar Raindel, Hongqiang Liu, Jitu Padhye, and Vyas Sekar. FreeFlow: High performance container networking. In Proc. ACM HotNets, 2016.Google Scholar
Wei Zhang, Guyue Liu, Wenhui Zhang, Neel Shah, Phil Lopreiato, Gregoire Todeschi, KK Ramakrishnan, and Timothy Wood. OpenNetVM: A platform for high performance network service chains. In Proc. ACM HotMiddlebox, 2015.Google Scholar
Dongfang Zhao, Mohamed Mohamed, and Heiko Ludwig. Locality-aware scheduling for containers in cloud computing. IEEE Transactions on Cloud Computing, 8(2):635--646, 2020.Google ScholarCross Ref
Chao Zheng, Qiuwen Lu, Jia Li, Qinyun Liu, and Binxing Fang. A flexible and efficient container-based NFV platform for middlebox networking. In Proc. ACM SAC, 2018.Google ScholarDigital Library
Danyang Zhuo, Kaiyuan Zhang, Yibo Zhu, Hongqiang Harry Liu, Matthew Rockett, Arvind Krishnamurthy, and Thomas Anderson. Slim: OS kernel support for a low-overhead container overlay network. In Proc. USENIX NSDI, 2019.Google Scholar

Index Terms

PipeDevice: a hardware-software co-design approach to intra-host container communication
1. Networks
  1. Network architectures

Recommendations

Hardware-Software Co-design to Mitigate DRAM Refresh Overheads: A Case for Refresh-Aware Process Scheduling
ASPLOS '17

DRAM cells need periodic refresh to maintain data integrity. With high capacity DRAMs, DRAM refresh poses a significant performance bottleneck as the number of rows to be refreshed (and hence the refresh cycle time, tRFC) with each refresh command ...
Read More
Hardware-Software Co-design to Mitigate DRAM Refresh Overheads: A Case for Refresh-Aware Process Scheduling
Asplos'17

DRAM cells need periodic refresh to maintain data integrity. With high capacity DRAMs, DRAM refresh poses a significant performance bottleneck as the number of rows to be refreshed (and hence the refresh cycle time, tRFC) with each refresh command ...
Read More
Hardware-Software Co-design to Mitigate DRAM Refresh Overheads: A Case for Refresh-Aware Process Scheduling
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems

DRAM cells need periodic refresh to maintain data integrity. With high capacity DRAMs, DRAM refresh poses a significant performance bottleneck as the number of rows to be refreshed (and hence the refresh cycle time, tRFC) with each refresh command ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CoNEXT '22: Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies
November 2022
431 pages
ISBN:9781450395083
DOI:10.1145/3555050
General Chairs:
Giuseppe Bianchi
University of Rome Tor Vergata, Italy
,
Alessandro Mei
Sapienza University of Rome, Italy
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 November 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
container communication
hardware-software co-design
Qualifiers
- research-article
Conference

Acceptance Rates
CoNEXT '22 Paper Acceptance Rate28of151submissions,19%Overall Acceptance Rate198of789submissions,25%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 455
  Total Downloads
- Downloads (Last 12 months)240
- Downloads (Last 6 weeks)30
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

PipeDevice: a hardware-software co-design approach to intra-host container communication

CoNEXT '22: Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies

ABSTRACT

References

Cited By

Index Terms

Recommendations

Hardware-Software Co-design to Mitigate DRAM Refresh Overheads: A Case for Refresh-Aware Process Scheduling

Hardware-Software Co-design to Mitigate DRAM Refresh Overheads: A Case for Refresh-Aware Process Scheduling

Hardware-Software Co-design to Mitigate DRAM Refresh Overheads: A Case for Refresh-Aware Process Scheduling

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

PipeDevice: a hardware-software co-design approach to intra-host container communication

CoNEXT '22: Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies

ABSTRACT

References

Cited By

Index Terms

Recommendations

Hardware-Software Co-design to Mitigate DRAM Refresh Overheads: A Case for Refresh-Aware Process Scheduling

Hardware-Software Co-design to Mitigate DRAM Refresh Overheads: A Case for Refresh-Aware Process Scheduling

Hardware-Software Co-design to Mitigate DRAM Refresh Overheads: A Case for Refresh-Aware Process Scheduling

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media