skip to main content
10.1145/3555050.3569118acmconferencesArticle/Chapter ViewAbstractPublication PagesconextConference Proceedingsconference-collections
research-article

PipeDevice: a hardware-software co-design approach to intra-host container communication

Published:30 November 2022Publication History

ABSTRACT

Containers are prevalently adopted due to the deployment and performance advantages over virtual machines. For many containerized data-intensive applications, however, bulky data transfers may pose performance issues. In particular, communication across co-located containers on the same host incurs large overheads in memory copy and the kernel's TCP stack. Existing solutions such as shared-memory networking and RDMA have their own limitations, including insufficient memory isolation and limited scalability.

This paper presents PipeDevice, a new system for low overhead intra-host container communication. PipeDevice follows a hardware-software co-design approach --- it offloads data forwarding entirely onto hardware, which accesses application data in hugepages on the host, thereby eliminating CPU overhead from memory copy and TCP processing. PipeDevice preserves memory isolation and scales well to connections, making it deployable in public clouds. Isolation is achieved by allocating dedicated memory to each connection from hugepages. To achieve high scalability, PipeDevice stores the connection states entirely in host DRAM and manages them in software. Evaluation with a prototype implementation on commodity FPGA shows that for delivering 80 Gbps across containers PipeDevice saves 63.2% CPU compared to kernel TCP stack, and 40.5% over FreeFlow. PipeDevice provides salient benefits to applications. For example, we port baidu-allreduce to PipeDevice and obtain ~2.2× gains in allreduce throughput.

References

  1. Achieving Fast, Scalable I/O for Virtualized Servers. https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/scalable-i-o-virtualized-servers-paper.pdf.Google ScholarGoogle Scholar
  2. Amazon web service. https://aws.amazon.com/.Google ScholarGoogle Scholar
  3. AMD Zen 4 Epyc CPU. https://www.techradar.com/news/amd-zen-4-epyc-cpu-could-be-an-epic-128-core-256-thread-monster.Google ScholarGoogle Scholar
  4. Baidu-allreduce. https://github.com/baidu-research/baidu-allreduce.Google ScholarGoogle Scholar
  5. bpftrace: High-level tracing language for linux systems. https://bpftrace.org/.Google ScholarGoogle Scholar
  6. Cilium. https://github.com/cilium/cilium.Google ScholarGoogle Scholar
  7. Cloud-Native Network Functions. https://www.cisco.com/c/en/us/solutions/service-provider/industry/cable/cloud-native-network-functions.html.Google ScholarGoogle Scholar
  8. Container management in 2021: In-depth guide. https://research.aimultiple.com/container-management/.Google ScholarGoogle Scholar
  9. containerd: an industry-standard container runtime with an emphasis on simplicity, robustness and portability. https://containerd.io/.Google ScholarGoogle Scholar
  10. Deep learning containers in Google Cloud. https://cloud.google.com/deep-learning-containers.Google ScholarGoogle Scholar
  11. Enable Istio proxy sidecar injection in Oracle cloud native environment. https://docs.oracle.com/en/learn/ocne-sidecars/index.html#introduction.Google ScholarGoogle Scholar
  12. F-Stack: A high performance userspace stack based on FreeBSD 11.0 stable. http://www.f-stack.org/.Google ScholarGoogle Scholar
  13. Fast memcpy with SPDK and Intel I/OAT DMA Engine. https://www.intel.com/content/www/us/en/developer/articles/technical/fast-memcpy-using-spdk-and-ioat-dma-engine.html.Google ScholarGoogle Scholar
  14. FreeFlow TCP. https://github.com/microsoft/Freeflow/tree/tcp.Google ScholarGoogle Scholar
  15. Gloo. https://github.com/facebookincubator/gloo.Google ScholarGoogle Scholar
  16. Implement mmap() for zero copy receive. https://lwn.net/Articles/752207/.Google ScholarGoogle Scholar
  17. Implementing TCP Sockets over RDMA. https://www.openfabrics.org/images/eventpresos/workshops2014/IBUG/presos/Thursday/PDF/09_Sockets-over-rdma.pdf.Google ScholarGoogle Scholar
  18. Information about the TCP chimney offload, receive side scaling, and network direct memory access features in Windows server 2008. https://support.microsoft.com/en-us/help/951037/information-about-the-tcp-chimney-offload-receive-side-scaling-and-net.Google ScholarGoogle Scholar
  19. Intel Arria 10 product table. https://www.intel.co.id/content/dam/www/programmable/us/en/pdfs/literature/pt/arria-10-product-table.pdf.Google ScholarGoogle Scholar
  20. Intel C610 Series Chipset Datasheet. https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/x99-chipset-pch-datasheet.pdf.Google ScholarGoogle Scholar
  21. Intel DSA specification. https://www.intel.com/content/www/us/en/develop/articles/intel-data-streaming-accelerator-architecture-specification.html.Google ScholarGoogle Scholar
  22. Intel QuickData Technology Software Guide. https://www.intel.com/content/dam/doc/white-paper/quickdata-technology-software-guide-for-linux-paper.pdf.Google ScholarGoogle Scholar
  23. IOAT benchmark. https://github.com/spdk/spdk/tree/master/examples/ioat/perf.Google ScholarGoogle Scholar
  24. io_uring. https://man.archlinux.org/man/io_uring.7.en.Google ScholarGoogle Scholar
  25. Istio. https://istio.io/latest/about/service-mesh/.Google ScholarGoogle Scholar
  26. Linkerd architecture. https://linkerd.io/2.11/reference/architecture/.Google ScholarGoogle Scholar
  27. Mellanox BlueField-2 DPU. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-2-dpu.pdf.Google ScholarGoogle Scholar
  28. Mellanox BlueField DPU DMA Guide. https://docs.nvidia.com/doca/sdk/dma-samples/index.html.Google ScholarGoogle Scholar
  29. Microsoft Azure. https://azure.microsoft.com/.Google ScholarGoogle Scholar
  30. NCCL. https://github.com/NVIDIA/nccl.Google ScholarGoogle Scholar
  31. Open MPI: Open source high performance computing. https://www.open-mpi.org/.Google ScholarGoogle Scholar
  32. Perftest. https://github.com/linux-rdma/perftest.Google ScholarGoogle Scholar
  33. Run Spark applications with Docker using Amazon EMR 6.x. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-docker.html.Google ScholarGoogle Scholar
  34. Seastar. http://www.seastar-project.org/.Google ScholarGoogle Scholar
  35. Spark and Docker: Your Spark development cycle just got 10x faster! https://towardsdatascience.com/spark-and-docker-your-spark-development-cycle-just-got-10x-faster-f41ed50c67fd.Google ScholarGoogle Scholar
  36. TCP mmap() program. https://lwn.net/Articles/752197/.Google ScholarGoogle Scholar
  37. What is container management and why is it important. https://searchitoperations.techtarget.com/definition/container-management-software.Google ScholarGoogle Scholar
  38. Why use Docker containers for machine learning development? https://aws.amazon.com/cn/blogs/opensource/why-use-docker-containers-for-machine-learning-development/.Google ScholarGoogle Scholar
  39. Zero-copy TCP receive. https://lwn.net/Articles/752188/.Google ScholarGoogle Scholar
  40. P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu, and D.K. Panda. Zero copy sockets direct protocol over infiniband-preliminary implementation and performance analysis. In Proc. IEEE ISPASS, 2004.Google ScholarGoogle Scholar
  41. Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. IX: A protected dataplane operating system for high throughput and low latency. In Proc. USENIX OSDI, 2014.Google ScholarGoogle Scholar
  42. Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. A cloud-scale acceleration architecture. In Proc. IEEE/ACM MICRO, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  43. Youmin Chen, Youyou Lu, and Jiwu Shu. Scalable RDMA RPC on reliable connection with efficient resource sharing. In Proc. ACM EuroSys, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Yuchen Cheng, Chunghsuan Wu, Yanqiang Liu, Rui Ren, Hong Xu, Bin Yang, and Zhengwei Qi. OPS: Optimized shuffle management system for Apache Spark. In Proc. ACM ICPP, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. FaRM: Fast remote memory. In Proc. USENIX NSDI, 2014.Google ScholarGoogle Scholar
  46. Weibei Fan, Jing He, Zhijie Han, Peng Li, and Ruchuan Wang. Intelligent resource scheduling based on locality principle in data center networks. IEEE Communications Magazine, 58(10):94--100, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  47. Philipp Fent, Alexander van Renen, Andreas Kipf, Viktor Leis, Thomas Neumann, and Alfons Kemper. Low-latency communication for fast DBMS using RDMA and shared memory. In Proc. IEEE ICDE, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  48. Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, Harish Kumar Chandrappa, Somesh Chaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, Fengfen Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel, Tejas Sapre, Mark Shaw, Gabriel Silva, Madhan Sivakumar, Nisheeth Srivastava, Anshuman Verma, Qasim Zuhair, Deepak Bansal, Doug Burger, Kushagra Vaid, David A. Maltz, and Albert Greenberg. Azure Accelerated Networking: SmartNICs in the public cloud. In Proc. USENIX NSDI, 2018.Google ScholarGoogle Scholar
  49. D. Goldenberg, M. Kagan, R. Ravid, and M.S. Tsirkin. Sockets Direct Protocol over InfiniBand in clusters: is it beneficial? In Proc. IEEE HOTI, 2005.Google ScholarGoogle Scholar
  50. Sangjin Han, Scott Marshall, Byung-Gon Chun, and Sylvia Ratnasamy. MegaPipe: A new programming interface for scalable network I/O. In Proc. USENIX OSDI, 2012.Google ScholarGoogle Scholar
  51. Zhiqiang He, Dongyang Wang, Binzhang Fu, Kun Tan, Bei Hua, Zhi-Li Zhang, and Kai Zheng. MasQ: RDMA for Virtual Private Cloud. In Proc. ACM SIGCOMM, 2020.Google ScholarGoogle Scholar
  52. Michio Honda, Giuseppe Lettieri, Lars Eggert, and Douglas Santry. PASTE: A network programming interface for non-volatile main memory. In Proc. USENIX NSDI, 2018.Google ScholarGoogle Scholar
  53. Jinho Hwang, K. K. Ramakrishnan, and Timothy Wood. NetVM: High performance and flexible networking using virtualization on commodity platforms. In Proc. USENIX NSDI, 2014.Google ScholarGoogle Scholar
  54. EunYoung Jeong, Shinae Wood, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and KyoungSoo Park. mTCP: A highly scalable user-level TCP stack for multicore systems. In Proc. USENIX NSDI, 2014.Google ScholarGoogle Scholar
  55. Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In Proc. USENIX OSDI, 2020.Google ScholarGoogle Scholar
  56. Anuj Kalia, Michael Kaminsky, and David Andersen. Datacenter RPCs can be general and fast. In Proc. USENIX NSDI, 2019.Google ScholarGoogle Scholar
  57. Anuj Kalia, Michael Kaminsky, and David G. Andersen. Using RDMA efficiently for key-value services. In Proc. ACM SIGCOMM, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Anuj Kalia, Michael Kaminsky, and David G. Andersen. Design guidelines for high performance RDMA systems. In Proc. USENIX ATC, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Anuj Kalia, Michael Kaminsky, and David G. Andersen. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. In Proc. USENIX OSDI, 2016.Google ScholarGoogle Scholar
  60. Junaid Khalid, Eric Rozner, Wesley Felter, Cong Xu, Karthick Rajamani, Alexandre Ferreira, and Aditya Akella. Iron: Isolating network-based CPU in container environments. In Proc. USENIX NSDI, 2018.Google ScholarGoogle Scholar
  61. Daehyeok Kim, Tianlong Yu, Hongqiang Liu, Yibo Zhu, Jitu Padhye, Shachar Raindel, Chuanxiong Guo, Vyas Sekar, and Srinivasan Seshan. FreeFlow: Software-based virtual RDMA networking for containerized clouds. In Proc. USENIX NSDI, 2019.Google ScholarGoogle Scholar
  62. Sameer G. Kulkarni, Wei Zhang, Jinho Hwang, Shriram Rajagopalan, K. K. Ramakrishnan, Timothy Wood, Mayutan Arumaithurai, and Xiaoming Fu. NFVnice: Dynamic backpressure and scheduling for NFV service chains. In Proc. ACM SIGCOMM, 2017.Google ScholarGoogle Scholar
  63. Jiaxin Lei, Manish Munikar, Kun Suo, Hui Lu, and Jia Rao. Parallelizing packet processing in container overlay networks. In Proc. ACM EuroSys, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, and Lintao Zhang. SocksDirect: Data-center sockets can be fast and compatible. In Proc. ACM SIGCOMM, 2020.Google ScholarGoogle Scholar
  65. Jian Li, Shuai Xue, Wang Zhang, Ruhui Ma, Zhengwei Qi, and Haibing Guan. When I/O interrupt becomes system bottleneck: Efficiency and scalability enhancement for SR-IOV network virtualization. IEEE Transactions on Cloud Computing, 7(4):1183--1196, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  66. Xiaofeng Lin, Yu Chen, Xiaodong Li, Junjie Mao, Jiaquan He, Wei Xu, and Yuanchun Shi. Scalable kernel TCP design and implementation for short-lived connections. In Proc. ASPLOS, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Glenn K. Lockwood, Mahidhar Tatineni, and Rick Wagner. SR-IOV: Performance benefits for virtualized interconnects. In Proc. ACM XSEDE, 2014.Google ScholarGoogle Scholar
  68. Patrick MacArthur and Robert D. Russell. An Efficient Method for Stream Semantics over RDMA. In Proc. IEEE IPDPS, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Ilias Marinos, Robert NM Watson, and Mark Handley. Network stack specialization for performance. In Proc. ACM SIGCOMM, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. YoungGyoun Moon, SeungEon Lee, Muhammad Asim Jamshed, and KyoungSoo Park. AccelTCP: Accelerating network applications with stateful TCP offloading. In Proc. USENIX NSDI, 2020.Google ScholarGoogle Scholar
  71. Jaehyun Nam, Seungsoo Lee, Hyunmin Seo, Phil Porras, Vinod Yegneswaran, and Seungwon Shin. BASTION: A security enforcement network stack for container networks. In Proc. USENIX ATC, 2020.Google ScholarGoogle Scholar
  72. Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W. Moore. Understanding PCIe performance for end host networking. In Proc. ACM SIGCOMM, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Zhixiong Niu, Hong Xu, Peng Cheng, Qiang Su, Yongqiang Xiong, Tao Wang, Dongsu Han, and Keith Winstein. NetKernel: Making network stack part of the virtualized infrastructure. In Proc. USENIX ATC, 2020.Google ScholarGoogle Scholar
  74. Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. A generic communication scheduler for distributed DNN training acceleration. In Proc. ACM SOSP, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Arjun Singhvi, Aditya Akella, Dan Gibson, Thomas F. Wenisch, Monica Wong-Chan, Sean Clark, Milo M. K. Martin, Moray McLaren, Prashant Chandra, Rob Cauble, Hassan M. G. Wassel, Behnam Montazeri, Simon L. Sabato, Joel Scherpelz, and Amin Vahdat. 1RMA: Re-envisioning remote memory access for multi-tenant datacenters. In Proc. ACM SIGCOMM, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Livio Soares and Michael Stumm. FlexSC: Flexible system call scheduling with exception-less system calls. In Proc. USENIX OSDI, 2010.Google ScholarGoogle Scholar
  77. Shin-Yeh Tsai and Yiying Zhang. LITE Kernel RDMA Support for Datacenter Applications. In Proc. ACM SOSP, 2017.Google ScholarGoogle Scholar
  78. Jian Yang, Joseph Izraelevitz, and Steven Swanson. FileMR: Rethinking RDMA networking for scalable persistent memory. In Proc. USENIX NSDI, 2020.Google ScholarGoogle Scholar
  79. Kenichi Yasukata, Michio Honda, Douglas Santry, and Lars Eggert. StackMap: Low-latency networking with the OS stack and dedicated NICs. In Proc. USENIX ATC, 2016.Google ScholarGoogle Scholar
  80. Tianlong Yu, Shadi Abdollahian Noghabi, Shachar Raindel, Hongqiang Liu, Jitu Padhye, and Vyas Sekar. FreeFlow: High performance container networking. In Proc. ACM HotNets, 2016.Google ScholarGoogle Scholar
  81. Wei Zhang, Guyue Liu, Wenhui Zhang, Neel Shah, Phil Lopreiato, Gregoire Todeschi, KK Ramakrishnan, and Timothy Wood. OpenNetVM: A platform for high performance network service chains. In Proc. ACM HotMiddlebox, 2015.Google ScholarGoogle Scholar
  82. Dongfang Zhao, Mohamed Mohamed, and Heiko Ludwig. Locality-aware scheduling for containers in cloud computing. IEEE Transactions on Cloud Computing, 8(2):635--646, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  83. Chao Zheng, Qiuwen Lu, Jia Li, Qinyun Liu, and Binxing Fang. A flexible and efficient container-based NFV platform for middlebox networking. In Proc. ACM SAC, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Danyang Zhuo, Kaiyuan Zhang, Yibo Zhu, Hongqiang Harry Liu, Matthew Rockett, Arvind Krishnamurthy, and Thomas Anderson. Slim: OS kernel support for a low-overhead container overlay network. In Proc. USENIX NSDI, 2019.Google ScholarGoogle Scholar

Index Terms

  1. PipeDevice: a hardware-software co-design approach to intra-host container communication

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CoNEXT '22: Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies
      November 2022
      431 pages
      ISBN:9781450395083
      DOI:10.1145/3555050

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 30 November 2022

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      CoNEXT '22 Paper Acceptance Rate28of151submissions,19%Overall Acceptance Rate198of789submissions,25%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader