skip to main content
10.1145/3555050.3569128acmconferencesArticle/Chapter ViewAbstractPublication PagesconextConference Proceedingsconference-collections
research-article

An ultra-low latency and compatible PCIe interconnect for rack-scale communication

Published: 30 November 2022 Publication History

Abstract

Emerging network-attached resource disaggregation architecture requires ultra-low latency rack-scale communication. However, current hardware offloading (e.g., RDMA) and user-space (e.g., mTCP) communication schemes still rely on heavily layered protocol stacks which requires the translation between PCIe bus and network protocol, or complex connection/memory resource management within RNICs, inevitably bringing latency overhead.
We argue that PCIe Non-Transparent Bridge (NTB) is a superior high-speed in-rack network technology to interconnect PCIe-attached machines or devices with the same PCIe fabric since no translation is needed between PCIe and network protocol. We present NTSocks, the first user-space in-rack interconnect over PCIe fabric which virtualizes native NTB into high-level network functionalities for rack-scale systems with software-hardware co-design. NTSocks provides (1) compatibility with a fast socket-like abstraction, (2) multi-thread scalability using a core-driven dat-aplane model, and (3) fair and efficient resource sharing with a multi-tenant isolation mechanism. Even though PCIe NTB is originally designed for device communication across PCIe domains, NTSocks shows a flexible user-level indirection with performance close to bare-metal NTB while providing common network stack features. In the evaluations with latency-sensitive Key-Value Store, NTSocks achieves better latency by up to 24.5× and 1.58× than kernel and RDMA socket, respectively.

References

[1]
Krste Asanović. 2014. Firebox: a hardware building block for 2020 warehouse-scale computers.
[2]
Broadcom. 2011. Pex8733, pci express gen 3 switch, 32 lanes, 18 ports. https://docs.broadcom.com/docs/12351852. (2011).
[3]
Google Cloud. 2018. Tpu pods. https://cloud.google.com/tpu/. (2018).
[4]
Tencent Cloud. 2019. High-performance network framework based on dpdk. http://f-stack.org/. (2019).
[5]
DPDK Community. 2020. Data plane development kit. https://www.dpdk.org/. (2020).
[6]
Linux Kernel Community. 2020. Ntb drivers in linux kernel. https://www.kernel.org/doc/Documentation/ntb.txt. (2020).
[7]
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with ycsb. In Proceedings of the First ACM Symposium on Cloud Computing, 143--145.
[8]
NVIDIA Corporation. 2022. Bluefield smartnic. https://www.nvidia.com/en-us/networking/products/data-processing-unit/. (2022).
[9]
Paolo Costa, Hitesh Ballani, Kaveh Razavi, and Ian Kash. 2015. R2c2: a network stack for rack-scale computers. ACM SIGCOMM Computer Communication Review, 45, 4, 551--564.
[10]
CXL. 2020. Compute express link: the breakthrough cpu-to-device interconnect. https://www.computeexpresslink.org/. (2020).
[11]
Alexandros Daglis, Stanko Novaković, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2015. Manycore network interfaces for in-memory rack-scale computing. ACM SIGARCH Computer Architecture News, 43, 3S, 567--579.
[12]
Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Communications of the ACM, 56, 2, 74--80.
[13]
Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. Farm: fast remote memory. In 11th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 14), 401--414.
[14]
EMC. 2016. Dssd d5. https://www.emc.com/enus/storage/flash/dssd/dssd-d5/index.htm. (2016).
[15]
Alireza Farshin, Amir Roozbeh, Gerald Q Maguire Jr, and Dejan Kostić. 2020. Reexamining direct cache access to optimize i/o intensive applications for multi-hundred-gigabit networks. In 2020 {USENIX} Annual Technical Conference ({USENIX} {ATC} 20), 673--689.
[16]
Daniel Firestone et al. 2018. Azure accelerated networking: smartnics in the public cloud. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18), 51--66.
[17]
Linux Foundation. 2020. What is the vector packet processor (vpp). https://fd.io/docs/vpp/master/. (2020).
[18]
The Apache Software Foundation. 2020. Ab - apache http server benchmarking tool. https://httpd.apache.org/docs/2.4/programs/ab.html. (2020).
[19]
Peter X Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16), 249--264.
[20]
Yixiao Gao et al. 2021. When cloud storage meets {rdma}. In 18th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 21), 519--533.
[21]
Dan Gibson et al. 2022. Aquila: a unified, low-latency fabric for datacenter networks. In 19th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 22), 1249--1266.
[22]
Donghyun Gouk, Sangwon Lee, Miryeong Kwon, and Myoungsoo Jung. 2022. Direct access, {high-performance} memory disaggregation with {directcxl}. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), 287--294.
[23]
Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G Shin. 2017. Efficient memory disaggregation with infiniswap. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), 649--667.
[24]
Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. 2016. Rdma over commodity ethernet at scale. In Proceedings of the 2016 ACM SIGCOMM Conference, 202--215.
[25]
Zhiyuan Guo, Yizhou Shan, Xuhao Luo, Yutong Huang, and Yiying Zhang. 2022. Clio: a hardware-software co-designed disaggregated memory system. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 417--433.
[26]
Microchip Technology Inc. 2019. Microchip switchtec pm853x. https://ww1.microchip.com/downloads/en/DeviceDoc/00002849.pdf. (2019).
[27]
Intel. 2017. Intel rack scale design. https://www.intel.com/content/www/us/en/architecture-and-technology/rack-scale-design-overview.html. (2017).
[28]
Intel. 2020. Intel® 64 and ia-32 architectures optimization reference manual. https://software.intel.com/content/www/us/en/develop/down-load/intel-64-and-ia-32-architectures-optimization-reference-manual.html. (2020).
[29]
Intel. 2020. Intel® data direct i/o technology. https://www.intel.com/content/www/us/en/io/data-direct-i-o-technology.html. (2020).
[30]
EunYoung Jeong, Shinae Wood, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and KyoungSoo Park. 2014. Mtcp: a highly scalable user-level {tcp} stack for multicore systems. In 11th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 14), 489--502.
[31]
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A unified architecture for accelerating distributed {dnn} training in heterogeneous gpu/cpu clusters. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), 463--479.
[32]
Wu Jingjing and Maslekar Omkar. 2019. Dpdk pmd for ntb. https://static.sched.com/hosted_files/dpdkna2019/35/DKPMDforPCleNon-TransparentBridge.pptx. Intel, (2019).
[33]
Anuj Kalia, Michael Kaminsky, and David Andersen. 2019. Datacenter rpcs can be general and fast. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19), 1--16.
[34]
Daehyeok Kim, Tianlong Yu, Hongqiang Harry Liu, Yibo Zhu, Jitu Padhye, Shachar Raindel, Chuanxiong Guo, Vyas Sekar, and Srinivasan Seshan. 2019. Freeflow: software-based virtual {rdma} networking for containerized clouds. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19), 113--126.
[35]
Yohei Kuga, Ryo Nakamura, Takeshi Matsuya, and Yuji Sekiya. 2020. Nettlp: a development platform for pcie devices in software interacting with hardware. In 17th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 20), 141--155.
[36]
Sergey Legtchenko, Nicholas Chen, Daniel Cletheroe, Antony Rowstron, Hugh Williams, and Xiaohan Zhao. 2016. Xfabric: a reconfigurable in-rack network for rack-scale computers. In 13th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 16), 15--29.
[37]
Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, and Lintao Zhang. 2019. Socksdirect: datacenter sockets can be fast and compatible. In Proceedings of the ACM Special Interest Group on Data Communication, 90--103.
[38]
Huaicheng Li et al. 2022. First-generation memory disaggregation for cloud platforms. arXiv preprint arXiv:2203.00241.
[39]
Yuliang Li et al. 2019. Hpcc: high precision congestion control. In Proceedings of the ACM Special Interest Group on Data Communication, 44--58.
[40]
Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee, and Arvind Krishnamurthy. 2018. Parameter hub: a rack-scale parameter server for distributed deep neural network training. In Proceedings of the ACM Symposium on Cloud Computing, 41--54.
[41]
Wassim Mansour, Pablo Fajardo, Nicolas Janvier, et al. 2017. High performance rdma-based daq platform over pcie routable network. ICALEPCS, Barcelona, Spain, 8--13.
[42]
Jonas Markussen, Lars Bjørlykke Kristiansen, Pål Halvorsen, Halvor Kielland-Gyrud, Håkon Kvale Stensland, and Carsten Griwodz. 2021. Smartio: zero-overhead device sharing through pcie networking. ACM Transactions on Computer Systems (TOCS), 38, 1--2, 1--78.
[43]
Jonas Markussen, Lars Bjørlykke Kristiansen, Håkon Kvale Stensland, Friedrich Seifert, Carsten Griwodz, and Pål Halvorsen. 2018. Flexible device sharing in pcie clusters using device lending. In Proceedings of the 47th International Conference on Parallel Processing Companion, 1--10.
[44]
Mellanox. 2019. Messaging accelerator (vma). Available at https://github.com/mellanox/libvma. (2019).
[45]
Mihir Nanavati, Jake Wires, and Andrew Warfield. 2017. Decibel: isolation and sharing in disaggregated rack-scale storage. In 14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17), 17--33.
[46]
Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W Moore. 2018. Understanding pcie performance for end host networking. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, 327--341.
[47]
Stanko Novakovic, Alexandros Daglis, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2016. The case for rackout: scalable data serving using rack-scale systems. In Proceedings of the Seventh ACM Symposium on Cloud Computing, 182--195.
[48]
Stanko Novakovic, Alexandros Daglis, Dmitrii Ustiugov, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2019. Mitigating load imbalance in distributed data serving with rack-scale memory pooling. ACM Transactions on Computer Systems (TOCS), 36, 2, 1--37.
[49]
2014. Pci express® base specification revision 4.0 version 0.3. https://xdevs.com/doc/Standards/PCI/PCI_Express_Base_4.0_Rev0.3_February19-2014.pdf. (2014).
[50]
C. PETERSEN. 2016. Introducing lightning: a flexiblenvme jbof. https://code.facebook.com/posts/989638804458007/introducinglightning-a-flexible-nvme-jbof/. (Mar. 2016).
[51]
DPDK Project. 2020. Ntb rawdev driver. https://doc.dpdk.org/guides/rawdevs/ntb.html. (2020).
[52]
Yiming Qiu, Jiarong Xing, Kuo-Feng Hsu, Qiao Kang, Ming Liu, Srinivas Narayana, and Ang Chen. 2021. Automated smartnic offloading insights for network functions. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, 772--787.
[53]
Jack Regula. 2004. Using non-transparent bridging in pci express systems. PLX Technology, Inc, 31.
[54]
Holly Schroth. 2019. Are you ready for gen z in the workplace? California Management Review, 61, 3, 5--18.
[55]
ScyllaDB. 2019. Seastar: high-performance server-side application framework. http://seastar.io/. (2019).
[56]
Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. 2018. Legoos: a disseminated, distributed {os} for hardware resource disaggregation. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), 69--87.
[57]
Mark J Sullivan. 2010. Intel xeon processor c5500/c3500 series non-transparent bridge. Technology@ Intel Magazine.
[58]
PLX Technologies. 2005. Multi-host system and intelligent i/o design with pci express. https://lwn.net/Articles/672752/. (2005).
[59]
Shin-Yeh Tsai, Yizhou Shan, and Yiying Zhang. 2020. Disaggregating persistent memory and controlling them remotely: an exploration of passive disaggregated key-value stores. In 2020 {USENIX} Annual Technical Conference ({USENIX} {ATC} 20), 33--48.
[60]
Cheng-Chun Tu, Chao-tang Lee, and Tzi-cker Chiueh. 2014. Marlin: a memory-based rack area network. In 2014 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS). IEEE, 125--135.
[61]
Qing Wang, Youyou Lu, Erci Xu, Junru Li, Youmin Chen, and Jiwu Shu. 2021. Concordia: distributed shared memory with in-network cache coherence. In 19th {USENIX} Conference on File and Storage Technologies ({FAST} 21), 277--292.
[62]
Xingda Wei, Xiating Xie, Rong Chen, Haibo Chen, and Binyu Zang. 2021. Characterizing and optimizing remote persistent memory with rdma and nvm. In 2021 {USENIX} Annual Technical Conference ({USENIX} {ATC} 21), 523--536.
[63]
Xiangliang Yu. 2016. Ntb: add support for amd pci-express non-transparent bridge. https://lwn.net/Articles/672752/. (2016).
[64]
Liuhang Zhang, Rui Hou, Sally A McKee, Jianbo Dong, and Lixin Zhang. 2016. P-socket: optimizing a communication library for a pcie-based intra-rack interconnect. In Proceedings of the ACM International Conference on Computing Frontiers, 145--153.
[65]
Xiantao Zhang, Xiao Zheng, Zhi Wang, Hang Yang, Yibin Shen, and Xin Long. 2020. High-density multi-tenant bare-metal cloud. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 483--495.
[66]
Hang Zhu, Kostis Kaffes, Zixu Chen, Zhenming Liu, Christos Kozyrakis, Ion Stoica, and Xin Jin. 2020. Racksched: a microsecond-scale scheduler for rack-scale computers. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), 1225--1240.

Cited By

View all
  • (2025)Dynamic sharding model and performance optimization method for consortium blockchainThe Journal of Supercomputing10.1007/s11227-024-06870-881:2Online publication date: 21-Jan-2025
  • (2024)Yggdrasil: Reducing Network I/O Tax with (CXL-Based) Distributed Shared MemoryProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673138(597-606)Online publication date: 12-Aug-2024
  • (2024)Revisiting Learned Index with Byte-addressable Persistent StorageProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673113(929-938)Online publication date: 12-Aug-2024

Index Terms

  1. An ultra-low latency and compatible PCIe interconnect for rack-scale communication

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CoNEXT '22: Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies
    November 2022
    431 pages
    ISBN:9781450395083
    DOI:10.1145/3555050
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 November 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    • Best Paper

    Author Tags

    1. PCIe interconnect
    2. PCIe non-transparent bridging
    3. disaggregation
    4. high-speed networks
    5. rack-scale communication

    Qualifiers

    • Research-article

    Conference

    CoNEXT '22
    Sponsor:

    Acceptance Rates

    CoNEXT '22 Paper Acceptance Rate 28 of 151 submissions, 19%;
    Overall Acceptance Rate 198 of 789 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)173
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 15 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Dynamic sharding model and performance optimization method for consortium blockchainThe Journal of Supercomputing10.1007/s11227-024-06870-881:2Online publication date: 21-Jan-2025
    • (2024)Yggdrasil: Reducing Network I/O Tax with (CXL-Based) Distributed Shared MemoryProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673138(597-606)Online publication date: 12-Aug-2024
    • (2024)Revisiting Learned Index with Byte-addressable Persistent StorageProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673113(929-938)Online publication date: 12-Aug-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media