research-article

An ultra-low latency and compatible PCIe interconnect for rack-scale communication

Authors:

Jie WuAuthors Info & Claims

CoNEXT '22: Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies

Pages 232 - 244

https://doi.org/10.1145/3555050.3569128

Published: 30 November 2022 Publication History

Abstract

Emerging network-attached resource disaggregation architecture requires ultra-low latency rack-scale communication. However, current hardware offloading (e.g., RDMA) and user-space (e.g., mTCP) communication schemes still rely on heavily layered protocol stacks which requires the translation between PCIe bus and network protocol, or complex connection/memory resource management within RNICs, inevitably bringing latency overhead.

We argue that PCIe Non-Transparent Bridge (NTB) is a superior high-speed in-rack network technology to interconnect PCIe-attached machines or devices with the same PCIe fabric since no translation is needed between PCIe and network protocol. We present NTSocks, the first user-space in-rack interconnect over PCIe fabric which virtualizes native NTB into high-level network functionalities for rack-scale systems with software-hardware co-design. NTSocks provides (1) compatibility with a fast socket-like abstraction, (2) multi-thread scalability using a core-driven dat-aplane model, and (3) fair and efficient resource sharing with a multi-tenant isolation mechanism. Even though PCIe NTB is originally designed for device communication across PCIe domains, NTSocks shows a flexible user-level indirection with performance close to bare-metal NTB while providing common network stack features. In the evaluations with latency-sensitive Key-Value Store, NTSocks achieves better latency by up to 24.5× and 1.58× than kernel and RDMA socket, respectively.

References

[1]

Krste Asanović. 2014. Firebox: a hardware building block for 2020 warehouse-scale computers.

[2]

Broadcom. 2011. Pex8733, pci express gen 3 switch, 32 lanes, 18 ports. https://docs.broadcom.com/docs/12351852. (2011).

[3]

Google Cloud. 2018. Tpu pods. https://cloud.google.com/tpu/. (2018).

[4]

Tencent Cloud. 2019. High-performance network framework based on dpdk. http://f-stack.org/. (2019).

[5]

DPDK Community. 2020. Data plane development kit. https://www.dpdk.org/. (2020).

[6]

Linux Kernel Community. 2020. Ntb drivers in linux kernel. https://www.kernel.org/doc/Documentation/ntb.txt. (2020).

[7]

Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with ycsb. In Proceedings of the First ACM Symposium on Cloud Computing, 143--145.

Digital Library

[8]

NVIDIA Corporation. 2022. Bluefield smartnic. https://www.nvidia.com/en-us/networking/products/data-processing-unit/. (2022).

[9]

Paolo Costa, Hitesh Ballani, Kaveh Razavi, and Ian Kash. 2015. R2c2: a network stack for rack-scale computers. ACM SIGCOMM Computer Communication Review, 45, 4, 551--564.

Digital Library

[10]

CXL. 2020. Compute express link: the breakthrough cpu-to-device interconnect. https://www.computeexpresslink.org/. (2020).

[11]

Alexandros Daglis, Stanko Novaković, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2015. Manycore network interfaces for in-memory rack-scale computing. ACM SIGARCH Computer Architecture News, 43, 3S, 567--579.

Digital Library

[12]

Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Communications of the ACM, 56, 2, 74--80.

Digital Library

[13]

Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. Farm: fast remote memory. In 11th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 14), 401--414.

[14]

EMC. 2016. Dssd d5. https://www.emc.com/enus/storage/flash/dssd/dssd-d5/index.htm. (2016).

[15]

Alireza Farshin, Amir Roozbeh, Gerald Q Maguire Jr, and Dejan Kostić. 2020. Reexamining direct cache access to optimize i/o intensive applications for multi-hundred-gigabit networks. In 2020 {USENIX} Annual Technical Conference ({USENIX} {ATC} 20), 673--689.

[16]

Daniel Firestone et al. 2018. Azure accelerated networking: smartnics in the public cloud. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18), 51--66.

[17]

Linux Foundation. 2020. What is the vector packet processor (vpp). https://fd.io/docs/vpp/master/. (2020).

[18]

The Apache Software Foundation. 2020. Ab - apache http server benchmarking tool. https://httpd.apache.org/docs/2.4/programs/ab.html. (2020).

[19]

Peter X Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16), 249--264.

[20]

Yixiao Gao et al. 2021. When cloud storage meets {rdma}. In 18th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 21), 519--533.

[21]

Dan Gibson et al. 2022. Aquila: a unified, low-latency fabric for datacenter networks. In 19th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 22), 1249--1266.

[22]

Donghyun Gouk, Sangwon Lee, Miryeong Kwon, and Myoungsoo Jung. 2022. Direct access, {high-performance} memory disaggregation with {directcxl}. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), 287--294.

[23]

Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G Shin. 2017. Efficient memory disaggregation with infiniswap. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), 649--667.

[24]

Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. 2016. Rdma over commodity ethernet at scale. In Proceedings of the 2016 ACM SIGCOMM Conference, 202--215.

Digital Library

[25]

Zhiyuan Guo, Yizhou Shan, Xuhao Luo, Yutong Huang, and Yiying Zhang. 2022. Clio: a hardware-software co-designed disaggregated memory system. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 417--433.

Digital Library

[26]

Microchip Technology Inc. 2019. Microchip switchtec pm853x. https://ww1.microchip.com/downloads/en/DeviceDoc/00002849.pdf. (2019).

[27]

Intel. 2017. Intel rack scale design. https://www.intel.com/content/www/us/en/architecture-and-technology/rack-scale-design-overview.html. (2017).

[28]

Intel. 2020. Intel® 64 and ia-32 architectures optimization reference manual. https://software.intel.com/content/www/us/en/develop/down-load/intel-64-and-ia-32-architectures-optimization-reference-manual.html. (2020).

[29]

Intel. 2020. Intel® data direct i/o technology. https://www.intel.com/content/www/us/en/io/data-direct-i-o-technology.html. (2020).

[30]

EunYoung Jeong, Shinae Wood, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and KyoungSoo Park. 2014. Mtcp: a highly scalable user-level {tcp} stack for multicore systems. In 11th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 14), 489--502.

[31]

Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A unified architecture for accelerating distributed {dnn} training in heterogeneous gpu/cpu clusters. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), 463--479.

[32]

Wu Jingjing and Maslekar Omkar. 2019. Dpdk pmd for ntb. https://static.sched.com/hosted_files/dpdkna2019/35/DKPMDforPCleNon-TransparentBridge.pptx. Intel, (2019).

[33]

Anuj Kalia, Michael Kaminsky, and David Andersen. 2019. Datacenter rpcs can be general and fast. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19), 1--16.

[34]

Daehyeok Kim, Tianlong Yu, Hongqiang Harry Liu, Yibo Zhu, Jitu Padhye, Shachar Raindel, Chuanxiong Guo, Vyas Sekar, and Srinivasan Seshan. 2019. Freeflow: software-based virtual {rdma} networking for containerized clouds. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19), 113--126.

[35]

Yohei Kuga, Ryo Nakamura, Takeshi Matsuya, and Yuji Sekiya. 2020. Nettlp: a development platform for pcie devices in software interacting with hardware. In 17th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 20), 141--155.

[36]

Sergey Legtchenko, Nicholas Chen, Daniel Cletheroe, Antony Rowstron, Hugh Williams, and Xiaohan Zhao. 2016. Xfabric: a reconfigurable in-rack network for rack-scale computers. In 13th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 16), 15--29.

[37]

Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, and Lintao Zhang. 2019. Socksdirect: datacenter sockets can be fast and compatible. In Proceedings of the ACM Special Interest Group on Data Communication, 90--103.

Digital Library

[38]

Huaicheng Li et al. 2022. First-generation memory disaggregation for cloud platforms. arXiv preprint arXiv:2203.00241.

[39]

Yuliang Li et al. 2019. Hpcc: high precision congestion control. In Proceedings of the ACM Special Interest Group on Data Communication, 44--58.

[40]

Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee, and Arvind Krishnamurthy. 2018. Parameter hub: a rack-scale parameter server for distributed deep neural network training. In Proceedings of the ACM Symposium on Cloud Computing, 41--54.

Digital Library

[41]

Wassim Mansour, Pablo Fajardo, Nicolas Janvier, et al. 2017. High performance rdma-based daq platform over pcie routable network. ICALEPCS, Barcelona, Spain, 8--13.

[42]

Jonas Markussen, Lars Bjørlykke Kristiansen, Pål Halvorsen, Halvor Kielland-Gyrud, Håkon Kvale Stensland, and Carsten Griwodz. 2021. Smartio: zero-overhead device sharing through pcie networking. ACM Transactions on Computer Systems (TOCS), 38, 1--2, 1--78.

Digital Library

[43]

Jonas Markussen, Lars Bjørlykke Kristiansen, Håkon Kvale Stensland, Friedrich Seifert, Carsten Griwodz, and Pål Halvorsen. 2018. Flexible device sharing in pcie clusters using device lending. In Proceedings of the 47th International Conference on Parallel Processing Companion, 1--10.

Digital Library

[44]

Mellanox. 2019. Messaging accelerator (vma). Available at https://github.com/mellanox/libvma. (2019).

[45]

Mihir Nanavati, Jake Wires, and Andrew Warfield. 2017. Decibel: isolation and sharing in disaggregated rack-scale storage. In 14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17), 17--33.

[46]

Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W Moore. 2018. Understanding pcie performance for end host networking. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, 327--341.

Digital Library

[47]

Stanko Novakovic, Alexandros Daglis, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2016. The case for rackout: scalable data serving using rack-scale systems. In Proceedings of the Seventh ACM Symposium on Cloud Computing, 182--195.

Digital Library

[48]

Stanko Novakovic, Alexandros Daglis, Dmitrii Ustiugov, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2019. Mitigating load imbalance in distributed data serving with rack-scale memory pooling. ACM Transactions on Computer Systems (TOCS), 36, 2, 1--37.

Digital Library

[49]

2014. Pci express® base specification revision 4.0 version 0.3. https://xdevs.com/doc/Standards/PCI/PCI_Express_Base_4.0_Rev0.3_February19-2014.pdf. (2014).

[50]

C. PETERSEN. 2016. Introducing lightning: a flexiblenvme jbof. https://code.facebook.com/posts/989638804458007/introducinglightning-a-flexible-nvme-jbof/. (Mar. 2016).

[51]

DPDK Project. 2020. Ntb rawdev driver. https://doc.dpdk.org/guides/rawdevs/ntb.html. (2020).

[52]

Yiming Qiu, Jiarong Xing, Kuo-Feng Hsu, Qiao Kang, Ming Liu, Srinivas Narayana, and Ang Chen. 2021. Automated smartnic offloading insights for network functions. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, 772--787.

Digital Library

[53]

Jack Regula. 2004. Using non-transparent bridging in pci express systems. PLX Technology, Inc, 31.

[54]

Holly Schroth. 2019. Are you ready for gen z in the workplace? California Management Review, 61, 3, 5--18.

[55]

ScyllaDB. 2019. Seastar: high-performance server-side application framework. http://seastar.io/. (2019).

[56]

Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. 2018. Legoos: a disseminated, distributed {os} for hardware resource disaggregation. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), 69--87.

[57]

Mark J Sullivan. 2010. Intel xeon processor c5500/c3500 series non-transparent bridge. Technology@ Intel Magazine.

[58]

PLX Technologies. 2005. Multi-host system and intelligent i/o design with pci express. https://lwn.net/Articles/672752/. (2005).

[59]

Shin-Yeh Tsai, Yizhou Shan, and Yiying Zhang. 2020. Disaggregating persistent memory and controlling them remotely: an exploration of passive disaggregated key-value stores. In 2020 {USENIX} Annual Technical Conference ({USENIX} {ATC} 20), 33--48.

[60]

Cheng-Chun Tu, Chao-tang Lee, and Tzi-cker Chiueh. 2014. Marlin: a memory-based rack area network. In 2014 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS). IEEE, 125--135.

Digital Library

[61]

Qing Wang, Youyou Lu, Erci Xu, Junru Li, Youmin Chen, and Jiwu Shu. 2021. Concordia: distributed shared memory with in-network cache coherence. In 19th {USENIX} Conference on File and Storage Technologies ({FAST} 21), 277--292.

[62]

Xingda Wei, Xiating Xie, Rong Chen, Haibo Chen, and Binyu Zang. 2021. Characterizing and optimizing remote persistent memory with rdma and nvm. In 2021 {USENIX} Annual Technical Conference ({USENIX} {ATC} 21), 523--536.

[63]

Xiangliang Yu. 2016. Ntb: add support for amd pci-express non-transparent bridge. https://lwn.net/Articles/672752/. (2016).

[64]

Liuhang Zhang, Rui Hou, Sally A McKee, Jianbo Dong, and Lixin Zhang. 2016. P-socket: optimizing a communication library for a pcie-based intra-rack interconnect. In Proceedings of the ACM International Conference on Computing Frontiers, 145--153.

Digital Library

[65]

Xiantao Zhang, Xiao Zheng, Zhi Wang, Hang Yang, Yibin Shen, and Xin Long. 2020. High-density multi-tenant bare-metal cloud. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 483--495.

Digital Library

[66]

Hang Zhu, Kostis Kaffes, Zixu Chen, Zhenming Liu, Christos Kozyrakis, Ion Stoica, and Xin Jin. 2020. Racksched: a microsecond-scale scheduler for rack-scale computers. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), 1225--1240.

Cited By

Wang YGong ZJia DTan ALiu M(2025)Dynamic sharding model and performance optimization method for consortium blockchainThe Journal of Supercomputing10.1007/s11227-024-06870-881:2Online publication date: 21-Jan-2025
https://doi.org/10.1007/s11227-024-06870-8
Tang WHan YAi TLi GYu BYang X(2024)Yggdrasil: Reducing Network I/O Tax with (CXL-Based) Distributed Shared MemoryProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673138(597-606)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673138
Zhang RHuang YLiang SSun SMa SHuan CChen LLu ZXu YYan MWu J(2024)Revisiting Learned Index with Byte-addressable Persistent StorageProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673113(929-938)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673113

Index Terms

An ultra-low latency and compatible PCIe interconnect for rack-scale communication
1. Networks
  1. Network types
    1. Data center networks

Recommendations

Manycore network interfaces for in-memory rack-scale computing
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Datacenter operators rely on low-cost, high-density technologies to maximize throughput for data-intensive services with tight tail latencies. In-memory rack-scale computing is emerging as a promising paradigm in scale-out datacenters capitalizing on ...
Manycore network interfaces for in-memory rack-scale computing
ISCA'15

Datacenter operators rely on low-cost, high-density technologies to maximize throughput for data-intensive services with tight tail latencies. In-memory rack-scale computing is emerging as a promising paradigm in scale-out datacenters capitalizing on ...
Inter-rack live migration of multiple virtual machines
VTDC '12: Proceedings of the 6th international workshop on Virtualization Technologies in Distributed Computing Date

Within datacenters, often multiple virtual machines (VMs) need to be live migrated simultaneously for various reasons such as maintenance, power savings, and load balancing. Such mass simultaneous live migration of multiple VMs can trigger large data ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CoNEXT '22: Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies

November 2022

431 pages

ISBN:9781450395083

DOI:10.1145/3555050

General Chairs:
Giuseppe Bianchi
University of Rome Tor Vergata, Italy
,
Alessandro Mei
Sapienza University of Rome, Italy

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCOMM: ACM Special Interest Group on Data Communication

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 November 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Best Paper

Author Tags

Qualifiers

Research-article

Conference

CoNEXT '22

Sponsor:

SIGCOMM

CoNEXT '22: The 18th International Conference on emerging Networking EXperiments and Technologies

December 6 - 9, 2022

Roma, Italy

Acceptance Rates

CoNEXT '22 Paper Acceptance Rate 28 of 151 submissions, 19%;

Overall Acceptance Rate 198 of 789 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
691
Total Downloads

Downloads (Last 12 months)173
Downloads (Last 6 weeks)14

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang YGong ZJia DTan ALiu M(2025)Dynamic sharding model and performance optimization method for consortium blockchainThe Journal of Supercomputing10.1007/s11227-024-06870-881:2Online publication date: 21-Jan-2025
https://doi.org/10.1007/s11227-024-06870-8
Tang WHan YAi TLi GYu BYang X(2024)Yggdrasil: Reducing Network I/O Tax with (CXL-Based) Distributed Shared MemoryProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673138(597-606)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673138
Zhang RHuang YLiang SSun SMa SHuan CChen LLu ZXu YYan MWu J(2024)Revisiting Learned Index with Byte-addressable Persistent StorageProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673113(929-938)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673113

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten