skip to main content
10.1145/3373376.3378528acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Lynx: A SmartNIC-driven Accelerator-centric Architecture for Network Servers

Published: 13 March 2020 Publication History

Abstract

This paper explores new opportunities afforded by the growing deployment of compute and I/O accelerators to improve the performance and efficiency of hardware-accelerated computing services in data centers.
We propose Lynx, an accelerator-centric network server architecture that offloads the server data and control planes to the SmartNIC, and enables direct networking from accelerators via a lightweight hardware-friendly I/O mechanism. Lynx enables the design of hardware-accelerated network servers that run without CPU involvement, freeing CPU cores and improving performance isolation for accelerated services. It is portable across accelerator architectures and allows the management of both local and remote accelerators, seamlessly scaling beyond a single physical machine.
We implement and evaluate Lynx on GPUs and the Intel Visual Compute Accelerator, as well as two SmartNIC architectures - one with an FPGA, and another with an 8-core ARM processor. Compared to a traditional host-centric approach, Lynx achieves over 4X higher throughput for a GPU-centric face verification server, where it is used for GPU communications with an external database, and 25% higher throughput for a GPU-accelerated neural network inference service. For this workload, we show that a single SmartNIC may drive 4 local and 8 remote GPUs while achieving linear performance scaling without using the host CPU.

References

[1]
Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265--283. https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf
[2]
Sandeep R. Agrawal, Valentin Pistol, Jun Pang, John Tran, David Tarjan, and Alvin R. Lebeck. 2014. Rhythm: harnessing data parallel hardware for server workloads. In Architectural Support for Programming Languages and Operating Systems, ASPLOS '14, Salt Lake City, UT, USA, March 1--5, 2014. 19--34. https://doi.org/10.1145/2541940.2541956
[3]
Timo Ahonen, Abdenour Hadid, and Matti Pietik"a inen. 2006. Face Description with Local Binary Patterns: Application to Face Recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2006), 2037--2041. https://doi.org/10.1109/TPAMI.2006.244
[4]
Amazon Elastic Inference. [n.d.]. Amazon Elastic Inference: Add GPU acceleration to any Amazon EC2 instance for faster inference at much lower cost. https://aws.amazon.com/machine-learning/elastic-inference/.
[5]
Roberto Ammendola, Andrea Biagioni, Ottorino Frezza, G. Lamanna, Alessandro Lonardo, Francesca Lo Cicero, Pier Stanislao Paolucci, F. Pantaleo, Davide Rossetti, Francesco Simula, M. Sozzi, Laura Tosoratto, and Piero Vicini. 2014. NaNet: a flexible and configurable low-latency NIC for real-time trigger systems based on GPUs. JINST (2014). https://arxiv.org/pdf/1311.4007.pdf
[6]
Cavium. [n.d.]. LiquidIO SmartNIC family of intelligent adapters provides high performance industry-leading programmable server adapter solutions for various data center deployments. https://www.marvell.com/ethernet-adapters-and-controllers/liquidio-smart-nics/index.jsp .
[7]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) . USENIX Association, Carlsbad, CA, 578--594. https://www.usenix.org/conference/osdi18/presentation/chen
[8]
Feras Daoud, Amir Watad, and Mark Silberstein. 2016. GPUrdma: GPU-side Library for High Performance Networking from GPU Kernels. ACM, New York, NY, USA, 6:1--6:8.
[9]
Gregory Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun, and Sanjeev Satheesh. 2016. Persistent RNNs: Stashing Recurrent Weights On-chip. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 (ICML'16). JMLR.org, 2024--2033. http://proceedings.mlr.press/v48/diamos16.pdf
[10]
Dotan Barak. [n.d.] a. RDMAmojo -- blog on RDMA technology and programming. https://www.rdmamojo.com/2013/06/01/which-queue-pair-type-to-use/.
[11]
Dotan Barak. [n.d.] b. RDMAmojo -- blog on RDMA technology and programming. https://www.rdmamojo.com/2013/01/26/ibv_post_send/.
[12]
Hagai Eran, Lior Zeno, Gabi Malka, and Mark Silberstein. 2017. NICA: OS Support for Near-data Network Application Accelerators. In International Workshop on Multi-core and Rack-scale Systems (MARS17) . http://acsl.eelabs.technion.ac.il/publications/nica-os-support-for-near-data-network-application-accelerators/
[13]
Alireza Farshin, Amir Roozbeh, Gerald Q. Maguire Jr, and Dejan Kostic. 2019. Make the Most out of Last Level Cache in Intel Processors (EuroSys '19). https://people.kth.se/ farshin/documents/slice-aware-eurosys19.pdf
[14]
Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, Harish Kumar Chandrappa, Somesh Chaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, Fengfen Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel, Tejas Sapre, Mark Shaw, Gabriel Silva, Madhan Sivakumar, Nisheeth Srivastava, Anshuman Verma, Qasim Zuhair, Deepak Bansal, Doug Burger, Kushagra Vaid, David A. Maltz, and Albert Greenberg. 2018. Azure Accelerated Networking: SmartNICs in the Public Cloud. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18) . USENIX Association, Renton, WA, 51--66. https://www.usenix.org/conference/nsdi18/presentation/firestone
[15]
Google AutoML. [n.d.]. AutoML: Train high-quality custom machine learning models with minimal effort and machine learning expertise. https://cloud.google.com/automl/.
[16]
Tobias Gysi, Jeremia B"a r, and Torsten Hoefler. 2016. dCUDA: hardware supported overlap of computation and communication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, Salt Lake City, UT, USA, November 13--18, 2016 . 609--620. https://doi.org/10.1109/SC.2016.51
[17]
Habana. [n.d.]. Goya deep learning inference accelerator: White paper. https://habana.ai/wp-content/uploads/2019/06/Goya-Whitepaper-Inference-Performance.pdf .
[18]
Tayler H. Hetherington, Mike O'Connor, and Tor M. Aamodt. 2015. MemcachedGPU: scaling-up scale-out key-value stores. In Proceedings of the Sixth ACM Symposium on Cloud Computing, SoCC 2015, Kohala Coast, Hawaii, USA, August 27--29, 2015. 43--57. https://doi.org/10.1145/2806777.2806836
[19]
Huawei. [n.d.]. FPGA-Accelerated Cloud Server. https://www.huaweicloud.com/en-us/product/fcs.html .
[20]
Intel. [n.d.] a. Intel® Software Guard Extensions (Intel® SGX). https://www.intel.com/content/www/us/en/architecture-and-technology/software-guard-extensions.html .
[21]
Intel. [n.d.] b. Intel® Visual Compute Accelerator (Intel® VCA) Product Brief. https://www.intel.com/content/www/us/en/servers/media-and-graphics/visual-compute-accelerator-brief.html .
[22]
Keon Jang, Sangjin Han, Seungyeop Han, Sue B. Moon, and KyoungSoo Park. 2011. SSLShader: Cheap SSL Acceleration with Commodity Processors. In Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2011, Boston, MA, USA, March 30 - April 1, 2011 . https://www.usenix.org/conference/nsdi11/sslshader-cheap-ssl-acceleration-commodity-processors
[23]
Jonathon Phillips. [n.d.]. color FERET Database. https://www.nist.gov/itl/iad/image-group/color-feret-database .
[24]
Antoine Kaufmann, Simon Peter, Thomas E. Anderson, and Arvind Krishnamurthy. 2015. FlexNIC: Rethinking Network DMA. In 15th Workshop on Hot Topics in Operating Systems, HotOS XV, Kartause Ittingen, Switzerland, May 18--20, 2015 . https://www.usenix.org/conference/hotos15/workshop-program/presentation/kaufmann
[25]
Joongi Kim, Keon Jang, Keunhong Lee, Sangwook Ma, Junhyun Shim, and Sue Moon. 2015. NBA (Network Balancing Act): A High-performance Packet Processing Framework for Heterogeneous Processors. ACM, 22:1--22:14.
[26]
Sangman Kim, Seonggu Huh, Xinya Zhang, Yige Hu, Amir Wated, Emmett Witchel, and Mark Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, Broomfield, CO, 201--216. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/kim
[27]
Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. [n.d.]. Gradient-Based Learning Applied to Document Recognition. In Proceedings of the IEEE, november 1998 . http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf
[28]
Layong Larry Luo. 2018. Towards Converged SmartNIC Architecture for Bare Metal & Public Clouds. https://conferences.sigcomm.org/events/apnet2018/slides/larry.pdf . 2nd Asia-Pacific Workshop on Networking (APNet 2018).
[29]
Mellanox Technologies. [n.d.] a. BlueField SmartNIC . http://www.mellanox.com/page/products_dyn?product_family=275&mtag=bluefield_smart_nic .
[30]
Mellanox Technologies. [n.d.] b. Mellanox OpenFabrics Enterprise Distribution for Linux (MLNX_OFED). http://www.mellanox.com/page/products_dyn?product_family=26 .
[31]
Mellanox Technologies. 2018a. libvma: Linux user-space library for network socket acceleration based on RDMA compatible network adaptors. https://github.com/Mellanox/libvma .
[32]
Mellanox Technologies. 2018b. sockperf: Network Benchmarking Utility. https://github.com/Mellanox/sockperf .
[33]
Microsoft Brainwave. [n.d.]. Brainwave: a deep learning platform for real-time AI serving in the cloud. https://www.microsoft.com/en-us/research/project/project-brainwave/.
[34]
Microsoft Catapult. [n.d.]. Microsoft Catapult: Transforming cloud computing by augmenting CPUs with an interconnected and configurable compute layer composed of programmable silicon. https://www.microsoft.com/en-us/research/project/project-catapult/.
[35]
Nguyen, Khang T. [n.d.]. Introduction to Cache Allocation Technology in the Intel® Xeon® Processor E5 v4 Family. https://software.intel.com/en-us/articles/introduction-to-cache-allocation-technology .
[36]
NVIDIA. [n.d.] a. CUDA Dynamic Parallelism API and Principles. https://devblogs.nvidia.com/cuda-dynamic-parallelism-api-principles/.
[37]
NVIDIA. [n.d.] b. A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology. https://github.com/NVIDIA/gdrcopy .
[38]
NVIDIA. [n.d.] c. GPUDirect RDMA: Developing a Linux Kernel Module using GPUDirect RDMA . https://docs.nvidia.com/cuda/gpudirect-rdma/index.html .
[39]
NVSHMEM. [n.d.]. GPU-side API for remote data access, collectives and synchronization. http://www.openshmem.org/site/sites/default/site_files/SC2017-BOF-NVIDIA.pdf .
[40]
Lena Oden and Holger Frö ning. 2017. InfiniBand Verbs on GPU: a case study of controlling an InfiniBand network device from the GPU . IJHPCA (2017), 274--284. https://doi.org/10.1177/1094342015588142
[41]
Phitchaya Mangpo Phothilimthana, Ming Liu, Antoine Kaufmann, Simon Peter, Rastislav Bod'i k, and Thomas E. Anderson. 2018. Floem: A Programming System for NIC-Accelerated Network Applications. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8--10, 2018. 663--679. https://www.usenix.org/conference/osdi18/presentation/phothilimthana
[42]
Sreeram Potluri, Anshuman Goswami, Davide Rossetti, C. J. Newburn, Manjunath Gorentla Venkata, and Neena Imam. 2017. GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM. In 24th IEEE International Conference on High Performance Computing, HiPC 2017, Jaipur, India, December 18--21, 2017 . 253--262. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287756
[43]
Sreeram Potluri, Khaled Hamidouche, Akshay Venkatesh, Devendar Bureddy, and Dhabaleswar K. Panda. 2013. Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs. In 42nd International Conference on Parallel Processing, ICPP 2013, Lyon, France, October 1--4, 2013 . 80--89. https://doi.org/10.1109/ICPP.2013.17
[44]
Davide Rossetti and Elena Agostini. [n.d.]. How to make your life easier in the age of exascale computing using NVIDIA GPUDirect technologies. https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9653-how-to-make-your-life-easier-in-the-age-of-exascale-computing-using-nvidia-gpudirect-technologies.pdf .
[45]
Selectel. 2018. FPGA-acce­le­ra­tors go into the clouds [Russian]. https://blog.selectel.ru/fpga-uskoriteli-uxodyat-v-oblaka/.
[46]
Vicent Selfa, Julio Sahuquillo, Lieven Eeckhout, Salvador Petit, and María Engracia Gómez. 2017. Application Clustering Policies to Address System Fairness with Intel's Cache Allocation Technology . 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) . https://users.elis.ugent.be/ leeckhou/papers/pact17.pdf
[47]
Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2014. GPUfs: Integrating a File System with GPUs . ACM Trans. Comput. Syst. https://doi.org/10.1145/2553081
[48]
TensorFlow Light manual. [n.d.]. TensorFlow Light Delegates. https://www.tensorflow.org/lite/performance/delegates .
[49]
Giorgos Vasiliadis, Lazaros Koromilas, Michalis Polychronakis, and Sotiris Ioannidis. 2014. GASPP: A GPU-Accelerated Stateful Packet Processing Framework. In 2014 USENIX Annual Technical Conference, USENIX ATC '14, Philadelphia, PA, USA, June 19--20, 2014. 321--332.
[50]
Amir Watad, Alexander Libov, Ohad Shacham, Edward Bortnikov, and Mark Silberstein. 2019. Achieving scalability in a k-NN multi-GPU network service with Centaur. In The 28th International Conference on Parallel Architectures and Compilation Techniques Seatltle, WA, USA .
[51]
Yaocheng Xiang, Xiaolin Wang, Zihui Huang, Zeyu Wang, Yingwei Luo, and Zhenlin Wang. 2018. DCAPS: Dynamic Cache Allocation with Partial Sharing. ACM.
[52]
Wang Xu. 2018. Hardware Acceleration over NFV in China Mobile . https://wiki.opnfv.org/download/attachments/20745096/opnfv_Acc.pdf?version=1&modificationDate=1528124448000&api=v2 .
[53]
Yann LeCun. [n.d.]. THE MNIST DATABASE of handwritten digits. http://yann.lecun.com/exdb/mnist/.

Cited By

View all
  • (2025)Scalable Data Management on Next-Generation Data Center NetworksScalable Data Management for Future Hardware10.1007/978-3-031-74097-8_8(199-221)Online publication date: 24-Jan-2025
  • (2024)OSMOSISProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692007(247-263)Online publication date: 10-Jul-2024
  • (2024)Performance interfaces for hardware acceleratorsProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691984(855-874)Online publication date: 10-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '20: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems
March 2020
1412 pages
ISBN:9781450371025
DOI:10.1145/3373376
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 March 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. hardware accelerators
  2. i/o services for accelerators
  3. operating systems
  4. server architecture
  5. smartnics

Qualifiers

  • Research-article

Funding Sources

  • Israel Science Foundation

Conference

ASPLOS '20

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)209
  • Downloads (Last 6 weeks)15
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Scalable Data Management on Next-Generation Data Center NetworksScalable Data Management for Future Hardware10.1007/978-3-031-74097-8_8(199-221)Online publication date: 24-Jan-2025
  • (2024)OSMOSISProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692007(247-263)Online publication date: 10-Jul-2024
  • (2024)Performance interfaces for hardware acceleratorsProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691984(855-874)Online publication date: 10-Jul-2024
  • (2024)ACCL+Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691950(211-231)Online publication date: 10-Jul-2024
  • (2024)Perancangan Buku Ilustrasi Perangko Sejarah Terbentuknya Batik Mega MendungJurnal Desain Komunikasi Visual10.47134/dkv.v1i1.21691:1(12)Online publication date: 25-Jan-2024
  • (2024)μMon: Empowering Microsecond-level Network Monitoring with WaveletsProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672236(274-290)Online publication date: 4-Aug-2024
  • (2024)Toward GPU-centric Networking on Commodity HardwareProceedings of the 7th International Workshop on Edge Systems, Analytics and Networking10.1145/3642968.3654820(43-48)Online publication date: 22-Apr-2024
  • (2024)Zero-sided RDMA: Network-driven Data Shuffling for Disaggregated Heterogeneous Cloud DBMSsProceedings of the ACM on Management of Data10.1145/36392912:1(1-28)Online publication date: 26-Mar-2024
  • (2024)MTDA: Efficient and Fair DPU Offloading Method for Multiple TenantsIEEE Transactions on Services Computing10.1109/TSC.2024.3433588(1-14)Online publication date: 2024
  • (2024)P4Hauler: An Accelerator-Aware In-Network Load Balancer for Applications Performance BoostingIEEE Transactions on Cloud Computing10.1109/TCC.2024.338965812:2(697-711)Online publication date: Apr-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media