skip to main content
10.1145/3579371.3589077acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

SmartDS: Middle-Tier-centric SmartNIC Enabling Application-aware Message Split for Disaggregated Block Storage

Authors Info & Claims
Published:17 June 2023Publication History

ABSTRACT

The widespread deployment of storage disaggregation in the cloud has facilitated flexible scaling and storage overprovisioning, allowing for high utilization of storage capacity and IOPS. Instead of utilizing remote storage protocols to access remote disks, a middle-tier is introduced between compute servers and storage servers in order to serve I/O requests from compute servers and provide computations such as compression and decompression. However, due to the need for a cloud to concurrently serve millions of VMs that require access to disaggregated storage, the middle-tier requires a massive number of servers to process network traffic between computing and storage nodes. For example, a major cloud company may deploy hundreds of thousands of high-end servers to provide such a service for its cloud storage, because the existing CPU-based middle-tier suffers from a severe issue of compute-intensive compression/decompression on high-throughput storage traffic. To address this issue, we introduce SmartDS, a middle-tier-centric SmartNIC that serves storage I/O requests with low latency and high throughput, while maintaining high flexibility and programmability. The key idea behind SmartDS is the application-aware message split (AAMS) mechanism, which allows for the processing of the message's header on the host CPU to achieve high flexibility, and the message's payload on the SmartDS. Experimental results demonstrate that SmartDS provides up to 4.3× more throughput than a CPU-based middle-tier and enables the linear scale-up of multiple network ports and multiple SmartNICs, thus significantly reducing cloud infrastructure costs for disaggregated block storage.

References

  1. M. S. Abdelfattah, A. Hagiescu, and D. Singh, "Gzip on a chip: High performance lossless data compression on fpgas using opencl," in IWOCL, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Aghayev, S. Weil, M. Kuchnik, M. Nelson, G. R. Ganger, and G. Amvrosiadis, "File systems unfit as distributed storage backends: Lessons from 10 years of ceph evolution," in SOSP, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Alian, S. Agarwal, J. Shin, N. Patel, Y. Yuan, D. Kim, R. Wang, and N. S. Kim, "Idio: Network-driven, inbound network data orchestration on server processors," in MICRO, 2022.Google ScholarGoogle Scholar
  4. M. Alian and N. S. Kim, "Netdimm: Low-latency near-memory network interface architecture," in MICRO, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. Alonso, "Technical perspective: Dfi: The data flow interface for high-speed networks," SIGMOD Rec., 2022.Google ScholarGoogle Scholar
  6. Amazon, "Amazon Elastic Block Store," https://aws.amazon.com/cn/blogs/architecture/category/storage/amazon-elastic-block-storage-ebs, 2022.Google ScholarGoogle Scholar
  7. M. T. Arashloo, A. Lavrov, M. Ghobadi, J. Rexford, D. Walker, and D. Wentzlaff, "Enabling programmable transport protocols in high-speed nics," in NSDI, 2020.Google ScholarGoogle Scholar
  8. M. Bartík, S. Ubik, and P. Kubalik, "Lz4 compression algorithm on fpga," in ICECS, 2015.Google ScholarGoogle Scholar
  9. Broadcom, "Stingray™ PS250," https://docs.broadcom.com/doc/PS250-PB, 2018.Google ScholarGoogle Scholar
  10. Broadcom, "BCM957508-P2200G," https://docs.broadcom.com/doc/957508-P2200G-DS, 2019.Google ScholarGoogle Scholar
  11. Broadcom, "BCM957504-N1100G," https://docs.broadcom.com/doc/957504-N1100G-DS, 2020.Google ScholarGoogle Scholar
  12. Broadcom, "Broadcom N2200G," https://www.broadcom.com/products/ethernet-connectivity/network-adapters/n2200g, 2022.Google ScholarGoogle Scholar
  13. Broadcom, "Broadcom Stingray PS1100R," https://docs.broadcom.com/doc/PS1100R-PB, 2022.Google ScholarGoogle Scholar
  14. M. S. Brunella, G. Belocchi, M. Bonola, S. Pontarelli, G. Siracusano, G. Bianchi, A. Cammarano, A. Palumbo, L. Petrucci, and R. Bifulco, "hxdp: Efficient software packet processing on fpga nics," Communications of the ACM, 2022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci, J. Haridas, C. Uddaraju, H. Khatri, A. Edwards, V. Bedekar, S. Mainali, R. Abbasi, A. Agarwal, M. F. u. Haq, M. I. u. Haq, D. Bhardwaj, S. Dayanand, A. Adusumilli, M. McNett, S. Sankaran, K. Manivannan, and L. Rigas, "Windows azure storage: A highly available cloud storage service with strong consistency," in SOSP, 2011.Google ScholarGoogle Scholar
  16. Y. Chen, A. Ganapathi, and R. H. Katz, "To compress or not to compress compute vs. io tradeoffs for mapreduce energy efficiency," in SIGCOMM, 2010.Google ScholarGoogle Scholar
  17. S. Choi, M. Shahbaz, B. Prabhakar, and M. Rosenblum, "λ-nic: Interactive server-less compute on programmable smartnics," in ICDCS, 2020.Google ScholarGoogle Scholar
  18. D. Cock, A. Ramdas, D. Schwyn, M. Giardino, A. Turowski, Z. He, N. Hossle, D. Korolija, M. Licciardello, K. Martsenko, R. Achermann, G. Alonso, and T. Roscoe, "Enzian: An Open, General, CPU/FPGA Platform for Systems Software Research," in ASPLOS, 2022.Google ScholarGoogle Scholar
  19. Ehernet Technology Consortium, "800G specification," https://ethernettechnologyconsortium.org/wpcontent/uploads/2020/03/800G-Specification_r1.0.pdf, 2020.Google ScholarGoogle Scholar
  20. A. Farshin, A. Roozbeh, G. Q. Maguire Jr, and D. Kostić, "Make the most out of last level cache in intel processors," in EuroSys, 2019.Google ScholarGoogle Scholar
  21. D. Firestone, A. Putnam, S. Mundkur, D. Chiou, A. Dabagh, M. Andrewartha, H. Angepat, V. Bhanu, A. Caulfield, E. Chung, H. K. Chandrappa, S. Chaturmohta, M. Humphrey, J. Lavier, N. Lam, F. Liu, K. Ovtcharov, J. Padhye, G. Popuri, S. Raindel, T. Sapre, M. Shaw, G. Silva, M. Sivakumar, N. Srivastava, A. Verma, Q. Zuhair, D. Bansal, D. Burger, K. Vaid, D. A. Maltz, and A. Greenberg, "Azure accelerated networking:smartnics in the public cloud," in NSDI, 2018.Google ScholarGoogle Scholar
  22. M. Flajslik and M. Rosenblum, "Network interface design for low latency request-response protocols," in ATC, 2013.Google ScholarGoogle Scholar
  23. J. Fowers, J.-Y. Kim, D. Burger, and S. Hauck, "A scalable high-bandwidth architecture for lossless compression on fpgas," in FCCM, 2015.Google ScholarGoogle Scholar
  24. J. Fried, Z. Ruan, A. Ousterhout, and A. Belay, "Caladan: Mitigating interference at microsecond timescales," in OSDI, 2020.Google ScholarGoogle Scholar
  25. Y. Gao, Q. Li, L. Tang, Y. Xi, P. Zhang, W. Peng, B. Li, Y. Wu, S. Liu, L. Yan, F. Feng, Y. Zhuang, F. Liu, P. Liu, X. Liu, Z. Wu, J. Wu, Z. Cao, C. Tian, J. Wu, J. Zhu, H. Wang, D. Cai, and J. Wu, "When cloud storage meets rdma," in NSDI, 2021.Google ScholarGoogle Scholar
  26. Y. Go, M. A. Jamshed, Y. Moon, C. Hwang, and K. Park, "Apunet: Revitalizing gpu as packet processing accelerator," in NSDI, 2017.Google ScholarGoogle Scholar
  27. S. Goswami, N. Kodirov, C. Mustard, I. Beschastnikh, and M. Seltzer, "Parking packet payload with p4," in CoNEXT, 2020.Google ScholarGoogle Scholar
  28. S. Grant, A. Yelam, M. Bland, and A. C. Snoeren, "Smartnic performance isolation with fairnic: Programmable networking for the cloud," in SIGCOMM, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. Guo, H. Wu, Z. Deng, G. Soni, J. Ye, J. Padhye, and M. Lipshteyn, "Rdma over commodity ethernet at scale," in SIGCOMM, 2016.Google ScholarGoogle Scholar
  30. HiTech Global, "2-Port QSFP28 (2x100G) / QSFP+ (2x40G or 2x56G) FMC Module (Vita57.1)," http://www.hitechglobal.com/FMCModules/FMC_2QSFP28.htm, 2022.Google ScholarGoogle Scholar
  31. X. Hu, F. Wang, W. Li, J. Li, and H. Guan, "Qzfs: Qat accelerated compression in file system for application agnostic and cost efficient data storage," in ATC, 2019.Google ScholarGoogle Scholar
  32. Intel, "Intel data direct i/o technology: A primer," https://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/data-direct-i-o-technology-brief.pdf, 2012.Google ScholarGoogle Scholar
  33. Intel, "Intel QuickAssist Technology," https://www.intel.com/content/www/us/en/architecture-and-technology/intel-quick-assist-technology-overview.html, 2019.Google ScholarGoogle Scholar
  34. Intel, "Intel® SSD D7-P5520 Series," https://ark.intel.com/content/www/us/en/ark/products/213416/intel-ssd-d7p5520-series-1-92tb-2-5in-pcie-4-0-x4-3d4-tlc.html, 2020.Google ScholarGoogle Scholar
  35. Intel, "Intel® Infrastructure Processing Unit," https://www.intel.com/content/www/us/en/products/details/network-io/ipu.html, 2022.Google ScholarGoogle Scholar
  36. Intel, "Intel® Memory Latency Checker," https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html, 2022.Google ScholarGoogle Scholar
  37. Z. István, D. Sidler, G. Alonso, and M. Vukolic, "Consensus in a Box: Inexpensive Coordination in Hardware," in NSDI, 2016.Google ScholarGoogle Scholar
  38. J. Jang, S. J. Jung, S. Jeong, J. Heo, H. Shin, T. J. Ham, and J. W. Lee, "A specialized architecture for object serialization with applications to big data analytics," in ISCA, 2020.Google ScholarGoogle Scholar
  39. M. Khazraee, A. Forencich, G. C. Papen, A. C. Snoeren, and A. Schulman, "Rosebud: Making FPGA-Accelerated Middlebox Development More Pleasant," in ASPLOS, 2023.Google ScholarGoogle Scholar
  40. J. Kim, I. Jang, W. Reda, J. Im, M. Canini, D. Kostić, Y. Kwon, S. Peter, and E. Witchel, "Linefs: Efficient smartnic offload of a distributed file system with pipeline parallelism," in SOSP, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. A. Klimovic, C. Kozyrakis, E. Thereska, B. John, and S. Kumar, "Flash storage disaggregation," in EuroSys, 2016.Google ScholarGoogle Scholar
  42. N. Lazarev, S. Xiang, N. Adit, Z. Zhang, and C. Delimitrou, "Dagger: Efficient and fast rpcs in cloud microservices with near-memory reconfigurable nics," in ASPLOS, 2021.Google ScholarGoogle Scholar
  43. N. Lazarev, S. Xiang, N. Adit, Z. Zhang, and C. Delimitrou, "Dagger: efficient and fast rpcs in cloud microservices with near-memory reconfigurable nics," in ASPLOS, 2021.Google ScholarGoogle Scholar
  44. B. Li, K. Tan, L. Luo, Y. Peng, R. Luo, N. Xu, Y. Xiong, P. Cheng, and E. Chen, "Clicknp: Highly flexible and high performance network processing with reconfigurable hardware," in SIGCOMM, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Li, Y. Lu, Q. Wang, J. Lin, Z. Yang, and J. Shu, "AlNiCo: SmartNIC-accelerated contention-aware request scheduling for transaction processing," in ATC, 2022.Google ScholarGoogle Scholar
  46. J. Lin, K. Patel, B. E. Stephens, A. Sivaraman, and A. Akella, "Panic: A high-performance programmable nic for multi-tenant networks," in OSDI, 2020.Google ScholarGoogle Scholar
  47. M. Liu, T. Cui, H. Schuh, A. Krishnamurthy, S. Peter, and K. Gupta, "Offloading distributed applications onto smartnics using ipipe," in SIGCOMM, 2019.Google ScholarGoogle Scholar
  48. M. Liu, S. Peter, A. Krishnamurthy, and P. M. Phothilimthana, "E3:energy-efficient microservices on smartnic-accelerated servers," in ATC, 2019.Google ScholarGoogle Scholar
  49. LZ4, "LZ4 Benchmarks," https://github.com/lz4/lz4, 2022.Google ScholarGoogle Scholar
  50. J. D. McCalpin, "Memory bandwidth and system balance in hpc systems," UT Faculty/Researcher Works, 2016.Google ScholarGoogle Scholar
  51. Mellanox, "ConnectX®-5 En Card Product Brief," https://www.mellanox.com/sites/default/files/relateddocs/prod_adapter_cards/PB_ConnectX-5_EN_Card.pdf, 2017.Google ScholarGoogle Scholar
  52. Mellanox, "ConnectX®-6 En Card Product Brief," https://www.mellanox.com/sites/default/files/relateddocs/prod_adapter_cards/PB_ConnectX-6_EN_Card.pdf, 2017.Google ScholarGoogle Scholar
  53. R. Miao, L. Zhu, S. Ma, K. Qian, S. Zhuang, B. Li, S. Cheng, J. Gao, Y. Zhuang, P. Zhang, R. Liu, C. Shi, B. Fu, J. Zhu, J. Wu, D. Cai, and H. H. Liu, "From luna to solar: The evolutions of the compute-to-storage networks in alibaba cloud," in SIGCOMM, 2022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Microsoft, "Introduction to Header-Data Split," https://learn.microsoft.com/en-us/windows-hardware/drivers/network/header-data-split, 2021.Google ScholarGoogle Scholar
  55. J. Min, M. Liu, T. Chugh, C. Zhao, A. Wei, I. H. Doh, and A. Krishnamurthy, "Gimbal: enabling multi-tenant storage disaggregation on smartnic jbofs," in SIGCOMM, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. A. Mirhosseini, H. Golestani, and T. F. Wenisch, "Hyperplane: A scalable low-latency notification accelerator for software data planes," in MICRO, 2020.Google ScholarGoogle Scholar
  57. R. Neugebauer, G. Antichi, J. F. Zazo, Y. Audzevich, S. López-Buedo, and A. W. Moore, "Understanding pcie performance for end host networking," in SIGCOMM, 2018.Google ScholarGoogle Scholar
  58. Nvidia, "NVIDIA BLUEFIELD-2 DPU," https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-2-dpu.pdf, 2021.Google ScholarGoogle Scholar
  59. Nvidia, "NVIDIA BLUEFIELD-3 DPU," https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-3-dpu.pdf, 2022.Google ScholarGoogle Scholar
  60. A. Ozsoy, M. Swany, and A. Chauhan, "Pipelined parallel lzss for streaming data compression on gpgpus," in ICPADS, 2012.Google ScholarGoogle Scholar
  61. P. M. Phothilimthana, M. Liu, A. Kaufmann, S. Peter, R. Bodik, and T. Anderson, "Floem: A programming system for nic-accelerated network applications," in OSDI, 2018.Google ScholarGoogle Scholar
  62. B. Pismenny, L. Liss, A. Morrison, and D. Tsafrir, "The benefits of general-purpose on-nic memory," in ASPLOS, 2022.Google ScholarGoogle Scholar
  63. S. Pontarelli, R. Bifulco, M. Bonola, C. Cascone, M. Spaziani, V. Bruschi, D. Sanvito, G. Siracusano, A. Capone, M. Honda, F. Huici, and G. Bianchi, "Flowblaze: Stateful packet processing in hardware," in NSDI, 2019.Google ScholarGoogle Scholar
  64. A. Pourhabibi, S. Gupta, H. Kassir, M. Sutherland, Z. Tian, M. P. Drumond, B. Falsafi, and C. Koch, "Optimus prime: Accelerating data transformation in servers," in ASPLOS, 2020.Google ScholarGoogle Scholar
  65. A. Pourhabibi, M. Sutherland, A. Daglis, and B. Falsafi, "Cerebros: Evading the rpc tax in datacenters," in MICRO, 2021.Google ScholarGoogle Scholar
  66. W. Qiao, J. Du, Z. Fang, M. Lo, M.-C. F. Chang, and J. Cong, "High-throughput lossless compression on tightly coupled cpu-fpga platforms," in FCCM, 2018.Google ScholarGoogle Scholar
  67. A. Sarma, H. Seyedroudbari, H. Gupta, U. Ramachandran, and A. Daglis, "Nfslicer: Data movement optimization for shallow network functions," arXiv preprint arXiv:2203.02585, 2022.Google ScholarGoogle Scholar
  68. H. N. Schuh, W. Liang, M. Liu, J. Nelson, and A. Krishnamurthy, "Xenic: Smartnic-accelerated distributed transactions," in ASPLOS, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. L. Shalev, H. Ayoub, N. Bshara, and E. Sabbag, "A cloud-optimized transport protocol for elastic and scalable hpc," IEEE Micro, 2020.Google ScholarGoogle Scholar
  70. D. Sidler, Z. Wang, M. Chiosa, A. Kulkarni, and G. Alonso, "Strom: smart remote memory," in EuroSys, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Silicom, "Silicom FPGA SmartNIC N501x," https://www.silicom.dk/wp-content/uploads/2022/03/Silicom-FPGA-SmartNIC-N501x-Series_v1.0.pdf, 2022.Google ScholarGoogle Scholar
  72. E. Sitaridi, R. Mueller, T. Kaldewey, G. Lohman, and K. A. Ross, "Massively-parallel lossless data decompression," in ICPP, 2016.Google ScholarGoogle Scholar
  73. I. Smolyar, A. Markuze, B. Pismenny, H. Eran, G. Zellweger, A. Bolen, L. Liss, A. Morrison, and D. Tsafrir, "Ioctopus: Outsmarting nonuniform dma," in ASPLOS, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. M. Sutherland, S. Gupta, B. Falsafi, V. Marathe, D. Pnevmatikatos, and A. Daglis, "The nebula rpc-optimized architecture," in ISCA, 2020.Google ScholarGoogle Scholar
  75. The Silesia corpus, "," https://sun.aei.polsl.pl//-sdeor/index.php, 2022.Google ScholarGoogle Scholar
  76. S. Thomas, G. M. Voelker, and G. Porter, "Cachecloud: Towards speed-of-light datacenter communication," in HotCloud, 2018.Google ScholarGoogle Scholar
  77. A. Tootoonchian, A. Panda, C. Lan, M. Walls, K. Argyraki, S. Ratnasamy, and S. Shenker, "Resq: Enabling slos in network function virtualization," in NSDI, 2018.Google ScholarGoogle Scholar
  78. M. Vemmou, A. Cho, and A. Daglis, "Patching up network data leaks with sweeper," in MICRO, 2022.Google ScholarGoogle Scholar
  79. Z. Wang, H. Huang, J. Zhang, and G. Alonso, "Shuhai: Benchmarking high bandwidth memory on fpgas," in FCCM, 2020.Google ScholarGoogle Scholar
  80. Wang, Zeke and Huang, Hongjing and Zhang, Jie and Wu, Fei and Alonso, Gustavo, "FpgaNIC: An FPGA-based Versatile 100Gb SmartNIC for GPUs," in ATC, 2022.Google ScholarGoogle Scholar
  81. J. Wirth, J. A. Hofmann, L. Thostrup, C. Binnig, and A. Koch, "Scalable and Flexible High-Performance In-Network Processing of Hash Joins in Distributed Databases," in FPT, 2021.Google ScholarGoogle Scholar
  82. Xilinx, "Xilinx ALVEO™ U280," https://www.xilinx.com/publications/product-briefs/alveo-u280-product-brief.pdf, 2021.Google ScholarGoogle Scholar
  83. Xilinx, "Virtex UltraScale+ HBM VCU128 FPGA Evaluation Kit," https://www.xilinx.com/products/boards-and-kits/vcu128.html, 2022.Google ScholarGoogle Scholar
  84. Xilinx, "Xilinx Versal FPGA," https://www.xilinx.com/products/silicon-devices/acap/versal-hbm.html, 2022.Google ScholarGoogle Scholar
  85. Y. Yuan, J. Huang, Y. Sun, T. Wang, J. Nelson, D. R. Ports, Y. Wang, R. Wang, C. Tai, and N. S. Kim, "Rambda: Rdma-driven acceleration framework for memory-intensive μs-scale datacenter applications," in HPCA, 2023.Google ScholarGoogle Scholar
  86. B. Zhou, H. Jin, and R. Zheng, "A high speed lossless compression algorithm based on cpu and gpu hybrid platform," in TrustCom, 2014.Google ScholarGoogle Scholar

Index Terms

  1. SmartDS: Middle-Tier-centric SmartNIC Enabling Application-aware Message Split for Disaggregated Block Storage

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ISCA '23: Proceedings of the 50th Annual International Symposium on Computer Architecture
        June 2023
        1225 pages
        ISBN:9798400700958
        DOI:10.1145/3579371

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 June 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate543of3,203submissions,17%

        Upcoming Conference

        ISCA '24
      • Article Metrics

        • Downloads (Last 12 months)605
        • Downloads (Last 6 weeks)74

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader