skip to main content
research-article

DONGLE 2.0: Direct FPGA-Orchestrated NVMe Storage for HLS

Published: 17 September 2024 Publication History

Abstract

Rapid growth in data size poses significant computational and memory challenges to data processing. FPGA accelerators and near-storage processing have emerged as compelling solutions for tackling the growing computational and memory requirements. Many FPGA-based accelerators have shown to be effective in processing large data sets by leveraging the storage capability of either host-attached or FPGA-attached storage devices. However, the current HLS development environment does not allow direct access to host-or FPGA-attached NVMe storage from the HLS code. As such, users must frequently hand off between HLS and host code to access data in storage, and such a process requires tedious programming to ensure functional correctness. Moreover, since the HLS code uses radically different methods to access storage compared to DRAM, the HLS codebase targeting DRAM-based platforms cannot be easily ported to NVMe-based platforms, resulting in limited code portability and reusability. Furthermore, frequent suspension of HLS kernel and synchronization between CPU and FPGA introduce significant latency overhead and require sophisticated scheduling mechanisms to hide latency.
To address these challenges, we propose a new HLS storage interface named DONGLE 2.0 that enables direct FPGA-orchestrated NVMe storage access. By providing a unified interface for storage and memory access, DONGLE 2.0 allows a single-source HLS program to target multiple memory/storage devices, thus making the codebase cleaner, portable, and more efficient. DONGLE 2.0 is an extension to DONGLE 1.0 [1] but adds support for host-attached storage. While its primary focus is still on FPGA NVMe access in near-storage configurations, the added host storage support ensures its compatibility with platforms that lack native support for FPGA-attached NVMe storage. We implemented a prototype of DONGLE 2.0 using an AMD/Xilinx Alveo U200 FPGA and Solidigm DC-P4610 SSD. Our evaluation on various workloads showed a geometric mean speed-up of 2.3× and a reduction in lines of code (LoC) by 2.4× compared to the state-of-the-art commercial platform when using FPGA-attached NVMe storage. Moreover, DONGLE 2.0 demonstrated a geometric mean speed-up of 1.5× and a reduction in LoC by 2.4× compared to the state-of-the-art commercial platform when using host-attached NVMe storage.

References

[1]
Linus Y. Wong, Jialiang Zhang, and Jing (Jane) Li. 2023. DONGLE: Direct FPGA-orchestrated NVMe storage for HLS. In Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. Association for Computing Machinery, New York, NY, USA, 3–13. DOI:
[2]
Jialiang Zhang and Jing Li. 2018. Degree-aware hybrid graph traversal on FPGA-HMC platform. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Association for Computing Machinery, New York, NY, USA, 229–238. DOI:
[3]
Xinyu Chen, Hongshi Tan, Yao Chen, Bingsheng He, Weng-Fai Wong, and Deming Chen. 2021. ThunderGP: HLS-based graph processing framework on FPGAs. In Proceedings of the 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Association for Computing Machinery, New York, NY, USA, 69–80. DOI:
[4]
Yuwei Hu, Yixiao Du, Ecenur Ustun, and Zhiru Zhang. 2021. GraphLily: Accelerating graph linear algebra on HBM-equipped FPGAs. In Proceedings of the 2021 IEEE/ACM International Conference On Computer Aided Design. 1–9. DOI:
[5]
Jiajie Li, Yuze Chi, and Jason Cong. 2020. HeteroHalide: From image processing DSL to efficient FPGA acceleration. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Association for Computing Machinery, New York, NY, USA, 51–57. DOI:
[6]
Davide Conficconi, Eleonora D’Arnese, Emanuele Del Sozzo, Donatella Sciuto, and Marco D. Santambrogio. 2021. A framework for customizable FPGA-based image registration accelerators. In Proceedings of the 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Association for Computing Machinery, New York, NY, USA, 251–261. DOI:
[7]
Jialiang Zhang and Jing Li. 2017. Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Association for Computing Machinery, New York, NY, USA, 25–34. DOI:
[8]
Erwei Wang, James J. Davis, Georgios-Ilias Stavrou, Peter Y. K. Cheung, George A. Constantinides, and Mohamed Abdelfattah. 2022. Logic shrinkage: Learned FPGA netlist sparsity for efficient neural network inference. In Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Association for Computing Machinery, New York, NY, USA, 101–111. DOI:
[9]
Sang-Woo Jun, Ming Liu, Kermin Elliott Fleming, and Arvind. 2014. Scalable multi-access flash store for big data analytics. In Proceedings of the 2014 ACM/SIGDA International Symposium on Field-programmable Gate Arrays. Association for Computing Machinery, New York, NY, USA, 55–64. DOI:
[10]
Sang-Woo Jun, Shuotao Xu, and Arvind. 2017. Terabyte sort on FPGA-accelerated flash storage. In Proceedings of the 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines. 17–24. DOI:
[11]
Zhenyuan Ruan, Tong He, and Jason Cong. 2019. INSIDER: Designing in-storage computing system for emerging high-performance drive. In Proceedings of the 2019 USENIX Annual Technical Conference. USENIX Association, Renton, WA, 379–394. Retrieved from https://www.usenix.org/conference/atc19/presentation/ruan
[12]
Nikola Samardzic, Weikang Qiao, Vaibhav Aggarwal, Mau-Chung Frank Chang, and Jason Cong. 2020. Bonsai: High-performance adaptive merge tree sorting. In Proceedings of the 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture. 282–294. DOI:
[13]
Joo Hwan Lee, Hui Zhang, Veronica Lagrange, Praveen Krishnamoorthy, Xiaodong Zhao, and Yang Seok Ki. 2020. SmartSSD: FPGA accelerated near-storage data analytics on SSD. IEEE Computer Architecture Letters 19, 2 (2020), 110–113. DOI:
[14]
Sahand Salamat, Armin Haj Aboutalebi, Behnam Khaleghi, Joo Hwan Lee, Yang Seok Ki, and Tajana Rosing. 2021. NASCENT: Near-storage acceleration of database sort on SmartSSD. In Proceedings of the 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. Association for Computing Machinery, New York, NY, USA, 262–272. DOI:
[15]
Weikang Qiao, Jihun Oh, Licheng Guo, Mau-Chung Frank Chang, and Jason Cong. 2021. FANS: FPGA-accelerated near-storage sorting. In Proceedings of the 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines. 106–114. DOI:
[16]
Mohammadreza Soltaniyeh, Veronica Lagrange Moutinho Dos Reis, Matthew Bryson, Richard Martin, and Santosh Nagarakatte. 2021. Near-storage acceleration of database query processing with SmartSSDs. In Proceedings of the 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines. 265–265. DOI:
[17]
Yongjoo Jang, Sejin Kim, Daehoon Kim, Sungjin Lee, and Jaeha Kung. 2021. Deep partitioned training from near-storage computing to DNN accelerators. IEEE Computer Architecture Letters 20, 1 (2021), 70–73. DOI:
[18]
AMD/Xilinx. 2021. Vitis High-Level Synthesis User Guide (UG1399). (2021). Retrieved from https://docs.xilinx.com/r/2021.2-English/ug1399-vitis-hls/Getting-Started-with-Vitis-HLS
[19]
ARM. 2021. AMBA AXI and ACE Protocol Specification. (2021). Retrieved from https://developer.arm.com/documentation/ihi0022/hc/
[20]
Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In Proceedings of the 2016 IEEE/ACM International Conference on Computer-Aided Design. 1–8. DOI:
[21]
NVM Express Workgroup. 2013. NVM Express 1.0e. (2013). Retrieved from https://nvmexpress.org/developers/nvme-specification/
[22]
Solidigm. 2018. DC P4610 Series 1.6TB, 2.5in PCIe 3.1 x4, 3D2, TLC. (2018). Retrieved from https://ark.intel.com/content/www/us/en/ark/products/140103/intel-ssd-dc-p4610-series-1-6tb-2-5in-pcie-3-1-x4-3d2-tlc.html
[23]
AMD/Xilinx. 2021. Xilinx Runtime (XRT) Release Notes (UG1451). (2021). Retrieved from https://docs.xilinx.com/r/2021.2-English/ug1451-xrt-release-notes
[24]
AMD/Xilinx. 2021. Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393). (2021). Retrieved from https://docs.xilinx.com/r/2021.2-English/ug1393-vitis-application-acceleration
[25]
AMD/Xilinx. 2020. Vivado Design Suite User Guide: Getting Started (UG910). (2020). Retrieved from https://docs.xilinx.com/r/2020.2-English/ug910-vivado-getting-started
[26]
AMD/Xilinx. 2021. Vitis Accelerated Libraries. (2021). Retrieved from https://github.com/Xilinx/Vitis_Libraries/tree/2021.2
[27]
Yuan Zhou, Udit Gupta, Steve Dai, Ritchie Zhao, Nitish Srivastava, Hanchen Jin, Joseph Featherston, Yi-Hsiang Lai, Gai Liu, Gustavo Angarita Velasquez, Wenping Wang, and Zhiru Zhang. 2018. Rosetta: A realistic high-level synthesis benchmark suite for software-programmable FPGAs. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (February 2018).
[28]
Nicholas Beckwith, Jialiang Zhang, and Jing Jane Li. 2022. Augmenting HLS with zero-overhead application-specific address mapping for optane DCPMM. In Proceedings of the 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines. 1–9. DOI:
[29]
AMD/Xilinx. 2019. NVMe Host Accelerator v1.0 (PB058). (2019). Retrieved from https://docs.xilinx.com/v/u/en-US/pb058-nvme-host-accelerator
[30]
AMD/Xilinx. 2022. DMA/Bridge Subsystem for PCI Express v4.1 Product Guide (PG195). (2022). Retrieved from https://docs.xilinx.com/r/en-US/pg195-pcie-dma
[33]
Samsung. 2023. MZPLJ12THALA-00007(12.8TB) | SSD | Samsung Semiconductor USA. (2023). Retrieved from https://semiconductor.samsung.com/us/ssd/enterprise-ssd/pm1733-pm1735/mzplj12thala-00007/
[34]
Solidigm. 2023. Solidigm DC-P4610 mid-endurance PCIe 3.1 NVMe SSDs with balanced read/write throughput for mixed-workload applications | Solidigm D7 SSDs for data centers. (2023). Retrieved from https://www.solidigm.com/products/data-center/d7/p4610.html
[35]
Solidigm. 2023. D7-P5620 Mid-Endurance PCIe 4.0 NVMe SSD for data centers | Solidigm D7 SSD. (2023). Retrieved from https://www.solidigm.com/products/data-center/d7/p5620.html
[36]
Samsung. 2023. M393ABG40M5B-CYF(DDR4) | DRAM | Samsung Semiconductor Global. (2023). Retrieved from https://semiconductor.samsung.com/dram/module/rdimm/m393abg40m5b-cyf/
[37]
Samsung. 2023. M321RBGA0B40-CWK(DDR5) | DRAM | Samsung Semiconductor Global. (2023). Retrieved from https://semiconductor.samsung.com/dram/module/rdimm/m321rbga0b40-cwk/
[38]
CDW. 2023. CDW. Retrieved from https://www.cdw.com. (2023).
[39]
Alphabet. 2022. 2022 Alphabet Annual Report. Retrieved from https://abc.xyz/assets/d4/4f/a48b94d548d0b2fdc029 a95e8c63/2022-alphabet-annual-report.pdf. (2022).
[40]
Meta. 2023. Q4 2022 Earnings. Retrieved from https://investor.fb.com/investor-events/event-details/2023/Q4-2022-Earnings/default.aspx. (2023).
[41]
Microsoft. 2022. Earnings Release FY22 Q4. Retrieved from https://www.microsoft.com/en-us/Investor/earnings/FY-2022-Q4/performance. (2022).
[42]
AMD/Xilinx. 2022. Alveo U200 and U250 Data Center Accelerator Cards Data Sheet (DS962). (2022). Retrieved from https://docs.xilinx.com/r/en-US/ds962-u200-u250
[43]
AMD/Xilinx. 2023. DDR4 Controller. Retrieved from https://www.xilinx.com/products/intellectual-property/ddr4.html. (2023).
[44]
AMD/Xilinx. 2023. UltraScale Architecture and Product Data Sheet: Overview. Retrieved from https://docs.xilinx.com/v/u/en-US/ds890-ultrascale-overview. (2023).
[45]
Dell. 2023. Dell Precision 7920 Rack Owner’s Manual. (2023). Retrieved from https://dl.dell.com/content/manual19462141-dell-precision-7920-rack-owner-s-manual.pdf
[46]
Intel. 2023. Intel Xeon Platinum 8490H Processor. Retrieved from https://www.intel.com/content/www/us/en/products/sku/231747/intel-xeon-platinum-8490h-processor-112-5m-cache-1-90-ghz/specifications.html. (2023).
[47]
Intel. 2020. Intel Optane Persistent Memory Start Up Guide. Retrieved from https://www.intel.com/content/www/us/en/support/articles/ 000055382/memory-and-storage/intel-optane-persistent-memory.html. (2020).
[48]
Broadcom. 2012. PEX8734, PCI Express Gen3 Switch, 32 Lanes, 8 Ports. (2012). Retrieved from https://docs.broadcom.com/doc/12351853
[49]
Piotr Luszczek, Jack J Dongarra, David Koester, Rolf Rabenseifner, Bob Lucas, Jeremy Kepner, John McCalpin, David Bailey, and Daisuke Takahashi. 2005. Introduction to the HPC Challenge Benchmark Suite. Technical Report. Lawrence Berkeley National Lab. Berkeley, CA (United States).
[50]
AMD/Xilinx. 2021. Vitis Accel Examples’ Repository. (2021). Retrieved from https://github.com/Xilinx/Vitis_Accel_Examples/tree/2021.2
[51]
Hayden Kwok-Hay So and Robert Brodersen. 2008. File system access from reconfigurable FPGA hardware processes in BORPH. In Proceedings of the 2008 International Conference on Field Programmable Logic and Applications. 567–570. DOI:
[52]
Sang-Woo Jun, Ming Liu, Sungjin Lee, Jamey Hicks, John Ankcorn, Myron King, Shuotao Xu, and Arvind. 2015. BlueDBM: An appliance for big data analytics. In Proceedings of the 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture. 1–13. DOI:
[53]
Yu Zou and Mingjie Lin. 2021. FERMAT: FPGA-accelerated heterogeneous computing platform near NVMe storage. In Proceedings of the 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines. 262–262. DOI:
[54]
Juan Camilo Vega, Qianfeng Clark Shen, and Paul Chow. 2020. SHIP: Storage for hybrid interconnected processors. In Proceedings of the 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines. 211–211. DOI:
[55]
Athanasios Stratikopoulos, Christos Kotselidis, John Goodacre, and Mikel Luján. 2018. FastPath: Towards wire-speed NVMe SSDs. In Proceedings of the 2018 28th International Conference on Field Programmable Logic and Applications. 170–1707. DOI:
[56]
Athanasios Stratikopoulos, Christos Kotselidis, John Goodacre, and Mikel Luján. 2020. FastPath_MP: Low overhead & energy-efficient FPGA-based storage multi-paths. ACM Transactions on Architecture and Code Optimization 17, 4 (November 2020), 37:1–37:23. DOI:
[57]
Shuotao Xu, Sungjin Lee, Sang-Woo Jun, Ming Liu, Jamey Hicks, and Arvind. 2016. Bluecache: A scalable distributed flash-based key-value store. In Proceedings of the VLDB Endowment 10, 4 (November 2016), 301–312. DOI:
[58]
Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu, and Arvind. 2018. GraFBoost: Using accelerated flash storage for external graph analytics. In Proceedings of the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture. 411–424. DOI:
[59]
Sahand Salamat, Hui Zhang, Yang Seok Ki, and Tajana Rosing. 2022. NASCENT2: Generic near-storage sort accelerator for data analytics on SmartSSD. ACM Transactions on Reconfigurable Technology and Systems 15, 2, Article 16 (January 2022), 29 pages. DOI:
[60]
Ji-Hoon Kim, Yeo-Reum Park, Jaeyoung Do, Soo-Young Ji, and Joo-Young Kim. 2023. Accelerating large-scale graph-based nearest neighbor search on a computational storage platform. IEEE Transactions on Computers 72, 1 (January 2023), 278–290. DOI:
[61]
Hayden Kwok-Hay So and Robert Brodersen. 2008. A unified hardware/software runtime environment for FPGA-based reconfigurable computers using BORPH. ACM Transactions on Embedded Computing Systems 7, 2 (January 2008), 14:1–14:28. DOI:

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems
ACM Transactions on Reconfigurable Technology and Systems  Volume 17, Issue 3
September 2024
434 pages
EISSN:1936-7414
DOI:10.1145/3613592
  • Editor:
  • Deming Chen
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 September 2024
Online AM: 05 March 2024
Accepted: 11 February 2024
Revised: 22 December 2023
Received: 12 September 2023
Published in TRETS Volume 17, Issue 3

Check for updates

Author Tags

  1. Near-Storage Computing
  2. SmartSSD
  3. FPGA
  4. HLS

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 318
    Total Downloads
  • Downloads (Last 12 months)318
  • Downloads (Last 6 weeks)40
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media