research-article

Demystifying the Soft and Hardened Memory Systems of Modern FPGAs for Software Programmers through Microbenchmarking

Authors:

Alec Lu,

Zhenman Fang,

Lesley ShannonAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 15, Issue 4

Article No.: 43, Pages 1 - 33

https://doi.org/10.1145/3517131

Published: 06 June 2022 Publication History

Get Access

Abstract

Both modern datacenter and embedded Field Programmable Gate Arrays (FPGAs) provide great opportunities for high-performance and high-energy-efficiency computing. With the growing public availability of FPGAs from major cloud service providers such as AWS, Alibaba, and Nimbix, as well as uniform hardware accelerator development tools (such as Xilinx Vitis and Intel oneAPI) for software programmers, hardware and software developers can now easily access FPGA platforms. However, it is nontrivial to develop efficient FPGA accelerators, especially for software programmers who use high-level synthesis (HLS).

The major goal of this article is to figure out how to efficiently access the memory system of modern datacenter and embedded FPGAs in HLS-based accelerator designs. This is especially important for memory-bound applications; for example, a naive accelerator design only utilizes less than 5% of the available off-chip memory bandwidth. To achieve our goal, we first identify a comprehensive set of factors that affect the memory bandwidth, including (1) the clock frequency of the accelerator design, (2) the number of concurrent memory access ports, (3) the data width of each port, (4) the maximum burst access length for each port, and (5) the size of consecutive data accesses. Then, we carefully design a set of HLS-based microbenchmarks to quantitatively evaluate the performance of the memory systems of datacenter FPGAs (Xilinx Alveo U200 and U280) and embedded FPGA (Xilinx ZCU104) when changing those affecting factors, and we provide insights into efficient memory access in HLS-based accelerator designs. Comparing between the typically used soft and hardened memory systems, respectively, found on datacenter and embedded FPGAs, we further summarize their unique features and discuss the effective approaches to leverage these systems. To demonstrate the usefulness of our insights, we also conduct two case studies to accelerate the widely used K-nearest neighbors (KNN) and sparse matrix-vector multiplication (SpMV) algorithms on datacenter FPGAs with a soft (and thus more flexible) memory system. Compared to the baseline designs, optimized designs leveraging our insights achieve about \(3.5\times\) and \(8.5\times\) speedups for the KNN and SpMV accelerators. Our final optimized KNN and SpMV designs on a Xilinx Alveo U200 FPGA fully utilize its off-chip memory bandwidth, and achieve about \(5.6\times\) and \(3.4\times\) speedups over the 24-core CPU implementations.

References

[1]

Alibaba. 2020. Alibaba compute optimized instance families with FPGAs. Retrieved from https://www.alibabacloud.com/help/doc-detail/108504.htm.

Abstract

References

Cited By

Index Terms

Recommendations

Demystifying the Memory System of Modern Datacenter FPGAs for Software Programmers through Microbenchmarking

Hardware and software infrastructure to implement many-core systems in modern FPGAs

FPGAs for Software Programmers

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Full Text

HTML Format

Share

Share this Publication link

Share on social media

Affiliations