skip to main content
research-article

Demystifying the Soft and Hardened Memory Systems of Modern FPGAs for Software Programmers through Microbenchmarking

Published: 06 June 2022 Publication History

Abstract

Both modern datacenter and embedded Field Programmable Gate Arrays (FPGAs) provide great opportunities for high-performance and high-energy-efficiency computing. With the growing public availability of FPGAs from major cloud service providers such as AWS, Alibaba, and Nimbix, as well as uniform hardware accelerator development tools (such as Xilinx Vitis and Intel oneAPI) for software programmers, hardware and software developers can now easily access FPGA platforms. However, it is nontrivial to develop efficient FPGA accelerators, especially for software programmers who use high-level synthesis (HLS).
The major goal of this article is to figure out how to efficiently access the memory system of modern datacenter and embedded FPGAs in HLS-based accelerator designs. This is especially important for memory-bound applications; for example, a naive accelerator design only utilizes less than 5% of the available off-chip memory bandwidth. To achieve our goal, we first identify a comprehensive set of factors that affect the memory bandwidth, including (1) the clock frequency of the accelerator design, (2) the number of concurrent memory access ports, (3) the data width of each port, (4) the maximum burst access length for each port, and (5) the size of consecutive data accesses. Then, we carefully design a set of HLS-based microbenchmarks to quantitatively evaluate the performance of the memory systems of datacenter FPGAs (Xilinx Alveo U200 and U280) and embedded FPGA (Xilinx ZCU104) when changing those affecting factors, and we provide insights into efficient memory access in HLS-based accelerator designs. Comparing between the typically used soft and hardened memory systems, respectively, found on datacenter and embedded FPGAs, we further summarize their unique features and discuss the effective approaches to leverage these systems. To demonstrate the usefulness of our insights, we also conduct two case studies to accelerate the widely used K-nearest neighbors (KNN) and sparse matrix-vector multiplication (SpMV) algorithms on datacenter FPGAs with a soft (and thus more flexible) memory system. Compared to the baseline designs, optimized designs leveraging our insights achieve about \(3.5\times\) and \(8.5\times\) speedups for the KNN and SpMV accelerators. Our final optimized KNN and SpMV designs on a Xilinx Alveo U200 FPGA fully utilize its off-chip memory bandwidth, and achieve about \(5.6\times\) and \(3.4\times\) speedups over the 24-core CPU implementations.

References

[1]
Alibaba. 2020. Alibaba compute optimized instance families with FPGAs. Retrieved from https://www.alibabacloud.com/help/doc-detail/108504.htm.
[2]
N. S. Altman. 1992. An introduction to kernel and nearest-neighbor nonparametric regression. Amer. Stat. 46, 3 (1992), 175–185.
[3]
Amazon. 2020. Amazon EC2 F1 Instances, Enable faster FPGA accelerator development and deployment in the cloud. Retrieved from https://aws.amazon.com/ec2/instance-types/f1/.
[4]
AnandTech. 2018. Intel Shows Xeon Scalable Gold 6138P with Integrated FPGA, Shipping to Vendors. Retrieved from https://www.anandtech.com/show/12773/intel-shows-xeon-scalable-gold-6138p-with-integrated-fpga-shipping-to-vendors.
[5]
A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarath, and P. Sadayappan. 2014. Fast sparse matrix-vector multiplication on GPUs for graph applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). IEEE Press, 781–792.
[6]
N. Bell and M. Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. Association for Computing Machinery, New York, NY, 1–11.
[7]
Kevin K. Chang, Abhijith Kashyap, Hasan Hassan, Saugata Ghose, Kevin Hsieh, Donghyuk Lee, Tianshi Li, Gennady Pekhimenko, Samira Khan, and Onur Mutlu. 2016. Understanding latency variation in modern DRAM chips: Experimental characterization, analysis, and optimization. SIGMETRICS Perform. Eval. Rev. 44, 1 (June 2016), 323–336.
[8]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09). IEEE Computer Society, 44–54. Retrieved from http://rodinia.cs.virginia.edu/doku.php?id=start.
[9]
Young-kyu Choi, Yuze Chi, Weikang Qiao, Nikola Samardzic, and Jason Cong. 2021. HBM connect: High-performance HLS interconnect for FPGA HBM. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’21). Association for Computing Machinery, New York, NY, 116–126.
[10]
Young-kyu Choi, Jason Cong, Zhenman Fang, Yuchen Hao, Glenn Reinman, and Peng Wei. 2016. A quantitative analysis on microarchitectures of modern CPU-FPGA platforms. In Proceedings of the 53rd Annual Design Automation Conference (DAC’16). Association for Computing Machinery, New York, NY, Article 109, 6 pages.
[11]
Young-Kyu Choi, Jason Cong, Zhenman Fang, Yuchen Hao, Glenn Reinman, and Peng Wei. 2019. In-depth analysis on microarchitectures of modern heterogeneous CPU-FPGA platforms. ACM Trans. Reconfig. Technol. Syst. 12, 1, Article 4 (Feb. 2019), 20 pages.
[12]
Jason Cong, Zhenman Fang, Yuchen Hao, Peng Wei, Cody Hao Yu, Chen Zhang, and Peipei Zhou. 2018. Best-effort FPGA programming: A few steps can go a long way. Retrieved from https://arxiv.org/abs/1807.01340.
[13]
Jason Cong, Zhenman Fang, Michael Lo, Hanrui Wang, Jingxian Xu, and Shaochong Zhang. 2018. Understanding performance differences of FPGAs and GPUs. In Proceedings of the 26th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’18). Association for Computing Machinery, New York, NY, 93–96.
[14]
Jason Cong, Peng Wei, Cody Hao Yu, and Peipei Zhou. 2017. Bandwidth optimization through on-chip memory restructuring for HLS. In Proceedings of the 54th Annual Design Automation Conference (DAC’17). Association for Computing Machinery, New York, NY, Article 43, 6 pages.
[15]
Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). Association for Computing Machinery, New York, NY, 365–376.
[16]
Zhenman Fang, Sanyam Mehta, Pen-Chung Yew, Antonia Zhai, James Greensky, Gautham Beeraka, and Binyu Zang. 2015. Measuring microarchitectural details of multi- and many-core memory systems through microbenchmarking. ACM Trans. Archit. Code Optim. 11, 4, Article 55 (Jan. 2015), 26 pages.
[17]
Guo Gongde, Wang Hui, Bell David, Bi Yaxin, and Greer Kieran. 2003. KNN model-based approach in classification. In Proceedings of the International Conference on Ontologies, Databases and Applications of Semantics (ODBASE’03). Springer, Switzerland, 986–996.
[18]
Weerawarana Houstis, S. Weerawarana, E. N. Houstis, and J. R. Rice. 1990. An interactive symbolic–numeric interface to parallel ELLPACK for building general PDE solvers. In Symbolic and Numerical Computation for Artificial Intelligence. ACM, 303–322.
[20]
Intel. 2021. Intel oneAPI: A Unified X-Architecture Programming Model. Retrieved from https://software.intel.com/content/www/us/en/develop/tools/oneapi.html#gs.9wo7rg.
[21]
Jin Hee Kim, Brett Grady, Ruolong Lian, John Brothers, and Jason H. Anderson. 2017. FPGA-based CNN inference accelerator synthesized from multi-threaded C software. In Proceedings of the 30th IEEE International System-on-Chip Conference (SOCC’17). 268–273.
[22]
Shaoshan Liu, Liangkai Liu, Jie Tang, Bo Yu, Yifan Wang, and Weisong Shi. 2019. Edge computing for autonomous driving: Opportunities and challenges. Proc. IEEE 107, 8 (2019), 1697–1716.
[23]
Alec Lu, Zhenman Fang, Nazanin Farahpour, and Lesley Shannon. 2020. CHIP-KNN: A configurable and high-performance k-nearest neighbors accelerator on cloud FPGAs. In Proceedings of the International Conference on Field-Programmable Technology (FPT’20). 139–147.
[24]
Alec Lu, Zhenman Fang, Weihua Liu, and Lesley Shannon. 2021. Demystifying the memory system of modern datacenter FPGAs for software programmers through microbenchmarking. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’21). 105–115.
[25]
[26]
Nimbix. 2020. Xilinx Alveo Accelerator Cards. Retrieved from https://www.nimbix.net/alveo.
[27]
E. Nurvitadhi, A. Mishra, and D. Marr. 2015. A sparse matrix vector multiply accelerator for support vector machine. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’15). 109–116.
[28]
Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. In Proceedings of the 41st Annual International Symposium on Computer Architecuture (ISCA’14). IEEE Press, 13–24.
[29]
Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. MachSuite: Benchmarks for accelerator design and customized architectures. In Proceedings of the IEEE International Symposium on Workload Characterization. 110–119.
[30]
Thomas Seidl and Hans-Peter Kriegel. 1998. Optimal multi-step k-nearest neighbor search. SIGMOD Rec. 27, 2 (June 1998), 154–165.
[31]
Masayuki Shimoda, Youki Sada, Ryosuke Kuramochi, and Hiroki Nakahara. 2019. An FPGA implementation of real-time object detection with a thermal camera. In Proceedings of the 29th International Conference on Field Programmable Logic and Applications (FPL’19). 413–414.
[32]
J. Stuecheli, Bart Blaner, C. R. Johns, and M. S. Siegel. 2015. CAPI: A coherent accelerator processor interface. IBM J. Res. Dev. 59, 1 (2015), 7:1–7:7.
[33]
Zeke Wang, Hongjing Huang, Jie Zhang, and Gustavo Alonso. 2020. Shuhai: Benchmarking high bandwidth memory on FPGAs. In Proceedings of the 28th IEEE International Symposium On Field-Programmable Custom Computing Machines (FCCM’20). 111–119.
[34]
Henry Wong, Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems Software (ISPASS’10). 235–246.
[36]
[37]
Xilinx. 2020. Alveo U200 and U250 Data Center Accelerator Cards Data Sheet. Retrieved from https://www.xilinx.com/support/documentation/data_sheets/ds962-u200-u250.pdf.
[38]
Xilinx. 2020. Alveo U280 Data Center Accelerator Cards Data Sheet. Retrieved from https://www.xilinx.com/support/documentation/data_sheets/ds963-u280.pdf.
[39]
[40]
Xilinx. 2021. Vitis High-Level Synthesis User Guide. Retrieved from https://www.xilinx.com/support/documentation/sw_manuals/xilinx2020_2/ug1399-vitis-hls.pdf.
[41]
Bin Yao, Feifei Li, and Piyush Kumar. 2010. K nearest neighbor queries and kNN-joins in large relational databases (almost) for free. In Proceedings of the IEEE 26th International Conference on Data Engineering (ICDE’10). 4–15.
[42]
Chen Zhang, Guangyu Sun, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2019. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 38, 11 (2019), 2072–2085.

Cited By

View all
  • (2024)Co-Designing a 3D Transformation Accelerator for Versal-Based Image Registration2024 IEEE 42nd International Conference on Computer Design (ICCD)10.1109/ICCD63220.2024.00041(219-222)Online publication date: 18-Nov-2024
  • (2024)FYalSAT: High-Throughput Stochastic Local Search K-SAT Solver on FPGAIEEE Access10.1109/ACCESS.2024.339733012(65503-65512)Online publication date: 2024
  • (2023)CHIP-KNNv2: A Configurable and High-Performance K-Nearest Neighbors Accelerator on HBM-based FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/361687316:4(1-26)Online publication date: 5-Dec-2023

Index Terms

  1. Demystifying the Soft and Hardened Memory Systems of Modern FPGAs for Software Programmers through Microbenchmarking

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Reconfigurable Technology and Systems
    ACM Transactions on Reconfigurable Technology and Systems  Volume 15, Issue 4
    December 2022
    476 pages
    ISSN:1936-7406
    EISSN:1936-7414
    DOI:10.1145/3540252
    • Editor:
    • Deming Chen
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 June 2022
    Online AM: 09 February 2022
    Accepted: 01 February 2022
    Revised: 01 December 2021
    Received: 01 September 2021
    Published in TRETS Volume 15, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Datacenter FPGAs
    2. embedded FPGAs
    3. memory system
    4. HLS
    5. benchmarking

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • NSERC Discovery
    • Alliance
    • COHESA
    • CWSE PDF
    • Canada Foundation for Innovation John R. Evans Leaders Fund
    • British Columbia Knowledge Dev. Fund
    • Simon Fraser University New Faculty Start-up Grant

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)104
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 03 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Co-Designing a 3D Transformation Accelerator for Versal-Based Image Registration2024 IEEE 42nd International Conference on Computer Design (ICCD)10.1109/ICCD63220.2024.00041(219-222)Online publication date: 18-Nov-2024
    • (2024)FYalSAT: High-Throughput Stochastic Local Search K-SAT Solver on FPGAIEEE Access10.1109/ACCESS.2024.339733012(65503-65512)Online publication date: 2024
    • (2023)CHIP-KNNv2: A Configurable and High-Performance K-Nearest Neighbors Accelerator on HBM-based FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/361687316:4(1-26)Online publication date: 5-Dec-2023

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media