research-article

On the RTL Implementation of FINN Matrix Vector Unit

Authors:
Syed Asad Alam

School of Computer Science and Statistics, Trinity College Dublin, Ireland

School of Computer Science and Statistics, Trinity College Dublin, Ireland

0000-0002-1509-9678
View Profile

,
David Gregg

School of Computer Science and Statistics, Trinity College Dublin, Ireland

School of Computer Science and Statistics, Trinity College Dublin, Ireland

0000-0003-3782-4612
View Profile

,
Giulio Gambardella

Synopsys Inc, Ireland

Synopsys Inc, Ireland

0000-0001-6183-5077
View Profile

,
Thomas Preusser

AMD, Germany

AMD, Germany

0000-0003-3998-7896
View Profile

,
Michaela Blott

AMD, Ireland

AMD, Ireland

0000-0002-7833-4057
View Profile

Authors Info & Claims

ACM Transactions on Embedded Computing Systems Volume 22 Issue 6Article No.: 94pp 1–27https://doi.org/10.1145/3547141

Published:09 November 2023Publication History

ACM Transactions on Embedded Computing Systems

Abstract

Field-programmable gate array (FPGA)–based accelerators are becoming increasingly popular for deep neural network (DNN) inference due to their ability to scale performance with increasing degrees of specialization with dataflow architectures or custom data type precision. In order to reduce the barrier for software engineers and data scientists to adopt FPGAs, C++- and OpenCL-based design entries with high-level synthesis (HLS) have been introduced. They provide higher abstraction compared with register-transfer level (RTL)–based design. HLS offers faster development time, better maintainability, and more flexibility in code exploration when evaluating several options for multi-dimension tensors, convolutional layers, or different degrees of parallelism. For this reason, HLS has been adopted by DNN accelerator generation frameworks such as FINN and hls4ml.

In this article, we present an alternative backend library for FINN, leveraging RTL. We investigate and evaluate, across a spectrum of design dimensions, the pros and cons of an RTL-based implementation versus the original HLS variant. We show that for smaller design parameters, RTL produces significantly smaller circuits as compared with HLS. For larger circuits, however, the look-up table (LUT) count of RTL-based design is slightly higher, up to around 15%. On the other hand, HLS consistently requires more flip-flops (FFs; with an orders-of-magnitude difference for smaller designs) and block RAMs (BRAMs; 2× more). This also impacts the critical path delay, with RTL producing significantly faster circuits, up to around 80%. RTL also benefits from at least a 10× reduction in synthesis time. Finally, the results were validated in practice using two real-world use cases, one of a multi-layer perceptron (MLP) used in network intrusion detection and the other a convolution network called ResNet, used in image recognition. Overall, since HLS frameworks code-generate the hardware design, the benefits of the ease in the design entry is less important. As such, the gained benefits in synthesis time together with some design-dependent resource benefits make the RTL abstraction an attractive alternative.

REFERENCES

[1] 2010. AMBA 4 AXI4-Stream Protocol Specification.Google Scholar
[2] Alam S. A. and Gustafsson O.. 2016. On the implementation of time-multiplexed frequency-response masking filters. IEEE Trans. Signal Process. 64, 15 (Aug.2016), 3933–3944.Google ScholarDigital Library
[3] Alonso Tobias, Petrica Lucian, Ruiz Mario, Petri-Koenig Jakoba, Umuroglu Yaman, Stamelos Ioannis, Koromilas Elias, Blott Michaela, and Vissers Kees. 2021. Elastic-DF: Scaling performance of DNN inference in FPGA clouds through automatic partitioning. ACM Trans. Reconfigurable Technol. Syst. 15, 2, Article 15 (Dec. 2021), 34 pages. DOI:Google ScholarDigital Library
[4] Blott Michaela, Preußer Thomas B., Fraser Nicholas J., Gambardella Giulio, O’Brien Kenneth, Umuroglu Yaman, Leeser Miriam, and Vissers Kees. 2018. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Trans. Reconfigurable Technol. Syst. 11, 3, Article 16 (Dec.2018), 23 pages.Google ScholarDigital Library
[5] Bruschi N., Garofalo A., Conti F., Tagliavini G., and Rossi D.. 2020. Enabling mixed-precision quantized neural networks in extreme-edge devices. In Proceedings of the ACM International Conference on Computing Frontiers (Sicily, Catania, Italy, May 2020). 217–220.Google Scholar
[6] Capotondi A., Rusci M., Fariselli M., and Benini L.. 2020. CMix-NN: Mixed low-precision CNN library for memory-constrained edge devices. IEEE Trans. Circuits Syst. II 67, 5 (2020), 871–875.Google Scholar
[7] Coussy Philippe, Gajski Daniel D., Meredith Michael, and Takach Andres. 2009. An introduction to high-level synthesis. IEEE Des. Test. Comput. 26, 4 (2009), 8–17. Google ScholarDigital Library
[8] Czajkowski T. S., Aydonat U., Denisenko D., Freeman J., Kinsner M., Neto D., Wong J., Yiannacouras P., and Singh D. P.. 2012. From OpenCL to high-performance hardware on FPGAs. In Proc. Int. Conf. Field-Programmable Logic Applicat.531–534.Google Scholar
[9] Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In Proc. IEEE Conf. Comput. Vision Pattern Recog.248–255. Google ScholarCross Ref
[10] al. Michaela Blott et2021. FINN: Dataflow compiler for QNN inference on FPGAs. (2021). https://github.com/xilinx/finn.Google Scholar
[11] Garofalo A., Rusci M., Conti F., Rossi D., and Benini L.. 2019. PULP-NN: Accelerating quantized neural networks on parallel ultra-low-power RISC-V processors. Philosophical Trans. Royal Society A: Mathematical, Physical and Eng. Sci. 378 (Dec.2019). Issue 2164. Google ScholarCross Ref
[12] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2015. Deep residual learning for image recognition, In Proc. IEEE Conf. Comput. Vision Pattern Recog.arXiv preprint arXiv:1512.03385. arXiv:http://arxiv.org/abs/1512.03385v1.Google Scholar
[13] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR’16). 770–778. Google ScholarCross Ref
[14] Homsirikamol Ekawat and George Kris Gaj. 2017. Toward a new HLS-based methodology for FPGA benchmarking of candidates in cryptographic competitions: The CAESAR contest case study. In Proc. IEEE Int. Conf. Field Programmable Technology.120–127. Google ScholarCross Ref
[15] Hubara I., Courbariaux M., Soudry D., El-Yaniv R., and Bengio Y.. 2017. Quantized neural networks: Training neural networks with low precision weights and activations. ACM J. Mach. Learn. Res. 18, 1 (Jan.2017), 6869–6898.Google ScholarDigital Library
[16] Intel®. [n.d.]. High Level Synthesis Compiler. Retrieved July 25, 2022 from https://www.intel.com/content/www/us/en/software/programmable/quartus-prime/hls-compiler.html.Google Scholar
[17] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (May2017), 84–90.Google ScholarDigital Library
[18] Kumm M. and Kappauf J.. 2018. Advanced compressor tree synthesis for FPGAs. IEEE Trans. Comput.99 (2018), 1–1. Google ScholarDigital Library
[19] Martin Grant and Smith Gary. 2009. High-level synthesis: Past, present, and future. IEEE Des. Test. Comput. 26, 4 (2009), 18–25. Google ScholarDigital Library
[20] Meeus Wim, Beeck Kristof Van, Goedemé Toon, Meel Jan, and Stroobandt Dirk. 2012. An overview of today’s high-level synthesis tools. Design Automation for Embedded Systems (2012), 31–51. Google ScholarDigital Library
[21] Moustafa Nour and Slay Jill. 2015. UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In 2015 Military Communications and Information Systems Conference (MilCIS’15). 1–6. Google ScholarCross Ref
[22] Muslim Fahad Bin, Ma Liang, Roozmeh Mehdi, and Lavagno Luciano. 2017. Efficient FPGA implementation of OpenCL high-performance computing applications via high-level synthesis. IEEE Access 5 (2017), 2747–2762. Google ScholarCross Ref
[23] Nabi Syed Waqar and Vanderbauwhede Wim. 2017. FPGA design space exploration for scientific HPC applications using a fast and accurate cost model based on roofline analysis. J. Parallel and Distrib. Comput. 133 (2017), 407–419.Google ScholarCross Ref
[24] Nabi S. W. and Vanderbauwhede W.. 2019. Automatic pipelining and vectorization of scientific code for FPGAs. International Journal of Reconfigurable Computing 2019, 7348013 (2019), 12.Google Scholar
[25] Nane Razvan, Sima Vlad-Mihai, Pilato Christian, Choi Jongsok, Fort Blair, Canis Andrew, Chen Yu Ting, Hsiao Hsuan, Brown Stephen, Ferrandi Fabrizio, Anderson Jason, and Bertels Koen. 2016. A survey and evaluation of FPGA high-level synthesis tools. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 35, 10 (2016), 1591–1604. Google ScholarDigital Library
[26] Noronha Daniel H., Salehpour Bahar, and Wilton Steven J. E.. 2018. LeFlow: Enabling flexible FPGA high-level synthesis of tensorflow deep neural networks. In Proc. International Workshop on FPGAs for Software Programmers. 1–8.Google Scholar
[27] Pappalardo Alessandro. 2021. Xilinx/brevitas. Retrieved July 25, 2022 from Google ScholarCross Ref
[28] Pell Oliver and Averbukh Vitali. 2012. Maximum performance computing with dataflow engines. Computing in Science Engineering 14, 4 (2012), 98–103. Google ScholarDigital Library
[29] Persand Kaveena, Anderson Andrew, and Gregg David. 2021. Taxonomy of saliency metrics for channel pruning. IEEE Access 9 (2021), 120110–120126. Google ScholarCross Ref
[30] Petrica Lucian, Alonso Tobias, Kroes Mairin, Fraser Nicholas, Cotofana Sorin, and Blott Michaela. 2020. Memory-efficient dataflow inference for deep CNNs on FPGA. In Proc. IEEE Int. Conf. Field Programmable Technology.48–55. Google ScholarCross Ref
[31] Preußer Thomas B.. 2017. Generic and universal parallel matrix summation with a flexible compression goal for Xilinx FPGAs. In International Conference on Field Programmable Logic and Applications (FPL’17). Google ScholarCross Ref
[32] Rastegari M., Ordonez V., Redmon J., and Farhadi A.. 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. 1603.05279:1603.05279v4 [cs.CV].Google Scholar
[33] Sarkar Soujanna, Dabral Shashank, Tiwari Praveen K., and Mitra Raj S.. 2009. Lessons and experiences with high-level synthesis. IEEE Des. Test. Comput. 26, 4 (2009), 34–45. Google ScholarDigital Library
[34] Satyanarayanan M.. 2017. The emergence of edge computing. Computer 50, 1 (Jan.2017), 30–39.Google ScholarDigital Library
[35] Simonyan Karen and Zisserman Andrew. 2014. Very deep convolutional networks for large-scale image recognition. arxiv:1409.1556.Google Scholar
[36] Szegedy Christian, Liu Wei, Jia Yangqing, Sermanet Pierre, Reed Scott, Anguelov Dragomir, Erhan Dumitru, Vanhoucke Vincent, and Rabinovich Andrew. 2015. Going deeper with convolutions. In Proc. IEEE Conf. Comput. Vision Pattern Recog.1–9. Google ScholarCross Ref
[37] Umoroglu Y., Fraser N. J., Gambardella G., Blott M., Leong P., Jahre M., and Vissers K.. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proc. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Monterey, CA). 65–74.Google Scholar
[38] Umuroglu Yaman, Akhauri Yash, Fraser Nicholas James, and Blott Michaela. 2020. LogicNets: Co-designed neural networks and circuits for extreme-throughput applications. In Proc. Int. Conf. Field-Programmable Logic Applicat.291–297. Google ScholarCross Ref
[39] Umuroglu Yaman, Conficconi Davide, Rasnayake Lahiru, Preußer Thomas B., and Själander Magnus. 2019. Optimizing bit-serial matrix multiplication for reconfigurable computing. ACM Trans. Reconfigurable Technol. Syst. 12, 3 (Aug. 2019). Google ScholarDigital Library
[40] Winterstein Felix, Bayliss Samuel, and Constantinides George A.. 2013. High-level synthesis of dynamic data structures: A case study using Vivado HLS. In Proc. IEEE Int. Conf. Field Programmable Technology.362–365. Google ScholarCross Ref
[41] Xilinx. Xilinx Unified Software Development Flatform. Retrieved July 25, 2022 from https://www.xilinx.com/html_docs/xilinx2020_1/vitis_doc/irn1582730075765.html.Google Scholar
[42] Xilinx. 2020. https://www.xilinx.com/support/documentation/sw_manuals/xilinx2020_1/ug892-vivado-design-flows-overview.pdf.Google Scholar
[43] Yang Tien-Ju, Chen Yu-Hsin, and Sze Vivienne. 2016. Designing energy-efficient convolutional neural networks using energy-aware pruning. arxiv:1611.05128.Google Scholar
[44] Zhao Jieru, Liang Tingyuan, Sinha Sharad, and Zhang Wei. 2019. Machine learning based routing congestion prediction in FPGA high-level synthesis. In Proc. Design, Automation Test in Europe (DATE’19). 1130–1135. Google ScholarCross Ref

Index Terms

On the RTL Implementation of FINN Matrix Vector Unit
1. Hardware
  1. Electronic design automation
    1. High-level and register-transfer level synthesis
  2. Integrated circuits
    1. Reconfigurable logic and FPGAs

Recommendations

Hardware resource estimation for heterogeneous FPGA-based SoCs
SAC '17: Proceedings of the Symposium on Applied Computing

The increasing complexity of recent System-on-Chip (SoC) designs introduces new challenges for design space exploration tools. In addition to the time-to-market challenge, designers need to estimate rapidly and accurately both performance and area ...
Read More
Implementation of a CNN accelerator on an Embedded SoC Platform using SDSoC
ICDSP '18: Proceedings of the 2nd International Conference on Digital Signal Processing

Today, Convolution Neural Networks (CNN) is adopted by various application areas such as computer vision, speech recognition, and natural language processing. Due to a massive amount of computing for CNN, CNN running on an embedded platform may not meet ...
Read More
Using Dynamic Signal-Tracing to Debug Compiler-Optimized HLS Circuits on FPGAs
FCCM '15: Proceedings of the 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines

High-level synthesis (HLS) for FPGA designs has received considerable attention in recent years. To make this design methodology mainstream, improved debugging technologies are essential. Ideally, a user should be able to debug their design using the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Embedded Computing Systems Volume 22, Issue 6
November 2023
428 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3632298
Editor:
Tulika Mitra
National University of Singapore, Singapore
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 9 November 2023
- Online AM: 14 July 2022
- Accepted: 2 July 2022
- Revised: 4 May 2022
- Received: 29 December 2021
Published in tecs Volume 22, Issue 6

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
FINN
convolutional neural network
HLS
RTL
FPGA
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 355
  Total Downloads
- Downloads (Last 12 months)238
- Downloads (Last 6 weeks)46
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

On the RTL Implementation of FINN Matrix Vector Unit

ACM Transactions on Embedded Computing Systems

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Hardware resource estimation for heterogeneous FPGA-based SoCs

Implementation of a CNN accelerator on an Embedded SoC Platform using SDSoC

Using Dynamic Signal-Tracing to Debug Compiler-Optimized HLS Circuits on FPGAs