skip to main content
research-article

On the RTL Implementation of FINN Matrix Vector Unit

Published:09 November 2023Publication History
Skip Abstract Section

Abstract

Field-programmable gate array (FPGA)–based accelerators are becoming increasingly popular for deep neural network (DNN) inference due to their ability to scale performance with increasing degrees of specialization with dataflow architectures or custom data type precision. In order to reduce the barrier for software engineers and data scientists to adopt FPGAs, C++- and OpenCL-based design entries with high-level synthesis (HLS) have been introduced. They provide higher abstraction compared with register-transfer level (RTL)–based design. HLS offers faster development time, better maintainability, and more flexibility in code exploration when evaluating several options for multi-dimension tensors, convolutional layers, or different degrees of parallelism. For this reason, HLS has been adopted by DNN accelerator generation frameworks such as FINN and hls4ml.

In this article, we present an alternative backend library for FINN, leveraging RTL. We investigate and evaluate, across a spectrum of design dimensions, the pros and cons of an RTL-based implementation versus the original HLS variant. We show that for smaller design parameters, RTL produces significantly smaller circuits as compared with HLS. For larger circuits, however, the look-up table (LUT) count of RTL-based design is slightly higher, up to around 15%. On the other hand, HLS consistently requires more flip-flops (FFs; with an orders-of-magnitude difference for smaller designs) and block RAMs (BRAMs; 2× more). This also impacts the critical path delay, with RTL producing significantly faster circuits, up to around 80%. RTL also benefits from at least a 10× reduction in synthesis time. Finally, the results were validated in practice using two real-world use cases, one of a multi-layer perceptron (MLP) used in network intrusion detection and the other a convolution network called ResNet, used in image recognition. Overall, since HLS frameworks code-generate the hardware design, the benefits of the ease in the design entry is less important. As such, the gained benefits in synthesis time together with some design-dependent resource benefits make the RTL abstraction an attractive alternative.

REFERENCES

  1. [1] 2010. AMBA 4 AXI4-Stream Protocol Specification.Google ScholarGoogle Scholar
  2. [2] Alam S. A. and Gustafsson O.. 2016. On the implementation of time-multiplexed frequency-response masking filters. IEEE Trans. Signal Process. 64, 15 (Aug.2016), 39333944.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Alonso Tobias, Petrica Lucian, Ruiz Mario, Petri-Koenig Jakoba, Umuroglu Yaman, Stamelos Ioannis, Koromilas Elias, Blott Michaela, and Vissers Kees. 2021. Elastic-DF: Scaling performance of DNN inference in FPGA clouds through automatic partitioning. ACM Trans. Reconfigurable Technol. Syst. 15, 2, Article 15 (Dec. 2021), 34 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Blott Michaela, Preußer Thomas B., Fraser Nicholas J., Gambardella Giulio, O’Brien Kenneth, Umuroglu Yaman, Leeser Miriam, and Vissers Kees. 2018. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Trans. Reconfigurable Technol. Syst. 11, 3, Article 16 (Dec.2018), 23 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Bruschi N., Garofalo A., Conti F., Tagliavini G., and Rossi D.. 2020. Enabling mixed-precision quantized neural networks in extreme-edge devices. In Proceedings of the ACM International Conference on Computing Frontiers (Sicily, Catania, Italy, May 2020). 217220.Google ScholarGoogle Scholar
  6. [6] Capotondi A., Rusci M., Fariselli M., and Benini L.. 2020. CMix-NN: Mixed low-precision CNN library for memory-constrained edge devices. IEEE Trans. Circuits Syst. II 67, 5 (2020), 871875.Google ScholarGoogle Scholar
  7. [7] Coussy Philippe, Gajski Daniel D., Meredith Michael, and Takach Andres. 2009. An introduction to high-level synthesis. IEEE Des. Test. Comput. 26, 4 (2009), 817. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Czajkowski T. S., Aydonat U., Denisenko D., Freeman J., Kinsner M., Neto D., Wong J., Yiannacouras P., and Singh D. P.. 2012. From OpenCL to high-performance hardware on FPGAs. In Proc. Int. Conf. Field-Programmable Logic Applicat.531534.Google ScholarGoogle Scholar
  9. [9] Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In Proc. IEEE Conf. Comput. Vision Pattern Recog.248255. Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] al. Michaela Blott et2021. FINN: Dataflow compiler for QNN inference on FPGAs. (2021). https://github.com/xilinx/finn.Google ScholarGoogle Scholar
  11. [11] Garofalo A., Rusci M., Conti F., Rossi D., and Benini L.. 2019. PULP-NN: Accelerating quantized neural networks on parallel ultra-low-power RISC-V processors. Philosophical Trans. Royal Society A: Mathematical, Physical and Eng. Sci. 378 (Dec.2019). Issue 2164. Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2015. Deep residual learning for image recognition, In Proc. IEEE Conf. Comput. Vision Pattern Recog.arXiv preprint arXiv:1512.03385. arXiv:http://arxiv.org/abs/1512.03385v1.Google ScholarGoogle Scholar
  13. [13] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR’16). 770778. Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Homsirikamol Ekawat and George Kris Gaj. 2017. Toward a new HLS-based methodology for FPGA benchmarking of candidates in cryptographic competitions: The CAESAR contest case study. In Proc. IEEE Int. Conf. Field Programmable Technology.120127. Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Hubara I., Courbariaux M., Soudry D., El-Yaniv R., and Bengio Y.. 2017. Quantized neural networks: Training neural networks with low precision weights and activations. ACM J. Mach. Learn. Res. 18, 1 (Jan.2017), 68696898.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Intel®. [n.d.]. High Level Synthesis Compiler. Retrieved July 25, 2022 from https://www.intel.com/content/www/us/en/software/programmable/quartus-prime/hls-compiler.html.Google ScholarGoogle Scholar
  17. [17] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (May2017), 8490.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Kumm M. and Kappauf J.. 2018. Advanced compressor tree synthesis for FPGAs. IEEE Trans. Comput.99 (2018), 11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Martin Grant and Smith Gary. 2009. High-level synthesis: Past, present, and future. IEEE Des. Test. Comput. 26, 4 (2009), 1825. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Meeus Wim, Beeck Kristof Van, Goedemé Toon, Meel Jan, and Stroobandt Dirk. 2012. An overview of today’s high-level synthesis tools. Design Automation for Embedded Systems (2012), 3151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Moustafa Nour and Slay Jill. 2015. UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In 2015 Military Communications and Information Systems Conference (MilCIS’15). 16. Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Muslim Fahad Bin, Ma Liang, Roozmeh Mehdi, and Lavagno Luciano. 2017. Efficient FPGA implementation of OpenCL high-performance computing applications via high-level synthesis. IEEE Access 5 (2017), 27472762. Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Nabi Syed Waqar and Vanderbauwhede Wim. 2017. FPGA design space exploration for scientific HPC applications using a fast and accurate cost model based on roofline analysis. J. Parallel and Distrib. Comput. 133 (2017), 407419.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Nabi S. W. and Vanderbauwhede W.. 2019. Automatic pipelining and vectorization of scientific code for FPGAs. International Journal of Reconfigurable Computing 2019, 7348013 (2019), 12.Google ScholarGoogle Scholar
  25. [25] Nane Razvan, Sima Vlad-Mihai, Pilato Christian, Choi Jongsok, Fort Blair, Canis Andrew, Chen Yu Ting, Hsiao Hsuan, Brown Stephen, Ferrandi Fabrizio, Anderson Jason, and Bertels Koen. 2016. A survey and evaluation of FPGA high-level synthesis tools. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 35, 10 (2016), 15911604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Noronha Daniel H., Salehpour Bahar, and Wilton Steven J. E.. 2018. LeFlow: Enabling flexible FPGA high-level synthesis of tensorflow deep neural networks. In Proc. International Workshop on FPGAs for Software Programmers. 18.Google ScholarGoogle Scholar
  27. [27] Pappalardo Alessandro. 2021. Xilinx/brevitas. Retrieved July 25, 2022 from Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Pell Oliver and Averbukh Vitali. 2012. Maximum performance computing with dataflow engines. Computing in Science Engineering 14, 4 (2012), 98103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Persand Kaveena, Anderson Andrew, and Gregg David. 2021. Taxonomy of saliency metrics for channel pruning. IEEE Access 9 (2021), 120110120126. Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Petrica Lucian, Alonso Tobias, Kroes Mairin, Fraser Nicholas, Cotofana Sorin, and Blott Michaela. 2020. Memory-efficient dataflow inference for deep CNNs on FPGA. In Proc. IEEE Int. Conf. Field Programmable Technology.4855. Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Preußer Thomas B.. 2017. Generic and universal parallel matrix summation with a flexible compression goal for Xilinx FPGAs. In International Conference on Field Programmable Logic and Applications (FPL’17). Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Rastegari M., Ordonez V., Redmon J., and Farhadi A.. 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. 1603.05279:1603.05279v4 [cs.CV].Google ScholarGoogle Scholar
  33. [33] Sarkar Soujanna, Dabral Shashank, Tiwari Praveen K., and Mitra Raj S.. 2009. Lessons and experiences with high-level synthesis. IEEE Des. Test. Comput. 26, 4 (2009), 3445. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Satyanarayanan M.. 2017. The emergence of edge computing. Computer 50, 1 (Jan.2017), 3039.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Simonyan Karen and Zisserman Andrew. 2014. Very deep convolutional networks for large-scale image recognition. arxiv:1409.1556.Google ScholarGoogle Scholar
  36. [36] Szegedy Christian, Liu Wei, Jia Yangqing, Sermanet Pierre, Reed Scott, Anguelov Dragomir, Erhan Dumitru, Vanhoucke Vincent, and Rabinovich Andrew. 2015. Going deeper with convolutions. In Proc. IEEE Conf. Comput. Vision Pattern Recog.19. Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Umoroglu Y., Fraser N. J., Gambardella G., Blott M., Leong P., Jahre M., and Vissers K.. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proc. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Monterey, CA). 6574.Google ScholarGoogle Scholar
  38. [38] Umuroglu Yaman, Akhauri Yash, Fraser Nicholas James, and Blott Michaela. 2020. LogicNets: Co-designed neural networks and circuits for extreme-throughput applications. In Proc. Int. Conf. Field-Programmable Logic Applicat.291297. Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Umuroglu Yaman, Conficconi Davide, Rasnayake Lahiru, Preußer Thomas B., and Själander Magnus. 2019. Optimizing bit-serial matrix multiplication for reconfigurable computing. ACM Trans. Reconfigurable Technol. Syst. 12, 3 (Aug. 2019). Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Winterstein Felix, Bayliss Samuel, and Constantinides George A.. 2013. High-level synthesis of dynamic data structures: A case study using Vivado HLS. In Proc. IEEE Int. Conf. Field Programmable Technology.362365. Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Xilinx. Xilinx Unified Software Development Flatform. Retrieved July 25, 2022 from https://www.xilinx.com/html_docs/xilinx2020_1/vitis_doc/irn1582730075765.html.Google ScholarGoogle Scholar
  42. [42] Xilinx. 2020. https://www.xilinx.com/support/documentation/sw_manuals/xilinx2020_1/ug892-vivado-design-flows-overview.pdf.Google ScholarGoogle Scholar
  43. [43] Yang Tien-Ju, Chen Yu-Hsin, and Sze Vivienne. 2016. Designing energy-efficient convolutional neural networks using energy-aware pruning. arxiv:1611.05128.Google ScholarGoogle Scholar
  44. [44] Zhao Jieru, Liang Tingyuan, Sinha Sharad, and Zhang Wei. 2019. Machine learning based routing congestion prediction in FPGA high-level synthesis. In Proc. Design, Automation Test in Europe (DATE’19). 11301135. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. On the RTL Implementation of FINN Matrix Vector Unit

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Embedded Computing Systems
        ACM Transactions on Embedded Computing Systems  Volume 22, Issue 6
        November 2023
        428 pages
        ISSN:1539-9087
        EISSN:1558-3465
        DOI:10.1145/3632298
        • Editor:
        • Tulika Mitra
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 November 2023
        • Online AM: 14 July 2022
        • Accepted: 2 July 2022
        • Revised: 4 May 2022
        • Received: 29 December 2021
        Published in tecs Volume 22, Issue 6

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text