skip to main content
research-article

FPGA Logic Block Architectures for Efficient Deep Learning Inference

Published: 03 June 2020 Publication History

Abstract

Reducing the precision of deep neural network (DNN) inference accelerators can yield large efficiency gains with little or no accuracy degradation compared to half or single precision floating-point by enabling more multiplication operations per unit area. A wide range of precisions fall on the pareto-optimal curve of hardware efficiency vs. accuracy with no single precision dominating, making the variable precision capabilities of FPGAs very valuable. We propose three types of logic block architectural enhancements and fully evaluate a total of six architectures that improve the area efficiency of multiplications and additions implemented in the soft fabric. Increasing the LUT fracturability and adding two adders to the ALM (4-bit Adder Double Chain architecture) leads to a 1.5× area reduction for arithmetic heavy machine learning (ML) kernels, while increasing their speed. In addition, this architecture also reduces the logic area of general applications by 6%, while increasing the critical path delay by only 1%. However, our highest impact option, which adds a 9-bit shadow multiplier to the logic clusters, reduces the area and critical path delay of ML kernels by 2.4× and 1.2×, respectively. These large gains come at a cost of 15% logic area increase for general applications.

References

[1]
Y. Cao. 2018. Predictive Technology Model (PTM). Retrieved from http://ptm.asu.edu/.
[2]
E. Ahmed and J. Rose. 2004. The effect of LUT and cluster size on deep-submicron FPGA performance and density. IEEE Trans. VLSI Syst. 12, 3 (2004), 288--298.
[3]
C. Baugh and B. Wooley. 1973. A two’s complement parallel array multiplication algorithm. IEEE Trans. Comput. C-22, 12 (1973), 1045--1047.
[4]
V. Betz and J. Rose. 1997. Cluster-based logic blocks for FPGAs: Area-efficiency vs. input sharing and size. In Proceedings of the Custom Integrated Circuits Conference. 551--554.
[5]
V. Betz and J. Rose. 1998. How much logic should go in an FPGA logic block. IEEE Design 8 Test of Computers, 15, 1 (1998), 10--15.
[6]
A. Boutros et al. 2018. Embracing diversity: Enhanced DSP blocks for low-precision deep learning on FPGAs. In Proceedings of the International Conference on Field Programmable Logic and Applications. 1--8.
[7]
A. Boutros et al. 2018. You cannot improve what you do not measure: FPGA vs. ASIC efficiency gaps for convolutional neural network inference. ACM Trans. Reconfig. Technol. Syst. 11, 3 (2018), 1–23.
[8]
A. Boutros et al. 2019. Math doesn’t have to be hard: Logic block architectures to enhance low-precision multiply-accumulate on FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 94--103.
[9]
M. Burich. 2012. Conference workshop: FPGAs in 2032, challenges and opportunities in the next 20 years, convergence of programmable solutions. In Proceedings of the International Symposium on Field-Programmable Gate Arrays.
[10]
S. Chandrakar et al. 2015. Enhancements in UltraScale CLB architecture. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 108--116.
[11]
C. Chiasson and V. Betz. 2013. COFFE: Fully-automated transistor sizing for FPGAs. In Proceedings of the International Conference on Field-Programmable Technology. 34--41.
[12]
Intel Corporation. 2005. Stratix GX Transeiver User Guide.
[13]
Xilinx Corporation. 2007. Virtex-II Platform FPGA User Guide.
[14]
Xilinx Corporation. 2007. Virtex-II Pro and Virtex-II Pro X FPGA User Guide.
[15]
M. Deo et al. 2019. Intel Stratix 10 MX devices solve the memory bandwidth challenge. Intel Whitepaper.
[16]
J. Fowers et al. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the International Symposium on Computer Architecture, 1--14.
[17]
S. Han et al. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the International Symposium on Computer Architecture. 243--254.
[18]
K. He et al. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.
[19]
Intel Corporation. 2017. Intel Stratix 10 logic array blocks and adaptive logic modules user guide (UG-S10LAB).
[20]
P. Jamieson and J. Rose. 2006. Enhancing the area-efficiency of FPGAs with hard circuits using shadow clusters. In Proceedings of the International Conference on Field Programmable Technology. 1--8.
[21]
A. Krizhevsky et al. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.
[22]
I. Kuon and J. Rose. 2011. Exploring area and delay tradeoffs in FPGAs with architecture and automated transistor design. IEEE Trans. VLSI Syst. 19, 1 (2011), 71--84.
[23]
M. Langhammer et al. 2019. Fractal synthesis: Invited tutorial. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 202--211.
[24]
M. Langhammer and B. Pasca. 2015. Floating-point DSP block architecture for FPGAs. In Proceedings of the International Symposium on Field Programmable Gate Arrays. 117--125.
[25]
Guy Lemieux and David Lewis. 2001. Using sparse crossbars within LUT. In Proceedings of the International Symposium on Field Programmable Gate Arrays. 59--68.
[26]
D. Lewis et al. 2005. The Stratix II logic and routing architecture. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 14--20.
[27]
D. Lewis et al. 2016. The Stratix 10 highly pipelined FPGA architecture. In Proceedings of the International Symposium on Field Programmable Gate Arrays. 159--168.
[28]
D. M. Lewis et al. 2003. The StratixTM routing and logic architecture. In Proceedings of the International Symposium on Field Programmable Gate Arrays. 12--20.
[29]
J. Luu et al. 2014. VTR 7.0: Next generation architecture and CAD system for FPGAs. ACM Trans. Reconfig. Technol. Syst. 7, 2 (2014), 1–30.
[30]
A. Mishra et al. 2017. WRPN: Wide reduced-precision networks. arXiv preprint arXiv:1709.01134.
[31]
Kevin E. Murray et al. 2020. VTR 8: High performance CAD and customizable FPGA architecture modelling. ACM Trans. Reconfig. Technol. Syst. 0, ja, 1
[32]
E. Nurvitadhi et al. 2016. Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC. In Proceedings of the International Conference on Field-Programmable Technology. 77--84.
[33]
E. Nurvitadhi et al. 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks? In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 5--14.
[34]
J. et al. Qiu. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 26--35.
[35]
E. Real et al. 2018. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548.
[36]
J. Rose et al. 1993. Architecture of field-programmable gate arrays. IEEE J. Solid-State Circ. 81, 7 (1993), 1013--1029.
[37]
V. Rybalkin et al. 2018. FINN-L: Library extensions and design trade-off analysis for variable precision LSTM networks on FPGAs. In Proceedings of the International Conference on Field Programmable Logic and Applications. 1--8.
[38]
S. M. Trimberger. 2015. Three ages of FPGAs: A retrospective on the first thirty years of FPGA technology. Proc. IEEE, 318--331.
[39]
Mike W. et al. 2019. Virtex UltraScale+ HBM FPGA: A revolutionary increase in memory performance. Xilinx Whitepaper.
[40]
Xiaowei X. et al. 2018. Scaling for edge inference of deep neural networks. Nat. Electron. 1, 4 (2018), 216--222.
[41]
S. Yazdanshenas and V. Betz. 2017. Automatic circuit design and modelling for heterogeneous FPGAs. In Proceedings of the International Conference on Field Programmable Technology. 9--16.
[42]
S. Yazdanshenas and V. Betz. 2019. COFFE 2: Automatic modelling and optimization of complex and heterogeneous FPGA architectures. ACM Trans. Reconfig. Technol. Syst. 12, 1 (2019), 1–27.

Cited By

View all
  • (2024)A Stacked FPGA utilizing 3D-SRAM with Latency Optimization2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)10.1109/MCSoC64144.2024.00072(400-406)Online publication date: 16-Dec-2024
  • (2024)Approximate Row-Merging-Based Multipliers for Neural Network Acceleration on FPGAsIEEE Embedded Systems Letters10.1109/LES.2023.330467816:2(126-129)Online publication date: Jun-2024
  • (2024)Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM60383.2024.00015(54-65)Online publication date: 5-May-2024
  • Show More Cited By

Index Terms

  1. FPGA Logic Block Architectures for Efficient Deep Learning Inference

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Reconfigurable Technology and Systems
      ACM Transactions on Reconfigurable Technology and Systems  Volume 13, Issue 3
      September 2020
      182 pages
      ISSN:1936-7406
      EISSN:1936-7414
      DOI:10.1145/3404107
      • Editor:
      • Deming Chen
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 03 June 2020
      Online AM: 07 May 2020
      Accepted: 01 April 2020
      Revised: 01 March 2020
      Received: 01 October 2019
      Published in TRETS Volume 13, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. CAD tools
      2. Deep neural networks
      3. FPGA

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)164
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 25 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)A Stacked FPGA utilizing 3D-SRAM with Latency Optimization2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)10.1109/MCSoC64144.2024.00072(400-406)Online publication date: 16-Dec-2024
      • (2024)Approximate Row-Merging-Based Multipliers for Neural Network Acceleration on FPGAsIEEE Embedded Systems Letters10.1109/LES.2023.330467816:2(126-129)Online publication date: Jun-2024
      • (2024)Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM60383.2024.00015(54-65)Online publication date: 5-May-2024
      • (2024)Improving the Area-Delay Tradeoff for Academic FPGA Architectures by Applying Rent’s Rule to Logic Block Modeling2024 XXXIII International Scientific Conference Electronics (ET)10.1109/ET63133.2024.10721534(1-6)Online publication date: 17-Sep-2024
      • (2024)Field-Programmable Gate Array ArchitectureHandbook of Computer Architecture10.1007/978-981-97-9314-3_49(417-463)Online publication date: 21-Dec-2024
      • (2023)CoMeFa: Deploying Compute-in-Memory on FPGAs for Deep Learning AccelerationACM Transactions on Reconfigurable Technology and Systems10.1145/360350416:3(1-34)Online publication date: 27-Jul-2023
      • (2023)Query Context Expansion for Open-Domain Question AnsweringACM Transactions on Asian and Low-Resource Language Information Processing10.1145/360349822:8(1-21)Online publication date: 23-Aug-2023
      • (2023)BLOOP: Boolean Satisfiability-based Optimized Loop PipeliningACM Transactions on Reconfigurable Technology and Systems10.1145/359997216:3(1-32)Online publication date: 27-Jul-2023
      • (2023)Causality and Correlation Graph Modeling for Effective and Explainable Session-Based RecommendationACM Transactions on the Web10.1145/359331318:1(1-25)Online publication date: 11-Oct-2023
      • (2023)Opinion Leader Detection in Asian Social Networks using Modified Spider Monkey OptimizationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/355531122:5(1-26)Online publication date: 9-May-2023
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media