research-article

FPGA Logic Block Architectures for Efficient Deep Learning Inference

Authors:

Mohamed Eldafrawy,

Andrew Boutros,

Sadegh Yazdanshenas,

Vaughn BetzAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 13, Issue 3

Article No.: 12, Pages 1 - 34

https://doi.org/10.1145/3393668

Published: 03 June 2020 Publication History

Abstract

Reducing the precision of deep neural network (DNN) inference accelerators can yield large efficiency gains with little or no accuracy degradation compared to half or single precision floating-point by enabling more multiplication operations per unit area. A wide range of precisions fall on the pareto-optimal curve of hardware efficiency vs. accuracy with no single precision dominating, making the variable precision capabilities of FPGAs very valuable. We propose three types of logic block architectural enhancements and fully evaluate a total of six architectures that improve the area efficiency of multiplications and additions implemented in the soft fabric. Increasing the LUT fracturability and adding two adders to the ALM (4-bit Adder Double Chain architecture) leads to a 1.5× area reduction for arithmetic heavy machine learning (ML) kernels, while increasing their speed. In addition, this architecture also reduces the logic area of general applications by 6%, while increasing the critical path delay by only 1%. However, our highest impact option, which adds a 9-bit shadow multiplier to the logic clusters, reduces the area and critical path delay of ML kernels by 2.4× and 1.2×, respectively. These large gains come at a cost of 15% logic area increase for general applications.

References

[1]

Y. Cao. 2018. Predictive Technology Model (PTM). Retrieved from http://ptm.asu.edu/.

[2]

E. Ahmed and J. Rose. 2004. The effect of LUT and cluster size on deep-submicron FPGA performance and density. IEEE Trans. VLSI Syst. 12, 3 (2004), 288--298.

Digital Library

[3]

C. Baugh and B. Wooley. 1973. A two’s complement parallel array multiplication algorithm. IEEE Trans. Comput. C-22, 12 (1973), 1045--1047.

Digital Library

[4]

V. Betz and J. Rose. 1997. Cluster-based logic blocks for FPGAs: Area-efficiency vs. input sharing and size. In Proceedings of the Custom Integrated Circuits Conference. 551--554.

[5]

V. Betz and J. Rose. 1998. How much logic should go in an FPGA logic block. IEEE Design 8 Test of Computers, 15, 1 (1998), 10--15.

[6]

A. Boutros et al. 2018. Embracing diversity: Enhanced DSP blocks for low-precision deep learning on FPGAs. In Proceedings of the International Conference on Field Programmable Logic and Applications. 1--8.

[7]

A. Boutros et al. 2018. You cannot improve what you do not measure: FPGA vs. ASIC efficiency gaps for convolutional neural network inference. ACM Trans. Reconfig. Technol. Syst. 11, 3 (2018), 1–23.

Digital Library

[8]

A. Boutros et al. 2019. Math doesn’t have to be hard: Logic block architectures to enhance low-precision multiply-accumulate on FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 94--103.

[9]

M. Burich. 2012. Conference workshop: FPGAs in 2032, challenges and opportunities in the next 20 years, convergence of programmable solutions. In Proceedings of the International Symposium on Field-Programmable Gate Arrays.

[10]

S. Chandrakar et al. 2015. Enhancements in UltraScale CLB architecture. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 108--116.

[11]

C. Chiasson and V. Betz. 2013. COFFE: Fully-automated transistor sizing for FPGAs. In Proceedings of the International Conference on Field-Programmable Technology. 34--41.

[12]

Intel Corporation. 2005. Stratix GX Transeiver User Guide.

[13]

Xilinx Corporation. 2007. Virtex-II Platform FPGA User Guide.

[14]

Xilinx Corporation. 2007. Virtex-II Pro and Virtex-II Pro X FPGA User Guide.

[15]

M. Deo et al. 2019. Intel Stratix 10 MX devices solve the memory bandwidth challenge. Intel Whitepaper.

[16]

J. Fowers et al. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the International Symposium on Computer Architecture, 1--14.

[17]

S. Han et al. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the International Symposium on Computer Architecture. 243--254.

[18]

K. He et al. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.

[19]

Intel Corporation. 2017. Intel Stratix 10 logic array blocks and adaptive logic modules user guide (UG-S10LAB).

[20]

P. Jamieson and J. Rose. 2006. Enhancing the area-efficiency of FPGAs with hard circuits using shadow clusters. In Proceedings of the International Conference on Field Programmable Technology. 1--8.

[21]

A. Krizhevsky et al. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.

[22]

I. Kuon and J. Rose. 2011. Exploring area and delay tradeoffs in FPGAs with architecture and automated transistor design. IEEE Trans. VLSI Syst. 19, 1 (2011), 71--84.

Digital Library

[23]

M. Langhammer et al. 2019. Fractal synthesis: Invited tutorial. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 202--211.

[24]

M. Langhammer and B. Pasca. 2015. Floating-point DSP block architecture for FPGAs. In Proceedings of the International Symposium on Field Programmable Gate Arrays. 117--125.

[25]

Guy Lemieux and David Lewis. 2001. Using sparse crossbars within LUT. In Proceedings of the International Symposium on Field Programmable Gate Arrays. 59--68.

Digital Library

[26]

D. Lewis et al. 2005. The Stratix II logic and routing architecture. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 14--20.

[27]

D. Lewis et al. 2016. The Stratix 10 highly pipelined FPGA architecture. In Proceedings of the International Symposium on Field Programmable Gate Arrays. 159--168.

[28]

D. M. Lewis et al. 2003. The StratixTM routing and logic architecture. In Proceedings of the International Symposium on Field Programmable Gate Arrays. 12--20.

[29]

J. Luu et al. 2014. VTR 7.0: Next generation architecture and CAD system for FPGAs. ACM Trans. Reconfig. Technol. Syst. 7, 2 (2014), 1–30.

[30]

A. Mishra et al. 2017. WRPN: Wide reduced-precision networks. arXiv preprint arXiv:1709.01134.

[31]

Kevin E. Murray et al. 2020. VTR 8: High performance CAD and customizable FPGA architecture modelling. ACM Trans. Reconfig. Technol. Syst. 0, ja, 1

[32]

E. Nurvitadhi et al. 2016. Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC. In Proceedings of the International Conference on Field-Programmable Technology. 77--84.

[33]

E. Nurvitadhi et al. 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks? In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 5--14.

[34]

J. et al. Qiu. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 26--35.

[35]

E. Real et al. 2018. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548.

[36]

J. Rose et al. 1993. Architecture of field-programmable gate arrays. IEEE J. Solid-State Circ. 81, 7 (1993), 1013--1029.

[37]

V. Rybalkin et al. 2018. FINN-L: Library extensions and design trade-off analysis for variable precision LSTM networks on FPGAs. In Proceedings of the International Conference on Field Programmable Logic and Applications. 1--8.

[38]

S. M. Trimberger. 2015. Three ages of FPGAs: A retrospective on the first thirty years of FPGA technology. Proc. IEEE, 318--331.

[39]

Mike W. et al. 2019. Virtex UltraScale+ HBM FPGA: A revolutionary increase in memory performance. Xilinx Whitepaper.

[40]

Xiaowei X. et al. 2018. Scaling for edge inference of deep neural networks. Nat. Electron. 1, 4 (2018), 216--222.

[41]

S. Yazdanshenas and V. Betz. 2017. Automatic circuit design and modelling for heterogeneous FPGAs. In Proceedings of the International Conference on Field Programmable Technology. 9--16.

[42]

S. Yazdanshenas and V. Betz. 2019. COFFE 2: Automatic modelling and optimization of complex and heterogeneous FPGA architectures. ACM Trans. Reconfig. Technol. Syst. 12, 1 (2019), 1–27.

Digital Library

Cited By

Takahashi RAndo KNakahara H(2024)A Stacked FPGA utilizing 3D-SRAM with Latency Optimization2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)10.1109/MCSoC64144.2024.00072(400-406)Online publication date: 16-Dec-2024
https://doi.org/10.1109/MCSoC64144.2024.00072
Aizaz ZKhare KTirmizi A(2024)Approximate Row-Merging-Based Multipliers for Neural Network Acceleration on FPGAsIEEE Embedded Systems Letters10.1109/LES.2023.330467816:2(126-129)Online publication date: Jun-2024
https://doi.org/10.1109/LES.2023.3304678
Taka EGourounas DGerstlauer AMarculescu DArora A(2024)Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM60383.2024.00015(54-65)Online publication date: 5-May-2024
https://doi.org/10.1109/FCCM60383.2024.00015
Show More Cited By

Index Terms

FPGA Logic Block Architectures for Efficient Deep Learning Inference
1. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators
      2. Reconfigurable logic applications

Recommendations

FPGA Logic Block Architectures for Efficient Deep Learning Inference
Compact FPGA-Based Hardware Architectures for GF(2^m) Multipliers
DSD '13: Proceedings of the 2013 Euromicro Conference on Digital System Design

This work describes FPGA hardware architectures of GF(2m) multipliers being more compact than a bit-serial multiplier and outperforming software counterparts. The proposed multiplier is more compact than a hardware implementation of the bit-serial ...
A novel FPGA logic block for improved arithmetic performance
FPGA '08: Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays

To improve FPGA performance for arithmetic circuits, this paper proposes a new architecture for FPGA logic cells that includes a 6:2 compressor. The new cell features additional fast carry-chains that concatenate adjacent compressors and can be routed ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 13, Issue 3

September 2020

182 pages

ISSN:1936-7406

EISSN:1936-7414

DOI:10.1145/3404107

Editor:
Deming Chen
University of Illinois, Urbana-Champaign Urbana, USA

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2020

Online AM: 07 May 2020

Accepted: 01 April 2020

Revised: 01 March 2020

Received: 01 October 2019

Published in TRETS Volume 13, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

36
Total Citations
View Citations
793
Total Downloads

Downloads (Last 12 months)164
Downloads (Last 6 weeks)8

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Takahashi RAndo KNakahara H(2024)A Stacked FPGA utilizing 3D-SRAM with Latency Optimization2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)10.1109/MCSoC64144.2024.00072(400-406)Online publication date: 16-Dec-2024
https://doi.org/10.1109/MCSoC64144.2024.00072
Aizaz ZKhare KTirmizi A(2024)Approximate Row-Merging-Based Multipliers for Neural Network Acceleration on FPGAsIEEE Embedded Systems Letters10.1109/LES.2023.330467816:2(126-129)Online publication date: Jun-2024
https://doi.org/10.1109/LES.2023.3304678
Taka EGourounas DGerstlauer AMarculescu DArora A(2024)Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM60383.2024.00015(54-65)Online publication date: 5-May-2024
https://doi.org/10.1109/FCCM60383.2024.00015
Minev PKukenska VDinev MVarbov I(2024)Improving the Area-Delay Tradeoff for Academic FPGA Architectures by Applying Rent’s Rule to Logic Block Modeling2024 XXXIII International Scientific Conference Electronics (ET)10.1109/ET63133.2024.10721534(1-6)Online publication date: 17-Sep-2024
https://doi.org/10.1109/ET63133.2024.10721534
Boutros ABetz V(2024)Field-Programmable Gate Array ArchitectureHandbook of Computer Architecture10.1007/978-981-97-9314-3_49(417-463)Online publication date: 21-Dec-2024
https://doi.org/10.1007/978-981-97-9314-3_49
Arora ABhamburkar ABorda AAnand TSehgal RHanindhito BGaillardon PKulkarni JJohn L(2023)CoMeFa: Deploying Compute-in-Memory on FPGAs for Deep Learning AccelerationACM Transactions on Reconfigurable Technology and Systems10.1145/360350416:3(1-34)Online publication date: 27-Jul-2023
https://dl.acm.org/doi/10.1145/3603504
Zhu WZhang XYe LZhai Q(2023)Query Context Expansion for Open-Domain Question AnsweringACM Transactions on Asian and Low-Resource Language Information Processing10.1145/360349822:8(1-21)Online publication date: 23-Aug-2023
https://dl.acm.org/doi/10.1145/3603498
Fiege NZipf P(2023)BLOOP: Boolean Satisfiability-based Optimized Loop PipeliningACM Transactions on Reconfigurable Technology and Systems10.1145/359997216:3(1-32)Online publication date: 27-Jul-2023
https://dl.acm.org/doi/10.1145/3599972
Wu HGeng CFang H(2023)Causality and Correlation Graph Modeling for Effective and Explainable Session-Based RecommendationACM Transactions on the Web10.1145/359331318:1(1-25)Online publication date: 11-Oct-2023
https://dl.acm.org/doi/10.1145/3593313
Kumar SKumar AMallik ADhall S(2023)Opinion Leader Detection in Asian Social Networks using Modified Spider Monkey OptimizationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/355531122:5(1-26)Online publication date: 9-May-2023
https://dl.acm.org/doi/10.1145/3555311
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents