research-article

You Cannot Improve What You Do not Measure: FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference

Authors:

Andrew Boutros,

Sadegh Yazdanshenas,

Vaughn BetzAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 11, Issue 3

Article No.: 20, Pages 1 - 23

https://doi.org/10.1145/3242898

Published: 12 December 2018 Publication History

Abstract

Recently, deep learning (DL) has become best-in-class for numerous applications but at a high computational cost that necessitates high-performance energy-efficient acceleration. The reconfigurability of FPGAs is appealing due to the rapid change in DL models but also causes lower performance and area-efficiency compared to ASICs. In this article, we implement three state-of-the-art computing architectures (CAs) for convolutional neural network (CNN) inference on FPGAs and ASICs. By comparing the FPGA and ASIC implementations, we highlight the area and performance costs of programmability to pinpoint the inefficiencies in current FPGA architectures. We perform our experiments using three variations of these CAs for AlexNet, VGG-16 and ResNet-50 to allow extensive comparisons. We find that the performance gap varies significantly from 2.8× to 6.3×, while the area gap is consistent across CAs with an 8.7 average FPGA-to-ASIC area ratio. Among different blocks of the CAs, the convolution engine, constituting up to 60% of the total area, has a high area ratio ranging from 13 to 31. Motivated by our FPGA vs. ASIC comparisons, we suggest FPGA architectural changes such as increasing DSP block count, enhancing low-precision support in DSP blocks and rethinking the on-chip memories to reduce the programmability gap for DL applications.

References

[1]

M. Abadi et al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the OSDI. 265--283.

Digital Library

[2]

U. Aydonat et al. 2017. An OpenCL (TM) deep learning accelerator on Arria 10. In Proceedings of the FPGA. 55--64.

Digital Library

[3]

Y. Chen et al. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the MICRO. 609--622.

Digital Library

[4]

Y. Chen et al. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. In Proceedings of the JSSC, Vol. 52. 127--138.

[5]

S. Chetlur et al. 2014. CuDNN: Efficient primitives for deep learning. arXiv:1410.0759.

[6]

E. Chung and J. Fowers. 2017. Accelerating persistent neural networks at datacenter scale. In Proceedings of the HOT CHIPS, Vol. 29.

[7]

F. Colombo et al. 2017. Deep artificial composer: A creative neural network model for automated melody generation. In Proceedings of the EvoMUSART. 81--96.

[8]

Y. Fu et al. 2016. Deep learning with INT8 optimization on Xilinx devices. In white paper of Xilinx.

[9]

L. Gatys et al. 2015. A neural algorithm of artistic style. arXiv:1508.06576.

[10]

A. Graves et al. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the ICASSP. 6645--6649.

[11]

Y. Guan et al. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In Proceedings of the FCCM. 152--159.

[12]

Matthew R. Guthaus et al. 2016. OpenRAM: An open-source memory compiler. In Proceedings of the ICCAD.

Digital Library

[13]

P. Gysel et al. 2016. Hardware-oriented approximation of convolutional neural networks. arXiv:1604.03168.

[14]

K. He et al. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the ICCV. 1026--1034.

Digital Library

[15]

K. He et al. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR. 770--778.

[16]

S. Herculano-Houzel. 2009. The human brain in numbers: A linearly scaled-up primate brain. In Frontiers in Human Neuroscience, Vol. 3.

[17]

S. Ioffe and C. Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the ICML. 448--456.

Digital Library

[18]

Y. Jia et al. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093.

[19]

N. Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the ISCA. 1--12.

Digital Library

[20]

A. Krizhevsky et al. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the NIPS. 1097--1105.

Digital Library

[21]

M. Langhammer and B. Pasca. 2015. Floating-point DSP block architecture for FPGAs. In Proceedings of the FPGA. 117--125.

Digital Library

[22]

A. Lavin and S. Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the CVPR. 4013--4021.

[23]

Z. Liu et al. 2016. Automatic code generation of convolutional neural networks in FPGA implementation. In Proceedings of the FPT. 61--68.

[24]

L. Lu et al. 2017. Evaluating fast algorithms for convolutional neural networks on FPGAs. In Proceedings of the FCCM. 101--108.

[25]

Y. Ma et al. 2016. Scalable and modularized RTL compilation of convolutional neural networks onto FPGA. In Proceedings of the FPL. 1--8.

[26]

Y. Ma et al. 2017. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks. In Proceedings of the FPL. 1--8.

[27]

Y. Ma et al. 2017. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In Proceedings of the FPGA. 45--54.

Digital Library

[28]

A. Mishra et al. 2017. WRPN: Wide reduced-precision networks. arXiv:1709.01134.

[29]

E. Nurvitadhi et al. 2016. Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC. In Proceedings of the FPT. 77--84.

[30]

K. Ovtcharov et al. 2015. Accelerating deep convolutional neural networks using specialized hardware. In Microsoft Research Whitepaper, Vol. 2.

[31]

A. Prost-Boucle et al. 2017. Scalable high-performance architecture for convolutional ternary neural networks on FPGA. In Proceedings of the FPL. 1--7.

[32]

A. Putnam et al. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. In Proceedings of the ISCA. 13--24.

Digital Library

[33]

J. Qiu et al. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the FPGA. 26--35.

Digital Library

[34]

R. Rashid et al. 2014. Comparing performance, productivity and scalability of the TILT overlay processor to OpenCL HLS. In Proceedings of the FPT. 20--27.

[35]

D. E. Rumelhart et al. 1985. Learning Internal Representations by Error Propagation. Technical Report.

[36]

O. Russakovsky et al. 2015. Imagenet large scale visual recognition challenge. In Proceedings of the IJCV, Vol. 115. 211--252.

Digital Library

[37]

H. Sharma et al. 2016. From high-level deep neural models to FPGAs. In Proceedings of the MICRO. 1--12.

Digital Library

[38]

F. Shen et al. 2016. Weighted residuals for very deep networks. In Proceedings of the ICSAI. 936--941.

[39]

Y. Shen et al. 2016. Overcoming resource underutilization in spatial CNN accelerators. In Proceedings of the FPL. 1--4.

[40]

Y. Shen et al. 2017. Maximizing CNN accelerator efficiency through resource partitioning. In Proceedings of the ISCA. 535--547.

Digital Library

[41]

D. Silver et al. 2017. Mastering the game of go without human knowledge. In Nature, Vol. 550. 354--359.

[42]

N. Suda et al. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the FPGA. 16--25.

Digital Library

[43]

A. Suleiman et al. 2017. Towards closing the energy Gap between HOG and CNN features for embedded vision. arXiv:1703.05853.

[44]

I. Sutskever et al. 2014. Sequence to sequence learning with neural networks. In Proceedings of the NIPS. 3104--3112.

Digital Library

[45]

C. Szegedy et al. 2015. Going deeper with convolutions. In Proceedings of the CVPR.

[46]

Kosuke Tatsumura et al. 2016. High density, low energy, magnetic tunnel junction based block RAMs for memory-rich FPGAs. In Proceedings of the FPT. 4--11.

[47]

Y. Umuroglu et al. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the FPGA. 65--74.

Digital Library

[48]

S. Venieris and C. Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In Proceedings of the FCCM. 40--47.

[49]

G. Venkatesh et al. 2017. Accelerating deep convolutional networks using low-precision and sparsity. In Proceedings of the ICASSP. 2861--2865.

[50]

S. Wang et al. 2017. Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neural networks. In Proceedings of the DATE. 1032--1037.

Digital Library

[51]

Y. Wang et al. 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family. In Proceedings of the DAC. 1--6.

Digital Library

[52]

X. Wei et al. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the DAC. 1--6.

Digital Library

[53]

H. Wong et al. 2011. Comparing FPGA vs. custom CMOS and the impact on processor microarchitecture. In Proceedings of the FPGA. 5--14.

Digital Library

[54]

S. Yazdanshenas et al. 2017. Don’t forget the memory: Automatic block RAM modelling, optimization, and architecture exploration. In Proceedings of the FPGA. 115--124.

Digital Library

[55]

C. Zhang et al. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the FPGA. 161--170.

Digital Library

[56]

C. Zhang et al. 2016. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In Proceedings of the ISLPED. 326--331.

Digital Library

[57]

C. Zhang and V. Prasanna. 2017. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system. In Proceedings of the FPGA. 35--44.

Digital Library

Cited By

Gorgin SNisar MLee J(2025)Efficient hardware accelerators for k-nearest neighbors classification using most significant digit first arithmeticThe Journal of Supercomputing10.1007/s11227-024-06466-281:1Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1007/s11227-024-06466-2
Saidani TGhodhbani RAlhomoud AAlshammari AZayani HBen Ammar M(2024)Hardware Acceleration for Object Detection using YOLOv5 Deep Learning Algorithm on Xilinx Zynq FPGA PlatformEngineering, Technology & Applied Science Research10.48084/etasr.676114:1(13066-13071)Online publication date: 8-Feb-2024
https://doi.org/10.48084/etasr.6761
Talbot CGarg DPileggi LMai KDe V(2024)An IP-Agnostic Foundational Cell Array Offering Supply Chain SecurityProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3657364(1-6)Online publication date: 23-Jun-2024
https://dl.acm.org/doi/10.1145/3649329.3657364
Show More Cited By

Index Terms

You Cannot Improve What You Do not Measure: FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference
1. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators
      2. Reconfigurable logic applications

Recommendations

In-Package Domain-Specific ASICs for Intel® Stratix® 10 FPGAs: A Case Study of Accelerating Deep Learning Using TensorTile ASIC(Abstract Only)
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

FPGAs or ASICs? There is a long-running debate on this. FPGAs are extremely flexible while ASICs offer top efficiency but inflexible. We believe that FPGAs and ASICs are better together, to offer both flexible and efficient solutions. We propose single-...
Efficient AES implementations on ASICs and FPGAs
AES'04: Proceedings of the 4th international conference on Advanced Encryption Standard

In this article, we present two AES hardware architectures: one for ASICs and one for FPGAs. Both architectures utilize the similarities of encryption and decryption to provide a high throughput using only a relatively small area. The presented ...
LINQits: big data on little clients
ICSA '13

We present LINQits, a flexible hardware template that can be mapped onto programmable logic or ASICs in a heterogeneous system-on-chip for a mobile device or server. Unlike fixed-function accelerators, LINQits accelerates a domain-specific query ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 11, Issue 3

Special Issue on Deep learning on FPGAs

September 2018

187 pages

ISSN:1936-7406

EISSN:1936-7414

DOI:10.1145/3299999

Editor:
Steve Wilton
Department of Electrical and Computer Engineering / University of British Columbia / Kaiser 4112, 5500-2332 Main Mall / Vancouver, BC V6T 1Z4 Canada

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 December 2018

Accepted: 01 July 2018

Revised: 01 April 2018

Received: 01 December 2017

Published in TRETS Volume 11, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

48
Total Citations
View Citations
1,206
Total Downloads

Downloads (Last 12 months)128
Downloads (Last 6 weeks)16

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gorgin SNisar MLee J(2025)Efficient hardware accelerators for k-nearest neighbors classification using most significant digit first arithmeticThe Journal of Supercomputing10.1007/s11227-024-06466-281:1Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1007/s11227-024-06466-2
Saidani TGhodhbani RAlhomoud AAlshammari AZayani HBen Ammar M(2024)Hardware Acceleration for Object Detection using YOLOv5 Deep Learning Algorithm on Xilinx Zynq FPGA PlatformEngineering, Technology & Applied Science Research10.48084/etasr.676114:1(13066-13071)Online publication date: 8-Feb-2024
https://doi.org/10.48084/etasr.6761
Talbot CGarg DPileggi LMai KDe V(2024)An IP-Agnostic Foundational Cell Array Offering Supply Chain SecurityProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3657364(1-6)Online publication date: 23-Jun-2024
https://dl.acm.org/doi/10.1145/3649329.3657364
Zhang BZhen LWang SYang F(2024)Integrating Operations Research into Very Large-Scale Integrated Circuits Placement Design: A ReviewAsia-Pacific Journal of Operational Research10.1142/S021759592450007641:06Online publication date: 6-Jul-2024
https://doi.org/10.1142/S0217595924500076
Ali TRainey JLau SGheorghiu EMaier PAppiah KBhowmik D(2024)An FPGA-based neuromorphic vision system acceleratorArtificial Intelligence for Security and Defence Applications II10.1117/12.3034095(15)Online publication date: 13-Nov-2024
https://doi.org/10.1117/12.3034095
Xue WYang KLiu YFan DXu PTian Y(2024)Unlocking High Performance with Low-Bit NPUs and CPUs for Highly Optimized HPL-MxP on Cloud Brain IIProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00088(1-16)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00088
Devi Paladi CRani Thuraka E(2024)Hybrid Processing Unit for Efficient Realization of DNN on FPGA Devices2023 Second IEEE International Conference on Measurement, Instrumentation, Control and Automation (ICMICA)10.1109/ICMICA61068.2024.10732168(1-6)Online publication date: 3-May-2024
https://doi.org/10.1109/ICMICA61068.2024.10732168
Dai XChen YAbdelfattah M(2024)Kratos: An FPGA Benchmark for Unrolled DNNs with Fine-Grained Sparsity and Mixed Precision2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00030(156-163)Online publication date: 2-Sep-2024
https://doi.org/10.1109/FPL64840.2024.00030
Chen TTan YZhang ZLuo NLi BLi Y(2024)Dataflow optimization with layer-wise design variables estimation method for enflame CNN acceleratorsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104869189(104869)Online publication date: Jul-2024
https://doi.org/10.1016/j.jpdc.2024.104869
Boutros ABetz V(2024)Field-Programmable Gate Array ArchitectureHandbook of Computer Architecture10.1007/978-981-97-9314-3_49(417-463)Online publication date: 21-Dec-2024
https://doi.org/10.1007/978-981-97-9314-3_49
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents