skip to main content
research-article

You Cannot Improve What You Do not Measure: FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference

Published: 12 December 2018 Publication History

Abstract

Recently, deep learning (DL) has become best-in-class for numerous applications but at a high computational cost that necessitates high-performance energy-efficient acceleration. The reconfigurability of FPGAs is appealing due to the rapid change in DL models but also causes lower performance and area-efficiency compared to ASICs. In this article, we implement three state-of-the-art computing architectures (CAs) for convolutional neural network (CNN) inference on FPGAs and ASICs. By comparing the FPGA and ASIC implementations, we highlight the area and performance costs of programmability to pinpoint the inefficiencies in current FPGA architectures. We perform our experiments using three variations of these CAs for AlexNet, VGG-16 and ResNet-50 to allow extensive comparisons. We find that the performance gap varies significantly from 2.8× to 6.3×, while the area gap is consistent across CAs with an 8.7 average FPGA-to-ASIC area ratio. Among different blocks of the CAs, the convolution engine, constituting up to 60% of the total area, has a high area ratio ranging from 13 to 31. Motivated by our FPGA vs. ASIC comparisons, we suggest FPGA architectural changes such as increasing DSP block count, enhancing low-precision support in DSP blocks and rethinking the on-chip memories to reduce the programmability gap for DL applications.

References

[1]
M. Abadi et al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the OSDI. 265--283.
[2]
U. Aydonat et al. 2017. An OpenCL (TM) deep learning accelerator on Arria 10. In Proceedings of the FPGA. 55--64.
[3]
Y. Chen et al. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the MICRO. 609--622.
[4]
Y. Chen et al. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. In Proceedings of the JSSC, Vol. 52. 127--138.
[5]
S. Chetlur et al. 2014. CuDNN: Efficient primitives for deep learning. arXiv:1410.0759.
[6]
E. Chung and J. Fowers. 2017. Accelerating persistent neural networks at datacenter scale. In Proceedings of the HOT CHIPS, Vol. 29.
[7]
F. Colombo et al. 2017. Deep artificial composer: A creative neural network model for automated melody generation. In Proceedings of the EvoMUSART. 81--96.
[8]
Y. Fu et al. 2016. Deep learning with INT8 optimization on Xilinx devices. In white paper of Xilinx.
[9]
L. Gatys et al. 2015. A neural algorithm of artistic style. arXiv:1508.06576.
[10]
A. Graves et al. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the ICASSP. 6645--6649.
[11]
Y. Guan et al. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In Proceedings of the FCCM. 152--159.
[12]
Matthew R. Guthaus et al. 2016. OpenRAM: An open-source memory compiler. In Proceedings of the ICCAD.
[13]
P. Gysel et al. 2016. Hardware-oriented approximation of convolutional neural networks. arXiv:1604.03168.
[14]
K. He et al. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the ICCV. 1026--1034.
[15]
K. He et al. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR. 770--778.
[16]
S. Herculano-Houzel. 2009. The human brain in numbers: A linearly scaled-up primate brain. In Frontiers in Human Neuroscience, Vol. 3.
[17]
S. Ioffe and C. Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the ICML. 448--456.
[18]
Y. Jia et al. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093.
[19]
N. Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the ISCA. 1--12.
[20]
A. Krizhevsky et al. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the NIPS. 1097--1105.
[21]
M. Langhammer and B. Pasca. 2015. Floating-point DSP block architecture for FPGAs. In Proceedings of the FPGA. 117--125.
[22]
A. Lavin and S. Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the CVPR. 4013--4021.
[23]
Z. Liu et al. 2016. Automatic code generation of convolutional neural networks in FPGA implementation. In Proceedings of the FPT. 61--68.
[24]
L. Lu et al. 2017. Evaluating fast algorithms for convolutional neural networks on FPGAs. In Proceedings of the FCCM. 101--108.
[25]
Y. Ma et al. 2016. Scalable and modularized RTL compilation of convolutional neural networks onto FPGA. In Proceedings of the FPL. 1--8.
[26]
Y. Ma et al. 2017. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks. In Proceedings of the FPL. 1--8.
[27]
Y. Ma et al. 2017. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In Proceedings of the FPGA. 45--54.
[28]
A. Mishra et al. 2017. WRPN: Wide reduced-precision networks. arXiv:1709.01134.
[29]
E. Nurvitadhi et al. 2016. Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC. In Proceedings of the FPT. 77--84.
[30]
K. Ovtcharov et al. 2015. Accelerating deep convolutional neural networks using specialized hardware. In Microsoft Research Whitepaper, Vol. 2.
[31]
A. Prost-Boucle et al. 2017. Scalable high-performance architecture for convolutional ternary neural networks on FPGA. In Proceedings of the FPL. 1--7.
[32]
A. Putnam et al. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. In Proceedings of the ISCA. 13--24.
[33]
J. Qiu et al. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the FPGA. 26--35.
[34]
R. Rashid et al. 2014. Comparing performance, productivity and scalability of the TILT overlay processor to OpenCL HLS. In Proceedings of the FPT. 20--27.
[35]
D. E. Rumelhart et al. 1985. Learning Internal Representations by Error Propagation. Technical Report.
[36]
O. Russakovsky et al. 2015. Imagenet large scale visual recognition challenge. In Proceedings of the IJCV, Vol. 115. 211--252.
[37]
H. Sharma et al. 2016. From high-level deep neural models to FPGAs. In Proceedings of the MICRO. 1--12.
[38]
F. Shen et al. 2016. Weighted residuals for very deep networks. In Proceedings of the ICSAI. 936--941.
[39]
Y. Shen et al. 2016. Overcoming resource underutilization in spatial CNN accelerators. In Proceedings of the FPL. 1--4.
[40]
Y. Shen et al. 2017. Maximizing CNN accelerator efficiency through resource partitioning. In Proceedings of the ISCA. 535--547.
[41]
D. Silver et al. 2017. Mastering the game of go without human knowledge. In Nature, Vol. 550. 354--359.
[42]
N. Suda et al. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the FPGA. 16--25.
[43]
A. Suleiman et al. 2017. Towards closing the energy Gap between HOG and CNN features for embedded vision. arXiv:1703.05853.
[44]
I. Sutskever et al. 2014. Sequence to sequence learning with neural networks. In Proceedings of the NIPS. 3104--3112.
[45]
C. Szegedy et al. 2015. Going deeper with convolutions. In Proceedings of the CVPR.
[46]
Kosuke Tatsumura et al. 2016. High density, low energy, magnetic tunnel junction based block RAMs for memory-rich FPGAs. In Proceedings of the FPT. 4--11.
[47]
Y. Umuroglu et al. 2017. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the FPGA. 65--74.
[48]
S. Venieris and C. Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In Proceedings of the FCCM. 40--47.
[49]
G. Venkatesh et al. 2017. Accelerating deep convolutional networks using low-precision and sparsity. In Proceedings of the ICASSP. 2861--2865.
[50]
S. Wang et al. 2017. Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neural networks. In Proceedings of the DATE. 1032--1037.
[51]
Y. Wang et al. 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family. In Proceedings of the DAC. 1--6.
[52]
X. Wei et al. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the DAC. 1--6.
[53]
H. Wong et al. 2011. Comparing FPGA vs. custom CMOS and the impact on processor microarchitecture. In Proceedings of the FPGA. 5--14.
[54]
S. Yazdanshenas et al. 2017. Don’t forget the memory: Automatic block RAM modelling, optimization, and architecture exploration. In Proceedings of the FPGA. 115--124.
[55]
C. Zhang et al. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the FPGA. 161--170.
[56]
C. Zhang et al. 2016. Energy-efficient CNN implementation on a deeply pipelined FPGA cluster. In Proceedings of the ISLPED. 326--331.
[57]
C. Zhang and V. Prasanna. 2017. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system. In Proceedings of the FPGA. 35--44.

Cited By

View all
  • (2025)Efficient hardware accelerators for k-nearest neighbors classification using most significant digit first arithmeticThe Journal of Supercomputing10.1007/s11227-024-06466-281:1Online publication date: 1-Jan-2025
  • (2024)Hardware Acceleration for Object Detection using YOLOv5 Deep Learning Algorithm on Xilinx Zynq FPGA PlatformEngineering, Technology & Applied Science Research10.48084/etasr.676114:1(13066-13071)Online publication date: 8-Feb-2024
  • (2024)An IP-Agnostic Foundational Cell Array Offering Supply Chain SecurityProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3657364(1-6)Online publication date: 23-Jun-2024
  • Show More Cited By

Index Terms

  1. You Cannot Improve What You Do not Measure: FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Reconfigurable Technology and Systems
      ACM Transactions on Reconfigurable Technology and Systems  Volume 11, Issue 3
      Special Issue on Deep learning on FPGAs
      September 2018
      187 pages
      ISSN:1936-7406
      EISSN:1936-7414
      DOI:10.1145/3299999
      • Editor:
      • Steve Wilton
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 December 2018
      Accepted: 01 July 2018
      Revised: 01 April 2018
      Received: 01 December 2017
      Published in TRETS Volume 11, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. ASIC
      2. Deep learning
      3. FPGA
      4. convolutional neural networks

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)128
      • Downloads (Last 6 weeks)16
      Reflects downloads up to 25 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Efficient hardware accelerators for k-nearest neighbors classification using most significant digit first arithmeticThe Journal of Supercomputing10.1007/s11227-024-06466-281:1Online publication date: 1-Jan-2025
      • (2024)Hardware Acceleration for Object Detection using YOLOv5 Deep Learning Algorithm on Xilinx Zynq FPGA PlatformEngineering, Technology & Applied Science Research10.48084/etasr.676114:1(13066-13071)Online publication date: 8-Feb-2024
      • (2024)An IP-Agnostic Foundational Cell Array Offering Supply Chain SecurityProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3657364(1-6)Online publication date: 23-Jun-2024
      • (2024)Integrating Operations Research into Very Large-Scale Integrated Circuits Placement Design: A ReviewAsia-Pacific Journal of Operational Research10.1142/S021759592450007641:06Online publication date: 6-Jul-2024
      • (2024)An FPGA-based neuromorphic vision system acceleratorArtificial Intelligence for Security and Defence Applications II10.1117/12.3034095(15)Online publication date: 13-Nov-2024
      • (2024)Unlocking High Performance with Low-Bit NPUs and CPUs for Highly Optimized HPL-MxP on Cloud Brain IIProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00088(1-16)Online publication date: 17-Nov-2024
      • (2024)Hybrid Processing Unit for Efficient Realization of DNN on FPGA Devices2023 Second IEEE International Conference on Measurement, Instrumentation, Control and Automation (ICMICA)10.1109/ICMICA61068.2024.10732168(1-6)Online publication date: 3-May-2024
      • (2024)Kratos: An FPGA Benchmark for Unrolled DNNs with Fine-Grained Sparsity and Mixed Precision2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00030(156-163)Online publication date: 2-Sep-2024
      • (2024)Dataflow optimization with layer-wise design variables estimation method for enflame CNN acceleratorsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104869189(104869)Online publication date: Jul-2024
      • (2024)Field-Programmable Gate Array ArchitectureHandbook of Computer Architecture10.1007/978-981-97-9314-3_49(417-463)Online publication date: 21-Dec-2024
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media