skip to main content
10.1145/3174243.3174257acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Published: 15 February 2018 Publication History

Abstract

Three-dimensional convolutional neural networks (3D CNNs) are used efficiently in many computer vision applications. Most previous work in this area has concentrated only on designing and optimizing accelerators for 2D CNN, with few attempts made to accelerate 3D CNN on FPGA. We find accelerating 3D CNNs on FPGA to be challenge due to their high computational complexity and storage demands. More importantly, although the computation patterns of 2D and 3D CNNs are analogous, the conventional approaches adopted for accelerating 2D CNNs may be unfit for 3D CNN acceleration. In this paper, in order to accelerate 2D and 3D CNNs using a uniform framework, we propose a uniform template-based architecture that uses templates based on the Winograd algorithm to ensure fast development of 2D and 3D CNN accelerators. Furthermore, we also develop a uniform analytical model to facilitate efficient design space explorations of 2D and 3D CNN accelerators based on our architecture. Finally, we demonstrate the effectiveness of the template-based architecture by implementing accelerators for real-life 2D and 3D CNNs (VGG16 and C3D) on multiple FPGA platforms. On S2C VUS440, we achieve up to 1.13 TOPS and 1.11 TOPS under low resource utilization for VGG16 and C3D, respectively. End-to-end comparisons with CPU and GPU solutions demonstrate that our implementation of C3D achieves gains of up to 13x and 60x in performance and energy relative to a CPU solution, and a 6.4x energy efficiency gain over a GPU solution.

References

[1]
M. Alwani et al., "Fused-layer cnn accelerators," In MICRO, pages 1--12. IEEE, 2016.
[2]
U. Aydonat et al., "An opencl deep learning accelerator on arria 10," arXiv preprint arXiv:1701.03534, 2017.
[3]
J. Cong et al., "Bandwidth optimization through on-chip memory restructuring for hls," Design Automation Conference (DAC), pages 1--6. IEEE, 2017.
[4]
K. He et al., "Deep residual learning for image recognition," In CVPR, pages 770--778, 2016.
[5]
S. Ji et al., "3d convolutional neural networks for human action recognition," IEEE transactions on pattern analysis and machine intelligence, 35(1):221--231, 2013.
[6]
N. P. Jouppi et al., "In-datacenter performance analysis of a tensor processing unit," arXiv preprint arXiv:1704.04760, 2017.
[7]
K. Kamnitsas et al., "Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation," Medical image analysis, 36:61--78, 2017.
[8]
A. Krizhevsky et al., "Imagenet classification with deep convolutional neural networks," In NIPS, pages 1097--1105, 2012.
[9]
Q. Lan et al., "High performance implementation of 3d convolutional neural networks on a gpu," Inpress at https://www.hindawi.com/journals/cin/aip/8348671/.
[10]
A. Lavin and S. Gray, "Fast algorithms for convolutional neural networks," In CVPR, pages 4013--4021, 2016.
[11]
H. Li et al., "A high performance fpga-based accelerator for large-scale convolutional neural networks," In FPL, pages 1--9. IEEE, 2016.
[12]
Y. Ma et al., "Optimizing loop operation and dataflow in fpga acceleration of deep convolutional neural networks," In FPGA, pages 45--54. ACM, 2017.
[13]
D. Mahajan et al., "Tabla: A unified template-based framework for accelerating statistical machine learning," In HPCA, pages 14--26. IEEE, 2016.
[14]
J. Qiu et al., "Going deeper with embedded fpga platform for convolutional neural network," In FPGA, pages 26--35. ACM, 2016.
[15]
K. Simonyan et al., "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
[16]
N. Suda et al., "Throughput-optimized opencl-based fpga accelerator for largescale convolutional neural networks," In FPGA, pages 16--25. ACM, 2016.
[17]
D. Tran et al., "Learning spatiotemporal features with 3d convolutional networks," In ICCV, pages 4489--4497, 2015.
[18]
X.Wei et al., "Automated systolic array architecture synthesis for high throughput cnn inference on fpgas," In DAC, page 29. ACM, 2017.
[19]
S. Winograd. "On multiplication of polynomials modulo a polynomial," SIAM Journal on Computing, 9(2):225--229, 1980.
[20]
C. Zhang et al., "Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks," In ICCAD, pages 1--8. IEEE, 2016.
[21]
C. Zhang et al., "Optimizing fpga-based accelerator design for deep convolutional neural networks," In FPGA, pages 161--170. ACM, 2015.
[22]
C. Zhang et al., "Frequency domain acceleration of convolutional neural networks on cpu-fpga shared memory system," In FPGA, pages 35--44. ACM, 2017.
[23]
J. Zhang and J. Li, "Improving the performance of opencl-based fpga accelerator for convolutional neural network," In FPGA, pages 25--34. ACM, 2017.

Cited By

View all
  • (2024)ARM-CO-UP: ARM COoperative Utilization of ProcessorsACM Transactions on Design Automation of Electronic Systems10.1145/365647229:5(1-30)Online publication date: 8-Apr-2024
  • (2024)A Real-Time Sparsity-Aware 3D-CNN Processor for Mobile Hand Gesture RecognitionIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2024.340807271:8(3695-3707)Online publication date: Aug-2024
  • (2024)Hardware-Friendly 3-D CNN Acceleration With Balanced Kernel Group SparsityIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.339004043:10(3027-3040)Online publication date: Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
February 2018
310 pages
ISBN:9781450356145
DOI:10.1145/3174243
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 February 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. 3d cnn
  2. uniform templates
  3. winograd algorithm

Qualifiers

  • Research-article

Funding Sources

  • National Program on Key Basic Research Project

Conference

FPGA '18
Sponsor:

Acceptance Rates

FPGA '18 Paper Acceptance Rate 10 of 116 submissions, 9%;
Overall Acceptance Rate 125 of 627 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)93
  • Downloads (Last 6 weeks)9
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)ARM-CO-UP: ARM COoperative Utilization of ProcessorsACM Transactions on Design Automation of Electronic Systems10.1145/365647229:5(1-30)Online publication date: 8-Apr-2024
  • (2024)A Real-Time Sparsity-Aware 3D-CNN Processor for Mobile Hand Gesture RecognitionIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2024.340807271:8(3695-3707)Online publication date: Aug-2024
  • (2024)Hardware-Friendly 3-D CNN Acceleration With Balanced Kernel Group SparsityIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.339004043:10(3027-3040)Online publication date: Oct-2024
  • (2024)TangramFP: Energy-Efficient, Bit-Parallel, Multiply-Accumulate for Deep Neural Networks2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00009(1-12)Online publication date: 13-Nov-2024
  • (2024)A Novel FPGA Accelerator of R(2+1)D2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM60383.2024.00031(1-7)Online publication date: 5-May-2024
  • (2024)SMOF: Streaming Modern CNNs on FPGAs with Smart Off-Chip Eviction2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM60383.2024.00029(185-196)Online publication date: 5-May-2024
  • (2024)Exploring Memory Access Techniques for Efficient FPGA based 3D CNN Accelerator Design2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS)10.1109/AICAS59952.2024.10595963(218-222)Online publication date: 22-Apr-2024
  • (2024)Review of neural network model acceleration techniques based on FPGA platformsNeurocomputing10.1016/j.neucom.2024.128511(128511)Online publication date: Aug-2024
  • (2024)Architectures for Machine LearningHandbook of Computer Architecture10.1007/978-981-97-9314-3_12(321-379)Online publication date: 21-Dec-2024
  • (2023)XVDPU: A High-Performance CNN Accelerator on the Versal Platform Powered by the AI EngineACM Transactions on Reconfigurable Technology and Systems10.1145/361783617:2(1-24)Online publication date: 13-Sep-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media