research-article

Throughput-Optimized FPGA Accelerator for Deep Convolutional Neural Networks

Authors:
Zhiqiang Liu

National University of Defense Technology, Changsha, Hunan, China

National University of Defense Technology, Changsha, Hunan, China
View Profile

,
Yong Dou

National University of Defense Technology, Changsha, Hunan, China

National University of Defense Technology, Changsha, Hunan, China
View Profile

,
Jingfei Jiang

National University of Defense Technology, Changsha, Hunan, China

National University of Defense Technology, Changsha, Hunan, China
View Profile

,
Jinwei Xu

National University of Defense Technology, Changsha, Hunan, China

National University of Defense Technology, Changsha, Hunan, China
View Profile

,
Shijie Li

National University of Defense Technology, Changsha, Hunan, China

National University of Defense Technology, Changsha, Hunan, China
View Profile

,
Yongmei Zhou

National University of Defense Technology, Changsha, Hunan, China

National University of Defense Technology, Changsha, Hunan, China
View Profile

,
Yingnan Xu

National University of Defense Technology, Changsha, Hunan, China

National University of Defense Technology, Changsha, Hunan, China
View Profile

ACM Transactions on Reconfigurable Technology and Systems Volume 10 Issue 3Article No.: 17pp 1–23https://doi.org/10.1145/3079758

Published:19 July 2017Publication History

ACM Transactions on Reconfigurable Technology and Systems

Abstract

Deep convolutional neural networks (CNNs) have gained great success in various computer vision applications. State-of-the-art CNN models for large-scale applications are computation intensive and memory expensive and, hence, are mainly processed on high-performance processors like server CPUs and GPUs. However, there is an increasing demand of high-accuracy or real-time object detection tasks in large-scale clusters or embedded systems, which requires energy-efficient accelerators because of the green computation requirement or the limited battery restriction. Due to the advantages of energy efficiency and reconfigurability, Field-Programmable Gate Arrays (FPGAs) have been widely explored as CNN accelerators. In this article, we present an in-depth analysis of computation complexity and the memory footprint of each CNN layer type. Then a scalable parallel framework is proposed that exploits four levels of parallelism in hardware acceleration. We further put forward a systematic design space exploration methodology to search for the optimal solution that maximizes accelerator throughput under the FPGA constraints such as on-chip memory, computational resources, external memory bandwidth, and clock frequency. Finally, we demonstrate the methodology by optimizing three representative CNNs (LeNet, AlexNet, and VGG-S) on a Xilinx VC709 board. The average performance of the three accelerators is 424.7, 445.6, and 473.4GOP/s under 100MHz working frequency, which outperforms the CPU and previous work significantly.

References

Ossama Abdel-Hamid, Abdel-Rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu. 2014. Convolutional neural networks for speech recognition. IEEE/ACM Trans. Aud. Speech Lang. Process. 22, 10 (2014), 1533--1545. Google ScholarDigital Library
Bernard Bosi, Guy Bois, and Yvon Savaria. 1999. Reconfigurable pipelined 2-D convolvers for fast digital signal processing. IEEE Trans. VLSI Syst. 7 (1999), 299--308. Google ScholarDigital Library
Y.-Lan Boureau, Jean Ponce, and Yann LeCun. 2010. A theoretical analysis of feature pooling in visual recognition. In Proceedings of the 27th International Conference on Machine Learning (ICML’10), Vol. 7. 111--118.Google Scholar
Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula, and Srihari Cadambi. 2010. A dynamically configurable coprocessor for convolutional neural networks. In Proceedings of the 37th Annual International Symposium on Computer Architecture. ACM, 247--257. Google ScholarDigital Library
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, Vol. 49. IEEE Computer Society, 609--622. Google ScholarDigital Library
Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning. ACM, 160--167. Google ScholarDigital Library
Misha Denil, Babak Shakibi, Laurent Dinh, Marc Aurelio Ranzato, and Nando de Freitas. 2013. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems, Vol. 7. 2148--2156.Google Scholar
Clément Farabet, Cyril Poulet, Jefferson Y. Han, and Yann LeCun. 2009. Cnp: An fpga-based processor for convolutional networks. In Proceedings of the International Conference on Field Programmable Logic and Applications, 2009 (FPL’09), Vol. 49. IEEE, 32--37.Google ScholarCross Ref
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 580--587. Google ScholarDigital Library
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1725--1732. Google ScholarDigital Library
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.Google Scholar
Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In Proceedings of Twenty-Ninth AAAI Conference on Artificial Intelligence. 2267--2273.Google Scholar
Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, and Huazhong Yang. 2016. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays. ACM, 16--25. Google ScholarDigital Library
Ali Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 806--813. Google ScholarDigital Library
Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-optimized opencl-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 16--25. Google ScholarDigital Library
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9. Google ScholarCross Ref
Andrea Vedaldi and Karel Lenc. 2015. MatConvNet: Convolutional neural networks for matlab. In Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, Vol. 7. ACM, 689--692. Google ScholarDigital Library
Xilinx. 2015. Virtex7-product-table.pdf. https://www.xilinx.com/support/documentation/selection-guides.Google Scholar
Zhongwen Xu, Yi Yang, and Alex G. Hauptmann. 2015. A discriminative CNN video representation for event detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1798--1807. Google ScholarCross Ref
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 161--170. Google ScholarDigital Library

Index Terms

Throughput-Optimized FPGA Accelerator for Deep Convolutional Neural Networks
1. Computer systems organization
  1. Embedded and cyber-physical systems
  2. Real-time systems

Recommendations

Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

OpenCL FPGA has recently gained great popularity with emerging needs for workload acceleration such as Convolutional Neural Network (CNN), which is the most popular deep learning architecture in the domain of computer vision. While OpenCL enhances the ...
Read More
Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

As convolution layers contribute most operations in convolutional neural network (CNN) algorithms, an effective convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution in CNNs ...
Read More
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Convolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Reconfigurable Technology and Systems Volume 10, Issue 3
September 2017
187 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/3102109
Editor:
Steve Wilton
Department of Electrical and Computer Engineering/University of British Columbia/Kaiser 4112, 5500-2332 Main Mall/Vancouver, BC V6T 1Z4 Canada
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2017
- Accepted: 1 March 2017
- Revised: 1 November 2016
- Received: 1 June 2016
Published in trets Volume 10, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
FPGA architecture
application mapping
convolutional neural networks
high performance computing
optimisation
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 80
  Total Citations
  View Citations
- 1,746
  Total Downloads
- Downloads (Last 12 months)183
- Downloads (Last 6 weeks)29
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Throughput-Optimized FPGA Accelerator for Deep Convolutional Neural Networks

ACM Transactions on Reconfigurable Technology and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Throughput-Optimized FPGA Accelerator for Deep Convolutional Neural Networks

ACM Transactions on Reconfigurable Technology and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media