research-article

Optimizing OpenCL-Based CNN Design on FPGA with Comprehensive Design Space Exploration and Collaborative Performance Modeling

Authors:
Jiandong Mu

Hong Kong University of Science and Technology, Hong Kong

Hong Kong University of Science and Technology, Hong Kong
View Profile

,
Wei Zhang

Hong Kong University of Science and Technology, Hong Kong

Hong Kong University of Science and Technology, Hong Kong
View Profile

,
Hao Liang

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China
View Profile

,
Sharad Sinha

Indian Institute of Technology (IIT), Goa, India

Indian Institute of Technology (IIT), Goa, India

0000-0002-4532-2017
View Profile

ACM Transactions on Reconfigurable Technology and Systems Volume 13 Issue 3Article No.: 13pp 1–28https://doi.org/10.1145/3397514

Published:23 June 2020Publication History

ACM Transactions on Reconfigurable Technology and Systems

Abstract

Recent success in applying convolutional neural networks (CNNs) to object detection and classification has sparked great interest in accelerating CNNs using hardware-like field-programmable gate arrays (FPGAs). However, finding an efficient FPGA design for a given CNN model and FPGA board is not trivial since a strong background in hardware design and detailed knowledge of the target board are required. In this work, we try to solve this problem by design space exploration with a collaborative framework. Our framework consists of three main parts: FPGA design generation, coarse-grained modeling, and fine-grained modeling. In the FPGA design generation, we propose a novel data structure, LoopTree, to capture the details of the FPGA design for CNN applications without writing down the source code. Different LoopTrees, which indicate different FPGA designs, are automatically generated in this process. A coarse-grained model will evaluate LoopTrees at the operation level, e.g., add, mult, and so on, so that the most efficient LoopTrees can be selected. A fine-grained model, which is based on the source code, will then refine the selected design in a cycle-accurate manner. A set of comprehensive OpenCL-based designs have been implemented on board to verify our framework. An average estimation error of 8.87% and 4.8% has been observed for our coarse-grained model and fine-grained model, respectively. This is much lower than the prevalent operation-statistics-based estimation, which is obtained according to a predefined formula for specific loop schedules.

References

Kamel Abdelouahab, Maxime Pelcat, Jocelyn Serot, and François Berry. 2018. Accelerating CNN inference on FPGAs: A survey. arXiv preprint arXiv:1806.01683 (2018).Google Scholar
Afzal Ahmad and Muhammad Adeel Pasha. 2019. Towards design space exploration and optimization of fast algorithms for convolutional neural networks (CNNs) on FPGAs. In 2019 Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE, 1106--1111.Google Scholar
Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-level synthesis for FPGA-based processor/accelerator systems. In 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. 33--36.Google ScholarDigital Library
Young-kyu Choi and Jason Cong. 2017. HLScope: High-level performance debugging for FPGA designs. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 125--128.Google Scholar
Roberto DiCecco et al. 2016. Caffeinated FPGAs: FPGA framework for convolutional neural networks. In FPT. IEEE, 265--268.Google Scholar
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 161--170.Google ScholarDigital Library
Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kyung Kim, and Hadi Esmaeilzadeh. 2016. TABLA: A unified template-based framework for accelerating statistical machine learning. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 14--26.Google ScholarCross Ref
Mohammad Motamedi, Philipp Gysel, Venkatesh Akella, and Soheil Ghiasi. 2016. Design space exploration of FPGAbased deep convolutional neural networks. In 2016 21st Asia and South Pacific Design Automation Conference (ASPDAC). IEEE, 575--580.Google ScholarDigital Library
Ross Girshick. 2015. Fast R-CNN. In IEEE International Conference on Computer Vision. 1440--1448.Google Scholar
Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and Jason Cong. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 152--159.Google ScholarCross Ref
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.Google ScholarDigital Library
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In International Symposium on Code Generation and Optimization, 2004. CGO 2004. IEEE, 75--86.Google ScholarCross Ref
Qing Li, Weidong Cai, Xiaogang Wang, Yun Zhou, David Dagan Feng, and Mei Chen. 2014. Medical image classification with convolutional neural network. In 2014 13th International Conference on Control Automation Robotics 8 Vision (ICARCV). IEEE, 844--848.Google ScholarCross Ref
Yun Liang, Liqiang Lu, Qingcheng Xiao, and Shengen Yan. 2019. Evaluating fast algorithms for convolutional neural networks on FPGAs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2019).Google Scholar
Yun Liang, Shuo Wang, and Wei Zhang. 2018. FlexCL: A model of performance and power for OpenCL workloads on FPGAs. IEEE Trans. Comput. 67, 12 (2018), 1750--1764.Google ScholarDigital Library
Liqiang Lu, Yun Liang, Qingcheng Xiao, and Shengen Yan. 2017. Evaluating fast algorithms for convolutional neural networks on FPGAs. In 2017 IEEE 25th Annual International Symposium on FCCM. IEEE, 101--108.Google ScholarCross Ref
Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. 2017. Optimizing loop operation and dataow in FPGA acceleration of deep convolutional neural networks. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 45--54.Google ScholarDigital Library
Razvan Nane, Vlad-Mihai Sima, Christian Pilato, Jongsok Choi, Blair Fort, Andrew Canis, Yu Ting Chen, Hsuan Hsiao, Stephen Brown, Fabrizio Ferrandi, et al. 2015. A survey and evaluation of FPGA high-level synthesis tools. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 10 (2015), 1591--1604.Google ScholarDigital Library
Eriko Nurvitadhi, Ganesh Venkatesh, Jaewoong Sim, Debbie Marr, Randy Huang, Jason Ong Gee Hock, Yeong Tat Liew, Krishnan Srivatsan, Duncan Moss, Suchit Subhaschandra, et al. 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks? In 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 5--14.Google ScholarDigital Library
Declan O’Loughlin, Aedan Coffey, Frank Callaly, Darren Lyons, and Fearghal Morgan. 2014. Xilinx Vivado high level synthesis: Case studies. In 25th IET Irish Signals 8 Systems Conference 2014 and 2014 China-Ireland International Conference on Information and Communications Technologies (ISSC 2014/CIICT 2014).Google ScholarCross Ref
Kenneth O’Neal and Philip Brisk. 2018. Predictive modeling for CPU, GPU, and FPGA performance and power consumption: A survey. In 2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, 763--768.Google ScholarCross Ref
Kenneth O’Neal, Mitch Liu, Hans Tang, Amin Kalantar, Kennen DeRenard, and Philip Brisk. 2018. HLSPredict: Cross platform performance prediction for FPGA high-level synthesis. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1--8.Google ScholarDigital Library
Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S. Chung. 2015. Accelerating deep convolutional neural networks using specialized hardware. Microsoft Research Whitepaper 2, 11 (2015), 1--4.Google Scholar
Jiantao Qiu, JieWang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, et al. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 26--35.Google ScholarDigital Library
Atul Rahman, Sangyun Oh, Jongeun Lee, and Kiyoung Choi. 2017. Design space exploration of FPGA accelerators for convolutional neural networks. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’17). IEEE, 1147--1152.Google Scholar
Rachit Rajat, Hanqing Zeng, and Viktor Prasanna. 2019. A flexible design automation tool for accelerating quantized spectral CNNs. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 144--150.Google ScholarCross Ref
Enrico Reggiani, Marco Rabozzi, Anna Maria Nestorov, Alberto Scolari, Luca Stornaiuolo, and Marco Santambrogio. 2019. Pareto optimal design space exploration for accelerated CNN on FPGA. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 107--114.Google ScholarCross Ref
M. A. Sanchez, Mario Garrido, M. Lopez Vallejo, and C. López-Barrio. 2006. Automated design space exploration of FPGA-based FFT architectures based on area and power estimation. In 2006 IEEE International Conference on Field Programmable Technology. IEEE, 127--134.Google Scholar
Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1--12.Google ScholarCross Ref
Marco Siracusa, Marco Rabozzi, Emanuele Del Sozzo, Marco D. Santambrogio, and Lorenzo Di Tucci. 2019. Automated design space exploration and roofline analysis for FPGA-based HLS applications. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 314--314.Google ScholarCross Ref
Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 16--25.Google ScholarDigital Library
Jonathan Tompson and Kristofer Schlachter. 2012. An introduction to the OpenCL programming model. Person Education 49 (2012), 31.Google Scholar
Dong Wang, Jianjing An, and Ke Xu. 2016. PipeCNN: An OpenCL-based FPGA accelerator for large-scale convolution neuron networks. arXiv preprint arXiv:1611.02450 (2016).Google Scholar
Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu. 2016. CNN-RNN: A unified framework for multi-label image classification. In IEEE Conference on Computer Vision and Pattern Recognition. 2285--2294.Google ScholarCross Ref
Shuo Wang, Yun Liang, and Wei Zhang. 2017. Flexcl: An analytical performance model for OpenCL workloads on flexible FPGAs. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). IEEE, 1--6.Google ScholarDigital Library
Ying Wang, Jie Xu, Yinhe Han, Huawei Li, and Xiaowei Li. 2016. DeepBurning: automatic generation of FPGA-based learning accelerators for the neural network family. In 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC’16). IEEE, 1--6.Google ScholarDigital Library
Zeke Wang, Shuhao Zhang, Bingsheng He, and Wei Zhang. 2016. Melia: A MapReduce framework on OpenCL-based FPGAs. IEEE Transactions on Parallel and Distributed Systems 27, 12 (2016), 3547--3560.Google ScholarDigital Library
Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017. 1--6.Google ScholarDigital Library
Dennis Weller, Fabian Oboril, Dimitar Lukarski, Juergen Becker, and Mehdi Tahoori. 2017. Energy efficient scientific computing on FPGA using OpenCL. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 247--256.Google ScholarDigital Library
Qingcheng Xiao, Yun Liang, Liqiang Lu, Shengen Yan, and Yu-Wing Tai. 2017. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.Google ScholarDigital Library
Xilinx. [n.d.]. Vivado Design Suite User Guide.Google Scholar
Yu Xing, Shuang Liang, Lingzhi Sui, Xijie Jia, Jiantao Qiu, Xin Liu, Yushun Wang, Yi Shan, and Yu Wang. 2019. DNNVM: End-to-end compiler leveraging heterogeneous optimizations on FPGA-based CNN accelerators. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2019).Google ScholarDigital Library
Xianchao Xu and Brian Liu. 2018. FCLNN: A flexible framework for fast CNN prototyping on FPGA with OpenCL and Caffe. In 2018 International Conference on Field-Programmable Technology (FPT). IEEE, 238--241.Google ScholarCross Ref
Shengxin Zha, Florian Luisier, Walter Andrews, Nitish Srivastava, and Ruslan Salakhutdinov. 2015. Exploiting image-trained CNN architectures for unconstrained video classification. arXiv preprint arXiv:1503.04144 (2015).Google Scholar
Chen Zhang, Guangyu Sun, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In ICCAD. IEEE, 1--8.Google Scholar
Jialiang Zhang and Jing Li. 2017. Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network. In Proc. of FPGA. ACM, 25--34.Google ScholarDigital Library
Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wenmei Hwu, and Deming Chen. 2018. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. In P ICCAD.Google Scholar
Jieru Zhao, Liang Feng, Sharad Sinha, Wei Zhang, Yun Liang, and Bingsheng He. 2017. COMBA: A Comprehensive Model-Based Analysis Framework for High Level Synthesis of Real Applications. In ICCAD. IEEE, 430--437.Google Scholar
Guanwen Zhong, Vanchinathan Venkataramani, Yun Liang, Tulika Mitra, and Smail Niar. 2014. Design space exploration of multiple loops on FPGAs using high level synthesis. In 2014 IEEE 32nd International Conference on Computer Design (ICCD’14). IEEE, 456--463.Google ScholarCross Ref

Index Terms

Optimizing OpenCL-Based CNN Design on FPGA with Comprehensive Design Space Exploration and Collaborative Performance Modeling
1. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies

Recommendations

Exploration and Customization of FPGA-Based Soft Processors

As embedded systems designers increasingly use field-programmable gate arrays (FPGAs) while pursuing single-chip designs, they are motivated to have their designs also include soft processors, processors built using FPGA programmable logic. In this ...
Read More
A Unified FPGA-Based System Architecture for 2-D Discrete Wavelet Transform

This paper presents a novel unified and programmable 2-D Discrete Wavelet Transform (DWT) system architecture, which was implemented using a Field Programmable Gate Array (FPGA)-based Nios II soft-core processor working in combination with custom ...
Read More
Automatic Optimising CNN with Depthwise Separable Convolution on FPGA: (Abstact Only)
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Convolution layers in Convolutional Neural Networks (CNNs) are effective in vision feature extraction but quite inefficient in computational resource usage. Depthwise separable convolution layer has been proposed in recent publications to enhance the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Reconfigurable Technology and Systems Volume 13, Issue 3
September 2020
182 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/3404107
Editor:
Deming Chen
University of Illinois, Urbana-Champaign Urbana, USA
Issue’s Table of Contents
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 June 2020
- Online AM: 7 May 2020
- Accepted: 1 April 2020
- Revised: 1 March 2020
- Received: 1 December 2019
Published in trets Volume 13, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CNN
design space exploration
hardware design
modeling
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 418
  Total Downloads
- Downloads (Last 12 months)51
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Optimizing OpenCL-Based CNN Design on FPGA with Comprehensive Design Space Exploration and Collaborative Performance Modeling

ACM Transactions on Reconfigurable Technology and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Exploration and Customization of FPGA-Based Soft Processors

A Unified FPGA-Based System Architecture for 2-D Discrete Wavelet Transform

Automatic Optimising CNN with Depthwise Separable Convolution on FPGA: (Abstact Only)

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Optimizing OpenCL-Based CNN Design on FPGA with Comprehensive Design Space Exploration and Collaborative Performance Modeling

ACM Transactions on Reconfigurable Technology and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Exploration and Customization of FPGA-Based Soft Processors

A Unified FPGA-Based System Architecture for 2-D Discrete Wavelet Transform

Automatic Optimising CNN with Depthwise Separable Convolution on FPGA: (Abstact Only)

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media