skip to main content
research-article

Optimizing OpenCL-Based CNN Design on FPGA with Comprehensive Design Space Exploration and Collaborative Performance Modeling

Published:23 June 2020Publication History
Skip Abstract Section

Abstract

Recent success in applying convolutional neural networks (CNNs) to object detection and classification has sparked great interest in accelerating CNNs using hardware-like field-programmable gate arrays (FPGAs). However, finding an efficient FPGA design for a given CNN model and FPGA board is not trivial since a strong background in hardware design and detailed knowledge of the target board are required. In this work, we try to solve this problem by design space exploration with a collaborative framework. Our framework consists of three main parts: FPGA design generation, coarse-grained modeling, and fine-grained modeling. In the FPGA design generation, we propose a novel data structure, LoopTree, to capture the details of the FPGA design for CNN applications without writing down the source code. Different LoopTrees, which indicate different FPGA designs, are automatically generated in this process. A coarse-grained model will evaluate LoopTrees at the operation level, e.g., add, mult, and so on, so that the most efficient LoopTrees can be selected. A fine-grained model, which is based on the source code, will then refine the selected design in a cycle-accurate manner. A set of comprehensive OpenCL-based designs have been implemented on board to verify our framework. An average estimation error of 8.87% and 4.8% has been observed for our coarse-grained model and fine-grained model, respectively. This is much lower than the prevalent operation-statistics-based estimation, which is obtained according to a predefined formula for specific loop schedules.

References

  1. Kamel Abdelouahab, Maxime Pelcat, Jocelyn Serot, and François Berry. 2018. Accelerating CNN inference on FPGAs: A survey. arXiv preprint arXiv:1806.01683 (2018).Google ScholarGoogle Scholar
  2. Afzal Ahmad and Muhammad Adeel Pasha. 2019. Towards design space exploration and optimization of fast algorithms for convolutional neural networks (CNNs) on FPGAs. In 2019 Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE, 1106--1111.Google ScholarGoogle Scholar
  3. Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-level synthesis for FPGA-based processor/accelerator systems. In 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. 33--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Young-kyu Choi and Jason Cong. 2017. HLScope: High-level performance debugging for FPGA designs. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 125--128.Google ScholarGoogle Scholar
  5. Roberto DiCecco et al. 2016. Caffeinated FPGAs: FPGA framework for convolutional neural networks. In FPT. IEEE, 265--268.Google ScholarGoogle Scholar
  6. Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 161--170.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kyung Kim, and Hadi Esmaeilzadeh. 2016. TABLA: A unified template-based framework for accelerating statistical machine learning. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 14--26.Google ScholarGoogle ScholarCross RefCross Ref
  8. Mohammad Motamedi, Philipp Gysel, Venkatesh Akella, and Soheil Ghiasi. 2016. Design space exploration of FPGAbased deep convolutional neural networks. In 2016 21st Asia and South Pacific Design Automation Conference (ASPDAC). IEEE, 575--580.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ross Girshick. 2015. Fast R-CNN. In IEEE International Conference on Computer Vision. 1440--1448.Google ScholarGoogle Scholar
  10. Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and Jason Cong. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 152--159.Google ScholarGoogle ScholarCross RefCross Ref
  11. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In International Symposium on Code Generation and Optimization, 2004. CGO 2004. IEEE, 75--86.Google ScholarGoogle ScholarCross RefCross Ref
  13. Qing Li, Weidong Cai, Xiaogang Wang, Yun Zhou, David Dagan Feng, and Mei Chen. 2014. Medical image classification with convolutional neural network. In 2014 13th International Conference on Control Automation Robotics 8 Vision (ICARCV). IEEE, 844--848.Google ScholarGoogle ScholarCross RefCross Ref
  14. Yun Liang, Liqiang Lu, Qingcheng Xiao, and Shengen Yan. 2019. Evaluating fast algorithms for convolutional neural networks on FPGAs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2019).Google ScholarGoogle Scholar
  15. Yun Liang, Shuo Wang, and Wei Zhang. 2018. FlexCL: A model of performance and power for OpenCL workloads on FPGAs. IEEE Trans. Comput. 67, 12 (2018), 1750--1764.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Liqiang Lu, Yun Liang, Qingcheng Xiao, and Shengen Yan. 2017. Evaluating fast algorithms for convolutional neural networks on FPGAs. In 2017 IEEE 25th Annual International Symposium on FCCM. IEEE, 101--108.Google ScholarGoogle ScholarCross RefCross Ref
  17. Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. 2017. Optimizing loop operation and dataow in FPGA acceleration of deep convolutional neural networks. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 45--54.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Razvan Nane, Vlad-Mihai Sima, Christian Pilato, Jongsok Choi, Blair Fort, Andrew Canis, Yu Ting Chen, Hsuan Hsiao, Stephen Brown, Fabrizio Ferrandi, et al. 2015. A survey and evaluation of FPGA high-level synthesis tools. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 10 (2015), 1591--1604.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Eriko Nurvitadhi, Ganesh Venkatesh, Jaewoong Sim, Debbie Marr, Randy Huang, Jason Ong Gee Hock, Yeong Tat Liew, Krishnan Srivatsan, Duncan Moss, Suchit Subhaschandra, et al. 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks? In 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 5--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Declan O’Loughlin, Aedan Coffey, Frank Callaly, Darren Lyons, and Fearghal Morgan. 2014. Xilinx Vivado high level synthesis: Case studies. In 25th IET Irish Signals 8 Systems Conference 2014 and 2014 China-Ireland International Conference on Information and Communications Technologies (ISSC 2014/CIICT 2014).Google ScholarGoogle ScholarCross RefCross Ref
  21. Kenneth O’Neal and Philip Brisk. 2018. Predictive modeling for CPU, GPU, and FPGA performance and power consumption: A survey. In 2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, 763--768.Google ScholarGoogle ScholarCross RefCross Ref
  22. Kenneth O’Neal, Mitch Liu, Hans Tang, Amin Kalantar, Kennen DeRenard, and Philip Brisk. 2018. HLSPredict: Cross platform performance prediction for FPGA high-level synthesis. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1--8.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S. Chung. 2015. Accelerating deep convolutional neural networks using specialized hardware. Microsoft Research Whitepaper 2, 11 (2015), 1--4.Google ScholarGoogle Scholar
  24. Jiantao Qiu, JieWang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, et al. 2016. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 26--35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Atul Rahman, Sangyun Oh, Jongeun Lee, and Kiyoung Choi. 2017. Design space exploration of FPGA accelerators for convolutional neural networks. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’17). IEEE, 1147--1152.Google ScholarGoogle Scholar
  26. Rachit Rajat, Hanqing Zeng, and Viktor Prasanna. 2019. A flexible design automation tool for accelerating quantized spectral CNNs. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 144--150.Google ScholarGoogle ScholarCross RefCross Ref
  27. Enrico Reggiani, Marco Rabozzi, Anna Maria Nestorov, Alberto Scolari, Luca Stornaiuolo, and Marco Santambrogio. 2019. Pareto optimal design space exploration for accelerated CNN on FPGA. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 107--114.Google ScholarGoogle ScholarCross RefCross Ref
  28. M. A. Sanchez, Mario Garrido, M. Lopez Vallejo, and C. López-Barrio. 2006. Automated design space exploration of FPGA-based FFT architectures based on area and power estimation. In 2006 IEEE International Conference on Field Programmable Technology. IEEE, 127--134.Google ScholarGoogle Scholar
  29. Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1--12.Google ScholarGoogle ScholarCross RefCross Ref
  30. Marco Siracusa, Marco Rabozzi, Emanuele Del Sozzo, Marco D. Santambrogio, and Lorenzo Di Tucci. 2019. Automated design space exploration and roofline analysis for FPGA-based HLS applications. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 314--314.Google ScholarGoogle ScholarCross RefCross Ref
  31. Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 16--25.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jonathan Tompson and Kristofer Schlachter. 2012. An introduction to the OpenCL programming model. Person Education 49 (2012), 31.Google ScholarGoogle Scholar
  33. Dong Wang, Jianjing An, and Ke Xu. 2016. PipeCNN: An OpenCL-based FPGA accelerator for large-scale convolution neuron networks. arXiv preprint arXiv:1611.02450 (2016).Google ScholarGoogle Scholar
  34. Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu. 2016. CNN-RNN: A unified framework for multi-label image classification. In IEEE Conference on Computer Vision and Pattern Recognition. 2285--2294.Google ScholarGoogle ScholarCross RefCross Ref
  35. Shuo Wang, Yun Liang, and Wei Zhang. 2017. Flexcl: An analytical performance model for OpenCL workloads on flexible FPGAs. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC’17). IEEE, 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ying Wang, Jie Xu, Yinhe Han, Huawei Li, and Xiaowei Li. 2016. DeepBurning: automatic generation of FPGA-based learning accelerators for the neural network family. In 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC’16). IEEE, 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Zeke Wang, Shuhao Zhang, Bingsheng He, and Wei Zhang. 2016. Melia: A MapReduce framework on OpenCL-based FPGAs. IEEE Transactions on Parallel and Distributed Systems 27, 12 (2016), 3547--3560.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017. 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Dennis Weller, Fabian Oboril, Dimitar Lukarski, Juergen Becker, and Mehdi Tahoori. 2017. Energy efficient scientific computing on FPGA using OpenCL. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 247--256.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Qingcheng Xiao, Yun Liang, Liqiang Lu, Shengen Yan, and Yu-Wing Tai. 2017. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Xilinx. [n.d.]. Vivado Design Suite User Guide.Google ScholarGoogle Scholar
  42. Yu Xing, Shuang Liang, Lingzhi Sui, Xijie Jia, Jiantao Qiu, Xin Liu, Yushun Wang, Yi Shan, and Yu Wang. 2019. DNNVM: End-to-end compiler leveraging heterogeneous optimizations on FPGA-based CNN accelerators. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2019).Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Xianchao Xu and Brian Liu. 2018. FCLNN: A flexible framework for fast CNN prototyping on FPGA with OpenCL and Caffe. In 2018 International Conference on Field-Programmable Technology (FPT). IEEE, 238--241.Google ScholarGoogle ScholarCross RefCross Ref
  44. Shengxin Zha, Florian Luisier, Walter Andrews, Nitish Srivastava, and Ruslan Salakhutdinov. 2015. Exploiting image-trained CNN architectures for unconstrained video classification. arXiv preprint arXiv:1503.04144 (2015).Google ScholarGoogle Scholar
  45. Chen Zhang, Guangyu Sun, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In ICCAD. IEEE, 1--8.Google ScholarGoogle Scholar
  46. Jialiang Zhang and Jing Li. 2017. Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network. In Proc. of FPGA. ACM, 25--34.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wenmei Hwu, and Deming Chen. 2018. DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs. In P ICCAD.Google ScholarGoogle Scholar
  48. Jieru Zhao, Liang Feng, Sharad Sinha, Wei Zhang, Yun Liang, and Bingsheng He. 2017. COMBA: A Comprehensive Model-Based Analysis Framework for High Level Synthesis of Real Applications. In ICCAD. IEEE, 430--437.Google ScholarGoogle Scholar
  49. Guanwen Zhong, Vanchinathan Venkataramani, Yun Liang, Tulika Mitra, and Smail Niar. 2014. Design space exploration of multiple loops on FPGAs using high level synthesis. In 2014 IEEE 32nd International Conference on Computer Design (ICCD’14). IEEE, 456--463.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Optimizing OpenCL-Based CNN Design on FPGA with Comprehensive Design Space Exploration and Collaborative Performance Modeling

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Reconfigurable Technology and Systems
      ACM Transactions on Reconfigurable Technology and Systems  Volume 13, Issue 3
      September 2020
      182 pages
      ISSN:1936-7406
      EISSN:1936-7414
      DOI:10.1145/3404107
      • Editor:
      • Deming Chen
      Issue’s Table of Contents

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 June 2020
      • Online AM: 7 May 2020
      • Accepted: 1 April 2020
      • Revised: 1 March 2020
      • Received: 1 December 2019
      Published in trets Volume 13, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format