skip to main content
10.1145/3373087.3375311acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article
Open access

Light-OPU: An FPGA-based Overlay Processor for Lightweight Convolutional Neural Networks

Published: 24 February 2020 Publication History

Abstract

Lightweight convolutional neural networks (LW-CNNs) such as MobileNet, ShuffleNet, SqueezeNet, etc., have emerged in the past few years for fast inference on embedded and mobile system. However, lightweight operations limit acceleration potential by GPU due to their memory bounded nature and their parallel mechanisms that are not friendly to SIMD. This calls for more specific accelerators. In this paper, we propose an FPGA-based overlay processor with a corresponding compilation flow for general LW-CNN accelerations, called Light-OPU. Software-hardware co-designed Light-OPU reformulates and decomposes lightweight operations for efficient acceleration. Moreover, our instruction architecture considers sharing of major computation engine between LW operations and conventional convolution operations. This improves the run-time resource efficiency and overall power efficiency. Finally, Light-OPU is software programmable, since loading of compiled codes and kernel weights completes switch of targeted network without FPGA reconfiguration. Our experiments on seven major LW-CNNs show that Light-OPU achieves 5.5x better latency and 3.0x higher power efficiency on average compared with edge GPU NVIDIA Jetson TX2. Furthermore, Light-OPU has 1.3x to 8.4x better power efficiency compared with previous customized FPGA accelerators. To the best of our knowledge, Light-OPU is the first in-depth study on FPGA-based general processor for LW-CNNs acceleration with high performance and power efficiency, which is evaluated using all major LW-CNNs including the newly released MobileNetV3.

References

[1]
Lin Bai, Yiming Zhao, and Xinming Huang. 2018. A CNN accelerator on FPGA using depthwise separable convolution. IEEE Transactions on Circuits and Systems II: Express Briefs 65, 10 (2018), 1415--1419.
[2]
Srihari Cadambi, Abhinandan Majumdar, Michela Becchi, Srimat Chakradhar, and Hans Peter Graf. 2010. A programmable parallel accelerator for learning and classification. In PACT. ACM, 273--284.
[3]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multiperson 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7291--7299.
[4]
Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula, and Srihari Cadambi. 2010. A dynamically configurable coprocessor for convolutional neural networks. In ACM SIGARCH Computer Arch. News, Vol. 38. 247--257.
[5]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. {TVM}: An automated end-to-end optimizing compiler for deep learning. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} . 578--594.
[6]
François Chollet. 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1251--1258.
[7]
Clément Farabet, Cyril Poulet, Jefferson Y Han, and Yann LeCun. 2009. Cnp: An fpga-based processor for convolutional networks. In FPL. IEEE, 32--37.
[8]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[9]
Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. 2019. Searching for mobilenetv3. arXiv preprint arXiv:1905.02244 (2019).
[10]
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
[11]
Forrest Iandola, Matt Moskewicz, Sergey Karayev, Ross Girshick, Trevor Darrell, and Kurt Keutzer. 2014. Densenet: Implementing efficient convnet descriptor pyramids. arXiv preprint arXiv:1404.1869 (2014).
[12]
Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).
[13]
Jun Haeng Lee, Sangwon Ha, Saerom Choi, Won-Jo Lee, and Seungwon Lee. 2018. Quantization for rapid deployment of deep neural networks. arXiv preprint arXiv:1810.05488 (2018).
[14]
Mohammad Loni, Masoud Daneshtalab, and Mikael Sjödin. 2018. ADONN: adaptive design of optimized deep neural networks for embedded systems. In 2018 21st Euromicro Conference on Digital System Design (DSD). IEEE, 397--404.
[15]
Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. 2018. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV). 116--131.
[16]
Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. 2017. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks. In Field Programmable Logic and Applications (FPL), 2017 27th International Conference on. IEEE, 1--8.
[17]
Panagiotis G Mousouliotis, Konstantinos L Panayiotou, Emmanouil G Tsardoulias, Loukas P Petrou, and Andreas L Symeonidis. 2018. Expanding a robot's life: Low power object recognition via FPGA-based DCNN deployment. In 2018 7th International Conference on Modern Circuits and Systems Technologies (MOCAST). IEEE, 1--4.
[18]
Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S Chung. 2015. Accelerating deep convolutional neural networks using specialized hardware. Microsoft Research Whitepaper 2 (2015).
[19]
Eunhyeok Park, Sungjoo Yoo, and Peter Vajda. 2018. Value-aware quantization for training and inference of neural networks. In Proceedings of the European Conference on Computer Vision (ECCV). 580--595.
[20]
Kathirgamaraja Pradeep, Kamalakkannan Kamalavasan, Ratnasegar Natheesan, and Ajith Pasqual. 2018. EdgeNet: SqueezeNet like Convolution Neural Network on Embedded FPGA. In 2018 25th IEEE International Conference on Electronics, Circuits and Systems (ICECS). IEEE, 81--84.
[21]
Jiantao Qiu, JieWang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, et al. 2016. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 26--35.
[22]
Joseph Redmon and Ali Farhadi. 2017. YOLO9000: better, faster, stronger. arXiv preprint (2017).
[23]
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang- Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4510--4520.
[24]
Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 17.
[25]
Tao Sheng, Chen Feng, Shaojie Zhuo, Xiaopeng Zhang, Liang Shen, and Mickey Aleksic. 2018. A quantization-friendly separable convolution for mobilenets. In 2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2). IEEE, 14--18.
[26]
Karen Simonyan and AndrewZisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[27]
Jiang Su, Julian Faraone, Junyi Liu, Yiren Zhao, David B Thomas, Philip HW Leong, and Peter YK Cheung. 2018. Redundancy-reduced MobileNet acceleration on reconfigurable logic for ImageNet classification. In International Symposium on Applied Reconfigurable Computing. Springer, 16--28.
[28]
Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 16--25.
[29]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In CVPR 2015.
[30]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818--2826.
[31]
Frederick Tung and Greg Mori. 2018. Deep neural network compression by inparallel pruning-quantization. IEEE transactions on pattern analysis and machine intelligence (2018).
[32]
Stylianos I Venieris and Christos-Savvas Bouganis. 2018. fpgaConvNet: mapping regular and irregular convolutional neural networks on FPGAs. IEEE transactions on neural networks and learning systems 30, 2 (2018), 326--342.
[33]
Deguang Wang, Junzhong Shen, Mei Wen, and Chunyuan Zhang. 2019. An efficient design flow for accelerating complicated-connected CNNs on a multi- FPGA platform. In Proceedings of the 48th International Conference on Parallel Processing. ACM, 98.
[34]
Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 29.
[35]
Yifan Yang, Qijing Huang, Bichen Wu, Tianjun Zhang, Liang Ma, Giulio Gambardella, Michaela Blott, Luciano Lavagno, Kees Vissers, John Wawrzynek, et al. 2019. Synetgy: Algorithm-hardware co-design for convnet accelerators on embedded fpgas. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 23--32.
[36]
Yunxuan Yu, Chen Wu, Xiao Shi, and Lei He. 2019. Overview of a FPGA-Based overlay processor. In 2019 China Semiconductor Technology International Conference (CSTIC). 1--3. https://doi.org/10.1109/CSTIC.2019.8755623
[37]
Yunxuan Yu, Chen Wu, Tiandong Zhao, Kun Wang, and Lei He. 2019. OPU: An FPGA-Based overlay processor for convolutional neural networks. IEEE Transactions on Very Large Scale Integration (VLSI) Systems (2019).
[38]
Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks. In Proceedings of the 35th International Conference on Computer-Aided Design. ACM, 12.
[39]
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In FPGA. ACM, 161--170.
[40]
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6848--6856.
[41]
Ruizhe Zhao, Xinyu Niu, and Wayne Luk. 2018. Automatic optimising CNN with depthwise separable convolution on FPGA:(Abstact only). In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 285--285.

Cited By

View all
  • (2025)Algorithm-Hardware Co-design for Accelerating Depthwise Separable CNNsACM Transactions on Design Automation of Electronic Systems10.1145/371184630:2(1-22)Online publication date: 9-Jan-2025
  • (2024)Design of a Generic Dynamically Reconfigurable Convolutional Neural Network Accelerator with Optimal BalanceElectronics10.3390/electronics1304076113:4(761)Online publication date: 14-Feb-2024
  • (2024)Amoeba: An Efficient and Flexible FPGA-Based Accelerator for Arbitrary-Kernel CNNsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.338387132:6(1086-1099)Online publication date: Jun-2024
  • Show More Cited By

Index Terms

  1. Light-OPU: An FPGA-based Overlay Processor for Lightweight Convolutional Neural Networks

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
      February 2020
      346 pages
      ISBN:9781450370998
      DOI:10.1145/3373087
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 24 February 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. compiler
      2. fpga acceleration
      3. lightweight cnn
      4. processor

      Qualifiers

      • Research-article

      Conference

      FPGA '20
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 125 of 627 submissions, 20%

      Upcoming Conference

      FPGA '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)495
      • Downloads (Last 6 weeks)50
      Reflects downloads up to 17 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Algorithm-Hardware Co-design for Accelerating Depthwise Separable CNNsACM Transactions on Design Automation of Electronic Systems10.1145/371184630:2(1-22)Online publication date: 9-Jan-2025
      • (2024)Design of a Generic Dynamically Reconfigurable Convolutional Neural Network Accelerator with Optimal BalanceElectronics10.3390/electronics1304076113:4(761)Online publication date: 14-Feb-2024
      • (2024)Amoeba: An Efficient and Flexible FPGA-Based Accelerator for Arbitrary-Kernel CNNsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.338387132:6(1086-1099)Online publication date: Jun-2024
      • (2024)Side-Channel-Assisted Reverse-Engineering of Encrypted DNN Hardware Accelerator IP and Attack Surface Exploration2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00001(4678-4695)Online publication date: 19-May-2024
      • (2024)An FPGA-Based High-Throughput Dataflow Accelerator for Lightweight Neural Network2024 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS58744.2024.10558315(1-5)Online publication date: 19-May-2024
      • (2024)An FPGA-Based Accelerator Enabling Efficient Support for CNNs with Arbitrary Kernel Sizes2024 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS58744.2024.10558221(1-5)Online publication date: 19-May-2024
      • (2024)An FPGA-Based Reconfigurable Accelerator for Convolution-Transformer Hybrid EfficientViT2024 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS58744.2024.10557992(1-5)Online publication date: 19-May-2024
      • (2024)Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00048(297-303)Online publication date: 2-Sep-2024
      • (2024)High Throughput and Low Bandwidth Demand: Accelerating CNN Inference Block-by-block on FPGAs2024 27th Euromicro Conference on Digital System Design (DSD)10.1109/DSD64264.2024.00073(503-511)Online publication date: 28-Aug-2024
      • (2024)Efficient Edge AI: Deploying Convolutional Neural Networks on FPGA with the Gemmini Accelerator2024 27th Euromicro Conference on Digital System Design (DSD)10.1109/DSD64264.2024.00062(418-426)Online publication date: 28-Aug-2024
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media