research-article

Open access

Light-OPU: An FPGA-based Overlay Processor for Lightweight Convolutional Neural Networks

Authors:

Lei HeAuthors Info & Claims

FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pages 122 - 132

https://doi.org/10.1145/3373087.3375311

Published: 24 February 2020 Publication History

Abstract

Lightweight convolutional neural networks (LW-CNNs) such as MobileNet, ShuffleNet, SqueezeNet, etc., have emerged in the past few years for fast inference on embedded and mobile system. However, lightweight operations limit acceleration potential by GPU due to their memory bounded nature and their parallel mechanisms that are not friendly to SIMD. This calls for more specific accelerators. In this paper, we propose an FPGA-based overlay processor with a corresponding compilation flow for general LW-CNN accelerations, called Light-OPU. Software-hardware co-designed Light-OPU reformulates and decomposes lightweight operations for efficient acceleration. Moreover, our instruction architecture considers sharing of major computation engine between LW operations and conventional convolution operations. This improves the run-time resource efficiency and overall power efficiency. Finally, Light-OPU is software programmable, since loading of compiled codes and kernel weights completes switch of targeted network without FPGA reconfiguration. Our experiments on seven major LW-CNNs show that Light-OPU achieves 5.5x better latency and 3.0x higher power efficiency on average compared with edge GPU NVIDIA Jetson TX2. Furthermore, Light-OPU has 1.3x to 8.4x better power efficiency compared with previous customized FPGA accelerators. To the best of our knowledge, Light-OPU is the first in-depth study on FPGA-based general processor for LW-CNNs acceleration with high performance and power efficiency, which is evaluated using all major LW-CNNs including the newly released MobileNetV3.

References

[1]

Lin Bai, Yiming Zhao, and Xinming Huang. 2018. A CNN accelerator on FPGA using depthwise separable convolution. IEEE Transactions on Circuits and Systems II: Express Briefs 65, 10 (2018), 1415--1419.

[2]

Srihari Cadambi, Abhinandan Majumdar, Michela Becchi, Srimat Chakradhar, and Hans Peter Graf. 2010. A programmable parallel accelerator for learning and classification. In PACT. ACM, 273--284.

[3]

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multiperson 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7291--7299.

[4]

Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula, and Srihari Cadambi. 2010. A dynamically configurable coprocessor for convolutional neural networks. In ACM SIGARCH Computer Arch. News, Vol. 38. 247--257.

Digital Library

[5]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. {TVM}: An automated end-to-end optimizing compiler for deep learning. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} . 578--594.

[6]

François Chollet. 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1251--1258.

[7]

Clément Farabet, Cyril Poulet, Jefferson Y Han, and Yann LeCun. 2009. Cnp: An fpga-based processor for convolutional networks. In FPL. IEEE, 32--37.

[8]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[9]

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. 2019. Searching for mobilenetv3. arXiv preprint arXiv:1905.02244 (2019).

[10]

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).

[11]

Forrest Iandola, Matt Moskewicz, Sergey Karayev, Ross Girshick, Trevor Darrell, and Kurt Keutzer. 2014. Densenet: Implementing efficient convnet descriptor pyramids. arXiv preprint arXiv:1404.1869 (2014).

[12]

Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).

[13]

Jun Haeng Lee, Sangwon Ha, Saerom Choi, Won-Jo Lee, and Seungwon Lee. 2018. Quantization for rapid deployment of deep neural networks. arXiv preprint arXiv:1810.05488 (2018).

[14]

Mohammad Loni, Masoud Daneshtalab, and Mikael Sjödin. 2018. ADONN: adaptive design of optimized deep neural networks for embedded systems. In 2018 21st Euromicro Conference on Digital System Design (DSD). IEEE, 397--404.

[15]

Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. 2018. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV). 116--131.

Digital Library

[16]

Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. 2017. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks. In Field Programmable Logic and Applications (FPL), 2017 27th International Conference on. IEEE, 1--8.

[17]

Panagiotis G Mousouliotis, Konstantinos L Panayiotou, Emmanouil G Tsardoulias, Loukas P Petrou, and Andreas L Symeonidis. 2018. Expanding a robot's life: Low power object recognition via FPGA-based DCNN deployment. In 2018 7th International Conference on Modern Circuits and Systems Technologies (MOCAST). IEEE, 1--4.

[18]

Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S Chung. 2015. Accelerating deep convolutional neural networks using specialized hardware. Microsoft Research Whitepaper 2 (2015).

[19]

Eunhyeok Park, Sungjoo Yoo, and Peter Vajda. 2018. Value-aware quantization for training and inference of neural networks. In Proceedings of the European Conference on Computer Vision (ECCV). 580--595.

Digital Library

[20]

Kathirgamaraja Pradeep, Kamalakkannan Kamalavasan, Ratnasegar Natheesan, and Ajith Pasqual. 2018. EdgeNet: SqueezeNet like Convolution Neural Network on Embedded FPGA. In 2018 25th IEEE International Conference on Electronics, Circuits and Systems (ICECS). IEEE, 81--84.

[21]

Jiantao Qiu, JieWang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, et al. 2016. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 26--35.

Digital Library

[22]

Joseph Redmon and Ali Farhadi. 2017. YOLO9000: better, faster, stronger. arXiv preprint (2017).

[23]

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang- Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4510--4520.

[24]

Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 17.

[25]

Tao Sheng, Chen Feng, Shaojie Zhuo, Xiaopeng Zhang, Liang Shen, and Mickey Aleksic. 2018. A quantization-friendly separable convolution for mobilenets. In 2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2). IEEE, 14--18.

[26]

Karen Simonyan and AndrewZisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[27]

Jiang Su, Julian Faraone, Junyi Liu, Yiren Zhao, David B Thomas, Philip HW Leong, and Peter YK Cheung. 2018. Redundancy-reduced MobileNet acceleration on reconfigurable logic for ImageNet classification. In International Symposium on Applied Reconfigurable Computing. Springer, 16--28.

[28]

Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 16--25.

Digital Library

[29]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In CVPR 2015.

[30]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818--2826.

[31]

Frederick Tung and Greg Mori. 2018. Deep neural network compression by inparallel pruning-quantization. IEEE transactions on pattern analysis and machine intelligence (2018).

[32]

Stylianos I Venieris and Christos-Savvas Bouganis. 2018. fpgaConvNet: mapping regular and irregular convolutional neural networks on FPGAs. IEEE transactions on neural networks and learning systems 30, 2 (2018), 326--342.

[33]

Deguang Wang, Junzhong Shen, Mei Wen, and Chunyuan Zhang. 2019. An efficient design flow for accelerating complicated-connected CNNs on a multi- FPGA platform. In Proceedings of the 48th International Conference on Parallel Processing. ACM, 98.

Digital Library

[34]

Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 29.

Digital Library

[35]

Yifan Yang, Qijing Huang, Bichen Wu, Tianjun Zhang, Liang Ma, Giulio Gambardella, Michaela Blott, Luciano Lavagno, Kees Vissers, John Wawrzynek, et al. 2019. Synetgy: Algorithm-hardware co-design for convnet accelerators on embedded fpgas. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 23--32.

Digital Library

[36]

Yunxuan Yu, Chen Wu, Xiao Shi, and Lei He. 2019. Overview of a FPGA-Based overlay processor. In 2019 China Semiconductor Technology International Conference (CSTIC). 1--3. https://doi.org/10.1109/CSTIC.2019.8755623

[37]

Yunxuan Yu, Chen Wu, Tiandong Zhao, Kun Wang, and Lei He. 2019. OPU: An FPGA-Based overlay processor for convolutional neural networks. IEEE Transactions on Very Large Scale Integration (VLSI) Systems (2019).

[38]

Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks. In Proceedings of the 35th International Conference on Computer-Aided Design. ACM, 12.

Digital Library

[39]

Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In FPGA. ACM, 161--170.

[40]

Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6848--6856.

[41]

Ruizhe Zhao, Xinyu Niu, and Wayne Luk. 2018. Automatic optimising CNN with depthwise separable convolution on FPGA:(Abstact only). In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 285--285.

Digital Library

Cited By

Li GLi RLi TChen TZhang MCorporaal H(2025)Algorithm-Hardware Co-design for Accelerating Depthwise Separable CNNsACM Transactions on Design Automation of Electronic Systems10.1145/371184630:2(1-22)Online publication date: 9-Jan-2025
https://dl.acm.org/doi/10.1145/3711846
Tong HHan KHan SLuo Y(2024)Design of a Generic Dynamically Reconfigurable Convolutional Neural Network Accelerator with Optimal BalanceElectronics10.3390/electronics1304076113:4(761)Online publication date: 14-Feb-2024
https://doi.org/10.3390/electronics13040761
Wu XWang MLin JWang Z(2024)Amoeba: An Efficient and Flexible FPGA-Based Accelerator for Arbitrary-Kernel CNNsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.338387132:6(1086-1099)Online publication date: Jun-2024
https://doi.org/10.1109/TVLSI.2024.3383871
Show More Cited By

Index Terms

Light-OPU: An FPGA-based Overlay Processor for Lightweight Convolutional Neural Networks
1. General and reference
  1. Document types
    1. General conference proceedings
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Recommendations

Low-precision Floating-point Arithmetic for High-performance FPGA-based CNN Acceleration
Low-precision data representation is important to reduce storage size and memory access for convolutional neural networks (CNNs). Yet, existing methods have two major limitations: (1) requiring re-training to maintain accuracy for deep CNNs and (2) ...
A CNN accelerator on embedded FPGA using dynamic reconfigurable coprocessor
AIIPCC '19: Proceedings of the International Conference on Artificial Intelligence, Information Processing and Cloud Computing

Convolutional neural network (CNN) has been widely deployed in deep learning networks at present. However, numerous convolution operations are computing intensive and often require powerful accelerator such as FPGA. The existed accelerators usually as ...
Automatic translation of software binaries onto FPGAs
DAC '04: Proceedings of the 41st annual Design Automation Conference

The introduction of advanced FPGA architectures, with built-in DSP support, has given DSP designers a new hardware alternative. By exploiting its inherent parallelism, it is expected that FPGAs can outperform DSP processors. This paper describes the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 2020

346 pages

ISBN:9781450370998

DOI:10.1145/3373087

General Chair:
Stephen Neuendorffer
Xilinx, USA
,
Program Chair:
Lesley Shannon
Simon Fraser University, Canada

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

FPGA '20

Sponsor:

SIGDA

FPGA '20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 23 - 25, 2020

CA, Seaside, USA

Acceptance Rates

Overall Acceptance Rate 125 of 627 submissions, 20%

Upcoming Conference

FPGA '25

Sponsor:
sigda

The 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

February 27 - March 1, 2025

Monterey , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

74
Total Citations
View Citations
2,671
Total Downloads

Downloads (Last 12 months)495
Downloads (Last 6 weeks)50

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li GLi RLi TChen TZhang MCorporaal H(2025)Algorithm-Hardware Co-design for Accelerating Depthwise Separable CNNsACM Transactions on Design Automation of Electronic Systems10.1145/371184630:2(1-22)Online publication date: 9-Jan-2025
https://dl.acm.org/doi/10.1145/3711846
Tong HHan KHan SLuo Y(2024)Design of a Generic Dynamically Reconfigurable Convolutional Neural Network Accelerator with Optimal BalanceElectronics10.3390/electronics1304076113:4(761)Online publication date: 14-Feb-2024
https://doi.org/10.3390/electronics13040761
Wu XWang MLin JWang Z(2024)Amoeba: An Efficient and Flexible FPGA-Based Accelerator for Arbitrary-Kernel CNNsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.338387132:6(1086-1099)Online publication date: Jun-2024
https://doi.org/10.1109/TVLSI.2024.3383871
Gongye CLuo YXu XFei Y(2024)Side-Channel-Assisted Reverse-Engineering of Encrypted DNN Hardware Accelerator IP and Attack Surface Exploration2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00001(4678-4695)Online publication date: 19-May-2024
https://doi.org/10.1109/SP54263.2024.00001
Zhao ZLi JChen GJiang ZQiao RXu PChen YLu H(2024)An FPGA-Based High-Throughput Dataflow Accelerator for Lightweight Neural Network2024 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS58744.2024.10558315(1-5)Online publication date: 19-May-2024
https://doi.org/10.1109/ISCAS58744.2024.10558315
Wang MWu XLin JWang Z(2024)An FPGA-Based Accelerator Enabling Efficient Support for CNNs with Arbitrary Kernel Sizes2024 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS58744.2024.10558221(1-5)Online publication date: 19-May-2024
https://doi.org/10.1109/ISCAS58744.2024.10558221
Shao HShi HMao WWang Z(2024)An FPGA-Based Reconfigurable Accelerator for Convolution-Transformer Hybrid EfficientViT2024 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS58744.2024.10557992(1-5)Online publication date: 19-May-2024
https://doi.org/10.1109/ISCAS58744.2024.10557992
Aggarwal SDamsgaard HPappalardo AFranco GPreußer TBlott MMitra T(2024)Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs2024 34th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL64840.2024.00048(297-303)Online publication date: 2-Sep-2024
https://doi.org/10.1109/FPL64840.2024.00048
Chen YTanaka K(2024)High Throughput and Low Bandwidth Demand: Accelerating CNN Inference Block-by-block on FPGAs2024 27th Euromicro Conference on Digital System Design (DSD)10.1109/DSD64264.2024.00073(503-511)Online publication date: 28-Aug-2024
https://doi.org/10.1109/DSD64264.2024.00073
Peccia FPavlitska SFleck TBringmann O(2024)Efficient Edge AI: Deploying Convolutional Neural Networks on FPGA with the Gemmini Accelerator2024 27th Euromicro Conference on Digital System Design (DSD)10.1109/DSD64264.2024.00062(418-426)Online publication date: 28-Aug-2024
https://doi.org/10.1109/DSD64264.2024.00062
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten