research-article

High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-Design

Authors:
Anupreetham Anupreetham

Arizona State University, USA

Arizona State University, USA

0000-0002-4991-188X
View Profile

,
Mohamed Ibrahim

University of Toronto, Intel Corporation, Canada

University of Toronto, Intel Corporation, Canada

0009-0006-8930-0692
View Profile

,
Mathew Hall

University of Toronto, Canada

University of Toronto, Canada

0000-0002-2134-8247
View Profile

,
Andrew Boutros

University of Toronto, Vector Institute for AI, Canada

University of Toronto, Vector Institute for AI, Canada

0000-0002-8044-1644
View Profile

,
Ajay Kuzhively

Arizona State University, USA

Arizona State University, USA

0009-0008-6780-6451
View Profile

,
Abinash Mohanty

Arizona State University, USA

Arizona State University, USA

0009-0006-9524-0366
View Profile

,
Eriko Nurvitadhi

Intel Corporation, USA

Intel Corporation, USA

0000-0002-2347-9590
View Profile

,
Vaughn Betz

University of Toronto, Vector Institute for AI, Canada

University of Toronto, Vector Institute for AI, Canada

0000-0003-0528-6493
View Profile

,
Yu Cao

Arizona State University, USA

Arizona State University, USA

0000-0001-5689-0768
View Profile

,
Jae-Sun Seo

Arizona State University, USA

Arizona State University, USA

0000-0002-4551-7789
View Profile

ACM Transactions on Reconfigurable Technology and Systems Volume 17 Issue 1Article No.: 1pp 1–20https://doi.org/10.1145/3634919

Published:15 January 2024Publication History

ACM Transactions on Reconfigurable Technology and Systems

Abstract

Object detection and classification is a key task in many computer vision applications such as smart surveillance and autonomous vehicles. Recent advances in deep learning have significantly improved the quality of results achieved by these systems, making them more accurate and reliable in complex environments. Modern object detection systems make use of lightweight convolutional neural networks (CNNs) for feature extraction, coupled with single-shot multi-box detectors (SSDs) that generate bounding boxes around the identified objects along with their classification confidence scores. Subsequently, a non-maximum suppression (NMS) module removes any redundant detection boxes from the final output. Typical NMS algorithms must wait for all box predictions to be generated by the SSD-based feature extractor before processing them. This sequential dependency between box predictions and NMS results in a significant latency overhead and degrades the overall system throughput, even if a high-performance CNN accelerator is used for the SSD feature extraction component. In this paper, we present a novel pipelined NMS algorithm that eliminates this sequential dependency and associated NMS latency overhead. We then use our novel NMS algorithm to implement an end-to-end fully pipelined FPGA system for low-latency SSD-MobileNet-V1 object detection. Our system, implemented on an Intel Stratix 10 FPGA, runs at 400 MHz and achieves a throughput of 2,167 frames per second with an end-to-end batch-1 latency of 2.13 ms. Our system achieves 5.3× higher throughput and 5× lower latency compared to the best prior FPGA-based solution with comparable accuracy.

REFERENCES

[1] Abdelfattah Mohamed S., Han David, Bitar Andrew, DiCecco Roberto, O’Connell Shane, Shanker Nitika, Chu Joseph, Prins Ian, Fender Joshua, Ling Andrew C., and Chiu Gordon R.. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL). 411–4117. Google ScholarCross Ref
[2] Anupreetham Anupreetham, Ibrahim Mohamed, Hall Mathew, Boutros Andrew, Kuzhively Ajay, Mohanty Abinash, Nurvitadhi Eriko, Betz Vaughn, Cao Yu, and Seo Jae-sun. 2021. End-to-end FPGA-based object detection using pipelined CNN and non-maximum suppression. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). 76–82. Google ScholarCross Ref
[3] Boutros Andrew and Betz Vaughn. 2021. FPGA architecture: Principles and progression. IEEE Circuits and Systems Magazine 21, 2 (2021), 4–29. Google ScholarCross Ref
[4] Boutros Andrew, Nurvitadhi Eriko, Ma Rui, Gribok Sergey, Zhao Zhipeng, Hoe James C., Betz Vaughn, and Langhammer Martin. 2020. Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and GPUs. In 2020 International Conference on Field-Programmable Technology (ICFPT). 10–19. Google ScholarCross Ref
[5] Boutros Andrew, Yazdanshenas Sadegh, and Betz Vaughn. 2018. You cannot improve what you do not measure: FPGA vs. ASIC efficiency gaps for convolutional neural network inference. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 11, 3 (2018), 1–23.Google ScholarDigital Library
[6] Cai Liang, Dong Feng, Chen Ke, Yu Kehua, Qu Wei, and Jiang Jianfei. 2020. An FPGA based heterogeneous accelerator for single shot multibox detector (SSD). In 2020 IEEE 15th International Conference on Solid-State & Integrated Circuit Technology (ICSICT). 1–3. Google ScholarCross Ref
[7] Chollet François. 2017. Xception: Deep learning with depthwise separable convolutions. In Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
[8] Fan Hongxiang, Liu Shuanglong, Ferianc Martin, Ng Ho-Cheung, Que Zhiqiang, Liu Shen, Niu Xinyu, and Luk Wayne. 2018. A real-time object detection accelerator with compressed SSDLite on FPGA. In 2018 International Conference on Field-Programmable Technology (FPT). 14–21. Google ScholarCross Ref
[9] Fowers Jeremy, Ovtcharov Kalin, Papamichael Michael, Massengill Todd, Liu Ming, Lo Daniel, Alkalay Shlomi, Haselman Michael, Adams Logan, Ghandi Mahdi, Heil Stephen, Patel Prerak, Sapek Adam, Weisz Gabriel, Woods Lisa, Lanka Sitaram, Reinhardt Steven K., Caulfield Adrian M., Chung Eric S., and Burger Doug. 2018. A configurable cloud-scale DNN processor for real-time AI. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 1–14. Google ScholarDigital Library
[10] Girshick Ross, Donahue Jeff, Darrell Trevor, and Malik Jitendra. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 580–587. Google ScholarDigital Library
[11] Hall Mathew and Betz Vaughn. 2020. From TensorFlow graphs to LUTs and wires: Automated sparse and physically aware CNN hardware generation. In 2020 International Conference on Field-Programmable Technology (ICFPT). 56–65. Google ScholarCross Ref
[12] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778. Google ScholarCross Ref
[13] Howard Andrew G., Zhu Menglong, Chen Bo, Kalenichenko Dmitry, Wang Weijun, Weyand Tobias, Andreetto Marco, and Adam Hartwig. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017). arXiv:1704.04861 http://arxiv.org/abs/1704.04861Google Scholar
[14] Ibrahim Mohamed and Betz Vaughn. 2022. Extending Data Flow Architectures for Convolutional Neural Networks to Object Detection and Multiple FPGAs. Master’s thesis. The University of Toronto. https://tspace.library.utoronto.ca/handle/1807/123335Google Scholar
[15] Jiao Licheng, Zhang Fan, Liu Fang, Yang Shuyuan, Li Lingling, Feng Zhixi, and Qu Rong. 2019. A survey of deep learning-based object detection. IEEE Access 7 (2019), 128837–128868. Google ScholarCross Ref
[16] Venkataramanaiah Shreyas Kolala, Ma Yufei, Yin Shihui, Nurvithadhi Eriko, Dasu Aravind, Cao Yu, and Seo Jae-Sun. 2019. Automatic compiler based FPGA accelerator for CNN training. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL). 166–172. Google ScholarCross Ref
[17] Langhammer Martin, Nurvitadhi Eriko, Pasca Bogdan, and Gribok Sergey. 2021. Stratix 10 NX architecture and applications(FPGA’21). Association for Computing Machinery, New York, NY, USA, 57–67. Google ScholarDigital Library
[18] Lin Tsung-Yi, Maire Michael, Belongie Serge J., Bourdev Lubomir D., Girshick Ross B., Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. CoRR abs/1405.0312 (2014). arXiv:1405.0312 http://arxiv.org/abs/1405.0312Google Scholar
[19] Liu Wei, Anguelov Dragomir, Erhan Dumitru, Szegedy Christian, Reed Scott E., Fu Cheng-Yang, and Berg Alexander C.. 2015. SSD: Single shot multibox detector. CoRR abs/1512.02325 (2015). arxiv:1512.02325 http://arxiv.org/abs/1512.02325Google Scholar
[20] Ma Yufei, Zheng Tu, Cao Yu, Vrudhula Sarma, and Seo Jae-sun. 2018. Algorithm-hardware co-design of single shot detector for fast object detection on FPGAs. In IEEE International Conference on Computer-Aided Design (ICCAD).Google ScholarDigital Library
[21] Meng Jian, Venkataramanaiah Shreyas Kolala, Zhou Chuteng, Hansen Patrick, Whatmough Paul, and Seo Jaesun. 2021. FixyFPGA: Efficient FPGA accelerator for deep neural networks with high element-wise sparsity and without external memory access. In IEEE International Conference on Field-Programmable Logic and Applications (FPL). 9–16. Google ScholarCross Ref
[22] NVIDIA. 2019. NVIDIA Tesla deep learning product performance. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).Google Scholar
[23] Reddi Vijay Janapa, Cheng Christine, Kanter David, Mattson Peter, Schmuelling Guenther, Wu Carole-Jean, Anderson Brian, Breughe Maximilien, Charlebois Mark, Chou William, Chukka Ramesh, Coleman Cody, Davis Sam, Deng Pan, Diamos Greg, Duke Jared, Fick Dave, Gardner J. Scott, Hubara Itay, Idgunji Sachin, Jablin Thomas B., Jiao Jeff, John Tom St., Kanwar Pankaj, Lee David, Liao Jeffery, Lokhmotov Anton, Massa Francisco, Meng Peng, Micikevicius Paulius, Osborne Colin, Pekhimenk Gennady, Rajan Arun Tejusve Raghunath, Sequeira Dilip, Sirasao Ashish, Sun Fei, Tang Hanlin, Thomson Michael, Wei Frank, Wu Ephrem, Xu Lingjie, Yamada Koichi, Yu Bing, Yuan George, Zhong Aaron, Zhang Peizhao, and Zhou Yuchen. 2019. MLPerf inference benchmark. CoRR abs/1911.02549 (2019). arXiv:1911.02549 http://arxiv.org/abs/1911.02549Google Scholar
[24] Redmon Joseph, Divvala Santosh Kumar, Girshick Ross B., and Farhadi Ali. 2015. You only look once: Unified, real-time object detection. CoRR abs/1506.02640 (2015). arXiv:1506.02640 http://arxiv.org/abs/1506.02640Google Scholar
[25] Redmon Joseph and Farhadi Ali. 2018. YOLOv3: An incremental improvement. CoRR abs/1804.02767 (2018). arxiv:1804.02767 http://arxiv.org/abs/1804.02767Google Scholar
[26] Ren Shaoqing, He Kaiming, Girshick Ross B., and Sun Jian. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. CoRR abs/1506.01497 (2015). arXiv:1506.01497 http://arxiv.org/abs/1506.01497Google Scholar
[27] Sandler Mark, Howard Andrew G., Zhu Menglong, Zhmoginov Andrey, and Chen Liang-Chieh. 2018. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR abs/1801.04381 (2018). arXiv:1801.04381 http://arxiv.org/abs/1801.04381Google Scholar
[28] Shi Man, Ouyang Peng, Yin Shouyi, Liu Leibo, and Wei Shaojun. 2019. A fast and power-efficient hardware architecture for non-maximum suppression. IEEE Transactions on Circuits and Systems II: Express Briefs 66, 11 (2019), 1870–1874. Google ScholarCross Ref
[29] Stan Marius, Hall Mathew, Ibrahim Mohamed, and Betz Vaughn. 2022. HPIPE NX: Boosting CNN inference acceleration performance with AI-optimized FPGAs. In International Conference on Field-Programmable Technology (FPT). IEEE, 1–9.Google ScholarCross Ref
[30] Suleiman Amr, Chen Yu-Hsin, Emer Joel S., and Sze Vivienne. 2017. Towards closing the energy gap between HOG and CNN features for embedded vision. CoRR abs/1703.05853 (2017). arXiv:1703.05853 http://arxiv.org/abs/1703.05853Google Scholar
[31] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. 2019. MnasNet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2820–2828.Google Scholar
[32] Wang Zixiao, Xu Ke, Wu Shuaixiao, Liu Li, Liu Lingzhi, and Wang Dong. 2020. Sparse-YOLO: Hardware/software co-design of an FPGA accelerator for YOLOv2. IEEE Access 8 (2020), 116569–116585. Google ScholarCross Ref
[33] Wu Di, Zhang Yu, Jia Xijie, Tian Lu, Li Tianping, Sui Lingzhi, Xie Dongliang, and Shan Yi. 2019. A high-performance CNN processor based on FPGA for MobileNets. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL). 136–143. Google ScholarCross Ref
[34] Zhang Hui, Wu Wei, Ma Yufei, and Wang Zhongfeng. 2020. Efficient hardware post processing of anchor-based object detection on FPGA. In 2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). 580–585. Google ScholarCross Ref
[35] Zhao Tong, Qiao Lufeng, Chen Qinghua, Zhang Qingsong, and Li Na. 2020. A hardware accelerator based on neural network for object detection. Journal of Physics: Conference Series 1486, 2 (Apr.2020), 022045. Google ScholarCross Ref
[36] Zou Zhengxia, Shi Zhenwei, Guo Yuhong, and Ye Jieping. 2019. Object detection in 20 years: A survey. CoRR abs/1905.05055 (2019). arXiv:1905.05055 http://arxiv.org/abs/1905.05055Google Scholar

Index Terms

High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-Design
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
      2. Reconfigurable computing

Recommendations

Algorithm-hardware Co-optimization for Energy-efficient Drone Detection on Resource-constrained FPGA
Convolutional neural network (CNN)-based object detection has achieved very high accuracy; e.g., single-shot multi-box detectors (SSDs) can efficiently detect and localize various objects in an input image. However, they require a high amount of ...
Read More
FPGA-based accelerator for object detection: a comprehensive survey
Abstract
Object detection is one of the most challenging tasks in computer vision. With the advances in semiconductor devices and chip technology, hardware accelerators have been widely used. Field-programmable gate arrays (FPGAs) are a highly flexible ...
Read More
High Power-Efficient and Performance-Density FPGA Accelerator for CNN-Based Object Detection
Pattern Recognition and Computer Vision
Abstract
The Field Programmable Gate Array (FPGA) accelerator for CNN-based object detection has been attracting widespread attention in computer vision. For most existing FPGA accelerators, the inference accuracy and speed are affected negatively by the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Reconfigurable Technology and Systems Volume 17, Issue 1
March 2024
446 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/3613534
Editor:
Deming Chen
University of Illinois, Urbana-Champaign, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 January 2024
- Online AM: 4 December 2023
- Accepted: 16 November 2023
- Revised: 27 September 2023
- Received: 27 May 2023
Published in trets Volume 17, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
FPGA accelerator
object detection
algorithm-hardware co-design
neural networks
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 412
  Total Downloads
- Downloads (Last 12 months)412
- Downloads (Last 6 weeks)104
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-Design

ACM Transactions on Reconfigurable Technology and Systems

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Algorithm-hardware Co-optimization for Energy-efficient Drone Detection on Resource-constrained FPGA

FPGA-based accelerator for object detection: a comprehensive survey

High Power-Efficient and Performance-Density FPGA Accelerator for CNN-Based Object Detection

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

Caption

High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-Design

ACM Transactions on Reconfigurable Technology and Systems

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Algorithm-hardware Co-optimization for Energy-efficient Drone Detection on Resource-constrained FPGA

FPGA-based accelerator for object detection: a comprehensive survey

High Power-Efficient and Performance-Density FPGA Accelerator for CNN-Based Object Detection

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

Share this Publication link

Share on Social Media