skip to main content
research-article

High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-Design

Published:15 January 2024Publication History
Skip Abstract Section

Abstract

Object detection and classification is a key task in many computer vision applications such as smart surveillance and autonomous vehicles. Recent advances in deep learning have significantly improved the quality of results achieved by these systems, making them more accurate and reliable in complex environments. Modern object detection systems make use of lightweight convolutional neural networks (CNNs) for feature extraction, coupled with single-shot multi-box detectors (SSDs) that generate bounding boxes around the identified objects along with their classification confidence scores. Subsequently, a non-maximum suppression (NMS) module removes any redundant detection boxes from the final output. Typical NMS algorithms must wait for all box predictions to be generated by the SSD-based feature extractor before processing them. This sequential dependency between box predictions and NMS results in a significant latency overhead and degrades the overall system throughput, even if a high-performance CNN accelerator is used for the SSD feature extraction component. In this paper, we present a novel pipelined NMS algorithm that eliminates this sequential dependency and associated NMS latency overhead. We then use our novel NMS algorithm to implement an end-to-end fully pipelined FPGA system for low-latency SSD-MobileNet-V1 object detection. Our system, implemented on an Intel Stratix 10 FPGA, runs at 400 MHz and achieves a throughput of 2,167 frames per second with an end-to-end batch-1 latency of 2.13 ms. Our system achieves 5.3× higher throughput and 5× lower latency compared to the best prior FPGA-based solution with comparable accuracy.

REFERENCES

  1. [1] Abdelfattah Mohamed S., Han David, Bitar Andrew, DiCecco Roberto, O’Connell Shane, Shanker Nitika, Chu Joseph, Prins Ian, Fender Joshua, Ling Andrew C., and Chiu Gordon R.. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL). 4114117. Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Anupreetham Anupreetham, Ibrahim Mohamed, Hall Mathew, Boutros Andrew, Kuzhively Ajay, Mohanty Abinash, Nurvitadhi Eriko, Betz Vaughn, Cao Yu, and Seo Jae-sun. 2021. End-to-end FPGA-based object detection using pipelined CNN and non-maximum suppression. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). 7682. Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Boutros Andrew and Betz Vaughn. 2021. FPGA architecture: Principles and progression. IEEE Circuits and Systems Magazine 21, 2 (2021), 429. Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Boutros Andrew, Nurvitadhi Eriko, Ma Rui, Gribok Sergey, Zhao Zhipeng, Hoe James C., Betz Vaughn, and Langhammer Martin. 2020. Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and GPUs. In 2020 International Conference on Field-Programmable Technology (ICFPT). 1019. Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Boutros Andrew, Yazdanshenas Sadegh, and Betz Vaughn. 2018. You cannot improve what you do not measure: FPGA vs. ASIC efficiency gaps for convolutional neural network inference. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 11, 3 (2018), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Cai Liang, Dong Feng, Chen Ke, Yu Kehua, Qu Wei, and Jiang Jianfei. 2020. An FPGA based heterogeneous accelerator for single shot multibox detector (SSD). In 2020 IEEE 15th International Conference on Solid-State & Integrated Circuit Technology (ICSICT). 13. Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Chollet François. 2017. Xception: Deep learning with depthwise separable convolutions. In Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Fan Hongxiang, Liu Shuanglong, Ferianc Martin, Ng Ho-Cheung, Que Zhiqiang, Liu Shen, Niu Xinyu, and Luk Wayne. 2018. A real-time object detection accelerator with compressed SSDLite on FPGA. In 2018 International Conference on Field-Programmable Technology (FPT). 1421. Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Fowers Jeremy, Ovtcharov Kalin, Papamichael Michael, Massengill Todd, Liu Ming, Lo Daniel, Alkalay Shlomi, Haselman Michael, Adams Logan, Ghandi Mahdi, Heil Stephen, Patel Prerak, Sapek Adam, Weisz Gabriel, Woods Lisa, Lanka Sitaram, Reinhardt Steven K., Caulfield Adrian M., Chung Eric S., and Burger Doug. 2018. A configurable cloud-scale DNN processor for real-time AI. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Girshick Ross, Donahue Jeff, Darrell Trevor, and Malik Jitendra. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 580587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Hall Mathew and Betz Vaughn. 2020. From TensorFlow graphs to LUTs and wires: Automated sparse and physically aware CNN hardware generation. In 2020 International Conference on Field-Programmable Technology (ICFPT). 5665. Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770778. Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Howard Andrew G., Zhu Menglong, Chen Bo, Kalenichenko Dmitry, Wang Weijun, Weyand Tobias, Andreetto Marco, and Adam Hartwig. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017). arXiv:1704.04861 http://arxiv.org/abs/1704.04861Google ScholarGoogle Scholar
  14. [14] Ibrahim Mohamed and Betz Vaughn. 2022. Extending Data Flow Architectures for Convolutional Neural Networks to Object Detection and Multiple FPGAs. Master’s thesis. The University of Toronto. https://tspace.library.utoronto.ca/handle/1807/123335Google ScholarGoogle Scholar
  15. [15] Jiao Licheng, Zhang Fan, Liu Fang, Yang Shuyuan, Li Lingling, Feng Zhixi, and Qu Rong. 2019. A survey of deep learning-based object detection. IEEE Access 7 (2019), 128837128868. Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Venkataramanaiah Shreyas Kolala, Ma Yufei, Yin Shihui, Nurvithadhi Eriko, Dasu Aravind, Cao Yu, and Seo Jae-Sun. 2019. Automatic compiler based FPGA accelerator for CNN training. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL). 166172. Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Langhammer Martin, Nurvitadhi Eriko, Pasca Bogdan, and Gribok Sergey. 2021. Stratix 10 NX architecture and applications(FPGA’21). Association for Computing Machinery, New York, NY, USA, 5767. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Lin Tsung-Yi, Maire Michael, Belongie Serge J., Bourdev Lubomir D., Girshick Ross B., Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. CoRR abs/1405.0312 (2014). arXiv:1405.0312 http://arxiv.org/abs/1405.0312Google ScholarGoogle Scholar
  19. [19] Liu Wei, Anguelov Dragomir, Erhan Dumitru, Szegedy Christian, Reed Scott E., Fu Cheng-Yang, and Berg Alexander C.. 2015. SSD: Single shot multibox detector. CoRR abs/1512.02325 (2015). arxiv:1512.02325 http://arxiv.org/abs/1512.02325Google ScholarGoogle Scholar
  20. [20] Ma Yufei, Zheng Tu, Cao Yu, Vrudhula Sarma, and Seo Jae-sun. 2018. Algorithm-hardware co-design of single shot detector for fast object detection on FPGAs. In IEEE International Conference on Computer-Aided Design (ICCAD).Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Meng Jian, Venkataramanaiah Shreyas Kolala, Zhou Chuteng, Hansen Patrick, Whatmough Paul, and Seo Jaesun. 2021. FixyFPGA: Efficient FPGA accelerator for deep neural networks with high element-wise sparsity and without external memory access. In IEEE International Conference on Field-Programmable Logic and Applications (FPL). 916. Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] NVIDIA. 2019. NVIDIA Tesla deep learning product performance. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).Google ScholarGoogle Scholar
  23. [23] Reddi Vijay Janapa, Cheng Christine, Kanter David, Mattson Peter, Schmuelling Guenther, Wu Carole-Jean, Anderson Brian, Breughe Maximilien, Charlebois Mark, Chou William, Chukka Ramesh, Coleman Cody, Davis Sam, Deng Pan, Diamos Greg, Duke Jared, Fick Dave, Gardner J. Scott, Hubara Itay, Idgunji Sachin, Jablin Thomas B., Jiao Jeff, John Tom St., Kanwar Pankaj, Lee David, Liao Jeffery, Lokhmotov Anton, Massa Francisco, Meng Peng, Micikevicius Paulius, Osborne Colin, Pekhimenk Gennady, Rajan Arun Tejusve Raghunath, Sequeira Dilip, Sirasao Ashish, Sun Fei, Tang Hanlin, Thomson Michael, Wei Frank, Wu Ephrem, Xu Lingjie, Yamada Koichi, Yu Bing, Yuan George, Zhong Aaron, Zhang Peizhao, and Zhou Yuchen. 2019. MLPerf inference benchmark. CoRR abs/1911.02549 (2019). arXiv:1911.02549 http://arxiv.org/abs/1911.02549Google ScholarGoogle Scholar
  24. [24] Redmon Joseph, Divvala Santosh Kumar, Girshick Ross B., and Farhadi Ali. 2015. You only look once: Unified, real-time object detection. CoRR abs/1506.02640 (2015). arXiv:1506.02640 http://arxiv.org/abs/1506.02640Google ScholarGoogle Scholar
  25. [25] Redmon Joseph and Farhadi Ali. 2018. YOLOv3: An incremental improvement. CoRR abs/1804.02767 (2018). arxiv:1804.02767 http://arxiv.org/abs/1804.02767Google ScholarGoogle Scholar
  26. [26] Ren Shaoqing, He Kaiming, Girshick Ross B., and Sun Jian. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. CoRR abs/1506.01497 (2015). arXiv:1506.01497 http://arxiv.org/abs/1506.01497Google ScholarGoogle Scholar
  27. [27] Sandler Mark, Howard Andrew G., Zhu Menglong, Zhmoginov Andrey, and Chen Liang-Chieh. 2018. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR abs/1801.04381 (2018). arXiv:1801.04381 http://arxiv.org/abs/1801.04381Google ScholarGoogle Scholar
  28. [28] Shi Man, Ouyang Peng, Yin Shouyi, Liu Leibo, and Wei Shaojun. 2019. A fast and power-efficient hardware architecture for non-maximum suppression. IEEE Transactions on Circuits and Systems II: Express Briefs 66, 11 (2019), 18701874. Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Stan Marius, Hall Mathew, Ibrahim Mohamed, and Betz Vaughn. 2022. HPIPE NX: Boosting CNN inference acceleration performance with AI-optimized FPGAs. In International Conference on Field-Programmable Technology (FPT). IEEE, 19.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Suleiman Amr, Chen Yu-Hsin, Emer Joel S., and Sze Vivienne. 2017. Towards closing the energy gap between HOG and CNN features for embedded vision. CoRR abs/1703.05853 (2017). arXiv:1703.05853 http://arxiv.org/abs/1703.05853Google ScholarGoogle Scholar
  31. [31] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. 2019. MnasNet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2820–2828.Google ScholarGoogle Scholar
  32. [32] Wang Zixiao, Xu Ke, Wu Shuaixiao, Liu Li, Liu Lingzhi, and Wang Dong. 2020. Sparse-YOLO: Hardware/software co-design of an FPGA accelerator for YOLOv2. IEEE Access 8 (2020), 116569116585. Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Wu Di, Zhang Yu, Jia Xijie, Tian Lu, Li Tianping, Sui Lingzhi, Xie Dongliang, and Shan Yi. 2019. A high-performance CNN processor based on FPGA for MobileNets. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL). 136143. Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Zhang Hui, Wu Wei, Ma Yufei, and Wang Zhongfeng. 2020. Efficient hardware post processing of anchor-based object detection on FPGA. In 2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). 580585. Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Zhao Tong, Qiao Lufeng, Chen Qinghua, Zhang Qingsong, and Li Na. 2020. A hardware accelerator based on neural network for object detection. Journal of Physics: Conference Series 1486, 2 (Apr.2020), 022045. Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Zou Zhengxia, Shi Zhenwei, Guo Yuhong, and Ye Jieping. 2019. Object detection in 20 years: A survey. CoRR abs/1905.05055 (2019). arXiv:1905.05055 http://arxiv.org/abs/1905.05055Google ScholarGoogle Scholar

Index Terms

  1. High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-Design

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Reconfigurable Technology and Systems
        ACM Transactions on Reconfigurable Technology and Systems  Volume 17, Issue 1
        March 2024
        446 pages
        ISSN:1936-7406
        EISSN:1936-7414
        DOI:10.1145/3613534
        • Editor:
        • Deming Chen
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 January 2024
        • Online AM: 4 December 2023
        • Accepted: 16 November 2023
        • Revised: 27 September 2023
        • Received: 27 May 2023
        Published in trets Volume 17, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
      • Article Metrics

        • Downloads (Last 12 months)412
        • Downloads (Last 6 weeks)104

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text