Abstract
Object detection and classification is a key task in many computer vision applications such as smart surveillance and autonomous vehicles. Recent advances in deep learning have significantly improved the quality of results achieved by these systems, making them more accurate and reliable in complex environments. Modern object detection systems make use of lightweight convolutional neural networks (CNNs) for feature extraction, coupled with single-shot multi-box detectors (SSDs) that generate bounding boxes around the identified objects along with their classification confidence scores. Subsequently, a non-maximum suppression (NMS) module removes any redundant detection boxes from the final output. Typical NMS algorithms must wait for all box predictions to be generated by the SSD-based feature extractor before processing them. This sequential dependency between box predictions and NMS results in a significant latency overhead and degrades the overall system throughput, even if a high-performance CNN accelerator is used for the SSD feature extraction component. In this paper, we present a novel pipelined NMS algorithm that eliminates this sequential dependency and associated NMS latency overhead. We then use our novel NMS algorithm to implement an end-to-end fully pipelined FPGA system for low-latency SSD-MobileNet-V1 object detection. Our system, implemented on an Intel Stratix 10 FPGA, runs at 400 MHz and achieves a throughput of 2,167 frames per second with an end-to-end batch-1 latency of 2.13 ms. Our system achieves 5.3× higher throughput and 5× lower latency compared to the best prior FPGA-based solution with comparable accuracy.
- [1] . 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL). 411–4117. Google ScholarCross Ref
- [2] . 2021. End-to-end FPGA-based object detection using pipelined CNN and non-maximum suppression. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). 76–82. Google ScholarCross Ref
- [3] . 2021. FPGA architecture: Principles and progression. IEEE Circuits and Systems Magazine 21, 2 (2021), 4–29. Google ScholarCross Ref
- [4] . 2020. Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and GPUs. In 2020 International Conference on Field-Programmable Technology (ICFPT). 10–19. Google ScholarCross Ref
- [5] . 2018. You cannot improve what you do not measure: FPGA vs. ASIC efficiency gaps for convolutional neural network inference. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 11, 3 (2018), 1–23.Google ScholarDigital Library
- [6] . 2020. An FPGA based heterogeneous accelerator for single shot multibox detector (SSD). In 2020 IEEE 15th International Conference on Solid-State & Integrated Circuit Technology (ICSICT). 1–3. Google ScholarCross Ref
- [7] . 2017. Xception: Deep learning with depthwise separable convolutions. In Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- [8] . 2018. A real-time object detection accelerator with compressed SSDLite on FPGA. In 2018 International Conference on Field-Programmable Technology (FPT). 14–21. Google ScholarCross Ref
- [9] . 2018. A configurable cloud-scale DNN processor for real-time AI. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 1–14. Google ScholarDigital Library
- [10] . 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 580–587. Google ScholarDigital Library
- [11] . 2020. From TensorFlow graphs to LUTs and wires: Automated sparse and physically aware CNN hardware generation. In 2020 International Conference on Field-Programmable Technology (ICFPT). 56–65. Google ScholarCross Ref
- [12] . 2016. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778. Google ScholarCross Ref
- [13] . 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017).
arXiv:1704.04861 http://arxiv.org/abs/1704.04861Google Scholar - [14] . 2022. Extending Data Flow Architectures for Convolutional Neural Networks to Object Detection and Multiple FPGAs. Master’s thesis. The University of Toronto. https://tspace.library.utoronto.ca/handle/1807/123335Google Scholar
- [15] . 2019. A survey of deep learning-based object detection. IEEE Access 7 (2019), 128837–128868. Google ScholarCross Ref
- [16] . 2019. Automatic compiler based FPGA accelerator for CNN training. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL). 166–172. Google ScholarCross Ref
- [17] . 2021. Stratix 10 NX architecture and applications(
FPGA’21 ). Association for Computing Machinery, New York, NY, USA, 57–67. Google ScholarDigital Library - [18] . 2014. Microsoft COCO: Common objects in context. CoRR abs/1405.0312 (2014).
arXiv:1405.0312 http://arxiv.org/abs/1405.0312Google Scholar - [19] . 2015. SSD: Single shot multibox detector. CoRR abs/1512.02325 (2015).
arxiv:1512.02325 http://arxiv.org/abs/1512.02325Google Scholar - [20] . 2018. Algorithm-hardware co-design of single shot detector for fast object detection on FPGAs. In IEEE International Conference on Computer-Aided Design (ICCAD).Google ScholarDigital Library
- [21] . 2021. FixyFPGA: Efficient FPGA accelerator for deep neural networks with high element-wise sparsity and without external memory access. In IEEE International Conference on Field-Programmable Logic and Applications (FPL). 9–16. Google ScholarCross Ref
- [22] . 2019. NVIDIA Tesla deep learning product performance. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).Google Scholar
- [23] . 2019. MLPerf inference benchmark. CoRR abs/1911.02549 (2019).
arXiv:1911.02549 http://arxiv.org/abs/1911.02549Google Scholar - [24] . 2015. You only look once: Unified, real-time object detection. CoRR abs/1506.02640 (2015).
arXiv:1506.02640 http://arxiv.org/abs/1506.02640Google Scholar - [25] . 2018. YOLOv3: An incremental improvement. CoRR abs/1804.02767 (2018).
arxiv:1804.02767 http://arxiv.org/abs/1804.02767Google Scholar - [26] . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. CoRR abs/1506.01497 (2015).
arXiv:1506.01497 http://arxiv.org/abs/1506.01497Google Scholar - [27] . 2018. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR abs/1801.04381 (2018).
arXiv:1801.04381 http://arxiv.org/abs/1801.04381Google Scholar - [28] . 2019. A fast and power-efficient hardware architecture for non-maximum suppression. IEEE Transactions on Circuits and Systems II: Express Briefs 66, 11 (2019), 1870–1874. Google ScholarCross Ref
- [29] . 2022. HPIPE NX: Boosting CNN inference acceleration performance with AI-optimized FPGAs. In International Conference on Field-Programmable Technology (FPT). IEEE, 1–9.Google ScholarCross Ref
- [30] . 2017. Towards closing the energy gap between HOG and CNN features for embedded vision. CoRR abs/1703.05853 (2017).
arXiv:1703.05853 http://arxiv.org/abs/1703.05853Google Scholar - [31] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. 2019. MnasNet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2820–2828.Google Scholar
- [32] . 2020. Sparse-YOLO: Hardware/software co-design of an FPGA accelerator for YOLOv2. IEEE Access 8 (2020), 116569–116585. Google ScholarCross Ref
- [33] . 2019. A high-performance CNN processor based on FPGA for MobileNets. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL). 136–143. Google ScholarCross Ref
- [34] . 2020. Efficient hardware post processing of anchor-based object detection on FPGA. In 2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). 580–585. Google ScholarCross Ref
- [35] . 2020. A hardware accelerator based on neural network for object detection. Journal of Physics: Conference Series 1486, 2 (
Apr. 2020), 022045. Google ScholarCross Ref - [36] . 2019. Object detection in 20 years: A survey. CoRR abs/1905.05055 (2019).
arXiv:1905.05055 http://arxiv.org/abs/1905.05055Google Scholar
Index Terms
- High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-Design
Recommendations
Algorithm-hardware Co-optimization for Energy-efficient Drone Detection on Resource-constrained FPGA
Convolutional neural network (CNN)-based object detection has achieved very high accuracy; e.g., single-shot multi-box detectors (SSDs) can efficiently detect and localize various objects in an input image. However, they require a high amount of ...
FPGA-based accelerator for object detection: a comprehensive survey
AbstractObject detection is one of the most challenging tasks in computer vision. With the advances in semiconductor devices and chip technology, hardware accelerators have been widely used. Field-programmable gate arrays (FPGAs) are a highly flexible ...
High Power-Efficient and Performance-Density FPGA Accelerator for CNN-Based Object Detection
Pattern Recognition and Computer VisionAbstractThe Field Programmable Gate Array (FPGA) accelerator for CNN-based object detection has been attracting widespread attention in computer vision. For most existing FPGA accelerators, the inference accuracy and speed are affected negatively by the ...
Comments