Skip to main content

A Low-Latency Hardware Accelerator for YOLO Object Detection Algorithms

  • Conference paper
  • First Online:
Advanced Parallel Processing Technologies (APPT 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14103))

Included in the following conference series:

  • 287 Accesses

Abstract

Object detection is an important computer vision task with a wide range of applications, including autonomous driving, smart security, and other domains. However, the high computational requirements poses challenges on deploying object detection on resource-limited edge devices. Thus dedicated hardware accelerators are desired to delever improved performances on detection speed and latency. Post-processing is a key step in object detection. It involves intensive computation on the CPU or GPU. The non-maximum suppression (NMS) algorithm is the core of post-processing, which can eliminate redundant boxes belonging to the same object. However, NMS becomes a bottleneck for hardware acceleration due to its characteristics of multiple iterations and waiting for all predicted boxes to be generated.

In this paper, we propose a novel hardware-friendly NMS algorithm for FPGA accelerator design. Our proposed algorithm alleviates the performance bottleneck of NMS by implementing the iterative algorithm into an efficient pipelined hardware circuit. We validate our algorithm on the VOC2007 dataset and show that it only brings 0.27% difference compared to the baseline NMS. Additional, the exponential function and sigmoid function are also extremely hardware-costly. To address this issue, we propose an approximate exponential function circuit to calculate the two functions with minimum logic cost and zero DSP cost.

We deploy our post-processing accelerator on Xilinx’s Alveo U50 FPGA board. The final design achieves a end-to-end detection latency of 283us for YOLOv2 model, According to the user guide provided by Xilinx and Intel, we converted the logic resources of different implementations on the FPGA into LUT resources. After that, we compared the resource utilization of acceleration module in the current state-of-the-art object detection system deployed on Intel with ours. Compared with it, we consumed 13.5\(\times \) lower LUT resources and used much fewer DSP resources.

This work was partially supported by Open Fund (NO. OBCandETL-2022-06) of Space Advanced Computing and Electronic Information Laboratory of BICE.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Anupreetham, A., et al.: End-to-end FPGA-based object detection using pipelined CNN and non-maximum suppression. In: 2021 31st International Conference on Field-Programmable Logic and Applications (FPL), pp. 76–82 (2021). https://doi.org/10.1109/FPL53798.2021.00021

  2. Cai, L., Dong, F., Chen, K., Yu, K., Qu, W., Jiang, J.: An FPGA based heterogeneous accelerator for single shot multibox detector (SSD). In: 2020 IEEE 15th International Conference on Solid-State & Integrated Circuit Technology (ICSICT), pp. 1–3. IEEE (2020)

    Google Scholar 

  3. Chen, Y.H., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circuits 52(1), 127–138 (2016)

    Article  Google Scholar 

  4. Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12 (2017)

    Google Scholar 

  5. Lee, J., Kim, C., Kang, S., Shin, D., Kim, S., Yoo, H.J.: UNPU: a 50.6 TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision. In: 2018 IEEE International Solid-State Circuits Conference-(ISSCC), pp. 218–220. IEEE (2018)

    Google Scholar 

  6. Li, Z., Zhang, Y., Sui, B., Xing, Z., Wang, Q.: FPGA implementation for the sigmoid with piecewise linear fitting method based on curvature analysis. Electronics 11(9), 1365 (2022)

    Article  Google Scholar 

  7. Liang, F., Yang, S., Mai, T., Yang, Y.: The design of objects bounding boxes non-maximum suppression and visualization module based on FPGA. In: 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP), pp. 1–5 (2018). https://doi.org/10.1109/ICDSP.2018.8631668

  8. Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2

    Chapter  Google Scholar 

  9. Ma, Y., Zheng, T., Cao, Y., Vrudhula, S., Seo, J.: Algorithm-hardware co-design of single shot detector for fast object detection on FPGAs. In: 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1–8. IEEE (2018)

    Google Scholar 

  10. Mo, R., Xu, K., Liu, L., Liu, L., Wang, D.: Adaptive linear unit for accurate binary neural networks. In: 2022 16th IEEE International Conference on Signal Processing (ICSP), vol. 1, pp. 223–228 (2022). https://doi.org/10.1109/ICSP56322.2022.9965306

  11. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)

    Google Scholar 

  12. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271 (2017)

    Google Scholar 

  13. Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)

  14. Wang, Z., Xu, K., Wu, S., Liu, L., Liu, L., Wang, D.: Sparse-YOLO: hardware/software co-design of an FPGA accelerator for YOLOv2. IEEE Access 8, 116569–116585 (2020). https://doi.org/10.1109/ACCESS.2020.3004198

    Article  Google Scholar 

  15. Zhang, H., Wu, W., Ma, Y., Wang, Z.: Efficient hardware post processing of anchor-based object detection on FPGA. In: 2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 580–585 (2020). https://doi.org/10.1109/ISVLSI49217.2020.00089

  16. Zou, Z., Chen, K., Shi, Z., Guo, Y., Ye, J.: Object detection in 20 years: a survey. Proc. IEEE 111(3), 257–276 (2023). https://doi.org/10.1109/JPROC.2023.3238524

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dong Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, A., Ye, Y., Peng, Y., Zhang, D., Yan, Z., Wang, D. (2024). A Low-Latency Hardware Accelerator for YOLO Object Detection Algorithms. In: Li, C., Li, Z., Shen, L., Wu, F., Gong, X. (eds) Advanced Parallel Processing Technologies. APPT 2023. Lecture Notes in Computer Science, vol 14103. Springer, Singapore. https://doi.org/10.1007/978-981-99-7872-4_15

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-7872-4_15

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-7871-7

  • Online ISBN: 978-981-99-7872-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics