Skip to main content
Log in

A small object detection architecture with concatenated detection heads and multi-head mixed self-attention mechanism

  • Research
  • Published:
Journal of Real-Time Image Processing Aims and scope Submit manuscript

Abstract

A novel detection method is proposed to address the challenge of detecting small objects in object detection. This method augments the YOLOv8n architecture with a small object detection layer and innovatively designs a Concat-detection head to effectively extract features. Simultaneously, a new attention mechanism—Multi-Head Mixed Self-Attention (MMSA) mechanism—is introduced to enhance the feature-extraction capability of the backbone. To improve the detection sensitivity for small objects, a combination of Normalized Wasserstein Distance (NWD) and Intersection over Union (IoU) is used to calculate the localization loss, optimizing the bounding-box regression. Experimental results on the TT100K dataset show that the mean average precision (mAP@0.5) reaches 88.1%, which is a 13.5% improvement over YOLOv8n. The method’s versatility is also validated through experiments on the BDD100K dataset, where it is compared with various object-detection algorithms. The results demonstrate that this method yields significant improvements and practical value in the field of small-object detection. Detailed code can be found at https://github.com/CodeSworder/MMSA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

The datasets utilized in the article are all publicly available. The detailed URLs for the datasets can be found in. https://cg.cs.tsinghua.edu.cn/traffic-sign/.

References

  1. Li, Z., Yang, L., Zhou, F.: FSSD: feature fusion single shot multibox detector. arXiv arXiv:1712.00960 (2024)

  2. Liu, Z., Gao, G., Sun, L., Fang, Z.: HRDNet: high-resolution detection network for small objects. In: 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2021). https://ieeexplore.ieee.org/abstract/document/9428241/. Accessed 22 May 2024

  3. Deng, C., Wang, M., Liu, L., Liu, Y., Jiang, Y.: Extended feature pyramid network for small object detection. IEEE Trans. Multimedia 24, 1968–1979 (2021). https://doi.org/10.1109/TMM.2021.3074273

    Article  Google Scholar 

  4. Zhang, Z.: Drone-YOLO: an efficient neural network method for target detection in drone images. Drones 7(8), 526 (2023). https://doi.org/10.3390/drones7080526

    Article  Google Scholar 

  5. Han, K., et al.: A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 87–110 (2022). https://doi.org/10.1109/TPAMI.2022.3152247

    Article  Google Scholar 

  6. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)

  7. Niu, Z., Zhong, G., Yu, H.: A review on the attention mechanism of deep learning. Neurocomputing 452, 48–62 (2021). https://doi.org/10.1016/j.neucom.2021.03.091

    Article  Google Scholar 

  8. Guo, M.-H., et al.: Attention mechanisms in computer vision: a survey. Comput. Vis. Media 8(3), 331–368 (2022). https://doi.org/10.1007/s41095-022-0271-y

    Article  Google Scholar 

  9. Posner, M.I., Boies, S.J.: Components of attention. Psychol. Rev. 78(5), 391 (1971). https://doi.org/10.1037/h0031333

    Article  Google Scholar 

  10. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

  11. Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., Hu, Q.: ECA-Net: efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11534–11542 (2020)

  12. Hou, Q., Zhou, D., Feng, J:. Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13713–13722 (2021)

  13. Wan, D., Lu, R., Shen, S., Xu, T., Lang, X., Ren, Z.: Mixed local channel attention for object detection. Eng. Appl. Artif. Intell. 123, 106442 (2023). https://doi.org/10.1016/j.engappai.2023.106442

    Article  Google Scholar 

  14. Zhu, C., He, Y., Savvides, M.: Feature selective anchor-free module for single-shot object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 840–849 (2019)

  15. Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-IoU loss: faster and better learning for bounding box regression. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12993–13000 (2020)

  16. Li, X., et al.: Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. Adv. Neural. Inf. Process. Syst. 33, 21002–21012 (2020)

    Google Scholar 

  17. Wang, J., Chen, Y., Dong, Z., Gao, M.: Improved YOLOv5 network for real-time multi-scale traffic sign detection. Neural Comput. Appl. 35(10), 7853–7865 (2023). https://doi.org/10.1007/s00521-022-08077-5

    Article  Google Scholar 

  18. Wang, M., et al.: FE-YOLOv5: feature enhancement network based on YOLOv5 for small object detection. J. Vis. Commun. Image Represent. 90, 103752 (2023)

    Article  Google Scholar 

  19. Zeng, S., Yang, W., Jiao, Y., Geng, L., Chen, X.: SCA-YOLO: a new small object detection model for UAV images. Vis. Comput. 40(3), 1787–1803 (2024). https://doi.org/10.1007/s00371-023-02886-y

    Article  Google Scholar 

  20. Zhang, Y., Ye, M., Zhu, G., Liu, Y., Guo, P., Yan, J.: FFCA-YOLO for small object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. (2024). https://ieeexplore.ieee.org/abstract/document/10423050/. Accessed 05 Aug 2024.

  21. Wang, H., Liu, C., Cai, Y., Chen, L., Li, Y.: YOLOv8-QSD: an improved small object detection algorithm for autonomous vehicles based on YOLOv8. IEEE Trans. Instrum. Meas. (2024). https://ieeexplore.ieee.org/abstract/document/10474434/. Accessed 05 Aug 2024

  22. Zhang, Y., Zhang, H., Huang, Q., Han, Y., Zhao, M.: DsP-YOLO: an anchor-free network with DsPAN for small object detection of multiscale defects. Expert Syst. Appl. 241, 122669 (2024). https://doi.org/10.1016/j.eswa.2023.122669

    Article  Google Scholar 

  23. Jing, R., Zhang, W., Liu, Y., Li, W., Li, Y., Liu, C.: An effective method for small object detection in low-resolution images. Eng. Appl. Artif. Intell. 127, 107206 (2024). https://doi.org/10.1016/j.engappai.2023.107206

    Article  Google Scholar 

  24. Wang, J., Xu, C., Yang, W., Yu, L.: A normalized Gaussian Wasserstein distance for tiny object detection. arXiv: arXiv:2110.13389 (2022). Accessed 19 Jun 2024

  25. Zhu, Z., Liang, D., Zhang, S., Huang, X., Li, B., Hu, S.: Traffic-sign detection and classification in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2110–2118 (2016). http://openaccess.thecvf.com/content_cvpr_2016/html/Zhu_Traffic-Sign_Detection_and_CVPR_2016_paper.html. Accessed 22 May 2024

  26. Yu, F., et al.: Bdd100k: a diverse driving dataset for heterogeneous multitask learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2636–2645 (2020). http://openaccess.thecvf.com/content_CVPR_2020/html/Yu_BDD100K_A_Diverse_Driving_Dataset_for_Heterogeneous_Multitask_Learning_CVPR_2020_paper.html. Accessed 22 May 2024

  27. Chen, X., et al.: HAT: hybrid attention transformer for image restoration. arXiv: arXiv:2309.05239 (2024). Accessed 05 Jun 2024

  28. Chattopadhay, A., Sarkar, A., Howlader, P., Balasubramanian, V.N.: Grad-cam++: generalized gradient-based visual explanations for deep convolutional networks. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 839–847. IEEE (2018)

  29. Wang, C., et al.: Gold-YOLO: efficient object detector via gather-and-distribute mechanism. Adv. Neural Inf. Process. Syst. 36 (2024)

  30. Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)

  31. Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., Lin, D.: Libra R-CNN: towards balanced learning for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 821–830 (2019). http://openaccess.thecvf.com/content_CVPR_2019/html/Pang_Libra_R-CNN_Towards_Balanced_Learning_for_Object_Detection_CVPR_2019_paper.html. Accessed 22 May 2024.

  32. Paz, D., Zhang, H., Christensen, H.I.: TridentNet: a conditional generative model for dynamic trajectory generation. In: Ang, M.H., Jr., Asama, H., Lin, W., Foong, S. (eds.) Intelligent autonomous systems 16. Lecture notes in networks and systems, vol. 412, pp. 403–416. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-030-95892-3_31

    Chapter  Google Scholar 

  33. Wang, A., et al.: YOLOv10: real-time end-to-end object detection. arXiv:2405.14458 (2024). https://doi.org/10.48550/arXiv.2405.14458.

Download references

Author information

Authors and Affiliations

Authors

Contributions

Jianhong Mu—original draft; Qinghua Su—review & editing; Xiyu Wang, Wenhui Liang, Sheng Xu, and Kaizheng Wan assisted with some comparative experiments during the revision process. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Jianhong Mu or Qinghua Su.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mu, J., Su, Q., Wang, X. et al. A small object detection architecture with concatenated detection heads and multi-head mixed self-attention mechanism. J Real-Time Image Proc 21, 184 (2024). https://doi.org/10.1007/s11554-024-01562-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11554-024-01562-1

Keywords

Navigation