Abstract
A novel detection method is proposed to address the challenge of detecting small objects in object detection. This method augments the YOLOv8n architecture with a small object detection layer and innovatively designs a Concat-detection head to effectively extract features. Simultaneously, a new attention mechanism—Multi-Head Mixed Self-Attention (MMSA) mechanism—is introduced to enhance the feature-extraction capability of the backbone. To improve the detection sensitivity for small objects, a combination of Normalized Wasserstein Distance (NWD) and Intersection over Union (IoU) is used to calculate the localization loss, optimizing the bounding-box regression. Experimental results on the TT100K dataset show that the mean average precision (mAP@0.5) reaches 88.1%, which is a 13.5% improvement over YOLOv8n. The method’s versatility is also validated through experiments on the BDD100K dataset, where it is compared with various object-detection algorithms. The results demonstrate that this method yields significant improvements and practical value in the field of small-object detection. Detailed code can be found at https://github.com/CodeSworder/MMSA.








Similar content being viewed by others
Data availability
The datasets utilized in the article are all publicly available. The detailed URLs for the datasets can be found in. https://cg.cs.tsinghua.edu.cn/traffic-sign/.
References
Li, Z., Yang, L., Zhou, F.: FSSD: feature fusion single shot multibox detector. arXiv arXiv:1712.00960 (2024)
Liu, Z., Gao, G., Sun, L., Fang, Z.: HRDNet: high-resolution detection network for small objects. In: 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2021). https://ieeexplore.ieee.org/abstract/document/9428241/. Accessed 22 May 2024
Deng, C., Wang, M., Liu, L., Liu, Y., Jiang, Y.: Extended feature pyramid network for small object detection. IEEE Trans. Multimedia 24, 1968–1979 (2021). https://doi.org/10.1109/TMM.2021.3074273
Zhang, Z.: Drone-YOLO: an efficient neural network method for target detection in drone images. Drones 7(8), 526 (2023). https://doi.org/10.3390/drones7080526
Han, K., et al.: A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 87–110 (2022). https://doi.org/10.1109/TPAMI.2022.3152247
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Niu, Z., Zhong, G., Yu, H.: A review on the attention mechanism of deep learning. Neurocomputing 452, 48–62 (2021). https://doi.org/10.1016/j.neucom.2021.03.091
Guo, M.-H., et al.: Attention mechanisms in computer vision: a survey. Comput. Vis. Media 8(3), 331–368 (2022). https://doi.org/10.1007/s41095-022-0271-y
Posner, M.I., Boies, S.J.: Components of attention. Psychol. Rev. 78(5), 391 (1971). https://doi.org/10.1037/h0031333
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., Hu, Q.: ECA-Net: efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11534–11542 (2020)
Hou, Q., Zhou, D., Feng, J:. Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13713–13722 (2021)
Wan, D., Lu, R., Shen, S., Xu, T., Lang, X., Ren, Z.: Mixed local channel attention for object detection. Eng. Appl. Artif. Intell. 123, 106442 (2023). https://doi.org/10.1016/j.engappai.2023.106442
Zhu, C., He, Y., Savvides, M.: Feature selective anchor-free module for single-shot object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 840–849 (2019)
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-IoU loss: faster and better learning for bounding box regression. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12993–13000 (2020)
Li, X., et al.: Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. Adv. Neural. Inf. Process. Syst. 33, 21002–21012 (2020)
Wang, J., Chen, Y., Dong, Z., Gao, M.: Improved YOLOv5 network for real-time multi-scale traffic sign detection. Neural Comput. Appl. 35(10), 7853–7865 (2023). https://doi.org/10.1007/s00521-022-08077-5
Wang, M., et al.: FE-YOLOv5: feature enhancement network based on YOLOv5 for small object detection. J. Vis. Commun. Image Represent. 90, 103752 (2023)
Zeng, S., Yang, W., Jiao, Y., Geng, L., Chen, X.: SCA-YOLO: a new small object detection model for UAV images. Vis. Comput. 40(3), 1787–1803 (2024). https://doi.org/10.1007/s00371-023-02886-y
Zhang, Y., Ye, M., Zhu, G., Liu, Y., Guo, P., Yan, J.: FFCA-YOLO for small object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. (2024). https://ieeexplore.ieee.org/abstract/document/10423050/. Accessed 05 Aug 2024.
Wang, H., Liu, C., Cai, Y., Chen, L., Li, Y.: YOLOv8-QSD: an improved small object detection algorithm for autonomous vehicles based on YOLOv8. IEEE Trans. Instrum. Meas. (2024). https://ieeexplore.ieee.org/abstract/document/10474434/. Accessed 05 Aug 2024
Zhang, Y., Zhang, H., Huang, Q., Han, Y., Zhao, M.: DsP-YOLO: an anchor-free network with DsPAN for small object detection of multiscale defects. Expert Syst. Appl. 241, 122669 (2024). https://doi.org/10.1016/j.eswa.2023.122669
Jing, R., Zhang, W., Liu, Y., Li, W., Li, Y., Liu, C.: An effective method for small object detection in low-resolution images. Eng. Appl. Artif. Intell. 127, 107206 (2024). https://doi.org/10.1016/j.engappai.2023.107206
Wang, J., Xu, C., Yang, W., Yu, L.: A normalized Gaussian Wasserstein distance for tiny object detection. arXiv: arXiv:2110.13389 (2022). Accessed 19 Jun 2024
Zhu, Z., Liang, D., Zhang, S., Huang, X., Li, B., Hu, S.: Traffic-sign detection and classification in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2110–2118 (2016). http://openaccess.thecvf.com/content_cvpr_2016/html/Zhu_Traffic-Sign_Detection_and_CVPR_2016_paper.html. Accessed 22 May 2024
Yu, F., et al.: Bdd100k: a diverse driving dataset for heterogeneous multitask learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2636–2645 (2020). http://openaccess.thecvf.com/content_CVPR_2020/html/Yu_BDD100K_A_Diverse_Driving_Dataset_for_Heterogeneous_Multitask_Learning_CVPR_2020_paper.html. Accessed 22 May 2024
Chen, X., et al.: HAT: hybrid attention transformer for image restoration. arXiv: arXiv:2309.05239 (2024). Accessed 05 Jun 2024
Chattopadhay, A., Sarkar, A., Howlader, P., Balasubramanian, V.N.: Grad-cam++: generalized gradient-based visual explanations for deep convolutional networks. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 839–847. IEEE (2018)
Wang, C., et al.: Gold-YOLO: efficient object detector via gather-and-distribute mechanism. Adv. Neural Inf. Process. Syst. 36 (2024)
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., Lin, D.: Libra R-CNN: towards balanced learning for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 821–830 (2019). http://openaccess.thecvf.com/content_CVPR_2019/html/Pang_Libra_R-CNN_Towards_Balanced_Learning_for_Object_Detection_CVPR_2019_paper.html. Accessed 22 May 2024.
Paz, D., Zhang, H., Christensen, H.I.: TridentNet: a conditional generative model for dynamic trajectory generation. In: Ang, M.H., Jr., Asama, H., Lin, W., Foong, S. (eds.) Intelligent autonomous systems 16. Lecture notes in networks and systems, vol. 412, pp. 403–416. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-030-95892-3_31
Wang, A., et al.: YOLOv10: real-time end-to-end object detection. arXiv:2405.14458 (2024). https://doi.org/10.48550/arXiv.2405.14458.
Author information
Authors and Affiliations
Contributions
Jianhong Mu—original draft; Qinghua Su—review & editing; Xiyu Wang, Wenhui Liang, Sheng Xu, and Kaizheng Wan assisted with some comparative experiments during the revision process. All authors reviewed the manuscript.
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mu, J., Su, Q., Wang, X. et al. A small object detection architecture with concatenated detection heads and multi-head mixed self-attention mechanism. J Real-Time Image Proc 21, 184 (2024). https://doi.org/10.1007/s11554-024-01562-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11554-024-01562-1