Abstract
Due to the limited target area occupied by small objects, certain feature extraction paradigms that are not well-suited for small objects can further exacerbate the loss of their already limited information. Additionally, inconsistencies between features at different levels in FPN can result in suboptimal feature fusion, hindering the accurate representation of multi-scale features. As a result, even high-performance detectors struggle to recognize small objects effectively. To resolve the above issues, we propose MLSA-YOLO, a small object detection algorithm based on multi-level feature fusion and scale-adaptive. Initially, we restructured the network architecture using SPD-Conv with the proposed Convolutional Space-to-Depth (CSPD) module to improve the network’s capacity for capturing local spatial details in images and to ensure that information is preserved during the downsampling process. Furthermore, to address the challenges in feature fusion, we employed a three-layer PAFPN structure at the neck and combined it with the proposed multi-level Feature Fusion and Scale-Adaptive (MLSA) feature pyramid network. This method enhances the complementarity of multi-level information, while effectively filtering the conflicting information generated during the fusion phase. To improve the quality of feature extraction, we incorporated the designed DCN_C2f module into the neck network. This module can accurately capture foreground object features, while enhancing the network’s adaptability to geometric deformations of objects. Experimental results show that our approach performs better than other state-of-the-art detection algorithms on the VisDrone2019, DOTA, and FocusTiny datasets. Compared to YOLOv8s, mAP50 improved by 9.5%, 3.4%, and 5.1%, respectively.









Similar content being viewed by others
Code availability
The code of the study are available from the corresponding author upon reasonable request.
References
Feng Q, Xu X, Wang Z (2023) Deep learning-based small object detection: a survey. Math Biosci Eng 20(4):6551–6590
Cheng G, Yuan X, Yao X, Yan K, Zeng Q, Xie X, Han J (2023) Towards large-scale small object detection: survey and benchmarks. IEEE Trans Pattern Anal Mach Intell 45:13467–13488
Rekavandi AM, Rashidi S, Boussaid F, Hoefs S, Akbas E et al (2023) Transformers in small object detection: a benchmark and survey of state-of-the-art. arXiv preprint arXiv:2309.04902
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 580–587
Girshick R (2015) Fast r-CNN. arXiv preprint arXiv:1504.08083
Ren S (2015) Faster r-CNN: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497
Redmon J (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition, pp 7263–7271
Redmon J (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767
Bochkovskiy A, Wang C-Y, Liao H-YM (2020) Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) SSD: single shot multibox detector. In: Computer vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer, Berlin, pp 21–37
Lin T (2017) Focal loss for dense object detection. arXiv preprint arXiv:1708.02002
Wang M, Yang W, Wang L, Chen D, Wei F, KeZiErBieKe H, Liao Y (2023) Fe-yolov5: feature enhancement network based on yolov5 for small object detection. J Vis Commun Image Represent 90:103752
Xue C, Xia Y, Wu M, Chen Z, Cheng F, Yun L (2024) EL-YOLO: an efficient and lightweight low-altitude aerial objects detector for onboard applications. Expert Syst Appl 256:124848
Ghiasi G, Lin T-Y, Le QV (2019) NAS-FPN: learning scalable feature pyramid architecture for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7036–7045
Fu Y, Ran T, Xiao W, Yuan L, Zhao J, He L, Mei J (2024) GD-YOLO: an improved convolutional neural network architecture for real-time detection of smoking and phone use behaviors. Digit Signal Process 151:104554
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2117–2125
Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path aggregation network for instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 8759–8768
Tan M, Pang R, Le QV (2020) EfficientDet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10781–10790
Liu S, Huang D, Wang Y (2019) Learning spatial fusion for single-shot object detection. arXiv preprint arXiv:1911.09516
Yang G, Lei J, Zhu Z, Cheng S, Feng Z, Liang R (2023) AFPN: asymptotic feature pyramid network for object detection. In: 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, pp 2184–2189
Pang Y, Zhao X, Zhang L, Lu H (2020) Multi-scale interactive network for salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9413–9422
Zhu X, Lyu S, Wang X, Zhao Q (2021) Tph-yolov5: improved yolov5 based on transformer prediction head for object detection on drone-captured scenarios. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2778–2788
Vaswani A (2017) Attention is all you need. Adv neural inf process syst
Jocher G, Stoken A, Chaurasia A et al {2020) Ultralytics YOLOv5. https://doi.org/10.5281/zenodo.3908559. https://github.com/ultralytics/yolov5
Woo S, Park J, Lee J-Y, Kweon IS (2018) CBAM: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 3–19
Yang C, Huang Z, Wang N (2022) QueryDet: cascaded sparse query for accelerating high-resolution small object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13668–13677
Zhang Z (2023) Drone-YOLO: an efficient neural network method for target detection in drone images. Drones 7(8):526
Jocher G, Chaurasia A, Qiu J, Ultralytics YOLOv8. https://github.com/ultralytics/ultralytics
Li Y, Fan Q, Huang H, Han Z, Gu Q (2023) A modified yolov8 detection network for UAV aerial image recognition. Drones 7(5):304
Han K, Wang Y, Tian Q, Guo J, Xu C, Xu C (2020) GhostNet: more features from cheap operations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1580–1589
Wang Y, Zou H, Yin M, Zhang X (2023) SMFF-YOLO: a scale-adaptive yolo algorithm with multi-level feature fusion for object detection in UAV scenes. Remote Sens 15(18):4580
Shi Y, Jia Y, Zhang X (2024) FocusDet: an efficient object detector for small object. Sci Rep 14(1):10697
Sunkara R, Luo T (2022) No more strided convolutions or pooling: a new CNN building block for low-resolution images and small objects. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Berlin, pp 443–459
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Peng Y, Sonka M, Chen DZ (2023) U-net v2: rethinking the skip connections of u-net for medical image segmentation. arXiv preprint arXiv:2311.17791
Wang W, Dai J, Chen Z, Huang Z, Li Z, Zhu X, Hu X, Lu T, Lu L, Li H (2023) Internimage: exploring large-scale vision foundation models with deformable convolutions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14408–14419
Zhu P, Wen L, Du D, Bian X, Fan H, Hu Q, Ling H (2021) Detection and tracking meet drones challenge. IEEE Trans Pattern Anal Mach Intell 44(11):7380–7399
Xia G-S, Bai X, Ding J, Zhu Z, Belongie S, Luo J, Datcu M, Pelillo M, Zhang L (2018) Dota: a large-scale dataset for object detection in aerial images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3974–3983
Cai Z, Vasconcelos N (2018) Cascade r-CNN: Delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6154–6162
Wang A, Chen H, Liu L, Chen K, Lin Z, Han J, Ding G (2024) Yolov10: real-time end-to-end object detection. arXiv preprint arXiv:2405.14458
Jocher G et al. Ultralytics YOLO11. https://github.com/ultralytics/ultralytics
Zhou X, Wang D, Krähenbühl P (2019) Objects as points. arXiv preprint arXiv:1904.07850
Mao G, Deng T, Yu N (2022) Object detection in UAV images based on multi-scale split attention. Acta Aeronaut Astronaut Sin 43(12):326738
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant 52275003, in part by the National Key Research and Development Program of China under Grant 2023YFB4704000.
Author information
Authors and Affiliations
Contributions
JP was involved in conceptualization, data curation, methodology, software, validation, visualization, writing—original draft, writing—review and editing. KL was involved in supervision, validation, resources, funding acquisition, writing—review and editing. GW was involved in visualization, software. WX was involved in supervision. TR was involved in visualization, software. LY was involved in supervision, resources, Funding acquisition, writing—review and editing.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Peng, J., Lv, K., Wang, G. et al. MLSA-YOLO: a multi-level feature fusion and scale-adaptive framework for small object detection. J Supercomput 81, 528 (2025). https://doi.org/10.1007/s11227-025-06961-0
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-025-06961-0