Abstract
Object detection in unmanned aerial vehicle (UAV) images has become an important research area in computer vision due to its unique value and challenges. UAV images are characterized by densely distributed small targets, significant changes in target scale, and background noise, which affect the accuracy and reliability of detection. To address these issues, we propose an small target detection network based on Enhanced Scale Sequence Fusion and channel space fusion cross-attention mechanism, called CSFCANet.To tackle the high proportion of small targets and scale variation in UAV images, we employ Enhanced Scale Sequence Fusion, integrating fine-grained information from shallow feature maps and semantic information from deep feature maps. Additionally, we incorporate an tiny target detection head to enhance the network’s ability to extract fine-grained information features for small targets. To address the issue of background noise, we propose a channel space fusion cross-attention mechanism, which first performs attention calculation on local patch block feature maps, and then performs attention calculation global patch blocks. This captures both long-range dependencies and detailed information. The method for calculating attention combines spatial description information and channel description information.Extensive experiments were conducted to validate the effectiveness of the model on the VisDrone benchmark dataset, UAVDT dataset and our self-made UAV power inspection dataset PIDrone. In comparison to the YOLOv8s model, the CSFCANet demonstrated an improvement in mAP of 7% on the PIDrone, 2.4% on the VisDrone, and 3.6% on the UAVDT.







Similar content being viewed by others
Data Availability
No datasets were generated or analysed during the current study.
References
Kainz, O., Dopiriak, M., Michalko, M., Jakab, F., Nováková, I.: Traffic monitoring from the perspective of an unmanned aerial vehicle. Appl. Sci. 12(16), 7966 (2022)
Abdelfattah, R., Wang, X., Wang, S.: Ttpla: An aerial-image dataset for detection and segmentation of transmission towers and power lines. In: Proceedings of the Asian conference on computer vision (2020)
Xue, Y., Jin, G., Shen, T., Tan, L., Wang, N., Gao, J., Wang, L.: Smalltrack: Wavelet pooling and graphenhanced classification for UAV small object tracking. IEEE Trans. Geosci. Remote Sens. 61, 1–15 (2023)
Xue, Y., Jin, G., Shen, T., Tan, L., Wang, L.: Template-guided frequency attention and adaptive cross-entropy loss for UAV visual tracking. Chin. J. Aeronautics 36(9), 299–312 (2023)
Xue, Y., Jin, G., Shen, T., Tan, L., Yang, J., Hou, X.: Mobiletrack: Siamese efficient mobile network for high-speed UAV tracking. IET Image Process. 16(12), 3300–3313 (2022)
Xue, Y., Jin, G., Shen, T., Tan, L., Wang, N., Gao, J., Wang, L.: Consistent representation mining for multi-drone single object tracking. IEEE Trans. Circuits Syst. Video Technol. 34(11), 10845–10859 (2024)
Xue, Y., Shen, T., Jin, G., Tan, L., Wang, N., Wang, L., Gao, J.: Handling occlusion in UAV visual tracking with query-guided redetection. IEEE Trans. Instrum. Meas. 73, 1–17 (2024)
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125, (2017)
Kang, M., Ting, C.-M., Ting, F.F., Phan, R.C.-W.: Asf-yolo: A novel yolo model with attentional scale sequence fusion for cell instance segmentation. Image Vis. Comput. 147, 105057 (2024)
Lim, J.-S., Astrid, M., Yoon, H.-J., Lee, S.-I.: Small object detection using context and attention. In: 2021 International conference on artificial intelligence in information and communication (ICAIIC), pp. 181–186. IEEE, (2021)
Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., Hu, Q.: Eca-net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11534–11542, (2020)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, (2018)
Hou, Q., Zhou, D., Feng, J.: Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13713–13722, (2021)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp. 3–19, (2018)
Du, D., Zhu, P., Wen, L., Bian, X., Lin, H., Hu, Q., Peng, T., Zheng, J., Wang, X., Zhang, Y., et al.: Visdrone-det2019: The vision meets drone object detection in image challenge results. In: Proceedings of the IEEE/CVF international conference on computer vision workshops, pp. 0–0, (2019)
Yu, H., Li, G., Zhang, W., Huang, Q., Du, D., Tian, Q., Sebe, N.: The unmanned aerial vehicle benchmark: Object detection, tracking and baseline. Int. J. Comput. Vis. 128, 1141–1159 (2020)
Liang, X., Zhang, J., Zhuo, L., Li, Y., Tian, Q.: Small object detection in unmanned aerial vehicle images using feature fusion and scaling-based single shot detector with spatial context analysis. IEEE Trans. Circuits Syst. Video Technol. 30(6), 1758–1770 (2019)
Pawar, N., Waghmare, A., Pratap, A., Thorat, A., Ghogale, K.N., Karamtoth, S.N.R., Shaikh, N.F.: Miniscule object detection in aerial images using yolor: a review. In: Proceedings of International conference on communication and computational technologies: ICCCT 2022, pp. 697–708. Springer, (2022)
Hong, S., Kang, S., Cho, D.: Patch-level augmentation for object detection in aerial images. In: Proceedings of the IEEE/CVF international conference on computer vision workshops, pp. 0–0, (2019)
Bosquet, B., Cores, D., Seidenari, L., Brea, V.M., Mucientes, M., Del Bimbo, A.: A full data augmentation pipeline for small object detection based on generative adversarial networks. Pattern Recognit. 133, 108998 (2023)
Huang, Y., Chen, J., Huang, D.: Ufpmp-det: Toward accurate and efficient object detection on drone imagery. In: Proceedings of the AAAI conference on artificial intelligence, vol. 36, pp. 1026–1033, (2022)
Amudhan, A., Sudheer, A.: Lightweight and computationally faster hypermetropic convolutional neural network for small size object detection. Image Vis. Comput. 119, 104396 (2022)
Zhang, Y., Ye, M., Zhu, G., Liu, Y., Guo, P., Yan, J.: Ffca-yolo for small object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 62, 1–15 (2024)
Hong, M., Li, S., Yang, Y., Zhu, F., Zhao, Q., Lu, L.: Sspnet: Scale selection pyramid network for tiny person detection from UAV images. IEEE Geosci. Remote Sens. Lett. 19, 1–5 (2021)
Xu, J., Li, Y., Wang, S.: Adazoom: Adaptive zoom network for multi-scale object detection in large scenes. arXiv preprint arXiv:2106.10409 (2021)
Leng, J., Mo, M., Zhou, Y., Gao, C., Li, W., Gao, X.: Pareto refocusing for drone-view object detection. IEEE Trans. Circuits Syst. Video Technol. 33(3), 1320–1334 (2022)
Yang, X., Yang, J., Yan, J., Zhang, Y., Zhang, T., Guo, Z., Xian, S., Fu, K.S.: Towards more robust detection for small, cluttered and rotated objects. arxiv 2018. arXiv preprint arXiv:1811.07126
Zhang, M., Zhang, B., Liu, M., Xin, M.: Robust object detection in aerial imagery based on multi-scale detector and soft densely connected. IEEE Access 8, 92791–92801 (2020)
Nie, J., Pang, Y., Zhao, S., Han, J., Li, X.: Efficient selective context network for accurate object detection. IEEE Trans. Circuits Syst. Video Technol. 31(9), 3456–3468 (2020)
Quan, Y., Zhang, D., Zhang, L., Tang, J.: Centralized feature pyramid for object detection. IEEE Trans. Image Process. 32, 4341–4354 (2023)
Singh, B., Davis, L.S.: An analysis of scale invariance in object detection snip. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3578–3587, (2018)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 7794–7803, (2018)
Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H.: Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp. 3146–3154, (2019)
Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H.: Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE/CVF international conference on computer vision workshops, pp. 0–0, (2019)
Xia, B.N., Gong, Y., Zhang, Y., Poellabauer, C.: Second-order non-local attention networks for person re-identification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 3760–3769, (2019)
Li, X., Wang, W., Hu, X., Yang, J.: Selective kernel networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 510–519, (2019)
Liu, Y., Shao, Z., Hoffmann, N.: Global attention mechanism: Retain information to enhance channel-spatial interactions. arXiv preprint arXiv:2112.05561 (2021)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transac. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2016)
Cai, Z., Vasconcelos, N.: Cascade r-cnn: Delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6154–6162, (2018)
Li, C., Li, L., Jiang, H., Weng, K., Geng, Y., Li, L., Ke, Z., Li, Q., Cheng, M., Nie, W., et al.: Yolov6: A single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976 (2022)
Wang, A., Chen, H., Liu, L., Chen, K., Lin, Z., Han, J., Ding, G.: YOLOv10: Real-time end-to-end object detection https://arxiv.org/abs/2405.14458 (2024)
Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z.: Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9759–9768, (2020)
Feng, C., Zhong, Y., Gao, Y., Scott, M.R., Huang, W.: Tood: Task-aligned one-stage object detection. In: 2021 IEEE/CVF International conference on computer vision (ICCV), pp. 3490–3499. IEEE Computer Society (2021)
Li, X., Wang, W., Wu, L., Chen, S., Hu, X., Li, J., Tang, J., Yang, J.: Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. Adv. Neural Inf. Process. Syst. 33, 21002–21012 (2020)
Lyu, C., Zhang, W., Huang, H., Zhou, Y., Wang, Y., Liu, Y., Zhang, S., Chen, K.: Rtmdet: An empirical study of designing real-time object detectors. arxiv 2022. arXiv preprint arXiv:2212.07784
Ross, T.-Y., Dollár, G.: Focal loss for dense object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2980–2988, (2017)
Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Liu, Y., Chen, J.: Detrs beat yolos on real-time object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16965–16974, (2024)
Zhang, Y., Ye, M., Zhu, G., Liu, Y., Guo, P., Yan, J.: FFCA-YOLO for small object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 62, 1–15 (2024)
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (Nos. 62072002, 62172004, and 62273001), Anhui Province Collaborative Innovation Project (Nos. GXXT-2022-050, GXXT-2022-053), and the Outstanding Research and Innovation Team Project of Anhui Province (2022AH010005), and the Special Fund for Anhui Agriculture Research System (2021-2025).
Author information
Authors and Affiliations
Contributions
JLL and PC conceived the study; JLL, CHZ and BW participated in the methodology design; JLL and PC carried it out and drafted the manuscript. All authors revised the manuscript critically. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no Conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, J., Zheng, C., Chen, P. et al. Small object detection in UAV imagery based on channel-spatial fusion cross attention. SIViP 19, 302 (2025). https://doi.org/10.1007/s11760-025-03850-0
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11760-025-03850-0