Mitigate the scale imbalance via multi-scale information interaction in small object detection

Chai, Enhui; Chen, Li; Hao, Xingxing; Zhou, Wei

doi:10.1007/s00521-023-09122-7

Mitigate the scale imbalance via multi-scale information interaction in small object detection

Original Article
Published: 15 November 2023

Volume 36, pages 1699–1712, (2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Enhui Chai¹,
Li Chen¹,
Xingxing Hao¹ &
…
Wei Zhou¹

394 Accesses
Explore all metrics

Abstract

The scale imbalance of the backbone and the neck is the main reason for the inferior accuracy of small object detection when using the general object detector. The general object detector usually contains a complex backbone and a lightweight neck, in which the complex backbone always costs large computational resource and the lightweight neck is hard to interact with deep semantic and shallow spatial information. Thus, the general object detector has severe scale imbalance in detecting small objects. Based on these, in this paper, we propose a novel detector named IUDet which includes a lightweight backbone and a complex neck. A novel sampling strategy is proposed, named pixel-spanning merge (PSM), in the lightweight backbone to save computational cost. In other side, it can transfer features of the scale dimension to the spatial dimension, thus enhancing information interaction. Moreover, the neck is designed with the element-wise sum of multi-scale features and an inverted U-shaped skip connection to improve the small object’s feature representation. The experimental results show that our IUDet outperforms the most popular detectors on MS COCO 2017 and VisDrone DET2019, in small object detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-scale detector optimized for small target

Article 01 March 2024

MSF-YOLO: A multi-scale features fusion-based method for small object detection

Article 06 January 2024

FocusDet: an efficient object detector for small object

Article Open access 10 May 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The datasets generated during and/or analyzed during the current study are not publicly available because the project did not end and the relevant data are needed for subsequent work, but are available from the corresponding author on reasonable request.

References

Law H, Deng J (2020) Cornernet: detecting objects as paired keypoints. Int J Comput Vis 128(3):642–656
Article Google Scholar
Zhou X, Wang D, Krhenbühl P (2019) Objects as points. 2019. arXiv preprint https://doi.org/10.48550/arXiv.1904.07850
Tian Z, Shen C, Chen H, He T (2020) Fcos: fully convolutional one-stage object detection. In: 2019 IEEE/CVF international conference on computer vision (ICCV)
Girshick R (2015) Fast r-cnn. Proceedings of the IEEE international conference on computer vision. 2015: 1440–1448.
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Article Google Scholar
Dai J, Li Y, He K, Sun J (2016) R-fcn: object detection via region-based fully convolutional networks. Curran Associates Inc, Red Hook
Google Scholar
He K, Gkioxari G, Dollar P, Girshick R (2017) Mask r-cnn. In: International conference on computer vision
Lin TY, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. IEEE Trans Pattern Anal Mach Intell PP(99):2999–3007
Google Scholar
Redmon J, Divvala S, Girshick R, Farhadi A (2019) You only look once: unified, real-time object detection. IEEE, Washington, D.C
Google Scholar
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv e-prints
Berg AC, Fu CY, Szegedy C, Anguelov D, Erhan D, Reed S, Liu W (2016) SSD: single shot multibox detector. 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer International Publishing, 2016: 21–37
Fu CY, Liu W, Ranga A, Tyagi A, Berg AC (2017) Dssd : deconvolutional single shot detector. arXiv preprint https://doi.org/10.48550/arXiv.1701.06659
Zhao Q, Sheng T, Wang Y, Tang Z, Ling H (2019) M2det: a single-shot object detector based on multi-level feature pyramid network. In: Proceedings of the AAAI conference on artificial intelligence, pp 9259–9266
Zhang S, Wen L, Bian X, Lei Z, Li SZ (2018) Single-shot refinement neural network for object detection. IEEE, Washington, D.C
Book Google Scholar
Tan M, Pang R, Le QV (2020) Efficientdet: scalable and efficient object detection. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Kisantal M, Wojna Z, Murawski J, Naruniec J, Cho K (2019) Augmentation for small object detection. arXiv preprint https://doi.org/10.48550/arXiv.1902.07296
Lim JS, Astrid M, Yoon HJ, Lee SI (2021) Small object detection using context and attention. 2021 international Conference on Artificial intelligence in information and Communication (ICAIIC). IEEE, 2021: 181–186.
Zisserman A, Simonyan K (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint https://doi.org/10.48550/arXiv.1409.1556
Szegedy C, Liu W, Jia Y, Sermanet P, Rabinovich A (2014) Going deeper with convolutions. IEEE Computer Society, Washington, D.C
Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. IEEE, Washington, D.C
Book Google Scholar
Iandola F, Moskewicz M, Karayev S, Girshick R, Darrell T, Keutzer K (2014) Densenet: implementing efficient convnet descriptor pyramids. arXiv preprint arXiv:1404.1869
Bochkovskiy A, Wang CY, Liao HYM (2020) Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934
Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K (2016) Squeezenet: alexnet-level accuracy with 50x fewer parameters and$<$ 0.5 mb model size. arXiv preprint arXiv:1602.07360
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861
Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6848–6856
Han K, Wang Y, Tian Q, Guo J, Xu C (2020) Ghostnet: more features from cheap operations. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Lin TY, Dollar P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. IEEE Computer Society, Washington, D.C
Book Google Scholar
Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path aggregation network for instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8759–8768
He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916
Article Google Scholar
Chen LC, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint https://doi.org/10.48550/arXiv.1706.05587
Yu F, Koltun V (2016) Multi-scale context aggregation by dilated convolutions. In: ICLR. arXiv preprint https://doi.org/10.48550/arXiv.1511.07122
Lin TY, Maire M, Belongie S, Hays J, Zitnick CL (2014) Microsoft coco: common objects in context. Springer International Publishing, Cham
Google Scholar
Du D, Zhu P, Wen L, Bian X, Liu ZM (2019) Visdrone-det2019: the vision meets drone object detection in image challenge results. In: ICCV visdrone workshop
Cao Y, Chen K, Loy CC, Lin D (2019) Prime sample attention in object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11583–11591
Li Y, Chen Y, Wang N, Zhang Z (2019) Scale-aware trident networks for object detection. IEEE, Washington, D.C
Book Google Scholar
Oksuz K, Cam BC, Kalkan S, Akbas E (2021) Imbalance problems in object detection: a review. IEEE Trans Pattern Anal Mach Intell 43(10):3388–3415
Article Google Scholar
Yu J, Jiang Y, Wang Z, Cao Z, Huang T (2016) Unitbox: an advanced object detection network. ACM, New York
Book Google Scholar
Rezatofighi H, Tsoi N, Gwak JY, Sadeghian A, Savarese S (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Zheng Z, Wang P, Liu W, Li J, Ye R, Ren D (2019) Distance-iou loss: faster and better learning for bounding box regression. arXiv
Pang J, Chen K, Shi J, Feng H, Ouyang W, Lin D (2020) Libra r-cnn: towards balanced learning for object detection. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Chen K, Li J, Lin W, See J, Zou J (2019) Towards accurate one-stage object detection with ap-loss. IEEE, Washington, D.C
Book Google Scholar
Qian Q, Chen L, Li H, Jin R (2020) Dr loss: improving object detection by distributional ranking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12164–12172
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Houlsby N (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations, 2021
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems, 30, 2017
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10012–10022
Dai X, Chen Y, Xiao B, Chen D, Liu M, Yuan L, Zhang L (2021) Dynamic head: unifying object detection heads with attentions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7373-7382.
Ma H, Xia X, Wang X, Xiao X, Li J, Zheng M (2022) Mocovit: mobile convolutional vision transformer. arXiv preprint arXiv:2205.12635
Liu Z, Mao H, Wu CY, Feichtenhofer C, Darrell T, Xie S (2022) A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11976–11986
Ding X, Zhang X, Zhou Y, Han J, Ding G, Sun J (2022) Scaling up your kernels to 31x31: revisiting large kernel design in cnns. arXiv e-prints
Shi W, Caballero J, Huszár F, Totz J, Aitken AP, Bishop R, Rueckert D, Wang Z (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. arXiv e-prints
Cai Z, Vasconcelos N (2017) Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6154-6162
Yang Z, Liu S, Hu H, Wang L, Lin S (2019) Reppoints: point set representation for object detection. In: 2019 IEEE/CVF international conference on computer vision (ICCV)
Du X, Lin TY, Jin P, Ghiasi G, Tan M, Cui Y, Le QV, Song X (2020) Spinenet: learning scale-permuted backbone for recognition and localization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11592–11601
Li X, Wang W, Hu X, Li J, Tang J, Yang J (2021) Generalized focal loss v2: learning reliable localization quality estimation for dense object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11632–11641
Song H, Sun D, Chun S, Jampani V, Han D, Heo B, Kim W, Yang MH (2021) Vidt: an efficient and effective fully transformer-based object detector. arXiv preprint https://doi.org/10.48550/arXiv.2110.03921
Wang CY, Bochkovskiy A, Liao H (2022) Yolov7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv e-prints

Download references

Acknowledgements

This work was supported by National key research and development projects (No. 2020YFC1523301), the Xi’an major scientific and technological achievements transformation and industrialization projects (No. 20GXSF0005), the National Natural Science Foundation of China under Grant (No. 62106199), and Graduate Innovation Program of Northwestern University (No. CX2023185). At the same time, thanks are due to the researchers who provided valuable comments and assistance in the writing and review of the paper.

Author information

Authors and Affiliations

School of Information Science and Technology, Northwestern university, Xuefu Street, Chang’an District, Xi’an, 710127, Shaanxi, China
Enhui Chai, Li Chen, Xingxing Hao & Wei Zhou

Authors

Enhui Chai
View author publications
You can also search for this author in PubMed Google Scholar
Li Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xingxing Hao
View author publications
You can also search for this author in PubMed Google Scholar
Wei Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

EC was involved in conceptualization, methodology, data curation, writing—original draft. CL helped in supervision, writing review—editing. HX contributed to investigation, writing review—editing. WZ helped in conceptualization, investigation.

Corresponding author

Correspondence to Li Chen.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Informed consent

Written informed consent for publication of this paper was obtained from the Northwest University and all authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chai, E., Chen, L., Hao, X. et al. Mitigate the scale imbalance via multi-scale information interaction in small object detection. Neural Comput & Applic 36, 1699–1712 (2024). https://doi.org/10.1007/s00521-023-09122-7

Download citation

Received: 18 April 2023
Accepted: 16 October 2023
Published: 15 November 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s00521-023-09122-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mitigate the scale imbalance via multi-scale information interaction in small object detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-scale detector optimized for small target

MSF-YOLO: A multi-scale features fusion-based method for small object detection

FocusDet: an efficient object detector for small object

Explore related subjects

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now