Abstract
Single Shot MultiBox Detector (SSD) method shows outstanding performance by using multiscale feature maps in object detection task. However, the SSD method exhibits low accuracy in small object detection. In this paper, A Recursive Attention-Enhanced Bidirectional Feature Pyramid Network (RA-BiFPN) is proposed. Firstly, we designed the attention-enhanced bidirectional feature pyramid network (A-BiFPN) to improve the detection accuracy of the small object. The A-BiFPN is composed of bidirectional feature pyramid network (BiFPN) and the coordinate attention. Among them, the BiFPN employs top-down and bottom-up paths to aggregate features at different scales so that features at all scales contain rich semantic and detailed information. These features help coordinate attention that embeds positional information into channel attention so that the network can easily focus on the channels and locations related to the object in the feature map. Secondly, in order to enhance the ability of the A-BiFPN to characterize small targets, we adopted the recursive structure to feed back the output feature of the A-BiFPN into the backbone network. In this way, the recursive structure goes through the bottom-up backbone repeatedly to enrich the representation power of the A-BiFPN. The experimental results show that the detection accuracy of our method in PASCAL VOC, NWPU VHR-10 , KITTI and RSOD dataset is improved by 2.65%, 7.98% ,7.02% and 5.63% respectively compared to the original SSD.
Similar content being viewed by others
References
Benenson R, Omran M, Hosang J, Schiele B (2014) Ten years of pedestrian detection, what have we learned?. In: European Conference on Computer Vision. Springer, Cham, pp 613–627
Bochkovskiy A, Wang C-Y, Liao H-Y M (2020) Yolov4: optimal speed and accuracy of object detection. arXiv:2004.10934
Cai Z, Vasconcelos N (2018) Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6154–6162
Cao C, Liu X, Yang Y, Yu Y, Wang J, Wang Z, Huang Y, Wang L, Huang C, Xu W et al (2015) Look and think twice: capturing top-down visual attention with feedback convolutional neural networks. In: Proceedings of the IEEE international conference on computer vision, pp 2956–2964
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Machine Intell 40 (4):834–848
Choi H-T, Lee H-J, Kang H, Yu S, Park H-H (2021) Ssd-emb: an improved ssd using enhanced feature map block for object detection. Sensors 21(8):2842
Feng D, Harakeh A, Waslander S, Dietmayer K (2020) A review and comparative study on probabilistic object detection in autonomous driving. arXiv:2011.10671
Ghiasi G, Lin T-Y, Le QV (2019) Nas-fpn: learning scalable feature pyramid architecture for object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7036–7045
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Guo G, Zhang N (2019) A survey on deep learning based face recognition. Comput Vis Image Underst 189:102805
Guo W, Yang W, Zhang H, Hua G (2018) Geospatial object detection in high resolution satellite images based on multi-scale convolutional neural network. Remote Sensing 10(1):131
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969
Hou Q, Zhang L, Cheng M-M, Feng J (2020) Strip pooling: rethinking spatial pooling for scene parsing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4003–4012
Hou Q, Zhou D, Feng J (2021) Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13713–13722
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Hwang Y-J, Lee J-G, Moon U-C, Park H-H (2020) Ssd-tseffm: new ssd using trident feature and squeeze and extraction feature fusion. Sensors 20(13):3630
Jiang D, Sun B, Su S, Zuo Z, Wu P, Tan X (2020) Fassd: a feature fusion and spatial attention-based single shot detector for small object detection. Electronics 9(9):1536
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
Kumar K (2019) Evs-dk: event video skimming using deep keyframe. J Vis Commun Image Represent 58:345–352
Kumar K (2021) Text query based summarized event searching interface system using deep learning over cloud. Multimedia Tools and Applications 80(7):11079–11094
Kumar K, Shrimankar DD (2017) F-des: fast and deep event summarization. IEEE Trans Multimedia 20(2):323–334
Kumar K, Shrimankar DD (2018) Deep event learning boost-up approach: delta. Multimedia Tools and Applications 77(20):26635–26655
Kumar K, Shrimankar DD, Singh N (2016) Equal partition based clustering approach for event summarization in videos. In: 2016 12th international conference on signal-image technology & internet-based systems (SITIS). IEEE, pp 119–126
Kumar K, Shrimankar DD, Singh N (2018) Eratosthenes sieve based key-frame extraction technique for event summarization in videos. Multimedia Tools and Applications 77(6):7383–7404
Li C, Pourtaherian A, van Onzenoort L, A Ten WT, De With P (2020) Infant facial expression analysis: towards a real-time video monitoring system using r-cnn and hmm. IEEE J Biomed Health Inform 25(5):1429–1440
Li K, Cheng G, Bu S, You X (2017) Rotation-insensitive and context-augmented object detection in remote sensing images. IEEE Trans Geosci Remote Sens 56(4):2337–2348
Li Y, Pei X, Huang Q, Jiao L, Shang R, Marturi N (2020) Anchor-free single stage detector in remote sensing images based on multiscale dense path aggregation feature pyramid network. IEEE Access 8:63121–63133
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, et al. (2018) Deep learning for generic object detection. A Survey [J]
Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path aggregation network for instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8759–8768
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: single shot multibox detector. In: European conference on computer vision. Springer, pp 21–37
Mao J, Xiao T, Jiang Y, Cao Z (2017) What can help pedestrian detection?. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3127–3136
Pan H, Jiang J, Chen G (2020) Tdfssd: top-down feature fusion single shot multibox detector. Signal Processing: Image Communication 89:115987
Parkhi O, Vedaldi A, Zisserman A (2015) Deep face recognition. In: BMVC 2015 - Proceedings of the British Machine Vision Conference, pp 1–12
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv:1804.02767
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Solanki A, Bamrara R, Kumar K, Singh N (2020) Vedl: a novel video event searching technique using deep learning. In: Soft Computing: Theories and Applications. Springer, pp 905–914
Tan M, Pang R, Le Q V (2020) Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10781–10790
Uçar A, Demir Y, Güzeliş C (2017) Object recognition and detection with deep learning for autonomous driving applications. Simulation 93(9):759–769
Wang L, Bao Y, Li H, Fan X, Luo Z (2017) Compact cnn based video representation for efficient video copy detection. In: International conference on multimedia modeling. Springer, pp 576–587
Wang Y, Liu X, Guo R (2022) An object detection algorithm based on the feature pyramid network and single shot multibox detector. Clust Comput 1–12
Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19
Xiong S, Tan Y, Li Y, Wen C, Yan P (2021) Subtask attention based object detection in remote sensing images. Remote Sensing 13(10):1925
Yin Q, Yang W, Ran M, Wang S (2021) Fd-ssd: an improved ssd object detection algorithm based on feature fusion and dilated convolution. Signal Processing: Image Communication 98:116402
Yin R, Zhao W, Fan X, Yin Y (2020) Af-ssd: an accurate and fast single shot detector for high spatial remote sensing imagery. Sensors 20(22):6530
Zaidi SSA, Ansari MS, Aslam A, Kanwal N, Asghar M, Lee B (2021) A survey of modern deep learning based object detection models. arXiv:2104.11892
Zhai S, Shang D, Wang S, Dong S (2020) Df-ssd: an improved ssd object detection algorithm based on densenet and feature fusion. IEEE Access 8:24344–24357
Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890
Zhou P, Ni B, Geng C, Hu J, Xu Y (2018) Scale-transferrable object detection. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 528–537
Zhou T, Li L, Li X, Feng C-M, Li J, Shao L (2021) Group-wise learning for weakly supervised semantic segmentation. IEEE Trans Image Process 31:799–811
Zhou T, Qi S, Wang W, Shen J, Zhu S-C (2021) Cascaded parsing of human-object interaction recognition. IEEE Trans Pattern Anal Mach Intell
Zhou T, Wang S, Zhou Y, Yao Y, Li J, Shao L (2020) Motion-attentive transition for zero-shot video object segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 34, pp 13066–13073
Zhou X, Wang D, Krähenbühl P (2019) Objects as points. arXiv:1904.07850
Acknowledgements
This work is supported by the National Natural Science Foundation of China under Grant (61873246, 62072416, 62006213, 62102373), Program for Science & Technology Innovation Talents in Universities of Henan Province (21HASTIT028), Natural Science Foundation of Henan (202300410495), Key Scientific Research Projects of Colleges and Universities in Henan Province (21A120010).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
We declare that we have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, H., Du, Q., Qi, Q. et al. A recursive attention-enhanced bidirectional feature pyramid network for small object detection. Multimed Tools Appl 82, 13999–14018 (2023). https://doi.org/10.1007/s11042-022-13951-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13951-4