Skip to main content
Log in

A recursive attention-enhanced bidirectional feature pyramid network for small object detection

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Single Shot MultiBox Detector (SSD) method shows outstanding performance by using multiscale feature maps in object detection task. However, the SSD method exhibits low accuracy in small object detection. In this paper, A Recursive Attention-Enhanced Bidirectional Feature Pyramid Network (RA-BiFPN) is proposed. Firstly, we designed the attention-enhanced bidirectional feature pyramid network (A-BiFPN) to improve the detection accuracy of the small object. The A-BiFPN is composed of bidirectional feature pyramid network (BiFPN) and the coordinate attention. Among them, the BiFPN employs top-down and bottom-up paths to aggregate features at different scales so that features at all scales contain rich semantic and detailed information. These features help coordinate attention that embeds positional information into channel attention so that the network can easily focus on the channels and locations related to the object in the feature map. Secondly, in order to enhance the ability of the A-BiFPN to characterize small targets, we adopted the recursive structure to feed back the output feature of the A-BiFPN into the backbone network. In this way, the recursive structure goes through the bottom-up backbone repeatedly to enrich the representation power of the A-BiFPN. The experimental results show that the detection accuracy of our method in PASCAL VOC, NWPU VHR-10 , KITTI and RSOD dataset is improved by 2.65%, 7.98% ,7.02% and 5.63% respectively compared to the original SSD.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Benenson R, Omran M, Hosang J, Schiele B (2014) Ten years of pedestrian detection, what have we learned?. In: European Conference on Computer Vision. Springer, Cham, pp 613–627

  2. Bochkovskiy A, Wang C-Y, Liao H-Y M (2020) Yolov4: optimal speed and accuracy of object detection. arXiv:2004.10934

  3. Cai Z, Vasconcelos N (2018) Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6154–6162

  4. Cao C, Liu X, Yang Y, Yu Y, Wang J, Wang Z, Huang Y, Wang L, Huang C, Xu W et al (2015) Look and think twice: capturing top-down visual attention with feedback convolutional neural networks. In: Proceedings of the IEEE international conference on computer vision, pp 2956–2964

  5. Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Machine Intell 40 (4):834–848

    Article  Google Scholar 

  6. Choi H-T, Lee H-J, Kang H, Yu S, Park H-H (2021) Ssd-emb: an improved ssd using enhanced feature map block for object detection. Sensors 21(8):2842

    Article  Google Scholar 

  7. Feng D, Harakeh A, Waslander S, Dietmayer K (2020) A review and comparative study on probabilistic object detection in autonomous driving. arXiv:2011.10671

  8. Ghiasi G, Lin T-Y, Le QV (2019) Nas-fpn: learning scalable feature pyramid architecture for object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7036–7045

  9. Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448

  10. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

  11. Guo G, Zhang N (2019) A survey on deep learning based face recognition. Comput Vis Image Underst 189:102805

    Article  Google Scholar 

  12. Guo W, Yang W, Zhang H, Hua G (2018) Geospatial object detection in high resolution satellite images based on multi-scale convolutional neural network. Remote Sensing 10(1):131

    Article  Google Scholar 

  13. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969

  14. Hou Q, Zhang L, Cheng M-M, Feng J (2020) Strip pooling: rethinking spatial pooling for scene parsing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4003–4012

  15. Hou Q, Zhou D, Feng J (2021) Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13713–13722

  16. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141

  17. Hwang Y-J, Lee J-G, Moon U-C, Park H-H (2020) Ssd-tseffm: new ssd using trident feature and squeeze and extraction feature fusion. Sensors 20(13):3630

    Article  Google Scholar 

  18. Jiang D, Sun B, Su S, Zuo Z, Wu P, Tan X (2020) Fassd: a feature fusion and spatial attention-based single shot detector for small object detection. Electronics 9(9):1536

    Article  Google Scholar 

  19. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105

    Google Scholar 

  20. Kumar K (2019) Evs-dk: event video skimming using deep keyframe. J Vis Commun Image Represent 58:345–352

    Article  Google Scholar 

  21. Kumar K (2021) Text query based summarized event searching interface system using deep learning over cloud. Multimedia Tools and Applications 80(7):11079–11094

    Article  Google Scholar 

  22. Kumar K, Shrimankar DD (2017) F-des: fast and deep event summarization. IEEE Trans Multimedia 20(2):323–334

    Article  Google Scholar 

  23. Kumar K, Shrimankar DD (2018) Deep event learning boost-up approach: delta. Multimedia Tools and Applications 77(20):26635–26655

    Article  Google Scholar 

  24. Kumar K, Shrimankar DD, Singh N (2016) Equal partition based clustering approach for event summarization in videos. In: 2016 12th international conference on signal-image technology & internet-based systems (SITIS). IEEE, pp 119–126

  25. Kumar K, Shrimankar DD, Singh N (2018) Eratosthenes sieve based key-frame extraction technique for event summarization in videos. Multimedia Tools and Applications 77(6):7383–7404

    Article  Google Scholar 

  26. Li C, Pourtaherian A, van Onzenoort L, A Ten WT, De With P (2020) Infant facial expression analysis: towards a real-time video monitoring system using r-cnn and hmm. IEEE J Biomed Health Inform 25(5):1429–1440

    Article  Google Scholar 

  27. Li K, Cheng G, Bu S, You X (2017) Rotation-insensitive and context-augmented object detection in remote sensing images. IEEE Trans Geosci Remote Sens 56(4):2337–2348

    Article  Google Scholar 

  28. Li Y, Pei X, Huang Q, Jiao L, Shang R, Marturi N (2020) Anchor-free single stage detector in remote sensing images based on multiscale dense path aggregation feature pyramid network. IEEE Access 8:63121–63133

    Article  Google Scholar 

  29. Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125

  30. Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988

  31. Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, et al. (2018) Deep learning for generic object detection. A Survey [J]

  32. Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path aggregation network for instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8759–8768

  33. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: single shot multibox detector. In: European conference on computer vision. Springer, pp 21–37

  34. Mao J, Xiao T, Jiang Y, Cao Z (2017) What can help pedestrian detection?. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3127–3136

  35. Pan H, Jiang J, Chen G (2020) Tdfssd: top-down feature fusion single shot multibox detector. Signal Processing: Image Communication 89:115987

    Google Scholar 

  36. Parkhi O, Vedaldi A, Zisserman A (2015) Deep face recognition. In: BMVC 2015 - Proceedings of the British Machine Vision Conference, pp 1–12

  37. Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv:1804.02767

  38. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99

    Google Scholar 

  39. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  40. Solanki A, Bamrara R, Kumar K, Singh N (2020) Vedl: a novel video event searching technique using deep learning. In: Soft Computing: Theories and Applications. Springer, pp 905–914

  41. Tan M, Pang R, Le Q V (2020) Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10781–10790

  42. Uçar A, Demir Y, Güzeliş C (2017) Object recognition and detection with deep learning for autonomous driving applications. Simulation 93(9):759–769

    Article  Google Scholar 

  43. Wang L, Bao Y, Li H, Fan X, Luo Z (2017) Compact cnn based video representation for efficient video copy detection. In: International conference on multimedia modeling. Springer, pp 576–587

  44. Wang Y, Liu X, Guo R (2022) An object detection algorithm based on the feature pyramid network and single shot multibox detector. Clust Comput 1–12

  45. Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19

  46. Xiong S, Tan Y, Li Y, Wen C, Yan P (2021) Subtask attention based object detection in remote sensing images. Remote Sensing 13(10):1925

    Article  Google Scholar 

  47. Yin Q, Yang W, Ran M, Wang S (2021) Fd-ssd: an improved ssd object detection algorithm based on feature fusion and dilated convolution. Signal Processing: Image Communication 98:116402

    Google Scholar 

  48. Yin R, Zhao W, Fan X, Yin Y (2020) Af-ssd: an accurate and fast single shot detector for high spatial remote sensing imagery. Sensors 20(22):6530

    Article  Google Scholar 

  49. Zaidi SSA, Ansari MS, Aslam A, Kanwal N, Asghar M, Lee B (2021) A survey of modern deep learning based object detection models. arXiv:2104.11892

  50. Zhai S, Shang D, Wang S, Dong S (2020) Df-ssd: an improved ssd object detection algorithm based on densenet and feature fusion. IEEE Access 8:24344–24357

    Article  Google Scholar 

  51. Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890

  52. Zhou P, Ni B, Geng C, Hu J, Xu Y (2018) Scale-transferrable object detection. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 528–537

  53. Zhou T, Li L, Li X, Feng C-M, Li J, Shao L (2021) Group-wise learning for weakly supervised semantic segmentation. IEEE Trans Image Process 31:799–811

    Article  Google Scholar 

  54. Zhou T, Qi S, Wang W, Shen J, Zhu S-C (2021) Cascaded parsing of human-object interaction recognition. IEEE Trans Pattern Anal Mach Intell

  55. Zhou T, Wang S, Zhou Y, Yao Y, Li J, Shao L (2020) Motion-attentive transition for zero-shot video object segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 34, pp 13066–13073

  56. Zhou X, Wang D, Krähenbühl P (2019) Objects as points. arXiv:1904.07850

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grant (61873246, 62072416, 62006213, 62102373), Program for Science & Technology Innovation Talents in Universities of Henan Province (21HASTIT028), Natural Science Foundation of Henan (202300410495), Key Scientific Research Projects of Colleges and Universities in Henan Province (21A120010).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huanlong Zhang.

Ethics declarations

Conflict of Interests

We declare that we have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, H., Du, Q., Qi, Q. et al. A recursive attention-enhanced bidirectional feature pyramid network for small object detection. Multimed Tools Appl 82, 13999–14018 (2023). https://doi.org/10.1007/s11042-022-13951-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13951-4

Keywords

Navigation