Skip to main content
Log in

DAF-Net: dense attention feature pyramid network for multiscale object detection

  • Regular Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

In recent years, object detection has become one of the most prominent components in computer vision. State-of-the-art object detectors now employ convolutional neural networks (CNNs) techniques alongside other deep neural network techniques to improve detection performance and accuracy. Most of the recent object detectors employ feature pyramid network (FPN) and their variants while others use combinations of attention mechanisms to achieve better performance. The open question is object detectors inconsistency between the lower layer features, their resolution receptive field and semantic information with the upper layers features in detecting objects. Although some researchers have attempted to address this issue, we exploit ideas surrounding the field and proposed a more prominent architecture called dense attention feature pyramid network (DAF-Net) for multiscale object detection. DAF-Net consists of two attention models, the spatial attention model and channel attention model. Different from other attention models, we proposed lightweight attention models which are fully data-driven then implemented a dense connected attention FPN to reduce the model’s complexity and resolve the learning of redundant feature maps. First, we developed the two attention models then used only the spatial attention model in the backbone of our network, and finally used both attention models to filter and maintain a steady flow of semantic information from lower layers to improve the model’s accuracy and efficiency. Experimental results on underwater images from the National Natural Science Foundation of China (NSFC) (Underwater Image Dataset, National Natural Science Foundation of China (NSFC). Online, retrieved from http://www.cnurpc.org/index.html), MS COCO dataset, and PASCAL VOC dataset indicate higher accuracy and better detection results using the proposed model compared to the benchmark model YOLOX-Darknet53 (Ge in Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430). Our model achieved 70.2mAP, 48.9 mAP, and 83.9 mAP on (NSFC), MS COCO, and PASCAL VOC datasets, respectively, compared with benchmark model 68.9mAP on (NSFC), 47.7mAP on MS COCO, and 82.4mAP on PASCAL VOC.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

The datasets that support the findings of this research are openly available in the National Natural Science Foundation of China (NSFC) at http://www.cnurpc.org/index.html reference number [50], COCO Common Objects in Context at https://cocodataset.org/ reference number [44] and Visual Object Classes Challenge 2012 (VOC2012) at http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html reference number [45]. They are also available upon request and we encourage interested parties to contact Divine Njengwie Achinek at achinekdivine002@dlmu.edu.cn to obtain access to the dataset.

References

  1. Li Y, Zhou S, Chen H (2022) Attention-based fusion factor in FPN for object detection. Appl Intell 1–10.

  2. Cao J, Chen Q, Guo J, Shi R (2020) Attention-guided context feature pyramid network for object detection. arXiv preprint arXiv:2005.11475

  3. Feichtenhofer C, Pinz A, Zisserman A (2017) Detect to track and track to detect. In: IEEE International Conference on Computer Vision 3038–3046

  4. Yuan Y, Xiong Z, Wang Qi (2017) An incremental framework11 for video-based traffic sign detection, tracking, and recognition. IEEE Trans Intell Transp Syst 18(7):1918–1929

    Article  Google Scholar 

  5. Chen K, Tao W (2018) Learning linear regression via single convolutional layer for visual object tracking. IEEE Trans Multimedia 21(1):86–97

    Article  MathSciNet  Google Scholar 

  6. Hongwei Hu, Ma Bo, Shen J, Sun H, Shao L, Porikli F (2018) Robust object tracking using manifold regularized convolutional neural networks. IEEE Trans Multimedia 21(2):510–521

    Google Scholar 

  7. Jiang H, Learned-Miller E (2017) Face detection with the faster r-cnn. In: IEEE International Conference on Automatic Face & Gesture Recognition 650–657. IEEE

  8. Yang S, Luo P, Loy C-C, Tang X (2016) Wider face: A face detection benchmark. In IEEE Conference on Computer Vision and Pattern Recognition 5525–5533.

  9. Yang S, Luo P, Loy C-C, Tang X (2018) Facenessnet: Face detection through deep facial part responses. IEEE Trans Pattern Anal Mach Intell 40(8):1845–1859

    Article  Google Scholar 

  10. Njengwie Achinek D, Shehi Shehu I, Mohamed Athuman A, Fu X (2021) Enhanced single shot multiple detection for real-time object detection in multiple scenes. In: The 5th International Conference on Computer Science and Application Engineering (pp. 1–9).

  11. Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117–2125).

  12. Ge Z, Liu S, Wang F, Li Z, Sun J (2021) Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430.

  13. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779–788).

  14. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg C (2016) Ssd: Single shot multibox detector. In European conference on computer vision (pp. 21–37). Springer, Cham.

  15. Wang CY, Yeh IH, Liao HYM (2021) You only learn one representation: Unified network for multiple tasks. arXiv preprint arXiv:2105.04206.

  16. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Adv Neural Inf Proc Syst 28.

  17. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).

  18. Pramanik A, Pal SK, Maiti J, Mitra P (2021) Granulated RCNN and multi-class deep sort for multi-object detection and tracking. IEEE Trans Emerg Top Comput Intell 6(1):171–181

    Article  Google Scholar 

  19. Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path aggregation network for instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8759–8768).

  20. Kim SW, Kook HK, Sun JY, Kang MC, Ko SJ (2018) Parallel feature pyramid network for object detection. In: Proceedings of the European Conference on Computer Vision (ECCV) (pp. 234–250).

  21. Guo C, Fan B, Zhang Q, Xiang S, Pan C (2020) Augfpn: Improving multi-scale feature learning for object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12595–12604).

  22. Kong T, Sun F, Tan C, Liu H, Huang W (2018) Deep feature pyramid reconfiguration for object detection. In: Proceedings of the European conference on computer vision (ECCV) (pp. 169–185).

  23. Nie J, Anwer RM, Cholakkal H, Khan FS, Pang Y, Shao L (2019) Enriched feature guided refinement network for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision (pp. 9537–9546).

  24. Ghiasi G, Lin TY, Le QV (2019) Nas-fpn: Learning scalable feature pyramid architecture for object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7036–7045).

  25. Singh B, Davis LS (2018) An analysis of scale invariance in object detection snip. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3578–3587).

  26. Wu Y, Jiang X, Fang Z, Gao Y, Fujita H (2021) Multi-modal 3d object detection by 2d-guided precision anchor proposal and multi-layer fusion. Appl Soft Comput 108:107405

    Article  Google Scholar 

  27. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132–7141).

  28. Zhu Y, Zhao C, Guo H, Wang J, Zhao X, Lu H (2018) Attention CoupleNet: Fully convolutional attention coupling network for object detection. IEEE Trans Image Process 28(1):113–126

    Article  MathSciNet  Google Scholar 

  29. Pirinen A, Sminchisescu C (2018) Deep reinforcement learning of region proposal networks for object detection. In: proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6945–6954).

  30. Sukhbaatar S, Grave E, Lample G, Jegou H, Joulin A (2019) Augmenting self-attention with persistent memory. arXiv preprint arXiv:1907.01470.

  31. Srinivas A, Lin TY, Parmar N, Shlens J, Abbeel P, Vaswani A (2021) Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16519–16529).

  32. Woo S, Park J, Lee JY, Kweon IS (2018) Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV) (pp. 3–19).

  33. Li W, Liu K, Zhang L, Cheng F (2020) Object detection based on an adaptive attention mechanism. Sci Rep 10(1):1–13

    Google Scholar 

  34. Cao J, Chen Q, Guo J, Shi R (2020) Attention-guided context feature pyramid network for object detection. arXiv preprint arXiv:2005.11475.

  35. Goyal P, Dollár P, Girshick R, Noordhuis P, Wesolowski L, Kyrola A, He K (2017) Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677.

  36. Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV) (pp. 801–818).

  37. Birodkar V, Lu Z, Li S, Rathod V, Huang J (2021) The surprising impact of mask-head architecture on novel class segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 7015–7025).

  38. Cheng B, Misra I, Schwing AG, Kirillov A, Girdhar R (2022) Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1290–1299).

  39. Han C, Zhao Q, Zhang S, Chen Y, Zhang Z, Yuan J (2022) YOLOPv2: Better, Faster, Stronger for Panoptic Driving Perception. arXiv preprint arXiv:2208.11434.

  40. Shafiee MJ, Chywl B, Li F, Wong A (2017) Fast YOLO: A fast you only look once system for real-time embedded object detection in video. arXiv preprint arXiv:1709.05943.

  41. Wang CY, Bochkovskiy A, Liao HYM (2022) YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696.

  42. Zheng W, Tang W, Jiang L, Fu CW (2021) SE-SSD: Self-ensembling single-stage object detector from point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14494–14503).

  43. He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916

    Article  Google Scholar 

  44. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision (pp. 740–755). Springer, Cham.

  45. Everingham M, Eslami SM, Van Gool L, Williams CK, Winn J, Zisserman A (2015) The pascal visual object classes challenge: a retrospective. Int J Comput Vision 111(1):98–136

    Article  Google Scholar 

  46. Liu S, Huang D, Wang Y (2019) Learning spatial fusion for single-shot object detection. arXiv preprint arXiv:1911.09516.

  47. Tan M, Pang R, Le QV (2020) Efficientdet: Scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10781–10790).

  48. Huang X, Wang X, Lv W, Bai X, Long X, Deng K, Yoshie O (2021) PP-YOLOv2: A practical object detector. arXiv preprint arXiv:2104.10419.

  49. Glenn-Jocher YFL (2021) 3181. Ultralytics: Github; https://github.com/ultralytics/yolov5/discussions/3181m1.

  50. Underwater Image Dataset, National Natural Science Foundation of China (NSFC). Online, retrieved from http://www.cnurpc.org/index.html.

  51. Priyadarshni D, Kolekar MH (2020) Underwater object detection and tracking. In: Soft Computing: Theories and Applications (pp. 837–846). Springer, Singapore.

  52. Han F, Yao J, Zhu H, Wang C (2020) Underwater image processing and object detection based on deep CNN method. J Sensors

  53. Castrillón M, Déniz O, Hernández D, Lorenzo J (2011) A comparison of face and facial feature detectors based on the Viola-Jones general object detection framework. Mach Vis Appl 22(3):481–494

    Google Scholar 

  54. Wang CY, Bochkovskiy A, Liao HYM (2023) YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7464–7475).

  55. Glenn J (2023) Ultralytics YOLOv8. https://github.com/ultralytics/ultralytics.

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China Grant 62176037 and 61802043, by the Liaoning Revitalization Talents Program Grant XLYC1908007, by the Foundation of Liaoning Key Research and Development Program Grant 201801728, by the Fundamental Research Funds for the Central Universities Grant 3132016352 and Grant 3132020215, by the Dalian Science and Technology Innovation Fund 2018J12GX037 and 2019J11CY001.

Author information

Authors and Affiliations

Authors

Contributions

Divine Njengwie Achinek and Xianping Fu conceived to present ideas developed the theory and performed the computations. Divine Njengwie Achinek Developed and design the methodology. Divine Njengwie Achinek. Ibrahim Shehi Shehu. and Mohamed Athuman. carried out the experiments and validation with the help of Xianping Fu. Divine Njengwie Achinek. Ibrahim Shehi Shehu. wrote the main manuscript with the help of Mohamed Athuman. Divine Njengwie Achinek. prepared the figures and diagram and all authors reviewed the manuscript.

Corresponding author

Correspondence to Divine Njengwie Achinek.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest with regard to the content and publication of the findings documented in this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Achinek, D.N., Shehu, I.S., Athuman, A.M. et al. DAF-Net: dense attention feature pyramid network for multiscale object detection. Int J Multimed Info Retr 13, 18 (2024). https://doi.org/10.1007/s13735-024-00323-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13735-024-00323-x

Keywords

Navigation