Skip to main content

Fine grained dual level attention mechanisms with spacial context information fusion for object detection

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

For channel and spatial feature map C×W×H in object detection task, its information fusion usually relies on attention mechanism, that is, all C channels and the entire space W×H are all compressed respectively via average/max pooling, and then their attention weight masks are obtained based on correlation calculation. This coarse-grained global operation ignores the differences among multiple channels and diverse spatial regions, resulting in inaccurate attention weights. In addition, how to mine the contextual information in the space W×H is also a challenge for object recognition and localization. To this end, we propose a Fine-Grained Dual Level Attention Mechanism joint Spacial Context Information Fusion module for object detection (FGDLAM&SCIF). It is a cascaded structure, firstly, we subdivide the feature space W×H into n (optimized as n = 4 in experiments) subspaces and construct a global adaptive pooling and one-dimensional convolution algorithm to effectively extract the feature channel weights on each subspace respectively. Secondly, the C feature channels are divided into n (n = 4) sub-channels, and then a multi-scale module is constructed in the feature space W×H to mine context information. Finally, row and column coding is used to fuse them orthogonally to obtain enhanced features. This module is embeddable, which can be transplanted into any object detection network, such as YOLOv4/v5, PPYOLOE, YOLOX and MobileNet, ResNet as well. Experiments are conducted on the MS COCO 2017 and Pascal VOC 2007 datasets to verify its effectiveness and good portability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

will be offered to reasonable requests from future readers.

References

  1. Wei Z, Zhang F, Chang S et al (2022) Mmwave radar and vision fusion for object detection in autonomous driving: a review[J]. Sensors 22(7):2542

    Article  Google Scholar 

  2. Hu J, Zhang C, Xu S, Chen C (2021) An Invasive Target Detection and Localization Strategy Using Pan-Tilt-Zoom Cameras for Security Applications. In 2021 IEEE International Conference on Real-time Computing and Robotics (RCAR) 1236–1241. IEEE. https://doi.org/10.1109/RCAR52367.2021.9517521

  3. Kumar K, Shrimankar DD, Singh N (2018) SOMES: an efficient SOM technique for event summarization in multi-view surveillance videos[C]//Recent Findings in Intelligent Computing Techniques: Proceedings of the 5th ICACNI 2017, Volume 3. Springer Singapore, : 383–389

  4. Negi A, Kumar K, Saini P (2023) Object of interest and unsupervised learning-based Framework for an effective video summarization using deep Learning[J]. IETE J Res, : 1–12

  5. Negi A, Kumar K, Saini P et al (2022) Object detection based approach for an efficient video summarization with system statistics over cloud[C]//2022 IEEE 9th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON). IEEE, : 1–6

  6. Kumar K, Shrimankar DD (2017) F-DES: fast and deep event summarization[J]. IEEE Trans Multimedia 20(2):323–334

    Article  Google Scholar 

  7. Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision 764–773. https://doi.org/10.1109/ICCV.2017.89

  8. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers.Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer International Publishing 213–229. https://doi.org/10.1007/978-3-030-58452-8_13

  9. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. Proceedings of the IEEE conference on computer vision and pattern recognition 7132–7141. https://doi.org/10.1109/TPAMI.2019.2913372

  10. Qin Z, Zhang P, Wu F, Li X (2021) Fcanet: frequency channel attention networks. Proc IEEE/CVF Int Conf Comput Vis 783:792. https://doi.org/10.1109/ICCV48922.2021.00082

    Article  Google Scholar 

  11. Yang Z, Zhu L, Wu Y, Yang Y (2020) Gated channel transformation for visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 11794–11803. https://doi.org/10.1109/CVPR42600.2020.01181

  12. Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 3146–3154. https://doi.org/10.1109/CVPR.2019.00326

  13. Woo S, Park J, Lee JY, Kweon IS (2018) Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV) 3–19. https://doi.org/10.1007/978-3-030-01234-2_1

  14. Park J, Woo S, Lee JY, Kweon IS (2018) Bam: Bottleneck attention module. arXiv preprint arXiv:1807.06514. https://doi.org/10.48550/arXiv.1807.06514

  15. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition 580–587. https://doi.org/10.1109/CVPR.2014.81

  16. Girshick R (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision 1440–1448. https://doi.org/10.1109/ICCV.2015.169

  17. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28. https://doi.org/10.1109/TPAMI.2016.2577031

  18. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition 779–788. https://doi.org/10.1109/CVPR.2016.91

  19. Redmon J, Farhadi A (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition 7263–7271. https://doi.org/10.1109/CVPR.2017.690

  20. Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv Preprint. https://doi.org/10.48550/arXiv.1804.02767. arXiv:1804.02767

    Article  Google Scholar 

  21. Bochkovskiy A, Wang CY, Liao HYM (2020) Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934. https://doi.org/10.48550/arXiv.2004.10934

  22. Jocher G et al (2021) YOLOV5, https://github.com/ultralytics/yolov5

  23. Li C, Li L, Jiang H et al YOLOv6: a single-stage object detection framework for industrial applications[J]. arXiv preprint arXiv:2209.02976, 2022.

  24. Wang CY, Bochkovskiy A, Liao HYM (2023) YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. : 7464–7475

  25. : YOLO V8, Jocher G, Chaurasia A, Qiu J (2023) YOLO by Ultralytics. https://github.com/ultralytics/ultralytics, Accessed: February 30, 2023

  26. Hu J, Shen L, Albanie S, Sun G, Vedaldi A (2018) Gather-excite: exploiting feature context in convolutional neural networks. Adv Neural Inf Process Syst 31. https://doi.org/10.48550/arXiv.1810.12348

  27. Lee H, Kim HE, Nam H (2019) Srm: A style-based recalibration module for convolutional neural networks. In Proceedings of the IEEE/CVF International conference on computer vision 1854–1862. https://doi.org/10.1109/ICCV.2019.00194

  28. Wang Q, Wu B, Zhu P, Li P, Zuo W, Hu Q (2020) ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 11534–11542. https://doi.org/10.1109/CVPR42600.2020.01155

  29. Hou Q, Zhou D, Feng J (2021) Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 13713–13722. https://doi.org/10.1109/CVPR46437.2021.01350

  30. Kumar A, Singh N, Kumar P et al (2017) A novel superpixel based color spatial feature for salient object detection[C]//2017 conference on information and communication technology (CICT). IEEE, : 1–5

  31. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings Part I 14: 21–37. Springer International Publishing. https://doi.org/10.1007/978-3-319-46448-0_2

  32. Lim JS, Astrid M, Yoon HJ, Lee SI (2021) Small object detection using context and attention. In 2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC) 181–186. IEEE. https://doi.org/10.1109/ICAIIC51459.2021.9415217

  33. Yuan Y, Xiong Z, Wang Q (2019) VSSA-NET: Vertical spatial sequence attention network for traffic sign detection. IEEE Trans Image Process 28(7):3423–3434. https://doi.org/10.1109/TIP.2019.2896952

    Article  MathSciNet  Google Scholar 

  34. Liu Y, Cao S, Lasang P, Shen S (2019) Modular lightweight network for road object detection using a feature fusion approach. IEEE Trans Syst Man Cybernetics: Syst 51(8):4716–4728. https://doi.org/10.1109/TSMC.2019.2945053

    Article  Google Scholar 

  35. Han J, Yao X, Cheng G, Feng X, Xu D (2019) P-CNN: part-based convolutional neural networks for fine-grained visual categorization. IEEE Trans Pattern Anal Mach Intell 44(2):579–590. https://doi.org/10.1109/TPAMI.2019.2933510

    Article  Google Scholar 

  36. Xu S, Wang X, Lv W, Chang Q, Cui C, Deng K, Lai B (2022) PP-YOLOE: an evolved version of YOLO. https://doi.org/10.48550/arXiv.2203.16250. arXiv preprint arXiv:2203.16250

  37. Ge Z, Liu S, Wang F, Li Z, Sun J (2021) Yolox: Exceeding Yolo series in 2021. arXiv Preprint. https://doi.org/10.48550/arXiv.2107.08430. arXiv:2107.08430

    Article  Google Scholar 

  38. Mingxing Tan R, Pang, Quoc VL (2020) EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10781–10790, 2, 7

  39. Liang T, Chu X, Liu Y, Wang Y, Tang Z, Chu W, Chen J, Haibin Ling (2022) CBNet: a composite backbone network architecture for object detection. IEEE Trans Image Process (TIP) 5:7

    Google Scholar 

  40. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition 770–778. https://doi.org/10.48550/arXiv.1512.03385

  41. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L, C (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition 4510–4520. https://doi.org/10.1109/CVPR.2018.00474

  42. Alexey Dosovitskiy L, Beyer A, Kolesnikov D, Weissenborn X, Zhai T, Unterthiner M, Dehghani M, Minderer G, Heigold S, Gelly et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations

  43. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708

  44. Hou Q, Jiang Z, Yuan L, Cheng M-M (2022) Shuicheng Yan, and Jiashi Feng. Vision permutator: a permutable mlp-like architecture for visual recognition. IEEE Trans Pattern Anal Mach Intell

  45. Real E, Aggarwal A, Huang Y, Le QV (2018) Regularized evolution for image classifier architecture search, in AAAI

  46. Lv T, Bai C, Wang C, Mdmlp (2022) Image classification from scratch on small datasets with mlp[J]. arXiv preprint arXiv:2205.14477

  47. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision 618–626. https://doi.org/10.1109/ICCV.2017.74

  48. Yuan Y, Huang L, Guo J et al (2018) Ocnet: object context network for scene parsing[J]. arXiv preprint arXiv:1809.00916

  49. Zhang Q, Yang YB (2022) Rest v2: simpler, faster and stronger [J]. Adv Neural Inf Process Syst 35:36440–36452

    Google Scholar 

Download references

Funding

The authors have no financial or proprietary interests in any material discussed in this article.

Author information

Authors and Affiliations

Authors

Contributions

All authors agree to submit the manuscript with the name list appeared in the title page.

Corresponding author

Correspondence to Chengwei Li.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Deng, H., Wang, C., Li, C. et al. Fine grained dual level attention mechanisms with spacial context information fusion for object detection. Pattern Anal Applic 27, 75 (2024). https://doi.org/10.1007/s10044-024-01290-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10044-024-01290-z

Keywords