Fine grained dual level attention mechanisms with spacial context information fusion for object detection

Deng, Haigang; Wang, Chuanxu; Li, Chengwei; Hao, Zhang

doi:10.1007/s10044-024-01290-z

Fine grained dual level attention mechanisms with spacial context information fusion for object detection

Theoretical Advances
Published: 28 June 2024

Volume 27, article number 75, (2024)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Haigang Deng¹,
Chuanxu Wang ORCID: orcid.org/0000-0003-4799-3831²,
Chengwei Li¹ &
…
Zhang Hao²

198 Accesses
Explore all metrics

Abstract

For channel and spatial feature map C×W×H in object detection task, its information fusion usually relies on attention mechanism, that is, all C channels and the entire space W×H are all compressed respectively via average/max pooling, and then their attention weight masks are obtained based on correlation calculation. This coarse-grained global operation ignores the differences among multiple channels and diverse spatial regions, resulting in inaccurate attention weights. In addition, how to mine the contextual information in the space W×H is also a challenge for object recognition and localization. To this end, we propose a Fine-Grained Dual Level Attention Mechanism joint Spacial Context Information Fusion module for object detection (FGDLAM&SCIF). It is a cascaded structure, firstly, we subdivide the feature space W×H into n (optimized as n = 4 in experiments) subspaces and construct a global adaptive pooling and one-dimensional convolution algorithm to effectively extract the feature channel weights on each subspace respectively. Secondly, the C feature channels are divided into n (n = 4) sub-channels, and then a multi-scale module is constructed in the feature space W×H to mine context information. Finally, row and column coding is used to fuse them orthogonally to obtain enhanced features. This module is embeddable, which can be transplanted into any object detection network, such as YOLOv4/v5, PPYOLOE, YOLOX and MobileNet, ResNet as well. Experiments are conducted on the MS COCO 2017 and Pascal VOC 2007 datasets to verify its effectiveness and good portability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Object detection based on an adaptive attention mechanism

Article Open access 09 July 2020

MSSD: multi-scale object detector based on spatial pyramid depthwise convolution and efficient channel attention mechanism

Article 01 September 2023

Multi-scale semantic enhancement network for object detection

Article Open access 03 May 2023

Data availability

will be offered to reasonable requests from future readers.

References

Wei Z, Zhang F, Chang S et al (2022) Mmwave radar and vision fusion for object detection in autonomous driving: a review[J]. Sensors 22(7):2542
Article Google Scholar
Hu J, Zhang C, Xu S, Chen C (2021) An Invasive Target Detection and Localization Strategy Using Pan-Tilt-Zoom Cameras for Security Applications. In 2021 IEEE International Conference on Real-time Computing and Robotics (RCAR) 1236–1241. IEEE. https://doi.org/10.1109/RCAR52367.2021.9517521
Kumar K, Shrimankar DD, Singh N (2018) SOMES: an efficient SOM technique for event summarization in multi-view surveillance videos[C]//Recent Findings in Intelligent Computing Techniques: Proceedings of the 5th ICACNI 2017, Volume 3. Springer Singapore, : 383–389
Negi A, Kumar K, Saini P (2023) Object of interest and unsupervised learning-based Framework for an effective video summarization using deep Learning[J]. IETE J Res, : 1–12
Negi A, Kumar K, Saini P et al (2022) Object detection based approach for an efficient video summarization with system statistics over cloud[C]//2022 IEEE 9th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON). IEEE, : 1–6
Kumar K, Shrimankar DD (2017) F-DES: fast and deep event summarization[J]. IEEE Trans Multimedia 20(2):323–334
Article Google Scholar
Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision 764–773. https://doi.org/10.1109/ICCV.2017.89
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers.Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer International Publishing 213–229. https://doi.org/10.1007/978-3-030-58452-8_13
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. Proceedings of the IEEE conference on computer vision and pattern recognition 7132–7141. https://doi.org/10.1109/TPAMI.2019.2913372
Qin Z, Zhang P, Wu F, Li X (2021) Fcanet: frequency channel attention networks. Proc IEEE/CVF Int Conf Comput Vis 783:792. https://doi.org/10.1109/ICCV48922.2021.00082
Article Google Scholar
Yang Z, Zhu L, Wu Y, Yang Y (2020) Gated channel transformation for visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 11794–11803. https://doi.org/10.1109/CVPR42600.2020.01181
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 3146–3154. https://doi.org/10.1109/CVPR.2019.00326
Woo S, Park J, Lee JY, Kweon IS (2018) Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV) 3–19. https://doi.org/10.1007/978-3-030-01234-2_1
Park J, Woo S, Lee JY, Kweon IS (2018) Bam: Bottleneck attention module. arXiv preprint arXiv:1807.06514. https://doi.org/10.48550/arXiv.1807.06514
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition 580–587. https://doi.org/10.1109/CVPR.2014.81
Girshick R (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision 1440–1448. https://doi.org/10.1109/ICCV.2015.169
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28. https://doi.org/10.1109/TPAMI.2016.2577031
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition 779–788. https://doi.org/10.1109/CVPR.2016.91
Redmon J, Farhadi A (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition 7263–7271. https://doi.org/10.1109/CVPR.2017.690
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv Preprint. https://doi.org/10.48550/arXiv.1804.02767. arXiv:1804.02767
Article Google Scholar
Bochkovskiy A, Wang CY, Liao HYM (2020) Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934. https://doi.org/10.48550/arXiv.2004.10934
Jocher G et al (2021) YOLOV5, https://github.com/ultralytics/yolov5
Li C, Li L, Jiang H et al YOLOv6: a single-stage object detection framework for industrial applications[J]. arXiv preprint arXiv:2209.02976, 2022.
Wang CY, Bochkovskiy A, Liao HYM (2023) YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. : 7464–7475
: YOLO V8, Jocher G, Chaurasia A, Qiu J (2023) YOLO by Ultralytics. https://github.com/ultralytics/ultralytics, Accessed: February 30, 2023
Hu J, Shen L, Albanie S, Sun G, Vedaldi A (2018) Gather-excite: exploiting feature context in convolutional neural networks. Adv Neural Inf Process Syst 31. https://doi.org/10.48550/arXiv.1810.12348
Lee H, Kim HE, Nam H (2019) Srm: A style-based recalibration module for convolutional neural networks. In Proceedings of the IEEE/CVF International conference on computer vision 1854–1862. https://doi.org/10.1109/ICCV.2019.00194
Wang Q, Wu B, Zhu P, Li P, Zuo W, Hu Q (2020) ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 11534–11542. https://doi.org/10.1109/CVPR42600.2020.01155
Hou Q, Zhou D, Feng J (2021) Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 13713–13722. https://doi.org/10.1109/CVPR46437.2021.01350
Kumar A, Singh N, Kumar P et al (2017) A novel superpixel based color spatial feature for salient object detection[C]//2017 conference on information and communication technology (CICT). IEEE, : 1–5
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings Part I 14: 21–37. Springer International Publishing. https://doi.org/10.1007/978-3-319-46448-0_2
Lim JS, Astrid M, Yoon HJ, Lee SI (2021) Small object detection using context and attention. In 2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC) 181–186. IEEE. https://doi.org/10.1109/ICAIIC51459.2021.9415217
Yuan Y, Xiong Z, Wang Q (2019) VSSA-NET: Vertical spatial sequence attention network for traffic sign detection. IEEE Trans Image Process 28(7):3423–3434. https://doi.org/10.1109/TIP.2019.2896952
Article MathSciNet Google Scholar
Liu Y, Cao S, Lasang P, Shen S (2019) Modular lightweight network for road object detection using a feature fusion approach. IEEE Trans Syst Man Cybernetics: Syst 51(8):4716–4728. https://doi.org/10.1109/TSMC.2019.2945053
Article Google Scholar
Han J, Yao X, Cheng G, Feng X, Xu D (2019) P-CNN: part-based convolutional neural networks for fine-grained visual categorization. IEEE Trans Pattern Anal Mach Intell 44(2):579–590. https://doi.org/10.1109/TPAMI.2019.2933510
Article Google Scholar
Xu S, Wang X, Lv W, Chang Q, Cui C, Deng K, Lai B (2022) PP-YOLOE: an evolved version of YOLO. https://doi.org/10.48550/arXiv.2203.16250. arXiv preprint arXiv:2203.16250
Ge Z, Liu S, Wang F, Li Z, Sun J (2021) Yolox: Exceeding Yolo series in 2021. arXiv Preprint. https://doi.org/10.48550/arXiv.2107.08430. arXiv:2107.08430
Article Google Scholar
Mingxing Tan R, Pang, Quoc VL (2020) EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10781–10790, 2, 7
Liang T, Chu X, Liu Y, Wang Y, Tang Z, Chu W, Chen J, Haibin Ling (2022) CBNet: a composite backbone network architecture for object detection. IEEE Trans Image Process (TIP) 5:7
Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition 770–778. https://doi.org/10.48550/arXiv.1512.03385
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L, C (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition 4510–4520. https://doi.org/10.1109/CVPR.2018.00474
Alexey Dosovitskiy L, Beyer A, Kolesnikov D, Weissenborn X, Zhai T, Unterthiner M, Dehghani M, Minderer G, Heigold S, Gelly et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708
Hou Q, Jiang Z, Yuan L, Cheng M-M (2022) Shuicheng Yan, and Jiashi Feng. Vision permutator: a permutable mlp-like architecture for visual recognition. IEEE Trans Pattern Anal Mach Intell
Real E, Aggarwal A, Huang Y, Le QV (2018) Regularized evolution for image classifier architecture search, in AAAI
Lv T, Bai C, Wang C, Mdmlp (2022) Image classification from scratch on small datasets with mlp[J]. arXiv preprint arXiv:2205.14477
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision 618–626. https://doi.org/10.1109/ICCV.2017.74
Yuan Y, Huang L, Guo J et al (2018) Ocnet: object context network for scene parsing[J]. arXiv preprint arXiv:1809.00916
Zhang Q, Yang YB (2022) Rest v2: simpler, faster and stronger [J]. Adv Neural Inf Process Syst 35:36440–36452
Google Scholar

Download references

Funding

The authors have no financial or proprietary interests in any material discussed in this article.

Author information

Authors and Affiliations

School of Instrument Science and Engineering, Harbin Institute of Technology, Harbin, Hei Long Jiang, China
Haigang Deng & Chengwei Li
School of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, Shan Dong, China
Chuanxu Wang & Zhang Hao

Authors

Haigang Deng
View author publications
You can also search for this author inPubMed Google Scholar
Chuanxu Wang
View author publications
You can also search for this author inPubMed Google Scholar
Chengwei Li
View author publications
You can also search for this author inPubMed Google Scholar
Zhang Hao
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

All authors agree to submit the manuscript with the name list appeared in the title page.

Corresponding author

Correspondence to Chengwei Li.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Deng, H., Wang, C., Li, C. et al. Fine grained dual level attention mechanisms with spacial context information fusion for object detection. Pattern Anal Applic 27, 75 (2024). https://doi.org/10.1007/s10044-024-01290-z

Download citation

Received: 12 April 2023
Accepted: 14 June 2024
Published: 28 June 2024
DOI: https://doi.org/10.1007/s10044-024-01290-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fine grained dual level attention mechanisms with spacial context information fusion for object detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Object detection based on an adaptive attention mechanism

MSSD: multi-scale object detector based on spatial pyramid depthwise convolution and efficient channel attention mechanism

Multi-scale semantic enhancement network for object detection

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now