Abstract
For channel and spatial feature map C×W×H in object detection task, its information fusion usually relies on attention mechanism, that is, all C channels and the entire space W×H are all compressed respectively via average/max pooling, and then their attention weight masks are obtained based on correlation calculation. This coarse-grained global operation ignores the differences among multiple channels and diverse spatial regions, resulting in inaccurate attention weights. In addition, how to mine the contextual information in the space W×H is also a challenge for object recognition and localization. To this end, we propose a Fine-Grained Dual Level Attention Mechanism joint Spacial Context Information Fusion module for object detection (FGDLAM&SCIF). It is a cascaded structure, firstly, we subdivide the feature space W×H into n (optimized as n = 4 in experiments) subspaces and construct a global adaptive pooling and one-dimensional convolution algorithm to effectively extract the feature channel weights on each subspace respectively. Secondly, the C feature channels are divided into n (n = 4) sub-channels, and then a multi-scale module is constructed in the feature space W×H to mine context information. Finally, row and column coding is used to fuse them orthogonally to obtain enhanced features. This module is embeddable, which can be transplanted into any object detection network, such as YOLOv4/v5, PPYOLOE, YOLOX and MobileNet, ResNet as well. Experiments are conducted on the MS COCO 2017 and Pascal VOC 2007 datasets to verify its effectiveness and good portability.
Similar content being viewed by others
Data availability
will be offered to reasonable requests from future readers.
References
Wei Z, Zhang F, Chang S et al (2022) Mmwave radar and vision fusion for object detection in autonomous driving: a review[J]. Sensors 22(7):2542
Hu J, Zhang C, Xu S, Chen C (2021) An Invasive Target Detection and Localization Strategy Using Pan-Tilt-Zoom Cameras for Security Applications. In 2021 IEEE International Conference on Real-time Computing and Robotics (RCAR) 1236–1241. IEEE. https://doi.org/10.1109/RCAR52367.2021.9517521
Kumar K, Shrimankar DD, Singh N (2018) SOMES: an efficient SOM technique for event summarization in multi-view surveillance videos[C]//Recent Findings in Intelligent Computing Techniques: Proceedings of the 5th ICACNI 2017, Volume 3. Springer Singapore, : 383–389
Negi A, Kumar K, Saini P (2023) Object of interest and unsupervised learning-based Framework for an effective video summarization using deep Learning[J]. IETE J Res, : 1–12
Negi A, Kumar K, Saini P et al (2022) Object detection based approach for an efficient video summarization with system statistics over cloud[C]//2022 IEEE 9th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON). IEEE, : 1–6
Kumar K, Shrimankar DD (2017) F-DES: fast and deep event summarization[J]. IEEE Trans Multimedia 20(2):323–334
Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision 764–773. https://doi.org/10.1109/ICCV.2017.89
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers.Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer International Publishing 213–229. https://doi.org/10.1007/978-3-030-58452-8_13
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. Proceedings of the IEEE conference on computer vision and pattern recognition 7132–7141. https://doi.org/10.1109/TPAMI.2019.2913372
Qin Z, Zhang P, Wu F, Li X (2021) Fcanet: frequency channel attention networks. Proc IEEE/CVF Int Conf Comput Vis 783:792. https://doi.org/10.1109/ICCV48922.2021.00082
Yang Z, Zhu L, Wu Y, Yang Y (2020) Gated channel transformation for visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 11794–11803. https://doi.org/10.1109/CVPR42600.2020.01181
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 3146–3154. https://doi.org/10.1109/CVPR.2019.00326
Woo S, Park J, Lee JY, Kweon IS (2018) Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV) 3–19. https://doi.org/10.1007/978-3-030-01234-2_1
Park J, Woo S, Lee JY, Kweon IS (2018) Bam: Bottleneck attention module. arXiv preprint arXiv:1807.06514. https://doi.org/10.48550/arXiv.1807.06514
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition 580–587. https://doi.org/10.1109/CVPR.2014.81
Girshick R (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision 1440–1448. https://doi.org/10.1109/ICCV.2015.169
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28. https://doi.org/10.1109/TPAMI.2016.2577031
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition 779–788. https://doi.org/10.1109/CVPR.2016.91
Redmon J, Farhadi A (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition 7263–7271. https://doi.org/10.1109/CVPR.2017.690
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv Preprint. https://doi.org/10.48550/arXiv.1804.02767. arXiv:1804.02767
Bochkovskiy A, Wang CY, Liao HYM (2020) Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934. https://doi.org/10.48550/arXiv.2004.10934
Jocher G et al (2021) YOLOV5, https://github.com/ultralytics/yolov5
Li C, Li L, Jiang H et al YOLOv6: a single-stage object detection framework for industrial applications[J]. arXiv preprint arXiv:2209.02976, 2022.
Wang CY, Bochkovskiy A, Liao HYM (2023) YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. : 7464–7475
: YOLO V8, Jocher G, Chaurasia A, Qiu J (2023) YOLO by Ultralytics. https://github.com/ultralytics/ultralytics, Accessed: February 30, 2023
Hu J, Shen L, Albanie S, Sun G, Vedaldi A (2018) Gather-excite: exploiting feature context in convolutional neural networks. Adv Neural Inf Process Syst 31. https://doi.org/10.48550/arXiv.1810.12348
Lee H, Kim HE, Nam H (2019) Srm: A style-based recalibration module for convolutional neural networks. In Proceedings of the IEEE/CVF International conference on computer vision 1854–1862. https://doi.org/10.1109/ICCV.2019.00194
Wang Q, Wu B, Zhu P, Li P, Zuo W, Hu Q (2020) ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 11534–11542. https://doi.org/10.1109/CVPR42600.2020.01155
Hou Q, Zhou D, Feng J (2021) Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 13713–13722. https://doi.org/10.1109/CVPR46437.2021.01350
Kumar A, Singh N, Kumar P et al (2017) A novel superpixel based color spatial feature for salient object detection[C]//2017 conference on information and communication technology (CICT). IEEE, : 1–5
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings Part I 14: 21–37. Springer International Publishing. https://doi.org/10.1007/978-3-319-46448-0_2
Lim JS, Astrid M, Yoon HJ, Lee SI (2021) Small object detection using context and attention. In 2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC) 181–186. IEEE. https://doi.org/10.1109/ICAIIC51459.2021.9415217
Yuan Y, Xiong Z, Wang Q (2019) VSSA-NET: Vertical spatial sequence attention network for traffic sign detection. IEEE Trans Image Process 28(7):3423–3434. https://doi.org/10.1109/TIP.2019.2896952
Liu Y, Cao S, Lasang P, Shen S (2019) Modular lightweight network for road object detection using a feature fusion approach. IEEE Trans Syst Man Cybernetics: Syst 51(8):4716–4728. https://doi.org/10.1109/TSMC.2019.2945053
Han J, Yao X, Cheng G, Feng X, Xu D (2019) P-CNN: part-based convolutional neural networks for fine-grained visual categorization. IEEE Trans Pattern Anal Mach Intell 44(2):579–590. https://doi.org/10.1109/TPAMI.2019.2933510
Xu S, Wang X, Lv W, Chang Q, Cui C, Deng K, Lai B (2022) PP-YOLOE: an evolved version of YOLO. https://doi.org/10.48550/arXiv.2203.16250. arXiv preprint arXiv:2203.16250
Ge Z, Liu S, Wang F, Li Z, Sun J (2021) Yolox: Exceeding Yolo series in 2021. arXiv Preprint. https://doi.org/10.48550/arXiv.2107.08430. arXiv:2107.08430
Mingxing Tan R, Pang, Quoc VL (2020) EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10781–10790, 2, 7
Liang T, Chu X, Liu Y, Wang Y, Tang Z, Chu W, Chen J, Haibin Ling (2022) CBNet: a composite backbone network architecture for object detection. IEEE Trans Image Process (TIP) 5:7
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition 770–778. https://doi.org/10.48550/arXiv.1512.03385
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L, C (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition 4510–4520. https://doi.org/10.1109/CVPR.2018.00474
Alexey Dosovitskiy L, Beyer A, Kolesnikov D, Weissenborn X, Zhai T, Unterthiner M, Dehghani M, Minderer G, Heigold S, Gelly et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708
Hou Q, Jiang Z, Yuan L, Cheng M-M (2022) Shuicheng Yan, and Jiashi Feng. Vision permutator: a permutable mlp-like architecture for visual recognition. IEEE Trans Pattern Anal Mach Intell
Real E, Aggarwal A, Huang Y, Le QV (2018) Regularized evolution for image classifier architecture search, in AAAI
Lv T, Bai C, Wang C, Mdmlp (2022) Image classification from scratch on small datasets with mlp[J]. arXiv preprint arXiv:2205.14477
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision 618–626. https://doi.org/10.1109/ICCV.2017.74
Yuan Y, Huang L, Guo J et al (2018) Ocnet: object context network for scene parsing[J]. arXiv preprint arXiv:1809.00916
Zhang Q, Yang YB (2022) Rest v2: simpler, faster and stronger [J]. Adv Neural Inf Process Syst 35:36440–36452
Funding
The authors have no financial or proprietary interests in any material discussed in this article.
Author information
Authors and Affiliations
Contributions
All authors agree to submit the manuscript with the name list appeared in the title page.
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Deng, H., Wang, C., Li, C. et al. Fine grained dual level attention mechanisms with spacial context information fusion for object detection. Pattern Anal Applic 27, 75 (2024). https://doi.org/10.1007/s10044-024-01290-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10044-024-01290-z