A cooperative approach based on self-attention with interactive attribute for image caption

Zhao, Dexin; Yang, Ruixue; Wang, Zhaohui; Qi, Zhiyang

doi:10.1007/s11042-022-13279-z

A cooperative approach based on self-attention with interactive attribute for image caption

Published: 14 June 2022

Volume 82, pages 1223–1236, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Dexin Zhao¹,
Ruixue Yang ORCID: orcid.org/0000-0002-4005-4295¹,
Zhaohui Wang¹ &
…
Zhiyang Qi¹

273 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Image caption is a challenging issue in the area of image understanding, in which most of the models are trained by the framework combined a deep convolutional neural network with a recurrent neural network. However, the features extracted by the convolutional neural network could capture the information of salient regions, which fails to cover the details in the image. Moreover, the gradient vanishing problem of the recurrent neural networks would cause the loss of the previous information as the time step grows. In this paper, Cooperative Self-Attention (CSA) is proposed address these problems. Comparing with existing methods, our model enhances the representation of the image by fusing the additional attribute information from the object detection. A sub-module named Inter-Attribute indicating the interaction of objects is proposed to strengthen the context of the entities. In virtue of the advantages of Self-Attention, different from previous methods that predict the next word based on one prior word and hidden state, our model concatenates all of the words generated step by step to solve long-term dependencies. Comparing with published state-of-the-art methods, our CSA demonstrates outstanding performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Article 11 October 2019

References

Agrawal H, Desai K, Wang Y, Chen X, Jain R, Johnson M, Batra D, Parikh D, Lee S, Anderson P (2019) Nocaps: novel object captioning at scale. In: Proceedings of international conference on computer vision, pp 8947–8956
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6077–6086
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Meeting of the association for computational linguistics, pp. 65–72
Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv preprint arXiv:1802.02611
Chen H, Ding G, Zhao S (2018) Temporal-difference learning with sampling baseline for image captioning. In: Proceedings of 32nd AAAI conference, pp 6706–6713
Ding G, Chen M, Zhao S, Chen H et al (2018) Neural image caption generation with weighted training and reference. Cognit Comput 11:763–777
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of 2019 IEEE/CVF conference on computer vision and pattern recognition, pp 3146–3154
Girdhar R, Carreira J, Doersch C, Zisserman A (2019) Video action transformer network. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 244–253
Guo L, Liu J, Yao P (2020) MSCap: multi-style image captioning with unpaired stylized text. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4204–4213
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Hu H, Gu J, Zhang Z, Dai J, Wei Y (2018) Relation networks for object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3588–3597
Jiasen L, Caiming X, Devi P, Richard S (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Computer Science IEEE Conference on Computer Vision and Pattern Recognition, pp 3242–3250
Jiasen L, Jianwei Y, Dhruv B, Devi P (2018) Neural baby talk. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 7219–7228
Karpathy, Andrej, Feifei L 2017) Deep visual-semantic alignments for generating image descriptions. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp 3128–3137
Kiros R, Salakhutdinov R, Zemel RS et al (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein MS, Fei-Fei L (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Article MathSciNet Google Scholar
Lin C (2004) ROUGE: a package for automatic evaluation of summaries. In: Proceedings of Meeting of the association for computational linguistics, pp 74–81
Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zit-nick (2014) Microsoft coco: common objects in context. In: Proceedings of the European Conference on Computer Vision, pp 740–755
Liu W, Anguelov D, Erhan D, Szegedy C, Reed SR, Fu C, Berg AC (2016) Ssd:single shot multibox detector. In: Proceedings of the European Conference on Computer Vision, pp 21–37
Lu D, Whitehead S, Huang L, Ji H, Chang S (2018) Entity-aware image caption generation. arXiv preprint arXiv:1804.07889
Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2015) Deep captioning with multimodal recurrent neural networks (m-rnn). In: Proceedings of the international conference on learning representations, pp 1–17
Papineni K, Roukos S, Ward T et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the meeting of the Association for Computational Linguistics, pp 311–318
Qin Y, Du J, Zhang Y, Lu H (2019) Look Back and Predict Forward in Image Captioning. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8359–8367
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39:1137–1149
Article Google Scholar
Rennie S, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 7008–7024
Sharif N, White L, Bennamoun M et al (2020) WEmbSim: a simple yet effective metric for image captioning. In: 2020 Digital Image Computing: Techniques and Applications (DICTA) (2020):1–8
Shirai K, Hashimoto K, Eriguchi A et al (2020) Neural text generation with artificial negative examples. ArXiv, abs/2012.14124
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp 1–9
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojn Z (2016) Rethinking the inception architecture for computer vision. In:Proceedings of 2016 IEEE conference on computer vision and pattern recognition, pp 2818–2826
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems neural information processing systems, pp 5998–6008
Vedantam R, Zitnick C, Parikh D (2015) CIDEr: consensus-based image description evaluation. In: Proceedings of Computer Vision and Pattern Recognition, pp 4566–4575
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp 3156–3164
Wang X, Girshick R, Gupta A, He K (2017) Non-local neural networks. arXiv preprint arXiv:1711.07971, 10
Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of International Conference on Machine Learning, pp 2048–2057
Yang J, Sun Y, Liang J, Ren B, Lai SH (2019) Image captioning by incorporating affective concepts learned from both visual and textual components. Neurocomputing 328:56–68
Article Google Scholar
Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10677–10686
Yu J, Li J, Yu Z, Huang Q (2019) Multimodal transformer with multi-view visual representation for image captioning.In: IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2019.2947482
Zhao D, Chang Z, Guo S (2019) A multimodal fusion approach for image captioning. Neurocomputing 329:476–485
Article Google Scholar
Zhao W, Wu X, Zhang X (2020) MemCap: memorizing style knowledge for image captioning. In: Proceedings of the Association for the Advance of Artificial Intelligence, pp 12984–2992
Zheng Y, Li Y, Wang S (2019) Intention oriented image captions with guiding objects. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8387–8396

Download references

Author information

Authors and Affiliations

Tianjin Key Laboratory of Intelligence Computing and Novel Software Technology, Tianjin University of Technology, Tianjin, 300384, China
Dexin Zhao, Ruixue Yang, Zhaohui Wang & Zhiyang Qi

Authors

Dexin Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Ruixue Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zhaohui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyang Qi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruixue Yang.

Ethics declarations

Conflict of interest

The authors declare there is no conflicts of interest regarding the publication of this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, D., Yang, R., Wang, Z. et al. A cooperative approach based on self-attention with interactive attribute for image caption. Multimed Tools Appl 82, 1223–1236 (2023). https://doi.org/10.1007/s11042-022-13279-z

Download citation

Received: 15 September 2020
Revised: 09 March 2021
Accepted: 24 May 2022
Published: 14 June 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s11042-022-13279-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A cooperative approach based on self-attention with interactive attribute for image caption

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A cooperative approach based on self-attention with interactive attribute for image caption

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation