Skip to main content
Log in

A cooperative approach based on self-attention with interactive attribute for image caption

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Image caption is a challenging issue in the area of image understanding, in which most of the models are trained by the framework combined a deep convolutional neural network with a recurrent neural network. However, the features extracted by the convolutional neural network could capture the information of salient regions, which fails to cover the details in the image. Moreover, the gradient vanishing problem of the recurrent neural networks would cause the loss of the previous information as the time step grows. In this paper, Cooperative Self-Attention (CSA) is proposed address these problems. Comparing with existing methods, our model enhances the representation of the image by fusing the additional attribute information from the object detection. A sub-module named Inter-Attribute indicating the interaction of objects is proposed to strengthen the context of the entities. In virtue of the advantages of Self-Attention, different from previous methods that predict the next word based on one prior word and hidden state, our model concatenates all of the words generated step by step to solve long-term dependencies. Comparing with published state-of-the-art methods, our CSA demonstrates outstanding performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Agrawal H, Desai K, Wang Y, Chen X, Jain R, Johnson M, Batra D, Parikh D, Lee S, Anderson P (2019) Nocaps: novel object captioning at scale. In: Proceedings of international conference on computer vision, pp 8947–8956

  2. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6077–6086

  3. Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Meeting of the association for computational linguistics, pp. 65–72

  4. Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv preprint arXiv:1802.02611

  5. Chen H, Ding G, Zhao S (2018) Temporal-difference learning with sampling baseline for image captioning. In: Proceedings of 32nd AAAI conference, pp 6706–6713

  6. Ding G, Chen M, Zhao S, Chen H et al (2018) Neural image caption generation with weighted training and reference. Cognit Comput 11:763–777

  7. Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of 2019 IEEE/CVF conference on computer vision and pattern recognition, pp 3146–3154

  8. Girdhar R, Carreira J, Doersch C, Zisserman A (2019) Video action transformer network. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 244–253

  9. Guo L, Liu J, Yao P (2020) MSCap: multi-style image captioning with unpaired stylized text. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4204–4213

  10. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778

  11. Hu H, Gu J, Zhang Z, Dai J, Wei Y (2018) Relation networks for object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3588–3597

  12. Jiasen L, Caiming X, Devi P, Richard S (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Computer Science IEEE Conference on Computer Vision and Pattern Recognition, pp 3242–3250

  13. Jiasen L, Jianwei Y, Dhruv B, Devi P (2018) Neural baby talk. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 7219–7228

  14. Karpathy, Andrej, Feifei L 2017) Deep visual-semantic alignments for generating image descriptions. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp 3128–3137

  15. Kiros R, Salakhutdinov R, Zemel RS et al (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539

  16. Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein MS, Fei-Fei L (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73

    Article  MathSciNet  Google Scholar 

  17. Lin C (2004) ROUGE: a package for automatic evaluation of summaries. In: Proceedings of Meeting of the association for computational linguistics, pp 74–81

  18. Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zit-nick (2014) Microsoft coco: common objects in context. In: Proceedings of the European Conference on Computer Vision, pp 740–755

  19. Liu W, Anguelov D, Erhan D, Szegedy C, Reed SR, Fu C, Berg AC (2016) Ssd:single shot multibox detector. In: Proceedings of the European Conference on Computer Vision, pp 21–37

  20. Lu D, Whitehead S, Huang L, Ji H, Chang S (2018) Entity-aware image caption generation. arXiv preprint arXiv:1804.07889

  21. Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2015) Deep captioning with multimodal recurrent neural networks (m-rnn). In: Proceedings of the international conference on learning representations, pp 1–17

  22. Papineni K, Roukos S, Ward T et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the meeting of the Association for Computational Linguistics, pp 311–318

  23. Qin Y, Du J, Zhang Y, Lu H (2019) Look Back and Predict Forward in Image Captioning. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8359–8367

  24. Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39:1137–1149

    Article  Google Scholar 

  25. Rennie S, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 7008–7024

  26. Sharif N, White L, Bennamoun M et al (2020) WEmbSim: a simple yet effective metric for image captioning. In: 2020 Digital Image Computing: Techniques and Applications (DICTA) (2020):1–8

  27. Shirai K, Hashimoto K, Eriguchi A et al (2020) Neural text generation with artificial negative examples. ArXiv, abs/2012.14124

  28. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp 1–9

  29. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojn Z (2016) Rethinking the inception architecture for computer vision. In:Proceedings of 2016 IEEE conference on computer vision and pattern recognition, pp 2818–2826

  30. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems neural information processing systems, pp 5998–6008

  31. Vedantam R, Zitnick C, Parikh D (2015) CIDEr: consensus-based image description evaluation. In: Proceedings of Computer Vision and Pattern Recognition, pp 4566–4575

  32. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp 3156–3164

  33. Wang X, Girshick R, Gupta A, He K (2017) Non-local neural networks. arXiv preprint arXiv:1711.07971, 10

  34. Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of International Conference on Machine Learning, pp 2048–2057

  35. Yang J, Sun Y, Liang J, Ren B, Lai SH (2019) Image captioning by incorporating affective concepts learned from both visual and textual components. Neurocomputing 328:56–68

    Article  Google Scholar 

  36. Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10677–10686

  37. Yu J, Li J, Yu Z, Huang Q (2019) Multimodal transformer with multi-view visual representation for image captioning.In: IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2019.2947482

  38. Zhao D, Chang Z, Guo S (2019) A multimodal fusion approach for image captioning. Neurocomputing 329:476–485

    Article  Google Scholar 

  39. Zhao W, Wu X, Zhang X (2020) MemCap: memorizing style knowledge for image captioning. In: Proceedings of the Association for the Advance of Artificial Intelligence, pp 12984–2992

  40. Zheng Y, Li Y, Wang S (2019) Intention oriented image captions with guiding objects. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8387–8396

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruixue Yang.

Ethics declarations

Conflict of interest

The authors declare there is no conflicts of interest regarding the publication of this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, D., Yang, R., Wang, Z. et al. A cooperative approach based on self-attention with interactive attribute for image caption. Multimed Tools Appl 82, 1223–1236 (2023). https://doi.org/10.1007/s11042-022-13279-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13279-z

Keywords

Navigation