Skip to main content
Log in

A New Attention-Based LSTM for Image Captioning

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Image captioning aims to describe the content of an image with a complete and natural sentence. Recently, the image captioning methods with encoder-decoder architecture has made great progress, in which LSTM became a dominant decoder to generate word sequence. However, in the decoder stage, the input vector keep same and there is much uncorrelated with previously visual parts or generated words. In this paper, we propose an attentional LSTM (ALSTM) and show how to integrate it within state-of-the-art automatic image captioning framework. Instead of traditional LSTM in existing models, ALSTM learns to refine input vector from network hidden states and sequential context information. Thus ALSTM can attend more relevant features such as spatial attention, visual relations and pay more attention on the most relevant context words. Moreover, ALSTM is utilized as the decoder in some classical frameworks and shows how to get effective visual/context attention to update input vector. Extensive quantitative and qualitative evaluations on the Flickr30K and MSCOCO image datasets with modified network illustrate the superiority of ALSTM. ALSTM based methods can generate high quality descriptions by combining sequence context and relations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollar P, Zitnick CL, Microsoft coco captions: Data collection and evaluation server. Computer Science

  2. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778

  3. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9

  4. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput. 9:1735

    Article  Google Scholar 

  5. Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 375–383

  6. Yang Z, Yuan Y, Wu Y, Cohen WW, Salakhutdinov RR (2016) Review networks for caption generation. In: Adv Neural Inf Process Syst. pp. 2361–2369

  7. Xiao F, Gong X, Zhang Y, Shen Y, Li J, Gao X (2019) Daa: dual lstms with adaptive attention for image captioning. Neurocomputing 364:322–329

    Article  Google Scholar 

  8. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077–6086

  9. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164

  10. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137

  11. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7008–7024

  12. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp. 2048–2057

  13. Xu Y, Hanwang Z, Jianfei C (2019) Learning to collocate neural modules for image captioning. In: Proc of the IEEE int conf on computer vision. Piscataway, NJ: IEEE, pp. 4249–4259

  14. Zhou L, Zhang Y, Jiang Y-G, Zhang T, Fan W (2019) Re-caption: saliency-enhanced image captioning through two-phase learning. IEEE Trans Image Process 29:694–709

    Article  MathSciNet  Google Scholar 

  15. Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp. 684–699

  16. Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10685–10694

  17. Gao L, Fan K, Song J, Liu X, Xu X, Shen HT (2019) Deliberate attention networks for image captioning. Proc AAAI Conf Artif Intell 33:8320–8327

    Google Scholar 

  18. Ke L, Pei W, Li R, Shen X, Tai Y-W (2019) Reflective decoding network for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 8888–8897

  19. Shi Z, Zhou X, Qiu X, Zhu X (2020) Improving image captioning with better use of caption. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 7454–7464

  20. Cao P, Yang Z, Sun L, Liang Y, Yang MQ, Guan R (2019) Image captioning with bidirectional semantic attention-based guiding of long short-term memory. Neural Process Lett 50(1):103–119

    Article  Google Scholar 

  21. He C, Hu H (2019) Image captioning with text-based visual attention. Neural Process Lett 49(1):177–185

    Article  Google Scholar 

  22. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575

  23. Wu Z, Cohen R, Encode, review, and decode: Reviewer module for caption generation, arXiv preprint arXiv:1605.07912 3

  24. Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proceedings of the IEEE international conference on computer vision, pp. 4894–4902

  25. You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4651–4659

  26. Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, pp. 311–318

  27. Banerjee S, Lavie A (2005) Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72

  28. Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp. 74–81

  29. Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision, Springer, pp. 382–398

  30. Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5659–5667

  31. He X, Yang Y, Shi B, Bai X (2019) Vd-san: visual-densely semantic attention network for image caption generation. Neurocomputing 328:48–55

    Article  Google Scholar 

  32. Yuan A, Li X, Lu X (2019) 3g structure for image caption generation. Neurocomputing 330:17–28

    Article  Google Scholar 

  33. Zhu X, Li L, Liu J, Li Z, Peng H, Niu X (2018) Image captioning with triple-attention and stack parallel lstm. Neurocomputing 319:55–65

    Article  Google Scholar 

  34. Tan YH, Chan CS (2019) Phrase-based image caption generator with hierarchical lstm network. Neurocomputing 333:86–100

    Article  Google Scholar 

  35. Ji J, Xu C, Zhang X, Wang B, Song X (2020) Spatio-temporal memory attention for image captioning. IEEE Trans Image Process 29:7615–7628

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported by the National Science and Technology Major Project (Grant No.2020YFA0713504), CERNET Innovation Project under NGII20180309 and the Scientific Research Fund of Hunan Provincial Education Department (Grant No. 210153).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xieping Gao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xiao, F., Xue, W., Shen, Y. et al. A New Attention-Based LSTM for Image Captioning. Neural Process Lett 54, 3157–3171 (2022). https://doi.org/10.1007/s11063-022-10759-z

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-022-10759-z

Keywords

Navigation