Caption TLSTMs: combining transformer with LSTMs for image captioning

Yan, Jie; Xie, Yuxiang; Luan, Xidao; Guo, Yanming; Gong, Quanzhi; Feng, Suru

doi:10.1007/s13735-022-00228-7

Caption TLSTMs: combining transformer with LSTMs for image captioning

Regular Paper
Published: 23 March 2022

Volume 11, pages 111–121, (2022)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Jie Yan¹,
Yuxiang Xie ORCID: orcid.org/0000-0002-6079-6907¹,
Xidao Luan²,
Yanming Guo¹,
Quanzhi Gong¹ &
…
Suru Feng¹

849 Accesses
6 Citations
Explore all metrics

Abstract

Image to captions has attracted widespread attention over the years. Recurrent neural networks (RNN) and their corresponding variants have been the mainstream when it comes to dealing with image captioning task for a long time. However, transformer-based models have shown powerful and promising performance on visual tasks contrary to classic neural networks. In order to extract richer and more robust multimodal intersection feature representation, we improve the original abstract scene graph to caption model and propose the Caption TLSTMs which is made up of two LSTMs with Transformer blocks in the middle of them in this paper. Compared with the model before improvement, the architecture of our Caption TLSTMs enables the entire network to make the most of the long-term dependencies and feature representation ability of the LSTM, while encoding the multimodal textual, visual and graphic information with the transformer blocks as well. Finally, experiments on VisualGenome and MSCOCO datasets have shown good performance in improving the general image caption generation quality, demonstrating the effectiveness of the Caption TLSTMs model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Stacked cross-modal feature consolidation attention networks for image captioning

Article 23 June 2023

Mozhgan Pourkeshavarz, Shahabedin Nabavi, … Mehrnoush Shamsfard

Integration of textual cues for fine-grained image captioning using deep CNN and LSTM

Article 19 October 2019

Neeraj Gupta & Anand Singh Jalal

Exploring Visual Relationship for Image Captioning

References

Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models, In: International conference on machine learning, pp 595-603, PMLR
Hossain MZ, Sohel F, Shiratuddin MF, Laga HJACS (2019) A comprehensive survey of deep learning for image captioning. ACM Comput Surv 51(6):1–36
Article Google Scholar
Li R, Liang H, Shi Y, Feng F, Wang XJN (2020) Dual-CNN: a convolutional language decoder for paragraph image captioning. Neurocomputing 396:92–101
Article Google Scholar
Papineni K, Roukos S, Ward T, W-J. Zhu (2002) Bleu: a method for automatic evaluation of machine translation, In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311-318
Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments, In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65-72
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries, In: Text summarization branches out, pp 74-81
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566-4575
Chen S, Jin Q, Wang P, Wu Q (2020) Say as you wish: Fine-grained control of image caption generation with abstract scene graphs, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9962-9971
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156-3164
Anderson P et al (2018) Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering, In: 31st meeting of the IEEE/CVF conference on computer vision and pattern recognition, CVPR 2018, June 18, 2018 - June 22, 2018, Salt Lake City, UT, United states, pp 6077-6086: IEEE Computer Society
Kulkarni G et al. (2011) Baby talk: understanding and generating simple image descriptions, In: 2011 IEEE conference on computer vision and pattern recognition, CVPR 2011, pp 1601-1608: IEEE Computer Society
Yang X, Liu Y, Wang XJ (2021) ReFormer: the relational transformer for image captioning
Jiang W, Ma L, Jiang Y-G, Liu W, Zhang T (2018) Recurrent fusion network for image captioning, In: Proceedings of the european conference on computer vision (ECCV), pp 499-515
Xu K et al (2015) Show, attend and tell: Neural image caption generation with visual attention, In: 32nd international conference on machine learning, ICML 2015, July 6, 2015 - July 11, 2015, Lile, France, vol. 3, pp. 2048-2057: International machine learning society (IMLS)
Xu D, Zhu Y, Choy CB, Fei-Fei L (2017) Scene graph generation by iterative message passing, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5410-5419
Zellers R, Yatskar M, Thomson S, Choi Y (2018) Neural motifs: Scene graph parsing with global context, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5831-5840
Sun G, Zhang C, Woodland PC (2021) Transformer language models with lstm-based cross-utterance information representation, In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7363-7367: IEEE
Graves A, Jaitly N, Mohamed A-r (2013) Hybrid speech recognition with deep bidirectional LSTM, In: 2013 IEEE workshop on automatic speech recognition and understanding, 2013, pp. 273-278: IEEE
Soltau H, Liao H, Sak HJ (2016) Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary speech recognition
Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov RJ (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint, arXiv:1901.02860
Irie K, Zeyer A, Schlüter R, Ney HJ (2019) Language modeling with deep transformers
Vaswani A et al (2017) Attention is all you need. Advances in neural information processing systems 30, pp 5998-6008
Farhadi A et al (2010) Every picture tells a story: Generating sentences from images, In: European conference on computer vision, pp 15-29, Springer
Gupta A, Verma Y, Jawahar C (2012) Choosing linguistics over vision to describe images, In: Proceedings of the AAAI conference on artificial intelligence, vol. 26(1)
Ordonez V, Kulkarni G, Berg TJA (2011) Im2text: describing images using 1 million captioned photographs, vol. 24, pp 1143-1151
Zhang X et al. (2021) RSTNet: Captioning With Adaptive Attention on Visual and Non-Visual Words, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15465-15474
Ushiku Y, Yamaguchi M, Mukuta Y, Harada T (2015) Common subspace for model and similarity: Phrase learning for caption generation from images, In: Proceedings of the IEEE international conference on computer vision, pp 2668-2676
Mitchell M et al. (2012) Midge: Generating image descriptions from computer vision detections, In: Proceedings of the 13th conference of the european chapter of the association for computational linguistics, pp 747-756
Sutskever I, Vinyals O, Le QVJ (2014) Sequence to sequence learning with neural networks
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation, In: 15th IEEE international conference on computer vision, ICCV 2015, December 11, 2015 - December 18, 2015, Santiago, Chile, 2015, vol. 2015 international conference on computer vision, ICCV 2015, pp. 2407-2415: Institute of Electrical and Electronics Engineers Inc
Zhu X, Li L, Liu J, Peng H, Niu XJAS (2018) Captioning transformer with stacked attention modules. Appl Sci 8(5):739
Article Google Scholar
Johnson J et al. (2015) Image retrieval using scene graphs, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3668-3678
Han K et al (2020) A survey on visual transformer. arXiv e-prints, arXiv:2111.06091
Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah MJ (2021) Transformers in vision: a survey. ACM Computing Surveys (CSUR)
Herdade S, Kappeler A, Boakye K, Soares JJ (2019) Image captioning: transforming objects into words
Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning, In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8928-8937
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10578-10587
Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10971-10980
Liu X, Zhang P, Yu C, Lu H, Qian X, Yang XJ (2021) A video is worth three views: trigeminal transformers for video-based person re-identification
Schlichtkrull M, Kipf TN, Bloem P, Van Den Berg R, Titov I, Welling M (2018) Modeling relational data with graph convolutional networks, In: European semantic web conference, pp 593-607, Springer
Krishna R et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Article MathSciNet Google Scholar
Lin T-Y et al (2014) Microsoft coco: Common objects in context, In: European conference on computer vision, pp 740-755, Springer
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128-3137
Ren S, He K, Girshick R, Sun JJA (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91-99
Google Scholar
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database, In: 2009 IEEE conference on computer vision and pattern recognition, pp 248-255: Ieee

Download references

Acknowledgements

This work has been supported by the National Natural Science Foundation of China under Contract Nos. 61571453 and 61806218.

Author information

Authors and Affiliations

College of Systems Engineering, National University of Defense Technology, Changsha, 410000, Hunan, China
Jie Yan, Yuxiang Xie, Yanming Guo, Quanzhi Gong & Suru Feng
College of Computer Engineering and Applied Mathematics, Changsha University, Changsha, 410000, Hunan, China
Xidao Luan

Authors

Jie Yan
View author publications
You can also search for this author in PubMed Google Scholar
Yuxiang Xie
View author publications
You can also search for this author in PubMed Google Scholar
Xidao Luan
View author publications
You can also search for this author in PubMed Google Scholar
Yanming Guo
View author publications
You can also search for this author in PubMed Google Scholar
Quanzhi Gong
View author publications
You can also search for this author in PubMed Google Scholar
Suru Feng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuxiang Xie.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yan, J., Xie, Y., Luan, X. et al. Caption TLSTMs: combining transformer with LSTMs for image captioning. Int J Multimed Info Retr 11, 111–121 (2022). https://doi.org/10.1007/s13735-022-00228-7

Download citation

Received: 07 December 2021
Revised: 26 February 2022
Accepted: 09 March 2022
Published: 23 March 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s13735-022-00228-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Caption TLSTMs: combining transformer with LSTMs for image captioning

Abstract

Access this article

Similar content being viewed by others

Stacked cross-modal feature consolidation attention networks for image captioning

Integration of textual cues for fine-grained image captioning using deep CNN and LSTM

Exploring Visual Relationship for Image Captioning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Caption TLSTMs: combining transformer with LSTMs for image captioning

Abstract

Access this article

Similar content being viewed by others

Stacked cross-modal feature consolidation attention networks for image captioning

Integration of textual cues for fine-grained image captioning using deep CNN and LSTM

Exploring Visual Relationship for Image Captioning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation