Relational Attention with Textual Enhanced Transformer for Image Captioning

Song, Lifei; Shi, Yiwen; Xiao, Xinyu; Zhang, Chunxia; Xiang, Shiming

doi:10.1007/978-3-030-88010-1_13

Lifei Song¹⁶,
Yiwen Shi¹⁷,
Xinyu Xiao¹⁸,
Chunxia Zhang¹⁹ &
…
Shiming Xiang¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 13021))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

2208 Accesses
1 Citations

Abstract

Image captioning has attracted extensive research interests in recent years, which aims to generate a natural language description of an image. However, many approaches focus only on individual target object information without exploring the relationship between objects and the surrounding. It will greatly affect the performance of captioning models. In order to solve the above issue, we propose a relation model to incorporate relational information between objects from different levels into the captioning model, including low-level box proposals and high-level region features. Moreover, Transformer-based architectures have shown great success in image captioning, where image regions are encoded and then attended into attention vectors to guide the caption generation. However, the attention vectors only contain image-level information without considering the textual information, which fails to expand the capability of captioning in both visual and textual domains. In this paper, we introduce a Textual Enhanced Transformer (TET) to enable addition of textual information into Transformer. There are two modules in TET: text-guided Transformer and self-attention Transformer. The two modules perform semantic and visual attention to guide the decoder to generate high-quality captions. We extensively evaluate model on MS COCO dataset and it achieves 128.7 CIDEr-D score on Karpathy split and 126.3 CIDEr-D (c40) score on official online evaluation server.

This research was supported by the National Key Research and Development Program of China under Grant No. 2020AAA0104903, and the National Natural Science Foundation of China under Grants 62072039, 62076242, and 61976208.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L.: Deep captioning with multimodal recurrent neural networks (m-rnn). In: ICLR (2015)
Google Scholar
Jia, X., Gavves, E., Fernando, B., Tuytelaars, T.: Guiding the long-short term memory model for image caption generation. In: ICCV (2015)
Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. NIPS (2014)
Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked attention networks for image question answering. In: CVPR (2016)
Google Scholar
Jiasen, L., Xiong, C., Parikh, D.: and Richard Socher. Adaptive attention via a visual sentinel for image captioning. CVPR, Knowing when to look (2017)
Google Scholar
Chen, L., et al.: SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: CVPR (2017)
Google Scholar
Kim, D.-J., Choi, J., Tae-Hyun, O., Kweon, I.S.: Dense relational captioning: Triple-stream networks for relationship-based captioning. In: CVPR (2019)
Google Scholar
Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: CVPR (2019)
Google Scholar
Hong, D., Gao, L., Yao, J., Zhang, B., Plaza, A., Chanussot, J.: Graph convolutional networks for hyperspectral image classification. CoRR, vol. abs/2008.02457 (2020)
Google Scholar
Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: ECCV (2018)
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. NIPS (2017)
Google Scholar
Li, G., Zhu, L., Liu, P., Yang, Y.: Entangled transformer for image captioning. In: ICCV (2019)
Google Scholar
Herdade, S., Kappeler, A., Boakye, K., Soares, J.: Image captioning, Transforming objects into words. NIPS (2019)
Google Scholar
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. NIPS (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Fang, H., et al.: From captions to visual concepts and back. In: CVPR (2015)
Google Scholar
Steven, J.: Rennie, E.M., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: CVPR (2017)
Google Scholar
Vedantam, R., Zitnick, C.L., Cider, D.P.: Consensus-based image description evaluation. In: CVPR (2015)
Google Scholar
Chen, X., et al.: Microsoft COCO captions: Data collection and evaluation server. vol. abs/1504.00325 (2015)
Google Scholar
Karpathy, A., Li, F.-F.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. ACL (2002)
Google Scholar
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. ACL (2005)
Google Scholar
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE:semantic propositional image caption evaluation. In: ECCV (2016)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: CVPR (2015)
Google Scholar
Huang, L., Wang, W., Chen, J., Wei, X.: Attention on attention for image captioning. In: ICCV (2019)
Google Scholar
Yang, X., Zhang, H., Cai, J.: Learning to Collocate Neural Modules for Image Captioning. In: ICCV (2019)
Google Scholar
Wang, W., Chen, Z., Hu, H.: Hierarchical Attention Network for Image Captioning, AAAI (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100049, China
Lifei Song
Beijing City University Intelligent Electronic Manufacturing Research Center, Beijing, 101309, China
Yiwen Shi
National Laboratory of Pattern recognition, Institute of Automation of Chinese Academy of Sciences, Beijing, 100190, China
Xinyu Xiao & Shiming Xiang
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China
Chunxia Zhang

Authors

Lifei Song
View author publications
You can also search for this author in PubMed Google Scholar
Yiwen Shi
View author publications
You can also search for this author in PubMed Google Scholar
Xinyu Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Chunxia Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shiming Xiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lifei Song .

Editor information

Editors and Affiliations

University of Science and Technology Beijing, Beijing, China
Huimin Ma
Chinese Academy of Sciences, Beijing, China
Liang Wang
Tsinghua University, Beijing, China
Changshui Zhang
Zhejiang University, Hangzhou, China
Fei Wu
Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Hunan University, Changsha, China
Yaonan Wang
Sun Yat-Sen University, Guangzhou, Guangdong, China
Jianhuang Lai
Beijing Jiaotong University, Beijing, China
Yao Zhao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Song, L., Shi, Y., Xiao, X., Zhang, C., Xiang, S. (2021). Relational Attention with Textual Enhanced Transformer for Image Captioning. In: Ma, H., et al. Pattern Recognition and Computer Vision. PRCV 2021. Lecture Notes in Computer Science(), vol 13021. Springer, Cham. https://doi.org/10.1007/978-3-030-88010-1_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-88010-1_13
Published: 22 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88009-5
Online ISBN: 978-3-030-88010-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics