Abstract
Image captioning has attracted extensive research interests in recent years, which aims to generate a natural language description of an image. However, many approaches focus only on individual target object information without exploring the relationship between objects and the surrounding. It will greatly affect the performance of captioning models. In order to solve the above issue, we propose a relation model to incorporate relational information between objects from different levels into the captioning model, including low-level box proposals and high-level region features. Moreover, Transformer-based architectures have shown great success in image captioning, where image regions are encoded and then attended into attention vectors to guide the caption generation. However, the attention vectors only contain image-level information without considering the textual information, which fails to expand the capability of captioning in both visual and textual domains. In this paper, we introduce a Textual Enhanced Transformer (TET) to enable addition of textual information into Transformer. There are two modules in TET: text-guided Transformer and self-attention Transformer. The two modules perform semantic and visual attention to guide the decoder to generate high-quality captions. We extensively evaluate model on MS COCO dataset and it achieves 128.7 CIDEr-D score on Karpathy split and 126.3 CIDEr-D (c40) score on official online evaluation server.
This research was supported by the National Key Research and Development Program of China under Grant No. 2020AAA0104903, and the National Natural Science Foundation of China under Grants 62072039, 62076242, and 61976208.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L.: Deep captioning with multimodal recurrent neural networks (m-rnn). In: ICLR (2015)
Jia, X., Gavves, E., Fernando, B., Tuytelaars, T.: Guiding the long-short term memory model for image caption generation. In: ICCV (2015)
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. NIPS (2014)
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked attention networks for image question answering. In: CVPR (2016)
Jiasen, L., Xiong, C., Parikh, D.: and Richard Socher. Adaptive attention via a visual sentinel for image captioning. CVPR, Knowing when to look (2017)
Chen, L., et al.: SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: CVPR (2017)
Kim, D.-J., Choi, J., Tae-Hyun, O., Kweon, I.S.: Dense relational captioning: Triple-stream networks for relationship-based captioning. In: CVPR (2019)
Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: CVPR (2019)
Hong, D., Gao, L., Yao, J., Zhang, B., Plaza, A., Chanussot, J.: Graph convolutional networks for hyperspectral image classification. CoRR, vol. abs/2008.02457 (2020)
Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: ECCV (2018)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. NIPS (2017)
Li, G., Zhu, L., Liu, P., Yang, Y.: Entangled transformer for image captioning. In: ICCV (2019)
Herdade, S., Kappeler, A., Boakye, K., Soares, J.: Image captioning, Transforming objects into words. NIPS (2019)
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. NIPS (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Fang, H., et al.: From captions to visual concepts and back. In: CVPR (2015)
Steven, J.: Rennie, E.M., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: CVPR (2017)
Vedantam, R., Zitnick, C.L., Cider, D.P.: Consensus-based image description evaluation. In: CVPR (2015)
Chen, X., et al.: Microsoft COCO captions: Data collection and evaluation server. vol. abs/1504.00325 (2015)
Karpathy, A., Li, F.-F.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. ACL (2002)
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. ACL (2005)
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE:semantic propositional image caption evaluation. In: ECCV (2016)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: CVPR (2015)
Huang, L., Wang, W., Chen, J., Wei, X.: Attention on attention for image captioning. In: ICCV (2019)
Yang, X., Zhang, H., Cai, J.: Learning to Collocate Neural Modules for Image Captioning. In: ICCV (2019)
Wang, W., Chen, Z., Hu, H.: Hierarchical Attention Network for Image Captioning, AAAI (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Song, L., Shi, Y., Xiao, X., Zhang, C., Xiang, S. (2021). Relational Attention with Textual Enhanced Transformer for Image Captioning. In: Ma, H., et al. Pattern Recognition and Computer Vision. PRCV 2021. Lecture Notes in Computer Science(), vol 13021. Springer, Cham. https://doi.org/10.1007/978-3-030-88010-1_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-88010-1_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88009-5
Online ISBN: 978-3-030-88010-1
eBook Packages: Computer ScienceComputer Science (R0)