Skip to main content

Relational Attention with Textual Enhanced Transformer for Image Captioning

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2021)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 13021))

Included in the following conference series:

Abstract

Image captioning has attracted extensive research interests in recent years, which aims to generate a natural language description of an image. However, many approaches focus only on individual target object information without exploring the relationship between objects and the surrounding. It will greatly affect the performance of captioning models. In order to solve the above issue, we propose a relation model to incorporate relational information between objects from different levels into the captioning model, including low-level box proposals and high-level region features. Moreover, Transformer-based architectures have shown great success in image captioning, where image regions are encoded and then attended into attention vectors to guide the caption generation. However, the attention vectors only contain image-level information without considering the textual information, which fails to expand the capability of captioning in both visual and textual domains. In this paper, we introduce a Textual Enhanced Transformer (TET) to enable addition of textual information into Transformer. There are two modules in TET: text-guided Transformer and self-attention Transformer. The two modules perform semantic and visual attention to guide the decoder to generate high-quality captions. We extensively evaluate model on MS COCO dataset and it achieves 128.7 CIDEr-D score on Karpathy split and 126.3 CIDEr-D (c40) score on official online evaluation server.

This research was supported by the National Key Research and Development Program of China under Grant No. 2020AAA0104903, and the National Natural Science Foundation of China under Grants 62072039, 62076242, and 61976208.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L.: Deep captioning with multimodal recurrent neural networks (m-rnn). In: ICLR (2015)

    Google Scholar 

  2. Jia, X., Gavves, E., Fernando, B., Tuytelaars, T.: Guiding the long-short term memory model for image caption generation. In: ICCV (2015)

    Google Scholar 

  3. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)

    Google Scholar 

  4. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. NIPS (2014)

    Google Scholar 

  5. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked attention networks for image question answering. In: CVPR (2016)

    Google Scholar 

  6. Jiasen, L., Xiong, C., Parikh, D.: and Richard Socher. Adaptive attention via a visual sentinel for image captioning. CVPR, Knowing when to look (2017)

    Google Scholar 

  7. Chen, L., et al.: SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: CVPR (2017)

    Google Scholar 

  8. Kim, D.-J., Choi, J., Tae-Hyun, O., Kweon, I.S.: Dense relational captioning: Triple-stream networks for relationship-based captioning. In: CVPR (2019)

    Google Scholar 

  9. Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: CVPR (2019)

    Google Scholar 

  10. Hong, D., Gao, L., Yao, J., Zhang, B., Plaza, A., Chanussot, J.: Graph convolutional networks for hyperspectral image classification. CoRR, vol. abs/2008.02457 (2020)

    Google Scholar 

  11. Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: ECCV (2018)

    Google Scholar 

  12. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. NIPS (2017)

    Google Scholar 

  13. Li, G., Zhu, L., Liu, P., Yang, Y.: Entangled transformer for image captioning. In: ICCV (2019)

    Google Scholar 

  14. Herdade, S., Kappeler, A., Boakye, K., Soares, J.: Image captioning, Transforming objects into words. NIPS (2019)

    Google Scholar 

  15. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. NIPS (2015)

    Google Scholar 

  16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  17. Fang, H., et al.: From captions to visual concepts and back. In: CVPR (2015)

    Google Scholar 

  18. Steven, J.: Rennie, E.M., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: CVPR (2017)

    Google Scholar 

  19. Vedantam, R., Zitnick, C.L., Cider, D.P.: Consensus-based image description evaluation. In: CVPR (2015)

    Google Scholar 

  20. Chen, X., et al.: Microsoft COCO captions: Data collection and evaluation server. vol. abs/1504.00325 (2015)

    Google Scholar 

  21. Karpathy, A., Li, F.-F.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)

    Google Scholar 

  22. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. ACL (2002)

    Google Scholar 

  23. Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. ACL (2005)

    Google Scholar 

  24. Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE:semantic propositional image caption evaluation. In: ECCV (2016)

    Google Scholar 

  25. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: CVPR (2015)

    Google Scholar 

  26. Huang, L., Wang, W., Chen, J., Wei, X.: Attention on attention for image captioning. In: ICCV (2019)

    Google Scholar 

  27. Yang, X., Zhang, H., Cai, J.: Learning to Collocate Neural Modules for Image Captioning. In: ICCV (2019)

    Google Scholar 

  28. Wang, W., Chen, Z., Hu, H.: Hierarchical Attention Network for Image Captioning, AAAI (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lifei Song .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Song, L., Shi, Y., Xiao, X., Zhang, C., Xiang, S. (2021). Relational Attention with Textual Enhanced Transformer for Image Captioning. In: Ma, H., et al. Pattern Recognition and Computer Vision. PRCV 2021. Lecture Notes in Computer Science(), vol 13021. Springer, Cham. https://doi.org/10.1007/978-3-030-88010-1_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-88010-1_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-88009-5

  • Online ISBN: 978-3-030-88010-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics