Relation-Aware Global-Augmented Transformer for TextCaps

Li, Qiang; Li, Bing; Ma, Can

doi:10.1007/978-3-031-15919-0_62

Qiang Li^12,13,
Bing Li¹² &
Can Ma¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13529))

Included in the following conference series:

International Conference on Artificial Neural Networks

2232 Accesses

Abstract

Text-based image captioning (TextCaps) task aims to describe the given image reasonably based on scene text and visual objects simultaneously. Although previous works have shown great success, they pay too much attention to the text modality while ignoring other important visual information, and the correlations between objects and text are not fully exploited. Moreover, traditional transformer-based architectures ignore global information reflecting the entire image, which may cause object missing and erroneous reasoning problems. In this paper, we propose a Relation-aware Global-augmented Transformer (RGT) framework to tackle these problems. Specifically, we utilize a scene graph extracted from the image to explicitly model the relative semantic and spatial relationships of objects via a graph convolutional network, which not only enhances the visual representations but also encodes explicit semantic features of objects. Besides, we add a multi-modal alignment (MMA) module as a supplement for the multi-modal transformer to strengthen the association between scene text and objects. Finally, a global-augmented transformer (GAT) is designed to get a more comprehensive representation of the image, which could alleviate object missing and erroneous reasoning problems. Our method outperforms state-of-the-art models on the TextCaps dataset, improving from 105.0 to 107.2 in CIDEr.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://eval.ai/web/challenges/challenge-page/906/.

References

Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition with embedded attributes. IEEE Trans. Pattern Anal. Mach. Intell. 36(12), 2552–2566 (2014)
Article Google Scholar
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)
Google Scholar
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: ACL, pp. 65–72 (2005)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
Borisyuk, F., Gordo, A., Sivakumar, V.: Rosetta: large scale system for text detection and recognition in images. In: KDD, pp. 71–79 (2018)
Google Scholar
Gao, L., Wang, B., Wang, W.: Image captioning with scene-graph based semantic concepts. In: ICMLC, pp. 225–229 (2018)
Google Scholar
Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for TextVQA. In: CVPR, pp. 9989–9999 (2020)
Google Scholar
Huang, L., Wang, W., Chen, J., Wei, X.: Attention on attention for image captioning. In: ICCV, pp. 4633–4642 (2019)
Google Scholar
Li, G., Zhu, L., Liu, P., Yang, Y.: Entangled transformer for image captioning. In: ICCV, pp. 8927–8936 (2019)
Google Scholar
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004) (2004)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS (2019)
Google Scholar
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: CVPR, pp. 3242–3250 (2017)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)
Google Scholar
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)
Google Scholar
Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: TextCaps: a dataset for image captioning with reading comprehension. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 742–758. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_44
Chapter Google Scholar
Singh, A., et al.: Towards VQA models that can read. In: CVPR, pp. 8317–8326 (2019)
Google Scholar
Tang, K., Niu, Y., Huang, J., Shi, J., Zhang, H.: Unbiased scene graph generation from biased training. In: CVPR, pp. 3713–3722 (2020)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)
Google Scholar
Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015)
Google Scholar
Wang, J., Tang, J., Luo, J.: Multimodal attention with image text spatial relationship for OCR-based image captioning. In: MM, pp. 4337–4345 (2020)
Google Scholar
Wang, J., Tang, J., Yang, M., Bai, X., Luo, J.: Improving OCR-based image captioning by incorporating geometrical relationship. In: CVPR, pp. 1306–1315 (2021)
Google Scholar
Wang, J., Wang, W., Wang, L., Wang, Z., Feng, D.D., Tan, T.: Learning visual relationship and context-aware attention for image captioning. Pattern Recogn. 98, 107075 (2020)
Article Google Scholar
Wang, Z., Bao, R., Wu, Q., Liu, S.: Confidence-aware non-repetitive multimodal transformers for textcaps. In: AAAI, pp. 2835–2843 (2021)
Google Scholar
Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)
Google Scholar
Yang, Z., et al.: TAP: text-aware pre-training for text-VQA and text-caption. In: CVPR, pp. 8751–8761 (2021)
Google Scholar
Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: ECCV, pp. 711–727 (2018)
Google Scholar
Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: scene graph parsing with global context. In: CVPR, pp. 5831–5840 (2018)
Google Scholar
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J.J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: AAAI, pp. 13041–13049 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Qiang Li, Bing Li & Can Ma
School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Qiang Li

Authors

Qiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Bing Li
View author publications
You can also search for this author in PubMed Google Scholar
Can Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Can Ma .

Editor information

Editors and Affiliations

University of the West of England, Bristol, UK
Elias Pimenidis
Lancaster University, Lancaster, UK
Plamen Angelov
Digital Innovation, Teesside University, Middlesbrough, UK
Chrisina Jayne
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
The University of the West of England, Bristol, UK
Mehmet Aydin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Q., Li, B., Ma, C. (2022). Relation-Aware Global-Augmented Transformer for TextCaps. In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M. (eds) Artificial Neural Networks and Machine Learning – ICANN 2022. ICANN 2022. Lecture Notes in Computer Science, vol 13529. Springer, Cham. https://doi.org/10.1007/978-3-031-15919-0_62

Download citation

DOI: https://doi.org/10.1007/978-3-031-15919-0_62
Published: 07 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15918-3
Online ISBN: 978-3-031-15919-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Relation-Aware Global-Augmented Transformer for TextCaps