Skip to main content

Transductive Cross-Lingual Scene-Text Visual Question Answering

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14452))

Included in the following conference series:

  • 434 Accesses

Abstract

Multilingual modeling has gained increasing attention in recent years, as the cross-lingual Text-based Visual Question Answering (TextVQA) are requried to understand questions and answers across different languages. Current researches mainly work on multimodal information assuming that multilingual pretrained models are effective to encode questions. However, the semantic comprehension of a text-based question varies between languages, creating challenges in directly deducing its answer from an image. To this end, we propose a novel multilingual text-based VQA framework suited for cross-language scenarios(CLVQA), transductively considering multiple answer generating interactions with questions. First, a question reading module densely connects encoding layers in a feedforward manner, which can adaptively work together with answering. Second, a multimodal OCR-based module decouples OCR features in an image into visual, linguistic, and holistic parts to facilitate the localization of a target-language answer. By incorporating enhancements from the above two input encoding modules, the proposed framework outputs its answer candidates mainly from the input image with a object detection module. Finally, a transductive answering module jointly understands input multimodal information and identified answer candidates at the multilingual level, autoregressively generating cross-lingual answers. Extensive experiments show that our framework outperforms state-of-the-art methods for both of cross-lingual (English\({<}\)-\({>}\)Chinese) and mono-lingual (English\({<}\)-\({>}\)English and Chinese\({<}\)-\({>}\)Chinese) tasks in terms of accuracy based metrics. Moreover, significant improvements are achieved in zero-shot cross-lingual settings(French\({<}\)-\({>}\)Chinese).

This work is partially supported by NSFC, China (No. 62276196).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://api.fanyi.baidu.com.

  2. 2.

    https://www.xfyun.cn/services/common-ocr.

References

  1. Kafle, K., Price, B., Cohen, S., Kanan, C.: Dvqa: understanding data visualizations via question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5648–5656 (2018)

    Google Scholar 

  2. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)

    Google Scholar 

  3. Dong, X., Zhu, L., Zhang, D., Yang, Y., Wu, F.: Fast parameter adaptation for few-shot image captioning and visual question answering. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 54–62 (2018)

    Google Scholar 

  4. Hu, R., Rohrbach, A., Darrell, T., Saenko, K.: Language-conditioned graph networks for relational reasoning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10294–10303 (2019)

    Google Scholar 

  5. Liu, F., Liu, J., Hong, R., Lu, H.: Erasing-based attention learning for visual question answering. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1175–1183 (2019)

    Google Scholar 

  6. Peng, L., Yang, Y., Wang, Z., Wu, X., Huang, Z.: Cra-net: composed relation attention network for visual question answering. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1202–1210 (2019)

    Google Scholar 

  7. Singh, A., et al.: Towards vqa models that can read. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8317–8326 (2019)

    Google Scholar 

  8. Biten, A.F., et al.: Scene text visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4291–4301 (2019)

    Google Scholar 

  9. Gao, C., Zhu, Q., Wang, P., Li, H., Liu, Y., van den Hengel, A., Wu, Q.: Structured multimodal attentions for textvqa. CoRR abs/ arXiv: 2006.00753 (2020)

  10. Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9992–10002 (2020)

    Google Scholar 

  11. Jin, Z., et al.: Ruart: a novel text-centered solution for text-based visual question answering. IEEE Trans. Multim. 25, 1–12 (2023). https://doi.org/10.1109/TMM.2021.3120194

  12. Gao, D., Li, K., Wang, R., Shan, S., Chen, X.: Multi-modal graph neural network for joint reasoning on vision and scene text. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020, pp. 12743–12753. Computer Vision Foundation/IEEE (2020). https://doi.org/10.1109/CVPR42600.2020.01276

  13. Liu, F., Xu, G., Wu, Q., Du, Q., Jia, W., Tan, M.: Cascade reasoning network for text-based visual question answering. In: Chen, C.W., et al. (eds.) MM 2020: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, 12–16 October 2020, pp. 4060–4069. ACM (2020). https://doi.org/10.1145/3394171.3413924,https://doi.org/10.1145/3394171.3413924

  14. Han, W., Huang, H., Han, T.: Finding the evidence: Localization-aware answer prediction for text visual question answering. arXiv preprint arXiv:2010.02582 (2020)

  15. Biten, A.F., Litman, R., Xie, Y., Appalaraju, S., Manmatha, R.: Latr: layout-aware transformer for scene-text VQA. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022, pp. 16527–16537. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.01605, https://doi.org/10.1109/CVPR52688.2022.01605

  16. Dey, A.U., Valveny, E., Harit, G.: External knowledge augmented text visual question answering. CoRR abs/ arXiv: 2108.09717 (2021)

  17. Yang, Z., et al.: Tap: text-aware pre-training for text-vqa and text-caption. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8751–8761 (2021)

    Google Scholar 

  18. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423

  19. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031

  20. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1–140:67 (2020). http://jmlr.org/papers/v21/20-074.html

  21. i Pujolràs, J.B., i Bigorda, L.G., Karatzas, D.: A multilingual approach to scene text visual question answering. In: Uchida, S., Smith, E.H.B., Eglin, V. (eds.) Document Analysis Systems - 15th IAPR International Workshop, DAS 2022, La Rochelle, France, 22–25 May 2022, Proceedings. LNCS, vol. 13237, pp. 65–79. Springer (2022). https://doi.org/10.1007/978-3-031-06555-2_5

  22. Vivoli, E., Biten, A.F., Mafla, A., Karatzas, D., Gómez, L.: MUST-VQA: multilingual scene-text VQA. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) Computer Vision - ECCV 2022 Workshops - Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part IV. LNCS, vol. 13804, pp. 345–358. Springer (2022). https://doi.org/10.1007/978-3-031-25069-9_23

  23. Pires, T., Schlinger, E., Garrette, D.: How multilingual is multilingual bert? arXiv preprint arXiv:1906.01502 (2019)

  24. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019)

  25. Xue, L., et al.: mt5: a massively multilingual pre-trained text-to-text transformer. In: Toutanova, K. (eds.) Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6–11, 2021, pp. 483–498. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.naacl-main.41

  26. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

  27. Ghosh, S.K., Valveny, E.: R-phoc: segmentation-free word spotting using cnn. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 801–806. IEEE (2017)

    Google Scholar 

  28. Yang, L., Wang, P., Li, H., Li, Z., Zhang, Y.: A holistic representation guided attention network for scene text recognition. Neurocomputing 414, 67–75 (2020)

    Article  Google Scholar 

  29. Fang, Z., Li, L., Xie, Z., Yuan, J.: Cross-modal attention networks with modality disentanglement for scene-text VQA. In: IEEE International Conference on Multimedia and Expo, ICME 2022, Taipei, Taiwan, 18–22 July 2022, pp. 1–6. IEEE (2022). https://doi.org/10.1109/ICME52920.2022.9859666

  30. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)

    Google Scholar 

  31. Nai, P., Li, L., Tao, X.: A densely connected encoder stack approach for multi-type legal machine reading comprehension. In: Huang, Z., Beek, W., Wang, H., Zhou, R., Zhang, Y. (eds.) WISE 2020. LNCS, vol. 12343, pp. 167–181. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-62008-0_12

    Chapter  Google Scholar 

  32. Wang, X., et al.: On the general value of evidence, and bilingual scene-text visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10126–10135 (2020)

    Google Scholar 

  33. Pfeiffer, J., etal.: xgqa: cross-lingual visual question answering. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022, pp. 2497–2511. Association for Computational Linguistics (2022). https://doi.org/10.18653/v1/2022.findings-acl.196

  34. Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition with embedded attributes. IEEE Trans. Pattern Anal. Mach. Intell. 36(12), 2552–2566 (2014)

    Article  Google Scholar 

  35. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. OpenReview.net (2021). https://openreview.net/forum?id=YicbFdNTTy

  36. Feng, F., Yang, Y., Cer, D., Arivazhagan, N., Wang, W.: Language-agnostic bert sentence embedding. arXiv preprint arXiv:2007.01852 (2020)

  37. Eisenschlos, J.M., Ruder, S., Czapla, P., Kardas, M., Gugger, S., Howard, J.: Multifit: efficient multi-lingual language model fine-tuning. arXiv preprint arXiv:1909.04761 (2019)

  38. Bigham, J.P., et al.: Vizwiz: nearly real-time answers to visual questions. In: Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology, pp. 333–342 (2010)

    Google Scholar 

  39. Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A diagram is worth a dozen images. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 235–251. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_15

    Chapter  Google Scholar 

  40. Kembhavi, A., Seo, M., Schwenk, D., Choi, J., Farhadi, A., Hajishirzi, H.: Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4999–5007 (2017)

    Google Scholar 

  41. Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: Ocr-vqa: visual question answering by reading text in images. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 947–952. IEEE (2019)

    Google Scholar 

  42. Kant, Y., et al.: Spatially aware multimodal transformers for TextVQA. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 715–732. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_41

    Chapter  Google Scholar 

  43. Gao, D., Li, K., Wang, R., Shan, S., Chen, X.: Multi-modal graph neural network for joint reasoning on vision and scene text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12746–12756 (2020)

    Google Scholar 

  44. Liu, F., Xu, G., Wu, Q., Du, Q., Jia, W., Tan, M.: Cascade reasoning network for text-based visual question answering. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4060–4069 (2020)

    Google Scholar 

  45. Singh, A., et al.: Pythia-a platform for vision & language research. In: SysML Workshop, NeurIPS, vol. 2018 (2018)

    Google Scholar 

  46. Zhu, Q., Gao, C., Wang, P., Wu, Q.: Simple is not easy: a simple strong baseline for textvqa and textcaps. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, 2–9 February 2021, pp. 3608–3615. AAAI Press (2021). https://ojs.aaai.org/index.php/AAAI/article/view/16476

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lin Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, L., Zhang, H., Fang, Z., Xie, Z., Liu, J. (2024). Transductive Cross-Lingual Scene-Text Visual Question Answering. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Lecture Notes in Computer Science, vol 14452. Springer, Singapore. https://doi.org/10.1007/978-981-99-8076-5_33

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8076-5_33

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8075-8

  • Online ISBN: 978-981-99-8076-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics