Skip to main content

Towards Multilingual Image Captioning Models that Can Read

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13068))

Abstract

Few current image captioning systems are capable to read and integrate read text into the generated descriptions, none of them was developed to solve the problem from a bilingual approach. The design of image captioning systems that can read, and, also work with different languages involves problems from a great variety of natures. In this work, we propose Multilingual M4C-Captioner, a bilingual architecture that can be easily trained with different languages with minor changes in the configuration. Our architecture is a modified version of the M4C-captioner, which mainly differs in the text embedding module and the OCR’s embedding module, our approach modifies the former in order to use a pre-trained and multilingual version of BERT, and the last by using the pre-trained FastText vectors for the target languages. This paper presents results for the English and Spanish language, however, our proposal can be easily extended to more than 100 languages. Additionally, we provide the first synthetically translated version of the TextCaps dataset for image captioning with reading comprehension.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Code Availability

The software requirements, Python code and scripts, the translated dataset and detailed instructions to reproduce our experiments are available on the following GitHub repository: https://github.com/gallardorafael/multilingual-mmf

Notes

  1. 1.

    https://textvqa.org/textcaps/challenge.

  2. 2.

    https://textvqa.org/textcaps.

  3. 3.

    https://opensource.google/projects/open-images-dataset.

  4. 4.

    https://opus.nlpl.eu/.

  5. 5.

    https://github.com/google-research/bert/blob/master/multilingual.md.

  6. 6.

    https://fasttext.cc/docs/en/crawl-vectors.html.

  7. 7.

    https://mmf.sh/.

References

  1. Amirian, S., Rasheed, K., Taha, T.R., Arabnia, H.R.: A short review on image caption generation with deep learning. In: Proceedings of the International Conference on Image Processing, Computer Vision, and Pattern Recognition (IPCV), pp. 10–18. The Steering Committee of The World Congress in Computer Science (2019)

    Google Scholar 

  2. Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24

    Chapter  Google Scholar 

  3. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)

    Google Scholar 

  4. Borisyuk, F., Gordo, A., Sivakumar, V.: Rosetta: large scale system for text detection and recognition in images. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 71–79 (2018)

    Google Scholar 

  5. Chen, Y.C., et al.: UNITER: learning universal image-text representations (2019)

    Google Scholar 

  6. Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)

    Google Scholar 

  7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  8. Hossain, M.Z., Sohel, F., Shiratuddin, M.F., Laga, H.: A comprehensive survey of deep learning for image captioning. ACM Comput. Surv. (CsUR) 51(6), 1–36 (2019)

    Article  Google Scholar 

  9. Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for textVQA. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9992–10002 (2020)

    Google Scholar 

  10. Huang, L., Wang, W., Chen, J., Wei, X.Y.: Attention on attention for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4634–4643 (2019)

    Google Scholar 

  11. Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)

  12. Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)

    Google Scholar 

  13. Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances in pre-training distributed word representations. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018) (2018)

    Google Scholar 

  14. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  15. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 (2015)

  16. Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: TextCaps: a dataset for image captioning with reading comprehension. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 742–758. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_44

    Chapter  Google Scholar 

  17. Singh, A., et al.: MMF: a multimodal framework for vision and language research (2020). https://github.com/facebookresearch/mmf

  18. Sudholt, S., Fink, G.A.: PHOCNet: a deep convolutional neural network for word spotting in handwritten documents. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 277–282. IEEE (2016)

    Google Scholar 

  19. Tiedemann, J.: The Tatoeba translation challenge - realistic data sets for low resource and multilingual MT. In: Proceedings of the Fifth Conference on Machine Translation, pp. 1174–1182. Association for Computational Linguistics, Online, November 2020. https://www.aclweb.org/anthology/2020.wmt-1.139

  20. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)

    Google Scholar 

  21. Wang, J., Tang, J., Luo, J.: Multimodal attention with image text spatial relationship for OCR-based image captioning. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4337–4345 (2020)

    Google Scholar 

  22. Wang, Z., Bao, R., Wu, Q., Liu, S.: Confidence-aware non-repetitive multimodal transformers for textcaps. arXiv preprint arXiv:2012.03662 (2020)

  23. Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)

  24. Xu, G., Niu, S., Tan, M., Luo, Y., Du, Q., Wu, Q.: Towards accurate text-based image captioning with content diversity exploration. arXiv preprint arXiv:2105.03236 (2021)

  25. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. PMLR (2015)

    Google Scholar 

  26. Yang, Z., et al.: TAP: text-aware pre-training for text-VQA and text-caption. arXiv preprint arXiv:2012.04638 (2020)

  27. Zhu, Q., Gao, C., Wang, P., Wu, Q.: Simple is not easy: a simple strong baseline for TextVQA and TextCaps. arXiv preprint arXiv:2012.05153 (2020)

Download references

Acknowledgments

We thank Microsoft Corporation, who kindly provided an Azure sponsorship with enough credits to perform all experiments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rafael Gallardo García .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gallardo García, R., Beltrán Martínez, B., Hernández Gracidas, C., Vilariño Ayala, D. (2021). Towards Multilingual Image Captioning Models that Can Read. In: Batyrshin, I., Gelbukh, A., Sidorov, G. (eds) Advances in Soft Computing. MICAI 2021. Lecture Notes in Computer Science(), vol 13068. Springer, Cham. https://doi.org/10.1007/978-3-030-89820-5_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-89820-5_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-89819-9

  • Online ISBN: 978-3-030-89820-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics