Towards Multilingual Image Captioning Models that Can Read

Gallardo García, Rafael; Beltrán Martínez, Beatriz; Hernández Gracidas, Carlos; Vilariño Ayala, Darnes

doi:10.1007/978-3-030-89820-5_2

Towards Multilingual Image Captioning Models that Can Read

Conference paper
First Online: 21 October 2021

862 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13068))

Abstract

Few current image captioning systems are capable to read and integrate read text into the generated descriptions, none of them was developed to solve the problem from a bilingual approach. The design of image captioning systems that can read, and, also work with different languages involves problems from a great variety of natures. In this work, we propose Multilingual M4C-Captioner, a bilingual architecture that can be easily trained with different languages with minor changes in the configuration. Our architecture is a modified version of the M4C-captioner, which mainly differs in the text embedding module and the OCR’s embedding module, our approach modifies the former in order to use a pre-trained and multilingual version of BERT, and the last by using the pre-trained FastText vectors for the target languages. This paper presents results for the English and Spanish language, however, our proposal can be easily extended to more than 100 languages. Additionally, we provide the first synthetically translated version of the TextCaps dataset for image captioning with reading comprehension.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Code Availability

The software requirements, Python code and scripts, the translated dataset and detailed instructions to reproduce our experiments are available on the following GitHub repository: https://github.com/gallardorafael/multilingual-mmf

Notes

References

Amirian, S., Rasheed, K., Taha, T.R., Arabnia, H.R.: A short review on image caption generation with deep learning. In: Proceedings of the International Conference on Image Processing, Computer Vision, and Pattern Recognition (IPCV), pp. 10–18. The Steering Committee of The World Congress in Computer Science (2019)
Google Scholar
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Borisyuk, F., Gordo, A., Sivakumar, V.: Rosetta: large scale system for text detection and recognition in images. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 71–79 (2018)
Google Scholar
Chen, Y.C., et al.: UNITER: learning universal image-text representations (2019)
Google Scholar
Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Hossain, M.Z., Sohel, F., Shiratuddin, M.F., Laga, H.: A comprehensive survey of deep learning for image captioning. ACM Comput. Surv. (CsUR) 51(6), 1–36 (2019)
Article Google Scholar
Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for textVQA. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9992–10002 (2020)
Google Scholar
Huang, L., Wang, W., Chen, J., Wei, X.Y.: Attention on attention for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4634–4643 (2019)
Google Scholar
Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Google Scholar
Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances in pre-training distributed word representations. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 (2015)
Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: TextCaps: a dataset for image captioning with reading comprehension. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 742–758. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_44
Chapter Google Scholar
Singh, A., et al.: MMF: a multimodal framework for vision and language research (2020). https://github.com/facebookresearch/mmf
Sudholt, S., Fink, G.A.: PHOCNet: a deep convolutional neural network for word spotting in handwritten documents. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 277–282. IEEE (2016)
Google Scholar
Tiedemann, J.: The Tatoeba translation challenge - realistic data sets for low resource and multilingual MT. In: Proceedings of the Fifth Conference on Machine Translation, pp. 1174–1182. Association for Computational Linguistics, Online, November 2020. https://www.aclweb.org/anthology/2020.wmt-1.139
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Wang, J., Tang, J., Luo, J.: Multimodal attention with image text spatial relationship for OCR-based image captioning. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4337–4345 (2020)
Google Scholar
Wang, Z., Bao, R., Wu, Q., Liu, S.: Confidence-aware non-repetitive multimodal transformers for textcaps. arXiv preprint arXiv:2012.03662 (2020)
Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)
Xu, G., Niu, S., Tan, M., Luo, Y., Du, Q., Wu, Q.: Towards accurate text-based image captioning with content diversity exploration. arXiv preprint arXiv:2105.03236 (2021)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. PMLR (2015)
Google Scholar
Yang, Z., et al.: TAP: text-aware pre-training for text-VQA and text-caption. arXiv preprint arXiv:2012.04638 (2020)
Zhu, Q., Gao, C., Wang, P., Wu, Q.: Simple is not easy: a simple strong baseline for TextVQA and TextCaps. arXiv preprint arXiv:2012.05153 (2020)

Download references

Acknowledgments

We thank Microsoft Corporation, who kindly provided an Azure sponsorship with enough credits to perform all experiments.

Author information

Authors and Affiliations

Language and Knowledge Engineering Laboratory, Benemérita Universidad Autónoma de Puebla, Puebla, Mexico
Rafael Gallardo García, Beatriz Beltrán Martínez & Darnes Vilariño Ayala
Faculty of Physical and Mathematical Sciences, Benemérita Universidad Autónoma de Puebla, Puebla, Mexico
Carlos Hernández Gracidas

Authors

Rafael Gallardo García
View author publications
You can also search for this author in PubMed Google Scholar
Beatriz Beltrán Martínez
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Hernández Gracidas
View author publications
You can also search for this author in PubMed Google Scholar
Darnes Vilariño Ayala
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rafael Gallardo García .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Centro de Investigación en Computación, Mexico City, Mexico
Ildar Batyrshin
Instituto Politécnico Nacional, Centro de Investigación en Computación, Mexico City, Mexico
Alexander Gelbukh
Instituto Politécnico Nacional, Centro de Investigación en Computación, Mexico City, Mexico
Grigori Sidorov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gallardo García, R., Beltrán Martínez, B., Hernández Gracidas, C., Vilariño Ayala, D. (2021). Towards Multilingual Image Captioning Models that Can Read. In: Batyrshin, I., Gelbukh, A., Sidorov, G. (eds) Advances in Soft Computing. MICAI 2021. Lecture Notes in Computer Science(), vol 13068. Springer, Cham. https://doi.org/10.1007/978-3-030-89820-5_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-89820-5_2
Published: 21 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89819-9
Online ISBN: 978-3-030-89820-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics