Abstract
Few current image captioning systems are capable to read and integrate read text into the generated descriptions, none of them was developed to solve the problem from a bilingual approach. The design of image captioning systems that can read, and, also work with different languages involves problems from a great variety of natures. In this work, we propose Multilingual M4C-Captioner, a bilingual architecture that can be easily trained with different languages with minor changes in the configuration. Our architecture is a modified version of the M4C-captioner, which mainly differs in the text embedding module and the OCR’s embedding module, our approach modifies the former in order to use a pre-trained and multilingual version of BERT, and the last by using the pre-trained FastText vectors for the target languages. This paper presents results for the English and Spanish language, however, our proposal can be easily extended to more than 100 languages. Additionally, we provide the first synthetically translated version of the TextCaps dataset for image captioning with reading comprehension.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsCode Availability
The software requirements, Python code and scripts, the translated dataset and detailed instructions to reproduce our experiments are available on the following GitHub repository: https://github.com/gallardorafael/multilingual-mmf
References
Amirian, S., Rasheed, K., Taha, T.R., Arabnia, H.R.: A short review on image caption generation with deep learning. In: Proceedings of the International Conference on Image Processing, Computer Vision, and Pattern Recognition (IPCV), pp. 10–18. The Steering Committee of The World Congress in Computer Science (2019)
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Borisyuk, F., Gordo, A., Sivakumar, V.: Rosetta: large scale system for text detection and recognition in images. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 71–79 (2018)
Chen, Y.C., et al.: UNITER: learning universal image-text representations (2019)
Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Hossain, M.Z., Sohel, F., Shiratuddin, M.F., Laga, H.: A comprehensive survey of deep learning for image captioning. ACM Comput. Surv. (CsUR) 51(6), 1–36 (2019)
Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for textVQA. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9992–10002 (2020)
Huang, L., Wang, W., Chen, J., Wei, X.Y.: Attention on attention for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4634–4643 (2019)
Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances in pre-training distributed word representations. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 (2015)
Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: TextCaps: a dataset for image captioning with reading comprehension. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 742–758. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_44
Singh, A., et al.: MMF: a multimodal framework for vision and language research (2020). https://github.com/facebookresearch/mmf
Sudholt, S., Fink, G.A.: PHOCNet: a deep convolutional neural network for word spotting in handwritten documents. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 277–282. IEEE (2016)
Tiedemann, J.: The Tatoeba translation challenge - realistic data sets for low resource and multilingual MT. In: Proceedings of the Fifth Conference on Machine Translation, pp. 1174–1182. Association for Computational Linguistics, Online, November 2020. https://www.aclweb.org/anthology/2020.wmt-1.139
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Wang, J., Tang, J., Luo, J.: Multimodal attention with image text spatial relationship for OCR-based image captioning. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4337–4345 (2020)
Wang, Z., Bao, R., Wu, Q., Liu, S.: Confidence-aware non-repetitive multimodal transformers for textcaps. arXiv preprint arXiv:2012.03662 (2020)
Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)
Xu, G., Niu, S., Tan, M., Luo, Y., Du, Q., Wu, Q.: Towards accurate text-based image captioning with content diversity exploration. arXiv preprint arXiv:2105.03236 (2021)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. PMLR (2015)
Yang, Z., et al.: TAP: text-aware pre-training for text-VQA and text-caption. arXiv preprint arXiv:2012.04638 (2020)
Zhu, Q., Gao, C., Wang, P., Wu, Q.: Simple is not easy: a simple strong baseline for TextVQA and TextCaps. arXiv preprint arXiv:2012.05153 (2020)
Acknowledgments
We thank Microsoft Corporation, who kindly provided an Azure sponsorship with enough credits to perform all experiments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Gallardo García, R., Beltrán Martínez, B., Hernández Gracidas, C., Vilariño Ayala, D. (2021). Towards Multilingual Image Captioning Models that Can Read. In: Batyrshin, I., Gelbukh, A., Sidorov, G. (eds) Advances in Soft Computing. MICAI 2021. Lecture Notes in Computer Science(), vol 13068. Springer, Cham. https://doi.org/10.1007/978-3-030-89820-5_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-89820-5_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89819-9
Online ISBN: 978-3-030-89820-5
eBook Packages: Computer ScienceComputer Science (R0)