Dose multimodal machine translation can improve translation performance?

Cui, ShaoDong; Duan, Kaibo; Ma, Wen; Shinnou, Hiroyuki

doi:10.1007/s00521-024-09705-y

Dose multimodal machine translation can improve translation performance?

Original Article
Published: 29 April 2024

Volume 36, pages 13853–13864, (2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

ShaoDong Cui ORCID: orcid.org/0000-0002-4466-3418¹,
Kaibo Duan¹,
Wen Ma¹ &
…
Hiroyuki Shinnou¹

441 Accesses
Explore all metrics

Abstract

Multimodal machine translation (MMT) is a method that uses visual information to guide text translation. However, recent studies have engendered controversy regarding the extent to which MMT can contribute to the improvement of text-enhanced translation. To explore whether the MMT model can improve translation performance, we use the current Neural Machine Translation (NMT) system for evaluation at Multi30K dataset. Specifically, we judge the performance of the MMT model by comparing the difference between the NMT model and the MMT model. At the same time, we conduct text and multimodal degradation experiments to verify whether vision can play a role. We explored the performance of the NMT model and the MMT model for sentences of different lengths to clarify the pros and cons of the MMT model. We found that the performance of the current NMT model surpasses that of the MMT model, suggesting that the impact of visual features might be less significant. Visual features seem to exert influence primarily when a substantial number of words in the source text are masked.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Independent Fusion of Words and Image for Multimodal Machine Translation

Multimodal machine translation through visuals and speech

Article Open access 13 August 2020

An error analysis for image-based multi-modal neural machine translation

Article Open access 08 April 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

All data included are freely available through the following repository: https://github.com/multi30k/dataset.

References

Barrault L, Bougares F, Specia L, Lala C, Elliott D, Frank S (2018) Findings of the third shared task on multimodal machine translation. In: Proceedings of the third conference on machine translation: shared task papers, pp 304–323
Caglayan O, Barrault L, Bougares F (2016) Multimodal attention for neural machine translation. arXiv preprint arXiv:1609.03976
Caglayan O, Aransa W, Bardet A, García-Martínez M, Bougares F, Barrault L, Masana M, Herranz L, van de Weijer J (2017) LIUM-CVC submissions for WMT17 multimodal translation task. In: Proceedings of the second conference on machine translation, association for computational linguistics, Copenhagen, Denmark, pp 432–439 https://doi.org/10.18653/v1/W17-4746
Caglayan O, Madhyastha P, Specia L, Barrault L (2019) Probing the need for visual context in multimodal machine translation. arXiv preprint arXiv:1903.08678
Caglayan O, Ive J, Haralampieva V, Madhyastha P, Barrault L, Specia L (2020) Simultaneous machine translation with visual context. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), association for computational linguistics, Online, pp 2350–236https://doi.org/10.18653/v1/2020.emnlp-main.184
Calixto I, Rios M, Aziz W (2019) Latent variable model for multi-modal translation. In: Proceedings of the 57th annual meeting of the association for computational linguistics, association for computational linguistics, Florence, Italy, pp 6392–640https://doi.org/10.18653/v1/P19-1642
Carlsson F, Eisen P, Rekathati F, Sahlgren M (2022) Cross-lingual and multilingual clip. In: Proceedings of the thirteenth language resources and evaluation conference, pp 6848–6854
Chen S, Zeng Y, Cao D, Lu S (2022) Video-guided machine translation via dual-level back-translation. Knowl Based Syst 245:108598
Article Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. (2020) An image is worth $16\times 16$ words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Elliott D (2018) Adversarial evaluation of multimodal machine translation. In: EMNLP, pp 2974–2978
Elliott D, Frank S, Sima’an K, Specia L (2016) Multi30k: multilingual English–German image descriptions. In: Proceedings of the 5th workshop on vision and language, association for computational linguistics, pp 70–77. https://doi.org/10.18653/v1/W16-3210
Elliott D, Frank S, Barrault L, Bougares F, Specia L (2017) Findings of the second shared task on multimodal machine translation and multilingual image description. In: Proceedings of the second conference on machine translation, volume 2: shared task papers, association for computational linguistics, Copenhagen, Denmark, pp 215–233. http://www.aclweb.org/anthology/W17-4718
Gain B, Bandyopadhyay D, Mukherjee S, Adak C, Ekbal A (2023) Impact of visual context on noisy multimodal NMT: an empirical study for English to Indian languages. arXiv preprint arXiv:2308.16075
Grönroos SA, Huet B, Kurimo M, Laaksonen J, Merialdo B, Pham P, Sjöberg M, Sulubacak U, Tiedemann J, Troncy R et al (2018) The MeMAD submission to the wmt18 multimodal translation task. arXiv preprint arXiv:1808.10802
Gupta D, Kharbanda S, Zhou J, Li W, Pfister H, Wei D (2023) CLIPTrans: transferring visual knowledge with pre-trained models for multimodal machine translation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2875–2886
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Helcl J, Libovickỳ J, Variš D (2018) CUNI system for the WMT18 multimodal translation task. arXiv preprint arXiv:1811.04697
Huang PY, Liu F, Shiang SR, Oh J, Dyer C (2016) Attention-based multimodal neural machine translation. In: Proceedings of the first conference on machine translation, shared task papers, vol 2, pp 639–645
Imankulova A, Kaneko M, Hirasawa T, Komachi M (2020) Toward multimodal simultaneous neural machine translation. In: Proceedings of the fifth conference on machine translation, association for computational linguistics, Online, pp 540–549 https://www.aclweb.org/anthology/2020.wmt-1.70
Li L, Tayir T, Han Y, Tao X, Velásquez JD (2023) Multimodality information fusion for automated machine translation. Inf Fusion 91:352–363. https://doi.org/10.1016/j.inffus.2022.10.018
Article Google Scholar
Libovický J, Helcl J (2017) Attention strategies for multi-source sequence-to-sequence learning. In: Barzilay R, Kan MY (eds) Proceedings of the 55th annual meeting of the association for computational linguistics (vol 2: short papers), association for computational linguistics, Vancouver, Canada, pp 196–20https://doi.org/10.18653/v1/P17-2031
Lin H, Meng F, Su J, Yin Y, Yang Z, Ge Y, Zhou J, Luo J (2020) Dynamic context-guided capsule network for multimodal machine translation. In: Proceedings of the 28th ACM international conference on multimedia, pp 1320–1329
Liu P, Cao H, Zhao T (2021) Gumbel-attention for multi-modal machine translation. arXiv preprint arXiv:2103.08862
Long Q, Wang M, Li L (2021) Generative imagination elevates machine translation. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, association for computational linguistics, Online, pp 5738–574https://doi.org/10.18653/v1/2021.naacl-main.457
Madhyastha PS, Wang J, Specia L (2017) Sheffield multimt: using object posterior predictions for multimodal machine translation. In: Proceedings of the second conference on machine translation, pp 470–476
Peng R, Zeng Y, Zhao J (2022) Distill the image to nowhere: inversion knowledge distillation for multimodal machine translation. In: Proceedings of the 2022 conference on empirical methods in natural language processing, association for computational linguistics, Abu Dhabi, United Arab Emirates, pp 2379–2390 https://aclanthology.org/2022.emnlp-main.152
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21(1):5485–5551
MathSciNet Google Scholar
Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. arXiv preprint arXiv:1803.02155
Song Y, Chen S, Jin Q, Luo W, Xie J, Huang F (2021) Enhancing neural machine translation with dual-side multimodal awareness. IEEE Trans Multimedia 24:3013–3024
Article Google Scholar
Specia L, Frank S, Sima’An K, Elliott D (2016) A shared task on multimodal machine translation and crosslingual image description. In: Proceedings of the first conference on machine translation, shared task papers, vol 2, pp 543–553
Tamura H, Hirasawa T, Kaneko M, Komachi M (2020) TMU Japanese-English multimodal machine translation system for wat 2020. In: Proceedings of the 7th workshop on Asian translation, pp 80–91
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing system, vol 30
Wang X, Wu J, Chen J, Li L, Wang YF, Wang WY (2019) VaTeX: a large-scale, high-quality multilingual dataset for video-and-language research. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4581–4591
Wu Z, Kong L, Bi W, Li X, Kao B (2021a) Good for misconceived reasons: an empirical revisiting on the need for visual context in multimodal machine translation. arXiv preprint arXiv:2105.14462
Wu Z, Kong L, Bi W, Li X, Kao B (2021b) Good for misconceived reasons: an empirical revisiting on the need for visual context in multimodal machine translation. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: long papers), association for computational linguistics, Online, pp 6153–616 https://doi.org/10.18653/v1/2021.acl-long.480
Yang P, Chen B, Zhang P, Sun X (2020) Visual agreement regularized training for multi-modal machine translation. Proc AAAI Conf Artif Intell 34:9418–9425
Google Scholar
Yang Z, Hirasawa T, Komachi M, Okazaki N (2022) Why videos do not guide translations in video-guided machine translation? An empirical evaluation of video-guided machine translation dataset. J Inform Process 30:388–396
Article Google Scholar
Yao S, Wan X (2020) Multimodal transformer for multimodal machine translation. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 4346–4350
Yin Y, Meng F, Su J, Zhou C, Yang Z, Zhou J, Luo J (2020) A novel graph-based multi-modal fusion encoder for neural machine translation. arXiv preprint arXiv:2007.08742
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78
Article Google Scholar
Zhao Y, Komachi M, Kajiwara T, Chu C (2020) Double attention-based multimodal neural machine translation with semantic image regions. In: Proceedings of the 22nd annual conference of the European association for machine translation, pp 105–114
Zhao Y, Komachi M, Kajiwara T, Chu C (2022) Region-attentive multimodal neural machine translation. Neurocomputing 476:1–13
Article Google Scholar
Zhou M, Cheng R, Lee YJ, Yu Z (2018) A visual attention grounding neural model for multimodal machine translation. In: Proceedings of the 2018 conference on empirical methods in natural language processing, association for computational linguistics, Brussels, Belgium, pp 3643–365https://doi.org/10.18653/v1/D18-1400

Download references

Author information

Authors and Affiliations

Faculty of Engineering, Ibaraki University, Hitachi, Ibaraki, Japan
ShaoDong Cui, Kaibo Duan, Wen Ma & Hiroyuki Shinnou

Authors

ShaoDong Cui
View author publications
You can also search for this author inPubMed Google Scholar
Kaibo Duan
View author publications
You can also search for this author inPubMed Google Scholar
Wen Ma
View author publications
You can also search for this author inPubMed Google Scholar
Hiroyuki Shinnou
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to ShaoDong Cui.

Ethics declarations

Conflict of interest

The authors declare that they have no Conflict of interest.

Ethics approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Cui, S., Duan, K., Ma, W. et al. Dose multimodal machine translation can improve translation performance?. Neural Comput & Applic 36, 13853–13864 (2024). https://doi.org/10.1007/s00521-024-09705-y

Download citation

Received: 31 August 2023
Accepted: 25 March 2024
Published: 29 April 2024
Issue Date: August 2024
DOI: https://doi.org/10.1007/s00521-024-09705-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dose multimodal machine translation can improve translation performance?

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Independent Fusion of Words and Image for Multimodal Machine Translation

Multimodal machine translation through visuals and speech

An error analysis for image-based multi-modal neural machine translation

Explore related subjects

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now