Abstract
In an effort to enhance the machine translation (MT) quality of low-resource languages, we report the first study on multimodal machine translation (MMT) for Manipuri\(\rightarrow \)English, Manipuri\(\rightarrow \)Hindi and Manipuri\(\rightarrow \)German language pairs. Manipuri is a morphologically rich and resource-constrained language with limited resources that can be computationally utilized. No such MMT dataset has not been reported for these language pairs till date. To build the parallel datasets, we collected news articles containing images and associated text in English from a local daily newspaper and used English as a pivot language. The machine-translated outputs of the existing translation systems of these languages go through manual post-editing to build the datasets. In addition to text, we build MT systems by exploiting features from images and audio recordings in the source language, i.e., Manipuri. We carried out an extensive analysis of the MT systems trained with text-only and multimodal inputs using automatic metrics and human evaluation techniques. Our findings attest that integrating multiple correlated modalities enhances the MT system performance in low-resource settings achieving a significant improvement of up to +3 BLEU score. The human assessment revealed that the fluency score of the MMT systems depends on the type of correlated auxiliary modality.
Similar content being viewed by others
Data Availability
The dataset used in this work is available from Imphal Free Press subject to licensing agreement. A request may be made to the authors to gain access of the data with permission from Imphal Free Press. A sample of the dataset is available at Github (https://github.com/LSMeetei/MnMultimodal) for reference.
Notes
Acronyms: O = Object, S = Subject, V = Verb.
Also known as F1-score is a harmonic mean of precision and recall.
BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+version.1.5.1
chrF2+numchars.6+space.false+version.1.5.1
TER+tok.tercom-nonorm-punct-noasian-uncased+version.1.5.1
References
Anastasopoulos A, Bojar O, Bremerman J, Cattoni R, Elbayad M, Federico M, Wiesner M (2021) Findings of the IWSLT 2021 Evaluation Campaign. In: Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021), Online. https://doi.org/10.18653/v1/2020.iwslt-1.1
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473
Bansal M, Lobiyal DK (2021) Multilingual sequence to sequence convolutional machine translation. Multimedia Tools and Applications 80(25):33701–33726. https://doi.org/10.1007/s11042-021-11345-6
Caglayan O, Aransa W, Bardet A, Garcia-Martinez M, Bougares F, Barrault L, Van de Weijer J (2017) For LIUM-CVC submissions WMT17 multimodal translation task. arXiv:1707.04481, https://doi.org/10.48550/arXiv.1707.04481
Caglayan O, Aransa W, Wang Y, Masana M, Garcia-Martinez M, Bougares F, Van de Weijer J (2016) Does multimodality help human and machine for translation and image captioning?. arXiv:1605.09186. https://doi.org/10.48550
Caglayan O, Madhyastha P, Specia L, Barrault L (2019) Probing the need for visual context in multimodal machine translation. arXiv:1903.08678https://doi.org/10.48550
Calixto I, Liu Q (2017) Incorporating Global Visual Features into Attention-based Neural Machine Translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 992-1003). https://doi.org/10.18653/v1/D17-1105
Dhanjal AS, Singh W (2022) An automatic machine translation system for multi-lingual speech to Indian sign language. multimedia Tools and Applications, 1-39. https://doi.org/10.1007/s11042-021-11706-1
Elliott D, Frank S, Sima’an K, Specia L (2016) Multi30k: Multilingual english-german image descriptions. arXiv:1605.00459https://doi.org/10.48550
Gulcehre C, Firat O, Xu K, Cho K, Barrault L, Lin HC, Bengio Y (2015) On using monolingual corpora in neural machine translation. arXiv:1503.03535https://doi.org/10.48550
Hirasawa T, Yang Z, Komachi M, Okazaki N (2020) Keyframe Segmentation and Positional Encoding for Video-guided Machine Translation Challenge 2020. arXiv:2006.12799https://doi.org/10.48550
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computation 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Huang PY, Liu F, Shiang SR, Oh J, Dyer C (2016, August) Attention-based multimodal neural machine translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers (pp. 639-645). https://doi.org/10.18653/v1/W16-2360
Kakwani D, Kunchukuttan A, Golla S, Gokul NC, Bhattacharyya A, Khapra MM, Kumar P (2020) inlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings (pp. 4948-4961). https://doi.org/10.18653/v1/2020.findings-emnlp.445
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv:1412.6980https://doi.org/10.48550
Klein G, Hernandez F, Nguyen V, Senellart J (2020, October) The OpenNMT neural machine translation toolkit: 2020 edition. In Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track) (pp. 102-109). https://aclanthology.org/2020.amta-research.9
Kocabiyikoglu AC, Besacier L, Kraif O (2018) Augmenting librispeech with french translations: A multimodal corpus for direct speech translation evaluation. arXiv:1802.03142https://doi.org/10.48550
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Herbst E (2007) Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions (pp. 177-180). https://aclanthology.org/P07-2045
Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1 (pp. 48-54). https://doi.org/10.3115/1073445.1073462
Lee JY (2019) Deep multimodal embedding for video captioning. Multimedia Tools and Applications 78(22):31793–31805. https://doi.org/10.1007/s11042-019-08011-3
Luong MT, Pham H, Manning CD (2015) Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 1412-1421). https://doi.org/10.18653/v1/D15-1166
Mao J, Xu W, Yang Y, Wang J, Yuille AL (2014) Explain images with multimodal recurrent neural networks. arXiv:1410.1090https://doi.org/10.48550
Meetei LS, Rahul L, Singh A, Singh SM, Singh TD, Bandyopadhyay S (2021) An Experiment on Speech-to-Text Translation Systems for Manipuri to English on Low Resource Setting. In Proceedings of the 18th International Conference on Natural Language Processing (ICON) (pp. 54-63). https://aclanthology.org/2021.icon-main.8
Meetei LS, Singh TD, Bandyopadhyay S (2019) WAT2019: English-Hindi translation on Hindi visual genome dataset. In Proceedings of the 6th Workshop on Asian Translation (pp. 181-188). https://doi.org/10.18653/v1/D19-5224
Meetei LS, Singh TD, Bandyopadhyay S, Vela M, van Genabith J (2020) English to Manipuri and Mizo Post-Editing Effort and its Impact on Low Resource Machine Translation. In Proceedings of the 17th International Conference on Natural Language Processing (ICON) (pp. 50-59). https://aclanthology.org/2020.icon-main.7
Meetei LS, Singh SM, Singh A, Das R, Singh TD, Bandyopadhyay S (2023) Hindi to English Multimodal Machine Translation on News Dataset in Low Resource Setting. Procedia Computer Science 218:2102–2109. https://doi.org/10.1016/j.procs.2023.01.186
Ney H (1999) Speech translation: Coupling of recognition and translation. In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258) (Vol. 1, pp. 517-520). IEEE. https://doi.org/10.1109/ICASSP.1999.758176
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318). https://doi.org/10.3115/1073083.1073135
Parida S, Bojar O, Dash SR (2019) Hindi visual genome: A dataset for multi-modal english to hindi machine translation. Computación y Sistemas 23(4):1499–1505. https://doi.org/10.13053/cys-23-4-3294
Pham NQ, Nguyen TS, Ha TL, Hussain J, Schneider F, Niehues J, Waibel A (2019) The iwslt 2019 kit speech translation system. In Proceedings of the 16th International Conference on Spoken Language Translation. https://aclanthology.org/2019.iwslt-1.3
Popović M (2015) chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation (pp. 392-395). https://doi.org/10.18653/v1/W15-3049
Post M (2018) A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers (pp. 186-191). https://doi.org/10.18653/v1/W18-6319
Rahul L, Meetei LS, Jayanna HS (2021 Statistical and Neural Machine Translation for Manipuri-English on Intelligence Domain. In Advances in Computing and Network Communications (pp. 249-257). Springer, Singapore. https://doi.org/10.1007/978-981-33-6987-0_21
Sanabria R, Caglayan O, Palaskar S, Elliott D, Barrault L, Specia L, Metze F (2018) How2: a large-scale dataset for multimodal language understanding. arXiv:1811.00347https://doi.org/10.48550
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE transactions on Signal Processing 45(11):2673–2681. https://doi.org/10.1109/78.650093
Sennrich R, Haddow B, Birch A (2015) Improving neural machine translation models with monolingual data. arXiv:1511.06709. https://doi.org/10.48550
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556https://doi.org/10.48550
Singh SM, Meetei LS, Singh TD, Bandyopadhyay S (2021) Multiple captions embellished multilingual multi-modal neural machine translation. In Proceedings of the First Workshop on Multimodal Machine Translation for Low Resource Languages (MMTLRL 2021) (pp. 2-11). https://aclanthology.org/2021.mmtlrl-1.2
Singh TD (2013) Taste of Two Different Flavours: Which Manipuri Script Works Better for English-Manipuri Language Pair SMT Systems?. In Proceedings of the Seventh Workshop on Syntax, Semantics and Structure in Statistical Translation (pp. 11-18). https://aclanthology.org/W13-0802
Singh TD, Hujon AV (2020) Low Resource and Domain Specific English to Khasi SMT and NMT Systems. In 2020 International Conference on Computational Performance Evaluation (ComPE) (pp. 733-737). IEEE. https://doi.org/10.1109/ComPE49325.2020.9200059
Singh TD, i Bonet CE, Bandyopadhyay S, van Genabith J (2021) Proceedings of the First Workshop on Multimodal Machine Translation for Low Resource Languages (MMTLRL 2021). In Proceedings of the First Workshop on Multimodal Machine Translation for Low Resource Languages (MMTLRL 2021). https://aclanthology.org/2021.mmtlrl-1
Singh SM, Singh TD (2022) An empirical study of low-resource neural machine translation of manipuri in multilingual settings. Neural Computing and Applications 34(17):14823–14844. https://doi.org/10.1007/s00521-022-07337-8
Singh A, Singh TD, Bandyopadhyay S (2022) V2t: video to text framework using a novel automatic shot boundary detection algorithm. Multimedia Tools and Applications 81(13):7989–18009. https://doi.org/10.1007/s11042-022-12343-y
Singh S, Singh TD, Bandyopadhyay S (2022) An Experiment on Speech-to-Speech Translation of Hindi to English: A Deep Learning Approach. In Advanced Machine Intelligence and Signal Processing (pp. 625-635). Singapore: Springer Nature Singapore. https://doi.org/10.1007/978-981-19-0840-8_48
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers (pp. 223-231). https://aclanthology.org/2006.amta-papers.25
Snover M, Madnani N, Dorr B, Schwartz R (2009). Fluency, adequacy, or HTER? Exploring different human judgments with a tunable MT metric. In Proceedings of the Fourth Workshop on Statistical Machine Translation (pp. 259-268). https://dl.acm.org/doi/abs/10.5555/1626431.1626480
Sperber M, Neubig G, Niehues J, Waibel A (2019) Attention-passing models for robust and data-efficient end-to-end speech translation. Transactions of the Association for Computational Linguistics 7:313–325. https://doi.org/10.1162/tacl_a_00270
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 2 (pp. 3104-3112). https://doi.org/10.5555/2969033.2969173
Tillmann C, Ney H (2003) Word reordering and a dynamic programming beam search algorithm for statistical machine translation. Computational linguistics 29(1):97–133. https://doi.org/10.1162/089120103321337458
Toral A, Wieling M, Way A (2018) Post-editing effort of a novel with statistical and neural machine translation. Frontiers in Digital Humanities 5:9. https://doi.org/10.3389/fdigh.2018.00009
Wang X, Wu J, Chen J, Li L, Wang YF, Wang WY (2019) Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 4581-4591). https://doi.org/10.1109/ICCV.2019.00468
Wang D, Xiong D (2021) Efficient Object-Level Visual Context Modeling for Multimodal Machine Translation: Masking Irrelevant Objects Helps Grounding. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, No. 4, pp. 2720-2728). https://doi.org/10.1609/aaai.v35i4.16376
Weiss RJ, Chorowski J, Jaitly N, Wu Y, Chen Z (2017) Sequence-to-Sequence Models Can Directly Translate Foreign Speech. Proc. Interspeech 2017, 2625–2629. https://doi.org/10.21437/Interspeech.2017-503
Yao BZ, Yang X, Lin L, Lee MW, Zhu SC (2010) I2t: Image parsing to text description. Proceedings of the IEEE 98(8):1485–1508. https://doi.org/10.1109/JPROC.2010.2050411
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2:67–78. https://doi.org/10.1162/tacl_a_00166
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Meetei, L.S., Singh, T.D. & Bandyopadhyay, S. Exploiting multiple correlated modalities can enhance low-resource machine translation quality. Multimed Tools Appl 83, 13137–13157 (2024). https://doi.org/10.1007/s11042-023-15721-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15721-2