Abstract
Parallel corpus is a key component of statistical and Neural Machine Translation (NMT). While most research focuses on machine translation, corpus creation studies are limited for many languages, and no research paper on a Mizo–English corpus exists yet. A high-quality parallel corpus is required for Natural Language Processing activities including machine translation, Chatbots, Transliteration, and Cross-Language Information Retrieval. This work aims to investigate parallel corpus creation techniques and apply them to the Mizo–English language pair. Another goal is to test machine translation on the newly constructed corpus. We contributed to LF Aligner tool to support Mizo language for Mizo sentence alignment in corpus development. Our effort created the first large-scale Mizo–English parallel corpus with over 529K sentences. The pre-processed corpus was used for Mizo-to-English NMT. It was evaluated using BLEU, Character F1 Score (ChrF), and Translation Edit Rate (TER) scores. Our system achieved BLEU 45.08, ChrF 65.36, and TER 41.16, setting a new benchmark for Mizo-to-English translation.
- [1] . 2022. Roman-urdu-parl: Roman-urdu and urdu parallel corpus for urdu language understanding. Trans. Asian Low-Resour. Lang. Inf. Process. 21, 1 (2022), 1–20.Google ScholarDigital Library
- [2] . 2022. Corpus tools for parallel corpora of theatre plays: An introduction to TAligner and ACM-theatre. Lang. Resour. Eval. 56, 2 (2022), 651–671.Google ScholarDigital Library
- [3] . 2022. Building machine translation systems for the next thousand languages. arXiv:2205.03983. Retrieved from https://arxiv.org/abs/2205.03983Google Scholar
- [4] . 2021. Parallel corpora preparation for english-amharic machine translation. In International Work-Conference on Artificial Neural Networks. Springer International Publishing, Madeira, 443–455.Google ScholarDigital Library
- [5] . 2022. Crawling parallel data for bilingual corpus using hybrid crawling architecture. Proc. Comput. Sci. 198 (2022), 122–127. Google ScholarDigital Library
- [6] . 2018. On the design of web crawlers for constructing an efficient chinese-portuguese bilingual corpus system. In Proceedings of the International Conference on Electronics, Information, and Communication (ICEIC’18). IEEE, Honolulu, HI, 1–4. Google ScholarCross Ref
- [7] . 2022. Compiling and analysing a large corpus of online discussions to explore users’ interactions. Appl. Corp. Ling. 2, 2 (2022), 100017.Google ScholarCross Ref
- [8] . 2019. Semi-automatic word-aligned tool for thai-vietnamese parallel corpus construction. In Proceedings of the 16th International Joint Conference on Computer Science and Software Engineering (JCSSE’19). IEEE, IEEE, 121–125.Google ScholarCross Ref
- [9] . 2020. Steps of pre-processing for english to mizo SMT system. In International Conference on Machine Learning, Image Processing, Network Security and Data Sciences. Springer, Silchar, 156–167.Google ScholarCross Ref
- [10] . 2018. Construction of amharic - arabic parallel text corpus for neural machine translation. Int. J. Nat. Lang. Comput. 7, 5 (2018), 93–103.Google ScholarCross Ref
- [11] . 2020. Sedar: A large scale french-english financial domain parallel corpus. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 3595–3602. https://aclanthology.org/2020.lrec-1.442Google Scholar
- [12] . 2022. Mizo to english machine translation: An evaluation benchmark. In Proceedings of the International Conference on Futuristic Technologies (INCOFT’22). IEEE, Belgaum, 1–6. Google ScholarCross Ref
- [13] . 2018. A parallel corpus of arabic-japanese news articles. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18), Vol. 11. European Language Resources Association, Miyazaki, 79–91. https://aclanthology.org/L18-1147Google Scholar
- [14] . 2018. Construction of english-bodo parallel text corpus for statistical machine translation. Int. J. Nat. Lang. Comput. 7, 5 (2018), 93–103.Google ScholarCross Ref
- [15] . 2022. Language resource building and english-to-mizo neural machine translation encountering tonal words. In Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 48–54. https://aclanthology.org/2022.wildre-1.9Google Scholar
- [16] . 2018. A survey report on the existing methods of building a parallel corpus. Int. J. Adv. Res. Comput. Sci. 9, 4 (2018), 13–19.Google ScholarCross Ref
- [17] . 2020. The OpenNMT neural machine translation toolkit: 2020 edition. In Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track). Association for Machine Translation in the Americas, Virtual, 102–109. https://aclanthology.org/2020.amta-research.9Google Scholar
- [18] . 2019. YiSi-a unified semantic MT quality evaluation and estimation metric for languages with different levels of available resources. In Proceedings of the 4th Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). Association for Computational Linguistics, Florence, 507–513. Google ScholarCross Ref
- [19] . 2021. Text categorization with WEKA: A survey. Mach. Learn. Appl. 4 (2021), 100033. Google ScholarCross Ref
- [20] . 2020. Building the spanish-croatian parallel corpus. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 3932–3936. https://aclanthology.org/2020.lrec-1.484Google Scholar
- [21] . 2020. JParaCrawl: A large scale web-based english-japanese parallel corpus. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 3603–3609. https://aclanthology.org/2020.lrec-1.443Google Scholar
- [22] . 2022. The LTRC hindi-telugu parallel corpus. In Proceedings of the 13th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 3417–3424. https://aclanthology.org/2022.lrec-1.365Google Scholar
- [23] . 2020. SENCORPUS: A french-wolof parallel corpus. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 2803–2811. https://aclanthology.org/2020.lrec-1.341Google Scholar
- [24] . 2019. English–mizo machine translation using neural and statistical approaches. Neural Comput. Appl. 31, 11 (2019), 7615–7631.Google ScholarDigital Library
- [25] . 2019. Neural machine translation system for english to indian language translation using MTIL parallel corpus. J. Intell. Syst. 28, 3 (2019), 387–398.Google ScholarCross Ref
- [26] . 2020. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 2685–2702. Google ScholarCross Ref
- [27] . 2020. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 7881–7892. Google ScholarCross Ref
- [28] . 2020. Machine translation of english language to mizo language. In Proceedings of the IEEE International Conference on Cloud Computing in Emerging Markets (CCEM’20). IEEE, Zhaw School of Engineering, 92–97. Google ScholarCross Ref
- [29] . 2022. A french-to-english machine translation model using transformer network. Proc. Comput. Sci. 199 (2022), 1438–1443. Google ScholarCross Ref
- [30] . 2019. Learning english–chinese bilingual word representations from sentence-aligned parallel corpus. Comput. Speech Lang. 56 (2019), 52–72. Google ScholarDigital Library
- [31] . 2020. MulTed: A multilingual aligned and tagged parallel corpus. Appl. Comput. Inf. 18, 1/2 (2020), 61–73.Google ScholarCross Ref
Index Terms
- Construction of Mizo: English Parallel Corpus for Machine Translation
Recommendations
An Improved English-to-Mizo Neural Machine Translation
Machine Translation is an effort to bridge language barriers and misinterpretations, making communication more convenient through the automatic translation of languages. The quality of translations produced by corpus-based approaches predominantly depends ...
Handling of Infinitives in English to Sanskrit Machine Translation
The development of Machine Translation (MT) system for ancient language like Sanskrit is a fascinating and challenging task. In this paper, the authors handle the infinitive type of English sentences in the English to Sanskrit machine translation (EST) ...
Post-Ordering by Parsing with ITG for Japanese-English Statistical Machine Translation
Word reordering is a difficult task for translation between languages with widely different word orders, such as Japanese and English. A previously proposed post-ordering method for Japanese-to-English translation first translates a Japanese sentence ...
Comments