short-paper

Construction of Mizo: English Parallel Corpus for Machine Translation

Authors:
Thangkhanhau Haulai

Mizoram University, India

Mizoram University, India

0000-0001-7154-5364
View Profile

,
Jamal Hussain

Mizoram University, India

Mizoram University, India

0000-0001-5553-3654
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22 Issue 8Article No.: 220pp 1–12https://doi.org/10.1145/3610404

Published:24 August 2023Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Parallel corpus is a key component of statistical and Neural Machine Translation (NMT). While most research focuses on machine translation, corpus creation studies are limited for many languages, and no research paper on a Mizo–English corpus exists yet. A high-quality parallel corpus is required for Natural Language Processing activities including machine translation, Chatbots, Transliteration, and Cross-Language Information Retrieval. This work aims to investigate parallel corpus creation techniques and apply them to the Mizo–English language pair. Another goal is to test machine translation on the newly constructed corpus. We contributed to LF Aligner tool to support Mizo language for Mizo sentence alignment in corpus development. Our effort created the first large-scale Mizo–English parallel corpus with over 529K sentences. The pre-processed corpus was used for Mizo-to-English NMT. It was evaluated using BLEU, Character F1 Score (ChrF), and Translation Edit Rate (TER) scores. Our system achieved BLEU 45.08, ChrF 65.36, and TER 41.16, setting a new benchmark for Mizo-to-English translation.

REFERENCES

[1] Alam Mehreen and Hussain Sibt Ul. 2022. Roman-urdu-parl: Roman-urdu and urdu parallel corpus for urdu language understanding. Trans. Asian Low-Resour. Lang. Inf. Process. 21, 1 (2022), 1–20.Google ScholarDigital Library
[2] Andaluz-Pinedo Olaia and Sanjurjo-González Hugo. 2022. Corpus tools for parallel corpora of theatre plays: An introduction to TAligner and ACM-theatre. Lang. Resour. Eval. 56, 2 (2022), 651–671.Google ScholarDigital Library
[3] Bapna Ankur, Caswell Isaac, Kreutzer Julia, Firat Orhan, Esch Daan van, Siddhant Aditya, Niu Mengmeng, Baljekar Pallavi, Garcia Xavier, Macherey Wolfgang, et al. 2022. Building machine translation systems for the next thousand languages. arXiv:2205.03983. Retrieved from https://arxiv.org/abs/2205.03983Google Scholar
[4] Biadgligne Yohanens and Smaïli Kamel. 2021. Parallel corpora preparation for english-amharic machine translation. In International Work-Conference on Artificial Neural Networks. Springer International Publishing, Madeira, 443–455.Google ScholarDigital Library
[5] Cheok Sai Man, Hoi Lap Man, Tang Su-Kit, and Tse Rita. 2022. Crawling parallel data for bilingual corpus using hybrid crawling architecture. Proc. Comput. Sci. 198 (2022), 122–127. Google ScholarDigital Library
[6] Cheong Sio Tai, Xu Jiabo, and Liu Yue. 2018. On the design of web crawlers for constructing an efficient chinese-portuguese bilingual corpus system. In Proceedings of the International Conference on Electronics, Information, and Communication (ICEIC’18). IEEE, Honolulu, HI, 1–4. Google ScholarCross Ref
[7] Chua Shi Min. 2022. Compiling and analysing a large corpus of online discussions to explore users’ interactions. Appl. Corp. Ling. 2, 2 (2022), 100017.Google ScholarCross Ref
[8] Chuong Dang Ngoc and Seresangtakul Pusadee. 2019. Semi-automatic word-aligned tool for thai-vietnamese parallel corpus construction. In Proceedings of the 16th International Joint Conference on Computer Science and Software Engineering (JCSSE’19). IEEE, IEEE, 121–125.Google ScholarCross Ref
[9] Devi Chanambam Sveta and Purkayastha Bipul Syam. 2020. Steps of pre-processing for english to mizo SMT system. In International Conference on Machine Learning, Image Processing, Network Security and Data Sciences. Springer, Silchar, 156–167.Google ScholarCross Ref
[10] Gashaw Ibrahim and Shashirekha H. L.. 2018. Construction of amharic - arabic parallel text corpus for neural machine translation. Int. J. Nat. Lang. Comput. 7, 5 (2018), 93–103.Google ScholarCross Ref
[11] Ghaddar Abbas and Langlais Philippe. 2020. Sedar: A large scale french-english financial domain parallel corpus. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 3595–3602. https://aclanthology.org/2020.lrec-1.442Google Scholar
[12] Hnamte Vanlalruata, Thangkhanhau Haulai, Hussain Jamal, Lalnunmawii Chawngthu, Tlaisun Laldinsangi, et al. 2022. Mizo to english machine translation: An evaluation benchmark. In Proceedings of the International Conference on Futuristic Technologies (INCOFT’22). IEEE, Belgaum, 1–6. Google ScholarCross Ref
[13] Inoue Go, Habash Nizar, Matsumoto Yuji, and Aoyama Hiroyuki. 2018. A parallel corpus of arabic-japanese news articles. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18), Vol. 11. European Language Resources Association, Miyazaki, 79–91. https://aclanthology.org/L18-1147Google Scholar
[14] Islam Saiful, Paul Abhijit, Purkayastha Bipul Shyam, and Hussain Ismail. 2018. Construction of english-bodo parallel text corpus for statistical machine translation. Int. J. Nat. Lang. Comput. 7, 5 (2018), 93–103.Google ScholarCross Ref
[15] Khenglawt Vanlalmuansangi, Laskar Sahinur Rahman, Pal Santanu, Pakray Partha, and Khan Ajoy Kumar. 2022. Language resource building and english-to-mizo neural machine translation encountering tonal words. In Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 48–54. https://aclanthology.org/2022.wildre-1.9Google Scholar
[16] Khosla Sonal and Acharya Haridasa. 2018. A survey report on the existing methods of building a parallel corpus. Int. J. Adv. Res. Comput. Sci. 9, 4 (2018), 13–19.Google ScholarCross Ref
[17] Klein Guillaume, Hernandez François, Nguyen Vincent, and Senellart Jean. 2020. The OpenNMT neural machine translation toolkit: 2020 edition. In Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track). Association for Machine Translation in the Americas, Virtual, 102–109. https://aclanthology.org/2020.amta-research.9Google Scholar
[18] Lo Chi-kiu. 2019. YiSi-a unified semantic MT quality evaluation and estimation metric for languages with different levels of available resources. In Proceedings of the 4th Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). Association for Computational Linguistics, Florence, 507–513. Google ScholarCross Ref
[19] Merlini Donatella and Rossini Martina. 2021. Text categorization with WEKA: A survey. Mach. Learn. Appl. 4 (2021), 100033. Google ScholarCross Ref
[20] Mikelenić Bojana and Tadić Marko. 2020. Building the spanish-croatian parallel corpus. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 3932–3936. https://aclanthology.org/2020.lrec-1.484Google Scholar
[21] Morishita Makoto, Suzuki Jun, and Nagata Masaaki. 2020. JParaCrawl: A large scale web-based english-japanese parallel corpus. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 3603–3609. https://aclanthology.org/2020.lrec-1.443Google Scholar
[22] Mujadia Vandan and Sharma Dipti Misra. 2022. The LTRC hindi-telugu parallel corpus. In Proceedings of the 13th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 3417–3424. https://aclanthology.org/2022.lrec-1.365Google Scholar
[23] Nguer Elhadji Mamadou, Lo Alla, Dione Cheikh M. Bamba, Ba Sileye O., and Lo Moussa. 2020. SENCORPUS: A french-wolof parallel corpus. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 2803–2811. https://aclanthology.org/2020.lrec-1.341Google Scholar
[24] Pathak Amarnath, Pakray Partha, and Bentham Jereemi. 2019. English–mizo machine translation using neural and statistical approaches. Neural Comput. Appl. 31, 11 (2019), 7615–7631.Google ScholarDigital Library
[25] Premjith B., Kumar M. Anand, and Soman K. P.. 2019. Neural machine translation system for english to indian language translation using MTIL parallel corpus. J. Intell. Syst. 28, 3 (2019), 387–398.Google ScholarCross Ref
[26] Rei Ricardo, Stewart Craig, Farinha Ana C., and Lavie Alon. 2020. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 2685–2702. Google ScholarCross Ref
[27] Sellam Thibault, Das Dipanjan, and Parikh Ankur P.. 2020. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 7881–7892. Google ScholarCross Ref
[28] Thihlum Zaitinkhuma, Khenglawt Vanlalmuansangi, and Debnath Somen. 2020. Machine translation of english language to mizo language. In Proceedings of the IEEE International Conference on Cloud Computing in Emerging Markets (CCEM’20). IEEE, Zhaw School of Engineering, 92–97. Google ScholarCross Ref
[29] Tian Taoling, Song Chai, Ting Jin, and Huang Hongyang. 2022. A french-to-english machine translation model using transformer network. Proc. Comput. Sci. 199 (2022), 1438–1443. Google ScholarCross Ref
[30] Yen An-Zi, Huang Hen-Hsen, and Chen Hsin-Hsi. 2019. Learning english–chinese bilingual word representations from sentence-aligned parallel corpus. Comput. Speech Lang. 56 (2019), 52–72. Google ScholarDigital Library
[31] Zeroual Imad and Lakhouaja Abdelhak. 2020. MulTed: A multilingual aligned and tagged parallel corpus. Appl. Comput. Inf. 18, 1/2 (2020), 61–73.Google ScholarCross Ref

Index Terms

Construction of Mizo: English Parallel Corpus for Machine Translation
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Machine translation

Recommendations

An Improved English-to-Mizo Neural Machine Translation
Machine Translation is an effort to bridge language barriers and misinterpretations, making communication more convenient through the automatic translation of languages. The quality of translations produced by corpus-based approaches predominantly depends ...
Read More
Handling of Infinitives in English to Sanskrit Machine Translation

The development of Machine Translation (MT) system for ancient language like Sanskrit is a fascinating and challenging task. In this paper, the authors handle the infinitive type of English sentences in the English to Sanskrit machine translation (EST) ...
Read More
Post-Ordering by Parsing with ITG for Japanese-English Statistical Machine Translation

Word reordering is a difficult task for translation between languages with widely different word orders, such as Japanese and English. A previously proposed post-ordering method for Japanese-to-English translation first translates a Japanese sentence ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 8
August 2023
373 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3615980
Editor:
Imed Zitouni
Google, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2023
- Online AM: 21 July 2023
- Accepted: 13 July 2023
- Revised: 13 June 2023
- Received: 20 November 2022
Published in tallip Volume 22, Issue 8

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Mizo
corpus construction
bilingual corpus
parallel text
machine translation
Qualifiers
- short-paper
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 233
  Total Downloads
- Downloads (Last 12 months)233
- Downloads (Last 6 weeks)17
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Construction of Mizo: English Parallel Corpus for Machine Translation

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

An Improved English-to-Mizo Neural Machine Translation

Handling of Infinitives in English to Sanskrit Machine Translation

Post-Ordering by Parsing with ITG for Japanese-English Statistical Machine Translation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

Caption

Construction of Mizo: English Parallel Corpus for Machine Translation

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

An Improved English-to-Mizo Neural Machine Translation

Handling of Infinitives in English to Sanskrit Machine Translation

Post-Ordering by Parsing with ITG for Japanese-English Statistical Machine Translation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

Share this Publication link

Share on Social Media