skip to main content
short-paper

Construction of Mizo: English Parallel Corpus for Machine Translation

Published:24 August 2023Publication History
Skip Abstract Section

Abstract

Parallel corpus is a key component of statistical and Neural Machine Translation (NMT). While most research focuses on machine translation, corpus creation studies are limited for many languages, and no research paper on a Mizo–English corpus exists yet. A high-quality parallel corpus is required for Natural Language Processing activities including machine translation, Chatbots, Transliteration, and Cross-Language Information Retrieval. This work aims to investigate parallel corpus creation techniques and apply them to the Mizo–English language pair. Another goal is to test machine translation on the newly constructed corpus. We contributed to LF Aligner tool to support Mizo language for Mizo sentence alignment in corpus development. Our effort created the first large-scale Mizo–English parallel corpus with over 529K sentences. The pre-processed corpus was used for Mizo-to-English NMT. It was evaluated using BLEU, Character F1 Score (ChrF), and Translation Edit Rate (TER) scores. Our system achieved BLEU 45.08, ChrF 65.36, and TER 41.16, setting a new benchmark for Mizo-to-English translation.

REFERENCES

  1. [1] Alam Mehreen and Hussain Sibt Ul. 2022. Roman-urdu-parl: Roman-urdu and urdu parallel corpus for urdu language understanding. Trans. Asian Low-Resour. Lang. Inf. Process. 21, 1 (2022), 120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Andaluz-Pinedo Olaia and Sanjurjo-González Hugo. 2022. Corpus tools for parallel corpora of theatre plays: An introduction to TAligner and ACM-theatre. Lang. Resour. Eval. 56, 2 (2022), 651671.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Bapna Ankur, Caswell Isaac, Kreutzer Julia, Firat Orhan, Esch Daan van, Siddhant Aditya, Niu Mengmeng, Baljekar Pallavi, Garcia Xavier, Macherey Wolfgang, et al. 2022. Building machine translation systems for the next thousand languages. arXiv:2205.03983. Retrieved from https://arxiv.org/abs/2205.03983Google ScholarGoogle Scholar
  4. [4] Biadgligne Yohanens and Smaïli Kamel. 2021. Parallel corpora preparation for english-amharic machine translation. In International Work-Conference on Artificial Neural Networks. Springer International Publishing, Madeira, 443–455.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Cheok Sai Man, Hoi Lap Man, Tang Su-Kit, and Tse Rita. 2022. Crawling parallel data for bilingual corpus using hybrid crawling architecture. Proc. Comput. Sci. 198 (2022), 122127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Cheong Sio Tai, Xu Jiabo, and Liu Yue. 2018. On the design of web crawlers for constructing an efficient chinese-portuguese bilingual corpus system. In Proceedings of the International Conference on Electronics, Information, and Communication (ICEIC’18). IEEE, Honolulu, HI, 1–4. Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Chua Shi Min. 2022. Compiling and analysing a large corpus of online discussions to explore users’ interactions. Appl. Corp. Ling. 2, 2 (2022), 100017.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Chuong Dang Ngoc and Seresangtakul Pusadee. 2019. Semi-automatic word-aligned tool for thai-vietnamese parallel corpus construction. In Proceedings of the 16th International Joint Conference on Computer Science and Software Engineering (JCSSE’19). IEEE, IEEE, 121125.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Devi Chanambam Sveta and Purkayastha Bipul Syam. 2020. Steps of pre-processing for english to mizo SMT system. In International Conference on Machine Learning, Image Processing, Network Security and Data Sciences. Springer, Silchar, 156167.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Gashaw Ibrahim and Shashirekha H. L.. 2018. Construction of amharic - arabic parallel text corpus for neural machine translation. Int. J. Nat. Lang. Comput. 7, 5 (2018), 93103.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Ghaddar Abbas and Langlais Philippe. 2020. Sedar: A large scale french-english financial domain parallel corpus. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 3595–3602. https://aclanthology.org/2020.lrec-1.442Google ScholarGoogle Scholar
  12. [12] Hnamte Vanlalruata, Thangkhanhau Haulai, Hussain Jamal, Lalnunmawii Chawngthu, Tlaisun Laldinsangi, et al. 2022. Mizo to english machine translation: An evaluation benchmark. In Proceedings of the International Conference on Futuristic Technologies (INCOFT’22). IEEE, Belgaum, 1–6. Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Inoue Go, Habash Nizar, Matsumoto Yuji, and Aoyama Hiroyuki. 2018. A parallel corpus of arabic-japanese news articles. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18), Vol. 11. European Language Resources Association, Miyazaki, 79–91. https://aclanthology.org/L18-1147Google ScholarGoogle Scholar
  14. [14] Islam Saiful, Paul Abhijit, Purkayastha Bipul Shyam, and Hussain Ismail. 2018. Construction of english-bodo parallel text corpus for statistical machine translation. Int. J. Nat. Lang. Comput. 7, 5 (2018), 93103.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Khenglawt Vanlalmuansangi, Laskar Sahinur Rahman, Pal Santanu, Pakray Partha, and Khan Ajoy Kumar. 2022. Language resource building and english-to-mizo neural machine translation encountering tonal words. In Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 48–54. https://aclanthology.org/2022.wildre-1.9Google ScholarGoogle Scholar
  16. [16] Khosla Sonal and Acharya Haridasa. 2018. A survey report on the existing methods of building a parallel corpus. Int. J. Adv. Res. Comput. Sci. 9, 4 (2018), 1319.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Klein Guillaume, Hernandez François, Nguyen Vincent, and Senellart Jean. 2020. The OpenNMT neural machine translation toolkit: 2020 edition. In Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track). Association for Machine Translation in the Americas, Virtual, 102–109. https://aclanthology.org/2020.amta-research.9Google ScholarGoogle Scholar
  18. [18] Lo Chi-kiu. 2019. YiSi-a unified semantic MT quality evaluation and estimation metric for languages with different levels of available resources. In Proceedings of the 4th Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). Association for Computational Linguistics, Florence, 507–513. Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Merlini Donatella and Rossini Martina. 2021. Text categorization with WEKA: A survey. Mach. Learn. Appl. 4 (2021), 100033. Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Mikelenić Bojana and Tadić Marko. 2020. Building the spanish-croatian parallel corpus. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 3932–3936. https://aclanthology.org/2020.lrec-1.484Google ScholarGoogle Scholar
  21. [21] Morishita Makoto, Suzuki Jun, and Nagata Masaaki. 2020. JParaCrawl: A large scale web-based english-japanese parallel corpus. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 3603–3609. https://aclanthology.org/2020.lrec-1.443Google ScholarGoogle Scholar
  22. [22] Mujadia Vandan and Sharma Dipti Misra. 2022. The LTRC hindi-telugu parallel corpus. In Proceedings of the 13th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 3417–3424. https://aclanthology.org/2022.lrec-1.365Google ScholarGoogle Scholar
  23. [23] Nguer Elhadji Mamadou, Lo Alla, Dione Cheikh M. Bamba, Ba Sileye O., and Lo Moussa. 2020. SENCORPUS: A french-wolof parallel corpus. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, 2803–2811. https://aclanthology.org/2020.lrec-1.341Google ScholarGoogle Scholar
  24. [24] Pathak Amarnath, Pakray Partha, and Bentham Jereemi. 2019. English–mizo machine translation using neural and statistical approaches. Neural Comput. Appl. 31, 11 (2019), 76157631.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Premjith B., Kumar M. Anand, and Soman K. P.. 2019. Neural machine translation system for english to indian language translation using MTIL parallel corpus. J. Intell. Syst. 28, 3 (2019), 387398.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Rei Ricardo, Stewart Craig, Farinha Ana C., and Lavie Alon. 2020. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 2685–2702. Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Sellam Thibault, Das Dipanjan, and Parikh Ankur P.. 2020. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 7881–7892. Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Thihlum Zaitinkhuma, Khenglawt Vanlalmuansangi, and Debnath Somen. 2020. Machine translation of english language to mizo language. In Proceedings of the IEEE International Conference on Cloud Computing in Emerging Markets (CCEM’20). IEEE, Zhaw School of Engineering, 92–97. Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Tian Taoling, Song Chai, Ting Jin, and Huang Hongyang. 2022. A french-to-english machine translation model using transformer network. Proc. Comput. Sci. 199 (2022), 14381443. Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Yen An-Zi, Huang Hen-Hsen, and Chen Hsin-Hsi. 2019. Learning english–chinese bilingual word representations from sentence-aligned parallel corpus. Comput. Speech Lang. 56 (2019), 5272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Zeroual Imad and Lakhouaja Abdelhak. 2020. MulTed: A multilingual aligned and tagged parallel corpus. Appl. Comput. Inf. 18, 1/2 (2020), 6173.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Construction of Mizo: English Parallel Corpus for Machine Translation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 8
      August 2023
      373 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3615980
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 August 2023
      • Online AM: 21 July 2023
      • Accepted: 13 July 2023
      • Revised: 13 June 2023
      • Received: 20 November 2022
      Published in tallip Volume 22, Issue 8

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper
    • Article Metrics

      • Downloads (Last 12 months)233
      • Downloads (Last 6 weeks)17

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text