Skip to main content
Log in

Manipuri–English comparable corpus for cross-lingual studies

  • Project Notes
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This paper presents Mni-EnCC, a temporal alligned Manipuri–English comparable corpus, to facilitate cross-lingual studies between Manipuri and English. Mni-EnCC has been created by collating text from two publicly published news sources in internet namely Sangai Express and Poknapham in Manipur. Though, both the publishers publish news in Manipuri and English editions, they are not the translation of each other. Almost all of the Manipuri editions are created using proprietary tools which generate texts in customized non-standard and non-unicode encodings. We develop tools to transform the non-unicode text into unicode text to generate the Manipuri corpus. We then verify and time aligned all the articles using a semi-automated process. Furthermore, the quality of the Mni-EnCC is evaluated using two premier cross-lingual studies: bilingual dictionary induction and machine translation. Experimental observations provide encouraging results making it as a suitable dataset for future cross-lingual studies on between Manipuri and English language pair. With an objective to promote cross-lingual studies in Manipuri–English, we also plan to release the corpus and supporting Unicode conversion tool.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Notes

  1. Meitei Mayek is another script used for writing Manipuri. However, as most of the presently available online Manipuri texts are in Bengali, we have considered only texts written using Bengali script (https://en.wikipedia.org/wiki/Meitei_language).

  2. https://www.thesangaiexpress.com/.

  3. http://poknapham.in/.

  4. Bengali script and Meitei Mayek scripts are used for writing Manipuri documents.

  5. https://www.pmindia.gov.in/en/mann-ki-baat/.

  6. https://comparable.limsi.fr/bucc2019/bucc-introduction.html.

  7. https://pib.gov.in/indexd.aspx.

  8. https://www.tdil-dc.in/.

  9. www.pmindia.gov.in.

  10. https://en.wikipedia.org/wiki/Orthogonal_Procrustes_problem.

  11. It is a primary symbol from an agreed set of symbols that, in single or combined with other glyphs, is intended to represent a character for writing purpose.

  12. An integer value that uniquely identifies a character.

  13. This is true at the time of curating the dataset.

  14. https://docs.python.org/3/library/urllib.html.

  15. https://beautiful-soup-4.readthedocs.io/en/latest/.

  16. It also provide information about the article thematic (sports), and its publication date.

  17. The publication date and the date represented in the URLs vary by one day because the news articles on the website are updated a day later after the publication in the respective newspaper.

  18. https://en.wikipedia.org/wiki/Byline.

  19. https://comparable.limsi.fr/bucc2020/bucc2020-task.html.

  20. https://github.com/moses-smt/mosesdecoder.

  21. https://github.com/facebookresearch/MUSE.

  22. https://github.com/artetxem/vecmap.

  23. https://fasttext.cc/.

  24. https://github.com/artetxem/monoses.

  25. https://github.com/artetxem/undreamt.

  26. https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl.

  27. https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html.

  28. We will make the Unicode conversion codes, comparable corpora, and evaluation datasets available to other researchers on request.

References

  • Aker, A., Kanoulas, E., & Gaizauskas, R. J. (2012). A light way to collect comparable corpora from the web. In: Proceedings of LREC 2012 (pp. 15–20). Citeseer.

  • Artetxe, M., Labaka, G., & Agirre, E. (2018a). A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. http://arxiv.org/abs/1805.06297.

  • Artetxe, M., Labaka, G., & Agirre, E. (2018b). A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), (pp. 789–798).

  • Artetxe, M., Labaka, G., & Agirre, E. (2018c). Unsupervised statistical machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium.

  • Artetxe, M., Labaka, G., Agirre, E., & Cho, K. (2018d). Unsupervised neural machine translation. In: International Conference on Learning Representations.

  • Bansal, A., Banerjee, E., & Jha, G. N. (2013). Corpora creation for indian language technologies–the ilci project. In: the sixth Proceedings of Language Technology Conference (LTC ‘13).

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.

    Google Scholar 

  • Buck, C., & Koehn, P. (2016). Quick and reliable document alignment via tf/idf-weighted cosine distance. In: Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, (pp. 672–678).

  • Chaudhury, S., Sen, S., & Nandi, G. R. (2012). A finite state transducer (fst) based font converter. International Journal of Computer Applications, 58(17), 35–39.

    Article  Google Scholar 

  • Chelliah, S. L. (1990). Level ordered morphology and phonology in manipuri. Linguistics of the Tibeto-Burman Area, 13(2), 27–72.

    Google Scholar 

  • Cheon, J., & Ko, Y. (2021). Parallel sentence extraction to improve cross-language information retrieval from wikipedia. Journal of Information Science, 47(2), 281–293.

    Article  Google Scholar 

  • Conneau, A., Lample, G., Ranzato, M., Denoyer, L., & Jégou, H. (2017). Word translation without parallel data. http://arxiv.org/abs/1805.062971710.04087.

  • Dara, A. A., & Lin, Y. C. (2016). Yoda system for wmt16 shared task: Bilingual document alignment. In: Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, (pp. 679–684).

  • Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., et al. (2016). Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1), 2096–2030.

    Google Scholar 

  • Gomes, L., & Lopes, G. (2016). First steps towards coverage-based document alignment. In: Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, (pp. 697–702).

  • Goyal, V., Kumar, A., & Lehal, M. S. (2020). Document alignment for generation of english-punjabi comparable corpora from wikipedia. International Journal of E-Adoption (IJEA), 12(1), 42–51.

    Article  Google Scholar 

  • Guzmán, F., Chen, P. J., Ott, M., Pino, J., Lample, G., Koehn, P., Chaudhary, V., & Ranzato, M. (2019). The flores evaluation datasets for low-resource machine translation: Nepali–English and Sinhala–English. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), (pp. 6100–6113).

  • Gv, R. (1995). Python tutorial: Technical report cs-r9526. Centrum voor Wiskunde en Informatica.

    Google Scholar 

  • Haddow, B., & Kirefu, F. (2020). Pmindia–a collection of parallel corpora of languages of India. http://arxiv.org/abs/2001.09907.

  • Hazem, A., & Morin, E. (2016). Efficient data selection for bilingual terminology extraction from comparable corpora. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, (pp. 3401–3411).

  • Heaps, H. S. (1978). Information retrieval, computational and theoretical aspects. Boca Raton: Academic Press.

    Google Scholar 

  • Huang, D., Zhao, L., Li, L., & Yu, H. (2010). Mining large-scale comparable corpora from chinese-english news collections. In: Coling 2010: Posters, (pp. 472–480).

  • Islam, S., Devi, M. I., & Purkayastha, B. S. (2017). A study on various applications of nlp developed for north-east languages. International Journal on Computer Science and Engineering, 9(6), 386–378.

    Google Scholar 

  • Jha, G. N. (2012). The tdil program and the indian language corpora initiative. In: Language Resources and Evaluation Conference.

  • Kementchedjhieva, Y., Hartmann, M., & Søgaard, A. (2019). Lost in evaluation: Misleading benchmarks for bilingual dictionary induction. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), (pp. 3336–3341).

  • Kishida, K., & Chen, Kh. (2021). Experiments on cross-language information retrieval using comparable corpora of chinese, japanese, and korean languages. Evaluating Information Retrieval and Access Tasks (pp. 21–37). Singapore: Springer.

    Chapter  Google Scholar 

  • Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., & Zens, R., et al (2007). Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, (pp. 177–180).

  • Kumari, A., & Goyal, V. (2012). Font convertors for Indian languages—A survey. Computer Science, 1, 12.

    Google Scholar 

  • Laitonjam, L., Singh, L. G., & Singh, S. R. (2018). Transliteration of english loanwords and named-entities to manipuri: Phoneme vs grapheme representation. In: 2018 International Conference on Asian Language Processing (IALP), IEEE, (pp. 255–260).

  • Lehal, G. S., Saini, T. S., & Buttar, S. P. K. (2014). Automatic bilingual legacy-fonts identification and conversion system. Research in Computing Science, 86, 9–23.

    Article  Google Scholar 

  • Levy, O., Søgaard, A., & Goldberg, Y. (2016). A strong baseline for learning cross-lingual word embeddings from sentence alignments. http://arxiv.org/abs/1608.05426.

  • Lü, L., Zhang, Z. K., & Zhou, T. (2013). Deviation of Zipf’s and Heaps’ laws in human languages with limited dictionary sizes. Scientific Reports, 3(1), 1–7.

    Article  Google Scholar 

  • Luong, M. T., Pham, H., & Manning, C. D. (2015). Bilingual word representations with monolingual quality in mind. In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, (pp. 151–159).

  • Marchisio, K., Duh, K., & Koehn, P. (2020). When does unsupervised machine translation work? In: Proceedings of the Fifth Conference on Machine Translation, (pp. 571–583).

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2, (pp. 3111–3119).

  • Mrkšić, N., Vulić, I., Séaghdha, D. Ó., Leviant, I., Reichart, R., Gašić, M., et al. (2017). Semantic specialization of distributional word vector spaces using monolingual and cross-lingual constraints. Transactions of the Association for Computational Linguistics, 5, 309–324.

    Article  Google Scholar 

  • Nguyen, D., Overwijk, A., Hauff, C., Trieschnigg, D. R., Hiemstra, D., De Jong, F. (2008). Wikitranslate: query translation for cross-lingual information retrieval using only wikipedia. In: Workshop of the Cross-Language Evaluation Forum for European Languages (pp. 58–65). Springer.

  • Nirmal, Y., & Sharma, U. (2018). Problems and issues in parsing manipuri text. In: Proceedings of the International Conference on Computing and Communication Systems (pp. 393–401). Springer.

  • Nirmal, Y., & Sharma, U. (2019). A grammar-driven approach for parsing manipuri language. In: International Conference on Pattern Recognition and Machine Intelligence (pp. 267–274). Springer.

  • Nongmeikapam, K., Salam, B., Romina, M., Chanu, N. M., & Bandyopadhyay, S. (2011). A light weight manipuri stemmer. In: Proc. National Conference on Indian Language, Computing (NCILC).

  • Paik, J. H., Mitra, M., Parui, S. K., & Järvelin, K. (2011). Gras: An effective and efficient stemming algorithm for information retrieval. ACM Transactions on Information Systems (TOIS), 29(4), 1–24.

    Article  Google Scholar 

  • Rahaman, S. S., Islam, M. R., Akhand, M., et al. (2013). Design and development of a bengali unicode font converter. 2013 International Conference on Informatics (pp. 1–4). IEEE: Electronics and Vision (ICIEV).

  • Raj, A. A., & Maganti, H. (2009). Transliteration based search engine for multilingual information access. In: Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies (CLIAWS3), (pp. 12–20).

  • Raj, A. A., & Prahallad, K. (2007). Identification and conversion of font-data in indian languages. In: In International Conference on Universal Digital Library (ICUDL).

  • Rapp, R., Sharoff, S., & Zweigenbaum, P. (2016). Recent advances in machine translation using comparable corpora. Natural Language Engineering, 22(4), 501–516.

    Article  Google Scholar 

  • Rapp, R., Zweigenbaum, P., & Sharoff, S. (2020). Overview of the fourth bucc shared task: Bilingual dictionary induction from comparable corpora. In: Proceedings of the 13th Workshop on Building and Using Comparable Corpora, (pp. 6–13).

  • Richardson, L. (2007). Beautiful soup documentation. Dosegljivo. Retrieved 7 July, 2018, from https://www.crummy.com/software/BeautifulSoup/bs4/doc/.

  • Ruder, S., Vulić, I., & Søgaard, A. (2019). A survey of cross-lingual word embedding models. Journal of Artificial Intelligence Research, 65, 569–631.

    Article  Google Scholar 

  • Sabbah, F., & Aker, A. (2018). Creating comparable corpora through topic mappings. In: the 11th Workshop on BUCC at LREC 2018, (pp. 19–24).

  • Sellami, R., Sadat, F., & Beluith, L. H. (2018). Building and exploiting domain-specific comparable corpora for statistical machine translation. In: Intelligent Natural Language Processing: Trends and Applications (pp. 659–676). Springer.

  • Sharma, H. S. (1999). A comparison between khasi and manipuri word order. Linguistics of the Tibeto-Burman Area, 22(1), 139–48.

    Google Scholar 

  • Sharoff, S., Rapp, R., & Zweigenbaum, P. (2013). Overviewing important aspects of the last twenty years of research in comparable corpora. In: Building and Using Comparable Corpora (pp. 1–17). Springer.

  • Shchukin, V., Khristich, D., & Galinskaya, I. (2016). Word clustering approach to bilingual document alignment (wmt 2016 shared task). In: Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, (pp. 740–744).

  • Sing, T. D., & Bandyopadhyay, S. (2010). Statistical machine translation of english–manipuri using morpho-syntactic and semantic information. Proceedings of the Association for Machine Translation in the Americas (AMTA 2010).

  • Singh, T. D. (2012). Building parallel corpora for SMT system: A case study of English–Manipuri. International Journal of Computer Applications, 52, 14.

    Google Scholar 

  • Singh, L. G., Laitonjam, L., & Singh, S. R. (2016). Automatic syllabification for manipuri language. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, (pp. 349–357).

  • Singh, R., & Singh, S. (2021). Text similarity measures in news articles by vector space model using nlp. Journal of The Institution of Engineers (India): Series B, 102(2), 329–338.

    Article  Google Scholar 

  • Singh, S. M., & Singh, T. D. (2020). Unsupervised neural machine translation for english and manipuri. In: Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages, (pp. 69–78).

  • Singh, T. D., & Bandyopadhyay, S. (2010). Manipuri–English bidirectional statistical machine translation systems using morphology and dependency relations. In: Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation, (pp. 83–91).

  • Siripragada, S., Philip, J., Namboodiri, V. P., & Jawahar, C. (2020). A multilingual parallel corpora collection effort for indian languages. In: Proceedings of the 12th Language Resources and Evaluation Conference, (pp. 3743–3751).

  • Skadiņa, I. (2012). Analysis and evaluation of comparable corpora for under-resourced areas of machine translation. In: The 5th Workshop on Building and Using Comparable Corpora, (pp. 17).

  • Skadiņa, I., Gaizauskas, R., Babych, B., Ljubešić, N., Tufiş, D., & Vasiļjevs, A. (2019). Using Comparable corpora for under-resourced areas of machine translation. New York: Springer.

    Book  Google Scholar 

  • Søgaard, A., Ruder, S., & Vulić, I. (2018). On the limitations of unsupervised bilingual dictionary induction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), (pp. 778–788).

  • Speer, R., Chin, J., & Havasi, C. (2017). Conceptnet 5.5: An open multilingual graph of general knowledge. In: Thirty-first AAAI Conference on Artificial Intelligence.

  • Štajner, T., & Mladenić, D. (2019). Cross-lingual document similarity estimation and dictionary generation with comparable corpora. Knowledge and Information Systems, 58(3), 729–743.

    Article  Google Scholar 

  • Tsou, B. K., & Chow, K. (2019). From the cultivation of comparable corpora to harvesting from them: A quantitative and qualitative exploration. In: Proceedings of the Conference on Building and Using Comparable Corpora (BUCC 2019), (pp. 29–36).

  • van Leijenhorst, D. C., & Van der Weide, T. P. (2005). A formal derivation of Heaps’ law. Information Sciences, 170(2–4), 263–272.

    Article  Google Scholar 

  • Vulić, I., & Korhonen, A. (2016). On the role of seed lexicons in learning bilingual word embeddings. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), (pp. 247–257).

  • Wloka, B. (2018). Identifying bilingual topics in wikipedia for efficient parallel corpus extraction and building domain-specific glossaries for the japanese-english language pair. In: 11th Workshop on Building and Using Comparable Corpora, (pp. 15).

  • Zhu, S., Yang, Y., & Xu, C. (2020). Extracting parallel sentences from nonparallel corpora using parallel hierarchical attention network. Computational Intelligence and Neuroscience, (2020).

  • Zipf, G. K. (1949). Human behaviour and the principle of least-effort. Cambridge: Addison-Wesley.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lenin Laitonjam.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Tables 10 and 11 shows the complete mapping tables for converting the ASCII-based character(s) to the corresponding Unicode character(s) for Sangai Express and Poknapham Manipuri texts respectively.

Table 10 Mapping Table for the Sangai Express texts
Table 11 Mapping Table for the Poknapham texts

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Laitonjam, L., Singh, S.R. Manipuri–English comparable corpus for cross-lingual studies. Lang Resources & Evaluation 57, 377–413 (2023). https://doi.org/10.1007/s10579-021-09576-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-021-09576-y

Keywords

Navigation