Skip to main content
Log in

Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Neural machine translation systems trained on low-resource languages produce sub-optimal results due to the scarcity of large parallel datasets. To alleviate this problem, parallel corpora can be mined from the web. Two key tasks in a parallel corpus mining pipeline are web document alignment and sentence alignment. Effective approaches for these tasks obtained vector representations of the documents (or sentences) belonging to the two languages and determine the alignment between the documents (or sentences) based on a semantic similarity scoring mechanism. Recently, document or sentence representations obtained from pre-trained multilingual language models (PMLMs) such as LASER, XLM-R and LaBSE have significantly improved the benchmark scores in diverse natural language processing tasks. In this study, we carry out an empirical analysis of the effectiveness of these PMLMs of the document and sentence alignment tasks in the context of the low-resource language pairs Sinhala–English, Tamil–English and Sinhala–Tamil. Further, we introduce a weighting mechanism based on small-scale bilingual lexicons to improve the semantic similarity measurement between sentences and documents. Our results show that both document and sentence alignment can be further improved using our weighting mechanism. We have also compiled a gold-standard evaluation benchmark dataset for document alignment and sentence alignment tasks for the considered language pairs. This dataset (https://github.com/kdissa/comparable-corpus) and the source code (https://github.com/nlpcuom/parallel_corpus_mining) are publicly released.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. https://github.com/kdissa/comparable-corpus.

  2. mBERT was not considered since it does not include Sinhala.

  3. http://www.hirunews.lk.

  4. https://www.newsfirst.lk/.

  5. https://www.army.lk/.

  6. https://www.itnnews.lk.

  7. http://www.statmt.org/wmt20/translation-task.html.

  8. https://www.languagesdept.gov.lk/.

  9. We use k = 4 for all experiments in this work as it gave the best results in all our experiments.

  10. https://uom.lk/nlp

References

  1. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30

  2. Koehn P, Knowles R (2017) Six challenges for neural machine translation. In: Proceedings of the first workshop on neural machine translation. Association for Computational Linguistics, Vancouver, pp 28–39

  3. Ranathunga S, Lee ESA, Skenduli MP, Shekhar R, Alam M, Kaur R (2021) Neural machine translation for low-resource languages: a survey. arXiv preprint arXiv:2106.15115

  4. Kreutzer J, Caswell I, Wang L, Wahab A, van Esch D, Ulzii-Orshikh N et al (2022) Quality at a glance: an audit of web-crawled multilingual datasets. Trans Assoc Comput Linguist 10:50–72

    Article  Google Scholar 

  5. Bañón M, Chen P, Haddow B, Heafield K, Hoang H, Esplà-Gomis M et al (2020) ParaCrawl: web-scale acquisition of parallel corpora. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 4555–4567

  6. Buck C, Koehn P (2016) Findings of the WMT 2016 bilingual document alignment shared task. In: Proceedings of the first conference on machine translation: volume 2, shared task papers. Association for Computational Linguistics, Berlin, pp 554–563

  7. Resnik P (1998) Parallel strands: a preliminary investigation into mining the web for bilingual text. In: Conference of the association for machine translation in the Americas. Springer, pp 72–82

  8. Resnik P (1999) Mining the web for bilingual text. In: Proceedings of the 37th annual meeting of the association for computational linguistics, pp 527–534

  9. Papavassiliou V, Prokopidis P, Piperidis S (2016) The ilsp/arc submission to the wmt 2016 bilingual document alignment shared task. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, pp 733–739

  10. Resnik P, Smith NA (2003) The web as a parallel corpus. Comput Linguist 29(3):349–380

    Article  Google Scholar 

  11. Espla-Gomis M, Forcada ML, Ortiz-Rojas S, Ferrández-Tordera J. (2016) Bitextor’s participation in WMT’16: shared task on document alignment. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, pp 685–691

  12. Etchegoyhen T, Gete H (2020) Handle with care: a case study in comparable corpora exploitation for neural machine translation. In: Proceedings of The 12th language resources and evaluation conference, pp 3799–3807

  13. El-Kishky A, Guzmán F (2020) Massively multilingual document alignment with cross-lingual sentence-mover’s distance. In: Proceedings of the 1st conference of the asia-pacific chapter of the association for computational linguistics and the 10th international joint conference on natural language processing. Association for Computational Linguistics, Suzhou, pp 616–625

  14. Varga D, Halácsy P, Kornai A, Nagy V, Németh L, Trón V (2007) Parallel corpora for medium density languages. Amsterdam Stud Theory Hist Linguist Sci Ser 4(292):247

    Google Scholar 

  15. Munteanu DS, Marcu D (2005) Improving machine translation performance by exploiting non-parallel corpora. Comput Linguist 31(4):477–504

    Article  Google Scholar 

  16. Sarikaya R, Maskey S, Zhang R, Jan EE, Wang D, Ramabhadran B et al (2009) Iterative sentence-pair extraction from quasi-parallel corpora for machine translation. In: Tenth annual conference of the international speech communication association, pp 432–435

  17. Kvapilíková I, Artetxe M, Labaka G, Agirre E, Bojar O (2020) Unsupervised multilingual sentence embeddings for parallel corpus mining. In: Proceedings of the 58th annual meeting of the association for computational linguistics: student research workshop, pp 255–262

  18. Feng F, Yang Y, Cer D, Arivazhagan N, Wang W (2020) Language-agnostic BERT sentence embedding. arXiv preprint arXiv:2007.01852

  19. Artetxe M, Schwenk H (2019) Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans Assoc Comput Linguist 7:597–610

    Article  Google Scholar 

  20. Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F et al (2020) Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 8440–8451

  21. Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). Association for Computational Linguistics, Minneapolis, pp 4171–4186

  22. Rajitha C, Piyarathna L, Sachintha D, Ranathunga S (2021) Metric learning in multilingual sentence similarity measurement for document alignment. In: Proceedings of the international conference on recent advances in natural language processing (RANLP 2021). Held online: INCOMA Ltd., pp 1150–1157. https://aclanthology.org/2021.ranlp-1.129

  23. Ni J, Ábrego GH, Constant N, Ma J, Hall KB, Cer D et al (2021) Sentence-t5: scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877

  24. Joshi P, Santy S, Budhiraja A, Bali K, Choudhury M (2020) The state and fate of linguistic diversity and inclusion in the NLP world. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 6282–6293

  25. Artetxe M, Schwenk H (2019) Margin-based parallel corpus mining with multilingual sentence embeddings. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 3197–3203

  26. Koehn P, Khayrallah H, Heafield K, Forcada ML (2018) Findings of the wmt 2018 shared task on parallel corpus filtering. In: Proceedings of the third conference on machine translation: shared task papers, pp 726–739

  27. Chen J, Nie JY (2000) Parallel web text mining for cross-language IR. In: Content-based multimedia information access, vol 1. RIAO, pp 62–77

  28. Shi L, Niu C, Zhou M, Gao J (2006) A DOM tree alignment model for mining parallel data from the web. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics, pp 489–496

  29. Zafarian A, Sadeghi APA, Azadi F, Ghiasifard S, Panahloo ZA, Bakhshaei S et al (2015) AUT document alignment framework for BUCC workshop shared task. In: Proceedings of the eighth workshop on building and using comparable corpora, pp 79–87

  30. Li B, Gaussier E (2013) Exploiting comparable corpora for lexicon extraction: Measuring and improving corpus quality. In: Building and using comparable corpora. Springer, pp 131–149

  31. Ma X, Liberman M (1999) Bits: a method for bilingual text search over the web. In: Machine translation summit VII, pp 538–542

  32. Fung P, Cheung P (2004) Mining very-non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and e. In: Proceedings of the 2004 conference on empirical methods in natural language processing, pp 57–63

  33. Ion R, Ceauşu A, Irimia E (2011) An expectation maximization algorithm for textual unit alignment. In: Proceedings of the 4th workshop on building and using comparable corpora: comparable corpora and the web, pp 128–135

  34. Gomes L, Lopes G (2016) First steps towards coverage-based document alignment. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, pp 697–702

  35. Morin E, Hazem A, Boudin F, Loginova-Clouet E (2015) LINA: identifying comparable documents from Wikipedia. In: Proceedings of the eighth workshop on building and using comparable corpora. Association for Computational Linguistics, Beijing, pp 88–91

  36. Uszkoreit J, Ponte J, Popat A, Dubiner M (2010) Large scale parallel document mining for machine translation. In: Proceedings of the 23rd international conference on computational linguistics (Coling 2010), pp 1101–1109

  37. Rajitha M, Piyarathna L, Nayanajith M, Surangika S (2020) Sinhala and English document alignment using statistical machine translation. In: 2020 20th international conference on advances in ICT for emerging regions (ICTer). IEEE, pp 29–34

  38. Jakubina L, Langlais P (2016) Bad luc@ wmt 2016: a bilingual document alignment platform based on lucene. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, pp 703–709

  39. Medveď M, Jakubíček M, Kovář V (2016) English-French document alignment based on keywords and statistical translation. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, pp 728–732

  40. Buck C, Koehn P (2016) Quick and reliable document alignment via tf/idf-weighted cosine distance. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, pp 672–678

  41. Germann U (2016) Bilingual document alignment with latent semantic indexing. In: Proceedings of the first conference on machine translation: volume 2, shared task papers. Association for Computational Linguistics, Berlin, pp 692–696

  42. Dara AA, Lin YC (2016) Yoda system for wmt16 shared task: bilingual document alignment. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, pp 679–684

  43. Brown PF, Lai JC, Mercer RL (1991) Aligning sentences in parallel corpora. In: 29th Annual meeting of the association for computational linguistics. Association for Computational Linguistics, Berkeley, pp 169–176

  44. Gale WA, Church KW (1993) A program for aligning sentences in bilingual corpora. Comput Linguist 19(1):75–102

    Google Scholar 

  45. Ma X (2006) Champollion: a robust parallel text sentence aligner. In: Proceedings of the fifth international conference on language resources and evaluation (LREC’06). European Language Resources Association (ELRA), Genoa, pp 489–492

  46. Munteanu DS, Marcu D (2002) Processing comparable corpora with bilingual suffix trees. In: Proceedings of the 2002 conference on empirical methods in natural language processing (EMNLP 2002), pp 289–295

  47. Stefanescu D, Ion R, Hunsicker S (2012) Hybrid parallel sentence mining from comparable corpora. In: Proceedings of the 16th annual conference of the European association for machine translation, pp 137–144

  48. Abdul-Rauf S, Schwenk H (2009) On the use of comparable corpora to improve SMT performance. In: Proceedings of the 12th conference of the european chapter of the ACL (EACL 2009), pp 16–23

  49. Mahata S, Das D (2017) Bandyopadhyay S. Bucc2017: a hybrid approach for identifying parallel sentences in comparable corpora. In: Proceedings of the 10th workshop on building and using comparable corpora, pp 56–59

  50. Azpeitia A, Etchegoyhen T, Garcia EM (2017) Weighted set-theoretic alignment of comparable sentences. In: Proceedings of the 10th workshop on building and using comparable corpora, pp 41–45

  51. Azpeitia A, Etchegoyhen T, Garcia EM (2018) Extracting parallel sentences from comparable corpora with STACC variants. In: Proceedings of the 11th workshop on building and using comparable corpora, pp 48–52

  52. Grégoire F, Langlais P (2017) Bucc 2017 shared task: a first attempt toward a deep learning framework for identifying parallel sentences in comparable corpora. In: Proceedings of the 10th workshop on building and using comparable corpora, pp 46–50

  53. Iyyer M, Manjunatha V, Boyd-Graber J, Daumé III H (2015) Deep unordered composition rivals syntactic methods for text classification. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers), pp 1681–1691

  54. Guoa M, Shenb Q, Yanga Y, Gea H, Cera D, Abregoa GH et al (2018) Effective parallel corpus mining using bilingual sentence embeddings. WMT 2018:165

    Google Scholar 

  55. Leong C, Wong DF, Chao LS (2018) Um-paligner: neural network-based parallel sentence identification model. In: 11th Workshop on building and using comparable corpora, p 53

  56. Bouamor H, Sajjad H (2018) H2@ bucc18: parallel sentence extraction from comparable corpora using multilingual sentence embeddings. In: Proceedings of workshop on building and using comparable corpora, pp 43–47

  57. Hangya V, Fraser A (2019) Unsupervised parallel sentence extraction with parallel segment detection helps machine translation. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 1224–1234

  58. Schwenk H, Chaudhary V, Sun S, Gong H, Guzmán F (2021) WikiMatrix: mining 135M parallel sentences in 1620 language pairs from Wikipedia. In: Proceedings of the 16th conference of the European chapter of the association for computational linguistics: main volume, pp 1351–1361

  59. Schwenk H, Wenzek G, Edunov S, Grave E, Joulin A, Fan A (2021) CCMatrix: mining billions of high-quality parallel sentences on the web. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers). Association for Computational Linguistics, pp 6490–6500

  60. Yang Y, Ábrego GH, Yuan S, Guo M, Shen Q, Cer D et al (2019) Improving multilingual sentence embedding using bi-directional dual encoder with additive margin softmax. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence (IJCAI-19), pp 5370–5378

  61. Zweigenbaum P, Sharoff S, Rapp R (2018) Overview of the third BUCC shared task: spotting parallel sentences in comparable corpora. In: Proceedings of 11th workshop on building and using comparable corpora, pp 39–42

  62. Koehn P, Guzmán F, Chaudhary V, Pino J. (2019) Findings of the WMT 2019 shared task on parallel corpus filtering for low-resource conditions. In: Proceedings of the fourth conference on machine translation (volume 3: shared task papers, day 2), pp 54–72

  63. Priyadarshani H, Rajapaksha M, Ranasinghe M, Sarveswaran K, Dias G (2019) Statistical machine learning for transliteration: transliterating names between Sinhala, Tamil and English. In: 2019 International conference on asian language processing (IALP). IEEE, pp 244–249

  64. Farhath F, Ranathunga S, Jayasena S, Dias G (2018) Integration of bilingual lists for domain-specific statistical machine translation for Sinhala-Tamil. In moratuwa engineering research conference (MERCon). IEEE, pp. 538–543

  65. Thompson B, Koehn P (2019) Vecalign: improved sentence alignment in linear time and space. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 1342–1348

  66. Fernando A, Ranathunga S, Dias G (2020) Data augmentation and terminology integration for domain-specific Sinhala-English-Tamil statistical machine translation. arXiv preprint arXiv:2011.02821

  67. Guzmán F, Chen PJ, Ott M, Pino J, Lample G, Koehn P et al (2019) The FLORES evaluation datasets for low-resource machine translation: Nepali–English and Sinhala–English. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, pp 6098–6111

  68. Goyal N, Gao C, Chaudhary V, Chen PJ, Wenzek G, Ju D et al (2021) The flores-101 evaluation benchmark for low-resource and multilingual machine translation. arXiv preprint arXiv:2106.03193

  69. Liu Y, Gu J, Goyal N, Li X, Edunov S, Ghazvininejad M et al (2020) Multilingual denoising pre-training for neural machine translation. Trans Assoc Comput Linguist 8:726–742

    Article  Google Scholar 

  70. Thillainathan S, Ranathunga S, Jayasena S (2021) Fine-tuning self-supervised multilingual sequence-to-sequence models for extremely low-resource NMT. In: 2021 Moratuwa engineering research conference (MERCon). IEEE, pp 432–437

  71. Lee ESA, Thillainathan S, Nayak S, Ranathunga S, Adelani DI, Su R et al (2022) Pre-trained multilingual sequence-to-sequence models: a hope for low-resource language translation? arXiv preprint https://doi.org/10.48550/arXiv.2203.08850

  72. Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N et al (2019) fairseq: a fast, extensible toolkit for sequence modeling. In: Proceedings of NAACL-HLT 2019: demonstrations, pp 48–53

  73. Post M (2018) A call for clarity in reporting BLEU scores. In: Proceedings of the third conference on machine translation: research papers. Association for Computational Linguistics, Belgium, pp 186–191

Download references

Acknowledgements

Aloka Fernando was initially funded by the Accelerating Higher Education Expansion and Development (AHEAD) Operation of the Ministry of Education, Sri Lanka, funded by the World Bank. Currently, she is funded by a Senate Research Committee (SRC) grant from the University of Moratuwa, Sri Lanka. Dataset creation was funded by an SRC grant from University of Moratuwa, Sri Lanka.

Funding

Funding was provided by Higher Education Expansion and Development (AHEAD) and Senate Research Committee (SRC) Grant University of Moratuwa.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aloka Fernando.

Ethics declarations

Conflict of interest

The authors have no conflict of interest to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Algorithms for using bilingual lexicons

Our improvement to the document alignment and sentence alignment algorithms consider bilingual lexicons as explained in Sects. 4.1.1 and 4.2.2, respectively. The supporting algorithms related to term matching using person names (Algorithm 1) and rest of the bilingual lexicons (Algorithm 2) are shown below.

figure i

Appendix B Document alignment results

Table 11 shows the document alignment results for each news source for the language pairs English–Sinhala, English–Tamil and Sinhala–Tamil. In Table 7, the individual scores obtained for the news sources are averaged. The score in bold is the result corresponding to the best F1 score with respective to the news source and language pair.

Table 11 Document Alignment results in terms of recall (R), precision (P) and F1 with respective to each language pair. Here, BL refers to the recreated Baseline [13] considering LASER embeddings. On top of this, each bilingual lexicon had been added and the experiments were repeated. The bilingual lexicons considered were Person Names (N), Designations (Ds), Dictionary (Dc) and Improved Dictionary (MDc). Subsequently considering PMLMs XLM-R and LaBSE, the same set of experiments have been conducted.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fernando, A., Ranathunga, S., Sachintha, D. et al. Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages. Knowl Inf Syst 65, 571–612 (2023). https://doi.org/10.1007/s10115-022-01761-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-022-01761-x

Keywords

Navigation