Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages

Fernando, Aloka; Ranathunga, Surangika; Sachintha, Dilan; Piyarathna, Lakmali; Rajitha, Charith

doi:10.1007/s10115-022-01761-x

Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages

Regular Paper
Published: 17 October 2022

Volume 65, pages 571–612, (2023)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Aloka Fernando¹^na1,
Surangika Ranathunga¹,
Dilan Sachintha¹^na1,
Lakmali Piyarathna¹^na1 &
…
Charith Rajitha¹^na1

554 Accesses
1 Altmetric
Explore all metrics

Abstract

Neural machine translation systems trained on low-resource languages produce sub-optimal results due to the scarcity of large parallel datasets. To alleviate this problem, parallel corpora can be mined from the web. Two key tasks in a parallel corpus mining pipeline are web document alignment and sentence alignment. Effective approaches for these tasks obtained vector representations of the documents (or sentences) belonging to the two languages and determine the alignment between the documents (or sentences) based on a semantic similarity scoring mechanism. Recently, document or sentence representations obtained from pre-trained multilingual language models (PMLMs) such as LASER, XLM-R and LaBSE have significantly improved the benchmark scores in diverse natural language processing tasks. In this study, we carry out an empirical analysis of the effectiveness of these PMLMs of the document and sentence alignment tasks in the context of the low-resource language pairs Sinhala–English, Tamil–English and Sinhala–Tamil. Further, we introduce a weighting mechanism based on small-scale bilingual lexicons to improve the semantic similarity measurement between sentences and documents. Our results show that both document and sentence alignment can be further improved using our weighting mechanism. We have also compiled a gold-standard evaluation benchmark dataset for document alignment and sentence alignment tasks for the considered language pairs. This dataset (https://github.com/kdissa/comparable-corpus) and the source code (https://github.com/nlpcuom/parallel_corpus_mining) are publicly released.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages

Article 01 December 2021

Jointly learning bilingual word embeddings and alignments

Article 01 November 2021

L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi

Notes

https://github.com/kdissa/comparable-corpus.
mBERT was not considered since it does not include Sinhala.
http://www.hirunews.lk.
https://www.newsfirst.lk/.
https://www.army.lk/.
https://www.itnnews.lk.
http://www.statmt.org/wmt20/translation-task.html.
https://www.languagesdept.gov.lk/.
We use k = 4 for all experiments in this work as it gave the best results in all our experiments.
https://uom.lk/nlp

References

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Koehn P, Knowles R (2017) Six challenges for neural machine translation. In: Proceedings of the first workshop on neural machine translation. Association for Computational Linguistics, Vancouver, pp 28–39
Ranathunga S, Lee ESA, Skenduli MP, Shekhar R, Alam M, Kaur R (2021) Neural machine translation for low-resource languages: a survey. arXiv preprint arXiv:2106.15115
Kreutzer J, Caswell I, Wang L, Wahab A, van Esch D, Ulzii-Orshikh N et al (2022) Quality at a glance: an audit of web-crawled multilingual datasets. Trans Assoc Comput Linguist 10:50–72
Article Google Scholar
Bañón M, Chen P, Haddow B, Heafield K, Hoang H, Esplà-Gomis M et al (2020) ParaCrawl: web-scale acquisition of parallel corpora. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 4555–4567
Buck C, Koehn P (2016) Findings of the WMT 2016 bilingual document alignment shared task. In: Proceedings of the first conference on machine translation: volume 2, shared task papers. Association for Computational Linguistics, Berlin, pp 554–563
Resnik P (1998) Parallel strands: a preliminary investigation into mining the web for bilingual text. In: Conference of the association for machine translation in the Americas. Springer, pp 72–82
Resnik P (1999) Mining the web for bilingual text. In: Proceedings of the 37th annual meeting of the association for computational linguistics, pp 527–534
Papavassiliou V, Prokopidis P, Piperidis S (2016) The ilsp/arc submission to the wmt 2016 bilingual document alignment shared task. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, pp 733–739
Resnik P, Smith NA (2003) The web as a parallel corpus. Comput Linguist 29(3):349–380
Article Google Scholar
Espla-Gomis M, Forcada ML, Ortiz-Rojas S, Ferrández-Tordera J. (2016) Bitextor’s participation in WMT’16: shared task on document alignment. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, pp 685–691
Etchegoyhen T, Gete H (2020) Handle with care: a case study in comparable corpora exploitation for neural machine translation. In: Proceedings of The 12th language resources and evaluation conference, pp 3799–3807
El-Kishky A, Guzmán F (2020) Massively multilingual document alignment with cross-lingual sentence-mover’s distance. In: Proceedings of the 1st conference of the asia-pacific chapter of the association for computational linguistics and the 10th international joint conference on natural language processing. Association for Computational Linguistics, Suzhou, pp 616–625
Varga D, Halácsy P, Kornai A, Nagy V, Németh L, Trón V (2007) Parallel corpora for medium density languages. Amsterdam Stud Theory Hist Linguist Sci Ser 4(292):247
Google Scholar
Munteanu DS, Marcu D (2005) Improving machine translation performance by exploiting non-parallel corpora. Comput Linguist 31(4):477–504
Article Google Scholar
Sarikaya R, Maskey S, Zhang R, Jan EE, Wang D, Ramabhadran B et al (2009) Iterative sentence-pair extraction from quasi-parallel corpora for machine translation. In: Tenth annual conference of the international speech communication association, pp 432–435
Kvapilíková I, Artetxe M, Labaka G, Agirre E, Bojar O (2020) Unsupervised multilingual sentence embeddings for parallel corpus mining. In: Proceedings of the 58th annual meeting of the association for computational linguistics: student research workshop, pp 255–262
Feng F, Yang Y, Cer D, Arivazhagan N, Wang W (2020) Language-agnostic BERT sentence embedding. arXiv preprint arXiv:2007.01852
Artetxe M, Schwenk H (2019) Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans Assoc Comput Linguist 7:597–610
Article Google Scholar
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F et al (2020) Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 8440–8451
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). Association for Computational Linguistics, Minneapolis, pp 4171–4186
Rajitha C, Piyarathna L, Sachintha D, Ranathunga S (2021) Metric learning in multilingual sentence similarity measurement for document alignment. In: Proceedings of the international conference on recent advances in natural language processing (RANLP 2021). Held online: INCOMA Ltd., pp 1150–1157. https://aclanthology.org/2021.ranlp-1.129
Ni J, Ábrego GH, Constant N, Ma J, Hall KB, Cer D et al (2021) Sentence-t5: scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877
Joshi P, Santy S, Budhiraja A, Bali K, Choudhury M (2020) The state and fate of linguistic diversity and inclusion in the NLP world. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 6282–6293
Artetxe M, Schwenk H (2019) Margin-based parallel corpus mining with multilingual sentence embeddings. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 3197–3203
Koehn P, Khayrallah H, Heafield K, Forcada ML (2018) Findings of the wmt 2018 shared task on parallel corpus filtering. In: Proceedings of the third conference on machine translation: shared task papers, pp 726–739
Chen J, Nie JY (2000) Parallel web text mining for cross-language IR. In: Content-based multimedia information access, vol 1. RIAO, pp 62–77
Shi L, Niu C, Zhou M, Gao J (2006) A DOM tree alignment model for mining parallel data from the web. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics, pp 489–496
Zafarian A, Sadeghi APA, Azadi F, Ghiasifard S, Panahloo ZA, Bakhshaei S et al (2015) AUT document alignment framework for BUCC workshop shared task. In: Proceedings of the eighth workshop on building and using comparable corpora, pp 79–87
Li B, Gaussier E (2013) Exploiting comparable corpora for lexicon extraction: Measuring and improving corpus quality. In: Building and using comparable corpora. Springer, pp 131–149
Ma X, Liberman M (1999) Bits: a method for bilingual text search over the web. In: Machine translation summit VII, pp 538–542
Fung P, Cheung P (2004) Mining very-non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and e. In: Proceedings of the 2004 conference on empirical methods in natural language processing, pp 57–63
Ion R, Ceauşu A, Irimia E (2011) An expectation maximization algorithm for textual unit alignment. In: Proceedings of the 4th workshop on building and using comparable corpora: comparable corpora and the web, pp 128–135
Gomes L, Lopes G (2016) First steps towards coverage-based document alignment. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, pp 697–702
Morin E, Hazem A, Boudin F, Loginova-Clouet E (2015) LINA: identifying comparable documents from Wikipedia. In: Proceedings of the eighth workshop on building and using comparable corpora. Association for Computational Linguistics, Beijing, pp 88–91
Uszkoreit J, Ponte J, Popat A, Dubiner M (2010) Large scale parallel document mining for machine translation. In: Proceedings of the 23rd international conference on computational linguistics (Coling 2010), pp 1101–1109
Rajitha M, Piyarathna L, Nayanajith M, Surangika S (2020) Sinhala and English document alignment using statistical machine translation. In: 2020 20th international conference on advances in ICT for emerging regions (ICTer). IEEE, pp 29–34
Jakubina L, Langlais P (2016) Bad luc@ wmt 2016: a bilingual document alignment platform based on lucene. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, pp 703–709
Medveď M, Jakubíček M, Kovář V (2016) English-French document alignment based on keywords and statistical translation. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, pp 728–732
Buck C, Koehn P (2016) Quick and reliable document alignment via tf/idf-weighted cosine distance. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, pp 672–678
Germann U (2016) Bilingual document alignment with latent semantic indexing. In: Proceedings of the first conference on machine translation: volume 2, shared task papers. Association for Computational Linguistics, Berlin, pp 692–696
Dara AA, Lin YC (2016) Yoda system for wmt16 shared task: bilingual document alignment. In: Proceedings of the first conference on machine translation: volume 2, shared task papers, pp 679–684
Brown PF, Lai JC, Mercer RL (1991) Aligning sentences in parallel corpora. In: 29th Annual meeting of the association for computational linguistics. Association for Computational Linguistics, Berkeley, pp 169–176
Gale WA, Church KW (1993) A program for aligning sentences in bilingual corpora. Comput Linguist 19(1):75–102
Google Scholar
Ma X (2006) Champollion: a robust parallel text sentence aligner. In: Proceedings of the fifth international conference on language resources and evaluation (LREC’06). European Language Resources Association (ELRA), Genoa, pp 489–492
Munteanu DS, Marcu D (2002) Processing comparable corpora with bilingual suffix trees. In: Proceedings of the 2002 conference on empirical methods in natural language processing (EMNLP 2002), pp 289–295
Stefanescu D, Ion R, Hunsicker S (2012) Hybrid parallel sentence mining from comparable corpora. In: Proceedings of the 16th annual conference of the European association for machine translation, pp 137–144
Abdul-Rauf S, Schwenk H (2009) On the use of comparable corpora to improve SMT performance. In: Proceedings of the 12th conference of the european chapter of the ACL (EACL 2009), pp 16–23
Mahata S, Das D (2017) Bandyopadhyay S. Bucc2017: a hybrid approach for identifying parallel sentences in comparable corpora. In: Proceedings of the 10th workshop on building and using comparable corpora, pp 56–59
Azpeitia A, Etchegoyhen T, Garcia EM (2017) Weighted set-theoretic alignment of comparable sentences. In: Proceedings of the 10th workshop on building and using comparable corpora, pp 41–45
Azpeitia A, Etchegoyhen T, Garcia EM (2018) Extracting parallel sentences from comparable corpora with STACC variants. In: Proceedings of the 11th workshop on building and using comparable corpora, pp 48–52
Grégoire F, Langlais P (2017) Bucc 2017 shared task: a first attempt toward a deep learning framework for identifying parallel sentences in comparable corpora. In: Proceedings of the 10th workshop on building and using comparable corpora, pp 46–50
Iyyer M, Manjunatha V, Boyd-Graber J, Daumé III H (2015) Deep unordered composition rivals syntactic methods for text classification. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers), pp 1681–1691
Guoa M, Shenb Q, Yanga Y, Gea H, Cera D, Abregoa GH et al (2018) Effective parallel corpus mining using bilingual sentence embeddings. WMT 2018:165
Google Scholar
Leong C, Wong DF, Chao LS (2018) Um-paligner: neural network-based parallel sentence identification model. In: 11th Workshop on building and using comparable corpora, p 53
Bouamor H, Sajjad H (2018) H2@ bucc18: parallel sentence extraction from comparable corpora using multilingual sentence embeddings. In: Proceedings of workshop on building and using comparable corpora, pp 43–47
Hangya V, Fraser A (2019) Unsupervised parallel sentence extraction with parallel segment detection helps machine translation. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 1224–1234
Schwenk H, Chaudhary V, Sun S, Gong H, Guzmán F (2021) WikiMatrix: mining 135M parallel sentences in 1620 language pairs from Wikipedia. In: Proceedings of the 16th conference of the European chapter of the association for computational linguistics: main volume, pp 1351–1361
Schwenk H, Wenzek G, Edunov S, Grave E, Joulin A, Fan A (2021) CCMatrix: mining billions of high-quality parallel sentences on the web. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers). Association for Computational Linguistics, pp 6490–6500
Yang Y, Ábrego GH, Yuan S, Guo M, Shen Q, Cer D et al (2019) Improving multilingual sentence embedding using bi-directional dual encoder with additive margin softmax. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence (IJCAI-19), pp 5370–5378
Zweigenbaum P, Sharoff S, Rapp R (2018) Overview of the third BUCC shared task: spotting parallel sentences in comparable corpora. In: Proceedings of 11th workshop on building and using comparable corpora, pp 39–42
Koehn P, Guzmán F, Chaudhary V, Pino J. (2019) Findings of the WMT 2019 shared task on parallel corpus filtering for low-resource conditions. In: Proceedings of the fourth conference on machine translation (volume 3: shared task papers, day 2), pp 54–72
Priyadarshani H, Rajapaksha M, Ranasinghe M, Sarveswaran K, Dias G (2019) Statistical machine learning for transliteration: transliterating names between Sinhala, Tamil and English. In: 2019 International conference on asian language processing (IALP). IEEE, pp 244–249
Farhath F, Ranathunga S, Jayasena S, Dias G (2018) Integration of bilingual lists for domain-specific statistical machine translation for Sinhala-Tamil. In moratuwa engineering research conference (MERCon). IEEE, pp. 538–543
Thompson B, Koehn P (2019) Vecalign: improved sentence alignment in linear time and space. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 1342–1348
Fernando A, Ranathunga S, Dias G (2020) Data augmentation and terminology integration for domain-specific Sinhala-English-Tamil statistical machine translation. arXiv preprint arXiv:2011.02821
Guzmán F, Chen PJ, Ott M, Pino J, Lample G, Koehn P et al (2019) The FLORES evaluation datasets for low-resource machine translation: Nepali–English and Sinhala–English. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, pp 6098–6111
Goyal N, Gao C, Chaudhary V, Chen PJ, Wenzek G, Ju D et al (2021) The flores-101 evaluation benchmark for low-resource and multilingual machine translation. arXiv preprint arXiv:2106.03193
Liu Y, Gu J, Goyal N, Li X, Edunov S, Ghazvininejad M et al (2020) Multilingual denoising pre-training for neural machine translation. Trans Assoc Comput Linguist 8:726–742
Article Google Scholar
Thillainathan S, Ranathunga S, Jayasena S (2021) Fine-tuning self-supervised multilingual sequence-to-sequence models for extremely low-resource NMT. In: 2021 Moratuwa engineering research conference (MERCon). IEEE, pp 432–437
Lee ESA, Thillainathan S, Nayak S, Ranathunga S, Adelani DI, Su R et al (2022) Pre-trained multilingual sequence-to-sequence models: a hope for low-resource language translation? arXiv preprint https://doi.org/10.48550/arXiv.2203.08850
Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N et al (2019) fairseq: a fast, extensible toolkit for sequence modeling. In: Proceedings of NAACL-HLT 2019: demonstrations, pp 48–53
Post M (2018) A call for clarity in reporting BLEU scores. In: Proceedings of the third conference on machine translation: research papers. Association for Computational Linguistics, Belgium, pp 186–191

Download references

Acknowledgements

Aloka Fernando was initially funded by the Accelerating Higher Education Expansion and Development (AHEAD) Operation of the Ministry of Education, Sri Lanka, funded by the World Bank. Currently, she is funded by a Senate Research Committee (SRC) grant from the University of Moratuwa, Sri Lanka. Dataset creation was funded by an SRC grant from University of Moratuwa, Sri Lanka.

Funding

Funding was provided by Higher Education Expansion and Development (AHEAD) and Senate Research Committee (SRC) Grant University of Moratuwa.

Author information

Aloka Fernando, Dilan Sachintha, Lakmali Piyarathna and Charith Rajitha have contributed equally to this work.

Authors and Affiliations

Department of Computer Science and Engineering, University of Moratuwa, Katubedda, Sri Lanka
Aloka Fernando, Surangika Ranathunga, Dilan Sachintha, Lakmali Piyarathna & Charith Rajitha

Authors

Aloka Fernando
View author publications
You can also search for this author inPubMed Google Scholar
Surangika Ranathunga
View author publications
You can also search for this author inPubMed Google Scholar
Dilan Sachintha
View author publications
You can also search for this author inPubMed Google Scholar
Lakmali Piyarathna
View author publications
You can also search for this author inPubMed Google Scholar
Charith Rajitha
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Aloka Fernando.

Ethics declarations

Conflict of interest

The authors have no conflict of interest to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Algorithms for using bilingual lexicons

Our improvement to the document alignment and sentence alignment algorithms consider bilingual lexicons as explained in Sects. 4.1.1 and 4.2.2, respectively. The supporting algorithms related to term matching using person names (Algorithm 1) and rest of the bilingual lexicons (Algorithm 2) are shown below.

Appendix B Document alignment results

Table 11 shows the document alignment results for each news source for the language pairs English–Sinhala, English–Tamil and Sinhala–Tamil. In Table 7, the individual scores obtained for the news sources are averaged. The score in bold is the result corresponding to the best F1 score with respective to the news source and language pair.

Table 11 Document Alignment results in terms of recall (R), precision (P) and F1 with respective to each language pair. Here, BL refers to the recreated Baseline [13] considering LASER embeddings. On top of this, each bilingual lexicon had been added and the experiments were repeated. The bilingual lexicons considered were Person Names (N), Designations (Ds), Dictionary (Dc) and Improved Dictionary (MDc). Subsequently considering PMLMs XLM-R and LaBSE, the same set of experiments have been conducted.

Full size table

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Fernando, A., Ranathunga, S., Sachintha, D. et al. Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages. Knowl Inf Syst 65, 571–612 (2023). https://doi.org/10.1007/s10115-022-01761-x

Download citation

Received: 27 April 2022
Revised: 04 July 2022
Accepted: 12 September 2022
Published: 17 October 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s10115-022-01761-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages

Jointly learning bilingual word embeddings and alignments

L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A Algorithms for using bilingual lexicons

Appendix B Document alignment results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now