Skip to main content
Log in

Finding translations for low-frequency words in comparable corpora

  • Published:
Machine Translation

Abstract

Statistical methods to extract translational equivalents from non-parallel corpora hold the promise of ensuring the required coverage and domain customisation of lexicons as well as accelerating their compilation and maintenance. A challenge for these methods are rare, less common words and expressions, which often have low corpus frequencies. However, it is rare words such as newly introduced terminology and named entities that present the main interest for practical lexical acquisition. In this article, we study possibilities of improving the extraction of low-frequency equivalents from bilingual comparable corpora. Our work is carried out in the general framework which discovers equivalences between words of different languages using similarities between their occurrence patterns found in respective monolingual corpora. We develop a method that aims to compensate for insufficient amounts of corpus evidence on rare words: prior to measuring cross-language similarities, the method uses same-language corpus data to model co-occurrence vectors of rare words by predicting their unseen co-occurrences and smoothing rare, unreliable ones. Our experimental evaluation demonstrates that the proposed method delivers a consistent and significant improvement on the conventional approach to this task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Baroni M, Bernardini S (2004) BootCaT: bootstrapping corpora and terms from the web. In: Proceedings of the 4th international conference on language resources and evaluation. Lisbon, Portugal, pp 1313–1316

  • Brockmann C, Lapata M (2003) Evaluating and combining approaches to selectional preference acquisition. In: Proceedings of EACL-03: 10th conference of the European chapter of the Association for Computational Linguistics. Budapest, Hungary, pp 27–34

  • Chiao Y-C, Zweigenbaum P (2002) Looking for candidate translational equivalents in specialized, comparable corpora. In: Coling 2002, proceedings of the 19th international conference on computational linguistics. Taipei, Taiwan, pp 1–5

  • Curran J (2004) From distributional to semantic similarity. PhD Thesis, University of Edinburgh, Edinburgh, UK

  • Dagan I, Church K (1997) Termight: coordinating humans and machines in bilingual terminology acquisition. Mach Translat 12(1–2): 89–107

    Article  Google Scholar 

  • Dagan I, Lee L, Pereira F (1999) Similarity-based models of word cooccurrence probabilities. Mach Learn 34(1–3): 43–69

    Article  Google Scholar 

  • Daille B, Morin E (2005) French–English terminology extraction from comparable corpora. In: Proceedings of IJCNLP 2005, second international joint conference on natural language processing. Jeju Island, Korea, pp 707–718

  • Déjean H, Gaussier E, Sadat F (2002) An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In: Coling 2002, proceedings of the 19th international conference on computational linguistics. Taipei, Taiwan, pp 1–7

  • Fletcher W (2004) Making the web more useful as a source for linguistic corpora. In: Conor U, Upton T (eds) Corpus linguistics in North America 2002. Rodopi, pp 191–205

  • Fung P (1995) Compiling bilingual lexicon entries from a non-parallel English–Chinese corpus. In: Proceedings of the third workshop on very large corpora. Cambridge, MA, pp 173–183

  • Fung P, McKeown K (1997) Finding terminology translations from non-parallel corpora. In: Proceedings of the fifth workshop on very large corpora. Hong Kong, pp 192–202

  • Fung P, Yee LY (1998) An IR approach for translating new words from nonparallel, comparable texts. In: COLING-ACL ’98: 36th annual meeting of the Association for Computational Linguistics and 17th international conference on computational linguistics. Montreal, Quebec, Canada, pp 414–420

  • Gaussier E, Renders J-M, Matveeva I, Goutte C, Déjean H (2004) A geometric view on bilingual lexicon extraction from comparable corpora. In: 42nd annual meeting of the Association for Computational Linguistics. Barcelona, Spain, pp 526–533

  • Gliozzo A, Strapparava C (2006) Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics. Sydney, Australia, pp 553–560

  • Keller F, Lapata M (2003) Using the web to obtain frequencies for unseen bigrams. Comput Ling 29(3): 459–484

    Article  Google Scholar 

  • Koehn P, Knight K (2000) Estimating word translation probabilities from unrelated monolingual corpora using the EM algorithm. In: Proceedings of the seventeenth national conference on artificial intelligence and twelfth conference on innovative applications of artificial intelligence. Austin, TX, pp 711–715

  • Kuhn HW (1955) The Hungarian Method for the assignment problem. Naval Res Logistic Quart 2: 83–97

    Article  Google Scholar 

  • Lee L (1999) Measures of distributional similarity. In: 37th annual meeting of the Association for Computational Linguistics. College Park, MD, pp 25–32

  • Lee L, Pereira F (1999) Distributional similarity models: clustering vs. nearest neighbors. In: 37th annual meeting of the Association for Computational Linguistics. College Park, MD, pp 33–40

  • Melamed ID (2000) Models of translational equivalence among words. Comput Ling 26(2): 221–249

    Article  Google Scholar 

  • Morin E, Daille B, Takeuchi K, Kageura K (2007) Bilingual terminology mining – using brain, not brawn comparable corpora. In: ACL 07, proceedings of the 45th annual meeting of the Association of Computational Linguistics. Prague, Czech Republic, pp 664–671

  • Munteanu DS, Marcu D (2006) Extracting parallel sub-sentential fragments from non-parallel corpora. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics. Sydney, Australia, pp 81–88

  • Pereira F, Tishby N, Lee L (1993) Distributional clustering of English words. In: 31st annual meeting of the Association for Computational Linguistics. Columbus, OH, pp 183–190

  • Radhakrishna Rao C (1982) Diversity: its measurement, decomposition, apportionment and analysis. Sankyha: Indian J Stat 44(A): 1–22

    Google Scholar 

  • Rapp R (1995) Identifying word translation in non-parallel texts. In: 33rd annual meeting of the Association for Computational Linguistics. Cambridge, MA, pp 320–322

  • Rapp R (1999) Automatic identification of word translations from unrelated English and German corpora. In: 37th annual meeting of the Association for Computational Linguistics. College Park, MD, pp 519–526

  • Resnik P (1993) Selection and information: a class-based approach to lexical relationships. PhD Thesis, University of Pennsylvania, Philadelphia, PA

  • Robitaille X, Sasaki Y, Tonoike M, Sato S, Utsuro T (2006) Compiling French–Japanese terminologies from the web. In: EACL-2006, 11th conference European chapter of the Association for Computational Linguistics, proceedings. Trento, Italy, pp 225–232

  • Tanaka K, Iwasaki H (1996) Extraction of lexical translations from non-aligned corpora. In: Proceedings of COLING-96: The 16th international conference on computational linguistics. Copenhagen, Denmark, pp 580–585

  • Tapanainen P, Järvinen T (1997) A non-projective dependency parser. In: Proceedings of the 5th conference on applied natural language processing. Washington, DC, pp 64–71

  • Tiedemann J (1998) Extraction of translation equivalents from parallel corpora. In: Proceedings of the 11th Nordic conference on computational linguistics (NODALIDA ‘98). Copenhagen, Denmark, pp 120–128

  • Utsuro T, Horiuchi T, Hamamoto T, Hino K, Nakayama T (2003) Effect of cross-language IR in bilingual lexicon acquisition from comparable corpora. In: Proceedings of EACL-03: 10th conference of the European chapter of the Association for Computational Linguistics. Budapest, Hungary, pp 355–362

  • Versley Y (2005) Parser evaluation across text types. In: Proceedings of the fourth workshop on treebanks and linguistic theories (TLT 2005). Barcelona, Spain, pp 209–220

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruslan Mitkov.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pekar, V., Mitkov, R., Blagoev, D. et al. Finding translations for low-frequency words in comparable corpora. Machine Translation 20, 247–266 (2006). https://doi.org/10.1007/s10590-007-9029-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-007-9029-7

Keywords

Navigation