Abstract
Statistical methods to extract translational equivalents from non-parallel corpora hold the promise of ensuring the required coverage and domain customisation of lexicons as well as accelerating their compilation and maintenance. A challenge for these methods are rare, less common words and expressions, which often have low corpus frequencies. However, it is rare words such as newly introduced terminology and named entities that present the main interest for practical lexical acquisition. In this article, we study possibilities of improving the extraction of low-frequency equivalents from bilingual comparable corpora. Our work is carried out in the general framework which discovers equivalences between words of different languages using similarities between their occurrence patterns found in respective monolingual corpora. We develop a method that aims to compensate for insufficient amounts of corpus evidence on rare words: prior to measuring cross-language similarities, the method uses same-language corpus data to model co-occurrence vectors of rare words by predicting their unseen co-occurrences and smoothing rare, unreliable ones. Our experimental evaluation demonstrates that the proposed method delivers a consistent and significant improvement on the conventional approach to this task.
Similar content being viewed by others
References
Baroni M, Bernardini S (2004) BootCaT: bootstrapping corpora and terms from the web. In: Proceedings of the 4th international conference on language resources and evaluation. Lisbon, Portugal, pp 1313–1316
Brockmann C, Lapata M (2003) Evaluating and combining approaches to selectional preference acquisition. In: Proceedings of EACL-03: 10th conference of the European chapter of the Association for Computational Linguistics. Budapest, Hungary, pp 27–34
Chiao Y-C, Zweigenbaum P (2002) Looking for candidate translational equivalents in specialized, comparable corpora. In: Coling 2002, proceedings of the 19th international conference on computational linguistics. Taipei, Taiwan, pp 1–5
Curran J (2004) From distributional to semantic similarity. PhD Thesis, University of Edinburgh, Edinburgh, UK
Dagan I, Church K (1997) Termight: coordinating humans and machines in bilingual terminology acquisition. Mach Translat 12(1–2): 89–107
Dagan I, Lee L, Pereira F (1999) Similarity-based models of word cooccurrence probabilities. Mach Learn 34(1–3): 43–69
Daille B, Morin E (2005) French–English terminology extraction from comparable corpora. In: Proceedings of IJCNLP 2005, second international joint conference on natural language processing. Jeju Island, Korea, pp 707–718
Déjean H, Gaussier E, Sadat F (2002) An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In: Coling 2002, proceedings of the 19th international conference on computational linguistics. Taipei, Taiwan, pp 1–7
Fletcher W (2004) Making the web more useful as a source for linguistic corpora. In: Conor U, Upton T (eds) Corpus linguistics in North America 2002. Rodopi, pp 191–205
Fung P (1995) Compiling bilingual lexicon entries from a non-parallel English–Chinese corpus. In: Proceedings of the third workshop on very large corpora. Cambridge, MA, pp 173–183
Fung P, McKeown K (1997) Finding terminology translations from non-parallel corpora. In: Proceedings of the fifth workshop on very large corpora. Hong Kong, pp 192–202
Fung P, Yee LY (1998) An IR approach for translating new words from nonparallel, comparable texts. In: COLING-ACL ’98: 36th annual meeting of the Association for Computational Linguistics and 17th international conference on computational linguistics. Montreal, Quebec, Canada, pp 414–420
Gaussier E, Renders J-M, Matveeva I, Goutte C, Déjean H (2004) A geometric view on bilingual lexicon extraction from comparable corpora. In: 42nd annual meeting of the Association for Computational Linguistics. Barcelona, Spain, pp 526–533
Gliozzo A, Strapparava C (2006) Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics. Sydney, Australia, pp 553–560
Keller F, Lapata M (2003) Using the web to obtain frequencies for unseen bigrams. Comput Ling 29(3): 459–484
Koehn P, Knight K (2000) Estimating word translation probabilities from unrelated monolingual corpora using the EM algorithm. In: Proceedings of the seventeenth national conference on artificial intelligence and twelfth conference on innovative applications of artificial intelligence. Austin, TX, pp 711–715
Kuhn HW (1955) The Hungarian Method for the assignment problem. Naval Res Logistic Quart 2: 83–97
Lee L (1999) Measures of distributional similarity. In: 37th annual meeting of the Association for Computational Linguistics. College Park, MD, pp 25–32
Lee L, Pereira F (1999) Distributional similarity models: clustering vs. nearest neighbors. In: 37th annual meeting of the Association for Computational Linguistics. College Park, MD, pp 33–40
Melamed ID (2000) Models of translational equivalence among words. Comput Ling 26(2): 221–249
Morin E, Daille B, Takeuchi K, Kageura K (2007) Bilingual terminology mining – using brain, not brawn comparable corpora. In: ACL 07, proceedings of the 45th annual meeting of the Association of Computational Linguistics. Prague, Czech Republic, pp 664–671
Munteanu DS, Marcu D (2006) Extracting parallel sub-sentential fragments from non-parallel corpora. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics. Sydney, Australia, pp 81–88
Pereira F, Tishby N, Lee L (1993) Distributional clustering of English words. In: 31st annual meeting of the Association for Computational Linguistics. Columbus, OH, pp 183–190
Radhakrishna Rao C (1982) Diversity: its measurement, decomposition, apportionment and analysis. Sankyha: Indian J Stat 44(A): 1–22
Rapp R (1995) Identifying word translation in non-parallel texts. In: 33rd annual meeting of the Association for Computational Linguistics. Cambridge, MA, pp 320–322
Rapp R (1999) Automatic identification of word translations from unrelated English and German corpora. In: 37th annual meeting of the Association for Computational Linguistics. College Park, MD, pp 519–526
Resnik P (1993) Selection and information: a class-based approach to lexical relationships. PhD Thesis, University of Pennsylvania, Philadelphia, PA
Robitaille X, Sasaki Y, Tonoike M, Sato S, Utsuro T (2006) Compiling French–Japanese terminologies from the web. In: EACL-2006, 11th conference European chapter of the Association for Computational Linguistics, proceedings. Trento, Italy, pp 225–232
Tanaka K, Iwasaki H (1996) Extraction of lexical translations from non-aligned corpora. In: Proceedings of COLING-96: The 16th international conference on computational linguistics. Copenhagen, Denmark, pp 580–585
Tapanainen P, Järvinen T (1997) A non-projective dependency parser. In: Proceedings of the 5th conference on applied natural language processing. Washington, DC, pp 64–71
Tiedemann J (1998) Extraction of translation equivalents from parallel corpora. In: Proceedings of the 11th Nordic conference on computational linguistics (NODALIDA ‘98). Copenhagen, Denmark, pp 120–128
Utsuro T, Horiuchi T, Hamamoto T, Hino K, Nakayama T (2003) Effect of cross-language IR in bilingual lexicon acquisition from comparable corpora. In: Proceedings of EACL-03: 10th conference of the European chapter of the Association for Computational Linguistics. Budapest, Hungary, pp 355–362
Versley Y (2005) Parser evaluation across text types. In: Proceedings of the fourth workshop on treebanks and linguistic theories (TLT 2005). Barcelona, Spain, pp 209–220
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pekar, V., Mitkov, R., Blagoev, D. et al. Finding translations for low-frequency words in comparable corpora. Machine Translation 20, 247–266 (2006). https://doi.org/10.1007/s10590-007-9029-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-007-9029-7