Finding translations for low-frequency words in comparable corpora

Pekar, Viktor; Mitkov, Ruslan; Blagoev, Dimitar; Mulloni, Andrea

doi:10.1007/s10590-007-9029-7

Finding translations for low-frequency words in comparable corpora

Published: 23 February 2008

Volume 20, pages 247–266, (2006)
Cite this article

Machine Translation

Viktor Pekar¹,
Ruslan Mitkov¹,
Dimitar Blagoev² &
…
Andrea Mulloni³

150 Accesses
14 Citations
Explore all metrics

Abstract

Statistical methods to extract translational equivalents from non-parallel corpora hold the promise of ensuring the required coverage and domain customisation of lexicons as well as accelerating their compilation and maintenance. A challenge for these methods are rare, less common words and expressions, which often have low corpus frequencies. However, it is rare words such as newly introduced terminology and named entities that present the main interest for practical lexical acquisition. In this article, we study possibilities of improving the extraction of low-frequency equivalents from bilingual comparable corpora. Our work is carried out in the general framework which discovers equivalences between words of different languages using similarities between their occurrence patterns found in respective monolingual corpora. We develop a method that aims to compensate for insufficient amounts of corpus evidence on rare words: prior to measuring cross-language similarities, the method uses same-language corpus data to model co-occurrence vectors of rare words by predicting their unseen co-occurrences and smoothing rare, unreliable ones. Our experimental evaluation demonstrates that the proposed method delivers a consistent and significant improvement on the conventional approach to this task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Baroni M, Bernardini S (2004) BootCaT: bootstrapping corpora and terms from the web. In: Proceedings of the 4th international conference on language resources and evaluation. Lisbon, Portugal, pp 1313–1316
Brockmann C, Lapata M (2003) Evaluating and combining approaches to selectional preference acquisition. In: Proceedings of EACL-03: 10th conference of the European chapter of the Association for Computational Linguistics. Budapest, Hungary, pp 27–34
Chiao Y-C, Zweigenbaum P (2002) Looking for candidate translational equivalents in specialized, comparable corpora. In: Coling 2002, proceedings of the 19th international conference on computational linguistics. Taipei, Taiwan, pp 1–5
Curran J (2004) From distributional to semantic similarity. PhD Thesis, University of Edinburgh, Edinburgh, UK
Dagan I, Church K (1997) Termight: coordinating humans and machines in bilingual terminology acquisition. Mach Translat 12(1–2): 89–107
Article Google Scholar
Dagan I, Lee L, Pereira F (1999) Similarity-based models of word cooccurrence probabilities. Mach Learn 34(1–3): 43–69
Article Google Scholar
Daille B, Morin E (2005) French–English terminology extraction from comparable corpora. In: Proceedings of IJCNLP 2005, second international joint conference on natural language processing. Jeju Island, Korea, pp 707–718
Déjean H, Gaussier E, Sadat F (2002) An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In: Coling 2002, proceedings of the 19th international conference on computational linguistics. Taipei, Taiwan, pp 1–7
Fletcher W (2004) Making the web more useful as a source for linguistic corpora. In: Conor U, Upton T (eds) Corpus linguistics in North America 2002. Rodopi, pp 191–205
Fung P (1995) Compiling bilingual lexicon entries from a non-parallel English–Chinese corpus. In: Proceedings of the third workshop on very large corpora. Cambridge, MA, pp 173–183
Fung P, McKeown K (1997) Finding terminology translations from non-parallel corpora. In: Proceedings of the fifth workshop on very large corpora. Hong Kong, pp 192–202
Fung P, Yee LY (1998) An IR approach for translating new words from nonparallel, comparable texts. In: COLING-ACL ’98: 36th annual meeting of the Association for Computational Linguistics and 17th international conference on computational linguistics. Montreal, Quebec, Canada, pp 414–420
Gaussier E, Renders J-M, Matveeva I, Goutte C, Déjean H (2004) A geometric view on bilingual lexicon extraction from comparable corpora. In: 42nd annual meeting of the Association for Computational Linguistics. Barcelona, Spain, pp 526–533
Gliozzo A, Strapparava C (2006) Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics. Sydney, Australia, pp 553–560
Keller F, Lapata M (2003) Using the web to obtain frequencies for unseen bigrams. Comput Ling 29(3): 459–484
Article Google Scholar
Koehn P, Knight K (2000) Estimating word translation probabilities from unrelated monolingual corpora using the EM algorithm. In: Proceedings of the seventeenth national conference on artificial intelligence and twelfth conference on innovative applications of artificial intelligence. Austin, TX, pp 711–715
Kuhn HW (1955) The Hungarian Method for the assignment problem. Naval Res Logistic Quart 2: 83–97
Article Google Scholar
Lee L (1999) Measures of distributional similarity. In: 37th annual meeting of the Association for Computational Linguistics. College Park, MD, pp 25–32
Lee L, Pereira F (1999) Distributional similarity models: clustering vs. nearest neighbors. In: 37th annual meeting of the Association for Computational Linguistics. College Park, MD, pp 33–40
Melamed ID (2000) Models of translational equivalence among words. Comput Ling 26(2): 221–249
Article Google Scholar
Morin E, Daille B, Takeuchi K, Kageura K (2007) Bilingual terminology mining – using brain, not brawn comparable corpora. In: ACL 07, proceedings of the 45th annual meeting of the Association of Computational Linguistics. Prague, Czech Republic, pp 664–671
Munteanu DS, Marcu D (2006) Extracting parallel sub-sentential fragments from non-parallel corpora. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics. Sydney, Australia, pp 81–88
Pereira F, Tishby N, Lee L (1993) Distributional clustering of English words. In: 31st annual meeting of the Association for Computational Linguistics. Columbus, OH, pp 183–190
Radhakrishna Rao C (1982) Diversity: its measurement, decomposition, apportionment and analysis. Sankyha: Indian J Stat 44(A): 1–22
Google Scholar
Rapp R (1995) Identifying word translation in non-parallel texts. In: 33rd annual meeting of the Association for Computational Linguistics. Cambridge, MA, pp 320–322
Rapp R (1999) Automatic identification of word translations from unrelated English and German corpora. In: 37th annual meeting of the Association for Computational Linguistics. College Park, MD, pp 519–526
Resnik P (1993) Selection and information: a class-based approach to lexical relationships. PhD Thesis, University of Pennsylvania, Philadelphia, PA
Robitaille X, Sasaki Y, Tonoike M, Sato S, Utsuro T (2006) Compiling French–Japanese terminologies from the web. In: EACL-2006, 11th conference European chapter of the Association for Computational Linguistics, proceedings. Trento, Italy, pp 225–232
Tanaka K, Iwasaki H (1996) Extraction of lexical translations from non-aligned corpora. In: Proceedings of COLING-96: The 16th international conference on computational linguistics. Copenhagen, Denmark, pp 580–585
Tapanainen P, Järvinen T (1997) A non-projective dependency parser. In: Proceedings of the 5th conference on applied natural language processing. Washington, DC, pp 64–71
Tiedemann J (1998) Extraction of translation equivalents from parallel corpora. In: Proceedings of the 11th Nordic conference on computational linguistics (NODALIDA ‘98). Copenhagen, Denmark, pp 120–128
Utsuro T, Horiuchi T, Hamamoto T, Hino K, Nakayama T (2003) Effect of cross-language IR in bilingual lexicon acquisition from comparable corpora. In: Proceedings of EACL-03: 10th conference of the European chapter of the Association for Computational Linguistics. Budapest, Hungary, pp 355–362
Versley Y (2005) Parser evaluation across text types. In: Proceedings of the fourth workshop on treebanks and linguistic theories (TLT 2005). Barcelona, Spain, pp 209–220

Download references

Author information

Authors and Affiliations

ILP, University of Wolverhampton, Stafford St, Wolverhampton, WV1 1SB, UK
Viktor Pekar & Ruslan Mitkov
Department of Informatics, University of Plovdiv, Plovdiv, 4003, Bulgaria
Dimitar Blagoev
Expert System, Via F. Zeni 8, Rovereto, 38068, Italy
Andrea Mulloni

Authors

Viktor Pekar
View author publications
You can also search for this author in PubMed Google Scholar
Ruslan Mitkov
View author publications
You can also search for this author in PubMed Google Scholar
Dimitar Blagoev
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Mulloni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruslan Mitkov.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pekar, V., Mitkov, R., Blagoev, D. et al. Finding translations for low-frequency words in comparable corpora. Machine Translation 20, 247–266 (2006). https://doi.org/10.1007/s10590-007-9029-7

Download citation

Received: 03 July 2007
Accepted: 05 December 2007
Published: 23 February 2008
Issue Date: March 2006
DOI: https://doi.org/10.1007/s10590-007-9029-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Finding translations for low-frequency words in comparable corpora

Abstract

Access this article

Similar content being viewed by others

New Areas of Application of Comparable Corpora

Bilingual Terminology Mining from Language for Special Purposes Comparable Corpora

Bilingual Contexts from Comparable Corpora to Mine for Translations of Collocations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Finding translations for low-frequency words in comparable corpora

Abstract

Access this article

Similar content being viewed by others

New Areas of Application of Comparable Corpora

Bilingual Terminology Mining from Language for Special Purposes Comparable Corpora

Bilingual Contexts from Comparable Corpora to Mine for Translations of Collocations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation