Skip to main content
Log in

Extracting Multilingual Lexicons from Parallel Corpora

  • Published:
Computers and the Humanities Aims and scope Submit manuscript

Abstract

The paper describes our recent developments in automatic extraction of translation equivalents from parallel corpora. We describe three increasingly complex algorithms: a simple baseline iterative method, and two non-iterative more elaborated versions. While the baseline algorithm is mainly described for illustrative purposes, the non-iterative algorithms outline the use of different working hypotheses which may be motivated by different kinds of applications and to some extent by the languages concerned. The first two algorithms rely on cross-lingual POS preservation, while with the third one POS invariance is not an extraction condition. The evaluation of the algorithms was conducted on three different corpora and several pairs of languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Ahrenberg L., Andersson M., Merkel M. (2000) A Knowledge-lite Approach to Word Alignment. In Véronis J. (ed.), Parallel Text Processing. Text, Speech and Language Technology Series, Kluwer Academic Publishers, Vol. 13, pp. 97–116.

  • Brants T. (2000) TnT-A Statistical Part-of-Speech Tagger. In Proceedings ANLP-2000, April 29-May 3, Seattle, WA.

  • Brew C., McKelvie D. (1996) Word-pair extraction for lexicography. Available at http:///tokww.ltg.ed.ac.uk/?~chrisbr/papers/nemplap96.

  • Brown P., Della Pietra S.A., Della Pietra V.J., Mercer R.L. (1993) The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19/2, pp. 263–311.

    Google Scholar 

  • Dimitrova L., Erjavec T., Ide N., Kaalep H., Petkevic V., Tufis D. (1998) Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and East European Languages. In Proceedings ACL-COLING'1998, Montreal, Canada, pp. 315–319.

  • Dunning T. (1993) Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19/1, pp. 61–74.

    Google Scholar 

  • Erjavec T. (ed.) (2001) Specifications and Notations for MULTEXT-East Lexicon Encoding. Edition Multext-East/Concede Edition, March, 21, p. Available at [http://nl.ijs.si/ME/ V2/msd/html/].

  • Erjavec T., Ide N. (1998) The Multext-East corpus. In Proceedings LREC'1998, Granada, Spain, pp. 971–974.

  • Erjavec T., Lawson A., Romary L. (1998) East Meet West: A Compendium of Multilingual Resources. TELRI-MULTEXT EAST CD-ROM, ISBN: 3-922641-46-6.

  • Gale W.A., Church K.W. (1991) Identifying word correspondences in parallel texts. In Fourth DARPA Workshop on Speech and Natural Language, pp. 152–157.

  • Gale W.A., Church K.W. (1993) A Program for Aligning Sentences in Bilingual Corpora. Computational Linguistics, 19/1, pp. 75–102.

    Google Scholar 

  • Hiemstra D. (1997) Deriving a Bilingual Lexicon for Cross Language Information Retrieval. In Proceedings of Gronics, pp. 21–26.

  • Ide N., Veronis J. (1995) Corpus Encoding Standard. MULTEXT/EAGLES Report. Available at http//tokww.lpl.univ-aix.fr/projects/multext/CES/CES1.html.

  • Kay M., Röscheisen M. (1993) Text-Translation Alignment. Computational Linguistics, 19/1, pp. 121–142.

    Google Scholar 

  • Kupiec J. (1993) An algorithm for finding noun phrase correspondences in bilingual corpora. In Proceedings of the 31st Annual Meeting of the Association of Computational Linguistics, pp. 17–22.

  • Melamed D. (2001) Empirical Methods for Exploiting Parallel Texts. The MIT Press, Cambridge Massachusetts, London England, 195 p.

    Google Scholar 

  • Mihalcea R., Pedersen T. (2003) An Evaluation Exercisefor Word Alignment. In Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, May-June, Edmonton, Canada, pp. 1–10.

  • Mititelu C. (2003) TREQ User Manual, Technical Report, RACAI, May, 25 p.

  • Smadja F., McKeown K.R., Hatzivassiloglou V. (1996) Translating Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics, 22/1, pp. 1–38.

    Google Scholar 

  • Stamou S., Oflazer K., Pala K., Christoudoulakis D., Cristea D., Tufis D., Koeva S., Totkov G., Dutoit D., Grigoriadou M. (2002) BALKANET A Multilingual Semantic Network for the Balkan Languages. In Proceedings of the International Wordnet Conference, Mysore, India, 21–25 January.

  • Tufis D. (1999).

  • Tufis D. (2000) Using a Large Set of Eagles-compliant Morpho-Syntactic Descriptors as a Tagset for Probabilistic Tagging. In Proceedings LREC'2000, Athens, pp. 1105–1112.

  • Tufis D. (2001). Partial Translations Recovery in a 1:1 Word Alignment Approach RACAI Technical Report, 2001(in Romanian), 18 p.

  • Tufis, D. (2002) A Cheap and Fast Way to Build Useful Translation Lexicons. In Proceedings of the 19th International Conference on Computational Linguistics, COLING2002, Taipei, 25–30 August, pp. 1030–1036.

  • Tufis D. Barbu A.M. (2002) Revealing Translators Knowledge: Statistical Methods in Constructing Practical Translation Lexicons for Language and Speech Processing. In International Journal of Speech Technology. Kluwer Academic Publishers, no. 5, pp. 199–209.

  • Tufis D., Barbu A.M., Ion R. (2003) TREQ-AL: A Word Alignment System with Limited Language Resources. In Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, May-June, Edmonton, Canada, pp. 36–39.

  • Tufis D., Ide N. Erjavec T. (1998) Standardized Specifications, Development and Assessment of Large Morpho-Lexical Resources for Six Central and Eastern European Languages. In Proceedings LREC' 1998, Granada, Spain, pp. 233–240.

  • Tufis D., Barbu A.M., Patrascu V., Rotariu G., Popescu C. (1997) Corpora and Corpus-Based Morpho-Lexical Processing. In Tufis D., Andersen P. (eds.), Recent Advances in Romanian Language Technology. Editura Academiei, pp. 35–56.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tufiş, D., Barbu, A.M. & Ion, R. Extracting Multilingual Lexicons from Parallel Corpora. Computers and the Humanities 38, 163–189 (2004). https://doi.org/10.1023/B:CHUM.0000031172.03949.48

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/B:CHUM.0000031172.03949.48

Navigation