Extracting Multilingual Lexicons from Parallel Corpora

Tufiş, Dan; Barbu, Ana Maria; Ion, Radu

doi:10.1023/B:CHUM.0000031172.03949.48

Extracting Multilingual Lexicons from Parallel Corpora

Published: May 2004

Volume 38, pages 163–189, (2004)
Cite this article

Computers and the Humanities Aims and scope Submit manuscript

Dan Tufiş¹,
Ana Maria Barbu¹ &
Radu Ion¹

137 Accesses
8 Citations
Explore all metrics

Abstract

The paper describes our recent developments in automatic extraction of translation equivalents from parallel corpora. We describe three increasingly complex algorithms: a simple baseline iterative method, and two non-iterative more elaborated versions. While the baseline algorithm is mainly described for illustrative purposes, the non-iterative algorithms outline the use of different working hypotheses which may be motivated by different kinds of applications and to some extent by the languages concerned. The first two algorithms rely on cross-lingual POS preservation, while with the third one POS invariance is not an extraction condition. The evaluation of the algorithms was conducted on three different corpora and several pairs of languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ahrenberg L., Andersson M., Merkel M. (2000) A Knowledge-lite Approach to Word Alignment. In Véronis J. (ed.), Parallel Text Processing. Text, Speech and Language Technology Series, Kluwer Academic Publishers, Vol. 13, pp. 97–116.
Brants T. (2000) TnT-A Statistical Part-of-Speech Tagger. In Proceedings ANLP-2000, April 29-May 3, Seattle, WA.
Brew C., McKelvie D. (1996) Word-pair extraction for lexicography. Available at http:///tokww.ltg.ed.ac.uk/?~chrisbr/papers/nemplap96.
Brown P., Della Pietra S.A., Della Pietra V.J., Mercer R.L. (1993) The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19/2, pp. 263–311.
Google Scholar
Dimitrova L., Erjavec T., Ide N., Kaalep H., Petkevic V., Tufis D. (1998) Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and East European Languages. In Proceedings ACL-COLING'1998, Montreal, Canada, pp. 315–319.
Dunning T. (1993) Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19/1, pp. 61–74.
Google Scholar
Erjavec T. (ed.) (2001) Specifications and Notations for MULTEXT-East Lexicon Encoding. Edition Multext-East/Concede Edition, March, 21, p. Available at [http://nl.ijs.si/ME/ V2/msd/html/].
Erjavec T., Ide N. (1998) The Multext-East corpus. In Proceedings LREC'1998, Granada, Spain, pp. 971–974.
Erjavec T., Lawson A., Romary L. (1998) East Meet West: A Compendium of Multilingual Resources. TELRI-MULTEXT EAST CD-ROM, ISBN: 3-922641-46-6.
Gale W.A., Church K.W. (1991) Identifying word correspondences in parallel texts. In Fourth DARPA Workshop on Speech and Natural Language, pp. 152–157.
Gale W.A., Church K.W. (1993) A Program for Aligning Sentences in Bilingual Corpora. Computational Linguistics, 19/1, pp. 75–102.
Google Scholar
Hiemstra D. (1997) Deriving a Bilingual Lexicon for Cross Language Information Retrieval. In Proceedings of Gronics, pp. 21–26.
Ide N., Veronis J. (1995) Corpus Encoding Standard. MULTEXT/EAGLES Report. Available at http//tokww.lpl.univ-aix.fr/projects/multext/CES/CES1.html.
Kay M., Röscheisen M. (1993) Text-Translation Alignment. Computational Linguistics, 19/1, pp. 121–142.
Google Scholar
Kupiec J. (1993) An algorithm for finding noun phrase correspondences in bilingual corpora. In Proceedings of the 31st Annual Meeting of the Association of Computational Linguistics, pp. 17–22.
Melamed D. (2001) Empirical Methods for Exploiting Parallel Texts. The MIT Press, Cambridge Massachusetts, London England, 195 p.
Google Scholar
Mihalcea R., Pedersen T. (2003) An Evaluation Exercisefor Word Alignment. In Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, May-June, Edmonton, Canada, pp. 1–10.
Mititelu C. (2003) TREQ User Manual, Technical Report, RACAI, May, 25 p.
Smadja F., McKeown K.R., Hatzivassiloglou V. (1996) Translating Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics, 22/1, pp. 1–38.
Google Scholar
Stamou S., Oflazer K., Pala K., Christoudoulakis D., Cristea D., Tufis D., Koeva S., Totkov G., Dutoit D., Grigoriadou M. (2002) BALKANET A Multilingual Semantic Network for the Balkan Languages. In Proceedings of the International Wordnet Conference, Mysore, India, 21–25 January.
Tufis D. (1999).
Tufis D. (2000) Using a Large Set of Eagles-compliant Morpho-Syntactic Descriptors as a Tagset for Probabilistic Tagging. In Proceedings LREC'2000, Athens, pp. 1105–1112.
Tufis D. (2001). Partial Translations Recovery in a 1:1 Word Alignment Approach RACAI Technical Report, 2001(in Romanian), 18 p.
Tufis, D. (2002) A Cheap and Fast Way to Build Useful Translation Lexicons. In Proceedings of the 19th International Conference on Computational Linguistics, COLING2002, Taipei, 25–30 August, pp. 1030–1036.
Tufis D. Barbu A.M. (2002) Revealing Translators Knowledge: Statistical Methods in Constructing Practical Translation Lexicons for Language and Speech Processing. In International Journal of Speech Technology. Kluwer Academic Publishers, no. 5, pp. 199–209.
Tufis D., Barbu A.M., Ion R. (2003) TREQ-AL: A Word Alignment System with Limited Language Resources. In Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, May-June, Edmonton, Canada, pp. 36–39.
Tufis D., Ide N. Erjavec T. (1998) Standardized Specifications, Development and Assessment of Large Morpho-Lexical Resources for Six Central and Eastern European Languages. In Proceedings LREC' 1998, Granada, Spain, pp. 233–240.
Tufis D., Barbu A.M., Patrascu V., Rotariu G., Popescu C. (1997) Corpora and Corpus-Based Morpho-Lexical Processing. In Tufis D., Andersen P. (eds.), Recent Advances in Romanian Language Technology. Editura Academiei, pp. 35–56.

Download references

Author information

Authors and Affiliations

Romanian Academy (RACAI), 13, “13 Septembrie”, 050711, Bucharest 5, Romania
Dan Tufiş, Ana Maria Barbu & Radu Ion

Authors

Dan Tufiş
View author publications
You can also search for this author in PubMed Google Scholar
Ana Maria Barbu
View author publications
You can also search for this author in PubMed Google Scholar
Radu Ion
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tufiş, D., Barbu, A.M. & Ion, R. Extracting Multilingual Lexicons from Parallel Corpora. Computers and the Humanities 38, 163–189 (2004). https://doi.org/10.1023/B:CHUM.0000031172.03949.48

Download citation

Issue Date: May 2004
DOI: https://doi.org/10.1023/B:CHUM.0000031172.03949.48

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extracting Multilingual Lexicons from Parallel Corpora

Abstract

Access this article

Similar content being viewed by others

Parallel Corpora

A Parallel Corpus of Translationese

Enhancing Multilingual Biomedical Terminologies via Machine Translation from Parallel Corpora

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Extracting Multilingual Lexicons from Parallel Corpora

Abstract

Access this article

Similar content being viewed by others

Parallel Corpora

A Parallel Corpus of Translationese

Enhancing Multilingual Biomedical Terminologies via Machine Translation from Parallel Corpora

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation