Abstract
The paper describes our recent developments in automatic extraction of translation equivalents from parallel corpora. We describe three increasingly complex algorithms: a simple baseline iterative method, and two non-iterative more elaborated versions. While the baseline algorithm is mainly described for illustrative purposes, the non-iterative algorithms outline the use of different working hypotheses which may be motivated by different kinds of applications and to some extent by the languages concerned. The first two algorithms rely on cross-lingual POS preservation, while with the third one POS invariance is not an extraction condition. The evaluation of the algorithms was conducted on three different corpora and several pairs of languages.
Similar content being viewed by others
References
Ahrenberg L., Andersson M., Merkel M. (2000) A Knowledge-lite Approach to Word Alignment. In Véronis J. (ed.), Parallel Text Processing. Text, Speech and Language Technology Series, Kluwer Academic Publishers, Vol. 13, pp. 97–116.
Brants T. (2000) TnT-A Statistical Part-of-Speech Tagger. In Proceedings ANLP-2000, April 29-May 3, Seattle, WA.
Brew C., McKelvie D. (1996) Word-pair extraction for lexicography. Available at http:///tokww.ltg.ed.ac.uk/?~chrisbr/papers/nemplap96.
Brown P., Della Pietra S.A., Della Pietra V.J., Mercer R.L. (1993) The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19/2, pp. 263–311.
Dimitrova L., Erjavec T., Ide N., Kaalep H., Petkevic V., Tufis D. (1998) Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and East European Languages. In Proceedings ACL-COLING'1998, Montreal, Canada, pp. 315–319.
Dunning T. (1993) Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19/1, pp. 61–74.
Erjavec T. (ed.) (2001) Specifications and Notations for MULTEXT-East Lexicon Encoding. Edition Multext-East/Concede Edition, March, 21, p. Available at [http://nl.ijs.si/ME/ V2/msd/html/].
Erjavec T., Ide N. (1998) The Multext-East corpus. In Proceedings LREC'1998, Granada, Spain, pp. 971–974.
Erjavec T., Lawson A., Romary L. (1998) East Meet West: A Compendium of Multilingual Resources. TELRI-MULTEXT EAST CD-ROM, ISBN: 3-922641-46-6.
Gale W.A., Church K.W. (1991) Identifying word correspondences in parallel texts. In Fourth DARPA Workshop on Speech and Natural Language, pp. 152–157.
Gale W.A., Church K.W. (1993) A Program for Aligning Sentences in Bilingual Corpora. Computational Linguistics, 19/1, pp. 75–102.
Hiemstra D. (1997) Deriving a Bilingual Lexicon for Cross Language Information Retrieval. In Proceedings of Gronics, pp. 21–26.
Ide N., Veronis J. (1995) Corpus Encoding Standard. MULTEXT/EAGLES Report. Available at http//tokww.lpl.univ-aix.fr/projects/multext/CES/CES1.html.
Kay M., Röscheisen M. (1993) Text-Translation Alignment. Computational Linguistics, 19/1, pp. 121–142.
Kupiec J. (1993) An algorithm for finding noun phrase correspondences in bilingual corpora. In Proceedings of the 31st Annual Meeting of the Association of Computational Linguistics, pp. 17–22.
Melamed D. (2001) Empirical Methods for Exploiting Parallel Texts. The MIT Press, Cambridge Massachusetts, London England, 195 p.
Mihalcea R., Pedersen T. (2003) An Evaluation Exercisefor Word Alignment. In Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, May-June, Edmonton, Canada, pp. 1–10.
Mititelu C. (2003) TREQ User Manual, Technical Report, RACAI, May, 25 p.
Smadja F., McKeown K.R., Hatzivassiloglou V. (1996) Translating Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics, 22/1, pp. 1–38.
Stamou S., Oflazer K., Pala K., Christoudoulakis D., Cristea D., Tufis D., Koeva S., Totkov G., Dutoit D., Grigoriadou M. (2002) BALKANET A Multilingual Semantic Network for the Balkan Languages. In Proceedings of the International Wordnet Conference, Mysore, India, 21–25 January.
Tufis D. (1999).
Tufis D. (2000) Using a Large Set of Eagles-compliant Morpho-Syntactic Descriptors as a Tagset for Probabilistic Tagging. In Proceedings LREC'2000, Athens, pp. 1105–1112.
Tufis D. (2001). Partial Translations Recovery in a 1:1 Word Alignment Approach RACAI Technical Report, 2001(in Romanian), 18 p.
Tufis, D. (2002) A Cheap and Fast Way to Build Useful Translation Lexicons. In Proceedings of the 19th International Conference on Computational Linguistics, COLING2002, Taipei, 25–30 August, pp. 1030–1036.
Tufis D. Barbu A.M. (2002) Revealing Translators Knowledge: Statistical Methods in Constructing Practical Translation Lexicons for Language and Speech Processing. In International Journal of Speech Technology. Kluwer Academic Publishers, no. 5, pp. 199–209.
Tufis D., Barbu A.M., Ion R. (2003) TREQ-AL: A Word Alignment System with Limited Language Resources. In Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, May-June, Edmonton, Canada, pp. 36–39.
Tufis D., Ide N. Erjavec T. (1998) Standardized Specifications, Development and Assessment of Large Morpho-Lexical Resources for Six Central and Eastern European Languages. In Proceedings LREC' 1998, Granada, Spain, pp. 233–240.
Tufis D., Barbu A.M., Patrascu V., Rotariu G., Popescu C. (1997) Corpora and Corpus-Based Morpho-Lexical Processing. In Tufis D., Andersen P. (eds.), Recent Advances in Romanian Language Technology. Editura Academiei, pp. 35–56.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Tufiş, D., Barbu, A.M. & Ion, R. Extracting Multilingual Lexicons from Parallel Corpora. Computers and the Humanities 38, 163–189 (2004). https://doi.org/10.1023/B:CHUM.0000031172.03949.48
Issue Date:
DOI: https://doi.org/10.1023/B:CHUM.0000031172.03949.48