Abstract
Automatically generating bilingual dictionaries from parallel, manually translated texts is a well established technique that works well in practice. However, parallel texts are a scarce resource. Therefore, it is desirable also to be able to generate dictionaries from pairs of comparable monolingual corpora. For most languages, such corpora are much easier to acquire, and often in considerably larger quantities. In this paper we present the implementation of an algorithm which exploits such corpora with good success. Based on the assumption that the co-occurrence patterns between different languages are related, it expands a small base lexicon. For improved performance, it also realizes a novel interlingua approach. That is, if corpora of more than two languages are available, the translations from one language to another can be determined not only directly, but also indirectly via a pivot language.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Armstrong, S., Kempen, M., McKelvie, D., Petitpierre, D., Rapp, R., & Thompson, H. (1998). Multilingual corpora for cooperation. In Proceedings of the 1st International Conference on Linguistic Resources and Evaluation (LREC) (Vol. 2, pp. 975–980). Granada, Spain.
Chiao, Y.-C., Sta, J.-D., & Zweigenbaum, P. (2004). A novel approach to improve word translations extraction from non-parallel, comparable corpora. In Proceedings of the International Joint Conference on Natural Language Processing. Hainan, China. AFNLP.
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
Fung, P., & McKeown, K. (1997). Finding terminology translations from non-parallel corpora. In Proceedings of the 5th Annual Workshop on Very Large Corpora (pp. 192–202). Hong Kong.
Fung, P., & Yee, L. Y. (1998). An IR approach for translating new words from nonparallel, comparable texts. In Proceedings of COLING-ACL 1998 (Vol. 1, pp. 414–420). Montreal.
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of MT Summit (pp. 79–86). Phuket, Thailand.
Munteanu, D. S., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.
Rapp, R. (1995). Identifying word translations in non-parallel texts. In Proceedings of the 33rd Meeting of the Association for Computational Linguistics (pp. 320–322). Cambridge, MA.
Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (pp. 519–526). College Park, MD.
Rapp, R., & Martin Vide, C. (2007). Statistical machine translation without parallel corpora. In: G. Rehm, A. Witt, & L. Lemnitzer (Eds.), Datenstrukturen für linguistische Ressourcen und ihre Anwendungen Data Structures for Linguistic Resources and Applications. Proceedings of the Biennial GLDV Conference 2007 (pp. 231–240). Tübingen: Gunter Narr Verlag.
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., et al. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation. Genoa, Italy.
Wu, D., & Fung, P. (2005). Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In Proceedings of the Second International Joint Conference on Natural Language Processing (IJCNLP-2005). Jeju, Korea.
Acknowledgements
Part of this research was supported by a Marie Curie Intra European Fellowship within the 6th European Community Framework Programme. We thank Olivier Ferret for valuable comments concerning French language resources.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rapp, R., Zock, M. (2009). Automatic Dictionary Expansion Using Non-parallel Corpora. In: Fink, A., Lausen, B., Seidel, W., Ultsch, A. (eds) Advances in Data Analysis, Data Handling and Business Intelligence. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01044-6_29
Download citation
DOI: https://doi.org/10.1007/978-3-642-01044-6_29
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01043-9
Online ISBN: 978-3-642-01044-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)