Skip to main content

Automatic Dictionary Expansion Using Non-parallel Corpora

  • Conference paper
  • First Online:
Advances in Data Analysis, Data Handling and Business Intelligence

Abstract

Automatically generating bilingual dictionaries from parallel, manually translated texts is a well established technique that works well in practice. However, parallel texts are a scarce resource. Therefore, it is desirable also to be able to generate dictionaries from pairs of comparable monolingual corpora. For most languages, such corpora are much easier to acquire, and often in considerably larger quantities. In this paper we present the implementation of an algorithm which exploits such corpora with good success. Based on the assumption that the co-occurrence patterns between different languages are related, it expands a small base lexicon. For improved performance, it also realizes a novel interlingua approach. That is, if corpora of more than two languages are available, the translations from one language to another can be determined not only directly, but also indirectly via a pivot language.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Armstrong, S., Kempen, M., McKelvie, D., Petitpierre, D., Rapp, R., & Thompson, H. (1998). Multilingual corpora for cooperation. In Proceedings of the 1st International Conference on Linguistic Resources and Evaluation (LREC) (Vol. 2, pp. 975–980). Granada, Spain.

    Google Scholar 

  • Chiao, Y.-C., Sta, J.-D., & Zweigenbaum, P. (2004). A novel approach to improve word translations extraction from non-parallel, comparable corpora. In Proceedings of the International Joint Conference on Natural Language Processing. Hainan, China. AFNLP.

    Google Scholar 

  • Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.

    Google Scholar 

  • Fung, P., & McKeown, K. (1997). Finding terminology translations from non-parallel corpora. In Proceedings of the 5th Annual Workshop on Very Large Corpora (pp. 192–202). Hong Kong.

    Google Scholar 

  • Fung, P., & Yee, L. Y. (1998). An IR approach for translating new words from nonparallel, comparable texts. In Proceedings of COLING-ACL 1998 (Vol. 1, pp. 414–420). Montreal.

    Google Scholar 

  • Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of MT Summit (pp. 79–86). Phuket, Thailand.

    Google Scholar 

  • Munteanu, D. S., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.

    Article  Google Scholar 

  • Rapp, R. (1995). Identifying word translations in non-parallel texts. In Proceedings of the 33rd Meeting of the Association for Computational Linguistics (pp. 320–322). Cambridge, MA.

    Google Scholar 

  • Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (pp. 519–526). College Park, MD.

    Google Scholar 

  • Rapp, R., & Martin Vide, C. (2007). Statistical machine translation without parallel corpora. In: G. Rehm, A. Witt, & L. Lemnitzer (Eds.), Datenstrukturen für linguistische Ressourcen und ihre Anwendungen Data Structures for Linguistic Resources and Applications. Proceedings of the Biennial GLDV Conference 2007 (pp. 231–240). Tübingen: Gunter Narr Verlag.

    Google Scholar 

  • Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., et al. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation. Genoa, Italy.

    Google Scholar 

  • Wu, D., & Fung, P. (2005). Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In Proceedings of the Second International Joint Conference on Natural Language Processing (IJCNLP-2005). Jeju, Korea.

    Google Scholar 

Download references

Acknowledgements

Part of this research was supported by a Marie Curie Intra European Fellowship within the 6th European Community Framework Programme. We thank Olivier Ferret for valuable comments concerning French language resources.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Reinhard Rapp .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Rapp, R., Zock, M. (2009). Automatic Dictionary Expansion Using Non-parallel Corpora. In: Fink, A., Lausen, B., Seidel, W., Ultsch, A. (eds) Advances in Data Analysis, Data Handling and Business Intelligence. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01044-6_29

Download citation

Publish with us

Policies and ethics