Abstract
Large-scale comparable corpora became more abundant and accessible than parallel corpora, with the explosive growth of the World Wide Web. Therefore, strategies on bilingual terminology extraction from comparable texts must be given more attention in order to enrich existing bilingual lexicons and thesauri and to enhance Cross-Language Information Retrieval. In the present paper, we focus on the enhancement of Cross-Language Information Retrieval using a two-stage corpus-based translation model that includes bi-directional extraction of bilingual terminology from comparable corpora and selection of best translation alternatives on the basis of their morphological knowledge. The impact of comparable corpora on the performance of the Cross-Language Information Retrieval process is evaluated in this study and the results indicate that the effect is clearly positive, especially when using the linear combination with bilingual dictionaries and Japanese-English pair of languages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Dejean, H., Gaussier, E., Sadat, F.: An Approach based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction. In: Proceedings of COLING 2002, Taiwan, pp. 218–224 (2002)
EDR: Japan Electronic Dictionary Research Institute, Ltd. EDR electronic dictionary version 1.5 technical guide. Technical report TR2-007. Japan Electronic Dictionary research Institute, Ltd. (1996)
Fung, P.: A Statistical View of Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora. In: Véronis, J. (ed.) Parallel Text Processing (2000)
Klavens, J., Tzoukermann, E.: Combining Corpus and Machine-Readable Dictionary Data for Building Bilingual Lexicons. Machine Translation 10(3-4), 1–34 (1996)
Knight, K., Graehl, J.: Machine Transliteration. Computational Linguistics 24(4) (1998)
Matsumoto, Y., Kitauchi, A., Yamashita, T., Imaichi, O., Imamura, T.: Japanese morphological analysis system ChaSen manual. Technical report NAIST-IS-TR97007, NAIST (1997)
Rapp, R.: Automatic Identification of Word Translations from Unrelated English and German Corpora. In: Proceedings of European Chapter of the Association for Computational Linguistics, EACL (1999)
Sadat, F., Yoshikawa, M., Uemura, S.: Enhancing Cross-language Information Retrieval by an Automatic Acquisition of Bilingual Terminology from Comparable Corpora. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2003, Toronto, Canada (2003)
Sadat, F., Yoshikawa, M., Uemura, S.: Learning Bilingual Translations from Comparable Corpora to Cross-Language Information Retrieval: Hybrid Statistics-based and Linguistics-based Approach. In: Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages, IRAL 2003, Sapporo, Japan (2003)
Sadat, F., Yoshikawa, M., Uemura, S.: Bilingual Terminology Acquisition from Comparable Corpora and Phrasal Translation to Cross-Language Information Retrieval. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, ACL 2003, Sapporo, Japan (2003)
Sadat, F.: Knowledge Acquisition from Collections of News Articles to Cross-language Information Retrieval. In: Proceedings of RIAO 2004 conference (Recherche d’Information Assisté par Ordinateur), Avignon, France, April 26-28, pp. 504–513 (2004)
Salton, G.: The SMART Retrieval System, Experiments in Automatic Documents Processing. Prentice-Hall, Inc., Englewood Cliffs (1971)
Salton, G., McGill, J.: Introduction to Modern Information Retrieval. Mc Graw-Hill, New York (1983)
Sekine, S.: OAK System– Manual. New York University (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sadat, F. (2010). Exploiting Comparable Corpora for Cross-Language Information Retrieval. In: Zhang, BT., Orgun, M.A. (eds) PRICAI 2010: Trends in Artificial Intelligence. PRICAI 2010. Lecture Notes in Computer Science(), vol 6230. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15246-7_66
Download citation
DOI: https://doi.org/10.1007/978-3-642-15246-7_66
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15245-0
Online ISBN: 978-3-642-15246-7
eBook Packages: Computer ScienceComputer Science (R0)