ABSTRACT
The bottleneck for dictionary-based cross-language information retrieval is the lack of comprehensive dictionaries, in particular for many different languages. We here introduce a methodology by which multilingual dictionaries (for Spanish and Swedish) emerge automatically from simple seed lexicons. These seed lexicons are automatically generated, by cognate mapping, from (previously manually constructed) Portuguese and German as well as English sources. Lexical and semantic hypotheses are then validated and new ones iteratively generated by making use of co-occurrence patterns of hypothesized translation synonyms in parallel corpora. We evaluate these newly derived dictionaries on a large medical document collection within a cross-language retrieval setting.
- P.-J. Cheng, J.-W. Teng, R.-C. Chen, J.-H. Wang, W.-H. Lu, and L.-F. Chien. Translating unknown queries with web corpora for cross-language information retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 146--153, 2004.]] Google ScholarDigital Library
- H. Déjean, É. Gaussier, and F. Sadat. An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In Proceedings of the 19th Intl. Conf. on Computational Linguistics, pages 218--224, 2002.]] Google ScholarDigital Library
- D. Eichmann, M. E. Ruiz, and P. Srinivasan. Cross-language information retrieval with the Umls Metathesaurus. In Proceedings of the 21st Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 72--80, 1998.]] Google ScholarDigital Library
- P. Fung. A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. In Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas, pages 1--17. 1998.]] Google ScholarDigital Library
- P. Fung and L.Y. Yee An IR approach for translating new words from nonparallel, comparable texts. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics & 17th International Conference on Computational Linguistics, pages 414--420. 1998.]] Google ScholarDigital Library
- J. Gonzalo, F. Verdejo, and I. Chugur. Using EuroWord- Net in a concept-based approach to cross-language text retrieval. Applied Artificial Intelligence, 13(7):647--678, 1999.]]Google ScholarCross Ref
- U. Hahn, K. Markó, M. Poprat, S. Schulz, J. Wermter, and P. Nohama. Crossing languages in text retrieval via an interlingua. In RIAO 2004 -- Conference Proceedings: Coupling Approaches, Coupling Media and Coupling Languages for Information Retrieval, pages 100--115, 2004.]]Google Scholar
- W. R. Hersh, C. Buckley, T. J. Leone, and D. H. Hickam. Ohsumed: An interactive retrieval evaluation and new large test collection for research. In Proceedings of the 17th Annual Intl. ACM SIGIR Conference on Research and Development in Information Retrieval, pages 192--201, 1994.]] Google ScholarDigital Library
- W. R. Hersh and L. C. Donohoe. Saphire International: A tool for cross-language information retrieval. In Proceedings of the AMIA Annual Fall Symposium, pages 673--677, 1998.]]Google Scholar
- P. Koehn and K. Knight. Learning a translation lexicon from monolingual corpora. In Unsupervised Lexical Acquisition. Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX), pages 9--16, 2002.]] Google ScholarDigital Library
- K. Markó, U. Hahn, S. Schulz, P. Daumke, and P. Nohama. Interlingual indexing across different languages. In RIAO 2004 -- Conference Proceedings: Coupling Approaches, Coupling Media and Coupling Languages for Information Retrieval, pages]]Google Scholar
- D. W. Oard and A. R. Diekema. Cross-language information retrieval. In M. E. Williams, editor, Annual Review of Information Science and Technology (ARIST), Vol. 33: 1998, pages 223--256. Medford, NJ: Information Today, 1998.]]Google Scholar
- A. Pirkola, T. Hedlund, H. Keskustalo, and K. Järvelin. Dictionary-based cross-language information retrieval: Problems, methods, and research findings. Information Retrieval, 4(3/4):209--230, 2001.]] Google ScholarDigital Library
- M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.]]Google ScholarCross Ref
- R. Rapp. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 519--526, 1999.]] Google ScholarDigital Library
- M. Rogati and Y. Yang. Resource selection for domain-specific cross-lingual IR. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 154--161, 2004.]] Google ScholarDigital Library
- M. Ruiz, A. Diekema, and P. Sheridan. Cindor conceptual interlingua document retrieval: Trec-8 evaluation. In Proceedings of the 8th Text REtrieval Conference (TREC-8), pages 597--606, 1999.]]Google Scholar
- MeSH. Medical Subject Headings. Bethesda, MD: National Library of Medicine, 2004.]]Google Scholar
- Umls. Unified Medical Language System. Bethesda, MD: National Library of Medicine, 2004.]]Google Scholar
- S. Schulz, M. Honeck, and U. Hahn. Biomedical text retrieval in languages with a complex morphology. In Proceedings of the ACL/NAACL 2002 Workshop on `Natural Language Processing in the Biomedical Domain', pages 61--68, 2002.]] Google ScholarDigital Library
- S. Tellex, B. Katz, J. J. Lin, A. Fernandes, and G. Marton. Quantitative evaluation of passage retrieval algorithms for question answering. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 41--47, 2003.]] Google ScholarDigital Library
- M. Volk, B. Ripplinger, S. Vintar, P. Buitelaar, D. Raileanu, and B. Sacaleanu. Semantic annotation for concept-based cross-language medical information retrieval. International Journal of Medical Informatics, 67(1/3):79--112, 2002.]]Google Scholar
- D. Widdows, B. Dorow, and C.-K. Chan. Using parallel corpora to enrich multilingual lexical resources. In Proceedings of the 3rd International Conference on Language Resources and Evaluation, pages 240--245, 2002.]]Google Scholar
- Y. Zhang and P. Vines. Using the web for automated translation extraction in cross-language information retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 162--169, 2004.]] Google ScholarDigital Library
Index Terms
- Bootstrapping dictionaries for cross-language information retrieval
Recommendations
Enhancing cross-language information retrieval by an automatic acquisition of bilingual terminology from comparable corpora
SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrievalThis paper presents an approach to bilingual lexicon extraction from comparable corpora and evaluations on Cross-Language Information Retrieval. We explore a bi-directional extraction of bilingual terminology primarily from comparable corpora. A ...
Using Communities of Words Derived from Multilingual Word Vectors for Cross-Language Information Retrieval in Indian Languages
We investigate the use of word embeddings for query translation to improve precision in cross-language information retrieval (CLIR). Word vectors represent words in a distributional space such that syntactically or semantically similar words are close ...
Learning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistics-based and linguistics-based approach
AsianIR '03: Proceedings of the sixth international workshop on Information retrieval with Asian languages - Volume 11Recent years saw an increased interest in the use and the construction of large corpora. With this increased interest and awareness has come an expansion in the application to knowledge acquisition and bilingual terminology extraction. The present paper ...
Comments