Using Query-Relevant Documents Pairs for Cross-Lingual Information Retrieval

Pinto, David; Juan, Alfons; Rosso, Paolo

doi:10.1007/978-3-540-74628-7_81

David Pinto^1,2,
Alfons Juan¹ &
Paolo Rosso¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4629))

Included in the following conference series:

International Conference on Text, Speech and Dialogue

1802 Accesses

Abstract

The world wide web is a natural setting for cross-lingual information retrieval. The European Union is a typical example of a multilingual scenario, where multiple users have to deal with information published in at least 20 languages. Given queries in some source language and a target corpus in another language, the typical approximation consists in translating either the query or the target dataset to the other language. Other approaches use parallel corpora to obtain a statistical dictionary of words among the different languages. In this work, we propose to use a training corpus made up by a set of Query-Relevant Document Pairs (QRDP) in a probabilistic cross-lingual information retrieval approach which is based on the IBM alignment model 1 for statistical machine translation. Our approach has two main advantages over those that use direct translation and parallel corpora: we will not obtain a translation of the query, but a set of associated words which share their meaning in some way and, therefore, the obtained dictionary is, in a broad sense, more semantic than a translation one. Besides, since the queries are supervised, we are working in a more restricted domain than that when using a general parallel corpus (it is well known that in this context results are better than those which are performed in a general context). In order to determine the quality of our experiments, we compared the results with those obtained by a direct translation of the queries with a query translation system, observing promising results.

This work has been partially supported by the MCyT TIN2006-15265-C06-04 research project, as well as by the BUAP-701 PROMEP/103.5/05/1536 grant.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Lessons Learnt from Experiments on the Ad Hoc Multilingual Test Collections at CLEF

Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

Adjusting Machine Translation Datasets for Document-Level Cross-Language Information Retrieval: Methodology

References

Franz, M., McCarley, J.S., Roukos, S.: Ad-hoc and multilingual information retrieval at ibm. In: Proceedings of the TREC-7 Conference, pp. 157–168 (1998)
Google Scholar
Kraaij, W., Nie, J.Y., Simard, M.: Embedding web-based statistical translation models in cross-language information retrieval. Computational Linguistics 29(3), 381–419 (2003)
Article Google Scholar
Fuhr, N.: Probabilistic models in information retrieval. The Computer Journal 35(3), 243–255 (1992)
Article MATH Google Scholar
Rijsbergen, C.J.V.: Information Retrieval, 2nd edn. Dept. of Computer Science, University of Glasgow (1979)
Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press, New York, Addison-Wesley (1999)
Google Scholar
Brown, P.F., Cocke, J., Pietra, S.A.D., Pietra, V.J.D., Jelinek, F., Lafferty, J.D., Mercer, R.L., Roossin, P.S.: A statistical approach to machine translation. Computational Linguistics 16(2), 79–85 (1990)
Google Scholar
Sigurbjörnsson, B., Kamps, J., de Rijke, M.: Eurogov: Engineering a multilingual web corpus. In: Peters, C., Gey, F.C., Gonzalo, J., Müller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.) CLEF 2005. LNCS, vol. 4022, pp. 825–836. Springer, Heidelberg (2006)
Chapter Google Scholar
Sigurbjörnsson, B., Kamps, J., de Rijke, M.: Overview of webclef 2005. In: Peters, C., Gey, F.C., Gonzalo, J., Müller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.) CLEF 2005. LNCS, vol. 4022, pp. 810–824. Springer, Heidelberg (2006)
Chapter Google Scholar
Pinto, D., Jiménez-Salazar, H., Rosso, P.: Clustering abstracts of scientific texts using the transition point technique. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 536–546. Springer, Heidelberg (2006)
Chapter Google Scholar
Pinto, D., Jiménez-Salazar, H., Rosso, P.: Buap-upv tpirs: A system for document indexing reduction on webclef. In: Peters, C., Gey, F.C., Gonzalo, J., Müller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.) CLEF 2005. LNCS, vol. 4022, pp. 873–879. Springer, Heidelberg (2006)
Chapter Google Scholar
Civera, J., Juan, A.: Mixtures of ibm model 2. In: Proceedings of the EAMT Conference, pp. 159–167 (2006)
Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
MATH Google Scholar
Rojas-López, F., Jiménez-Salazar, H., Pinto, D.: A competitive term selection method for information retrieval. In: Gelbukh, A. (ed.) CICLing 2007. LNCS, vol. 4394, pp. 468–475. Springer, Heidelberg (2007)
Chapter Google Scholar
Artile, J., Peinado, V., Peñas, A., Verdejo, F.: Uned at webclef 2005. In: Peters, C., Gey, F.C., Gonzalo, J., Müller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.) CLEF 2005. LNCS, vol. 4022, pp. 888–891. Springer, Heidelberg (2006)
Chapter Google Scholar
Martínez, T., Noguera, E., noz, R.M., Llopis, F.: University of alicante at the clef2005 webclef track. In: Peters, C., Gey, F.C., Gonzalo, J., Müller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.) CLEF 2005. LNCS, vol. 4022, pp. 865–868. Springer, Heidelberg (2006)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Systems and Computation, Polytechnic University of Valencia, Spain
David Pinto, Alfons Juan & Paolo Rosso
Faculty of Computer Science, B. Autonomous University of Puebla, Mexico
David Pinto

Authors

David Pinto
View author publications
You can also search for this author in PubMed Google Scholar
Alfons Juan
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Rosso
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Václav Matoušek Pavel Mautner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pinto, D., Juan, A., Rosso, P. (2007). Using Query-Relevant Documents Pairs for Cross-Lingual Information Retrieval. In: Matoušek, V., Mautner, P. (eds) Text, Speech and Dialogue. TSD 2007. Lecture Notes in Computer Science(), vol 4629. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74628-7_81

Download citation

DOI: https://doi.org/10.1007/978-3-540-74628-7_81
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74627-0
Online ISBN: 978-3-540-74628-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics