Abstract
Learning to rank (LTR), as a machine learning technique for ranking tasks, has become one of the most popular research topics in the area of information retrieval (IR). Cross-lingual information retrieval (CLIR), in which the language of the query is different from the language of the documents, is one of the important IR tasks that can potentially benefit from LTR. Our focus in this paper is the use of LTR for CLIR. To rank the documents in the target language in response to the query in the source language, we propose a local query-dependent approach based on LTR for CLIR, which is called LQ-DLTR for CLIR. The core idea of LQ-DLTR for CLIR is the use of the local characteristics of similar queries to construct the LTR model, instead of using a single global ranking model for all queries. Since the query and the documents are in different languages, the traditional features that are used in LTR cannot be used directly for CLIR. Thus, defining appropriate features is a major step in the use of LTR for CLIR. In this paper, three categories of cross-lingual features are defined: query–document features, document features, and query features. To define the cross-lingual features, translation resources are used to fill the gap between the documents and the queries. Then, in LQ-DLTR for CLIR, a neighborhood of similar queries based on cross-lingual query features is used to create a local ranking function by the LTR algorithm for a given query. The LTR algorithm uses two cross-lingual feature sets, namely document features and query–document features, to learn the model. The query features that are used to identify the neighbors are not involved in the learning phase. Experimental results indicate that the CLIR performance improves with the use of cross-lingual features that use several translations and their probabilities to compute the features, compared to the use of monolingual features in traditional LTR, which translate a query according to the best translation and ignore the probabilities. Moreover, experimental results show that LQ-DLTR for CLIR outperforms the baseline information retrieval methods and other LTR ranking models in terms of the MAP and NDCG measures.
Similar content being viewed by others
Notes
The HAMSHAHRI corpus is a standard collection that has been used in the Ad Hoc Track of the CLEF2008 and 2009.
CLEF Adhoc Multilingual Task: The evaluation packages are available via the ELRA catalogue (http://catalog.elra.info).
The CLEF Test Suite for the CLEF2000–2003 Campaigns, catalogue reference: ELRA-E0008.
References
AleAhmad A, Amiri H, Darrudi E, Rahgozar M, Oroumchian F (2009) Hamshahri: a standard Persian text collection. Knowl-Based Syst 22(5):382–387
Amini MR, Usunier N, Goutte C (2009) Learning from multiple partially observed views-an application to multilingual text categorization. In: Advances in neural information processing systems 22. The MIT Press, pp 28–36
Azarbonyad H, Shakery A, Faili H (2012) Using learning to rank approach for parallel corpora based cross language information retrieval. In: Proceedings of the 20th European conference on artificial intelligence. IOS Press, pp 79–84
Azarbonyad H, Shakery A, Faili H (2013) Exploiting multiple translation resources for English-Persian cross language information retrieval. In: Information access evaluation. Multilinguality, multimodality, and visualization: 4th international conference of the CLEF initiative. Springer, pp 93–99
Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. Comput Netw ISDN Syst 30(1–7):107–117
Cao Z, Qin T, Liu TY, Tsai MF, Li H (2007) Learning to rank: from pairwise approach to listwise approach. In: Proceedings of the 24th international conference on machine learning. ACM, pp 129–136
Cronen-Townsend S, Zhou Y, Croft WB (2002) Predicting query performance. In: Proceedings of the 25th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 299–306
Dadashkarimi J, Shakery A, Faili H (2014) A probabilistic translation method for dictionary-based cross-lingual information retrieval in agglutinative languages. arXiv preprint arXiv:1411.1006
Darwish K, Oard DW (2003) Probabilistic structured query methods. In: Proceedings of the 26th international ACM SIGIR conference on research and development in informaiton retrieval. ACM, pp 338–344
Ferro N, Silvello G (2016a) 3.5K runs, 5K topics, 3M assessments and 70M measures: What trends in 10 years of Adhoc-ish CLEF? Info Process Manag 53(1):175–202
Ferro N, Silvello G (2016b) The CLEF monolingual grid of points. In: Information access evaluation. Multilinguality, multimodality, and interaction: 7th international conference of the CLEF initiative. Springer, pp 16–27
Gao W, Blitzer J, Zhou M, Wong KF (2009) Exploiting bilingual information to improve web search. In: Proceedings of the 47th annual meeting of the association for computational linguistics and the 4th international joint conference on natural language processing (ACL-IJCNLP). Association for Computational Linguistics, pp 1075–1083
Gao W, Niu C, Zhou M, Wong KF (2009) Joint ranking for multilingual web search. In: Proceedings of the 31st European conference on IR research. Springer, pp 114–125
Geng X, Liu TY, Qin T, Arnold A, Li H, Shum HY (2008) Query dependent ranking using k-nearest neighbor. In: Proceedings of the 31st international ACM SIGIR conference on research and development in information retrieval. ACM, pp 115–122
He B, Ounis I (2004) Inferring query performance using pre-retrieval predictors. In: Proceedings of the 10th symposium on string processing and information retrieval. Springer, pp 43–54
Hedlund T, Airio E, Keskustalo H, Lehtokangas R, Pirkola A, Jarvelin K (2004) Dictionary-based cross-language information retrieval: learning experiences from CLEF 20002002. Inf Retr 7(1/2):99–119
Herbert B, Szarva G, Gurevych I (2011) Combining query translation techniques to improve cross-language information retrieval. In: Proceedings of the 33rd European conference on IR research. Springer, pp 712–715
Hieber F (2015) Translation-based ranking in cross-language information retrieval. Ph.D. thesis, Department of Computational Linguistics, Heidelberg University
Jabbari F, Bakhshaei S, Ziabary SMM, Khadivi S (2012) Developing an open-domain English-Farsi translation system using AFEC: Amirkabir Bilingual Farsi-English Corpus. In: Proceedings of the 4th workshop on computational approaches to Arabic script-based Languages. ACM, pp 17–23
Jarvelin K, Kekalainen J (2002) Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst 20(4):422–446
Kashefi O (2018) MIZAN: a large persian-english parallel corpus. arXiv preprint arXiv:1801.02107
Kim S, Ko Y, Oard DW (2015) Combining lexical and statistical translation evidence for cross-language information retrieval. J Assoc Inf Sci Technol 66(1):23–39
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the 10th machine translation summit, pp 79–86
Kraaij W, De Jong F (2004) Transitive probabilistic CLIR models. In: Proceedings of the 7th international RIAO conference, CID, pp 69–81
Kraaij W, Westerveld T, Hiemstra D (2002) The importance of prior probabilities for entry page search. In: Proceedings of the 25th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 27–34
Li H (2014) Learning to rank for information retrieval and natural language processing. Synth Lect Hum Lang Technol 7(3):1–121
Liu TY (2011) Learning to rank for information retrieval. Springer, Berlin
Lwin PHM (2012) Query dependent ranking for information retrieval based on query clustering. Int J Inf Commun Technol 2(1):25–30
Manning CD, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Mansouri A, Faili H (2012) State-of-the-art English to Persian statistical machine translation system. In: Proceedings of the 16th CSI international symposium on artificial intelligence and signal processing. IEEE, pp 174–179
Miangah TM (2009) Constructing a large-scale english-persian parallel corpus. Meta: Trans J 54(1):181–188
Ni W, Huang Y, Xie M (2008) A query dependent approach to learning to rank for information retrieval. In: Proceedings of the 9th international conference on web-age information management. IEEE, pp 262–269
Nie JY (2010) Cross-language information retrieval. Synth Lect Hum Lang Technol 3(1):1–125
Nie JY, Isabelle P, Plamondon P, Foster G (1998) Using a probabilistic translation model for cross-language information retrieval. In: Proceedings of the 6th workshop on very large Corpora. Association for Computational Linguistics, pp 18–27
Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51
Peng J, MacDonald C, Ounis I (2010) Learning to select a ranking function. In: Proceedings of the 32nd European conference on IR research. Springer, pp 114–126
Rahimi R, Shakery A (2013) A language modeling approach for extracting translation knowledge from comparable corpora. In: Proceedings of the 35th European conference on IR research. Springer, pp 606–617
Rahimi R, Shakery A, King I (2015a) Extracting translations from comparable corpora for cross-Language information retrieval using the language modeling framework. Inf Process Manag 52(2):299–318
Rahimi R, Shakery A, King I (2015b) Multilingual information retrieval in the language modeling framework. Inf Retr 18(3):246–281
Robertson S, Walker S, Jones S, Hancock-Beaulieu M, Gatford M (1994) Okapi at TREC-3. In: Proceedings of the 3rd text retrieval conference (TREC-3), pp 109–126
Sari S, Adriani M (2014) Learning to rank for determining relevant document in Indonesian-English cross language information retrieval using BM25. In: International conference on advanced computer science and information system. IEEE, pp 309–314
Schamoni S (2013) Reducing feature space for learning to rank in cross-language information retrieval. Ph.D. thesis, University Heidelberg
Schamoni S, Riezler S (2015) Combining orthogonal information in large-scale cross-language information retrieval. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 943–946
Scholer F, Williams HE, Turpin A (2004) Query association surrogates for web search. J Am Soc Inf Sci Technol 55(7):637–650
Sharma VK, Mittal N (2016) Cross lingual information retrieval (CLIR): review of tools, challenges and translation approaches corpora ontology NER Google translator Homonymy Polysemy. In: Information systems design and intelligent applications, Vol. 433. Springer, pp 699–708
Tiedemann J (2012) Parallel data, tools and interfaces in OPUS. In: Proceedings of the eight international conference on language resources and evaluation, European language resources association (ELRA), pp 2214–2218
Tsai MF, Chen HH, Wang YT (2011) Learning a merge model for multilingual information retrieval. Inf Process Manag 47(5):635–646
Tsai MF, Wang YT, Chen HH (2008) A study of learning a merge model for multilingual information retrieval. In: Proceedings of the 31st international ACM SIGIR conference on research and development in information retrieval. ACM, pp 195–202
Ture F, Lin J (2014) Exploiting representations from statistical machine translation for cross-language information retrieval. ACM Trans Inf Syst 32(4):19–32
Usunier N, Amini MR, Goutte C (2011) Multiview semi-supervised learning for ranking multilingual documents. In: Proceedings of the 2011 European conference on machine learning and knowledge discovery in databases. Springer, pp 443–458
Voorhees EM, Harman DK (2005) TREC: experiment and evaluation in information retrieval. The MIT Press, Cambridge
Vulic I, francine Moens M (2015) Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 363–372
Xu J, Li H (2007) AdaRank: a boosting algorithm for information retrieval. In: Proceedings of the 30th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 391–398
Zhai C (2007) Statistical language models for information retrieval—a critical review. Found Trends® Inf Retr 2(3):137–213
Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. ACM Trans Inf Syst 22(2):179–214
Zhao Y, Scholer F, Tsegay Y (2008) Effective pre-retrieval query performance prediction using similarity and variability evidence. In: Proceedings of the 30th European conference on IR research. Springer, pp 52–64
Acknowledgements
We are grateful to the anonymous reviewers for their constructive comments. This research was supported in part by a grant from the school of computer science, Institute for Research in Fundamental Sciences (No. CS 1397-4-55).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ghanbari, E., Shakery, A. Query-dependent learning to rank for cross-lingual information retrieval. Knowl Inf Syst 59, 711–743 (2019). https://doi.org/10.1007/s10115-018-1232-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-018-1232-8