Skip to main content
Log in

Query-dependent learning to rank for cross-lingual information retrieval

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Learning to rank (LTR), as a machine learning technique for ranking tasks, has become one of the most popular research topics in the area of information retrieval (IR). Cross-lingual information retrieval (CLIR), in which the language of the query is different from the language of the documents, is one of the important IR tasks that can potentially benefit from LTR. Our focus in this paper is the use of LTR for CLIR. To rank the documents in the target language in response to the query in the source language, we propose a local query-dependent approach based on LTR for CLIR, which is called LQ-DLTR for CLIR. The core idea of LQ-DLTR for CLIR is the use of the local characteristics of similar queries to construct the LTR model, instead of using a single global ranking model for all queries. Since the query and the documents are in different languages, the traditional features that are used in LTR cannot be used directly for CLIR. Thus, defining appropriate features is a major step in the use of LTR for CLIR. In this paper, three categories of cross-lingual features are defined: query–document features, document features, and query features. To define the cross-lingual features, translation resources are used to fill the gap between the documents and the queries. Then, in LQ-DLTR for CLIR, a neighborhood of similar queries based on cross-lingual query features is used to create a local ranking function by the LTR algorithm for a given query. The LTR algorithm uses two cross-lingual feature sets, namely document features and query–document features, to learn the model. The query features that are used to identify the neighbors are not involved in the learning phase. Experimental results indicate that the CLIR performance improves with the use of cross-lingual features that use several translations and their probabilities to compute the features, compared to the use of monolingual features in traditional LTR, which translate a query according to the best translation and ignore the probabilities. Moreover, experimental results show that LQ-DLTR for CLIR outperforms the baseline information retrieval methods and other LTR ranking models in terms of the MAP and NDCG measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. The HAMSHAHRI corpus is a standard collection that has been used in the Ad Hoc Track of the CLEF2008 and 2009.

  2. CLEF Adhoc Multilingual Task: The evaluation packages are available via the ELRA catalogue (http://catalog.elra.info).

  3. The CLEF Test Suite for the CLEF2000–2003 Campaigns, catalogue reference: ELRA-E0008.

  4. www.clef-initiative.eu.

  5. http://snowball.tartarus.org/.

  6. http://lemurproject.org/ranklib.php.

References

  1. AleAhmad A, Amiri H, Darrudi E, Rahgozar M, Oroumchian F (2009) Hamshahri: a standard Persian text collection. Knowl-Based Syst 22(5):382–387

    Article  Google Scholar 

  2. Amini MR, Usunier N, Goutte C (2009) Learning from multiple partially observed views-an application to multilingual text categorization. In: Advances in neural information processing systems 22. The MIT Press, pp 28–36

  3. Azarbonyad H, Shakery A, Faili H (2012) Using learning to rank approach for parallel corpora based cross language information retrieval. In: Proceedings of the 20th European conference on artificial intelligence. IOS Press, pp 79–84

  4. Azarbonyad H, Shakery A, Faili H (2013) Exploiting multiple translation resources for English-Persian cross language information retrieval. In: Information access evaluation. Multilinguality, multimodality, and visualization: 4th international conference of the CLEF initiative. Springer, pp 93–99

  5. Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. Comput Netw ISDN Syst 30(1–7):107–117

    Article  Google Scholar 

  6. Cao Z, Qin T, Liu TY, Tsai MF, Li H (2007) Learning to rank: from pairwise approach to listwise approach. In: Proceedings of the 24th international conference on machine learning. ACM, pp 129–136

  7. Cronen-Townsend S, Zhou Y, Croft WB (2002) Predicting query performance. In: Proceedings of the 25th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 299–306

  8. Dadashkarimi J, Shakery A, Faili H (2014) A probabilistic translation method for dictionary-based cross-lingual information retrieval in agglutinative languages. arXiv preprint arXiv:1411.1006

  9. Darwish K, Oard DW (2003) Probabilistic structured query methods. In: Proceedings of the 26th international ACM SIGIR conference on research and development in informaiton retrieval. ACM, pp 338–344

  10. Ferro N, Silvello G (2016a) 3.5K runs, 5K topics, 3M assessments and 70M measures: What trends in 10 years of Adhoc-ish CLEF? Info Process Manag 53(1):175–202

    Article  Google Scholar 

  11. Ferro N, Silvello G (2016b) The CLEF monolingual grid of points. In: Information access evaluation. Multilinguality, multimodality, and interaction: 7th international conference of the CLEF initiative. Springer, pp 16–27

  12. Gao W, Blitzer J, Zhou M, Wong KF (2009) Exploiting bilingual information to improve web search. In: Proceedings of the 47th annual meeting of the association for computational linguistics and the 4th international joint conference on natural language processing (ACL-IJCNLP). Association for Computational Linguistics, pp 1075–1083

  13. Gao W, Niu C, Zhou M, Wong KF (2009) Joint ranking for multilingual web search. In: Proceedings of the 31st European conference on IR research. Springer, pp 114–125

  14. Geng X, Liu TY, Qin T, Arnold A, Li H, Shum HY (2008) Query dependent ranking using k-nearest neighbor. In: Proceedings of the 31st international ACM SIGIR conference on research and development in information retrieval. ACM, pp 115–122

  15. He B, Ounis I (2004) Inferring query performance using pre-retrieval predictors. In: Proceedings of the 10th symposium on string processing and information retrieval. Springer, pp 43–54

  16. Hedlund T, Airio E, Keskustalo H, Lehtokangas R, Pirkola A, Jarvelin K (2004) Dictionary-based cross-language information retrieval: learning experiences from CLEF 20002002. Inf Retr 7(1/2):99–119

    Article  Google Scholar 

  17. Herbert B, Szarva G, Gurevych I (2011) Combining query translation techniques to improve cross-language information retrieval. In: Proceedings of the 33rd European conference on IR research. Springer, pp 712–715

  18. Hieber F (2015) Translation-based ranking in cross-language information retrieval. Ph.D. thesis, Department of Computational Linguistics, Heidelberg University

  19. Jabbari F, Bakhshaei S, Ziabary SMM, Khadivi S (2012) Developing an open-domain English-Farsi translation system using AFEC: Amirkabir Bilingual Farsi-English Corpus. In: Proceedings of the 4th workshop on computational approaches to Arabic script-based Languages. ACM, pp 17–23

  20. Jarvelin K, Kekalainen J (2002) Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst 20(4):422–446

    Article  Google Scholar 

  21. Kashefi O (2018) MIZAN: a large persian-english parallel corpus. arXiv preprint arXiv:1801.02107

  22. Kim S, Ko Y, Oard DW (2015) Combining lexical and statistical translation evidence for cross-language information retrieval. J Assoc Inf Sci Technol 66(1):23–39

    Article  Google Scholar 

  23. Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the 10th machine translation summit, pp 79–86

  24. Kraaij W, De Jong F (2004) Transitive probabilistic CLIR models. In: Proceedings of the 7th international RIAO conference, CID, pp 69–81

  25. Kraaij W, Westerveld T, Hiemstra D (2002) The importance of prior probabilities for entry page search. In: Proceedings of the 25th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 27–34

  26. Li H (2014) Learning to rank for information retrieval and natural language processing. Synth Lect Hum Lang Technol 7(3):1–121

    Article  Google Scholar 

  27. Liu TY (2011) Learning to rank for information retrieval. Springer, Berlin

    Book  MATH  Google Scholar 

  28. Lwin PHM (2012) Query dependent ranking for information retrieval based on query clustering. Int J Inf Commun Technol 2(1):25–30

    Google Scholar 

  29. Manning CD, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  30. Mansouri A, Faili H (2012) State-of-the-art English to Persian statistical machine translation system. In: Proceedings of the 16th CSI international symposium on artificial intelligence and signal processing. IEEE, pp 174–179

  31. Miangah TM (2009) Constructing a large-scale english-persian parallel corpus. Meta: Trans J 54(1):181–188

    Article  Google Scholar 

  32. Ni W, Huang Y, Xie M (2008) A query dependent approach to learning to rank for information retrieval. In: Proceedings of the 9th international conference on web-age information management. IEEE, pp 262–269

  33. Nie JY (2010) Cross-language information retrieval. Synth Lect Hum Lang Technol 3(1):1–125

    Article  Google Scholar 

  34. Nie JY, Isabelle P, Plamondon P, Foster G (1998) Using a probabilistic translation model for cross-language information retrieval. In: Proceedings of the 6th workshop on very large Corpora. Association for Computational Linguistics, pp 18–27

  35. Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51

    Article  MATH  Google Scholar 

  36. Peng J, MacDonald C, Ounis I (2010) Learning to select a ranking function. In: Proceedings of the 32nd European conference on IR research. Springer, pp 114–126

  37. Rahimi R, Shakery A (2013) A language modeling approach for extracting translation knowledge from comparable corpora. In: Proceedings of the 35th European conference on IR research. Springer, pp 606–617

  38. Rahimi R, Shakery A, King I (2015a) Extracting translations from comparable corpora for cross-Language information retrieval using the language modeling framework. Inf Process Manag 52(2):299–318

    Article  Google Scholar 

  39. Rahimi R, Shakery A, King I (2015b) Multilingual information retrieval in the language modeling framework. Inf Retr 18(3):246–281

    Article  Google Scholar 

  40. Robertson S, Walker S, Jones S, Hancock-Beaulieu M, Gatford M (1994) Okapi at TREC-3. In: Proceedings of the 3rd text retrieval conference (TREC-3), pp 109–126

  41. Sari S, Adriani M (2014) Learning to rank for determining relevant document in Indonesian-English cross language information retrieval using BM25. In: International conference on advanced computer science and information system. IEEE, pp 309–314

  42. Schamoni S (2013) Reducing feature space for learning to rank in cross-language information retrieval. Ph.D. thesis, University Heidelberg

  43. Schamoni S, Riezler S (2015) Combining orthogonal information in large-scale cross-language information retrieval. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 943–946

  44. Scholer F, Williams HE, Turpin A (2004) Query association surrogates for web search. J Am Soc Inf Sci Technol 55(7):637–650

    Article  Google Scholar 

  45. Sharma VK, Mittal N (2016) Cross lingual information retrieval (CLIR): review of tools, challenges and translation approaches corpora ontology NER Google translator Homonymy Polysemy. In: Information systems design and intelligent applications, Vol. 433. Springer, pp 699–708

  46. Tiedemann J (2012) Parallel data, tools and interfaces in OPUS. In: Proceedings of the eight international conference on language resources and evaluation, European language resources association (ELRA), pp 2214–2218

  47. Tsai MF, Chen HH, Wang YT (2011) Learning a merge model for multilingual information retrieval. Inf Process Manag 47(5):635–646

    Article  Google Scholar 

  48. Tsai MF, Wang YT, Chen HH (2008) A study of learning a merge model for multilingual information retrieval. In: Proceedings of the 31st international ACM SIGIR conference on research and development in information retrieval. ACM, pp 195–202

  49. Ture F, Lin J (2014) Exploiting representations from statistical machine translation for cross-language information retrieval. ACM Trans Inf Syst 32(4):19–32

    Article  Google Scholar 

  50. Usunier N, Amini MR, Goutte C (2011) Multiview semi-supervised learning for ranking multilingual documents. In: Proceedings of the 2011 European conference on machine learning and knowledge discovery in databases. Springer, pp 443–458

  51. Voorhees EM, Harman DK (2005) TREC: experiment and evaluation in information retrieval. The MIT Press, Cambridge

    Google Scholar 

  52. Vulic I, francine Moens M (2015) Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 363–372

  53. Xu J, Li H (2007) AdaRank: a boosting algorithm for information retrieval. In: Proceedings of the 30th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 391–398

  54. Zhai C (2007) Statistical language models for information retrieval—a critical review. Found Trends® Inf Retr 2(3):137–213

    Article  Google Scholar 

  55. Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. ACM Trans Inf Syst 22(2):179–214

    Article  Google Scholar 

  56. Zhao Y, Scholer F, Tsegay Y (2008) Effective pre-retrieval query performance prediction using similarity and variability evidence. In: Proceedings of the 30th European conference on IR research. Springer, pp 52–64

Download references

Acknowledgements

We are grateful to the anonymous reviewers for their constructive comments. This research was supported in part by a grant from the school of computer science, Institute for Research in Fundamental Sciences (No. CS 1397-4-55).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Azadeh Shakery.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ghanbari, E., Shakery, A. Query-dependent learning to rank for cross-lingual information retrieval. Knowl Inf Syst 59, 711–743 (2019). https://doi.org/10.1007/s10115-018-1232-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-018-1232-8

Keywords

Navigation