research-article

Generalizing Translation Models in the Probabilistic Relevance Framework

Authors:

Navid Rekabsaz,

Guido ZucconAuthors Info & Claims

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Pages 711 - 720

https://doi.org/10.1145/2983323.2983833

Published: 24 October 2016 Publication History

Abstract

A recurring question in information retrieval is whether term associations can be properly integrated in traditional information retrieval models while preserving their robustness and effectiveness. In this paper, we revisit a wide spectrum of existing models (Pivoted Document Normalization, BM25, BM25 Verboseness Aware, Multi-Aspect TF, and Language Modelling) by introducing a generalisation of the idea of the translation model. This generalisation is a de facto transformation of the translation models from Language Modelling to the probabilistic models. In doing so, we observe a potential limitation of these generalised translation models: they only affect the term frequency based components of all the models, ignoring changes in document and collection statistics. We correct this limitation by extending the translation models with the 15 statistics of term associations and provide extensive experimental results to demonstrate the benefit of the newly proposed methods. Additionally, we compare the translation models with query expansion methods based on the same term association resources, as well as based on Pseudo-Relevance Feedback (PRF). We observe that translation models always outperform the first, but provide complementary information with the second, such that by using PRF and our translation models together we observe results better than the current state of the art.

References

[1]

G. Amati, C. Carpineto, G. Romano, and F. U. Bordoni. Query difficulty, robustness, and selective application of query expansion. In Proc. of ECIR, 2004.

[2]

G. Amati and C. J. Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. TOIS, 2002.

Digital Library

[3]

A. Berger and J. Lafferty. Information Retrieval As Statistical Translation. In Proc. of SIGIR, 1999.

Digital Library

[4]

D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 2003.

Digital Library

[5]

H. Cui, J.-R. Wen, J.-Y. Nie, and W.-Y. Ma. Probabilistic query expansion using query logs. In Proc. of WWW, 2002.

Digital Library

[6]

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 1990.

[7]

M. Dehghani, H. Azarbonyad, J. Kamps, D. Hiemstra, and M. Marx. Luhn revisited: Significant words language models. In Proceedings of The 25th ACM International Conference on Information and Knowledge Management.

Digital Library

[8]

D. Ganguly, D. Roy, M. Mitra, and G. J. Jones. Word Embedding based Generalized Language Model for Information Retrieval. In Proc. of SIGIR, 2015.

Digital Library

[9]

J. Gao and J.-Y. Nie. Towards concept-based translation models using search logs for query expansion. In Proc. of CIKM.

Digital Library

[10]

T. Hofmann. Probabilistic latent semantic indexing. In Proc. of SIGIR, 1999.

Digital Library

[11]

S. Huston and W. B. Croft. A Comparison of Retrieval Models Using Term Dependencies. In Proc. of CIKM, 2014.

Digital Library

[12]

M. Karimzadehgan and C. Zhai. Estimation of Statistical Translation Models Based on Mutual Information for Ad Hoc Information Retrieval. In Proc. of SIGIR, 2010.

Digital Library

[13]

M. Karimzadehgan and C. Zhai. Axiomatic Analysis of Translation Language Model for Information Retrieval. In Proc. of ECIR, 2012.

Digital Library

[14]

J. Karlgren, A. Holst, and M. Sahlgren. Filaments of meaning in word space. In Advances in Information Retrieval. 2008.

Digital Library

[15]

B. Koopman, G. Zuccon, P. Bruza, L. Sitbon, and M. Lawley. An evaluation of corpus-driven measures of medical concept similarity for information retrieval. In Proc. of CIKM, 2012.

Digital Library

[16]

A. Kotov and C. Zhai. Tapping into knowledge base for concept feedback: leveraging conceptnet to improve search results for difficult queries. In Proc. of WSDM, 2012.

Digital Library

[17]

J. D. Lafferty and C. Zhai. Probabilistic relevance models based on document and query generation. In Language modeling and information retrieval, 2003.

[18]

V. Lavrenko and W. B. Croft. Relevance based language models. In Proc. of SIGIR, 2001.

Digital Library

[19]

O. Levy, Y. Goldberg, and I. Dagan. Improving distributional similarity with lessons learned from word embeddings. Transaction of the Association of Computational Linguists (ACL), 2015.

[20]

H. Li and J. Xu. Semantic Matching in Search. Foundations and Trends in Information Retrieval, 2014.

Digital Library

[21]

C. Lioma, J. G. Simonsen, B. Larsen, and N. D. Hansen. Non-compositional term dependence for information retrieval. In Proc. of SIGIR, 2015.

Digital Library

[22]

A. Lipani, M. Lupu, A. Hanbury, and A. Aizawa. Verboseness fission for bm25 document length normalization. In Proc. of ICTIR, 2015.

Digital Library

[23]

T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[24]

J. H. Paik. A novel tf-idf weighting scheme for effective ranking. In Proc. of SIGIR, 2013.

Digital Library

[25]

J. Palotti, G. Zuccon, L. Goeuriot, L. Kelly, A. Hanbury, G. J. Jones, M. Lupu, and P. Pecina. Clef ehealth evaluation lab 2015, task 2: Retrieving information about medical symptoms. CLEF, 2015.

[26]

J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. Proc. of EMNLP, 2014.

[27]

J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proc. of SIGIR, 1998.

Digital Library

[28]

G. Recchia, M. Jones, M. Sahlgren, and P. Kanerva. Encoding sequential information in vector space models of semantics: Comparing holographic reduced representation and random permutation. In Proceedings the Cognitive Science Society Conference, 2010.

[29]

S. Robertson. On Event Spaces and Probabilistic Models in Information Retrieval. Information Retrieval, 8, 2005.

Digital Library

[30]

S. Robertson and H. Zaragoza. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc, 2009.

Digital Library

[31]

J. Rocchio. Relevance feedback in information retrieval. In The SMART Retrieval System-- Experiments in Automatic Document Processing, 1971.

[32]

T. Sakai. Alternatives to bpref. In Proc. of SIGIR, 2007.

Digital Library

[33]

A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In Proc. of SIGIR, 1996.

Digital Library

[34]

I. Vulić and M.-F. Moens. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proc. of SIGIR, 2015.

Digital Library

[35]

C. Xiong and J. Callan. Query expansion with Freebase. In Proc. of ICTIR, 2015.

Digital Library

[36]

J. Xu and W. B. Croft. Query expansion using local and global document analysis. In Proc. of SIGIR, 1996.

Digital Library

[37]

Y. Xu, G. J. Jones, and B. Wang. Query dependent pseudo-relevance feedback based on wikipedia. In Proc. of SIGIR, 2009.

Digital Library

[38]

C. Zhai and J. Lafferty. A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval. In Proc. of SIGIR, 2001.

Digital Library

[39]

J. Zhao, J. X. Huang, and Z. Ye. Modeling term associations for probabilistic information retrieval. TOIS, 2014.

Digital Library

[40]

G. Zheng and J. Callan. Learning to reweight terms with distributed representations. In Proc. of SIGIR, 2015.

Digital Library

[41]

G. Zuccon, B. Koopman, P. Bruza, and L. Azzopardi. Integrating and evaluating neural word embeddings in information retrieval. In Proc. of Australasian Document Computing Symposium, 2015.

Digital Library

Cited By

Zhebel VDevyatkin DZubarev DSochenkov I(2023)Approaches to Cross-Language Retrieval of Similar Legal Documents Based on Machine LearningScientific and Technical Information Processing10.3103/S014768822305016750:5(494-499)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.3103/S0147688223050167
Lesota ORekabsaz NCohen DGrasserbauer KEickhoff CSchedl MHasibi FFang YAizawa A(2021)A Modern Perspective on Query Likelihood with Deep Generative Retrieval ModelsProceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3471158.3472229(185-195)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3471158.3472229
Zubarev DSochenkov I(2021)Comparison of Cross-Lingual Similar Documents Retrieval MethodsData Analytics and Management in Data Intensive Domains10.1007/978-3-030-81200-3_16(216-229)Online publication date: 16-Jul-2021
https://doi.org/10.1007/978-3-030-81200-3_16
Show More Cited By

Index Terms

Generalizing Translation Models in the Probabilistic Relevance Framework
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

The Probabilistic Relevance Framework: BM25 and Beyond

The Probabilistic Relevance Framework (PRF) is a formal framework for document retrieval, grounded in work done in the 1970—1980s, which led to the development of one of the most successful text-retrieval algorithms, BM25. In recent years, research in ...
Enhancing Information Retrieval with Adapted Word Embedding
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

Recent developments on word embedding provide a novel source of information for term-to-term similarity. A recurring question now is whether the provided term associations can be properly integrated in the traditional information retrieval models while ...
Hybrid term indexing for different IR models
IRAL '00: Proceedings of the fifth international workshop on on Information retrieval with Asian languages

Retrieval effectiveness depends on how terms are extracted and indexed. For Chinese text (and others like Japanese and Korean), there are no space to delimit words. Indexing using hybrid terms (i.e. words and bigrams) were able to achieve the best ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

October 2016

2566 pages

ISBN:9781450340731

DOI:10.1145/2983323

General Chairs:
Snehasis Mukhopadhyay
Indiana University Purdue University Indianapolis, USA
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Program Chairs:
Elisa Bertino
Purdue University
,
Fabio Crestani
University of Lugano
,
Javed Mostafa
University of North Carolina
,
Jie Tang
Tsinghua University
,
Luo Si
Alibaba Group Inc & Purdue University
,
Xiaofang Zhou
University of Queensland
,
Yi Chang
Yahoo Research
,
Yunyao Li
IBM Research - Almaden
,
Parikshit Sondhi
WalmartLabs

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Austrian Research Promotion Agency (FFG)
Austrian Science Fund (FWF)

Conference

CIKM'16

Sponsor:

CIKM'16: ACM Conference on Information and Knowledge Management

October 24 - 28, 2016

Indiana, Indianapolis, USA

Acceptance Rates

CIKM '16 Paper Acceptance Rate 160 of 701 submissions, 23%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
239
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)2

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhebel VDevyatkin DZubarev DSochenkov I(2023)Approaches to Cross-Language Retrieval of Similar Legal Documents Based on Machine LearningScientific and Technical Information Processing10.3103/S014768822305016750:5(494-499)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.3103/S0147688223050167
Lesota ORekabsaz NCohen DGrasserbauer KEickhoff CSchedl MHasibi FFang YAizawa A(2021)A Modern Perspective on Query Likelihood with Deep Generative Retrieval ModelsProceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3471158.3472229(185-195)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3471158.3472229
Zubarev DSochenkov I(2021)Comparison of Cross-Lingual Similar Documents Retrieval MethodsData Analytics and Management in Data Intensive Domains10.1007/978-3-030-81200-3_16(216-229)Online publication date: 16-Jul-2021
https://doi.org/10.1007/978-3-030-81200-3_16
Zhebel VZubarev DSochenkov I(2020)Different Approaches in Cross-Language Similar Documents Retrieval in the Legal DomainSpeech and Computer10.1007/978-3-030-60276-5_65(679-686)Online publication date: 29-Sep-2020
https://doi.org/10.1007/978-3-030-60276-5_65
Hofstätter SRekabsaz NEickhoff CHanbury APiwowarski BChevalier MGaussier EMaarek YNie JScholer F(2019)On the Effect of Low-Frequency Terms on Neural-IR ModelsProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331344(1137-1140)Online publication date: 18-Jul-2019
https://dl.acm.org/doi/10.1145/3331184.3331344
Hansen CHansen CAlstrup SSimonsen JLioma CPiwowarski BChevalier MGaussier EMaarek YNie JScholer F(2019)Contextually Propagated Term Weights for Document RepresentationProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331307(897-900)Online publication date: 18-Jul-2019
https://dl.acm.org/doi/10.1145/3331184.3331307
Suominen HKelly LGoeuriot L(2019)The Scholarly Impact and Strategic Intent of CLEF eHealth Labs from 2012 to 2017Information Retrieval Evaluation in a Changing World10.1007/978-3-030-22948-1_14(333-363)Online publication date: 14-Aug-2019
https://doi.org/10.1007/978-3-030-22948-1_14
Hofstätter SRekabsaz NLupu MEickhoff CHanbury A(2019)Enriching Word Embeddings for Patent Retrieval with Global ContextAdvances in Information Retrieval10.1007/978-3-030-15712-8_57(810-818)Online publication date: 7-Apr-2019
https://doi.org/10.1007/978-3-030-15712-8_57
Bi KAi QCroft W(2019)Iterative Relevance Feedback for Answer Passage Retrieval with Passage-Level Semantic MatchAdvances in Information Retrieval10.1007/978-3-030-15712-8_36(558-572)Online publication date: 7-Apr-2019
https://doi.org/10.1007/978-3-030-15712-8_36
Onal KZhang YAltingovde IRahman MKaragoz PBraylan ADang BChang HKim HMcnamara QAngert ABanner EKhetan VMcdonnell TNguyen AXu DWallace BRijke MLease M(2018)Neural information retrievalInformation Retrieval10.1007/s10791-017-9321-y21:2-3(111-182)Online publication date: 1-Jun-2018
https://dl.acm.org/doi/10.1007/s10791-017-9321-y
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents