skip to main content
10.1145/1148170.1148192acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

A parallel derivation of probabilistic information retrieval models

Published: 06 August 2006 Publication History

Abstract

This paper investigates in a stringent athematical formalism the parallel derivation of three grand probabilistic retrieval models: binary independent retrieval (BIR), Poisson model (PM), and language modelling (LM).The investigation has been motivated by a number of questions. Firstly, though sharing the same origin, namely the probability of relevance, the models differ with respect to event spaces. How can this be captured in a consistent notation, and can we relate the event spaces? Secondly, BIR and PM are closely related, but how does LM fit in? Thirdly, how are tf-idf and probabilistic models related? .The parallel investigation of the models leads to a number of formalised results:
BIR and PM assume the collection to be a set of non-relevant documents, whereas LM assumes the collection to be a set of terms from relevant documents.
PM can be viewed as a bridge connecting BIR and LM.
A BIR-LM equivalence explains BIR as a special LM case.
PM explains tf-idf, and both, BIR and LM probabilities express tf-idf in a dual way.
.

References

[1]
G. Amati. Probability Models for Information Retrieval based on Divergence from Randomness Ph thesis, Glasgow University, June 2003.
[2]
G. Amati and C. J. Rijsbergen. Term frequency normalization via Pareto distributions. In F.Crestani, M. Girolami, and C. J. Rijsbergen, editors, 24th BCS-IRSG European Colloquium on IR Research, Glasgow, Scotland 2002.
[3]
Gianni Amati and C. J. van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transaction on Information Systems (TOIS), 20(4): 357--389, October 2002.
[4]
Marcia J. Bates. After the dot-bomb: Getting web information retrieval right this time. First Monday 7(7), 2002.
[5]
K. Church and W Gale. Inverse document frequency (idf): A measure of deviation from poisson.In Proceedings of the Third Workshop on Very Large Corpora pages 121--130, 1995.
[6]
Bruce Croft and John Lafferty, editors. Language Modeling for Information Retrieval Kluwer, 2003.
[7]
Arjen de Vries and Thomas Roelleke. Relevance information: A loss of entropy but a gain for idf? In ACM SIGIR Salvador, Brazil, 2005.
[8]
joerd Hiemstra. A probabilistic justification for using tf.idf term weighting in information retrieval. International Journal on Digital Libraries 3(2): 131--139, 2000.
[9]
K. Sparck Jones, S. E. Robertson,. Hiemstra, and H. Zaragoza. Language modelling and relevance. Language Modelling for Information Retrieval pages 57--70,2003.
[10]
Wessel Kraaij, Thijs Westerveld, and joerd Hiemstra. The importance of prior probabilities for entry page search. In SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval pages 27--34, New York, NY, USA, 2002. ACM Press.
[11]
John Lafferty and ChengXiang Zhai. Probabilistic Relevance Models Based on Document and Query Generation chapter 1. In Croft and Lafferty {6}, 2002.
[12]
E. L. Margulis. N-poisson document modelling. In N. Belkin, P. Ingwersen, and M. Pejtersen, editors, Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval pages 177--189, New York, 1992.
[13]
J. M.Ponte and W. B. Croft. A language modeling approach to information retrieval. In W. Bruce Croft, Alistair Moffat, C. J. van Rijsbergen, Ross Wilkinson, and Justin Zobel, editors, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval pages 275--281, New York, 1998. ACM.
[14]
S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In W. Bruce Croft and C. J. van Rijsbergen,editors,Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval pages 232--241, London,et al., 1994. Springer-Verlag.
[15]
S. E. Robertson, S. Walker, and M. M. Hancock-Beaulieu. Large test collection experiments on an operational interactive system: Okapi at TREC. Information Processing and Management 31: 345--360, 1995.
[16]
S. E. Robertson. Understanding inverse document frequency:On theoretical arguments for idf. Journal of Documentation 60:503--520,2004.
[17]
S. E. Robertson and K. Sparck Jones. Relevance weighting of search terms.Journal of the American Society for Information Science 27: 129--146, 1976.
[18]
Stephen Robertson. On event spaces and probabilistic models in information retrieval.Information Retrieval 8(2): 319--329, 2005.
[19]
Thomas Rölleke, Theodora Tsikrika, and Gabriella Kazai. A general matrix framework for modelling information retrieval. Journal on Information Processing & Management (IP&M), Special Issue on Theory in Information Retrieval 42(1),2006.
[20]
Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length normalistation.In H.P.Frei,D.Harmann, P. Schäuble, and R. Wilkinson, editors, Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval pages 21--39,New York,1996.ACM.
[21]
Keith van Rijsbergen. The Geometry of Information Retrieval Cambridge University Press, 2004.

Cited By

View all
  • (2023)The hypergeometric test performs comparably to TF-IDF on standard text analysis tasksMultimedia Tools and Applications10.1007/s11042-023-16615-zOnline publication date: 8-Sep-2023
  • (2020)Quantitative Linguistic Study of Frequency Words in Kirill of Turov’s Words (based on the NLR manuscript F.п.I.39)Slovene10.31168/2305-6754.2020.9.1.29:1(29-80)Online publication date: 2020
  • (2017)Probabilistic Retrieval Models and Binary Independence Retrieval (BIR) ModelEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_919-2(1-7)Online publication date: 29-Jan-2017
  • Show More Cited By

Index Terms

  1. A parallel derivation of probabilistic information retrieval models

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
    August 2006
    768 pages
    ISBN:1595933697
    DOI:10.1145/1148170
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 August 2006

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Poisson
    2. binary retrieval model
    3. language modelling

    Qualifiers

    • Article

    Conference

    SIGIR06
    Sponsor:
    SIGIR06: The 29th Annual International SIGIR Conference
    August 6 - 11, 2006
    Washington, Seattle, USA

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 17 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)The hypergeometric test performs comparably to TF-IDF on standard text analysis tasksMultimedia Tools and Applications10.1007/s11042-023-16615-zOnline publication date: 8-Sep-2023
    • (2020)Quantitative Linguistic Study of Frequency Words in Kirill of Turov’s Words (based on the NLR manuscript F.п.I.39)Slovene10.31168/2305-6754.2020.9.1.29:1(29-80)Online publication date: 2020
    • (2017)Probabilistic Retrieval Models and Binary Independence Retrieval (BIR) ModelEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_919-2(1-7)Online publication date: 29-Jan-2017
    • (2016)Query Performance Prediction Using Reference ListsACM Transactions on Information Systems10.1145/292679034:4(1-34)Online publication date: 9-Jun-2016
    • (2015)Harmony Assumptions in Information Retrieval and Social NetworksThe Computer Journal10.1093/comjnl/bxv03158:11(2982-2999)Online publication date: 14-May-2015
    • (2014)Probabilistic models in IR and their relationshipsInformation Retrieval10.1007/s10791-013-9226-317:2(177-201)Online publication date: 1-Apr-2014
    • (2013)Information Retrieval Models: Foundations and RelationshipsSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00494ED1V01Y201304ICR0275:3(1-163)Online publication date: 26-Jul-2013
    • (2013)IR ModelsProceedings of the 2013 Conference on the Theory of Information Retrieval10.1145/2499178.2499203(4-4)Online publication date: 29-Sep-2013
    • (2013)Probabilistic co-relevance for query-sensitive similarity measurement in information retrievalInformation Processing and Management: an International Journal10.1016/j.ipm.2012.10.00249:2(558-575)Online publication date: 1-Mar-2013
    • (2012)Back to the rootsProceedings of the 21st ACM international conference on Information and knowledge management10.1145/2396761.2396866(823-832)Online publication date: 29-Oct-2012
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media