A goodness of fit test approach in information retrieval

Fragos, Kostas; Maistros, Yannis

doi:10.1007/s10791-006-3609-7

A goodness of fit test approach in information retrieval

Published: June 2006

Volume 9, pages 331–342, (2006)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

A goodness of fit test approach in information retrieval

Download PDF

Kostas Fragos¹ &
Yannis Maistros¹

127 Accesses
3 Citations
Explore all metrics

Abstract

In many probabilistic modeling approaches to Information Retrieval we are interested in estimating how well a document model “fits” the user’s information need (query model). On the other hand in statistics, goodness of fit tests are well established techniques for assessing the assumptions about the underlying distribution of a data set. Supposing that the query terms are randomly distributed in the various documents of the collection, we actually want to know whether the occurrences of the query terms are more frequently distributed by chance in a particular document. This can be quantified by the so-called goodness of fit tests. In this paper, we present a new document ranking technique based on Chi-square goodness of fit tests. Given the null hypothesis that there is no association between the query terms q and the document d irrespective of any chance occurrences, we perform a Chi-square goodness of fit test for assessing this hypothesis and calculate the corresponding Chi-square values. Our retrieval formula is based on ranking the documents in the collection according to these calculated Chi-square values. The method was evaluated over the entire test collection of TREC data, on disks 4 and 5, using the topics of TREC-7 and TREC-8 (50 topics each) conferences. It performs well, outperforming steadily the classical OKAPI term frequency weighting formula but below that of KL-Divergence from language modeling approach. Despite this, we believe that the technique is an important non-parametric way of thinking of retrieval, offering the possibility to try simple alternative retrieval formulas within goodness-of-fit statistical tests’ framework, modeling the data in various ways estimating or assigning any arbitrary theoretical distribution in terms.

References

Amati G and Rijsbergen V (2002) Probabilistic Models of Information Retrieval Based on Measuring the Divergence from Randomness. In: ACM Transactions on Information Systems, 20(4):357–389
Article Google Scholar
Berger A and Lafferty J (1999) Information Retrieval as statistical Translation. In: Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 222–229
Broglio J, Callan JP, Croft WB and Nachbar DW (1995). Document Retrieval and Routing using the INQUERY system. In: Harman DW (ed.), Overview of the Third Retrieval Conference (TREC 3), pp. 29–38. NIST Special Publication, pp. 500–225
D’Agostino BR and Stephens MA (eds.) (1986) Goodness-of-fit Techniques, Dekker, New York
Google Scholar
Jelinek F and Mercer R (1980) Interpolated estimation of Markov source parameters from sparse data. In: Gelsema ES and Kanal LN (eds.): Pattern Recognition in Practice, pp. 381–402. North Holland, Amsterdam
Google Scholar
Miller HD, Leek T and Schwartz RM (1999a) BBN at TREC-7: using hidden markov models for information retrieval. In: Proceedings of the seventh Text Retrieval Conference, TREC-7, pp. 133–142. NIST Special Publication, pp. 500–242
Miller HD, Leek T and Schwartz RM (1999b) A hidden Markov model information retrieval system. In: Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 214–221
Oakes M, Gaizauskas R and Fowkes H (2001) A Method Based on the Chi-square Test for Document Classification. In: 24th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01)
Ponte J and Croft B (1998) A language modeling approach in information retrieval. In: Croft B, Moffat A and Rijsbergen C (eds.): Proceeding of the 21st 5 ACM SIGIR Conference on Research and Development in Information Retrieval, (Melbourne, Australia), ACM, New York, pp. 275–281
Google Scholar
Robertson ES and Jones KS (1976) Relevance weighting of search terms. Journal of the American Society for Information Sciences 27(3):129–146
Google Scholar
Robertson ES, Walker S, Jones S, Hancock-Beaulieu MM and Gatford, M (1995) In: Harman DK (ed.): Okapi at TREC-3, the Third Text Retrieval Conference (TREC-3)
Robertson SE, Rijsbergen JC and Porter M (1981) Probabilistic models of indexing and searching. In: Robertson SE, van Rijsbergen CJ and Williams P (eds.): Information Retrieval Research, Butterworths, Oxford, UK, Chapter 4, pp. 35–36
Google Scholar
Salton G (1971) The SMART Retrieval System—Experiments in Automatic Document Processing. Prentice hall Inc., Englewood Cliffs, NJ
Google Scholar
Shannon C (1948) A mathematical theory of communication. Bell System Technical Journal 27:xxx–xxx
MathSciNet Google Scholar
Turtle H and Croft, W (1991) Evaluation of an inference network-based retrieval model. ACM transactions on Information Systems 9(3):187–222
Article Google Scholar
Voorhess E and Harman D (eds.) (2001) Proceeding of text retrieval Conference (TREC1-9). NIST Special Publications, http://trec.nist.gov/pubs.html
Zhai C (2001) Notes on the Lemur TFIDF model. In: School of Computer Science Carnegie Mellon University
Zhai C and Lafferty J (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In: 24th ACM SIGIR Conference on Research and Development in Information Retrieval, (SIGIR’01)

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineers, National Technical University of Athens Iroon, Polytexneniou 9, 15780, Zografou, Greece
Kostas Fragos & Yannis Maistros

Authors

Kostas Fragos
View author publications
You can also search for this author in PubMed Google Scholar
Yannis Maistros
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kostas Fragos.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fragos, K., Maistros, Y. A goodness of fit test approach in information retrieval. Inf Retrieval 9, 331–342 (2006). https://doi.org/10.1007/s10791-006-3609-7

Download citation

Received: 12 July 2004
Revised: 25 May 2005
Accepted: 13 July 2005
Issue Date: June 2006
DOI: https://doi.org/10.1007/s10791-006-3609-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A goodness of fit test approach in information retrieval

Abstract

Article PDF

Similar content being viewed by others

DeShaTo: Describing the Shape of Cumulative Topic Distributions to Rank Retrieval Systems Without Relevance Judgments

The hypergeometric test performs comparably to TF-IDF on standard text analysis tasks

Change of Measure Applications in Nonparametric Statistics

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A goodness of fit test approach in information retrieval

Abstract

Article PDF

Similar content being viewed by others

DeShaTo: Describing the Shape of Cumulative Topic Distributions to Rank Retrieval Systems Without Relevance Judgments

The hypergeometric test performs comparably to TF-IDF on standard text analysis tasks

Change of Measure Applications in Nonparametric Statistics

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation