Abstract
In many probabilistic modeling approaches to Information Retrieval we are interested in estimating how well a document model “fits” the user’s information need (query model). On the other hand in statistics, goodness of fit tests are well established techniques for assessing the assumptions about the underlying distribution of a data set. Supposing that the query terms are randomly distributed in the various documents of the collection, we actually want to know whether the occurrences of the query terms are more frequently distributed by chance in a particular document. This can be quantified by the so-called goodness of fit tests. In this paper, we present a new document ranking technique based on Chi-square goodness of fit tests. Given the null hypothesis that there is no association between the query terms q and the document d irrespective of any chance occurrences, we perform a Chi-square goodness of fit test for assessing this hypothesis and calculate the corresponding Chi-square values. Our retrieval formula is based on ranking the documents in the collection according to these calculated Chi-square values. The method was evaluated over the entire test collection of TREC data, on disks 4 and 5, using the topics of TREC-7 and TREC-8 (50 topics each) conferences. It performs well, outperforming steadily the classical OKAPI term frequency weighting formula but below that of KL-Divergence from language modeling approach. Despite this, we believe that the technique is an important non-parametric way of thinking of retrieval, offering the possibility to try simple alternative retrieval formulas within goodness-of-fit statistical tests’ framework, modeling the data in various ways estimating or assigning any arbitrary theoretical distribution in terms.
Article PDF
Similar content being viewed by others
References
Amati G and Rijsbergen V (2002) Probabilistic Models of Information Retrieval Based on Measuring the Divergence from Randomness. In: ACM Transactions on Information Systems, 20(4):357–389
Berger A and Lafferty J (1999) Information Retrieval as statistical Translation. In: Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 222–229
Broglio J, Callan JP, Croft WB and Nachbar DW (1995). Document Retrieval and Routing using the INQUERY system. In: Harman DW (ed.), Overview of the Third Retrieval Conference (TREC 3), pp. 29–38. NIST Special Publication, pp. 500–225
D’Agostino BR and Stephens MA (eds.) (1986) Goodness-of-fit Techniques, Dekker, New York
Jelinek F and Mercer R (1980) Interpolated estimation of Markov source parameters from sparse data. In: Gelsema ES and Kanal LN (eds.): Pattern Recognition in Practice, pp. 381–402. North Holland, Amsterdam
Miller HD, Leek T and Schwartz RM (1999a) BBN at TREC-7: using hidden markov models for information retrieval. In: Proceedings of the seventh Text Retrieval Conference, TREC-7, pp. 133–142. NIST Special Publication, pp. 500–242
Miller HD, Leek T and Schwartz RM (1999b) A hidden Markov model information retrieval system. In: Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 214–221
Oakes M, Gaizauskas R and Fowkes H (2001) A Method Based on the Chi-square Test for Document Classification. In: 24th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01)
Ponte J and Croft B (1998) A language modeling approach in information retrieval. In: Croft B, Moffat A and Rijsbergen C (eds.): Proceeding of the 21st 5 ACM SIGIR Conference on Research and Development in Information Retrieval, (Melbourne, Australia), ACM, New York, pp. 275–281
Robertson ES and Jones KS (1976) Relevance weighting of search terms. Journal of the American Society for Information Sciences 27(3):129–146
Robertson ES, Walker S, Jones S, Hancock-Beaulieu MM and Gatford, M (1995) In: Harman DK (ed.): Okapi at TREC-3, the Third Text Retrieval Conference (TREC-3)
Robertson SE, Rijsbergen JC and Porter M (1981) Probabilistic models of indexing and searching. In: Robertson SE, van Rijsbergen CJ and Williams P (eds.): Information Retrieval Research, Butterworths, Oxford, UK, Chapter 4, pp. 35–36
Salton G (1971) The SMART Retrieval System—Experiments in Automatic Document Processing. Prentice hall Inc., Englewood Cliffs, NJ
Shannon C (1948) A mathematical theory of communication. Bell System Technical Journal 27:xxx–xxx
Turtle H and Croft, W (1991) Evaluation of an inference network-based retrieval model. ACM transactions on Information Systems 9(3):187–222
Voorhess E and Harman D (eds.) (2001) Proceeding of text retrieval Conference (TREC1-9). NIST Special Publications, http://trec.nist.gov/pubs.html
Zhai C (2001) Notes on the Lemur TFIDF model. In: School of Computer Science Carnegie Mellon University
Zhai C and Lafferty J (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In: 24th ACM SIGIR Conference on Research and Development in Information Retrieval, (SIGIR’01)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Fragos, K., Maistros, Y. A goodness of fit test approach in information retrieval. Inf Retrieval 9, 331–342 (2006). https://doi.org/10.1007/s10791-006-3609-7
Received:
Revised:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/s10791-006-3609-7