ABSTRACT
The relative performance of retrieval systems when evaluated on one part of a test collection may bear little or no similarity to the relative performance measured on a different part of the collection. In this paper we report the results of a detailed study of the impact that different sub-collections have on retrieval effectiveness, analyzing the effect over many collections, and with different approaches to sub-dividing the collections. The effect is shown to be substantial, impacting on comparisons between retrieval runs that are statistically significant. Some possible causes for the effect are investigated, and the implications of this work are examined for test collection design and for the strength of conclusions one can draw from experimental results.
- C. W. Cleverdon, "The Evaluation of Systems Used in Information Retrieval (1958: Washington)," in Proceedings of the International Conference on Scientific Information - Two Volumes, 1959, pp. 687--698.Google Scholar
- E. M. Voorhees, "Variations in relevance judgments and the measurement of retrieval effectiveness," in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, 1998, pp. 315--323. Google ScholarDigital Library
- B. Carterette, V. Pavlu, E. Kanoulas, J. A. Aslam, and J. Allan, "Evaluation over thousands of queries," in Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, 2008, pp. 651--658. Google ScholarDigital Library
- J. Tague-Sutcliffe, "The pragmatics of information retrieval experimentation, revisited," Information Processing & Management, vol. 28, no. 4, pp. 467--490, Jul. 1992. Google ScholarDigital Library
- I. Soboroff, "Dynamic test collections: measuring search effectiveness on the live web," in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 2006, pp. 276--283. Google ScholarDigital Library
- E. M. Voorhees and D. K. Harman, TREC: Experiment and Evaluation in Information Retrieval, illustrated ed. The MIT Press, 2005. Google ScholarDigital Library
- S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford, "Okapi at TREC-3," NIST SPECIAL PUBLICATION SP, pp. 109--126, 1995.Google Scholar
- A. Singhal, C. Buckley, and M. Mitra, "Pivoted document length normalization," in Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, 1996, pp. 21--29. Google ScholarDigital Library
- F. Scholer, A. Turpin, and M. Sanderson, "Quantifying test collection quality based on the consistency of relevance judgements," in Proceedings of the 34th international ACM SIGIR conference on Research and development in Information, 2011, pp. 1063--1072. Google ScholarDigital Library
- C. Clarke, N. Craswell, and I. Soboroff, "Overview of the TREC 2004 Terabyte Track," in Proceedings of TREC, 2004, vol. 2004.Google Scholar
- E. M. Voorhees and C. Buckley, "The effect of topic set size on retrieval experiment error," in Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, 2002, pp. 316--323. Google ScholarDigital Library
- M. D. Smucker, J. Allan, and B. Carterette, "A comparison of statistical significance tests for information retrieval evaluation," in Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 2007, pp. 623--632. Google ScholarDigital Library
- J. Lin, "Divergence measures based on the Shannon entropy," Information Theory, IEEE Transactions on, vol. 37, no. 1, pp. 145--151, 1991. Google ScholarDigital Library
Index Terms
- Differences in effectiveness across sub-collections
Recommendations
Quantifying test collection quality based on the consistency of relevance judgements
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information RetrievalRelevance assessments are a key component for test collection-based evaluation of information retrieval systems. This paper reports on a feature of such collections that is used as a form of ground truth data to allow analysis of human assessment error. ...
User performance versus precision measures for simple search tasks
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrievalSeveral recent studies have demonstrated that the type of improvements in information retrieval system effectiveness reported in forums such as SIGIR and TREC do not translate into a benefit for users. Two of the studies used an instance recall task, ...
Pooling-based continuous evaluation of information retrieval systems
AbstractThe dominant approach to evaluate the effectiveness of information retrieval (IR) systems is by means of reusable test collections built following the Cranfield paradigm. In this paper, we propose a new IR evaluation methodology based on pooled ...
Comments