ABSTRACT
Information retrieval effectiveness is usually evaluated using measures such as Normalized Discounted Cumulative Gain (NDCG), Mean Average Precision (MAP) and Precision at some cutoff (Precision@k) on a set of judged queries. Recent research has suggested an alternative, evaluating information retrieval systems based on user behavior. Particularly promising are experiments that interleave two rankings and track user clicks. According to a recent study, interleaving experiments can identify large differences in retrieval effectiveness with much better reliability than other click-based methods.
We study interleaving in more detail, comparing it with traditional measures in terms of reliability, sensitivity and agreement. To detect very small differences in retrieval effectiveness, a reliable outcome with standard metrics requires about 5,000 judged queries, and this is about as reliable as interleaving with 50,000 user impressions. Amongst the traditional measures, NDCG has the strongest correlation with interleaving. Finally, we present some new forms of analysis, including an approach to enhance interleaving sensitivity.
- A. Al Maskari, M. Sanderson, P. Clough, and E. Airio. The good and the bad system: does the test collection predict users' effectiveness? In Proc. of SIGIR, 2008. Google ScholarDigital Library
- Andrew Turpin and Falk Scholer. User Performance versus Precision Measures for Simple Search Tasks. In Proc. of SIGIR, 2006. Google ScholarDigital Library
- Ben Carterette and Rosie Jones. Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks. In Proc. of NIPS, 2007.Google Scholar
- A. Broder. A taxonomy of web search. SIGIR Forum, 26(2):3--10, 2002. Google ScholarDigital Library
- W. B. Croft, D. Metzler, and T. Strohman. Search Engines: Information Retrieval in Practice. Addison Wesley, 2009. Google ScholarDigital Library
- Dennis Fetterly, Mark Manasse, and Marc Najork. On The Evolution of Clusters of Near-Duplicate Web Pages. In LA-WEB, pages 37--45, 2003. Google ScholarDigital Library
- Diane Kelly, Xin Fu, and Chirag Shah. Effects of rank and precision of search results on users' evaluations of system performance. Technical Report TR-2007-02., UNC SILS, 2007.Google Scholar
- Ellen M. Voorhees and Chris Buckley. The effect of topic set size on retrieval experiment error. In Proc. of SIGIR, 2002. Google ScholarDigital Library
- Falk Scholer and Andrew Turpin. Metric and relevance mismatch in retrieval evaluation. In Proc. of the Asia Information Retrieval Symposium, 2009. Google ScholarDigital Library
- Filip Radlinski, Madhu Kurup, and Thorsten Joachims. How Does Clickthrough Data Reflect Retrieval Quality. In Proc. of CIKM, 2008. Google ScholarDigital Library
- W. Hersh, A. Turpin, S. Price, B. Chan, D. Kraemer, L. Sacherek, and D. Olson. Do Batch and User Evaluations Give the Same Results? In Proc. of SIGIR, 2000. Google ScholarDigital Library
- James Allan, Ben Carterette, and J. Lewis. When Will Information Retrieval be "Good Enough"? In Proc. of SIGIR, 2005. Google ScholarDigital Library
- Mark Sanderson and Justin Zobel. Information Retrieval System Evaluation: Effort, Sensitivity and Reliability. In Proc. of SIGIR, 2005. Google ScholarDigital Library
- Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P. de Vries, and Emine Yilmaz. Relevance assessment: are judges exchangeable and does it matter. In Proc. of SIGIR, 2008. Google ScholarDigital Library
- Scott B. Huffman and Michael Hochster. How Well does Result Relevance Predict Session Satisfaction? In Proc. of SIGIR, 2007. Google ScholarDigital Library
- P. Thomas and D. Hawking. Evaluation by comparing result sets in context. In Proc. of CIKM, 2006. Google ScholarDigital Library
- Thorsten Joachims. Optimizing Search Engines Using Clickthrough Data. In Proc. of KDD, 2002. Google ScholarDigital Library
- Text Retrieval Conference. http://trec.nist.gov/.Google Scholar
- E. M. Voorhees and D. K. Harman, editors. TREC: Experiments in Information Retrieval Evaluation. MIT Press, 2005.Google Scholar
- Y. Yue, Y. Gao, O. Chapelle, Y. Zhang, and T. Joachims. Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation. In Proc. of SIGIR, 2010. Google ScholarDigital Library
Index Terms
- Comparing the sensitivity of information retrieval metrics
Recommendations
Are Secondary Assessors Uncertain When They Disagree About Relevance Judgements?
CHIIR '16: Proceedings of the 2016 ACM on Conference on Human Information Interaction and RetrievalThe collection of relevance judgements by assessors is important for many information retrieval (IR) tasks. In addition to the construction of test collections, relevance judging is critical to e-discovery and other applications where many assessors are ...
A qualitative exploration of secondary assessor relevance judging behavior
IIiX '14: Proceedings of the 5th Information Interaction in Context SymposiumSecondary assessors frequently differ in their relevance judgments. Primary assessors are those that originate a search topic and whose judgments truly reflect the assessor's relevance criteria. Secondary assessors do not originate the search and must ...
Estimating Measurement Uncertainty for Information Retrieval Effectiveness Metrics
Special Issue on Reproducibility in IR: Evaluation Campaigns, Collections and AnalysesOne typical way of building test collections for offline measurement of information retrieval systems is to pool the ranked outputs of different systems down to some chosen depth d and then form relevance judgments for those documents only. Non-pooled ...
Comments