ABSTRACT
Web search tools are used on a daily basis by billions of people. The commercial providers of these services spend large amounts of money measuring their own effectiveness and benchmarking against their competitors; nothing less than their corporate survival is at stake. Techniques for offline or "batch" evaluation of search quality have received considerable attention, spanning ways of constructing relevance judgments; ways of using them to generate numeric scores; and ways of inferring system "superiority" from sets of such scores.
Our purpose in this paper is consider these mechanisms as a chain of inter-dependent activities, in order to explore some of the ramifications of alternative components. By disaggregating the different activities, and asking what the ultimate objective of the measurement process is, we provide new insights into evaluation approaches, and are able to suggest new combinations that might prove fruitful avenues for exploration. Our observations are examined with reference to data collected from a user study covering 34 users undertaking a total of six search tasks each, using two systems of markedly different quality.
We hope to encourage broader awareness of the many factors that go into an evaluation of search effectiveness, and of the implications of these choices, and encourage researchers to carefully report all aspects of the evaluation process when describing their system performance experiments.
- A. Al-Maskari, M. Sanderson, and P. Clough. The relationship between IR effectiveness measures and user satisfaction. In Proc. SIGIR, pages 773--774, 2007. Google ScholarDigital Library
- J. Allan, B. Carterette, and J. Lewis. When will information retrieval be "good enough"? In Proc. SIGIR, pages 433--440, 2005. Google ScholarDigital Library
- P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. de Vries, and E. Yilmaz. Relevance assessment: Are judges exchangeable and does it matter? In Proc. SIGIR, pages 667--674, 2008. Google ScholarDigital Library
- C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees. Bias and the limits of pooling. In Proc. SIGIR, pages 619--620, 2006. Google ScholarDigital Library
- C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees. Bias and the limits of pooling for large collections. Information Retrieval, 10(6): 491--508, 2007. Google ScholarDigital Library
- S. Büttcher, C. L. A. Clarke, P. C. K. Yeung, and I. Soboroff. Reliable information retrieval evaluation with incomplete and biased judgements. In Proc. SIGIR, pages 63--70, 2007. Google ScholarDigital Library
- S. Büttcher, C. L. A. Clarke, and G. V. Cormack. Information Retrieval: Implementing and Evaluating Search Engines. The MIT Press, 2010. Google ScholarDigital Library
- B. Carterette, J. Allan, and R. K. Sitaraman. Minimal test collections for retrieval evaluation. In Proc. SIGIR, pages 268--275, 2006. Google ScholarDigital Library
- C. L. A. Clarke, N. Craswell, I. Soboroff, and A. Ashkan. A comparative analysis of cascade measures for novelty and diversity. In Proc. WSDM, pages 75--84, 2011. Google ScholarDigital Library
- W. H. DeLone and E. R. McLean. Information systems success: The quest for the dependent variable. Information Systems Research, 3(1): 60--95, 1992.Google ScholarDigital Library
- G. Demartini and S. Mizzaro. A classification of IR effectiveness metrics. In Proc. ECIR, pages 488--491, 2006. Google ScholarDigital Library
- S. P. Harter. Variations in relevance assessments and the measurement of retrieval effectiveness. JASIS, 47(1): 37--49, 1996. Google ScholarDigital Library
- A. Hassan, R. Jones, and K. L. Klinkner. Beyond DCG: User behavior as a predictor of a successful search. In Proc. WSDM, pages 221--230, 2010. Google ScholarDigital Library
- B. Hedin, S. Tomlinson, J. R. Baron, and D. W. Oard. Overview of the TREC 2009 legal track. In Proc. TREC, 2010.Google Scholar
- W. Hersh, A. Turpin, S. Price, B. Chan, D. Kraemer, L. Sacherek, and D. Olson. Do batch and user evaluations give the same results? In Proc. SIGIR, pages 17--24, 2000. Google ScholarDigital Library
- W. Hersh, A. M. Cohen, P. Roberts, and H. K. Rekapalli. TREC 2006 genomics track overview. In Proc. TREC, 2007.Google Scholar
- S. B. Huffman and M. Hochster. How well does result relevance predict session satisfaction? In Proc. SIGIR, pages 567--574, 2007. Google ScholarDigital Library
- T. Jones, D. Hawking, P. Thomas, and R. Sankaranarayana. Relative effect of spam and irrelevant documents on user interaction with search engines. In Proc. CIKM, pages 2113--2116, 2011. Google ScholarDigital Library
- G. Kazai. In search of quality in crowdsourcing for search engine evaluation. In Proc. ECIR, pages 165--176, 2011. Google ScholarDigital Library
- D. Kelly and C. R. Sugimoto. A systematic review of interactive information retrieval evaluation studies, 1967--2006. JASIST, 64(4): 745--770, 2013.Google ScholarCross Ref
- M. E. Lesk and G. Salton. Relevance assessments and the measurement of retrieval effectiveness. Information Storage and Retrieval, 4: 343--359, 1969.Google ScholarCross Ref
- X. Li, Y.-Y. Wang, and A. Acero. Learning query intent from regularized click graphs. In Proc. SIGIR, pages 339--346, 2008. Google ScholarDigital Library
- A. Moffat. Seven numeric properties of effectiveness metrics. In Proc. AIRS, 2013. To appear.Google ScholarCross Ref
- A. Moffat, W. Webber, and J. Zobel. Strategic system comparisons via targeted relevance judgments. In Proc. SIGIR, pages 375--382, 2007. Google ScholarDigital Library
- A. Moffat, F. Scholer, and P. Thomas. Models and metrics: IR evaluation as a user process. In Proc. Australasian Document Computing Symp., pages 47--54, December 2012. Google ScholarDigital Library
- A. Moffat, P. Thomas, and F. Scholer. Users versus models: What observation tells us about effectiveness metrics. In Proc. CIKM, October 2013. To appear. Google ScholarDigital Library
- S. Robertson. On GMAP and other transformations. In Proc. CIKM, pages 78--83, 2006. Google ScholarDigital Library
- T. Sakai. Alternatives to Bpref. In Proc. SIGIR, pages 71--78, 2007. Google ScholarDigital Library
- M. Sanderson. Test Collection Based Evaluation of Information Retrieval Systems, volume 4 of Foundations and Trends in Information Retrieval. now Publishers, 2010.Google Scholar
- M. Sanderson and J. Zobel. Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proc. SIGIR, pages 162--169, 2005. Google ScholarDigital Library
- K. L. Smith and P. B Kantor. User adaptation: Good results from poor systems. In Proc. SIGIR, pages 147--154, 2008. Google ScholarDigital Library
- P. Thomas, F. Scholer, and A. Moffat. What users do: The eyes have it. In Proc. AIRS, 2013. To appear.Google ScholarCross Ref
- A. Turpin and W. Hersh. Why batch and user evaluations do not give the same results. In Proc. SIGIR, pages 225--231, 2001. Google ScholarDigital Library
- A. Turpin and F. Scholer. User performance versus precision measures for simple search tasks. In Proc. SIGIR, pages 11--18, 2006. Google ScholarDigital Library
- A. Turpin, F. Scholer, K. Jarvelin, M. Wu, and J. S. Culpepper. Including summaries in system evaluation. In Proc. SIGIR, pages 508--515, 2009. Google ScholarDigital Library
- E. M. Voorhees and D. K. Harman. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005. Google ScholarDigital Library
- M. Wu, J. A. Thom, A. Turpin, and R. Wilkinson. Cost and benefit analysis of mediated enterprise search. In Proc. JCDL, pages 267--276, 2009. Google ScholarDigital Library
- W.-C. Wu, D. Kelly, A. Edwards, and J. Arguello. Grannies, tanning beds, tattoos and NASCAR: Evaluation of search tasks with varying levels of cognitive complexity. In Proc. Information Interaction in Context Symp., pages 254--257, 2012. Google ScholarDigital Library
- E. Yilmaz, E. Kanoulas, and J. A. Aslam. A simple and efficient sampling method for estimating AP and NDCG. In Proc. SIGIR, pages 603--610, 2008. Google ScholarDigital Library
- E. Yilmaz, G. Kazai, N. Craswell, and S. M. M. Tamaghoghi. On judgments obtained from a commercial search engine. In Proc. SIGIR, pages 1115--1116, 2012. Google ScholarDigital Library
- J. Zobel. How reliable are the results of large-scale information retrieval experiments? In Proc. SIGIR, pages 307--314, 1998. Google ScholarDigital Library
Index Terms
- Choices in batch information retrieval evaluation
Recommendations
Robust test collections for retrieval evaluation
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalLow-cost methods for acquiring relevance judgments can be a boon to researchers who need to evaluate new retrieval tasks or topics but do not have the resources to make thousands of judgments. While these judgments are very useful for a one-time ...
Minimal test collections for retrieval evaluation
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrievalAccurate estimation of information retrieval evaluation metrics such as average precision require large sets of relevance judgments. Building sets large enough for evaluation of real-world implementations is at best inefficient, at worst infeasible. In ...
A Comparison of Retrieval Result Relevance Judgments Between American and Chinese Users
Relevance judgment plays an extremely significant role in information retrieval. This study investigates the differences between American users and Chinese users in relevance judgment during the information retrieval process. 384 sets of relevance ...
Comments