research-article

Choices in batch information retrieval evaluation

Authors:
Falk Scholer

RMIT University, Australia

RMIT University, Australia
View Profile

,
Alistair Moffat

The University of Melbourne, Australia

The University of Melbourne, Australia
View Profile

,
Paul Thomas

CSIRO & ANU, Australia

CSIRO & ANU, Australia
View Profile

ADCS '13: Proceedings of the 18th Australasian Document Computing SymposiumDecember 2013Pages 74–81https://doi.org/10.1145/2537734.2537745

Published:05 December 2013Publication History

ADCS '13: Proceedings of the 18th Australasian Document Computing Symposium

Pages 74–81

ABSTRACT

Web search tools are used on a daily basis by billions of people. The commercial providers of these services spend large amounts of money measuring their own effectiveness and benchmarking against their competitors; nothing less than their corporate survival is at stake. Techniques for offline or "batch" evaluation of search quality have received considerable attention, spanning ways of constructing relevance judgments; ways of using them to generate numeric scores; and ways of inferring system "superiority" from sets of such scores.

Our purpose in this paper is consider these mechanisms as a chain of inter-dependent activities, in order to explore some of the ramifications of alternative components. By disaggregating the different activities, and asking what the ultimate objective of the measurement process is, we provide new insights into evaluation approaches, and are able to suggest new combinations that might prove fruitful avenues for exploration. Our observations are examined with reference to data collected from a user study covering 34 users undertaking a total of six search tasks each, using two systems of markedly different quality.

We hope to encourage broader awareness of the many factors that go into an evaluation of search effectiveness, and of the implications of these choices, and encourage researchers to carefully report all aspects of the evaluation process when describing their system performance experiments.

References

A. Al-Maskari, M. Sanderson, and P. Clough. The relationship between IR effectiveness measures and user satisfaction. In Proc. SIGIR, pages 773--774, 2007. Google ScholarDigital Library
J. Allan, B. Carterette, and J. Lewis. When will information retrieval be "good enough"? In Proc. SIGIR, pages 433--440, 2005. Google ScholarDigital Library
P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. de Vries, and E. Yilmaz. Relevance assessment: Are judges exchangeable and does it matter? In Proc. SIGIR, pages 667--674, 2008. Google ScholarDigital Library
C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees. Bias and the limits of pooling. In Proc. SIGIR, pages 619--620, 2006. Google ScholarDigital Library
C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees. Bias and the limits of pooling for large collections. Information Retrieval, 10(6): 491--508, 2007. Google ScholarDigital Library
S. Büttcher, C. L. A. Clarke, P. C. K. Yeung, and I. Soboroff. Reliable information retrieval evaluation with incomplete and biased judgements. In Proc. SIGIR, pages 63--70, 2007. Google ScholarDigital Library
S. Büttcher, C. L. A. Clarke, and G. V. Cormack. Information Retrieval: Implementing and Evaluating Search Engines. The MIT Press, 2010. Google ScholarDigital Library
B. Carterette, J. Allan, and R. K. Sitaraman. Minimal test collections for retrieval evaluation. In Proc. SIGIR, pages 268--275, 2006. Google ScholarDigital Library
C. L. A. Clarke, N. Craswell, I. Soboroff, and A. Ashkan. A comparative analysis of cascade measures for novelty and diversity. In Proc. WSDM, pages 75--84, 2011. Google ScholarDigital Library
W. H. DeLone and E. R. McLean. Information systems success: The quest for the dependent variable. Information Systems Research, 3(1): 60--95, 1992.Google ScholarDigital Library
G. Demartini and S. Mizzaro. A classification of IR effectiveness metrics. In Proc. ECIR, pages 488--491, 2006. Google ScholarDigital Library
S. P. Harter. Variations in relevance assessments and the measurement of retrieval effectiveness. JASIS, 47(1): 37--49, 1996. Google ScholarDigital Library
A. Hassan, R. Jones, and K. L. Klinkner. Beyond DCG: User behavior as a predictor of a successful search. In Proc. WSDM, pages 221--230, 2010. Google ScholarDigital Library
B. Hedin, S. Tomlinson, J. R. Baron, and D. W. Oard. Overview of the TREC 2009 legal track. In Proc. TREC, 2010.Google Scholar
W. Hersh, A. Turpin, S. Price, B. Chan, D. Kraemer, L. Sacherek, and D. Olson. Do batch and user evaluations give the same results? In Proc. SIGIR, pages 17--24, 2000. Google ScholarDigital Library
W. Hersh, A. M. Cohen, P. Roberts, and H. K. Rekapalli. TREC 2006 genomics track overview. In Proc. TREC, 2007.Google Scholar
S. B. Huffman and M. Hochster. How well does result relevance predict session satisfaction? In Proc. SIGIR, pages 567--574, 2007. Google ScholarDigital Library
T. Jones, D. Hawking, P. Thomas, and R. Sankaranarayana. Relative effect of spam and irrelevant documents on user interaction with search engines. In Proc. CIKM, pages 2113--2116, 2011. Google ScholarDigital Library
G. Kazai. In search of quality in crowdsourcing for search engine evaluation. In Proc. ECIR, pages 165--176, 2011. Google ScholarDigital Library
D. Kelly and C. R. Sugimoto. A systematic review of interactive information retrieval evaluation studies, 1967--2006. JASIST, 64(4): 745--770, 2013.Google ScholarCross Ref
M. E. Lesk and G. Salton. Relevance assessments and the measurement of retrieval effectiveness. Information Storage and Retrieval, 4: 343--359, 1969.Google ScholarCross Ref
X. Li, Y.-Y. Wang, and A. Acero. Learning query intent from regularized click graphs. In Proc. SIGIR, pages 339--346, 2008. Google ScholarDigital Library
A. Moffat. Seven numeric properties of effectiveness metrics. In Proc. AIRS, 2013. To appear.Google ScholarCross Ref
A. Moffat, W. Webber, and J. Zobel. Strategic system comparisons via targeted relevance judgments. In Proc. SIGIR, pages 375--382, 2007. Google ScholarDigital Library
A. Moffat, F. Scholer, and P. Thomas. Models and metrics: IR evaluation as a user process. In Proc. Australasian Document Computing Symp., pages 47--54, December 2012. Google ScholarDigital Library
A. Moffat, P. Thomas, and F. Scholer. Users versus models: What observation tells us about effectiveness metrics. In Proc. CIKM, October 2013. To appear. Google ScholarDigital Library
S. Robertson. On GMAP and other transformations. In Proc. CIKM, pages 78--83, 2006. Google ScholarDigital Library
T. Sakai. Alternatives to Bpref. In Proc. SIGIR, pages 71--78, 2007. Google ScholarDigital Library
M. Sanderson. Test Collection Based Evaluation of Information Retrieval Systems, volume 4 of Foundations and Trends in Information Retrieval. now Publishers, 2010.Google Scholar
M. Sanderson and J. Zobel. Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proc. SIGIR, pages 162--169, 2005. Google ScholarDigital Library
K. L. Smith and P. B Kantor. User adaptation: Good results from poor systems. In Proc. SIGIR, pages 147--154, 2008. Google ScholarDigital Library
P. Thomas, F. Scholer, and A. Moffat. What users do: The eyes have it. In Proc. AIRS, 2013. To appear.Google ScholarCross Ref
A. Turpin and W. Hersh. Why batch and user evaluations do not give the same results. In Proc. SIGIR, pages 225--231, 2001. Google ScholarDigital Library
A. Turpin and F. Scholer. User performance versus precision measures for simple search tasks. In Proc. SIGIR, pages 11--18, 2006. Google ScholarDigital Library
A. Turpin, F. Scholer, K. Jarvelin, M. Wu, and J. S. Culpepper. Including summaries in system evaluation. In Proc. SIGIR, pages 508--515, 2009. Google ScholarDigital Library
E. M. Voorhees and D. K. Harman. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005. Google ScholarDigital Library
M. Wu, J. A. Thom, A. Turpin, and R. Wilkinson. Cost and benefit analysis of mediated enterprise search. In Proc. JCDL, pages 267--276, 2009. Google ScholarDigital Library
W.-C. Wu, D. Kelly, A. Edwards, and J. Arguello. Grannies, tanning beds, tattoos and NASCAR: Evaluation of search tasks with varying levels of cognitive complexity. In Proc. Information Interaction in Context Symp., pages 254--257, 2012. Google ScholarDigital Library
E. Yilmaz, E. Kanoulas, and J. A. Aslam. A simple and efficient sampling method for estimating AP and NDCG. In Proc. SIGIR, pages 603--610, 2008. Google ScholarDigital Library
E. Yilmaz, G. Kazai, N. Craswell, and S. M. M. Tamaghoghi. On judgments obtained from a commercial search engine. In Proc. SIGIR, pages 1115--1116, 2012. Google ScholarDigital Library
J. Zobel. How reliable are the results of large-scale information retrieval experiments? In Proc. SIGIR, pages 307--314, 1998. Google ScholarDigital Library

Index Terms

Choices in batch information retrieval evaluation
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Robust test collections for retrieval evaluation
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Low-cost methods for acquiring relevance judgments can be a boon to researchers who need to evaluate new retrieval tasks or topics but do not have the resources to make thousands of judgments. While these judgments are very useful for a one-time ...
Read More
Minimal test collections for retrieval evaluation
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Accurate estimation of information retrieval evaluation metrics such as average precision require large sets of relevance judgments. Building sets large enough for evaluation of real-world implementations is at best inefficient, at worst infeasible. In ...
Read More
A Comparison of Retrieval Result Relevance Judgments Between American and Chinese Users

Relevance judgment plays an extremely significant role in information retrieval. This study investigates the differences between American users and Chinese users in relevance judgment during the information retrieval process. 384 sets of relevance ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ADCS '13: Proceedings of the 18th Australasian Document Computing Symposium
December 2013
126 pages
ISBN:9781450325240
DOI:10.1145/2537734
Conference Chairs:
Shane Culpepper
RMIT University
,
Guido Zuccon
CSIRO
,
General Chair:
Laurianne Sitbon
Queensland University of Technology
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 December 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
effectiveness
evaluation
information retrieval
relevance judgment
Qualifiers
- research-article
Conference

Acceptance Rates
ADCS '13 Paper Acceptance Rate12of23submissions,52%Overall Acceptance Rate30of57submissions,53%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 115
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Choices in batch information retrieval evaluation

ADCS '13: Proceedings of the 18th Australasian Document Computing Symposium

ABSTRACT

References

Cited By

Index Terms

Recommendations

Robust test collections for retrieval evaluation

Minimal test collections for retrieval evaluation

A Comparison of Retrieval Result Relevance Judgments Between American and Chinese Users