skip to main content
10.1145/2537734.2537745acmotherconferencesArticle/Chapter ViewAbstractPublication PagesadcsConference Proceedingsconference-collections
research-article

Choices in batch information retrieval evaluation

Published:05 December 2013Publication History

ABSTRACT

Web search tools are used on a daily basis by billions of people. The commercial providers of these services spend large amounts of money measuring their own effectiveness and benchmarking against their competitors; nothing less than their corporate survival is at stake. Techniques for offline or "batch" evaluation of search quality have received considerable attention, spanning ways of constructing relevance judgments; ways of using them to generate numeric scores; and ways of inferring system "superiority" from sets of such scores.

Our purpose in this paper is consider these mechanisms as a chain of inter-dependent activities, in order to explore some of the ramifications of alternative components. By disaggregating the different activities, and asking what the ultimate objective of the measurement process is, we provide new insights into evaluation approaches, and are able to suggest new combinations that might prove fruitful avenues for exploration. Our observations are examined with reference to data collected from a user study covering 34 users undertaking a total of six search tasks each, using two systems of markedly different quality.

We hope to encourage broader awareness of the many factors that go into an evaluation of search effectiveness, and of the implications of these choices, and encourage researchers to carefully report all aspects of the evaluation process when describing their system performance experiments.

References

  1. A. Al-Maskari, M. Sanderson, and P. Clough. The relationship between IR effectiveness measures and user satisfaction. In Proc. SIGIR, pages 773--774, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Allan, B. Carterette, and J. Lewis. When will information retrieval be "good enough"? In Proc. SIGIR, pages 433--440, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. de Vries, and E. Yilmaz. Relevance assessment: Are judges exchangeable and does it matter? In Proc. SIGIR, pages 667--674, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees. Bias and the limits of pooling. In Proc. SIGIR, pages 619--620, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees. Bias and the limits of pooling for large collections. Information Retrieval, 10(6): 491--508, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Büttcher, C. L. A. Clarke, P. C. K. Yeung, and I. Soboroff. Reliable information retrieval evaluation with incomplete and biased judgements. In Proc. SIGIR, pages 63--70, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Büttcher, C. L. A. Clarke, and G. V. Cormack. Information Retrieval: Implementing and Evaluating Search Engines. The MIT Press, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Carterette, J. Allan, and R. K. Sitaraman. Minimal test collections for retrieval evaluation. In Proc. SIGIR, pages 268--275, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. L. A. Clarke, N. Craswell, I. Soboroff, and A. Ashkan. A comparative analysis of cascade measures for novelty and diversity. In Proc. WSDM, pages 75--84, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. H. DeLone and E. R. McLean. Information systems success: The quest for the dependent variable. Information Systems Research, 3(1): 60--95, 1992.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Demartini and S. Mizzaro. A classification of IR effectiveness metrics. In Proc. ECIR, pages 488--491, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. P. Harter. Variations in relevance assessments and the measurement of retrieval effectiveness. JASIS, 47(1): 37--49, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Hassan, R. Jones, and K. L. Klinkner. Beyond DCG: User behavior as a predictor of a successful search. In Proc. WSDM, pages 221--230, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. Hedin, S. Tomlinson, J. R. Baron, and D. W. Oard. Overview of the TREC 2009 legal track. In Proc. TREC, 2010.Google ScholarGoogle Scholar
  15. W. Hersh, A. Turpin, S. Price, B. Chan, D. Kraemer, L. Sacherek, and D. Olson. Do batch and user evaluations give the same results? In Proc. SIGIR, pages 17--24, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. W. Hersh, A. M. Cohen, P. Roberts, and H. K. Rekapalli. TREC 2006 genomics track overview. In Proc. TREC, 2007.Google ScholarGoogle Scholar
  17. S. B. Huffman and M. Hochster. How well does result relevance predict session satisfaction? In Proc. SIGIR, pages 567--574, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. Jones, D. Hawking, P. Thomas, and R. Sankaranarayana. Relative effect of spam and irrelevant documents on user interaction with search engines. In Proc. CIKM, pages 2113--2116, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. Kazai. In search of quality in crowdsourcing for search engine evaluation. In Proc. ECIR, pages 165--176, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Kelly and C. R. Sugimoto. A systematic review of interactive information retrieval evaluation studies, 1967--2006. JASIST, 64(4): 745--770, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  21. M. E. Lesk and G. Salton. Relevance assessments and the measurement of retrieval effectiveness. Information Storage and Retrieval, 4: 343--359, 1969.Google ScholarGoogle ScholarCross RefCross Ref
  22. X. Li, Y.-Y. Wang, and A. Acero. Learning query intent from regularized click graphs. In Proc. SIGIR, pages 339--346, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Moffat. Seven numeric properties of effectiveness metrics. In Proc. AIRS, 2013. To appear.Google ScholarGoogle ScholarCross RefCross Ref
  24. A. Moffat, W. Webber, and J. Zobel. Strategic system comparisons via targeted relevance judgments. In Proc. SIGIR, pages 375--382, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Moffat, F. Scholer, and P. Thomas. Models and metrics: IR evaluation as a user process. In Proc. Australasian Document Computing Symp., pages 47--54, December 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Moffat, P. Thomas, and F. Scholer. Users versus models: What observation tells us about effectiveness metrics. In Proc. CIKM, October 2013. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Robertson. On GMAP and other transformations. In Proc. CIKM, pages 78--83, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Sakai. Alternatives to Bpref. In Proc. SIGIR, pages 71--78, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Sanderson. Test Collection Based Evaluation of Information Retrieval Systems, volume 4 of Foundations and Trends in Information Retrieval. now Publishers, 2010.Google ScholarGoogle Scholar
  30. M. Sanderson and J. Zobel. Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proc. SIGIR, pages 162--169, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. K. L. Smith and P. B Kantor. User adaptation: Good results from poor systems. In Proc. SIGIR, pages 147--154, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. P. Thomas, F. Scholer, and A. Moffat. What users do: The eyes have it. In Proc. AIRS, 2013. To appear.Google ScholarGoogle ScholarCross RefCross Ref
  33. A. Turpin and W. Hersh. Why batch and user evaluations do not give the same results. In Proc. SIGIR, pages 225--231, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Turpin and F. Scholer. User performance versus precision measures for simple search tasks. In Proc. SIGIR, pages 11--18, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. Turpin, F. Scholer, K. Jarvelin, M. Wu, and J. S. Culpepper. Including summaries in system evaluation. In Proc. SIGIR, pages 508--515, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. E. M. Voorhees and D. K. Harman. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Wu, J. A. Thom, A. Turpin, and R. Wilkinson. Cost and benefit analysis of mediated enterprise search. In Proc. JCDL, pages 267--276, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. W.-C. Wu, D. Kelly, A. Edwards, and J. Arguello. Grannies, tanning beds, tattoos and NASCAR: Evaluation of search tasks with varying levels of cognitive complexity. In Proc. Information Interaction in Context Symp., pages 254--257, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. E. Yilmaz, E. Kanoulas, and J. A. Aslam. A simple and efficient sampling method for estimating AP and NDCG. In Proc. SIGIR, pages 603--610, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. E. Yilmaz, G. Kazai, N. Craswell, and S. M. M. Tamaghoghi. On judgments obtained from a commercial search engine. In Proc. SIGIR, pages 1115--1116, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. J. Zobel. How reliable are the results of large-scale information retrieval experiments? In Proc. SIGIR, pages 307--314, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Choices in batch information retrieval evaluation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ADCS '13: Proceedings of the 18th Australasian Document Computing Symposium
      December 2013
      126 pages
      ISBN:9781450325240
      DOI:10.1145/2537734

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 5 December 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      ADCS '13 Paper Acceptance Rate12of23submissions,52%Overall Acceptance Rate30of57submissions,53%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader