skip to main content
10.1145/1835449.1835560acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Comparing the sensitivity of information retrieval metrics

Published:19 July 2010Publication History

ABSTRACT

Information retrieval effectiveness is usually evaluated using measures such as Normalized Discounted Cumulative Gain (NDCG), Mean Average Precision (MAP) and Precision at some cutoff (Precision@k) on a set of judged queries. Recent research has suggested an alternative, evaluating information retrieval systems based on user behavior. Particularly promising are experiments that interleave two rankings and track user clicks. According to a recent study, interleaving experiments can identify large differences in retrieval effectiveness with much better reliability than other click-based methods.

We study interleaving in more detail, comparing it with traditional measures in terms of reliability, sensitivity and agreement. To detect very small differences in retrieval effectiveness, a reliable outcome with standard metrics requires about 5,000 judged queries, and this is about as reliable as interleaving with 50,000 user impressions. Amongst the traditional measures, NDCG has the strongest correlation with interleaving. Finally, we present some new forms of analysis, including an approach to enhance interleaving sensitivity.

References

  1. A. Al Maskari, M. Sanderson, P. Clough, and E. Airio. The good and the bad system: does the test collection predict users' effectiveness? In Proc. of SIGIR, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Andrew Turpin and Falk Scholer. User Performance versus Precision Measures for Simple Search Tasks. In Proc. of SIGIR, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ben Carterette and Rosie Jones. Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks. In Proc. of NIPS, 2007.Google ScholarGoogle Scholar
  4. A. Broder. A taxonomy of web search. SIGIR Forum, 26(2):3--10, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. W. B. Croft, D. Metzler, and T. Strohman. Search Engines: Information Retrieval in Practice. Addison Wesley, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Dennis Fetterly, Mark Manasse, and Marc Najork. On The Evolution of Clusters of Near-Duplicate Web Pages. In LA-WEB, pages 37--45, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Diane Kelly, Xin Fu, and Chirag Shah. Effects of rank and precision of search results on users' evaluations of system performance. Technical Report TR-2007-02., UNC SILS, 2007.Google ScholarGoogle Scholar
  8. Ellen M. Voorhees and Chris Buckley. The effect of topic set size on retrieval experiment error. In Proc. of SIGIR, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Falk Scholer and Andrew Turpin. Metric and relevance mismatch in retrieval evaluation. In Proc. of the Asia Information Retrieval Symposium, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Filip Radlinski, Madhu Kurup, and Thorsten Joachims. How Does Clickthrough Data Reflect Retrieval Quality. In Proc. of CIKM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Hersh, A. Turpin, S. Price, B. Chan, D. Kraemer, L. Sacherek, and D. Olson. Do Batch and User Evaluations Give the Same Results? In Proc. of SIGIR, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. James Allan, Ben Carterette, and J. Lewis. When Will Information Retrieval be "Good Enough"? In Proc. of SIGIR, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Mark Sanderson and Justin Zobel. Information Retrieval System Evaluation: Effort, Sensitivity and Reliability. In Proc. of SIGIR, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P. de Vries, and Emine Yilmaz. Relevance assessment: are judges exchangeable and does it matter. In Proc. of SIGIR, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Scott B. Huffman and Michael Hochster. How Well does Result Relevance Predict Session Satisfaction? In Proc. of SIGIR, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Thomas and D. Hawking. Evaluation by comparing result sets in context. In Proc. of CIKM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Thorsten Joachims. Optimizing Search Engines Using Clickthrough Data. In Proc. of KDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Text Retrieval Conference. http://trec.nist.gov/.Google ScholarGoogle Scholar
  19. E. M. Voorhees and D. K. Harman, editors. TREC: Experiments in Information Retrieval Evaluation. MIT Press, 2005.Google ScholarGoogle Scholar
  20. Y. Yue, Y. Gao, O. Chapelle, Y. Zhang, and T. Joachims. Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation. In Proc. of SIGIR, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Comparing the sensitivity of information retrieval metrics

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
      July 2010
      944 pages
      ISBN:9781450301534
      DOI:10.1145/1835449

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 July 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGIR '10 Paper Acceptance Rate87of520submissions,17%Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader