Skip to main content

Metric and Relevance Mismatch in Retrieval Evaluation

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5839))

Abstract

Recent investigations of search performance have shown that, even when presented with two systems that are superior and inferior based on a Cranfield-style batch experiment, real users may perform equally well with either system. In this paper, we explore how these evaluation paradigms may be reconciled. First, we investigate the DCG@1 and P@1 metrics, and their relationship with user performance on a common web search task. Our results show that batch experiment predictions based on P@1 or DCG@1 translate directly to user search effectiveness. However, marginally relevant documents are not strongly differentiable from non-relevant documents. Therefore, when folding multiple relevance levels into a binary scale, marginally relevant documents should be grouped with non-relevant documents, rather than with highly relevant documents, as is currently done in standard IR evaluations.

We then investigate relevance mismatch, classifying users based on relevance profiles, the likelihood with which they will judge documents of different relevance levels to be useful. When relevance profiles can be estimated well, this classification scheme can offer further insight into the transferability of batch results to real user search tasks.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Al-Maskari, A., Sanderson, M., Clough, P.: The relationship between IR effectiveness measures and user satisfaction. In: SIGIR, Amsterdam, Netherlands, pp. 773–774 (2007)

    Google Scholar 

  2. Allan, J., Carterette, B., Lewis, J.: When will information retrieval be “good enough”? In: SIGIR, Salvador, Brazil, pp. 433–440 (2005)

    Google Scholar 

  3. Borlund, P.: Experimental components for the evaluation of interactive information retrieval systems. Journal of Documentation 56(1), 71–90 (2000)

    Article  MATH  Google Scholar 

  4. Buckley, C., Voorhees, E.M.: Evaluating Evaluation Measure Stability. In: SIGIR, Athens, Greece, pp. 33–40 (2000)

    Google Scholar 

  5. Buckley, C., Voorhees, E.M.: Retrieval system evaluation. In: Voorhees, E.M., Harman, D.K. (eds.) TREC: experiment and evaluation in information retrieval. MIT Press, Cambridge (2005)

    Google Scholar 

  6. Clarke, C., Craswell, N., Soboroff, I.: Overview of the TREC 2004 terabyte track. In: TREC 2004, Gaithersburg, MD (2005)

    Google Scholar 

  7. Clarke, C., Scholer, F., Soboroff, I.: The TREC 2005 terabyte track. In: TREC 2005. National Institute of Standards and Technology, Gaithersburg (2006)

    Google Scholar 

  8. Hersh, W., Turpin, A., Price, S., Chan, B., Kraemer, D., Sacherek, L., Olson, D.: Do batch and user evaluations give the same results? In: SIGIR, Athens, Greece, pp. 17–24 (2000)

    Google Scholar 

  9. Huffman, S.B., Hochster, M.: How well does result relevance predict session satisfaction? In: SIGIR, Amsterdam, Netherlands, pp. 567–574 (2007)

    Google Scholar 

  10. Ingwersen, P., Järvelin, K.: The Turn: Integration of Information Seeking and Retrieval in Context. Kluwer Academic Publishers, Dordrecht (2005)

    MATH  Google Scholar 

  11. Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Trans. Information Systems 20(4), 422–446 (2002)

    Article  Google Scholar 

  12. Kelly, D., Fu, X., Shah, C.: Effects of rank and precision of search results on users’ evaluations of system performance. Technical Report TR-2007-02, University of North Carolina (2007)

    Google Scholar 

  13. Rose, D.E., Levinson, D.: Understanding user goals in web search. In: WWW 2004, pp. 13–19. New York (2004)

    Google Scholar 

  14. Scholer, F., Turpin, A., Wu, M.: Measuring user relevance criteria. In: The Second International Workshop on Evaluating Information Access (EVIA 2008), Tokyo, Japan, pp. 47–56 (2008)

    Google Scholar 

  15. Sheskin, D.: Handbook of parametric and nonparametric statistical proceedures. CRC Press, Boca Raton (1997)

    MATH  Google Scholar 

  16. Sormunen, E.: Liberal relevance criteria of TREC – counting on negligible documents? In: SIGIR, Tampere, Finland, pp. 324–330 (2002)

    Google Scholar 

  17. Spink, A., Jansen, B.J., Wolfram, D., Saracevic, T.: From e-sex to e-commerce: Web search changes. IEEE Computer 35(3), 107–109 (2002)

    Article  Google Scholar 

  18. Turpin, A., Hersh, W.: Why batch and user evaluations do not give the same results. In: SIGIR, New Orleans, LA, pp. 225–231 (2001)

    Google Scholar 

  19. Turpin, A., Scholer, F.: User performance versus precision measures for simple web search tasks. In: SIGIR, Seattle, WA, pp. 11–18 (2006)

    Google Scholar 

  20. Turpin, A., Tsegay, Y., Hawking, D., Williams, H.E.: Fast generation of result snippets in web search. In: SIGIR, Amsterdam, Netherlands, pp. 127–134 (2007)

    Google Scholar 

  21. Vakkari, P., Sormunen, E.: The influence of relevance levels on the effectiveness of interactive information retrieval. Journal of the American Society for Information Science and Technology 55(11), 963–969 (2004)

    Article  Google Scholar 

  22. Voorhees, E.M.: Variations in relevance judgements and the measurement of retrieval effectiveness. Information Processing and Management 36(5), 697–716 (2000)

    Article  Google Scholar 

  23. Voorhees, E.M., Harman, D.K.: TREC: experiment and evaluation in information retrieval. MIT Press, Cambridge (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Scholer, F., Turpin, A. (2009). Metric and Relevance Mismatch in Retrieval Evaluation. In: Lee, G.G., et al. Information Retrieval Technology. AIRS 2009. Lecture Notes in Computer Science, vol 5839. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04769-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-04769-5_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-04768-8

  • Online ISBN: 978-3-642-04769-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics