Metric and Relevance Mismatch in Retrieval Evaluation

Scholer, Falk; Turpin, Andrew

doi:10.1007/978-3-642-04769-5_5

Metric and Relevance Mismatch in Retrieval Evaluation

Falk Scholer²³ &
Andrew Turpin²³

Conference paper

850 Accesses
9 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5839))

Abstract

Recent investigations of search performance have shown that, even when presented with two systems that are superior and inferior based on a Cranfield-style batch experiment, real users may perform equally well with either system. In this paper, we explore how these evaluation paradigms may be reconciled. First, we investigate the DCG@1 and P@1 metrics, and their relationship with user performance on a common web search task. Our results show that batch experiment predictions based on P@1 or DCG@1 translate directly to user search effectiveness. However, marginally relevant documents are not strongly differentiable from non-relevant documents. Therefore, when folding multiple relevance levels into a binary scale, marginally relevant documents should be grouped with non-relevant documents, rather than with highly relevant documents, as is currently done in standard IR evaluations.

We then investigate relevance mismatch, classifying users based on relevance profiles, the likelihood with which they will judge documents of different relevance levels to be useful. When relevance profiles can be estimated well, this classification scheme can offer further insight into the transferability of batch results to real user search tasks.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Al-Maskari, A., Sanderson, M., Clough, P.: The relationship between IR effectiveness measures and user satisfaction. In: SIGIR, Amsterdam, Netherlands, pp. 773–774 (2007)
Google Scholar
Allan, J., Carterette, B., Lewis, J.: When will information retrieval be “good enough”? In: SIGIR, Salvador, Brazil, pp. 433–440 (2005)
Google Scholar
Borlund, P.: Experimental components for the evaluation of interactive information retrieval systems. Journal of Documentation 56(1), 71–90 (2000)
Article MATH Google Scholar
Buckley, C., Voorhees, E.M.: Evaluating Evaluation Measure Stability. In: SIGIR, Athens, Greece, pp. 33–40 (2000)
Google Scholar
Buckley, C., Voorhees, E.M.: Retrieval system evaluation. In: Voorhees, E.M., Harman, D.K. (eds.) TREC: experiment and evaluation in information retrieval. MIT Press, Cambridge (2005)
Google Scholar
Clarke, C., Craswell, N., Soboroff, I.: Overview of the TREC 2004 terabyte track. In: TREC 2004, Gaithersburg, MD (2005)
Google Scholar
Clarke, C., Scholer, F., Soboroff, I.: The TREC 2005 terabyte track. In: TREC 2005. National Institute of Standards and Technology, Gaithersburg (2006)
Google Scholar
Hersh, W., Turpin, A., Price, S., Chan, B., Kraemer, D., Sacherek, L., Olson, D.: Do batch and user evaluations give the same results? In: SIGIR, Athens, Greece, pp. 17–24 (2000)
Google Scholar
Huffman, S.B., Hochster, M.: How well does result relevance predict session satisfaction? In: SIGIR, Amsterdam, Netherlands, pp. 567–574 (2007)
Google Scholar
Ingwersen, P., Järvelin, K.: The Turn: Integration of Information Seeking and Retrieval in Context. Kluwer Academic Publishers, Dordrecht (2005)
MATH Google Scholar
Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Trans. Information Systems 20(4), 422–446 (2002)
Article Google Scholar
Kelly, D., Fu, X., Shah, C.: Effects of rank and precision of search results on users’ evaluations of system performance. Technical Report TR-2007-02, University of North Carolina (2007)
Google Scholar
Rose, D.E., Levinson, D.: Understanding user goals in web search. In: WWW 2004, pp. 13–19. New York (2004)
Google Scholar
Scholer, F., Turpin, A., Wu, M.: Measuring user relevance criteria. In: The Second International Workshop on Evaluating Information Access (EVIA 2008), Tokyo, Japan, pp. 47–56 (2008)
Google Scholar
Sheskin, D.: Handbook of parametric and nonparametric statistical proceedures. CRC Press, Boca Raton (1997)
MATH Google Scholar
Sormunen, E.: Liberal relevance criteria of TREC – counting on negligible documents? In: SIGIR, Tampere, Finland, pp. 324–330 (2002)
Google Scholar
Spink, A., Jansen, B.J., Wolfram, D., Saracevic, T.: From e-sex to e-commerce: Web search changes. IEEE Computer 35(3), 107–109 (2002)
Article Google Scholar
Turpin, A., Hersh, W.: Why batch and user evaluations do not give the same results. In: SIGIR, New Orleans, LA, pp. 225–231 (2001)
Google Scholar
Turpin, A., Scholer, F.: User performance versus precision measures for simple web search tasks. In: SIGIR, Seattle, WA, pp. 11–18 (2006)
Google Scholar
Turpin, A., Tsegay, Y., Hawking, D., Williams, H.E.: Fast generation of result snippets in web search. In: SIGIR, Amsterdam, Netherlands, pp. 127–134 (2007)
Google Scholar
Vakkari, P., Sormunen, E.: The influence of relevance levels on the effectiveness of interactive information retrieval. Journal of the American Society for Information Science and Technology 55(11), 963–969 (2004)
Article Google Scholar
Voorhees, E.M.: Variations in relevance judgements and the measurement of retrieval effectiveness. Information Processing and Management 36(5), 697–716 (2000)
Article Google Scholar
Voorhees, E.M., Harman, D.K.: TREC: experiment and evaluation in information retrieval. MIT Press, Cambridge (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and IT, RMIT University, GPO Box 2476v, Melbourne, Australia
Falk Scholer & Andrew Turpin

Authors

Falk Scholer
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Turpin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31, Hyoja-dong, Nam-gu, 790-784, Pohang, Korea
Gary Geunbae Lee
School of Computing, The Robert Gordon University, St Andrew Street, AB25 1HG, Aberdeen, UK
Dawei Song
Microsoft Reseach Asia, 5F Beijing Sigma Center, 49 Zhichun Road, Haidian District, 100190, Beijing, P.R. China
Chin-Yew Lin
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, 101-8430, Tokyo, Japan
Akiko Aizawa
School of Literature, Shirayuri College, 1-25 Midorigaoka, Chofu-shi, 182-8525, Tokyo, Japan
Kazuko Kuriyama
Graduate School of Information Science and Technology, Hokkaido University, North 14 West 9, Kita-ku. Sapporo-shi, 060-0814, Hokkaido, Japan
Masaharu Yoshioka
Microsoft Research Asia, 5F Beijing Sigma Center, 49 Zhichun Road, Haidian District, 100190, Beijing, P.R. China
Tetsuya Sakai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Scholer, F., Turpin, A. (2009). Metric and Relevance Mismatch in Retrieval Evaluation. In: Lee, G.G., et al. Information Retrieval Technology. AIRS 2009. Lecture Notes in Computer Science, vol 5839. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04769-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-04769-5_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04768-8
Online ISBN: 978-3-642-04769-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics