Bootstrap-Based Comparisons of IR Metrics for Finding One Relevant Document

Sakai, Tetsuya

doi:10.1007/11880592_29

Tetsuya Sakai²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4182))

Included in the following conference series:

Asia Information Retrieval Symposium

966 Accesses
12 Citations

Abstract

This paper compares the sensitivity of IR metrics designed for the task of finding one relevant document, using a method recently proposed at SIGIR 2006. The metrics are: P⁺-measure, P-measure, O-measure, Normalised Weighted Reciprocal Rank (NWRR) and Reciprocal Rank (RR). All of them except for RR can handle graded relevance. Unlike the ad hoc (but nevertheless useful) “swap” method proposed by Voorhees and Buckley, the new method derives the sensitivity and the performance difference required to guarantee a given significance level directly from Bootstrap Hypothesis Tests. We use four data sets from NTCIR to show that, according to this method, “P(⁺)-measure ≥ O-measure ≥ NWRR ≥ RR” generally holds, where “≥” means “is at least as sensitive as”. These results generalise and reinforce previously reported ones based on the swap method. Therefore, we recommend the use of P(⁺)-measure and O-measure for practical tasks such as known-item search where recall is either unimportant or immeasurable.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Buckley, C., Voorhees, E.M.: Evaluating Evaluation Measure Stability. In: ACM SIGIR 2000 Proceedings, pp. 33–40 (2000)
Google Scholar
Buckley, C., Voorhees, E.M.: Retrieval Evaluation with Incomplete Information. In: ACM SIGIR 2004 Proceedings, pp. 25–32 (2004)
Google Scholar
Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman and Hall, CRC (1993)
MATH Google Scholar
Eguchi, K., et al.: Overview of the Web Retrieval Task at the Third NTCIR Workshop. In: National Institute of Informatics Technical Report NII-2003-002E (2003)
Google Scholar
Hawking, D., Craswell, N.: The Very Large Collection and Web Tracks. In: TREC: Experiment and Evaluation in Information Retrieval, pp. 199–231. MIT Press, Cambridge (2005)
Google Scholar
Kando, N.: Overview of the Fifth NTCIR Workshop. In: NTCIR-5 Proceedings (2005)
Google Scholar
Kekäläinen, J.: Binary and Graded Relevance in IR Evaluations – Comparison of the Effects on Ranking of IR Systems. Information Processing and Management 41, 1019–1033 (2005)
Article Google Scholar
Sakai, T.: The Reliability of Metrics based on Graded Relevance. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 1–16. Springer, Heidelberg (2005)
Chapter Google Scholar
Sakai, T.: The Effect of Topic Sampling on Sensitivity Comparisons of Information Retrieval Metrics. In: NTCIR-5 Proceedings, pp. 505–512 (2005)
Google Scholar
Sakai, T.: On the Task of Finding One Highly Relevant Document with High Precision. Information Processing of Japan Transactions on Databases TOD 29 (2006)
Google Scholar
Sakai, T.: Evaluating Evaluation Metrics based on the Bootstrap. In: ACM SIGIR 2006 Proceedings (2006) (to appear)
Google Scholar
Sakai, T.: Give Me Just One Highly Relevant Document: P-measure. In: ACM SIGIR 2006 Proceedings (2006) (to appear)
Google Scholar
Sakai, T.: A Further Note on Evaluation Metrics for the Task of Finding One Highly Relevant Document. Information Processing of Japan SIG Technical Reports FI- 82, 69–76 (2006)
Google Scholar
Sakai, T.: On the Reliability of Information Retrieval Metrics based on Graded Relevance. Information Processing and Management (2006) (to appear)
Google Scholar
Sanderson, M., Zobel, J.: Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability. In: ACM SIGIR 2005 Proceedings, pp. 162–169 (2005)
Google Scholar
Soboroff, I.: On Evaluating Web Search with Very Few Relevant Documents. In: ACM SIGIR 2004 Proceedings, pp. 530–531 (2004)
Google Scholar
Voorhees, E.M., Buckley, C.: The Effect of Topic Set Size on Retrieval Experiment Error. In: ACM SIGIR 2002 Proceedings, pp. 316–323 (2002)
Google Scholar
Voorhees, E.M.: Overview of the TREC 2004 Robust Retrieval Track. In: TREC 2004 Proceedings (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Toshiba Corporate R&D Center, Kawasaki, 212-8582, Japan
Tetsuya Sakai

Authors

Tetsuya Sakai
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, National University of Singapore, 3 Science Drive 2, 117543, Singapore
Hwee Tou Ng
Institute for Infocomm Research, 21 Heng Mui Keng Terrace, 119613, Singapore
Mun-Kew Leong
Department of Computer Science, School of Computing, National University of Singapore, 117543, Singapore
Min-Yen Kan
Institute for Infocomm Research, 21 Heng Mui Keng Terrace, P.O. Box, 119613, Singapore
Donghong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sakai, T. (2006). Bootstrap-Based Comparisons of IR Metrics for Finding One Relevant Document. In: Ng, H.T., Leong, MK., Kan, MY., Ji, D. (eds) Information Retrieval Technology. AIRS 2006. Lecture Notes in Computer Science, vol 4182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880592_29

Download citation

DOI: https://doi.org/10.1007/11880592_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45780-0
Online ISBN: 978-3-540-46237-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics