skip to main content
research-article

Rank-biased precision for measurement of retrieval effectiveness

Published: 23 December 2008 Publication History

Abstract

A range of methods for measuring the effectiveness of information retrieval systems has been proposed. These are typically intended to provide a quantitative single-value summary of a document ranking relative to a query. However, many of these measures have failings. For example, recall is not well founded as a measure of satisfaction, since the user of an actual system cannot judge recall. Average precision is derived from recall, and suffers from the same problem. In addition, average precision lacks key stability properties that are needed for robust experiments. In this article, we introduce a new effectiveness metric, rank-biased precision, that avoids these problems. Rank-biased pre-cision is derived from a simple model of user behavior, is robust if answer rankings are extended to greater depths, and allows accurate quantification of experimental uncertainty, even when only partial relevance judgments are available.

References

[1]
Allan, J., Carterette, B., and Lewis, J. 2005. When will information retrieval be “good enough”? See Marchionini et al. {2005}, 433--440.
[2]
Aslam, J. A., Pavlu, V., and Yilmaz, E. 2006. A statistical method for system evaluation using incomplete judgments. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, S. Dumais, E. N. Efthimiadis, D. Hawking, and K. Järvelin, Eds. ACM Press, New York, NY, 541--548.
[3]
Aslam, J. A., Yilmaz, E., and Pavlu, V. 2005. A geometric interpretation of r-precision and its correlation with average precision. See See Marchionini et al. {2005}, 573--574.
[4]
Borlund, P. and Ingwersen, P. 1998. Measures of relative relevance and ranked half-life: Performance indicators for interactive IR. In Proceedings of the Twenty-First Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, Eds. ACM Press, New York, NY, 324--331.
[5]
Buckley, C. and Voorhees, E. M. 2004. Retrieval evaluation with incomplete information. In Proceedings of the Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, M. Sanderson, K. Järvelin, J. Allan, and P. Bruza, Eds. ACM Press, New York, NY, 25--32.
[6]
Buckley, C. and Voorhees, E. M. 2005. Retrieval system evaluation. In TREC: Experiment and Evaluation in Information Retrieval. MIT Press, Cambridge, MA, Chapter 3, 53--75.
[7]
Büttcher, S., Clarke, C. L. A., Yeung, P. C. K., and Soboroff, I. 2007. Reliable information retrieval evaluation with incomplete and biased judgements. In Proceedings of the Thirtieth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, C. L. A. Clarke, N. Fuhr, N. Kando, W. Kraaij, and A. P. de Vries, Eds. ACM Press, New York, NY, 63--70.
[8]
Carterette, B., Allan, J., and Sitaraman, R. 2006. Minimal test collections for retrieval evaluation. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, S. Dumais, E. N. Efthimiadis, D. Hawking, and K. Järvelin, Eds. ACM Press, New York, NY, 268--275.
[9]
Cooper, W. S. 1968. Expected search length: A single measure of retrieval effectiveness based on weak ordering action of retrieval systems. Amer. Document. 19, 1 (Jan.), 30--41.
[10]
Cooper, W. S. 1973. On selecting a measure of retrieval effectiveness: Part I, the ‘subjective’ philosophy of evaluation. J. Amer. Soc. Inform. Sci. 24, 87--100.
[11]
Cormack, G. V. and Lynam, T. R. 2006. Statistical precision of information retrieval evaluation. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, S. Dumais, E. N. Efthimiadis, D. Hawking, and K. Järvelin, Eds. ACM Press, New York, NY, 533--540.
[12]
Della Mea, V. and Mizzaro, S. 2004. Measuring retrieval effectiveness: a new proposal and a first experimental validation. J. Amer. Soc. Inform. Sci. Tech. 55, 6, 530--543.
[13]
Frei, H. P. and Schäuble, P. 1991. Determining the effectiveness of retrieval algorithms. Inform. Process. Manage. 27, 2/3, 153--164.
[14]
Harman, D. 1995. Overview of the second text retrieval conference (TREC-2). Inform. Process. Manage. 31, 3, 271--289.
[15]
Harter, S. P. 1996. Variations in relevance assessments and the measurement of retrieval effectiveness. J. Amer. Soc. Inform. Sci. 47, 1, 37--49.
[16]
Hosanagar, K. 2005. A utility theoretic approach to determining optimal wait times in distributed information retrieval. See See Marchionini et al. {2005}, 91--97.
[17]
Järvelin, K. and Kekäläinen, J. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inform. Syst. 20, 4, 422--446.
[18]
Joachims, T., Granka, L., Pan, B., Hembrooke, H., and Gay, G. 2005. Accurately interpreting clickthrough data as implicit feedback. See See Marchionini et al. {2005}, 154--161.
[19]
Kagolovsky, Y. and Moehr, J. R. 2003. Current status of the evaluation of information retrieval. J. Med. Syst. 27, 5, 409--424.
[20]
Keen, E. M. 1992. Presenting results of experimental retrieval comparisons. Inform. Process. Manage. 28, 4, 491--502.
[21]
Kekäläinen, J. 2005. Binary and graded relevance in IR evaluations. Inform. Process. Manage. 41, 5, 1019--1034.
[22]
Losee, R. M. 2000. When information retrieval measures agree about the relative quality of document rankings. J. Amer. Soc. Inform. Sci. 51, 9, 834--840.
[23]
Marchionini, G., Moffat, A., Tate, J., Baeza-Yates, R., and Ziviani, N., Eds. 2005. Proceedings of the Twenty-Eighth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY.
[24]
Meng, X. and Chen, Z. 2004. On user-oriented measurements of effectiveness of Web information retrieval systems. In Proceedings of the International Conference on Internet Computing, H. R. Arabnia, O. Droegehorn, and S. Chatterjee, Eds. CSREA Press, Las Vegas, NV, 527--533.
[25]
Mizzaro, S. 1997. Relevance: The whole history. J. Amer. Soc. Inform. Sci. 48, 9, 810--832.
[26]
Moffat, A., Webber, W., and Zobel, J. 2007. Strategic system comparisons via targeted relevance judgments. In Proceedings of the Thirtieth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, C. L. A. Clarke, N. Fuhr, N. Kando, W. Kraaij, and A. P. de Vries, Eds. ACM Press, New York, NY, 375--382.
[27]
Raghavan, V. V., Jung, G. S., and Bollmann, P. 1989. A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans. Inform. Syst. 7, 3, 205--229.
[28]
Sakai, T. 2004. Ranking the NTCIR systems based on multigrade relevance. In Proceedings of the AIRS Asian Information Retrieval Symposium. Lecture Notes in Computer Science, vol. 3411. Springer, Berlin, Germany, 251--262.
[29]
Sakai, T. 2007. Alternatives to Bpref. In Proceedings of the Thirtieth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, C. L. A. Clarke, N. Fuhr, N. Kando, W. Kraaij, and A. P. de Vries, Eds. ACM Press, New York, NY, 71--78.
[30]
Sanderson, M. and Zobel, J. 2005. Information retrieval system evaluation: Effort, sensitivity, and reliability. See See Marchionini et al. {2005}, 162--169.
[31]
Saracevic, T. 1995. Evaluation of evaluation in information retrieval. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 138--146.
[32]
Shaw, Jr., W. M. 1986. On the foundation of evaluation. J. Amer. Soc. Inform. Sci. 37, 5, 346--348.
[33]
Su, L. T. 1994. The relevance of recall and precision in user evaluation. J. Amer. Soc. Inform. Sci. 45, 3 (Apr.), 207--217.
[34]
Tague-Sutcliffe, J. 1992. The pragmatics of information retrieval experimentation, revisited. Inform. Process. Manage. 28, 4, 467--490.
[35]
van Rijsbergen, C. J. 1979. Information Retrieval, 2nd ed. Butterworths, London, U.K.
[36]
Voorhees, E. M. 2002. The philosophy of information retrieval evaluation. In Proceedings of the 2001 Cross Language Evaluation Forum Workshop, C. Peters, M. Braschler, J. Gonzalo, and M. Kluck, Eds. Lecture Notes in Computer Science, vol. 2406, Springer, Berlin, Germany, 355--370.
[37]
Webber, W., Moffat, A., and Zobel, J. 2008. Score standardization for inter-collection comparison of retrieval systems. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, NY. To appear.
[38]
Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd ed. Morgan Kaufmann, San Francisco, CA.
[39]
Zobel, J. 1998. How reliable are the results of large-scale information retrieval experiments? In Proceedings of the Twenty-First Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, Eds. ACM Press, New York, NY, 307--314.

Cited By

View all
  • (2025)Best practices for evaluating IRFL approachesJournal of Systems and Software10.1016/j.jss.2025.112342(112342)Online publication date: Jan-2025
  • (2025)Integrating neural mutation into mutation-based fault localization: A hybrid approachJournal of Systems and Software10.1016/j.jss.2024.112281221(112281)Online publication date: Mar-2025
  • (2025)Boosting mutation-based fault localization by effectively generating Higher-Order MutantsInformation and Software Technology10.1016/j.infsof.2024.107660180(107660)Online publication date: Apr-2025
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 27, Issue 1
December 2008
208 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/1416950
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 December 2008
Accepted: 01 April 2008
Revised: 01 September 2007
Received: 01 October 2005
Published in TOIS Volume 27, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Recall
  2. average precision
  3. pooling
  4. precision
  5. relevance

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)147
  • Downloads (Last 6 weeks)10
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Best practices for evaluating IRFL approachesJournal of Systems and Software10.1016/j.jss.2025.112342(112342)Online publication date: Jan-2025
  • (2025)Integrating neural mutation into mutation-based fault localization: A hybrid approachJournal of Systems and Software10.1016/j.jss.2024.112281221(112281)Online publication date: Mar-2025
  • (2025)Boosting mutation-based fault localization by effectively generating Higher-Order MutantsInformation and Software Technology10.1016/j.infsof.2024.107660180(107660)Online publication date: Apr-2025
  • (2025)Towards efficient pareto-optimal utility-fairness between groups in repeated rankingsMachine Learning10.1007/s10994-024-06679-9114:3Online publication date: 6-Feb-2025
  • (2025)Session-Level Normalization and Click-Through Data Enhancement for Session-Based EvaluationBig Data10.1007/978-981-96-1024-2_2(15-33)Online publication date: 24-Jan-2025
  • (2024)Mapping Drug Terms via Integration of a Retrieval-Augmented Generation Algorithm with a Large Language ModelHealthcare Informatics Research10.4258/hir.2024.30.4.35530:4(355-363)Online publication date: 31-Oct-2024
  • (2024)Boosting Spectrum-Based Fault Localization via Multi-Correct Programs in Online ProgrammingIEICE Transactions on Information and Systems10.1587/transinf.2023EDP7164E107.D:4(525-536)Online publication date: 1-Apr-2024
  • (2024)Decoy Effect in Search Interaction: Understanding User Behavior and Measuring System VulnerabilityACM Transactions on Information Systems10.1145/370888443:2(1-58)Online publication date: 19-Dec-2024
  • (2024)Online and Offline Evaluation in Search ClarificationACM Transactions on Information Systems10.1145/368178643:1(1-30)Online publication date: 4-Nov-2024
  • (2024)Offline Evaluation of Set-Based Text-to-Image GenerationProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698424(42-53)Online publication date: 8-Dec-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media