skip to main content
research-article

Estimating Measurement Uncertainty for Information Retrieval Effectiveness Metrics

Published: 29 September 2018 Publication History

Abstract

One typical way of building test collections for offline measurement of information retrieval systems is to pool the ranked outputs of different systems down to some chosen depth d and then form relevance judgments for those documents only. Non-pooled documents—ones that did not appear in the top-d sets of any of the contributing systems—are then deemed to be non-relevant for the purposes of evaluating the relative behavior of the systems. In this article, we use RBP-derived residuals to re-examine the reliability of that process. By fitting the RBP parameter ϕ to maximize similarity between AP- and NDCG-induced system rankings, on the one hand, and RBP-induced rankings, on the other, an estimate can be made as to the potential score uncertainty associated with those two recall-based metrics. We then consider the effect that residual size—as an indicator of possible measurement uncertainty in utility-based metrics—has in connection with recall-based metrics by computing the effect of increasing pool sizes and examining the trends that arise in terms of both metric score and system separability using standard statistical tests. The experimental results show that the confidence levels expressed via the p-values generated by statistical tests are only weakly connected to the size of the residual and to the degree of measurement uncertainty caused by the presence of unjudged documents. Statistical confidence estimates are, however, largely consistent as pooling depths are altered. We therefore recommend that all such experimental results should report, in addition to the outcomes of statistical significance tests, the residual measurements generated by a suitably matched weighted-precision metric, to give a clear indication of measurement uncertainty that arises due to the presence of unjudged documents in test collections with finite pooled judgments.

References

[1]
J. A. Aslam, V. Pavlu, and R. Savell. 2003. A unified model for metasearch, pooling, and system evaluation. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM’03). 484--491.
[2]
P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. 2008. Relevance assessment: Are judges exchangeable and does it matter. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR’08). 667--674.
[3]
P. Bailey, A. Moffat, F. Scholer, and P. Thomas. 2016. UQV100: A test collection with query variability. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR’16). 725--728. Public data:
[4]
C. Buckley, D. Dimmick, I. Soboroff, and E. M. Voorhees. 2007. Bias and the limits of pooling for large collections. Inf. Retriev. 10, 6 (2007), 491--508.
[5]
C. Buckley and E. M. Voorhees. 2004. Retrieval evaluation with incomplete information. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR’04). 25--32.
[6]
S. Büttcher, C. L. A. Clarke, P. C. K. Yeung, and I. Soboroff. 2007. Reliable information retrieval evaluation with incomplete and biased judgements. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR’07). 63--70.
[7]
B. Carterette. 2007. Robust test collections for retrieval evaluation. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR’07). 55--62.
[8]
B. Carterette, J. Allan, and R. K. Sitaraman. 2006. Minimal test collections for retrieval evaluation. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR’06). 268--275.
[9]
O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM’09). 621--630.
[10]
N. Ferro. 2017. What does affect the correlation among evaluation measures? ACM Trans. Inf. Syst. (unpublished).
[11]
N. Fuhr. 2017. Some common mistakes in IR evaluation, and how they can be avoided. SIGIR Forum 51, 3 (2017), 32--41.
[12]
K. Järvelin and J. Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422--446.
[13]
A. Lipani, M. Lupu, J. Palotti, G. Zuccon, and A. Hanbury. 2017. Fixed budget pooling strategies based on fusion methods. In Proceedings of the Symposium on Applied Computing (SAC’17). 919--924.
[14]
A. Lipani, J. R. M. Palotti, M. Lupu, F. Piroi, G. Zuccon, and A. Hanbury. 2017. Fixed-cost pooling strategies based on IR evaluation measures. In Proceedings of the European Conference in Information Retrieval (ECIR’17). 357--368.
[15]
D. E. Losada, J. Parapar, and A. Barreiro. 2017. Multi-armed bandits for adjudicating documents in pooling-based evaluation of information retrieval systems. Inf. Process. Manage. 53, 5 (2017), 1005--1025.
[16]
X. Lu, A. Moffat, and J. S. Culpepper. 2016. The effect of pooling and evaluation depth on IR metrics. Inf. Retriev. 19, 4 (2016), 416--445.
[17]
X. Lu, A. Moffat, and J. S. Culpepper. 2017. Can deep effectiveness metrics be evaluated using shallow judgment pools? In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR’17). 35--44.
[18]
S. Mizzaro. 1997. Relevance: The whole history. J. Am. Soc. Inf. Sci. Technol. 48, 9 (1997), 810--832.
[19]
A. Moffat. 2013. Seven numeric properties of effectiveness metrics. In Proceedings of the Asia Information Retrieval Societies Conference (AIRS’13). 1--12.
[20]
A. Moffat, P. Bailey, F. Scholer, and P. Thomas. 2017. Incorporating user expectations and behavior into the measurement of search effectiveness. ACM Trans. Inf. Syst. 35, 3 (2017), 24:1--24:38.
[21]
A. Moffat, P. Thomas, and F. Scholer. 2013. Users versus models: What observation tells us about effectiveness metrics. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM’13). 659--668.
[22]
A. Moffat, W. Webber, and J. Zobel. 2007. Strategic system comparisons via targeted relevance judgments. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR’07). 375--382.
[23]
A. Moffat and J. Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 27, 1 (2008), 2.1--2.27.
[24]
S. D. Ravana and A. Moffat. 2009. Score aggregation techniques in retrieval experimentation. In Proceedings of the Australasian Database Conference (ADC’09). 59--67.
[25]
S. Robertson. 2006. On GMAP: And other transformations. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM’06). 78--83.
[26]
T. Sakai. 2004. New performance metrics based on multigrade relevance: Their application to question answering. In Proceedings of the NII Testbeds and Community for Information Access Research (NTCIR’04).
[27]
T. Sakai. 2006. Evaluating evaluation metrics based on the bootstrap. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR’06). 525--532.
[28]
T. Sakai. 2007. Alternatives to BPref. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR’07). 71--78.
[29]
T. Sakai. 2008. Comparing metrics across TREC and NTCIR: The robustness to pool depth bias. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR’08). 691--692.
[30]
T. Sakai. 2008. Comparing metrics across TREC and NTCIR: The robustness to system bias. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM’08). 581--590.
[31]
T. Sakai. 2016. Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006-2015. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR’16). 5--14.
[32]
T. Sakai and N. Kando. 2008. On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inf. Retriev. 11, 5 (2008), 447--470.
[33]
M. Sanderson. 2010. Test collection based evaluation of information retrieval systems. Found. Trends Inf. Retriev. 4, 4 (2010), 247--375.
[34]
D. Sheskin. 1997. Handbook of Parametric and Nonparametric Statistical Procedures. CRC Press.
[35]
M. D. Smucker, J. Allan, and B. Carterette. 2007. A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM’07). 623--632.
[36]
K. Spärck Jones and C. J. Van Rijsbergen. 1975. Report on the Need for and Provision of an ‘Ideal’ Information Retrieval Test Collection. Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5266 (1975).
[37]
E. M. Voorhees. 2000. Variations in relevance judgments and the measurement of retrieval effectiveness. Inf. Process. Manage. 36, 5 (2000), 697--716.
[38]
E. M. Voorhees. 2004. Overview of the TREC 2004 robust retrieval track. In Proceedings of the Text Retrieval Conference (TREC’04). NIST Special Publication. 500--261.
[39]
E. M. Voorhees. 2004. Overview of TREC 2004. In Proceedings of the Text Retrieval Conference (TREC’04). NIST Special Publication. 500--261.
[40]
E. M. Voorhees and D. Harman. 1998. Overview of the seventh Text REtrieval Conference. In Proceedings of the Text Retrieval Conference (TREC’98). NIST Special Publication. 500--242.
[41]
E. M. Voorhees and D. Harman. 1999. Overview of the eighth Text REtrieval Conference. In Proceedings of the Text Retrieval Conference (TREC’99). NIST Special Publication. 500--246.
[42]
E. M. Voorhees and D. K. Harman (Eds.). 2005. TREC: Experiment and Evaluation in Information Retrieval. MIT Press.
[43]
W. Webber, A. Moffat, and J. Zobel. 2010. The effect of pooling and evaluation depth on metric stability. In Proceedings of the Workshop on Evaluating Information Access (EVIA’10). 7--15.
[44]
W. Webber, A. Moffat, and J. Zobel. 2010. A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. 28, 4 (2010), 20.1--20.38.
[45]
Z. Yang, A. Moffat, and A. Turpin. 2016. How precise does document scoring need to be? In Proceedings of the Asia Information Retrieval Societies Conference (AIRS’16). 279--291.
[46]
E. Yilmaz, J. A. Aslam, and S. Robertson. 2008. A new rank correlation coefficient for information retrieval. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR’08). 587--594.
[47]
J. Zobel. 1998. How reliable are the results of large-scale information retrieval experiments? In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR’98). 307--314.

Cited By

View all
  • (2024)Leveraging LLMs for Unsupervised Dense Retriever RankingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657798(1307-1317)Online publication date: 10-Jul-2024
  • (2022)When Measurement MisleadsACM SIGIR Forum10.1145/3582524.358254056:1(1-20)Online publication date: 1-Jun-2022
  • (2021)Popularity Bias in False-positive Metrics for Recommender Systems EvaluationACM Transactions on Information Systems10.1145/345274039:3(1-43)Online publication date: 22-May-2021

Index Terms

  1. Estimating Measurement Uncertainty for Information Retrieval Effectiveness Metrics

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Journal of Data and Information Quality
        Journal of Data and Information Quality  Volume 10, Issue 3
        Special Issue on Reproducibility in IR: Evaluation Campaigns, Collections and Analyses
        September 2018
        94 pages
        ISSN:1936-1955
        EISSN:1936-1963
        DOI:10.1145/3282439
        Issue’s Table of Contents
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 29 September 2018
        Accepted: 01 July 2018
        Revised: 01 April 2018
        Received: 01 October 2017
        Published in JDIQ Volume 10, Issue 3

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Evaluation
        2. effectiveness metric
        3. evaluation
        4. information retrieval
        5. statistical test
        6. test collection

        Qualifiers

        • Research-article
        • Research
        • Refereed

        Funding Sources

        • Australian Research Council
        • Google Faculty Research Grant

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)20
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 13 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Leveraging LLMs for Unsupervised Dense Retriever RankingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657798(1307-1317)Online publication date: 10-Jul-2024
        • (2022)When Measurement MisleadsACM SIGIR Forum10.1145/3582524.358254056:1(1-20)Online publication date: 1-Jun-2022
        • (2021)Popularity Bias in False-positive Metrics for Recommender Systems EvaluationACM Transactions on Information Systems10.1145/345274039:3(1-43)Online publication date: 22-May-2021

        View Options

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media