skip to main content
10.1145/3488560.3498406acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

How Do You Test a Test?: A Multifaceted Examination of Significance Tests

Published: 15 February 2022 Publication History

Abstract

We examine three statistical significance tests -- a recently proposed ANOVA model and two baseline tests -- using a suite of measures to determine which is better suited for offline evaluation. We apply our analysis to both the runs of a whole TREC track and also to the runs submitted by six participant groups. The former reveals test behavior in the heterogeneous settings of a large-scale offline evaluation initiative; the latter, almost overlooked in past work (to the best of our knowledge), reveals what happens in the much more restricted case of variants of a single system, i.e. the typical context in which companies and research groups operate. We find the ANOVA test strikingly consistent in large-scale settings, but worryingly inconsistent in some participant experiments. Of greater concern, the participant only experiments show one of our baseline tests (a test widely used in research) can produce a substantial number of inconsistent results. We discuss the implications of this inconsistency for possible publication bias.

Supplementary Material

MP4 File (3488560.3498406 - wsdmfp238.mp4)
A presentation by Mark Sanderson for the WSDM 2022 paper "How do you Test a Test? A Multifaceted Examination of Significance Tests"

References

[1]
D. Banks, P. Over, and N.-F. Zhang. 1999. Blind Men and Elephants: Six Approaches to TREC data . Information Retrieval, Vol. 1, 1--2 (May 1999), 7--34.
[2]
L. Boytsov, A. Belova, and P. Westfall. 2013. Deciding on an Adjustment for Multiplicity in IR Experiments, See citeNzz-SIGIR2013, 403--412.
[3]
C. Buckley and E. M. Voorhees. 2005. Retrieval System Evaluation. In TREC. Experiment and Evaluation in Information Retrieval, D. K. Harman and E. M. Voorhees (Eds.). MIT Press, Cambridge (MA), USA, 53--78.
[4]
B. A. Carterette. 2012. Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments . ACM Transactions on Information Systems (TOIS), Vol. 30, 1 (2012), 4:1--4:34.
[5]
B. A. Carterette. 2017. But Is It Statistically Significant? Statistical Significance in IR Research, 1995--2014 . Proc. 40th Annual International ACM SIGIR (SIGIR 2017). ACM Press, New York, USA, 1125--1128.
[6]
Michael D Cooper. 1973. A simulation model of an information retrieval system. Information Storage and Retrieval, Vol. 9, 1 (1973), 13--32.
[7]
G. V. Cormack and T. R. Lynam. 2006. Statistical Precision of Information Retrieval Evaluation. In Proc. 29th Annual International ACM SIGIR (SIGIR 2006). ACM Press, New York, USA, 533--540.
[8]
G. Faggioli and N. Ferro. 2021. System Effect Estimation by Sharding: A Comparison between ANOVA Approaches to Detect Significant Differences. In Advances in Information Retrieval. Proc. 43rd European Conference on IR Research (ECIR 2021). Lecture Notes in Computer Science (LNCS), Springer, Heidelberg, Germany .
[9]
E. M. Fels. 1963. Evaluation of the Performance of an Information-Retrieval System by Modified Mooers Plan. American Documentation (pre-1986), Vol. 14, 1 (01 1963), 28. Name - University of Pittsburgh; Aslib; Copyright - Copyright Wiley Periodicals Inc. Jan 1963; Last updated - 2019--11--23.
[10]
N. Ferro, Y. Kim, and M. Sanderson. 2019. Using Collection Shards to Study Retrieval Performance Effect Sizes . ACM Transactions on Information Systems (TOIS), Vol. 37, 3 (May 2019), 30:1--30:40.
[11]
N. Ferro and M. Sanderson. 2017. Sub-corpora Impact on System Effectiveness. In Proc. 40th Annual International ACM SIGIR (SIGIR 2017). ACM Press, New York, USA, 901--904.
[12]
N. Ferro and M. Sanderson. 2019. Improving the Accuracy of System Performance Estimation by Using Shards, See citeNzz-SIGIR2019, 805--814.
[13]
N. Fuhr. 2017. Some Common Mistakes In IR Evaluation, And How They Can Be Avoided . SIGIR Forum, Vol. 51, 3 (December 2017), 32--41.
[14]
A. Gruson, P. Chandar, C. Charbuillet, J. McInerney, S. Hansen, D. Tardieu, and B. Carterette. 2019. Offline evaluation to make decisions about playlist recommendation algorithms. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining . 420--428.
[15]
Michael D Heine. 1981. Simulation, and simulation experiments. In Information retrieval experiment, K. Spärck Jones (Ed.). Butterworths London, 179--198.
[16]
Y. Hochberg and A. C. Tamhane. 1987. Multiple Comparison Procedures. John Wiley & Sons, USA .
[17]
D. A. Hull. 1993. Using Statistical Testing in the Evaluation of Retrieval Experiments. In Proc. 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1993). ACM Press, New York, USA, 329--338.
[18]
E. Ide. 1968. New Experiments in Relevance Feedback . In Report ISR-14 to the National Science Foundation. Cornell University, Department of Computer Science.
[19]
P. Jaccard. 1901. Étude comparative de la distribution florale dans une portion des Alpes et du Jura . Bulletin de la Société Vaudoise des Sciences Naturelles, Vol. 37, 142 (January 1901), 547--579.
[20]
K. J"arvelin and J. Kek"al"ainen. 2002. Cumulated Gain-Based Evaluation of IR Techniques . ACM Transactions on Information Systems (TOIS), Vol. 20, 4 (October 2002), 422--446.
[21]
G. J. F. Jones, P. Sheridan, D. Kelly, M. de Rijke, and T. Sakai (Eds.). 2013. Proc. 36th Annual International ACM SIGIR (SIGIR 2013). ACM Press, New York, USA .
[22]
E Michael Keen. 1992. Presenting results of experimental retrieval comparisons. Information Processing & Management, Vol. 28, 4 (1992), 491--502.
[23]
M. G. Kendall. 1938. A New Measure of Rank Correlation . Biometrika, Vol. 30, 1/2 (June 1938), 81--93.
[24]
M. E. Lesk. 1966. SIG - The Significance Programs for Testing the Evaluation Output . In Report ISR-12 to the National Science Foundation. Cornell University, Department of Computer Science.
[25]
A. Moffat, F. Scholer, and P. Thomas. 2012. Models and Metrics: IR Evaluation as a User Process. In Proc. 17th Australasian Document Computing Symposium (ADCS 2012). ACM Press, New York, USA, 47--54.
[26]
J. Parapar, D. E Losada, and Á. Barreiro. 2021. Testing the tests: simulation of rankings to compare statistical significance tests in information retrieval evaluation. In Proceedings of the 36th Annual ACM Symposium on Applied Computing. 655--664.
[27]
J. Parapar, D. E. Losada, M. A. Presedo-Quindimil, and A. Barreiro. 2020. Using Score Distributions to Compare Statistical Significance Tests for Information Retrieval Evaluation . Journal of the American Society for Information Science and Technology (JASIST), Vol. 71, 1 (January 2020), 98--113.
[28]
B. Piwowarski, M. Chevalier, E. Gaussier, Y. Maarek, J.-Y. Nie, and F. Scholer (Eds.). 2019. Proc. 42nd Annual International ACM SIGIR (SIGIR 2019). ACM Press, New York, USA .
[29]
Tetsuya Sakai. 2016a. Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006--2015. In Proceedings of the 39th International ACM SIGIR . 5--14.
[30]
T. Sakai. 2016b. Two Sample T-tests for IR Evaluation: Student or Welch?. In Proc. 39th Annual International ACM SIGIR (SIGIR 2016). ACM Press, New York, USA, 1045--1048.
[31]
T. Sakai. 2020. On Fuhr's Guideline for IR Evaluation . SIGIR Forum, Vol. 54, 1 (June 2020), p14:1--p14:8.
[32]
G. Salton and M. E. Lesk. 1968. Computer Evaluation of Indexing and Text Processing . Journal of the ACM (JACM), Vol. 15, 1 (January 1968), 8--36.
[33]
M. Sanderson and J. Zobel. 2005. Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability. In Proc. 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005). ACM Press, New York, USA, 162--169.
[34]
Tefko Saracevic. 1968. Comparative Systems Laboratory Final Technical Report, An Inquiry into Testing of Information Retrieval Systems Part II: Analysis of Results . Technical Report. Case Western Reserve University.
[35]
J. Savoy. 1997. Statistical inference in retrieval effectiveness evaluation. Information Processing and Management, Vol. 33, 4 (1997), 495--512.
[36]
M. D. Smucker, J. Allan, and B. A. Carterette. 2007. A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. In Proc. 16th International Conference on Information and Knowledge Management (CIKM 2007). ACM Press, New York, USA, 623--632.
[37]
M. D. Smucker, J. Allan, and B. A. Carterette. 2009. Agreement Among Statistical Significance Tests forInformation Retrieval Evaluation at Varying Sample Sizes. In Proc. 32nd Annual International ACM SIGIR (SIGIR 2009). ACM Press, New York, USA, 630--631.
[38]
K. Spärck Jones. 1974. Automatic indexing. Journal of Documentation, Vol. 30, 4 (1974), 393--432.
[39]
T.D. Sterling. 1959. Publication decisions and their possible effects on inferences drawn from tests of significance-or vice versa. Journal of the American Statistical Association, Vol. 54, 285 (1959), 30--34.
[40]
D. Szymkiewicz. 1934. Une contribution statistique à la géographie floristique . Acta Societas Botanicorum Poloniae, Vol. 11, 3 (1934), 249--265.
[41]
J. Tague, M. Nelson, and H. Wu. 1980. Problems in the Simulation of Bibliographic Retrieval Systems. In Proceedings of the 3rd Annual ACM (SIGIR '80). Butterworth & Co., GBR, 236--255.
[42]
J. M. Tague-Sutcliffe and J. Blustein. 1994. A Statistical Analysis of the TREC-3 Data. In The Third Text REtrieval Conference (TREC-3). National Institute of Standards and Technology (NIST), Special Publication 500--225, Washington, USA, 385--398.
[43]
J. W. Tukey. 1949. Comparing Individual Means in the Analysis of Variance . Biometrics, Vol. 5, 2 (June 1949), 99--114.
[44]
J. Urbano, H. Lima, and A. Hanjalic. 2019. Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors, See citeNzz-SIGIR2019, 505--514.
[45]
J. Urbano, M. Marrero, and D. Mart'in. 2013. A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation, See citeNzz-SIGIR2013, 925--928.
[46]
Julián Urbano and Thomas Nagler. 2018. Stochastic simulation of test collections: Evaluation scores. In The 41st international ACM SIGIR conference on research & development in information retrieval . 695--704.
[47]
C. J. van Rijsbergen. 1979. Information Retrieval 2nd ed.). Butterworths, London, England.
[48]
E. M. Voorhees. 2004. Overview of the TREC 2004 Robust Track. In The Thirteenth Text REtrieval Conference Proceedings (TREC 2004). National Institute of Standards and Technology (NIST), Special Publication 500--261, Washington, USA .
[49]
E. M. Voorhees and C. Buckley. 2002. The Effect of Topic Set Size on Retrieval Experiment Error. In Proc. 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002). ACM Press, New York, USA, 316--323.
[50]
E. M. Voorhees, D. Samarov, and I. Soboroff. 2017. Using Replicates in Information Retrieval Evaluation . ACM Transactions on Information Systems (TOIS), Vol. 36, 2 (September 2017), 12:1--12:21.
[51]
W. J. Wilbur. 1994. Non-parametric significance tests of retrieval performance comparisons . Journal of Information Science, Vol. 20, 4 (August 1994), 270--284.
[52]
J. Zobel. 1998. How Reliable are the Results of Large-Scale Information Retrieval Experiments. In Proc. 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998). ACM Press, New York, USA, 307--314.

Cited By

View all
  • (2024)How much freedom does an effectiveness metric really have?Journal of the Association for Information Science and Technology10.1002/asi.2487475:6(686-703)Online publication date: 15-Feb-2024
  • (2023)Report on the Dagstuhl Seminar on Frontiers of Information Access Experimentation for Research and EducationACM SIGIR Forum10.1145/3636341.363635157:1(1-28)Online publication date: 4-Dec-2023
  • (2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 18-Aug-2023
  • Show More Cited By

Index Terms

  1. How Do You Test a Test?: A Multifaceted Examination of Significance Tests

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining
    February 2022
    1690 pages
    ISBN:9781450391320
    DOI:10.1145/3488560
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 February 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. anova
    2. comparing tests
    3. prediction
    4. statistical significance testing

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    WSDM '22

    Acceptance Rates

    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)21
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 20 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)How much freedom does an effectiveness metric really have?Journal of the Association for Information Science and Technology10.1002/asi.2487475:6(686-703)Online publication date: 15-Feb-2024
    • (2023)Report on the Dagstuhl Seminar on Frontiers of Information Access Experimentation for Research and EducationACM SIGIR Forum10.1145/3636341.363635157:1(1-28)Online publication date: 4-Dec-2023
    • (2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 18-Aug-2023
    • (2023)How Discriminative Are Your Qrels? How To Study the Statistical Significance of Document Adjudication MethodsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614916(1960-1970)Online publication date: 21-Oct-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media