research-article

How Do You Test a Test?: A Multifaceted Examination of Significance Tests

Authors:

Mark SandersonAuthors Info & Claims

WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining

Pages 280 - 288

https://doi.org/10.1145/3488560.3498406

Published: 15 February 2022 Publication History

Abstract

We examine three statistical significance tests -- a recently proposed ANOVA model and two baseline tests -- using a suite of measures to determine which is better suited for offline evaluation. We apply our analysis to both the runs of a whole TREC track and also to the runs submitted by six participant groups. The former reveals test behavior in the heterogeneous settings of a large-scale offline evaluation initiative; the latter, almost overlooked in past work (to the best of our knowledge), reveals what happens in the much more restricted case of variants of a single system, i.e. the typical context in which companies and research groups operate. We find the ANOVA test strikingly consistent in large-scale settings, but worryingly inconsistent in some participant experiments. Of greater concern, the participant only experiments show one of our baseline tests (a test widely used in research) can produce a substantial number of inconsistent results. We discuss the implications of this inconsistency for possible publication bias.

Supplementary Material

MP4 File (3488560.3498406 - wsdmfp238.mp4)

A presentation by Mark Sanderson for the WSDM 2022 paper "How do you Test a Test? A Multifaceted Examination of Significance Tests"

Download
367.53 MB

References

[1]

D. Banks, P. Over, and N.-F. Zhang. 1999. Blind Men and Elephants: Six Approaches to TREC data . Information Retrieval, Vol. 1, 1--2 (May 1999), 7--34.

Digital Library

[2]

L. Boytsov, A. Belova, and P. Westfall. 2013. Deciding on an Adjustment for Multiplicity in IR Experiments, See citeNzz-SIGIR2013, 403--412.

[3]

C. Buckley and E. M. Voorhees. 2005. Retrieval System Evaluation. In TREC. Experiment and Evaluation in Information Retrieval, D. K. Harman and E. M. Voorhees (Eds.). MIT Press, Cambridge (MA), USA, 53--78.

[4]

B. A. Carterette. 2012. Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments . ACM Transactions on Information Systems (TOIS), Vol. 30, 1 (2012), 4:1--4:34.

Digital Library

[5]

B. A. Carterette. 2017. But Is It Statistically Significant? Statistical Significance in IR Research, 1995--2014 . Proc. 40th Annual International ACM SIGIR (SIGIR 2017). ACM Press, New York, USA, 1125--1128.

Digital Library

[6]

Michael D Cooper. 1973. A simulation model of an information retrieval system. Information Storage and Retrieval, Vol. 9, 1 (1973), 13--32.

[7]

G. V. Cormack and T. R. Lynam. 2006. Statistical Precision of Information Retrieval Evaluation. In Proc. 29th Annual International ACM SIGIR (SIGIR 2006). ACM Press, New York, USA, 533--540.

[8]

G. Faggioli and N. Ferro. 2021. System Effect Estimation by Sharding: A Comparison between ANOVA Approaches to Detect Significant Differences. In Advances in Information Retrieval. Proc. 43rd European Conference on IR Research (ECIR 2021). Lecture Notes in Computer Science (LNCS), Springer, Heidelberg, Germany .

[9]

E. M. Fels. 1963. Evaluation of the Performance of an Information-Retrieval System by Modified Mooers Plan. American Documentation (pre-1986), Vol. 14, 1 (01 1963), 28. Name - University of Pittsburgh; Aslib; Copyright - Copyright Wiley Periodicals Inc. Jan 1963; Last updated - 2019--11--23.

[10]

N. Ferro, Y. Kim, and M. Sanderson. 2019. Using Collection Shards to Study Retrieval Performance Effect Sizes . ACM Transactions on Information Systems (TOIS), Vol. 37, 3 (May 2019), 30:1--30:40.

Digital Library

[11]

N. Ferro and M. Sanderson. 2017. Sub-corpora Impact on System Effectiveness. In Proc. 40th Annual International ACM SIGIR (SIGIR 2017). ACM Press, New York, USA, 901--904.

[12]

N. Ferro and M. Sanderson. 2019. Improving the Accuracy of System Performance Estimation by Using Shards, See citeNzz-SIGIR2019, 805--814.

[13]

N. Fuhr. 2017. Some Common Mistakes In IR Evaluation, And How They Can Be Avoided . SIGIR Forum, Vol. 51, 3 (December 2017), 32--41.

[14]

A. Gruson, P. Chandar, C. Charbuillet, J. McInerney, S. Hansen, D. Tardieu, and B. Carterette. 2019. Offline evaluation to make decisions about playlist recommendation algorithms. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining . 420--428.

[15]

Michael D Heine. 1981. Simulation, and simulation experiments. In Information retrieval experiment, K. Spärck Jones (Ed.). Butterworths London, 179--198.

[16]

Y. Hochberg and A. C. Tamhane. 1987. Multiple Comparison Procedures. John Wiley & Sons, USA .

[17]

D. A. Hull. 1993. Using Statistical Testing in the Evaluation of Retrieval Experiments. In Proc. 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1993). ACM Press, New York, USA, 329--338.

Digital Library

[18]

E. Ide. 1968. New Experiments in Relevance Feedback . In Report ISR-14 to the National Science Foundation. Cornell University, Department of Computer Science.

[19]

P. Jaccard. 1901. Étude comparative de la distribution florale dans une portion des Alpes et du Jura . Bulletin de la Société Vaudoise des Sciences Naturelles, Vol. 37, 142 (January 1901), 547--579.

[20]

K. J"arvelin and J. Kek"al"ainen. 2002. Cumulated Gain-Based Evaluation of IR Techniques . ACM Transactions on Information Systems (TOIS), Vol. 20, 4 (October 2002), 422--446.

[21]

G. J. F. Jones, P. Sheridan, D. Kelly, M. de Rijke, and T. Sakai (Eds.). 2013. Proc. 36th Annual International ACM SIGIR (SIGIR 2013). ACM Press, New York, USA .

[22]

E Michael Keen. 1992. Presenting results of experimental retrieval comparisons. Information Processing & Management, Vol. 28, 4 (1992), 491--502.

Digital Library

[23]

M. G. Kendall. 1938. A New Measure of Rank Correlation . Biometrika, Vol. 30, 1/2 (June 1938), 81--93.

[24]

M. E. Lesk. 1966. SIG - The Significance Programs for Testing the Evaluation Output . In Report ISR-12 to the National Science Foundation. Cornell University, Department of Computer Science.

[25]

A. Moffat, F. Scholer, and P. Thomas. 2012. Models and Metrics: IR Evaluation as a User Process. In Proc. 17th Australasian Document Computing Symposium (ADCS 2012). ACM Press, New York, USA, 47--54.

Digital Library

[26]

J. Parapar, D. E Losada, and Á. Barreiro. 2021. Testing the tests: simulation of rankings to compare statistical significance tests in information retrieval evaluation. In Proceedings of the 36th Annual ACM Symposium on Applied Computing. 655--664.

Digital Library

[27]

J. Parapar, D. E. Losada, M. A. Presedo-Quindimil, and A. Barreiro. 2020. Using Score Distributions to Compare Statistical Significance Tests for Information Retrieval Evaluation . Journal of the American Society for Information Science and Technology (JASIST), Vol. 71, 1 (January 2020), 98--113.

[28]

B. Piwowarski, M. Chevalier, E. Gaussier, Y. Maarek, J.-Y. Nie, and F. Scholer (Eds.). 2019. Proc. 42nd Annual International ACM SIGIR (SIGIR 2019). ACM Press, New York, USA .

[29]

Tetsuya Sakai. 2016a. Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006--2015. In Proceedings of the 39th International ACM SIGIR . 5--14.

Digital Library

[30]

T. Sakai. 2016b. Two Sample T-tests for IR Evaluation: Student or Welch?. In Proc. 39th Annual International ACM SIGIR (SIGIR 2016). ACM Press, New York, USA, 1045--1048.

Digital Library

[31]

T. Sakai. 2020. On Fuhr's Guideline for IR Evaluation . SIGIR Forum, Vol. 54, 1 (June 2020), p14:1--p14:8.

Digital Library

[32]

G. Salton and M. E. Lesk. 1968. Computer Evaluation of Indexing and Text Processing . Journal of the ACM (JACM), Vol. 15, 1 (January 1968), 8--36.

Digital Library

[33]

M. Sanderson and J. Zobel. 2005. Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability. In Proc. 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005). ACM Press, New York, USA, 162--169.

[34]

Tefko Saracevic. 1968. Comparative Systems Laboratory Final Technical Report, An Inquiry into Testing of Information Retrieval Systems Part II: Analysis of Results . Technical Report. Case Western Reserve University.

[35]

J. Savoy. 1997. Statistical inference in retrieval effectiveness evaluation. Information Processing and Management, Vol. 33, 4 (1997), 495--512.

Digital Library

[36]

M. D. Smucker, J. Allan, and B. A. Carterette. 2007. A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. In Proc. 16th International Conference on Information and Knowledge Management (CIKM 2007). ACM Press, New York, USA, 623--632.

[37]

M. D. Smucker, J. Allan, and B. A. Carterette. 2009. Agreement Among Statistical Significance Tests forInformation Retrieval Evaluation at Varying Sample Sizes. In Proc. 32nd Annual International ACM SIGIR (SIGIR 2009). ACM Press, New York, USA, 630--631.

[38]

K. Spärck Jones. 1974. Automatic indexing. Journal of Documentation, Vol. 30, 4 (1974), 393--432.

[39]

T.D. Sterling. 1959. Publication decisions and their possible effects on inferences drawn from tests of significance-or vice versa. Journal of the American Statistical Association, Vol. 54, 285 (1959), 30--34.

[40]

D. Szymkiewicz. 1934. Une contribution statistique à la géographie floristique . Acta Societas Botanicorum Poloniae, Vol. 11, 3 (1934), 249--265.

[41]

J. Tague, M. Nelson, and H. Wu. 1980. Problems in the Simulation of Bibliographic Retrieval Systems. In Proceedings of the 3rd Annual ACM (SIGIR '80). Butterworth & Co., GBR, 236--255.

[42]

J. M. Tague-Sutcliffe and J. Blustein. 1994. A Statistical Analysis of the TREC-3 Data. In The Third Text REtrieval Conference (TREC-3). National Institute of Standards and Technology (NIST), Special Publication 500--225, Washington, USA, 385--398.

[43]

J. W. Tukey. 1949. Comparing Individual Means in the Analysis of Variance . Biometrics, Vol. 5, 2 (June 1949), 99--114.

[44]

J. Urbano, H. Lima, and A. Hanjalic. 2019. Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors, See citeNzz-SIGIR2019, 505--514.

[45]

J. Urbano, M. Marrero, and D. Mart'in. 2013. A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation, See citeNzz-SIGIR2013, 925--928.

[46]

Julián Urbano and Thomas Nagler. 2018. Stochastic simulation of test collections: Evaluation scores. In The 41st international ACM SIGIR conference on research & development in information retrieval . 695--704.

Digital Library

[47]

C. J. van Rijsbergen. 1979. Information Retrieval 2nd ed.). Butterworths, London, England.

[48]

E. M. Voorhees. 2004. Overview of the TREC 2004 Robust Track. In The Thirteenth Text REtrieval Conference Proceedings (TREC 2004). National Institute of Standards and Technology (NIST), Special Publication 500--261, Washington, USA .

[49]

E. M. Voorhees and C. Buckley. 2002. The Effect of Topic Set Size on Retrieval Experiment Error. In Proc. 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002). ACM Press, New York, USA, 316--323.

[50]

E. M. Voorhees, D. Samarov, and I. Soboroff. 2017. Using Replicates in Information Retrieval Evaluation . ACM Transactions on Information Systems (TOIS), Vol. 36, 2 (September 2017), 12:1--12:21.

Digital Library

[51]

W. J. Wilbur. 1994. Non-parametric significance tests of retrieval performance comparisons . Journal of Information Science, Vol. 20, 4 (August 1994), 270--284.

Digital Library

[52]

J. Zobel. 1998. How Reliable are the Results of Large-Scale Information Retrieval Experiments. In Proc. 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998). ACM Press, New York, USA, 307--314.

Digital Library

Cited By

Moffat AMackenzie J(2024)How much freedom does an effectiveness metric really have?Journal of the Association for Information Science and Technology10.1002/asi.2487475:6(686-703)Online publication date: 15-Feb-2024
https://dl.acm.org/doi/10.1002/asi.24874
Bauer CCarterette BFerro NFuhr NBeel JBreuer TClarke CCrescenzi ADemartini GDi Nunzio GDietz LFaggioli GFerwerda BFröbe MHagen MHanbury AHauff CJannach DKando NKanoulas EKnijnenburg BKruschwitz ULi MMaistro MMichiels LPapenmeier APotthast MRosso PSaid ASchaer PSeifert CSpina DStein BTintarev NUrbano JWachsmuth HWillemsen MZobel J(2023)Report on the Dagstuhl Seminar on Frontiers of Information Access Experimentation for Research and EducationACM SIGIR Forum10.1145/3636341.363635157:1(1-28)Online publication date: 4-Dec-2023
https://dl.acm.org/doi/10.1145/3636341.3636351
Roitero KBarbera DSoprano MDemartini GMizzaro SSakai T(2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 18-Aug-2023
https://dl.acm.org/doi/10.1145/3597201
Show More Cited By

Index Terms

How Do You Test a Test?: A Multifaceted Examination of Significance Tests
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Retrieval effectiveness

Recommendations

Uncontextualized significance considered dangerous
SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

We examine the context of significance tests in offline retrieval experiments. Our Information Retrieval (IR) community is notable for its experimental rigour: the use of statistical significance is grows across our publications. However, we show that ...
A bootstrap test for equality of variances

We introduce a bootstrap procedure to test the hypothesis H o that K + 1 variances are homogeneous. The procedure uses a variance-based statistic, and is derived from a normal-theory test for equality of variances. The test equivalently expressed the ...
The Levene test based-leakage assessment
Abstract
The secret information is split into several parts(multivariate) in the high-order mask. The test vector leakage assessment (TVLA) relied on Welch's t-test(T-test), the analysis of variance (ANOVA) and normalized inter-class variance (NICV) ...
Highlights
- a new method that quickly and reliably evaluate the security of IT products.
- The Levene test used to in leakage assessment of side channel in high order mask.
- Compared with the original detection method, the new method can reduce ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining

February 2022

1690 pages

ISBN:9781450391320

DOI:10.1145/3488560

General Chairs:
K. Selcuk Candan
Arizona State University, USA
,
Huan Liu
Arizona State University, USA
,
Program Chairs:
Leman Akoglu
Carnegie Mellon University, USA
,
Xin Luna Dong
Meta Platforms, Inc. (former Facebook), USA
,
Jiliang Tang
Michigan State University, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 February 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Australian Research Council

Conference

WSDM '22

Sponsor:

WSDM '22: The Fifteenth ACM International Conference on Web Search and Data Mining

February 21 - 25, 2022

AZ, Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
116
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Moffat AMackenzie J(2024)How much freedom does an effectiveness metric really have?Journal of the Association for Information Science and Technology10.1002/asi.2487475:6(686-703)Online publication date: 15-Feb-2024
https://dl.acm.org/doi/10.1002/asi.24874
Bauer CCarterette BFerro NFuhr NBeel JBreuer TClarke CCrescenzi ADemartini GDi Nunzio GDietz LFaggioli GFerwerda BFröbe MHagen MHanbury AHauff CJannach DKando NKanoulas EKnijnenburg BKruschwitz ULi MMaistro MMichiels LPapenmeier APotthast MRosso PSaid ASchaer PSeifert CSpina DStein BTintarev NUrbano JWachsmuth HWillemsen MZobel J(2023)Report on the Dagstuhl Seminar on Frontiers of Information Access Experimentation for Research and EducationACM SIGIR Forum10.1145/3636341.363635157:1(1-28)Online publication date: 4-Dec-2023
https://dl.acm.org/doi/10.1145/3636341.3636351
Roitero KBarbera DSoprano MDemartini GMizzaro SSakai T(2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 18-Aug-2023
https://dl.acm.org/doi/10.1145/3597201
Otero DParapar JFerro NFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)How Discriminative Are Your Qrels? How To Study the Statistical Significance of Document Adjudication MethodsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614916(1960-1970)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614916

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents