research-article

Deciding on an adjustment for multiplicity in IR experiments

Authors:
Leonid Boytsov

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Anna Belova

Abt Associates Inc, Bethesda, MD, USA

Abt Associates Inc, Bethesda, MD, USA
View Profile

,
Peter Westfall

Texas Tech University, Lubbock, TX, USA

Texas Tech University, Lubbock, TX, USA
View Profile

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrievalJuly 2013Pages 403–412https://doi.org/10.1145/2484028.2484034

Published:28 July 2013Publication History

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Pages 403–412

ABSTRACT

We evaluate statistical inference procedures for small-scale IR experiments that involve multiple comparisons against the baseline. These procedures adjust for multiple comparisons by ensuring that the probability of observing at least one false positive in the experiment is below a given threshold. We use only publicly available test collections and make our software available for download. In particular, we employ the TREC runs and runs constructed from the Microsoft learning-to-rank (MSLR) data set. Our focus is on non-parametric statistical procedures that include the Holm-Bonferroni adjustment of the permutation test p-values, the MaxT permutation test, and the permutation-based closed testing. In TREC-based simulations, these procedures retain from 66% to 92% of individually significant results (i.e., those obtained without taking other comparisons into account). Similar retention rates are observed in the MSLR simulations. For the largest evaluated query set size (i.e., 6400), procedures that adjust for multiplicity find at most 5% fewer true differences compared to unadjusted tests. At the same time, unadjusted tests produce many more false positives.

References

Anonymous. Guidance for Industry - E9 Statistical Principles for Clinical Trials. Technical report, U.S. Department of Health and Human Services - Food and Drug Administration, Center for Drug Evaluation and Research, Center for Biologics Evaluation and Research, ICH, 1998.Google Scholar
R. Bender and S. Lange. Adjusting for multiple testing--when and how? Journal of Clinical Epidemiology, 54(4):343--349, 2001.Google ScholarCross Ref
Y. Benjamini and Y. Hochberg. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289--300, 1995.Google ScholarCross Ref
R. Blanco and H. Zaragoza. Beware of relatively large but meaningless improvements. Technical report YL-2011-001, Yahoo! Research, 2011.Google Scholar
C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees. Bias and the limits of pooling for large collections. Information Retrieval, 10:491--508, 2007. Google ScholarDigital Library
R. J. Cabin and R. J. Mitchell. Bonferroni or not Bonferroni: when and how are the questions. Bulletin of the Ecological Society of America, 81(3):246--248, 2000.Google Scholar
B. A. Carterette. Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Trans. Inf. Syst., 30(1):4:1--4:34, Mar. 2012. Google ScholarDigital Library
O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In Proceeding of the 18th ACM conference on Information and knowledge management, CIKM '09, pages 621--630, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
C. L. A. Clarke, N. Craswel, I. Soboroff, and G. V. Cormack. Overview of TREC 2010 Web track. InmboxTREC-19: Proceedings of the Nineteenth Text REtrieval Conference, 2010.Google Scholar
G. V. Cormack and T. R. Lynam. Validity and power of t-test for comparing map and gmap. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '07, pages 753--754, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
S. Dudoit, J. Schaffer, and J. Boldrick. Multiple hypothesis testing in microarray experiments. Statistical Science, 18(1):71--103, 2003.Google ScholarCross Ref
B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Monographs on Statistics and Applied Probability. Chapman & Hall, 1993.Google ScholarCross Ref
S. Holm. A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal of Statistics, 6:65--70, 1979.Google Scholar
Y. Huang, H. Xu, V. Calian, and J. C. Hsu. To permute or not to permute. Bioinformatics, 22(18):2244--2248, 2006. Google ScholarDigital Library
E. L. Lehmann and J. P. Romano. Generalizations of the familywise error rate. Annals of Statistics, 33(3):1138--1154, 2005.Google ScholarCross Ref
R. Marcus, P. Eric, and K. R. Gabriel. On closed testing procedures with special reference to ordered analysis of variance. Biometrika, 63(3):655--660, 1976.Google ScholarCross Ref
M. Matsumoto and T. Nishimura. Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans. Model. Comput. Simul., 8(1):3--30, Jan. 1998. Google ScholarDigital Library
E. Pitman. Significance tests which may be applied to samples from any population. Royal Statistical Society, Supplement, 4:119--130, 1937.Google ScholarCross Ref
J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '98, pages 275--281, New York, NY, USA, 1998. ACM. Google ScholarDigital Library
S. Robertson. Understanding inverse document frequency: On theoretical arguments IDF. Journal of Documentation, 60:503--520, 2004.Google ScholarCross Ref
Y. Saeys, I. n. Inza, and P. Larrañaga. A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19):2507--2517, Oct 2007. Google ScholarDigital Library
M. Sanderson and J. Zobel. Information retrieval system evaluation: effort, sensitivity, and reliability. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '05, pages 162--169, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
J. Savoy. Statistical inference in retrieval effectiveness evaluation. Information Processing & Management, 33(4):495--512, 1997. Google ScholarDigital Library
H. Scheffé. A method for judging all contrasts in the analysis of variance. Biometrika, 40(1--2):87--110, 1953.Google Scholar
F. Scholer, A. Turpin, and M. Sanderson. Quantifying test collection quality based on the consistency of relevance judgements. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, SIGIR '11, pages 1063--1072, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
J. P. Shaffer. Multiple hypothesis testing. Annual Review of Psychology, 46(1):561--584, 1995.Google ScholarCross Ref
M. D. Smucker, J. Allan, and B. Carterette. A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, CIKM '07, pages 623--632, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
J. Sunklodas. Approximation of distributions of sums of weakly dependent random variables by the normal distribution. In Y. Prokhorov and V. Statulevičius, editors, Limit Theorems of Probability Theory, pages 113--165. Springer Berlin Heidelberg, 2000.Google ScholarCross Ref
J. Tague-Sutcliffe and J. Blustein. A statistical analysis ofmboxTREC-3 data. In Overview of the Third Text REtrieval Conference (TREC-3), pages 385--398, 1994.Google Scholar
J. Urbano, J. S. Downie, B. Mcfee, and M. Schedl. How significant is statistically significant? the case of audio music similarity and retrieval. In Proceedings of the 13th International Society for Music Information Retrieval Conference, pages 181--186, Porto, Portugal, October 8--12 2012.Google Scholar
W. Webber, A. Moffat, and J. Zobel. Statistical power in retrieval experimentation. In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM '08, pages 571--580, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
P. H. Westfall and J. F. Troendle. Multiple testing with minimal assumptions. Biometrical Journal, 50(5):745--755, 2008.Google ScholarCross Ref
P. H. Westfall and S. S. Young. Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment. Wiley-Interscience, 1 edition, Jan. 1993.Google Scholar
W. J. Wilbur. Non-parametric significance tests of retrieval performance comparisons. J. Inf. Sci., 20:270--284, April 1994. Google ScholarDigital Library
H. Xu and J. C. Hsu. Applying the generalized partitioning principle to control the generalized familywise error rate. Biometrical Journal, 49(1):52--67, 2007.Google ScholarCross Ref
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '01, pages 334--342, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
J. Zhou, D. P. Foster, R. A. Stine, and L. H. Ungar. Streamwise feature selection. Journal of Machine Learning Research, 7:1861--1885, 2006. Google ScholarDigital Library
J. Zobel, W. Webber, M. Sanderson, and A. Moffat. Principles for robust evaluation infrastructure. In Proceedings of the 2011 workshop on Data infrastructures for supporting information retrieval evaluation, DESIRE '11, pages 3--6, New York, NY, USA, 2011. ACM. Google ScholarDigital Library

Index Terms

Deciding on an adjustment for multiplicity in IR experiments
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

A comparison of statistical significance tests for information retrieval evaluation
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

Information retrieval (IR) researchers commonly use three tests of statistical significance: the Student's paired t-test, the Wilcoxon signed rank test, and the sign test. Other researchers have previously proposed using both the bootstrap and Fisher's ...
Read More
Conducting Laboratory Experiments Properly with Statistical Tools: An Easy Hands-On Tutorial
WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

This hands-on half-day tutorial consists of two sessions. Part~I covers the following topics: Preliminaries; Paired and two-sample t-tests, confidence intervals; One-way ANOVA and two-way ANOVA without replication; Familiwise error rate. Part~II covers ...
Read More
Conducting Laboratory Experiments Properly with Statistical Tools: An Easy Hands-on Tutorial
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

This hands-on half-day tutorial consists of two 90-minute sessions. Part I covers the following topics: paired and two-sample t -tests, confidence intervals (with Excel and R); familywise error rate, multiple comparison procedures; ANOVA (with Excel and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
July 2013
1188 pages
ISBN:9781450320344
DOI:10.1145/2484028
General Chairs:
Gareth J.F. Jones
Dublin City University, Ireland
,
Páraic Sheridan
Dublin City University, Ireland
,
Program Chairs:
Diane Kelly
University of North Carolina, Chapel Hill, USA
,
Maarten de Rijke
University of Amsterdam, The Netherlands
,
Tetsuya Sakai
Microsoft Research Asia, China
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 July 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
holm-bonferroni
maxt
multiple comparisons
permutation test
randomization test
statistical significance
t-test
Qualifiers
- research-article
Conference

Acceptance Rates
SIGIR '13 Paper Acceptance Rate73of366submissions,20%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 313
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Deciding on an adjustment for multiplicity in IR experiments

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

A comparison of statistical significance tests for information retrieval evaluation

Conducting Laboratory Experiments Properly with Statistical Tools: An Easy Hands-On Tutorial

Conducting Laboratory Experiments Properly with Statistical Tools: An Easy Hands-on Tutorial