skip to main content
10.1145/2484028.2484034acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Deciding on an adjustment for multiplicity in IR experiments

Published:28 July 2013Publication History

ABSTRACT

We evaluate statistical inference procedures for small-scale IR experiments that involve multiple comparisons against the baseline. These procedures adjust for multiple comparisons by ensuring that the probability of observing at least one false positive in the experiment is below a given threshold. We use only publicly available test collections and make our software available for download. In particular, we employ the TREC runs and runs constructed from the Microsoft learning-to-rank (MSLR) data set. Our focus is on non-parametric statistical procedures that include the Holm-Bonferroni adjustment of the permutation test p-values, the MaxT permutation test, and the permutation-based closed testing. In TREC-based simulations, these procedures retain from 66% to 92% of individually significant results (i.e., those obtained without taking other comparisons into account). Similar retention rates are observed in the MSLR simulations. For the largest evaluated query set size (i.e., 6400), procedures that adjust for multiplicity find at most 5% fewer true differences compared to unadjusted tests. At the same time, unadjusted tests produce many more false positives.

References

  1. Anonymous. Guidance for Industry - E9 Statistical Principles for Clinical Trials. Technical report, U.S. Department of Health and Human Services - Food and Drug Administration, Center for Drug Evaluation and Research, Center for Biologics Evaluation and Research, ICH, 1998.Google ScholarGoogle Scholar
  2. R. Bender and S. Lange. Adjusting for multiple testing--when and how? Journal of Clinical Epidemiology, 54(4):343--349, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  3. Y. Benjamini and Y. Hochberg. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289--300, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  4. R. Blanco and H. Zaragoza. Beware of relatively large but meaningless improvements. Technical report YL-2011-001, Yahoo! Research, 2011.Google ScholarGoogle Scholar
  5. C. Buckley, D. Dimmick, I. Soboroff, and E. Voorhees. Bias and the limits of pooling for large collections. Information Retrieval, 10:491--508, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. J. Cabin and R. J. Mitchell. Bonferroni or not Bonferroni: when and how are the questions. Bulletin of the Ecological Society of America, 81(3):246--248, 2000.Google ScholarGoogle Scholar
  7. B. A. Carterette. Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Trans. Inf. Syst., 30(1):4:1--4:34, Mar. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In Proceeding of the 18th ACM conference on Information and knowledge management, CIKM '09, pages 621--630, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. L. A. Clarke, N. Craswel, I. Soboroff, and G. V. Cormack. Overview of TREC 2010 Web track. InmboxTREC-19: Proceedings of the Nineteenth Text REtrieval Conference, 2010.Google ScholarGoogle Scholar
  10. G. V. Cormack and T. R. Lynam. Validity and power of t-test for comparing map and gmap. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '07, pages 753--754, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Dudoit, J. Schaffer, and J. Boldrick. Multiple hypothesis testing in microarray experiments. Statistical Science, 18(1):71--103, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  12. B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Monographs on Statistics and Applied Probability. Chapman & Hall, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  13. S. Holm. A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal of Statistics, 6:65--70, 1979.Google ScholarGoogle Scholar
  14. Y. Huang, H. Xu, V. Calian, and J. C. Hsu. To permute or not to permute. Bioinformatics, 22(18):2244--2248, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. L. Lehmann and J. P. Romano. Generalizations of the familywise error rate. Annals of Statistics, 33(3):1138--1154, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  16. R. Marcus, P. Eric, and K. R. Gabriel. On closed testing procedures with special reference to ordered analysis of variance. Biometrika, 63(3):655--660, 1976.Google ScholarGoogle ScholarCross RefCross Ref
  17. M. Matsumoto and T. Nishimura. Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans. Model. Comput. Simul., 8(1):3--30, Jan. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. Pitman. Significance tests which may be applied to samples from any population. Royal Statistical Society, Supplement, 4:119--130, 1937.Google ScholarGoogle ScholarCross RefCross Ref
  19. J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '98, pages 275--281, New York, NY, USA, 1998. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Robertson. Understanding inverse document frequency: On theoretical arguments IDF. Journal of Documentation, 60:503--520, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  21. Y. Saeys, I. n. Inza, and P. Larrañaga. A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19):2507--2517, Oct 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Sanderson and J. Zobel. Information retrieval system evaluation: effort, sensitivity, and reliability. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '05, pages 162--169, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Savoy. Statistical inference in retrieval effectiveness evaluation. Information Processing & Management, 33(4):495--512, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. H. Scheffé. A method for judging all contrasts in the analysis of variance. Biometrika, 40(1--2):87--110, 1953.Google ScholarGoogle Scholar
  25. F. Scholer, A. Turpin, and M. Sanderson. Quantifying test collection quality based on the consistency of relevance judgements. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, SIGIR '11, pages 1063--1072, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. P. Shaffer. Multiple hypothesis testing. Annual Review of Psychology, 46(1):561--584, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  27. M. D. Smucker, J. Allan, and B. Carterette. A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, CIKM '07, pages 623--632, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Sunklodas. Approximation of distributions of sums of weakly dependent random variables by the normal distribution. In Y. Prokhorov and V. Statulevičius, editors, Limit Theorems of Probability Theory, pages 113--165. Springer Berlin Heidelberg, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  29. J. Tague-Sutcliffe and J. Blustein. A statistical analysis ofmboxTREC-3 data. In Overview of the Third Text REtrieval Conference (TREC-3), pages 385--398, 1994.Google ScholarGoogle Scholar
  30. J. Urbano, J. S. Downie, B. Mcfee, and M. Schedl. How significant is statistically significant? the case of audio music similarity and retrieval. In Proceedings of the 13th International Society for Music Information Retrieval Conference, pages 181--186, Porto, Portugal, October 8--12 2012.Google ScholarGoogle Scholar
  31. W. Webber, A. Moffat, and J. Zobel. Statistical power in retrieval experimentation. In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM '08, pages 571--580, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. P. H. Westfall and J. F. Troendle. Multiple testing with minimal assumptions. Biometrical Journal, 50(5):745--755, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  33. P. H. Westfall and S. S. Young. Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment. Wiley-Interscience, 1 edition, Jan. 1993.Google ScholarGoogle Scholar
  34. W. J. Wilbur. Non-parametric significance tests of retrieval performance comparisons. J. Inf. Sci., 20:270--284, April 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. H. Xu and J. C. Hsu. Applying the generalized partitioning principle to control the generalized familywise error rate. Biometrical Journal, 49(1):52--67, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  36. C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '01, pages 334--342, New York, NY, USA, 2001. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. Zhou, D. P. Foster, R. A. Stine, and L. H. Ungar. Streamwise feature selection. Journal of Machine Learning Research, 7:1861--1885, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. J. Zobel, W. Webber, M. Sanderson, and A. Moffat. Principles for robust evaluation infrastructure. In Proceedings of the 2011 workshop on Data infrastructures for supporting information retrieval evaluation, DESIRE '11, pages 3--6, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Deciding on an adjustment for multiplicity in IR experiments

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
      July 2013
      1188 pages
      ISBN:9781450320344
      DOI:10.1145/2484028

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 28 July 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGIR '13 Paper Acceptance Rate73of366submissions,20%Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader