Skip to main content
Log in

Should significance testing be abandoned in machine learning?

  • Regular Paper
  • Published:
International Journal of Data Science and Analytics Aims and scope Submit manuscript

Abstract

Significance testing has become a mainstay in machine learning, with the p value being firmly embedded in the current research practice. Significance tests are widely believed to lend scientific rigor to the interpretation of empirical findings; however, their problems have received only scant attention in the machine learning literature so far. Here, we investigate one particular problem, the JeffreysLindley paradox. This paradox describes a statistical conundrum: the p value can be close to zero, convincing us that there is overwhelming evidence against the null hypothesis. At the same time, however, the posterior probability of the null hypothesis being true can be close to 1, convincing us of the exact opposite. In experiments with synthetic data sets and a subsequent thought experiment, we demonstrate that this paradox can have severe repercussions for the comparison of multiple classifiers over multiple benchmark data sets. Our main result suggests that significance tests should not be used in such comparative studies. We caution that the reliance on significance tests might lead to a situation that is similar to the reproducibility crisis in other fields of science. We offer for debate four avenues that might alleviate the looming crisis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Baker, M.: Is there a reproducibility crisis? Nature 533, 452–454 (2016)

    Article  Google Scholar 

  2. Bartlett, M.: A comment on D.V. Lindley’s statistical paradox. Biometrika 44, 533–534 (1957)

    Article  MATH  Google Scholar 

  3. Bayarri, M., Berger, J.: \(P\) values for composite null models. J. Am. Stat. Assoc. 95(452), 1127–1142 (2000)

    MathSciNet  MATH  Google Scholar 

  4. Begley, C., Ioannidis, J.: Reproducibility in science: improving the standard for basic and preclinical research. Circ. Res. 116(1), 116–126 (2015)

    Article  Google Scholar 

  5. Benavoli, A., Corani, G., Demšar, J., Zaffalon, M.: Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis. J. Mach. Learn. Res. 18(77), 1–36 (2017)

    MathSciNet  MATH  Google Scholar 

  6. Benavoli, A., Corani, G., Mangili, F.: Should we really use post-hoc tests based on mean-ranks? J. Mach. Learn. Res. 17(5), 1–10 (2016)

    MathSciNet  MATH  Google Scholar 

  7. Berger, J., Berry, D.: Statistical analysis and the illusion of objectivity. Am. Sci. 76, 159–165 (1988)

    Google Scholar 

  8. Berger, J., Delampady, M.: Testing precise hypotheses. Stat. Sci. 2(3), 317–352 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  9. Berrar, D.: Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers. Mach. Learn. 106(6), 911–949 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  10. Berrar, D., Dubitzky, W.: Jeffreys–Lindley Paradox in Machine Learning (2017). http://doi.org/10.17605/OSF.IO/SNXWJ. Accessed 23 July 2018

  11. Berrar, D., Dubitzky, W.: On the Jeffreys–Lindley paradox and the looming reproducibility crisis in machine learning. In: Proceedings of the 2017 IEEE International Conference on Data Science and Advanced Analytics, pp. 334–340 (2017)

  12. Berrar, D., Lopes, P., Dubitzky, W.: Caveats and pitfalls in crowdsourcing research: the case of soccer referee bias. Int. J. Data Sci. Anal. 4(2), 143–151 (2017)

    Article  Google Scholar 

  13. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  14. Cohen, J.: The earth is round (\(p <\).05). Am. Psychol. 49(12), 997–1003 (1994)

    Article  Google Scholar 

  15. Cousins, R.D.: The Jeffreys–Lindley paradox and discovery criteria in high energy physics. Synthese 194(2), 395–432 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  16. Cox, D., Hinkley, D.: Theoretical Statistics. Chapman and Hall/CR, London (1974)

    Book  MATH  Google Scholar 

  17. Cummings, G.: Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. Routledge, New York (2012)

    Google Scholar 

  18. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)

    MathSciNet  MATH  Google Scholar 

  19. Fisher, R.: Statistical methods and scientific induction. J. R. Stat. Soc. Ser. B 17(1), 69–78 (1955)

    MathSciNet  MATH  Google Scholar 

  20. Foster, E., Deardorff, A.: Open Science Framework (OSF). J. Med. Libr. Assoc. JMLA 105(2), 203–206 (2017). https://doi.org/10.5195/jmla.2017.88. Accessed 23 July 2018

  21. Gelman, A., Loken, E.: The garden of forking paths: why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time (2013). http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf. Accessed 23 July 2018

  22. Gigerenzer, G.: Mindless statistics. J. Socio-Econ. 33, 587–606 (2004)

    Article  Google Scholar 

  23. Goodman, S.: Toward evidence-based medical statistics. 1: the \(P\) value fallacy. Ann. Intern. Med. 130(12), 995–1004 (1999)

    Article  Google Scholar 

  24. Goodman, S.: A dirty dozen: twelve \(P\)-value misconceptions. Semin. Hematol. 45(3), 135–140 (2008)

    Article  Google Scholar 

  25. Goodman, S., Royall, R.: Evidence and scientific research. Am. J. Public Health 78(12), 1568–1574 (1988)

    Article  Google Scholar 

  26. Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.B., Poole, C., Goodman, S.N., Altman, D.G.: Statistical tests, \(p\) values, confidence intervals, and power: a guide to misinterpretations. Eur. J. Epidemiol. 31(4), 337–350 (2016)

    Article  Google Scholar 

  27. Hays, W.: Statistics for the Social Sciences. Holt, Rinehart & Winston, New York (1973)

    Google Scholar 

  28. Hubbard, R.: Alphabet soup—blurring the distinctions between \(p\)’s and \(\alpha \)’s in psychological research. Theory Psychol. 14(3), 295–327 (2004)

    Article  MathSciNet  Google Scholar 

  29. Hubbard, R., Armstrong, J.: Why we don’t really know what “statistical significance” means: a major educational failure. J. Mark. Edu. 28(2), 114–120 (2006)

    Article  Google Scholar 

  30. Hubbard, R., Lindsay, R.: Why \(p\) values are not a useful measure of evidence in statistical significance testing. Theory Psychol. 18(1), 69–88 (2008)

    Article  Google Scholar 

  31. Ioannidis, J.: Why most published research findings are false. PLoS Med. 2(8), e124 (2005)

    Article  Google Scholar 

  32. Jeffreys, H.: Theory of Probability, 3rd edn. Clarendon Press, Oxford (1961). (Reprinted 2003)

    MATH  Google Scholar 

  33. Leek, J., McShane, B., Gelman, A., Colquhoun, D., Nuijten, M., Goodman, S.: Five ways to fix statistics. Nature 551, 557–559 (2017)

    Article  Google Scholar 

  34. Levin, J.: What if there were no more bickering about statistical significance tests? Res. Sch. 5(2), 43–53 (1998)

    Google Scholar 

  35. Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002). http://CRAN.R-project.org/doc/Rnews/. Accessed 23 July 2018

  36. Lindley, D.: A statistical paradox. Biometrika 44, 187–192 (1957)

    Article  MATH  Google Scholar 

  37. Lu, M., Ishwaran, H.: A prediction-based alternative to \(P\) values in regression models. J. Thoracic Cardiovasc. Surg. 155(3), 1130–1136.e4 (2018)

    Article  Google Scholar 

  38. Matthews, R., Wasserstein, R., Spiegelhalter, D.: The ASA’s \(p\)-value statement, one year on. Significance 14(2), 38–41 (2017)

    Article  Google Scholar 

  39. McShane, B.B., Gal, D., Gelman, A., Robert, C., Tackett, J.L.: Abandon Statistical Significance (2017). ArXiv e-prints 1709.07588

  40. Nuzzo, R.: Statistical errors. Nature 506, 150–152 (2014)

    Article  Google Scholar 

  41. Poole, C.: Beyond the confidence interval. Am. J. Public Health 2(77), 195–199 (1987)

    Article  Google Scholar 

  42. R Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2017). https://www.R-project.org/. Accessed 23 July 2018

  43. Rosenthal, R.: The file drawer problem and tolerance for null results. Psychol. Bull. 86(3), 638–641 (1979)

    Article  Google Scholar 

  44. Rothman, K.: Writing for epidemiology. Epidemiology 9(3), 333–337 (1998)

    Article  MathSciNet  Google Scholar 

  45. Rothman, K., Greenland, S., Lash, T.: Modern Epidemiology, 3rd edn. Wolters Kluwer, Alphen aan den Rijn (2008)

    Google Scholar 

  46. Savalei, V., Dunn, E.: Is the call to abandon \(p\)-values the red herring of the replicability crisis? Front. Psychol. Artic. 6, 1–4, Article 245 (2015)

  47. Schervish, M.: \(P\) values: what they are and what they are not. Am. Stat. 50(3), 203–206 (1996)

    MathSciNet  Google Scholar 

  48. Schmidt, F.: Statistical significance testing and cumulative knowledge in psychology: implications for training of researchers. Psychol. Methods 1(2), 115–129 (1996)

    Article  Google Scholar 

  49. Schmidt, F., Hunter, J.: Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In: Harlow, L., Mulaik, S., Steiger, J. (eds.) What If There were No Significance Tests?, pp. 37–64. Psychology Press, Hove (1997)

  50. Sellke, T., Bayarri, M., Berger, J.: Calibration of \(p\) values for testing precise null hypotheses. Am. Stat. 55(1), 62–71 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  51. Senn, S.: Two cheers for \(p\)-values? J. Epidemiol. Biostat. 6, 193–204 (2001)

  52. Simmons, J., Nelson, L., Simonsohn, U.: False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci. 22(11), 1359–1366 (2011)

    Article  Google Scholar 

  53. Trafimow, D., Marks, M.: Editorial. Basic Appl. Soc. Psychol. 37, 1–2 (2015)

    Article  Google Scholar 

  54. Wasserstein, R., Lazar, N.: The ASA’s statement on \(p\)-values: context, process, and purpose (editorial). Am. Stat. 70(2), 129–133 (2016)

    Article  MathSciNet  Google Scholar 

  55. Webb, G.I., Boughton, J.R., Zheng, F., Ting, K.M., Salem, H.: Learning by extrapolation from marginal to full-multivariate probability distributions: decreasingly naive Bayesian classification. Mach. Learn. 86(2), 233–272 (2012)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Berrar.

Additional information

This paper is an extended version of the DSAA2017 Research Track paper titled “On the Jeffreys–Lindley paradox and the looming reproducibility crisis in machine learning” [11].

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Berrar, D., Dubitzky, W. Should significance testing be abandoned in machine learning?. Int J Data Sci Anal 7, 247–257 (2019). https://doi.org/10.1007/s41060-018-0148-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41060-018-0148-4

Keywords

Navigation