Time to dispense with the p-value in OR?

Hofmann, Marko; Meyer-Nieberg, Silja

doi:10.1007/s10100-017-0484-9

Time to dispense with the p-value in OR?

Rationale and implications of the statement of the American Statistical Association (ASA) on p-values

Original Paper
Published: 28 July 2017

Volume 26, pages 193–214, (2018)
Cite this article

Central European Journal of Operations Research Aims and scope Submit manuscript

555 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

Null hypothesis significance testing is the standard procedure of statistical decision making, and p-values are the most widespread decision criteria of inferential statistics both in science, in general, and also in operations research, in particular. p-values are of paramount importance in the life and human sciences, and dominate statistical summaries in natural and technical sciences as well as in operations research, a domain in which the p-value seems to be a common denominator for decision making based on samples. Yet, the use of significance testing in the analysis of research data has been criticized from numerous statisticians—continuously for almost 100 years. This criticism has recently (March 7, 2016) been given an official status by a statement from the American Statistical Association on p-values. Is it time to dispense with the p-value in OR? The answer depends on many factors, including the research objective, the research domain, and, especially, the amount of information provided in addition to the p-value. Despite this dependence from context three conclusions can be made that should concern the operational analyst: First, p-values can perfectly cast doubt on a null hypothesis or its underlying assumptions, but they are only a first step of analysis, which, stand alone, lacks expressive power. Second, the statistical layman almost inescapably misinterprets the evidentiary value of p-values. Third and foremost, p-values are an inadequate choice for a succinct executive summary of statistical evidence for or against a research question. In statistical summaries confidence intervals of standardized effect sizes provide much more information than p-values without requiring much more space.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Algina J, Keselman H, Penfield R (2006) Confidence intervals for an effect size when variances are not equal. J Mod Appl Stat Methods 5(1):2–13
Article Google Scholar
Armstrong JS (2007) Statistical significance tests are unnecessary even when properly done and properly interpreted: reply to commentaries. Int J Forecast 23:335–336
Article Google Scholar
Bakan D (1966) The test of significance in psychological research. Psychol Bull 66:423–437
Article Google Scholar
Bartz-Beielstein T, Preuss M (2014) Experimental analysis of optimization algorithms: tuning and beyond. Springer, Berlin. doi:10.1007/978-3-642-33206-7-10
Google Scholar
Bayarri M, Benjamin DJ, Berger JO, Sellke TM (2016) Rejection odds and rejection ratios: a proposal for statistical practice in testing hypotheses. J Math Psychol 72:90–103. doi:10.1016/j.jmp.2015.12.007
Article Google Scholar
Berkson J (1938) Some difficulties of interpretation encountered in the application of the chisquare test. J Am Stat Assoc 33:526–536
Article Google Scholar
Boring E (1919) Mathematical vs. scientific significance. Psychol Bull 16:335–338
Article Google Scholar
Brandstaetter E (1999) Confidence intervals as an alternative to significance testing. Methods Psychol Res Online 4(2):33–46
Google Scholar
Browne RH (2010) The t-test p value and its relationship to the effect size and p(x>y). Am Stat 64(1):30–33. doi:10.1198/tast.2010.08261
Article Google Scholar
Carver R (1978) The case against stastistical significance testing. Harv Educ Rev 48:378–399
Article Google Scholar
Christensen R (2005) Testing Fisher, Neyman, Pearson, and Bayes. Am Stat 59(2):121,126
Article Google Scholar
Coelho V, Grasas A, Ramalhinho H, Coelho I, Souza M, Cruz R (2016) An ILS-based algorithm to solve a large-scale real heterogeneous fleet VRP with multi-trips and docking constraints. Eur J Oper Res 250(2):367–376. doi:10.1016/j.ejor.2015.09.047, http://www.sciencedirect.com/science/article/pii/S0377221715008899
Cohen J (1962) The statistical power of abnormal-social psychological research: a review. J Abnormal Soc Psychol 65:145–153
Article Google Scholar
Cohen J (1988) Statistical power analysis for the behavioral sciences, 2nd edn. Lawrence Erlbaum Associates, Hillsdale
Google Scholar
Cohen J (1994) The earth is round (p \(<\) 0.5). Am Psychol 12:997–1003
Article Google Scholar
Cortina JM, Dunlap WP (1997) On the logic and purpose of significance testing. Psychol Methods 2(2):161–172
Article Google Scholar
Cumming G (2011) Understanding the new statistics: effect sizes, confidence intervals, and meta-analysis. Routledge, London
Google Scholar
Cumming G (2014) The new statistics: why and how. Psychol Sci 25:7–29
Article Google Scholar
De Witte K, Marques RC (2010) Designing performance incentives, an international benchmark study in the water sector. CEJOR 18:189–220
Article Google Scholar
Demidenko E (2016) The p-value you can’t buy. Am Stat 70(1):33–38. doi:10.1080/00031305.2015.1069760
Article Google Scholar
Derrac J, Garca S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput 1(1):3–18. doi:10.1016/j.swevo.2011.02.002, http://www.sciencedirect.com/science/article/pii/S2210650211000034
Dienes Z (2011) Bayesian versus orthodox statistics: Which side are you on? Perspect Psychol Sci 6(3):274–290
Article Google Scholar
Dienes Z (2014) Using bayes to get the most out of non-significatnt results. Front Psychol 5:1–17
Article Google Scholar
Dooling DJ, Danks JH (1975) Going beyond tests of significance: Is psychology ready? Bull Psychon Soc 5(1):15–17
Article Google Scholar
Ellis PD (2010) The essential guide to effect sizes. Cambridge University Press, Cambridge
Book Google Scholar
Eve MP (2012) Tear it down, build it up: the research output team, or the library-as-publisher. Insights UKSG 25:158–162
Article Google Scholar
Falk R (1998) In criticism of the null hypothesis statistical test. Am Psychol 53:798–799
Article Google Scholar
Falk R, Greenbaum CW (1995) Significance tests die hard. Theory Psychol 5:75–98
Article Google Scholar
Fanelli D (2012) Negative results are disappearing from most disciplines and countries. Scientometrics 90(3):891–904
Article Google Scholar
Few S (2009) Now you see it: simple visualization techniques for quantitative analysis. Analytics Press, Piedmont
Google Scholar
Fisher RA (1925) Statistical methods for reseach workers. Oliver & Boyd, London
Google Scholar
Fisher RA (1935) The design of experiments. Oliver and Boyd, Edinburgh
Google Scholar
Fisher RA (1955) Statistical methods and scientific induction. J R Stat Soc Ser B (Methodol) 17:69–78
Google Scholar
Fleiß J (2015) Merit norms in the ultimatum game: an experimental study of the effect of merit on individual behavior and aggregate outcomes. Cent Eur J Oper Res 23(2):389–406. doi:10.1007/s10100-015-0385-8
Article Google Scholar
Jea Gallien (2015) Initial shipment decisions for new products at zara. Oper Res 63(2):269–286. doi:10.1287/opre.2014.1343
Article Google Scholar
Gelman A, Stern H (2006) The difference between “significant” and “not significant” is not itself statistically significant. Am Stat 60:328–331
Article Google Scholar
Gigerenzer G (2004) Mindless statistics. J Socio Econ 33:587–606
Article Google Scholar
Gillan DJ, Wickens CD, Hollands JG, Carswell CM (1998) Guidellines for presenting qualitative data in hfes publications. Human Factors 40:28–41
Article Google Scholar
Glaser DN (1999) The controversy of significance testing: misconceptions and alternatives. Am J Crit Care 8(5):291–296
Google Scholar
Glass GV, McGaw B, Smith ML (1981) Meta-analvsis in social research. Sage, Beverly Hills
Google Scholar
Goodman S (1992) A comment on replication, p-values and evidence. Stat Med 11:875–879
Article Google Scholar
Goodman SN (2008) A dirty dozen: twelve p-value misconceptions. Sem Hematol 45(3):135–140
Article Google Scholar
Greenwald A (1975) Consequences of predjudice agains the null hypothesis. Psychol Bull 82:1–20
Article Google Scholar
Greenwald AG, Gonzales R, Harris RJ, Guthrie D (1996) Effect sizes and p values: what should be reported and what should be replicated? Psychophysiology 33:175–183
Article Google Scholar
Grissom R, Kim J (2012) Effect sizes of research. Routledge, Abingdon
Google Scholar
Hagen R (1997) In praise of the null hypothesis test. Am Psychol 52:15–24
Article Google Scholar
Haller H, Krauss S (2002) Misinterpretations of significance: a problem students share with their teachers? Methods Psychol Res Online 7(1):1–20
Google Scholar
Haramoto H (2009) Automation of statistical tests on randomness to obtain clearer conclusion. In: Owen AB, L’ Ecuyer P (eds) Monte carlo and quasi-monte carlo methods 2008. Springer, Berlin Heidelberg, pp 411–421
Chapter Google Scholar
Harris MJ (1991) Significance tests are not enough: the role of effect size estimation in theory corroboration. Theory Psychol 1:375–382
Article Google Scholar
Hedges LV (1981) Distribution theory for Glass’s estimator of effect size and related estimators. J Educ Stat 6:107–128
Article Google Scholar
Hoaglin DC, Mosteller F, Tukey JW (2000) Understanding robust and exploratory data analysis. Wiley, Hoboken
Google Scholar
Hoem JM (2008) The reporting of statistical significance in scientific journals. Demogr Res 18(15):437–442
Article Google Scholar
Hofmann M (2015) Reasoning beyond predictive validity: The role of plausibility in decision-supporting social simulation. In: Yilmaz L, Chan WKV, Moon I, Roeder TMK, Macal C, Rossetti MD (eds). In: Proceedings of the 2015 winter simulation conference. IEEE, Piscataway, New Jersey
Hofmann M (2015) Searching for effects in big data: Why p-values are not advised and what to use instead. In: Yilmaz L, Chan WKV, Moon I, Roeder TMK, Macal C, Rossetti MD (eds). In: Proceedings of the 2015 Winter Simulation conference. IEEE, Piscataway, New Jersey
Hubbard R (2004) Alphabet soup: blurring the distinctions between ps and alphas in psychological research. Theory Psychol 14:295–327
Article Google Scholar
Hubbard R, Armstrong J (2006) Why we don’t really know what statistical significance means: implications for educators. J Mark Educ 28:114–120
Article Google Scholar
Hubbard R, Lindsay RM (2008) Why p values are not a useful measure of evidence in stastistical significance testing. Theory Psychol 18:69–88
Article Google Scholar
Ioannidis J (2005) Why most puplished research findings are false. PLoS Med 2(8):e124
Article Google Scholar
Kelley K (2007) Confidence intervals for standardized effect sizes: theory, application, and implementation. J Stat Softw 20(8):1–24
Article Google Scholar
Kelley K, Preacher KJ (2012) On effect size. Psychol Methods 17(2):137–152
Article Google Scholar
Keselman H, Algina J, Lix L, Wilcox R, Deering K (2008) A generally robust approach for testing hypotheses and setting confidence intervals for effect sizes. Psychol Methods 13(2):110–129
Article Google Scholar
Kirk RE (1996) Practical significance: a concept whose time has come. Educ Psychol Meas 56:746–759
Article Google Scholar
Kline R (2013) Beyond significance testing : statistics reform in the behavioral sciences, 2nd edn. American Psychological Association, Washington
Book Google Scholar
Kozak M (2010) Asterisks–friends or foes of statistics? Teach Stat 32(3):88–89. doi:10.1111/j.1467-9639.2009.00367.x
Article Google Scholar
Kruschke JK (2015) Doing Bayesian data analysis, 2nd edn. Academic Press, Cambridge
Google Scholar
Kysucky V, Norden L (2016) The benefits of relationship lending in a cross-country context: a meta-analysis. Manag Sci 62(1):90–110
Google Scholar
Lambdin C (2012) Significance tests as sorcery: science is empirical—significance tests are not. Theory Psychol 22(1):67–90
Article Google Scholar
Lane DM, Sandor A (2009) Desdesign better graphs by including didistribution information and integrating words, numbers, and images. Psychol Methods 14(3):239–257
Article Google Scholar
Lanzante JR (2005) A cautionary note on the use of error bars. J Clim 13:3699–3703
Article Google Scholar
Lecoutre B, Poitevineau J (2014) The significance test controversy revisited. Springer, Berlin
Google Scholar
L’Ecuyer P (2015) Random number generators with multiple streams for sequential and parallel computing. In: Yilmaz L, Chan WKV, Moon I, Roeder TMK, Macal C, Rossetti MD (eds). In: Proceedings of the 2015 winter simulation conference. IEEE, Piscataway, New Jersey
Leung SC, Zhang Z, Zhang D, Hua X, Lim MK (2013) A meta-heuristic algorithm for heterogeneous fleet vehicle routing problems with two-dimensional loading constraints. Eur J Oper Res 225(2):199–210. doi:10.1016/j.ejor.2012.09.023, http://www.sciencedirect.com/science/article/pii/S037722171200687X
Lombardi CM, Hurlbert SH (2009a) Final collapse of the Neyman–Pearson decision theoretic framework and rise of the neofisherian. Ann Zool Fennici 46:311–349
Article Google Scholar
Lombardi CM, Hurlbert SH (2009b) Misprescription and misuse of one-tailed tests. Austral Ecol 34:447–468
Google Scholar
Lykken DT (1968) Statistical significane in psychological reseach. Psychol Bull 70:151–159
Article Google Scholar
Mayo D (1996) Error and the growth of experimental knowledge. The University of Chicago Press, Chicago
Book Google Scholar
Miller J (2009) What is the probability of replicating a stastistically significant effect? Psychon Bull Rev 16(4):617–640
Article Google Scholar
Morey RD, Rouder J, Verhagen J, Wagenmakers EJ (2014) Why hypothesis tests are essential for psychological science: a comment on cumming (2014). Psychol Sci 25(6):1289–90
Article Google Scholar
Morey RD, Romeijn JW, Rouder JN (2016) The philosophy of Bayes factors and the quantification of statistical evidence. J Math Psychol 72:6–18. doi:10.1016/j.jmp.2015.11.001
Article Google Scholar
Mulaik S, Raju N, Harshman R (1997) There is a time and a place for significance testing. In: Harlowand L, Mulaik S, Steiger J (eds) What if there were no significance tests?. Erlbaum, Mahwah, pp 65–115
Google Scholar
Murphy KR, Myors B (1999) Testing the hypothesis that treatments have negligible effects: minimum-effect tests in the general linear model. J Appl Psychol 84(2):234–248
Article Google Scholar
Nickerson RS (2000) Null hypothesis significance testing: a review of an old and continuing controversy. Psychol Methods 5(2):241–301
Article Google Scholar
Nieuwenhuis S, Forstmann BU, Wagenmakers E (2011) Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci 14(9):1105–1107
Article Google Scholar
Nosek BA, Spies JR, Motyl M (2012) Scientific utopia ii. restructuring incentives and practices to promote truth over publishability. Perspect Psychol Sci 7(6):615–631
Article Google Scholar
Nuzzo R (2014) Statistical errors. Nature 506(13):150–152
Article Google Scholar
Parkhurst DF (2001) Statistical significance tests: equivalence and reverse tests should reduce misinterpretation. BioScience 51(12):1051–1057
Article Google Scholar
Poole C (2001) Low p-values or narrow confidence intervals: which are more durable. Epidemiology 12(3):291–294
Article Google Scholar
Prentice D, Miller D (1992) When small effects are impressive. Psychol Bull 112:160–164
Article Google Scholar
Rausch A, Brauneis A (2014) It’s about how the task is set: the inclusion-exclusion effect and accountability in preprocessing management information. Cent Eur J Oper Res 23(2):313–344. doi:10.1007/s10100-014-0355-6
Article Google Scholar
Reese RA (2004) Does significance matter? Significance 1(1):39–40
Article Google Scholar
Rosenthal R (1979) The file drawer problem and tolerance for null results. Psychol Bull 86(3):638–641
Article Google Scholar
Rosenthal R, Rosnow RL (1991) Essentials of behavioral research: methods and data analysis, 2nd edn. McGraw-Hill, New York
Google Scholar
Rosnow R, Rosenthal R (1989) Statistical procedures and the justification of knowledge in psychological science. Am Psychol 44:1246–1284
Article Google Scholar
Rouder J (2014) Optional stopping: no problem for bayesians. Psychon Bull Rev 21(2):301–308
Article Google Scholar
Rukhin A, Soto J, Nechvatal J, Smid M, Barker E, Leigh S, Levenson M, Vangel M, Banks D, Heckert A, Dray J, Vo S (2010) A statistical test suite for random and pseudorandom number generators for cryptographic applications. No. 800-22 in NIST Special Publication, National Institute of Standards and Technology
Sargent RG, Goldsman D, Yaacoub T (2015) Use of the interval statistical procedure for simulation model validation. In: Yilmaz L, Chan WKV, Moon I, Roeder T, Macal C, Rossetti MC (eds). In: Proceedings of the 2015 winter simulation conference. IEEE
Savalei V, Dunn E (2015) Is the call to abandon p-values the red herring of the replicability crisis? Front Psychol 245:1–4. doi:10.3389/fpsyg.2015.00245
Google Scholar
Schmidt F, Hunter J (1997) Eight common but false objections to the discontinuation of significance testing in the analysis of research datat. In: Harlow LL, Mulaik SA, Steiger JH (eds) What if there were no significance tests?. Erlbaum, Mahwah, pp 37–64
Google Scholar
Schneider JW (2015) Null hypothesis significance tests. a mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations. Scientometrics 102:411–432
Article Google Scholar
Sedlmeier P (1996) Jenseits des Signifikanztest–Rituals: Ergaenzungen und Alternativen. Methods Psychol Res Online 1(4):41–63
Google Scholar
Senn S (2001) Two cheers for p-values? J Epidemiol Biostat 6(2):193–204
Article Google Scholar
Simmons JP, Nelson LD, Simonsohn U (2011) False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol Sci 22(11):1359–1366
Article Google Scholar
Sohn D (1998) Statistical significance and replicability: why the former does not presage the latter. Theory Psychol 8:291–311
Article Google Scholar
Soto J (1999) Statistical testing of random number generators. In: Proceedings of the 22nd national information systems security conference, NIST, pp 1–12
Switalski P, Seredynski F (2015) Scheduling parallel batch jobs in grids with evolutionary metaheuristics. J Sched 18(4):345–357. doi:10.1007/s10951-014-0382-0
Article Google Scholar
Thompson B (2007) Effect sizes, confidence intervals, and confidence intervals for effect sizes. Psychol Schools 44(5):423–432
Article Google Scholar
Thompson B (2008) Computing and interpreting effect sizes, confidence intervals, and confidence intervals for effect sizes. In: Osborne J (ed) Best practices in quantitative methods, chap 17, 17th edn. Sage, Newbury Park, pp 246–262
Chapter Google Scholar
Tufte E (2001) The visual display of quantitative information, 2nd edn. Graphics Press, Cheshire
Google Scholar
Tukey J (1991) The philosophy of multiple comparison. Stat Sci 6:100–116
Article Google Scholar
Tukey JW (1977) Exploratory data analysis. Pearson, London
Google Scholar
Tukey JW (1980) We need both exploratory and confirmatory. Am Stat 34(1):23–25
Google Scholar
Velleman DC Paul F, Hoaglin (2012) APA handbook of research methods in psychology, Vol 3: Data analysis and research publication. American psychological association, Washington, DC, chap Exploratory data analysis., pp 51–70. doi:10.1037/13621-003
Wagenmakers EJ, Verhagen J, Ly A, Matzke D, Steingroever H, Rouder JN, Morey RD (2015) The need for bayesian hypothesis testing in psychological science. In: Lilienfeld SO, Waldman I (eds). Psychological science under scrutiny: recent challenges and proposed solutions, University of Missouri Press (in Press), p in Press
Wasserstein RL, Lazar NA (2016) The ASA’s statement on p-values: context, process, and purpose. Am Stat 0(ja):00–00, doi:10.1080/00031305.2016.1154108
Wilkinson L (1999) Task force on statistical inference: statistical methods in psychology journals. Am Psychol 54:594–604
Article Google Scholar
Wineberg M (2016) Introductory statistics for EC: A visual approach. In: Proceedings of the 2016 on Genetic and Evolutionary Computation Conference Companion, ACM, New York, NY, USA, GECCO ’16 Companion, pp 357–383, doi:10.1145/2908961.2926983
Yu E, Sprenger A, Thomas R, Dougherty M (2014) When decision heutistics and science collide. Psychon Bull Rev 21(2):268–282
Article Google Scholar

Download references

Author information

Authors and Affiliations

Fakultät für Informatik, Universität der Bundeswehr München, Werner-Heisenbergweg 39, 85577, Neubiberg, Germany
Marko Hofmann & Silja Meyer-Nieberg

Authors

Marko Hofmann
View author publications
You can also search for this author in PubMed Google Scholar
Silja Meyer-Nieberg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marko Hofmann.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hofmann, M., Meyer-Nieberg, S. Time to dispense with the p-value in OR?. Cent Eur J Oper Res 26, 193–214 (2018). https://doi.org/10.1007/s10100-017-0484-9

Download citation

Published: 28 July 2017
Issue Date: March 2018
DOI: https://doi.org/10.1007/s10100-017-0484-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Time to dispense with the p-value in OR?

Abstract

Access this article

Similar content being viewed by others

The e-value: a fully Bayesian significance measure for precise statistical hypotheses and its research program

A Holistic Approach to Empirical Analysis: The Insignificance of P, Hypothesis Testing and Statistical Significance*

Statistical significance and its critics: practicing damaging science, or damaging scientific practice?

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Time to dispense with the p-value in OR?

Abstract

Access this article

Similar content being viewed by others

The e-value: a fully Bayesian significance measure for precise statistical hypotheses and its research program

A Holistic Approach to Empirical Analysis: The Insignificance of P, Hypothesis Testing and Statistical Significance*

Statistical significance and its critics: practicing damaging science, or damaging scientific practice?

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation