Abstract
Null hypothesis significance testing is the standard procedure of statistical decision making, and p-values are the most widespread decision criteria of inferential statistics both in science, in general, and also in operations research, in particular. p-values are of paramount importance in the life and human sciences, and dominate statistical summaries in natural and technical sciences as well as in operations research, a domain in which the p-value seems to be a common denominator for decision making based on samples. Yet, the use of significance testing in the analysis of research data has been criticized from numerous statisticians—continuously for almost 100 years. This criticism has recently (March 7, 2016) been given an official status by a statement from the American Statistical Association on p-values. Is it time to dispense with the p-value in OR? The answer depends on many factors, including the research objective, the research domain, and, especially, the amount of information provided in addition to the p-value. Despite this dependence from context three conclusions can be made that should concern the operational analyst: First, p-values can perfectly cast doubt on a null hypothesis or its underlying assumptions, but they are only a first step of analysis, which, stand alone, lacks expressive power. Second, the statistical layman almost inescapably misinterprets the evidentiary value of p-values. Third and foremost, p-values are an inadequate choice for a succinct executive summary of statistical evidence for or against a research question. In statistical summaries confidence intervals of standardized effect sizes provide much more information than p-values without requiring much more space.
Similar content being viewed by others
References
Algina J, Keselman H, Penfield R (2006) Confidence intervals for an effect size when variances are not equal. J Mod Appl Stat Methods 5(1):2–13
Armstrong JS (2007) Statistical significance tests are unnecessary even when properly done and properly interpreted: reply to commentaries. Int J Forecast 23:335–336
Bakan D (1966) The test of significance in psychological research. Psychol Bull 66:423–437
Bartz-Beielstein T, Preuss M (2014) Experimental analysis of optimization algorithms: tuning and beyond. Springer, Berlin. doi:10.1007/978-3-642-33206-7-10
Bayarri M, Benjamin DJ, Berger JO, Sellke TM (2016) Rejection odds and rejection ratios: a proposal for statistical practice in testing hypotheses. J Math Psychol 72:90–103. doi:10.1016/j.jmp.2015.12.007
Berkson J (1938) Some difficulties of interpretation encountered in the application of the chisquare test. J Am Stat Assoc 33:526–536
Boring E (1919) Mathematical vs. scientific significance. Psychol Bull 16:335–338
Brandstaetter E (1999) Confidence intervals as an alternative to significance testing. Methods Psychol Res Online 4(2):33–46
Browne RH (2010) The t-test p value and its relationship to the effect size and p(x>y). Am Stat 64(1):30–33. doi:10.1198/tast.2010.08261
Carver R (1978) The case against stastistical significance testing. Harv Educ Rev 48:378–399
Christensen R (2005) Testing Fisher, Neyman, Pearson, and Bayes. Am Stat 59(2):121,126
Coelho V, Grasas A, Ramalhinho H, Coelho I, Souza M, Cruz R (2016) An ILS-based algorithm to solve a large-scale real heterogeneous fleet VRP with multi-trips and docking constraints. Eur J Oper Res 250(2):367–376. doi:10.1016/j.ejor.2015.09.047, http://www.sciencedirect.com/science/article/pii/S0377221715008899
Cohen J (1962) The statistical power of abnormal-social psychological research: a review. J Abnormal Soc Psychol 65:145–153
Cohen J (1988) Statistical power analysis for the behavioral sciences, 2nd edn. Lawrence Erlbaum Associates, Hillsdale
Cohen J (1994) The earth is round (p \(<\) 0.5). Am Psychol 12:997–1003
Cortina JM, Dunlap WP (1997) On the logic and purpose of significance testing. Psychol Methods 2(2):161–172
Cumming G (2011) Understanding the new statistics: effect sizes, confidence intervals, and meta-analysis. Routledge, London
Cumming G (2014) The new statistics: why and how. Psychol Sci 25:7–29
De Witte K, Marques RC (2010) Designing performance incentives, an international benchmark study in the water sector. CEJOR 18:189–220
Demidenko E (2016) The p-value you can’t buy. Am Stat 70(1):33–38. doi:10.1080/00031305.2015.1069760
Derrac J, Garca S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput 1(1):3–18. doi:10.1016/j.swevo.2011.02.002, http://www.sciencedirect.com/science/article/pii/S2210650211000034
Dienes Z (2011) Bayesian versus orthodox statistics: Which side are you on? Perspect Psychol Sci 6(3):274–290
Dienes Z (2014) Using bayes to get the most out of non-significatnt results. Front Psychol 5:1–17
Dooling DJ, Danks JH (1975) Going beyond tests of significance: Is psychology ready? Bull Psychon Soc 5(1):15–17
Ellis PD (2010) The essential guide to effect sizes. Cambridge University Press, Cambridge
Eve MP (2012) Tear it down, build it up: the research output team, or the library-as-publisher. Insights UKSG 25:158–162
Falk R (1998) In criticism of the null hypothesis statistical test. Am Psychol 53:798–799
Falk R, Greenbaum CW (1995) Significance tests die hard. Theory Psychol 5:75–98
Fanelli D (2012) Negative results are disappearing from most disciplines and countries. Scientometrics 90(3):891–904
Few S (2009) Now you see it: simple visualization techniques for quantitative analysis. Analytics Press, Piedmont
Fisher RA (1925) Statistical methods for reseach workers. Oliver & Boyd, London
Fisher RA (1935) The design of experiments. Oliver and Boyd, Edinburgh
Fisher RA (1955) Statistical methods and scientific induction. J R Stat Soc Ser B (Methodol) 17:69–78
Fleiß J (2015) Merit norms in the ultimatum game: an experimental study of the effect of merit on individual behavior and aggregate outcomes. Cent Eur J Oper Res 23(2):389–406. doi:10.1007/s10100-015-0385-8
Jea Gallien (2015) Initial shipment decisions for new products at zara. Oper Res 63(2):269–286. doi:10.1287/opre.2014.1343
Gelman A, Stern H (2006) The difference between “significant” and “not significant” is not itself statistically significant. Am Stat 60:328–331
Gigerenzer G (2004) Mindless statistics. J Socio Econ 33:587–606
Gillan DJ, Wickens CD, Hollands JG, Carswell CM (1998) Guidellines for presenting qualitative data in hfes publications. Human Factors 40:28–41
Glaser DN (1999) The controversy of significance testing: misconceptions and alternatives. Am J Crit Care 8(5):291–296
Glass GV, McGaw B, Smith ML (1981) Meta-analvsis in social research. Sage, Beverly Hills
Goodman S (1992) A comment on replication, p-values and evidence. Stat Med 11:875–879
Goodman SN (2008) A dirty dozen: twelve p-value misconceptions. Sem Hematol 45(3):135–140
Greenwald A (1975) Consequences of predjudice agains the null hypothesis. Psychol Bull 82:1–20
Greenwald AG, Gonzales R, Harris RJ, Guthrie D (1996) Effect sizes and p values: what should be reported and what should be replicated? Psychophysiology 33:175–183
Grissom R, Kim J (2012) Effect sizes of research. Routledge, Abingdon
Hagen R (1997) In praise of the null hypothesis test. Am Psychol 52:15–24
Haller H, Krauss S (2002) Misinterpretations of significance: a problem students share with their teachers? Methods Psychol Res Online 7(1):1–20
Haramoto H (2009) Automation of statistical tests on randomness to obtain clearer conclusion. In: Owen AB, L’ Ecuyer P (eds) Monte carlo and quasi-monte carlo methods 2008. Springer, Berlin Heidelberg, pp 411–421
Harris MJ (1991) Significance tests are not enough: the role of effect size estimation in theory corroboration. Theory Psychol 1:375–382
Hedges LV (1981) Distribution theory for Glass’s estimator of effect size and related estimators. J Educ Stat 6:107–128
Hoaglin DC, Mosteller F, Tukey JW (2000) Understanding robust and exploratory data analysis. Wiley, Hoboken
Hoem JM (2008) The reporting of statistical significance in scientific journals. Demogr Res 18(15):437–442
Hofmann M (2015) Reasoning beyond predictive validity: The role of plausibility in decision-supporting social simulation. In: Yilmaz L, Chan WKV, Moon I, Roeder TMK, Macal C, Rossetti MD (eds). In: Proceedings of the 2015 winter simulation conference. IEEE, Piscataway, New Jersey
Hofmann M (2015) Searching for effects in big data: Why p-values are not advised and what to use instead. In: Yilmaz L, Chan WKV, Moon I, Roeder TMK, Macal C, Rossetti MD (eds). In: Proceedings of the 2015 Winter Simulation conference. IEEE, Piscataway, New Jersey
Hubbard R (2004) Alphabet soup: blurring the distinctions between ps and alphas in psychological research. Theory Psychol 14:295–327
Hubbard R, Armstrong J (2006) Why we don’t really know what statistical significance means: implications for educators. J Mark Educ 28:114–120
Hubbard R, Lindsay RM (2008) Why p values are not a useful measure of evidence in stastistical significance testing. Theory Psychol 18:69–88
Ioannidis J (2005) Why most puplished research findings are false. PLoS Med 2(8):e124
Kelley K (2007) Confidence intervals for standardized effect sizes: theory, application, and implementation. J Stat Softw 20(8):1–24
Kelley K, Preacher KJ (2012) On effect size. Psychol Methods 17(2):137–152
Keselman H, Algina J, Lix L, Wilcox R, Deering K (2008) A generally robust approach for testing hypotheses and setting confidence intervals for effect sizes. Psychol Methods 13(2):110–129
Kirk RE (1996) Practical significance: a concept whose time has come. Educ Psychol Meas 56:746–759
Kline R (2013) Beyond significance testing : statistics reform in the behavioral sciences, 2nd edn. American Psychological Association, Washington
Kozak M (2010) Asterisks–friends or foes of statistics? Teach Stat 32(3):88–89. doi:10.1111/j.1467-9639.2009.00367.x
Kruschke JK (2015) Doing Bayesian data analysis, 2nd edn. Academic Press, Cambridge
Kysucky V, Norden L (2016) The benefits of relationship lending in a cross-country context: a meta-analysis. Manag Sci 62(1):90–110
Lambdin C (2012) Significance tests as sorcery: science is empirical—significance tests are not. Theory Psychol 22(1):67–90
Lane DM, Sandor A (2009) Desdesign better graphs by including didistribution information and integrating words, numbers, and images. Psychol Methods 14(3):239–257
Lanzante JR (2005) A cautionary note on the use of error bars. J Clim 13:3699–3703
Lecoutre B, Poitevineau J (2014) The significance test controversy revisited. Springer, Berlin
L’Ecuyer P (2015) Random number generators with multiple streams for sequential and parallel computing. In: Yilmaz L, Chan WKV, Moon I, Roeder TMK, Macal C, Rossetti MD (eds). In: Proceedings of the 2015 winter simulation conference. IEEE, Piscataway, New Jersey
Leung SC, Zhang Z, Zhang D, Hua X, Lim MK (2013) A meta-heuristic algorithm for heterogeneous fleet vehicle routing problems with two-dimensional loading constraints. Eur J Oper Res 225(2):199–210. doi:10.1016/j.ejor.2012.09.023, http://www.sciencedirect.com/science/article/pii/S037722171200687X
Lombardi CM, Hurlbert SH (2009a) Final collapse of the Neyman–Pearson decision theoretic framework and rise of the neofisherian. Ann Zool Fennici 46:311–349
Lombardi CM, Hurlbert SH (2009b) Misprescription and misuse of one-tailed tests. Austral Ecol 34:447–468
Lykken DT (1968) Statistical significane in psychological reseach. Psychol Bull 70:151–159
Mayo D (1996) Error and the growth of experimental knowledge. The University of Chicago Press, Chicago
Miller J (2009) What is the probability of replicating a stastistically significant effect? Psychon Bull Rev 16(4):617–640
Morey RD, Rouder J, Verhagen J, Wagenmakers EJ (2014) Why hypothesis tests are essential for psychological science: a comment on cumming (2014). Psychol Sci 25(6):1289–90
Morey RD, Romeijn JW, Rouder JN (2016) The philosophy of Bayes factors and the quantification of statistical evidence. J Math Psychol 72:6–18. doi:10.1016/j.jmp.2015.11.001
Mulaik S, Raju N, Harshman R (1997) There is a time and a place for significance testing. In: Harlowand L, Mulaik S, Steiger J (eds) What if there were no significance tests?. Erlbaum, Mahwah, pp 65–115
Murphy KR, Myors B (1999) Testing the hypothesis that treatments have negligible effects: minimum-effect tests in the general linear model. J Appl Psychol 84(2):234–248
Nickerson RS (2000) Null hypothesis significance testing: a review of an old and continuing controversy. Psychol Methods 5(2):241–301
Nieuwenhuis S, Forstmann BU, Wagenmakers E (2011) Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci 14(9):1105–1107
Nosek BA, Spies JR, Motyl M (2012) Scientific utopia ii. restructuring incentives and practices to promote truth over publishability. Perspect Psychol Sci 7(6):615–631
Nuzzo R (2014) Statistical errors. Nature 506(13):150–152
Parkhurst DF (2001) Statistical significance tests: equivalence and reverse tests should reduce misinterpretation. BioScience 51(12):1051–1057
Poole C (2001) Low p-values or narrow confidence intervals: which are more durable. Epidemiology 12(3):291–294
Prentice D, Miller D (1992) When small effects are impressive. Psychol Bull 112:160–164
Rausch A, Brauneis A (2014) It’s about how the task is set: the inclusion-exclusion effect and accountability in preprocessing management information. Cent Eur J Oper Res 23(2):313–344. doi:10.1007/s10100-014-0355-6
Reese RA (2004) Does significance matter? Significance 1(1):39–40
Rosenthal R (1979) The file drawer problem and tolerance for null results. Psychol Bull 86(3):638–641
Rosenthal R, Rosnow RL (1991) Essentials of behavioral research: methods and data analysis, 2nd edn. McGraw-Hill, New York
Rosnow R, Rosenthal R (1989) Statistical procedures and the justification of knowledge in psychological science. Am Psychol 44:1246–1284
Rouder J (2014) Optional stopping: no problem for bayesians. Psychon Bull Rev 21(2):301–308
Rukhin A, Soto J, Nechvatal J, Smid M, Barker E, Leigh S, Levenson M, Vangel M, Banks D, Heckert A, Dray J, Vo S (2010) A statistical test suite for random and pseudorandom number generators for cryptographic applications. No. 800-22 in NIST Special Publication, National Institute of Standards and Technology
Sargent RG, Goldsman D, Yaacoub T (2015) Use of the interval statistical procedure for simulation model validation. In: Yilmaz L, Chan WKV, Moon I, Roeder T, Macal C, Rossetti MC (eds). In: Proceedings of the 2015 winter simulation conference. IEEE
Savalei V, Dunn E (2015) Is the call to abandon p-values the red herring of the replicability crisis? Front Psychol 245:1–4. doi:10.3389/fpsyg.2015.00245
Schmidt F, Hunter J (1997) Eight common but false objections to the discontinuation of significance testing in the analysis of research datat. In: Harlow LL, Mulaik SA, Steiger JH (eds) What if there were no significance tests?. Erlbaum, Mahwah, pp 37–64
Schneider JW (2015) Null hypothesis significance tests. a mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations. Scientometrics 102:411–432
Sedlmeier P (1996) Jenseits des Signifikanztest–Rituals: Ergaenzungen und Alternativen. Methods Psychol Res Online 1(4):41–63
Senn S (2001) Two cheers for p-values? J Epidemiol Biostat 6(2):193–204
Simmons JP, Nelson LD, Simonsohn U (2011) False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol Sci 22(11):1359–1366
Sohn D (1998) Statistical significance and replicability: why the former does not presage the latter. Theory Psychol 8:291–311
Soto J (1999) Statistical testing of random number generators. In: Proceedings of the 22nd national information systems security conference, NIST, pp 1–12
Switalski P, Seredynski F (2015) Scheduling parallel batch jobs in grids with evolutionary metaheuristics. J Sched 18(4):345–357. doi:10.1007/s10951-014-0382-0
Thompson B (2007) Effect sizes, confidence intervals, and confidence intervals for effect sizes. Psychol Schools 44(5):423–432
Thompson B (2008) Computing and interpreting effect sizes, confidence intervals, and confidence intervals for effect sizes. In: Osborne J (ed) Best practices in quantitative methods, chap 17, 17th edn. Sage, Newbury Park, pp 246–262
Tufte E (2001) The visual display of quantitative information, 2nd edn. Graphics Press, Cheshire
Tukey J (1991) The philosophy of multiple comparison. Stat Sci 6:100–116
Tukey JW (1977) Exploratory data analysis. Pearson, London
Tukey JW (1980) We need both exploratory and confirmatory. Am Stat 34(1):23–25
Velleman DC Paul F, Hoaglin (2012) APA handbook of research methods in psychology, Vol 3: Data analysis and research publication. American psychological association, Washington, DC, chap Exploratory data analysis., pp 51–70. doi:10.1037/13621-003
Wagenmakers EJ, Verhagen J, Ly A, Matzke D, Steingroever H, Rouder JN, Morey RD (2015) The need for bayesian hypothesis testing in psychological science. In: Lilienfeld SO, Waldman I (eds). Psychological science under scrutiny: recent challenges and proposed solutions, University of Missouri Press (in Press), p in Press
Wasserstein RL, Lazar NA (2016) The ASA’s statement on p-values: context, process, and purpose. Am Stat 0(ja):00–00, doi:10.1080/00031305.2016.1154108
Wilkinson L (1999) Task force on statistical inference: statistical methods in psychology journals. Am Psychol 54:594–604
Wineberg M (2016) Introductory statistics for EC: A visual approach. In: Proceedings of the 2016 on Genetic and Evolutionary Computation Conference Companion, ACM, New York, NY, USA, GECCO ’16 Companion, pp 357–383, doi:10.1145/2908961.2926983
Yu E, Sprenger A, Thomas R, Dougherty M (2014) When decision heutistics and science collide. Psychon Bull Rev 21(2):268–282
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hofmann, M., Meyer-Nieberg, S. Time to dispense with the p-value in OR?. Cent Eur J Oper Res 26, 193–214 (2018). https://doi.org/10.1007/s10100-017-0484-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10100-017-0484-9