Skip to main content
Log in

Time to dispense with the p-value in OR?

Rationale and implications of the statement of the American Statistical Association (ASA) on p-values

  • Original Paper
  • Published:
Central European Journal of Operations Research Aims and scope Submit manuscript

Abstract

Null hypothesis significance testing is the standard procedure of statistical decision making, and p-values are the most widespread decision criteria of inferential statistics both in science, in general, and also in operations research, in particular. p-values are of paramount importance in the life and human sciences, and dominate statistical summaries in natural and technical sciences as well as in operations research, a domain in which the p-value seems to be a common denominator for decision making based on samples. Yet, the use of significance testing in the analysis of research data has been criticized from numerous statisticians—continuously for almost 100 years. This criticism has recently (March 7, 2016) been given an official status by a statement from the American Statistical Association on p-values. Is it time to dispense with the p-value in OR? The answer depends on many factors, including the research objective, the research domain, and, especially, the amount of information provided in addition to the p-value. Despite this dependence from context three conclusions can be made that should concern the operational analyst: First, p-values can perfectly cast doubt on a null hypothesis or its underlying assumptions, but they are only a first step of analysis, which, stand alone, lacks expressive power. Second, the statistical layman almost inescapably misinterprets the evidentiary value of p-values. Third and foremost, p-values are an inadequate choice for a succinct executive summary of statistical evidence for or against a research question. In statistical summaries confidence intervals of standardized effect sizes provide much more information than p-values without requiring much more space.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Algina J, Keselman H, Penfield R (2006) Confidence intervals for an effect size when variances are not equal. J Mod Appl Stat Methods 5(1):2–13

    Article  Google Scholar 

  • Armstrong JS (2007) Statistical significance tests are unnecessary even when properly done and properly interpreted: reply to commentaries. Int J Forecast 23:335–336

    Article  Google Scholar 

  • Bakan D (1966) The test of significance in psychological research. Psychol Bull 66:423–437

    Article  Google Scholar 

  • Bartz-Beielstein T, Preuss M (2014) Experimental analysis of optimization algorithms: tuning and beyond. Springer, Berlin. doi:10.1007/978-3-642-33206-7-10

    Google Scholar 

  • Bayarri M, Benjamin DJ, Berger JO, Sellke TM (2016) Rejection odds and rejection ratios: a proposal for statistical practice in testing hypotheses. J Math Psychol 72:90–103. doi:10.1016/j.jmp.2015.12.007

    Article  Google Scholar 

  • Berkson J (1938) Some difficulties of interpretation encountered in the application of the chisquare test. J Am Stat Assoc 33:526–536

    Article  Google Scholar 

  • Boring E (1919) Mathematical vs. scientific significance. Psychol Bull 16:335–338

    Article  Google Scholar 

  • Brandstaetter E (1999) Confidence intervals as an alternative to significance testing. Methods Psychol Res Online 4(2):33–46

    Google Scholar 

  • Browne RH (2010) The t-test p value and its relationship to the effect size and p(x>y). Am Stat 64(1):30–33. doi:10.1198/tast.2010.08261

    Article  Google Scholar 

  • Carver R (1978) The case against stastistical significance testing. Harv Educ Rev 48:378–399

    Article  Google Scholar 

  • Christensen R (2005) Testing Fisher, Neyman, Pearson, and Bayes. Am Stat 59(2):121,126

    Article  Google Scholar 

  • Coelho V, Grasas A, Ramalhinho H, Coelho I, Souza M, Cruz R (2016) An ILS-based algorithm to solve a large-scale real heterogeneous fleet VRP with multi-trips and docking constraints. Eur J Oper Res 250(2):367–376. doi:10.1016/j.ejor.2015.09.047, http://www.sciencedirect.com/science/article/pii/S0377221715008899

  • Cohen J (1962) The statistical power of abnormal-social psychological research: a review. J Abnormal Soc Psychol 65:145–153

    Article  Google Scholar 

  • Cohen J (1988) Statistical power analysis for the behavioral sciences, 2nd edn. Lawrence Erlbaum Associates, Hillsdale

    Google Scholar 

  • Cohen J (1994) The earth is round (p \(<\) 0.5). Am Psychol 12:997–1003

    Article  Google Scholar 

  • Cortina JM, Dunlap WP (1997) On the logic and purpose of significance testing. Psychol Methods 2(2):161–172

    Article  Google Scholar 

  • Cumming G (2011) Understanding the new statistics: effect sizes, confidence intervals, and meta-analysis. Routledge, London

    Google Scholar 

  • Cumming G (2014) The new statistics: why and how. Psychol Sci 25:7–29

    Article  Google Scholar 

  • De Witte K, Marques RC (2010) Designing performance incentives, an international benchmark study in the water sector. CEJOR 18:189–220

    Article  Google Scholar 

  • Demidenko E (2016) The p-value you can’t buy. Am Stat 70(1):33–38. doi:10.1080/00031305.2015.1069760

    Article  Google Scholar 

  • Derrac J, Garca S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput 1(1):3–18. doi:10.1016/j.swevo.2011.02.002, http://www.sciencedirect.com/science/article/pii/S2210650211000034

  • Dienes Z (2011) Bayesian versus orthodox statistics: Which side are you on? Perspect Psychol Sci 6(3):274–290

    Article  Google Scholar 

  • Dienes Z (2014) Using bayes to get the most out of non-significatnt results. Front Psychol 5:1–17

    Article  Google Scholar 

  • Dooling DJ, Danks JH (1975) Going beyond tests of significance: Is psychology ready? Bull Psychon Soc 5(1):15–17

    Article  Google Scholar 

  • Ellis PD (2010) The essential guide to effect sizes. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Eve MP (2012) Tear it down, build it up: the research output team, or the library-as-publisher. Insights UKSG 25:158–162

    Article  Google Scholar 

  • Falk R (1998) In criticism of the null hypothesis statistical test. Am Psychol 53:798–799

    Article  Google Scholar 

  • Falk R, Greenbaum CW (1995) Significance tests die hard. Theory Psychol 5:75–98

    Article  Google Scholar 

  • Fanelli D (2012) Negative results are disappearing from most disciplines and countries. Scientometrics 90(3):891–904

    Article  Google Scholar 

  • Few S (2009) Now you see it: simple visualization techniques for quantitative analysis. Analytics Press, Piedmont

    Google Scholar 

  • Fisher RA (1925) Statistical methods for reseach workers. Oliver & Boyd, London

    Google Scholar 

  • Fisher RA (1935) The design of experiments. Oliver and Boyd, Edinburgh

    Google Scholar 

  • Fisher RA (1955) Statistical methods and scientific induction. J R Stat Soc Ser B (Methodol) 17:69–78

    Google Scholar 

  • Fleiß J (2015) Merit norms in the ultimatum game: an experimental study of the effect of merit on individual behavior and aggregate outcomes. Cent Eur J Oper Res 23(2):389–406. doi:10.1007/s10100-015-0385-8

    Article  Google Scholar 

  • Jea Gallien (2015) Initial shipment decisions for new products at zara. Oper Res 63(2):269–286. doi:10.1287/opre.2014.1343

    Article  Google Scholar 

  • Gelman A, Stern H (2006) The difference between “significant” and “not significant” is not itself statistically significant. Am Stat 60:328–331

    Article  Google Scholar 

  • Gigerenzer G (2004) Mindless statistics. J Socio Econ 33:587–606

    Article  Google Scholar 

  • Gillan DJ, Wickens CD, Hollands JG, Carswell CM (1998) Guidellines for presenting qualitative data in hfes publications. Human Factors 40:28–41

    Article  Google Scholar 

  • Glaser DN (1999) The controversy of significance testing: misconceptions and alternatives. Am J Crit Care 8(5):291–296

    Google Scholar 

  • Glass GV, McGaw B, Smith ML (1981) Meta-analvsis in social research. Sage, Beverly Hills

    Google Scholar 

  • Goodman S (1992) A comment on replication, p-values and evidence. Stat Med 11:875–879

    Article  Google Scholar 

  • Goodman SN (2008) A dirty dozen: twelve p-value misconceptions. Sem Hematol 45(3):135–140

    Article  Google Scholar 

  • Greenwald A (1975) Consequences of predjudice agains the null hypothesis. Psychol Bull 82:1–20

    Article  Google Scholar 

  • Greenwald AG, Gonzales R, Harris RJ, Guthrie D (1996) Effect sizes and p values: what should be reported and what should be replicated? Psychophysiology 33:175–183

    Article  Google Scholar 

  • Grissom R, Kim J (2012) Effect sizes of research. Routledge, Abingdon

    Google Scholar 

  • Hagen R (1997) In praise of the null hypothesis test. Am Psychol 52:15–24

    Article  Google Scholar 

  • Haller H, Krauss S (2002) Misinterpretations of significance: a problem students share with their teachers? Methods Psychol Res Online 7(1):1–20

    Google Scholar 

  • Haramoto H (2009) Automation of statistical tests on randomness to obtain clearer conclusion. In: Owen AB, L’ Ecuyer P (eds) Monte carlo and quasi-monte carlo methods 2008. Springer, Berlin Heidelberg, pp 411–421

    Chapter  Google Scholar 

  • Harris MJ (1991) Significance tests are not enough: the role of effect size estimation in theory corroboration. Theory Psychol 1:375–382

    Article  Google Scholar 

  • Hedges LV (1981) Distribution theory for Glass’s estimator of effect size and related estimators. J Educ Stat 6:107–128

    Article  Google Scholar 

  • Hoaglin DC, Mosteller F, Tukey JW (2000) Understanding robust and exploratory data analysis. Wiley, Hoboken

    Google Scholar 

  • Hoem JM (2008) The reporting of statistical significance in scientific journals. Demogr Res 18(15):437–442

    Article  Google Scholar 

  • Hofmann M (2015) Reasoning beyond predictive validity: The role of plausibility in decision-supporting social simulation. In: Yilmaz L, Chan WKV, Moon I, Roeder TMK, Macal C, Rossetti MD (eds). In: Proceedings of the 2015 winter simulation conference. IEEE, Piscataway, New Jersey

  • Hofmann M (2015) Searching for effects in big data: Why p-values are not advised and what to use instead. In: Yilmaz L, Chan WKV, Moon I, Roeder TMK, Macal C, Rossetti MD (eds). In: Proceedings of the 2015 Winter Simulation conference. IEEE, Piscataway, New Jersey

  • Hubbard R (2004) Alphabet soup: blurring the distinctions between ps and alphas in psychological research. Theory Psychol 14:295–327

    Article  Google Scholar 

  • Hubbard R, Armstrong J (2006) Why we don’t really know what statistical significance means: implications for educators. J Mark Educ 28:114–120

    Article  Google Scholar 

  • Hubbard R, Lindsay RM (2008) Why p values are not a useful measure of evidence in stastistical significance testing. Theory Psychol 18:69–88

    Article  Google Scholar 

  • Ioannidis J (2005) Why most puplished research findings are false. PLoS Med 2(8):e124

    Article  Google Scholar 

  • Kelley K (2007) Confidence intervals for standardized effect sizes: theory, application, and implementation. J Stat Softw 20(8):1–24

    Article  Google Scholar 

  • Kelley K, Preacher KJ (2012) On effect size. Psychol Methods 17(2):137–152

    Article  Google Scholar 

  • Keselman H, Algina J, Lix L, Wilcox R, Deering K (2008) A generally robust approach for testing hypotheses and setting confidence intervals for effect sizes. Psychol Methods 13(2):110–129

    Article  Google Scholar 

  • Kirk RE (1996) Practical significance: a concept whose time has come. Educ Psychol Meas 56:746–759

    Article  Google Scholar 

  • Kline R (2013) Beyond significance testing : statistics reform in the behavioral sciences, 2nd edn. American Psychological Association, Washington

    Book  Google Scholar 

  • Kozak M (2010) Asterisks–friends or foes of statistics? Teach Stat 32(3):88–89. doi:10.1111/j.1467-9639.2009.00367.x

    Article  Google Scholar 

  • Kruschke JK (2015) Doing Bayesian data analysis, 2nd edn. Academic Press, Cambridge

    Google Scholar 

  • Kysucky V, Norden L (2016) The benefits of relationship lending in a cross-country context: a meta-analysis. Manag Sci 62(1):90–110

    Google Scholar 

  • Lambdin C (2012) Significance tests as sorcery: science is empirical—significance tests are not. Theory Psychol 22(1):67–90

    Article  Google Scholar 

  • Lane DM, Sandor A (2009) Desdesign better graphs by including didistribution information and integrating words, numbers, and images. Psychol Methods 14(3):239–257

    Article  Google Scholar 

  • Lanzante JR (2005) A cautionary note on the use of error bars. J Clim 13:3699–3703

    Article  Google Scholar 

  • Lecoutre B, Poitevineau J (2014) The significance test controversy revisited. Springer, Berlin

    Google Scholar 

  • L’Ecuyer P (2015) Random number generators with multiple streams for sequential and parallel computing. In: Yilmaz L, Chan WKV, Moon I, Roeder TMK, Macal C, Rossetti MD (eds). In: Proceedings of the 2015 winter simulation conference. IEEE, Piscataway, New Jersey

  • Leung SC, Zhang Z, Zhang D, Hua X, Lim MK (2013) A meta-heuristic algorithm for heterogeneous fleet vehicle routing problems with two-dimensional loading constraints. Eur J Oper Res 225(2):199–210. doi:10.1016/j.ejor.2012.09.023, http://www.sciencedirect.com/science/article/pii/S037722171200687X

  • Lombardi CM, Hurlbert SH (2009a) Final collapse of the Neyman–Pearson decision theoretic framework and rise of the neofisherian. Ann Zool Fennici 46:311–349

    Article  Google Scholar 

  • Lombardi CM, Hurlbert SH (2009b) Misprescription and misuse of one-tailed tests. Austral Ecol 34:447–468

    Google Scholar 

  • Lykken DT (1968) Statistical significane in psychological reseach. Psychol Bull 70:151–159

    Article  Google Scholar 

  • Mayo D (1996) Error and the growth of experimental knowledge. The University of Chicago Press, Chicago

    Book  Google Scholar 

  • Miller J (2009) What is the probability of replicating a stastistically significant effect? Psychon Bull Rev 16(4):617–640

    Article  Google Scholar 

  • Morey RD, Rouder J, Verhagen J, Wagenmakers EJ (2014) Why hypothesis tests are essential for psychological science: a comment on cumming (2014). Psychol Sci 25(6):1289–90

    Article  Google Scholar 

  • Morey RD, Romeijn JW, Rouder JN (2016) The philosophy of Bayes factors and the quantification of statistical evidence. J Math Psychol 72:6–18. doi:10.1016/j.jmp.2015.11.001

    Article  Google Scholar 

  • Mulaik S, Raju N, Harshman R (1997) There is a time and a place for significance testing. In: Harlowand L, Mulaik S, Steiger J (eds) What if there were no significance tests?. Erlbaum, Mahwah, pp 65–115

    Google Scholar 

  • Murphy KR, Myors B (1999) Testing the hypothesis that treatments have negligible effects: minimum-effect tests in the general linear model. J Appl Psychol 84(2):234–248

    Article  Google Scholar 

  • Nickerson RS (2000) Null hypothesis significance testing: a review of an old and continuing controversy. Psychol Methods 5(2):241–301

    Article  Google Scholar 

  • Nieuwenhuis S, Forstmann BU, Wagenmakers E (2011) Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci 14(9):1105–1107

    Article  Google Scholar 

  • Nosek BA, Spies JR, Motyl M (2012) Scientific utopia ii. restructuring incentives and practices to promote truth over publishability. Perspect Psychol Sci 7(6):615–631

    Article  Google Scholar 

  • Nuzzo R (2014) Statistical errors. Nature 506(13):150–152

    Article  Google Scholar 

  • Parkhurst DF (2001) Statistical significance tests: equivalence and reverse tests should reduce misinterpretation. BioScience 51(12):1051–1057

    Article  Google Scholar 

  • Poole C (2001) Low p-values or narrow confidence intervals: which are more durable. Epidemiology 12(3):291–294

    Article  Google Scholar 

  • Prentice D, Miller D (1992) When small effects are impressive. Psychol Bull 112:160–164

    Article  Google Scholar 

  • Rausch A, Brauneis A (2014) It’s about how the task is set: the inclusion-exclusion effect and accountability in preprocessing management information. Cent Eur J Oper Res 23(2):313–344. doi:10.1007/s10100-014-0355-6

    Article  Google Scholar 

  • Reese RA (2004) Does significance matter? Significance 1(1):39–40

    Article  Google Scholar 

  • Rosenthal R (1979) The file drawer problem and tolerance for null results. Psychol Bull 86(3):638–641

    Article  Google Scholar 

  • Rosenthal R, Rosnow RL (1991) Essentials of behavioral research: methods and data analysis, 2nd edn. McGraw-Hill, New York

    Google Scholar 

  • Rosnow R, Rosenthal R (1989) Statistical procedures and the justification of knowledge in psychological science. Am Psychol 44:1246–1284

    Article  Google Scholar 

  • Rouder J (2014) Optional stopping: no problem for bayesians. Psychon Bull Rev 21(2):301–308

    Article  Google Scholar 

  • Rukhin A, Soto J, Nechvatal J, Smid M, Barker E, Leigh S, Levenson M, Vangel M, Banks D, Heckert A, Dray J, Vo S (2010) A statistical test suite for random and pseudorandom number generators for cryptographic applications. No. 800-22 in NIST Special Publication, National Institute of Standards and Technology

  • Sargent RG, Goldsman D, Yaacoub T (2015) Use of the interval statistical procedure for simulation model validation. In: Yilmaz L, Chan WKV, Moon I, Roeder T, Macal C, Rossetti MC (eds). In: Proceedings of the 2015 winter simulation conference. IEEE

  • Savalei V, Dunn E (2015) Is the call to abandon p-values the red herring of the replicability crisis? Front Psychol 245:1–4. doi:10.3389/fpsyg.2015.00245

    Google Scholar 

  • Schmidt F, Hunter J (1997) Eight common but false objections to the discontinuation of significance testing in the analysis of research datat. In: Harlow LL, Mulaik SA, Steiger JH (eds) What if there were no significance tests?. Erlbaum, Mahwah, pp 37–64

    Google Scholar 

  • Schneider JW (2015) Null hypothesis significance tests. a mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations. Scientometrics 102:411–432

    Article  Google Scholar 

  • Sedlmeier P (1996) Jenseits des Signifikanztest–Rituals: Ergaenzungen und Alternativen. Methods Psychol Res Online 1(4):41–63

    Google Scholar 

  • Senn S (2001) Two cheers for p-values? J Epidemiol Biostat 6(2):193–204

    Article  Google Scholar 

  • Simmons JP, Nelson LD, Simonsohn U (2011) False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol Sci 22(11):1359–1366

    Article  Google Scholar 

  • Sohn D (1998) Statistical significance and replicability: why the former does not presage the latter. Theory Psychol 8:291–311

    Article  Google Scholar 

  • Soto J (1999) Statistical testing of random number generators. In: Proceedings of the 22nd national information systems security conference, NIST, pp 1–12

  • Switalski P, Seredynski F (2015) Scheduling parallel batch jobs in grids with evolutionary metaheuristics. J Sched 18(4):345–357. doi:10.1007/s10951-014-0382-0

    Article  Google Scholar 

  • Thompson B (2007) Effect sizes, confidence intervals, and confidence intervals for effect sizes. Psychol Schools 44(5):423–432

    Article  Google Scholar 

  • Thompson B (2008) Computing and interpreting effect sizes, confidence intervals, and confidence intervals for effect sizes. In: Osborne J (ed) Best practices in quantitative methods, chap 17, 17th edn. Sage, Newbury Park, pp 246–262

    Chapter  Google Scholar 

  • Tufte E (2001) The visual display of quantitative information, 2nd edn. Graphics Press, Cheshire

    Google Scholar 

  • Tukey J (1991) The philosophy of multiple comparison. Stat Sci 6:100–116

    Article  Google Scholar 

  • Tukey JW (1977) Exploratory data analysis. Pearson, London

    Google Scholar 

  • Tukey JW (1980) We need both exploratory and confirmatory. Am Stat 34(1):23–25

    Google Scholar 

  • Velleman DC Paul F, Hoaglin (2012) APA handbook of research methods in psychology, Vol 3: Data analysis and research publication. American psychological association, Washington, DC, chap Exploratory data analysis., pp 51–70. doi:10.1037/13621-003

  • Wagenmakers EJ, Verhagen J, Ly A, Matzke D, Steingroever H, Rouder JN, Morey RD (2015) The need for bayesian hypothesis testing in psychological science. In: Lilienfeld SO, Waldman I (eds). Psychological science under scrutiny: recent challenges and proposed solutions, University of Missouri Press (in Press), p in Press

  • Wasserstein RL, Lazar NA (2016) The ASA’s statement on p-values: context, process, and purpose. Am Stat 0(ja):00–00, doi:10.1080/00031305.2016.1154108

  • Wilkinson L (1999) Task force on statistical inference: statistical methods in psychology journals. Am Psychol 54:594–604

    Article  Google Scholar 

  • Wineberg M (2016) Introductory statistics for EC: A visual approach. In: Proceedings of the 2016 on Genetic and Evolutionary Computation Conference Companion, ACM, New York, NY, USA, GECCO ’16 Companion, pp 357–383, doi:10.1145/2908961.2926983

  • Yu E, Sprenger A, Thomas R, Dougherty M (2014) When decision heutistics and science collide. Psychon Bull Rev 21(2):268–282

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marko Hofmann.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hofmann, M., Meyer-Nieberg, S. Time to dispense with the p-value in OR?. Cent Eur J Oper Res 26, 193–214 (2018). https://doi.org/10.1007/s10100-017-0484-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10100-017-0484-9

Keywords

Navigation