ABSTRACT
When presenting visualizations of experimental results, scientists often choose to display either inferential uncertainty (e.g., uncertainty in the estimate of a population mean) or outcome uncertainty (e.g., variation of outcomes around that mean) about their estimates. How does this choice impact readers' beliefs about the size of treatment effects? We investigate this question in two experiments comparing 95% confidence intervals (means and standard errors) to 95% prediction intervals (means and standard deviations). The first experiment finds that participants are willing to pay more for and overestimate the effect of a treatment when shown confidence intervals relative to prediction intervals. The second experiment evaluates how alternative visualizations compare to standard visualizations for different effect sizes. We find that axis rescaling reduces error, but not as well as prediction intervals or animated hypothetical outcome plots (HOPs), and that depicting inferential uncertainty causes participants to underestimate variability in individual outcomes.
Supplemental Material
Available for Download
The supplement includes a PDF that shows the stimuli and screenshots used in our experiments.
- Alice R Albrecht and Brian J Scholl. 2010. Perceptually averaging in a continuous visual world: Extracting statistical summary representations over time. Psychological Science 21, 4 (2010), 560--567.Google ScholarCross Ref
- American Psychological Association and others. 2001. Publication manual (5th edition). American Psychological Association Washington, DC.Google Scholar
- Nicholas J Barrowman and Ransom A Myers. 2003. Raindrop plots: a new way to display collections of likelihoods and distributions. The American Statistician 57, 4 (2003), 268--274.Google ScholarCross Ref
- Sarah Belia, Fiona Fidler, Jennifer Williams, and Geoff Cumming. 2005. Researchers misunderstand confidence intervals and standard error bars. Psychological methods 10, 4 (2005), 389.Google Scholar
- Melanie L Bell, Mallorie H Fiero, Haryana M Dhillon, Victoria J Bray, and Janette L Vardy. 2017. Statistical controversies in cancer research: using standardized effect size graphs to enhance interpretability of cancer-related clinical trials with patient-reported outcomes. Annals of Oncology 28, 8 (2017), 1730--1733.Google ScholarCross Ref
- Lonni Besançon and Pierre Dragicevic. 2019. The Continued Prevalence of Dichotomous Inferences at CHI. (2019).Google Scholar
- Katherine S Button, John PA Ioannidis, Claire Mokrysz, Brian A Nosek, Jonathan Flint, Emma SJ Robinson, and Marcus R Munafò. 2013. Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience 14, 5 (2013), 365.Google ScholarCross Ref
- Colin F Camerer, Anna Dreber, Felix Holzmeister, Teck-Hua Ho, Jürgen Huber, Magnus Johannesson, Michael Kirchler, Gideon Nave, Brian A Nosek, Thomas Pfeiffer, and others. 2018. Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour 2, 9 (2018), 637.Google ScholarCross Ref
- Beth Chance, Robert del Mas, and Joan Garfield. 2004. Reasoning about sampling distribitions. In The challenge of developing statistical literacy, reasoning and thinking. Springer, 295--323.Google Scholar
- Open Science Collaboration and others. 2015. Estimating the reproducibility of psychological science. Science 349, 6251 (2015), aac4716.Google Scholar
- Michael Correll and Michael Gleicher. 2014. Error bars considered harmful: Exploring alternate encodings for mean and error. Visualization and Computer Graphics, IEEE Transactions on 20, 12 (2014), 2142--2151.Google Scholar
- Geoff Cumming. 2013. Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. Routledge.Google Scholar
- Geoff Cumming, Fiona Fidler, Pav Kalinowski, and Jerry Lai. 2012. The statistical recommendations of the American Psychological Association Publication Manual: Effect sizes, confidence intervals, and meta-analysis. Australian Journal of Psychology 64, 3 (2012), 138--146.Google ScholarCross Ref
- Geoff Cumming and Sue Finch. 2005. Inference by eye: confidence intervals and how to read pictures of data. American Psychologist 60, 2 (2005), 170.Google ScholarCross Ref
- Peter Cummings. 2011. Arguments for and against standardized mean differences (effect sizes). Archives of pediatrics & adolescent medicine 165, 7 (2011), 592--596.Google Scholar
- Pierre Dragicevic, Yvonne Jansen, Abhraneel Sarma, Matthew Kay, and Fanny Chevalier. 2019. Increasing the Transparency of Research Papers with Explorable Multiverse Analyses. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, 65.Google ScholarDigital Library
- Michael Fernandes, Logan Walls, Sean Munson, Jessica Hullman, and Matthew Kay. 2018. Uncertainty Displays Using Quantile Dotplots or CDFs Improve Transit Decision-Making. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 144.Google ScholarDigital Library
- Rocio Garcia-Retamero and Edward T Cokely. 2013. Communicating health risks with visual aids. Current Directions in Psychological Science 22, 5 (2013), 392--399.Google ScholarCross Ref
- Gerd Gigerenzer. 1994. Why the distinction between single-event probabilities and frequencies is important for psychology (and vice versa). In Subjective probability. Wiley, 129--161.Google Scholar
- Gerd Gigerenzer and Ulrich Hoffrage. 1995. How to improve Bayesian reasoning without instruction: frequency formats. Psychological review 102, 4 (1995), 684.Google Scholar
- Daniel G Goldstein, Eric J Johnson, and William F Sharpe. 2008. Choosing outcomes versus choosing products: Consumer-focused retirement investment advice. Journal of Consumer Research 35, 3 (2008), 440--456.Google ScholarCross Ref
- Daniel G Goldstein and David Rothschild. 2014. Lay understanding of probability distributions. Judgment & Decision Making 9, 1 (2014).Google Scholar
- Rink Hoekstra, Richard D Morey, Jeffrey N Rouder, and Eric-Jan Wagenmakers. 2014. Robust misinterpretation of confidence intervals. Psychonomic bulletin & review 21, 5 (2014), 1157--1164.Google Scholar
- Ulrich Hoffrage and Gerd Gigerenzer. 1998. Using natural frequencies to improve diagnostic inferences. Academic medicine 73, 5 (1998), 538--540.Google ScholarCross Ref
- Jessica Hullman, Matthew Kay, Yea-Seul Kim, and Samana Shrestha. 2018. Imagining Replications: Graphical Prediction & Discrete Visualizations Improve Recall & Estimation of Effect Uncertainty. IEEE transactions on visualization and computer graphics 24, 1 (2018), 446--456.Google Scholar
- Jessica Hullman, Paul Resnick, and Eytan Adar. 2015. Hypothetical outcome plots outperform error bars and violin plots for inferences about reliability of variable ordering. PloS one 10, 11 (2015), e0142444.Google ScholarCross Ref
- Harald Ibrekk and M Granger Morgan. 1987. Graphical communication of uncertain quantities to nontechnical people. Risk analysis 7, 4 (1987), 519--529.Google Scholar
- Christopher H Jackson. 2008. Displaying uncertainty with shading. The American Statistician 62, 4 (2008), 340--347.Google ScholarCross Ref
- Alex Kale, Francis Nguyen, Matthew Kay, and Jessica Hullman. 2018. Hypothetical Outcome Plots Help Untrained Observers Judge Trends in Ambiguous Data. IEEE transactions on visualization and computer graphics (2018).Google Scholar
- Peter Kampstra and others. 2008. Beanplot: A boxplot alternative for visual comparison of distributions. (2008).Google Scholar
- Matthew Kay, Tara Kola, Jessica R Hullman, and Sean A Munson. 2016. When (ish) is my bus?: User-centered visualizations of uncertainty in everyday, mobile predictive systems. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 5092--5103.Google ScholarDigital Library
- Yea-Seul Kim, Logan Walls, Pete Krafft, and Jessica Hullman. 2019. A Bayesian Cognition Approach to Improve Data Visualization. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM.Google ScholarDigital Library
- Martin Krzywinski and Naomi Altman. 2013. Points of significance: error bars. (2013).Google Scholar
- George E Newman and Brian J Scholl. 2012. Bar graphs depicting averages are perceptually misinterpreted: The within-the-bar bias. Psychonomic bulletin & review 19, 4 (2012), 601--607.Google Scholar
- Nathaniel Schenker and Jane F Gentleman. 2001. On judging the significance of differences by examining the overlap between confidence intervals. The American Statistician 55, 3 (2001), 182--186.Google ScholarCross Ref
- Michael Schulte-Mecklenbeck, Joseph G Johnson, Ulf Böckenholt, Daniel G Goldstein, J Edward Russo, Nicolette J Sullivan, and Martijn C Willemsen. 2017. Process-tracing methods in decision making: On growing up in the 70s. Current Directions in Psychological Science 26, 5 (2017), 442--450.Google ScholarCross Ref
- Transparent Statistics in Human--Computer Interaction Working Group. 2019. Transparent Statistics Guidelines. (Feb 2019). DOI: http://dx.doi.org/10.5281/zenodo.1186169 (Available at https://transparentstats.github.io/guidelines).Google ScholarCross Ref
- Leland Wilkinson. 1999. Statistical methods in psychology journals: Guidelines and explanations. American psychologist 54, 8 (1999), 594.Google Scholar
Index Terms
- How Visualizing Inferential Uncertainty Can Mislead Readers About Treatment Effects in Scientific Results
Recommendations
The Probability that Your Hypothesis Is Correct, Credible Intervals, and Effect Sizes for IR Evaluation
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information RetrievalUsing classical statistical significance tests, researchers can only discuss P(D+|H), the probability of observing the data D at hand or something more extreme, under the assumption that the hypothesis H is true (i.e., the p-value). But what we usually ...
Visualizing Data with Bounded Uncertainty
INFOVIS '02: Proceedings of the IEEE Symposium on Information Visualization (InfoVis'02)Visualization is a powerful way to facilitate data analysis, but it is crucial that visualization systems explicitly convey the presence, nature, and degree of uncertainty to users. Otherwise, there is a danger that data will be falsely interpreted, ...
Visualizing Large-Scale Uncertainty in Astrophysical Data
Visualization of uncertainty or error in astrophysical data is seldom available in simulations of astronomical phenomena, and yet almost all rendered attributes possess some degree of uncertainty due to observational error. Uncertainties associated with ...
Comments