skip to main content
10.1145/3025453.3025912acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
research-article

Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing

Published: 02 May 2017 Publication History

Abstract

Datasets which are identical over a number of statistical properties, yet produce dissimilar graphs, are frequently used to illustrate the importance of graphical representations when exploring data. This paper presents a novel method for generating such datasets, along with several examples. Our technique varies from previous approaches in that new datasets are iteratively generated from a seed dataset through random perturbations of individual data points, and can be directed towards a desired outcome through a simulated annealing optimization strategy. Our method has the benefit of being agnostic to the particular statistical properties that are to remain constant between the datasets, and allows for control over the graphical appearance of resulting output.

Supplementary Material

suppl.mov (pn3600-file3.mp4)
Supplemental video
suppl.mov (pn3600p.mp4)
Supplemental video

References

[1]
Anscombe, F.J. (1973). Graphs in Statistical Analysis. The American Statistician 27, 1, 17--21.
[2]
Bach, B., Spritzer, A., Lutton, E., and Fekete, J.-D. (2012). Interactive Random Graph Generation with Evolutionary Algorithms. SpringerLink, 541--552.
[3]
Blyth, C.R. (1972). On Simpson's Paradox and the Sure-Thing Principle. Journal of the American Statistical Association 67, 338, 364--366.
[4]
Cairo, A. Download the Datasaurus: Never trust summary statistics alone; always visualize your data. http://www.thefunctionalart.com/2016/08/downloaddatasaurus-never-trust-summary.html.
[5]
Chatterjee, S. and Firat, A. (2007). Generating Data with Identical Statistics but Dissimilar Graphics. The American Statistician 61, 3, 248--254.
[6]
Fung, B.C.M., Wang, K., Chen, R., and Yu, P.S. (2010). Privacy-preserving Data Publishing: A Survey of Recent Developments. ACM Comput. Surv. 42, 4, 14:1--14:53.
[7]
Govindaraju, K. and Haslett, S.J. (2008). Illustration of regression towards the means. International Journal of Mathematical Education in Science and Technology 39, 4, 544--550.
[8]
Haslett, S.J. and Govindaraju, K. (2009). Cloning Data: Generating Datasets with Exactly the Same Multiple Linear Regression Fit. Australian & New Zealand Journal of Statistics 51, 4, 499--503.
[9]
Hwang, C.-R. Simulated annealing: Theory and applications. Acta Applicandae Mathematica 12, 1, 108--111.
[10]
Simpson, E.H. (1951). The Interpretation of Interaction in Contingency Tables. Journal of the Royal Statistical Society. Series B (Methodological) 13, 2, 238--241.
[11]
Stefanski, L.A. (2007). Residual (Sur)Realism. The American Statistician, .
[12]
Wickham, H., Cook, D., Hofmann, H., and Buja, A. (2010). Graphical inference for infovis. IEEE Transactions on Visualization and Computer Graphics 16, 6, 973--979.

Cited By

View all
  • (2025)Generation of Penetrometric Profile of the Soil Applying Machine Learning to Measure While Drilling Data from Deep Foundation MachineryApplied Sciences10.3390/app1503133115:3(1331)Online publication date: 27-Jan-2025
  • (2025)How aggregated opinions shape beliefsNature Reviews Psychology10.1038/s44159-024-00398-74:2(81-95)Online publication date: 6-Jan-2025
  • (2025)Quantifying and relating the completeness and diversity of process representations using species estimationInformation Systems10.1016/j.is.2024.102512130(102512)Online publication date: Apr-2025
  • Show More Cited By

Index Terms

  1. Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CHI '17: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems
    May 2017
    7138 pages
    ISBN:9781450346559
    DOI:10.1145/3025453
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 May 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    • Honorable Mention

    Author Tags

    1. anscombe
    2. scatter plots
    3. visualization

    Qualifiers

    • Research-article

    Conference

    CHI '17
    Sponsor:

    Acceptance Rates

    CHI '17 Paper Acceptance Rate 600 of 2,400 submissions, 25%;
    Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

    Upcoming Conference

    CHI 2025
    ACM CHI Conference on Human Factors in Computing Systems
    April 26 - May 1, 2025
    Yokohama , Japan

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)348
    • Downloads (Last 6 weeks)56
    Reflects downloads up to 13 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Generation of Penetrometric Profile of the Soil Applying Machine Learning to Measure While Drilling Data from Deep Foundation MachineryApplied Sciences10.3390/app1503133115:3(1331)Online publication date: 27-Jan-2025
    • (2025)How aggregated opinions shape beliefsNature Reviews Psychology10.1038/s44159-024-00398-74:2(81-95)Online publication date: 6-Jan-2025
    • (2025)Quantifying and relating the completeness and diversity of process representations using species estimationInformation Systems10.1016/j.is.2024.102512130(102512)Online publication date: Apr-2025
    • (2024)Technical Note: The divide and measure nonconformity – how metrics can mislead when we evaluate on different data partitionsHydrology and Earth System Sciences10.5194/hess-28-3665-202428:15(3665-3673)Online publication date: 13-Aug-2024
    • (2024)Disentangling decision errors from action execution in mouse-tracking studies: The case of effect-based action controlAttention, Perception, & Psychophysics10.3758/s13414-024-02974-8Online publication date: 20-Nov-2024
    • (2024)Exploring Topological Information Beyond Persistent Homology to Detect Geospatial ObjectsRemote Sensing10.3390/rs1621398916:21(3989)Online publication date: 27-Oct-2024
    • (2024)(Mis)estimation of the modal number of desired sexual partnersPLOS ONE10.1371/journal.pone.031529119:12(e0315291)Online publication date: 31-Dec-2024
    • (2024)Systematic Review of Generative Modelling Tools and Utility Metrics for Fully Synthetic Tabular DataACM Computing Surveys10.1145/370443757:4(1-38)Online publication date: 14-Nov-2024
    • (2024)Beware of Validation by Eye: Visual Validation of Linear Trends in ScatterplotsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345630531:1(787-797)Online publication date: 10-Sep-2024
    • (2024)Investigating the Visual Utility of Differentially Private ScatterplotsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.329239130:8(5370-5385)Online publication date: Aug-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media