skip to main content
10.1145/3077257.3077266acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

What you see is not what you get!: Detecting Simpson's Paradoxes during Data Exploration

Published: 14 May 2017 Publication History

Abstract

Visual data exploration tools, such as Vizdom or Tableau, significantly simplify data exploration for domain experts and, more importantly, novice users. These tools allow to discover complex correlations and to test hypotheses and differences between various populations in an entirely visual manner with just a few clicks, unfortunately, often ignoring even the most basic statistical rules. For example, there are many statistical pitfalls that a user can "tap" into when exploring data sets.
As a result of this experience, we started to build QUDE [1], the first system to Quantifying the Uncertainty in Data Exploration, which is part of Brown's Interactive Data Exploration Stack (called IDES). The goal of QUDE is to automatically warn and, if possible, protect users from common mistakes during the data exploration process. In this paper, we focus on a different type of error, the Simpson's Paradox, which is a special type of error in which a high-level aggregate/visualization leads to the wrong conclusion since a trend reverts when splitting the visualized data set into multiple subgroups (i.e., when executing a drill-down).

References

[1]
C. Binnig et al. Toward sustainable insights, or why polygamy is bad for you. In CIDR, 2017.
[2]
Y. Chung et al. Estimating the impact of unknown unknowns on aggregate query results. In SIGMOD, pages 861--876, 2016.
[3]
A. Crotty et al. Vizdom: Interactive analytics through pen and touch. PVLDB, 8(12), 2015.
[4]
C. C. Fabris et al. Discovering surprising instances of simpson's paradox in hierarchical multidimensional data. IJDWM, 2(1):27--49, 2006.
[5]
C. Glymour et al. Statistical themes and lessons for data mining. Data Min. Knowl. Discov., 1(1):11--28, 1997.
[6]
R. Kievit et al. Simpson's paradox in psychological science: a practical guide. Frontiers in psychology, 4, 2013.
[7]
R. Kievit et al. Simpson's paradox in psychological science: a practical guide. Frontiers in Psychology, 4:513, 2013.
[8]
Y. Kuo et al. Mining surprising patterns and their explanations in clinical data. Applied Artificial Intelligence, 28(2):111--138, 2014.
[9]
Z. Liu et al. The effects of interactive latency on exploratory visual analysis. IEEE Trans. Vis. Comput. Graph., 20(12), 2014.
[10]
Mimic ii data set. https://mimic.physionet.org/. Accessed: 2014-11-03.
[11]
B. Padmanabhan et al. A belief-driven method for discovering unexpected patterns. In KDD, pages 94--100, 1998.
[12]
A. Silberschatz et al. What makes patterns interesting in knowledge discovery systems. IEEE Trans. Knowl. Data Eng., 8(6):970--974, 1996.
[13]
E. Suzuki et al. Discovery of surprising exception rules based on intensity of implication. In PKDD, pages 10--18, 1998.
[14]
Z. Zhao et al. Controlling false discoveries during interactive data exploration. In SIGMOD, 2017.

Cited By

View all
  • (2023)Learning to Discover Various Simpson's ParadoxesProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599859(5092-5103)Online publication date: 6-Aug-2023
  • (2023)CrowdIDEA: Blending Crowd Intelligence and Data Analytics to Empower Causal ReasoningProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581021(1-17)Online publication date: 19-Apr-2023
  • (2022)Reliability at multiple stages in a data analysis pipelineCommunications of the ACM10.1145/350092365:11(118-128)Online publication date: 20-Oct-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HILDA '17: Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics
May 2017
89 pages
ISBN:9781450350297
DOI:10.1145/3077257
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 May 2017

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SIGMOD/PODS'17
Sponsor:

Acceptance Rates

Overall Acceptance Rate 28 of 56 submissions, 50%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)2
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Learning to Discover Various Simpson's ParadoxesProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599859(5092-5103)Online publication date: 6-Aug-2023
  • (2023)CrowdIDEA: Blending Crowd Intelligence and Data Analytics to Empower Causal ReasoningProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581021(1-17)Online publication date: 19-Apr-2023
  • (2022)Reliability at multiple stages in a data analysis pipelineCommunications of the ACM10.1145/350092365:11(118-128)Online publication date: 20-Oct-2022
  • (2022)ProS: data series progressive k-NN similarity search and classification with probabilistic quality guaranteesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-022-00771-z32:4(763-789)Online publication date: 30-Nov-2022
  • (2021)DashBotProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3481968(4696-4700)Online publication date: 26-Oct-2021
  • (2021)Datamations: Animated Explanations of Data Analysis PipelinesProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445063(1-14)Online publication date: 6-May-2021
  • (2020)Data Series Progressive Similarity Search with Probabilistic Quality GuaranteesProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389751(1857-1873)Online publication date: 11-Jun-2020
  • (2020)Surfacing Visualization MiragesProceedings of the 2020 CHI Conference on Human Factors in Computing Systems10.1145/3313831.3376420(1-16)Online publication date: 21-Apr-2020
  • (2020)VisuaLint: Sketchy In Situ Annotations of Chart Construction ErrorsComputer Graphics Forum10.1111/cgf.1397539:3(219-228)Online publication date: 18-Jul-2020
  • (2020)Aggregation Bias: A Proposal to Raise Awareness Regarding Inclusion in Visual AnalyticsTrends and Innovations in Information Systems and Technologies10.1007/978-3-030-45697-9_40(409-417)Online publication date: 18-May-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media