research-article

Summarized Causal Explanations For Aggregate Views

Authors:
Brit Youngmann

Technion - Israel Institute of Technology, Haifa, Israel

Technion - Israel Institute of Technology, Haifa, Israel

0000-0002-0031-5550
View Profile

,
Michael Cafarella

CSAIL MIT, Cambridge, MA, USA

CSAIL MIT, Cambridge, MA, USA

0000-0001-6122-0590
View Profile

,
Amir Gilad

Hebrew University, Jerusalem, Israel

Hebrew University, Jerusalem, Israel

0000-0002-3764-1958
View Profile

,
Sudeepa Roy

Duke University, Durham, USA

Duke University, Durham, USA

0009-0002-8300-7891
View Profile

Proceedings of the ACM on Management of Data Volume 2 Issue 1Article No.: 71pp 1–27https://doi.org/10.1145/3639328

Published:26 March 2024Publication History

Proceedings of the ACM on Management of Data

Abstract

SQL queries with group-by and average are frequently used and plotted as bar charts in several data analysis applications. Understanding the reasons behind the results in such an aggregate view may be a highly nontrivial and time-consuming task, especially for large datasets with multiple attributes. Hence, generating automated explanations for aggregate views can allow users to gain better insights into the results while saving time in data analysis. When providing explanations for such views, it is paramount to ensure that they are succinct yet comprehensive, reveal different types of insights that hold for different aggregate answers in the view, and, most importantly, they reflect reality and arm users to make informed data-driven decisions, i.e., the explanations do not only consider correlations but are causal. In this paper, we present CauSumX, a framework for generating summarized causal explanations for the entire aggregate view. Using background knowledge captured in a causal DAG, CauSumX finds the most effective causal treatments for different groups in the view. We formally define the framework and the optimization problem, study its complexity, and devise an efficient algorithm using the Apriori algorithm, LP rounding, and several optimizations. We experimentally show that our system generates useful summarized causal explanations compared to prior work and scales well for large high-dimensional data.

References

2021. 2021 Stackoverflow Developer Survey. https://insights.stackoverflow.com/survey/2021.Google Scholar
2021. Adult Census Income Dataset. https://www.kaggle.com/datasets/uciml/adult-census-income.Google Scholar
2023. The 19th*. https://19thnews.org/2023/03/parenthood-stereotypes-gender-pay-gap/. Accessed: 2023-05--18.Google Scholar
2023. OpenAI Introducing ChatGPT. https://openai.com/blog/chatgpt.Google Scholar
2023. Viza jobs. https://vizajobs.com/what-do-technology-jobs-paysalary-insights-and-compensation-factors/.Google Scholar
2029. Tech Talks. https://bdtechtalks.com/2019/03/29/ageism-in-tech-age-limit-software-developers-face/.Google Scholar
Rakesh Agrawal, Ramakrishnan Srikant, et al . 1994. Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB, Vol. 1215. Santiago, Chile, 487--499.Google ScholarDigital Library
Mohamed Aljaban. 2021. Analysis of car accidents causes in the usa. (2021).Google Scholar
Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011. Provenance for aggregate queries. In Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, June 12--16, 2011, Athens, Greece, Maurizio Lenzerini and Thomas Schwentick (Eds.). ACM, 153--164. https://doi.org/10.1145/1989284.1989302Google ScholarDigital Library
Abolfazl Asudeh, Zhongjun Jin, and HV Jagadish. 2019. Assessing and remedying coverage for a given dataset. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 554--565.Google ScholarCross Ref
Arthur Asuncion and David Newman. 2007. UCI machine learning repository.Google Scholar
Abhijit V Banerjee, Abhijit Banerjee, and Esther Duflo. 2011. Poor economics: A radical rethinking of the way to fight global poverty. Public Affairs.Google Scholar
Senjuti Basu Roy, Sihem Amer-Yahia, Ashish Chawla, Gautam Das, and Cong Yu. 2010. Constructing and exploring composite items. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 843--854.Google Scholar
Marianne Bertrand and Sendhil Mullainathan. 2004. Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination. American economic review 94, 4 (2004), 991--1013.Google Scholar
Aline Bessa, Juliana Freire, Tamraparni Dasu, and Divesh Srivastava. 2020. Effective Discovery of Meaningful Outlier Relationships. ACM Transactions on Data Science 1, 2 (2020), 1--33.Google ScholarDigital Library
Nicole Bidoit, Melanie Herschel, and Katerina Tzompanaki. 2014. Query-based why-not provenance with nedexplain. In Extending database technology (EDBT).Google Scholar
Shaofeng Bu, Laks VS Lakshmanan, and Raymond T Ng. 2005. Mdl summarization with holes. In Proceedings of the 31st international conference on Very large data bases. Citeseer, 433--444.Google ScholarDigital Library
Adriane Chapman and HV Jagadish. 2009. Why not?. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. 523--534.Google ScholarDigital Library
Chaofan Chen and Cynthia Rudin. 2018. An optimization approach to learning falling rule lists. In International conference on artificial intelligence and statistics. PMLR, 604--612.Google Scholar
Raj Chetty, Nathaniel Hendren, Maggie R Jones, and Sonya R Porter. 2020. Race and economic opportunity in the United States: An intergenerational perspective. The Quarterly Journal of Economics 135, 2 (2020), 711--783.Google ScholarCross Ref
Silvia Chiappa. 2019. Path-specific counterfactual fairness. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7801--7808.Google ScholarDigital Library
Leonardo Mendonça de Moura and Nikolaj Bjørner. 2008. Z3: An Efficient SMT Solver. In TACAS. 337--340.Google Scholar
Daniel Deutch, Nave Frost, and Amir Gilad. 2020. Explaining Natural Language query results. VLDB J. 29, 1 (2020), 485--508.Google ScholarDigital Library
Daniel Deutch and Amir Gilad. 2019. Reverse-Engineering Conjunctive Queries from Provenance Examples. In Advances in Database Technology - 22nd International Conference on Extending Database Technology, EDBT 2019, Lisbon, Portugal, March 26--29, 2019, Melanie Herschel, Helena Galhardas, Berthold Reinwald, Irini Fundulaki, Carsten Binnig, and Zoi Kaoudi (Eds.). OpenProceedings.org, 277--288. https://doi.org/10.5441/002/edbt.2019.25Google ScholarCross Ref
Daniel Deutch, Amir Gilad, Tova Milo, Amit Mualem, and Amit Somech. 2022. FEDEX: An Explainability Framework for Data Exploration Steps. Proc. VLDB Endow. 15, 13 (2022), 3854--3868. https://www.vldb.org/pvldb/vol15/p3854-gilad.pdfGoogle ScholarDigital Library
Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and informative explanations of outcomes. Proceedings of the VLDB Endowment 8, 1 (2014), 61--72.Google ScholarDigital Library
Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and Informative Explanations of Outcomes. Proc. VLDB Endow. 8, 1 (sep 2014), 61--72. https://doi.org/10.14778/2735461.2735467Google ScholarDigital Library
Sarah Flood, Miriam King, Steven Ruggles, and J Robert Warren. 2015. Integrated public use microdata series, current population survey: Version 9.0.[Machine-readable database]. Minneapolis: University of Minnesota 1 (2015).Google Scholar
Sainyam Galhotra, Amir Gilad, Sudeepa Roy, and Babak Salimi. 2022. Hyper: Hypothetical reasoning with what-if and how-to queries using a probabilistic causal approach. In Proceedings of the 2022 International Conference on Management of Data. 1598--1611.Google ScholarDigital Library
Sander Greenland and James M Robins. 1999. Epidemiology, justice, and the probability of causation. Jurimetrics 40 (1999), 321.Google Scholar
Paul W Holland. 1986. Statistics and causal inference. Journal of the American statistical Association 81, 396 (1986), 945--960.Google ScholarCross Ref
Alexandra Kim, Laks VS Lakshmanan, and Divesh Srivastava. 2020. Summarizing hierarchical multidimensional data. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 877--888.Google ScholarCross Ref
Been Kim, Cynthia Rudin, and Julie A Shah. 2014. The bayesian case model: A generative approach for case-based reasoning and prototype classification. Advances in neural information processing systems 27 (2014).Google Scholar
Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. 2016. Interpretable decision sets: A joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1675--1684.Google ScholarDigital Library
Laks VS Lakshmanan, Raymond T Ng, Christine Xing Wang, Xiaodong Zhou, and Theodore J Johnson. 2002. The generalized MDL approach for summarization. In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 766--777.Google ScholarCross Ref
Laks VS Lakshmanan, Jian Pei, and Jiawei Han. 2002. Quotient cube: How to summarize the semantics of a data cube. In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 778--789.Google ScholarCross Ref
Seokki Lee, Bertram Ludäscher, and Boris Glavic. 2020. Approximate summaries for why and why-not provenance (extended version). arXiv preprint arXiv:2002.00084 (2020).Google Scholar
Chenjie Li, Zhengjie Miao, Qitian Zeng, Boris Glavic, and Sudeepa Roy. 2021. Putting Things into Context: Rich Explanations for Query Answers using Join Graphs. In Proceedings of the 2021 International Conference on Management of Data. 1051--1063.Google ScholarDigital Library
Yin Lin, Brit Youngmann, Yuval Moskovitch, HV Jagadish, and Tova Milo. 2021. On detecting cherry-picked generalizations. Proceedings of the VLDB Endowment 15, 1 (2021), 59--71.Google ScholarDigital Library
Ester Livshits, Leopoldo E. Bertossi, Benny Kimelfeld, and Moshe Sebag. 2020. The Shapley Value of Tuples in Query Answering. In ICDT, Vol. 155. 20:1--20:19.Google Scholar
Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. 2013. Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 623--631.Google ScholarDigital Library
Pingchuan Ma, Rui Ding, Shuai Wang, Shi Han, and Dongmei Zhang. 2023. XInsight: EXplainable Data Analysis Through The Lens of Causality. Proc. ACM Manag. Data, Article 156 (jun 2023), 27 pages.Google ScholarDigital Library
Alexandra Meliou, Wolfgang Gatterbauer, Katherine F Moore, and Dan Suciu. 2009. Why so? or why no? functional causality for explaining query answers. arXiv preprint arXiv:0912.5340 (2009).Google Scholar
Alexandra Meliou, Wolfgang Gatterbauer, Katherine F Moore, and Dan Suciu. 2010. The complexity of causality and responsibility for query answers and non-answers. arXiv preprint arXiv:1009.2021 (2010).Google Scholar
Zhengjie Miao, Qitian Zeng, Boris Glavic, and Sudeepa Roy. 2019. Going beyond provenance: Explaining query answers with pattern-based counterbalances. In Proceedings of the 2019 International Conference on Management of Data. 485--502.Google ScholarDigital Library
Tova Milo, Yuval Moskovitch, and Brit Youngmann. 2020. Contribution Maximization in Probabilistic Datalog. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 817--828.Google Scholar
Sobhan Moosavi, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. 2019. A countrywide traffic accident dataset. arXiv preprint arXiv:1906.05409 (2019).Google Scholar
Göran Nilsson. 1982. Effects of speed limits on traffic accidents in Sweden.Google Scholar
Ndhlovu Pardon and Chigwenya Average. 2013. The effectiveness of traffic calming measures in reducing road carnage in masvingo urban. International Journal 3, 2 (2013), 2305--1493.Google Scholar
Judea Pearl. 2009. Causal inference in statistics: An overview. (2009).Google Scholar
Prabhakar Raghavan and Clark D Tompson. 1987. Randomized rounding: a technique for provably good algorithms and algorithmic proofs. Combinatorica 7, 4 (1987), 365--374.Google ScholarDigital Library
Alon Reshef, Benny Kimelfeld, and Ester Livshits. 2020. The Impact of Negation on the Complexity of the Shapley Value in Conjunctive Queries. In PODS, Dan Suciu, Yufei Tao, and Zhewei Wei (Eds.). 285--297.Google Scholar
James M Robins, Miguel Angel Hernan, and Babette Brumback. 2000. Marginal structural models and causal inference in epidemiology.Google Scholar
Paul R Rosenbaum and Donald B Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1 (1983), 41--55.Google ScholarCross Ref
Sudeepa Roy, Laurel Orr, and Dan Suciu. 2015. Explaining query answers with explanation-ready databases. Proceedings of the VLDB Endowment 9, 4 (2015), 348--359.Google ScholarDigital Library
Sudeepa Roy and Dan Suciu. 2014. A formal approach to finding explanations for database queries. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 1579--1590.Google ScholarDigital Library
Donald Bruce Rubin. 1971. The use of matched sampling and regression adjustment in observational studies. Ph. D. Dissertation. Harvard University.Google Scholar
Donald B Rubin. 2005. Causal inference using potential outcomes: Design, modeling, decisions. J. Amer. Statist. Assoc. 100, 469 (2005), 322--331.Google ScholarCross Ref
Omer Sagi and Lior Rokach. 2021. Approximating XGBoost with an interpretable decision tree. Information Sciences 572 (2021), 522--542.Google ScholarDigital Library
Babak Salimi, Johannes Gehrke, and Dan Suciu. 2018. Bias in olap queries: Detection, explanation, and removal. In Proceedings of the 2018 International Conference on Management of Data. 1021--1035.Google ScholarDigital Library
Babak Salimi, Harsh Parikh, Moe Kayali, Lise Getoor, Sudeepa Roy, and Dan Suciu. 2020. Causal Relational Learning. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 241--256. https://doi.org/10.1145/3318464.3389759Google ScholarDigital Library
Gayatri Sathe and Sunita Sarawagi. 2001. Intelligent rollups in multidimensional OLAP data. In VLDB. 307--316.Google Scholar
Holger Schielzeth. 2010. Simple means to improve the interpretability of regression coefficients. Methods in Ecology and Evolution 1, 2 (2010), 103--113.Google ScholarCross Ref
Amit Sharma and Emre Kiciman. 2020. DoWhy: An End-to-End Library for Causal Inference. arXiv preprint arXiv:2011.04216 (2020).Google Scholar
Shohei Shimizu, Patrik O Hoyer, Aapo Hyvärinen, Antti Kerminen, and Michael Jordan. 2006. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research 7, 10 (2006).Google Scholar
Bill Shipley. 2016. Cause and correlation in biology: a user's guide to path analysis, structural equations and causal inference with R. Cambridge university press.Google ScholarCross Ref
P. Spirtes et al. 2000. Causation, prediction, and search. MIT press.Google Scholar
Yuchao Tao, Amir Gilad, Ashwin Machanavajjhala, and Sudeepa Roy. 2022. DPXPlain: Privately Explaining Aggregate Query Answers. Proc. VLDB Endow. 16, 1 (2022), 113--126. https://www.vldb.org/pvldb/vol16/p113-tao.pdfGoogle ScholarDigital Library
Balder ten Cate, Cristina Civili, Evgeny Sherkhonov, and Wang-Chiew Tan. 2015. High-level why-not explanations using ontologies. In Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 31--43.Google ScholarDigital Library
Jin Tian and Judea Pearl. 2000. Probabilities of causation: Bounds and identification. Annals of Mathematics and Artificial Intelligence 28, 1--4 (2000), 287--313.Google ScholarDigital Library
Moshe Y. Vardi. 1982. The Complexity of Relational Query Languages (Extended Abstract). In Proceedings of the Fourteenth Annual ACM Symposium on Theory of Computing (San Francisco, California, USA) (STOC '82). ACM, New York, NY, USA, 137--146. https://doi.org/10.1145/800070.802186Google ScholarDigital Library
Stefan Wager and Susan Athey. 2018. Estimation and inference of heterogeneous treatment effects using random forests. J. Amer. Statist. Assoc. 113, 523 (2018), 1228--1242.Google ScholarCross Ref
Yuhao Wen, Xiaodan Zhu, Sudeepa Roy, and Jun Yang. 2018. Interactive summarization and exploration of top aggregate query answers. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, Vol. 11. NIH Public Access, 2196.Google Scholar
Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining away outliers in aggregate queries. (2013).Google ScholarDigital Library
Yu Xie, Jennie E Brand, and Ben Jann. 2012. Estimating heterogeneous treatment effects with observational data. Sociological methodology 42, 1 (2012), 314--347.Google Scholar
Hongyu Yang, Cynthia Rudin, and Margo Seltzer. 2017. Scalable Bayesian rule lists. In International conference on machine learning. PMLR, 3921--3930.Google Scholar
Brit Youngmann, Sihem Amer-Yahia, and Aurélien Personnaz. 2022. Guided Exploration of Data Summaries. Proc. VLDB Endow. 15, 9 (2022).Google ScholarDigital Library
Brit Youngmann, Michael Cafarella, Amir Gilad, and Sudeepa Roy. 2023. Techinical Report. https://anonymous.4open. science/r/Explanation_Summarization-F736Google Scholar
Brit Youngmann, Michael Cafarella, Yuval Moskovitch, and Babak Salimi. 2023. NEXUS: On Explaining Confounding Bias. In Companion of the 2023 International Conference on Management of Data. 171--174.Google Scholar
Brit Youngmann, Michael Cafarella, Yuval Moskovitch, and Babak Salimi. 2023. On Explaining Confounding Bias. 2023 IEEE 39th International Conference on Data Engineering (ICDE) (2023).Google Scholar
Brit Youngmann, Michael J. Cafarella, Babak Salimi, and Anna Zeng. 2023. Causal Data Integration. Proc. VLDB Endow. 16, 10 (2023), 2659--2665.Google ScholarDigital Library
Cong Yu, Laks Lakshmanan, and Sihem Amer-Yahia. 2009. It takes variety to make a world: diversification in recommender systems. In Proceedings of the 12th international conference on extending database technology: Advances in database technology. 368--378.Google ScholarDigital Library

Index Terms

Summarized Causal Explanations For Aggregate Views
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database views
    2. Query languages
      1. Relational database query languages
  2. Information systems applications
    1. Decision support systems
      1. Data analytics

Recommendations

Disentangling causality: assumptions in causal discovery and inference
Abstract
Causality has been a burgeoning field of research leading to the point where the literature abounds with different components addressing distinct parts of causality. For researchers, it has been increasingly difficult to discern the assumptions ...
Read More
Data-Driven Causal Effect Estimation Based on Graphical Causal Modelling: A Survey
In many fields of scientific research and real-world applications, unbiased estimation of causal effects from non-experimental data is crucial for understanding the mechanism underlying the data and for decision-making on effective responses or ...
Read More
Causal Effect Estimation Using Variational Information Bottleneck
Web Information Systems and Applications
Abstract
Causal inference is to estimate the causal effect in a causalrelationship when intervention is applied. Precisely, in a causal model with binary interventions, i.e., control and treatment, the causal effect is simply the difference between the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the ACM on Management of Data Volume 2, Issue 1
PACMMOD
February 2024
1874 pages
EISSN:2836-6573
DOI:10.1145/3654807
Editor:
Divyakant Agrawal
UC Santa Barbara, United States
Issue’s Table of Contents
Copyright © 2024 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 March 2024
Published in pacmmod Volume 2, Issue 1

Permissions
Request permissions about this article.
Request Permissions
Author Tags
causal inference
group-by SQL queries
query results explanation
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 40
  Total Downloads
- Downloads (Last 12 months)40
- Downloads (Last 6 weeks)24
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Summarized Causal Explanations For Aggregate Views

Proceedings of the ACM on Management of Data

Abstract

References

Cited By

Index Terms

Recommendations

Disentangling causality: assumptions in causal discovery and inference

Data-Driven Causal Effect Estimation Based on Graphical Causal Modelling: A Survey

Causal Effect Estimation Using Variational Information Bottleneck

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Summarized Causal Explanations For Aggregate Views

Proceedings of the ACM on Management of Data

Abstract

References

Cited By

Index Terms

Recommendations

Disentangling causality: assumptions in causal discovery and inference

Data-Driven Causal Effect Estimation Based on Graphical Causal Modelling: A Survey

Causal Effect Estimation Using Variational Information Bottleneck

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media