Abstract
SQL queries with group-by and average are frequently used and plotted as bar charts in several data analysis applications. Understanding the reasons behind the results in such an aggregate view may be a highly nontrivial and time-consuming task, especially for large datasets with multiple attributes. Hence, generating automated explanations for aggregate views can allow users to gain better insights into the results while saving time in data analysis. When providing explanations for such views, it is paramount to ensure that they are succinct yet comprehensive, reveal different types of insights that hold for different aggregate answers in the view, and, most importantly, they reflect reality and arm users to make informed data-driven decisions, i.e., the explanations do not only consider correlations but are causal. In this paper, we present CauSumX, a framework for generating summarized causal explanations for the entire aggregate view. Using background knowledge captured in a causal DAG, CauSumX finds the most effective causal treatments for different groups in the view. We formally define the framework and the optimization problem, study its complexity, and devise an efficient algorithm using the Apriori algorithm, LP rounding, and several optimizations. We experimentally show that our system generates useful summarized causal explanations compared to prior work and scales well for large high-dimensional data.
- 2021. 2021 Stackoverflow Developer Survey. https://insights.stackoverflow.com/survey/2021.Google Scholar
- 2021. Adult Census Income Dataset. https://www.kaggle.com/datasets/uciml/adult-census-income.Google Scholar
- 2023. The 19th*. https://19thnews.org/2023/03/parenthood-stereotypes-gender-pay-gap/. Accessed: 2023-05--18.Google Scholar
- 2023. OpenAI Introducing ChatGPT. https://openai.com/blog/chatgpt.Google Scholar
- 2023. Viza jobs. https://vizajobs.com/what-do-technology-jobs-paysalary-insights-and-compensation-factors/.Google Scholar
- 2029. Tech Talks. https://bdtechtalks.com/2019/03/29/ageism-in-tech-age-limit-software-developers-face/.Google Scholar
- Rakesh Agrawal, Ramakrishnan Srikant, et al . 1994. Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB, Vol. 1215. Santiago, Chile, 487--499.Google ScholarDigital Library
- Mohamed Aljaban. 2021. Analysis of car accidents causes in the usa. (2021).Google Scholar
- Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011. Provenance for aggregate queries. In Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, June 12--16, 2011, Athens, Greece, Maurizio Lenzerini and Thomas Schwentick (Eds.). ACM, 153--164. https://doi.org/10.1145/1989284.1989302Google ScholarDigital Library
- Abolfazl Asudeh, Zhongjun Jin, and HV Jagadish. 2019. Assessing and remedying coverage for a given dataset. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 554--565.Google ScholarCross Ref
- Arthur Asuncion and David Newman. 2007. UCI machine learning repository.Google Scholar
- Abhijit V Banerjee, Abhijit Banerjee, and Esther Duflo. 2011. Poor economics: A radical rethinking of the way to fight global poverty. Public Affairs.Google Scholar
- Senjuti Basu Roy, Sihem Amer-Yahia, Ashish Chawla, Gautam Das, and Cong Yu. 2010. Constructing and exploring composite items. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 843--854.Google Scholar
- Marianne Bertrand and Sendhil Mullainathan. 2004. Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination. American economic review 94, 4 (2004), 991--1013.Google Scholar
- Aline Bessa, Juliana Freire, Tamraparni Dasu, and Divesh Srivastava. 2020. Effective Discovery of Meaningful Outlier Relationships. ACM Transactions on Data Science 1, 2 (2020), 1--33.Google ScholarDigital Library
- Nicole Bidoit, Melanie Herschel, and Katerina Tzompanaki. 2014. Query-based why-not provenance with nedexplain. In Extending database technology (EDBT).Google Scholar
- Shaofeng Bu, Laks VS Lakshmanan, and Raymond T Ng. 2005. Mdl summarization with holes. In Proceedings of the 31st international conference on Very large data bases. Citeseer, 433--444.Google ScholarDigital Library
- Adriane Chapman and HV Jagadish. 2009. Why not?. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. 523--534.Google ScholarDigital Library
- Chaofan Chen and Cynthia Rudin. 2018. An optimization approach to learning falling rule lists. In International conference on artificial intelligence and statistics. PMLR, 604--612.Google Scholar
- Raj Chetty, Nathaniel Hendren, Maggie R Jones, and Sonya R Porter. 2020. Race and economic opportunity in the United States: An intergenerational perspective. The Quarterly Journal of Economics 135, 2 (2020), 711--783.Google ScholarCross Ref
- Silvia Chiappa. 2019. Path-specific counterfactual fairness. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7801--7808.Google ScholarDigital Library
- Leonardo Mendonça de Moura and Nikolaj Bjørner. 2008. Z3: An Efficient SMT Solver. In TACAS. 337--340.Google Scholar
- Daniel Deutch, Nave Frost, and Amir Gilad. 2020. Explaining Natural Language query results. VLDB J. 29, 1 (2020), 485--508.Google ScholarDigital Library
- Daniel Deutch and Amir Gilad. 2019. Reverse-Engineering Conjunctive Queries from Provenance Examples. In Advances in Database Technology - 22nd International Conference on Extending Database Technology, EDBT 2019, Lisbon, Portugal, March 26--29, 2019, Melanie Herschel, Helena Galhardas, Berthold Reinwald, Irini Fundulaki, Carsten Binnig, and Zoi Kaoudi (Eds.). OpenProceedings.org, 277--288. https://doi.org/10.5441/002/edbt.2019.25Google ScholarCross Ref
- Daniel Deutch, Amir Gilad, Tova Milo, Amit Mualem, and Amit Somech. 2022. FEDEX: An Explainability Framework for Data Exploration Steps. Proc. VLDB Endow. 15, 13 (2022), 3854--3868. https://www.vldb.org/pvldb/vol15/p3854-gilad.pdfGoogle ScholarDigital Library
- Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and informative explanations of outcomes. Proceedings of the VLDB Endowment 8, 1 (2014), 61--72.Google ScholarDigital Library
- Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and Informative Explanations of Outcomes. Proc. VLDB Endow. 8, 1 (sep 2014), 61--72. https://doi.org/10.14778/2735461.2735467Google ScholarDigital Library
- Sarah Flood, Miriam King, Steven Ruggles, and J Robert Warren. 2015. Integrated public use microdata series, current population survey: Version 9.0.[Machine-readable database]. Minneapolis: University of Minnesota 1 (2015).Google Scholar
- Sainyam Galhotra, Amir Gilad, Sudeepa Roy, and Babak Salimi. 2022. Hyper: Hypothetical reasoning with what-if and how-to queries using a probabilistic causal approach. In Proceedings of the 2022 International Conference on Management of Data. 1598--1611.Google ScholarDigital Library
- Sander Greenland and James M Robins. 1999. Epidemiology, justice, and the probability of causation. Jurimetrics 40 (1999), 321.Google Scholar
- Paul W Holland. 1986. Statistics and causal inference. Journal of the American statistical Association 81, 396 (1986), 945--960.Google ScholarCross Ref
- Alexandra Kim, Laks VS Lakshmanan, and Divesh Srivastava. 2020. Summarizing hierarchical multidimensional data. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 877--888.Google ScholarCross Ref
- Been Kim, Cynthia Rudin, and Julie A Shah. 2014. The bayesian case model: A generative approach for case-based reasoning and prototype classification. Advances in neural information processing systems 27 (2014).Google Scholar
- Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. 2016. Interpretable decision sets: A joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1675--1684.Google ScholarDigital Library
- Laks VS Lakshmanan, Raymond T Ng, Christine Xing Wang, Xiaodong Zhou, and Theodore J Johnson. 2002. The generalized MDL approach for summarization. In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 766--777.Google ScholarCross Ref
- Laks VS Lakshmanan, Jian Pei, and Jiawei Han. 2002. Quotient cube: How to summarize the semantics of a data cube. In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 778--789.Google ScholarCross Ref
- Seokki Lee, Bertram Ludäscher, and Boris Glavic. 2020. Approximate summaries for why and why-not provenance (extended version). arXiv preprint arXiv:2002.00084 (2020).Google Scholar
- Chenjie Li, Zhengjie Miao, Qitian Zeng, Boris Glavic, and Sudeepa Roy. 2021. Putting Things into Context: Rich Explanations for Query Answers using Join Graphs. In Proceedings of the 2021 International Conference on Management of Data. 1051--1063.Google ScholarDigital Library
- Yin Lin, Brit Youngmann, Yuval Moskovitch, HV Jagadish, and Tova Milo. 2021. On detecting cherry-picked generalizations. Proceedings of the VLDB Endowment 15, 1 (2021), 59--71.Google ScholarDigital Library
- Ester Livshits, Leopoldo E. Bertossi, Benny Kimelfeld, and Moshe Sebag. 2020. The Shapley Value of Tuples in Query Answering. In ICDT, Vol. 155. 20:1--20:19.Google Scholar
- Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. 2013. Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 623--631.Google ScholarDigital Library
- Pingchuan Ma, Rui Ding, Shuai Wang, Shi Han, and Dongmei Zhang. 2023. XInsight: EXplainable Data Analysis Through The Lens of Causality. Proc. ACM Manag. Data, Article 156 (jun 2023), 27 pages.Google ScholarDigital Library
- Alexandra Meliou, Wolfgang Gatterbauer, Katherine F Moore, and Dan Suciu. 2009. Why so? or why no? functional causality for explaining query answers. arXiv preprint arXiv:0912.5340 (2009).Google Scholar
- Alexandra Meliou, Wolfgang Gatterbauer, Katherine F Moore, and Dan Suciu. 2010. The complexity of causality and responsibility for query answers and non-answers. arXiv preprint arXiv:1009.2021 (2010).Google Scholar
- Zhengjie Miao, Qitian Zeng, Boris Glavic, and Sudeepa Roy. 2019. Going beyond provenance: Explaining query answers with pattern-based counterbalances. In Proceedings of the 2019 International Conference on Management of Data. 485--502.Google ScholarDigital Library
- Tova Milo, Yuval Moskovitch, and Brit Youngmann. 2020. Contribution Maximization in Probabilistic Datalog. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 817--828.Google Scholar
- Sobhan Moosavi, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. 2019. A countrywide traffic accident dataset. arXiv preprint arXiv:1906.05409 (2019).Google Scholar
- Göran Nilsson. 1982. Effects of speed limits on traffic accidents in Sweden.Google Scholar
- Ndhlovu Pardon and Chigwenya Average. 2013. The effectiveness of traffic calming measures in reducing road carnage in masvingo urban. International Journal 3, 2 (2013), 2305--1493.Google Scholar
- Judea Pearl. 2009. Causal inference in statistics: An overview. (2009).Google Scholar
- Prabhakar Raghavan and Clark D Tompson. 1987. Randomized rounding: a technique for provably good algorithms and algorithmic proofs. Combinatorica 7, 4 (1987), 365--374.Google ScholarDigital Library
- Alon Reshef, Benny Kimelfeld, and Ester Livshits. 2020. The Impact of Negation on the Complexity of the Shapley Value in Conjunctive Queries. In PODS, Dan Suciu, Yufei Tao, and Zhewei Wei (Eds.). 285--297.Google Scholar
- James M Robins, Miguel Angel Hernan, and Babette Brumback. 2000. Marginal structural models and causal inference in epidemiology.Google Scholar
- Paul R Rosenbaum and Donald B Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1 (1983), 41--55.Google ScholarCross Ref
- Sudeepa Roy, Laurel Orr, and Dan Suciu. 2015. Explaining query answers with explanation-ready databases. Proceedings of the VLDB Endowment 9, 4 (2015), 348--359.Google ScholarDigital Library
- Sudeepa Roy and Dan Suciu. 2014. A formal approach to finding explanations for database queries. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 1579--1590.Google ScholarDigital Library
- Donald Bruce Rubin. 1971. The use of matched sampling and regression adjustment in observational studies. Ph. D. Dissertation. Harvard University.Google Scholar
- Donald B Rubin. 2005. Causal inference using potential outcomes: Design, modeling, decisions. J. Amer. Statist. Assoc. 100, 469 (2005), 322--331.Google ScholarCross Ref
- Omer Sagi and Lior Rokach. 2021. Approximating XGBoost with an interpretable decision tree. Information Sciences 572 (2021), 522--542.Google ScholarDigital Library
- Babak Salimi, Johannes Gehrke, and Dan Suciu. 2018. Bias in olap queries: Detection, explanation, and removal. In Proceedings of the 2018 International Conference on Management of Data. 1021--1035.Google ScholarDigital Library
- Babak Salimi, Harsh Parikh, Moe Kayali, Lise Getoor, Sudeepa Roy, and Dan Suciu. 2020. Causal Relational Learning. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 241--256. https://doi.org/10.1145/3318464.3389759Google ScholarDigital Library
- Gayatri Sathe and Sunita Sarawagi. 2001. Intelligent rollups in multidimensional OLAP data. In VLDB. 307--316.Google Scholar
- Holger Schielzeth. 2010. Simple means to improve the interpretability of regression coefficients. Methods in Ecology and Evolution 1, 2 (2010), 103--113.Google ScholarCross Ref
- Amit Sharma and Emre Kiciman. 2020. DoWhy: An End-to-End Library for Causal Inference. arXiv preprint arXiv:2011.04216 (2020).Google Scholar
- Shohei Shimizu, Patrik O Hoyer, Aapo Hyvärinen, Antti Kerminen, and Michael Jordan. 2006. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research 7, 10 (2006).Google Scholar
- Bill Shipley. 2016. Cause and correlation in biology: a user's guide to path analysis, structural equations and causal inference with R. Cambridge university press.Google ScholarCross Ref
- P. Spirtes et al. 2000. Causation, prediction, and search. MIT press.Google Scholar
- Yuchao Tao, Amir Gilad, Ashwin Machanavajjhala, and Sudeepa Roy. 2022. DPXPlain: Privately Explaining Aggregate Query Answers. Proc. VLDB Endow. 16, 1 (2022), 113--126. https://www.vldb.org/pvldb/vol16/p113-tao.pdfGoogle ScholarDigital Library
- Balder ten Cate, Cristina Civili, Evgeny Sherkhonov, and Wang-Chiew Tan. 2015. High-level why-not explanations using ontologies. In Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 31--43.Google ScholarDigital Library
- Jin Tian and Judea Pearl. 2000. Probabilities of causation: Bounds and identification. Annals of Mathematics and Artificial Intelligence 28, 1--4 (2000), 287--313.Google ScholarDigital Library
- Moshe Y. Vardi. 1982. The Complexity of Relational Query Languages (Extended Abstract). In Proceedings of the Fourteenth Annual ACM Symposium on Theory of Computing (San Francisco, California, USA) (STOC '82). ACM, New York, NY, USA, 137--146. https://doi.org/10.1145/800070.802186Google ScholarDigital Library
- Stefan Wager and Susan Athey. 2018. Estimation and inference of heterogeneous treatment effects using random forests. J. Amer. Statist. Assoc. 113, 523 (2018), 1228--1242.Google ScholarCross Ref
- Yuhao Wen, Xiaodan Zhu, Sudeepa Roy, and Jun Yang. 2018. Interactive summarization and exploration of top aggregate query answers. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, Vol. 11. NIH Public Access, 2196.Google Scholar
- Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining away outliers in aggregate queries. (2013).Google ScholarDigital Library
- Yu Xie, Jennie E Brand, and Ben Jann. 2012. Estimating heterogeneous treatment effects with observational data. Sociological methodology 42, 1 (2012), 314--347.Google Scholar
- Hongyu Yang, Cynthia Rudin, and Margo Seltzer. 2017. Scalable Bayesian rule lists. In International conference on machine learning. PMLR, 3921--3930.Google Scholar
- Brit Youngmann, Sihem Amer-Yahia, and Aurélien Personnaz. 2022. Guided Exploration of Data Summaries. Proc. VLDB Endow. 15, 9 (2022).Google ScholarDigital Library
- Brit Youngmann, Michael Cafarella, Amir Gilad, and Sudeepa Roy. 2023. Techinical Report. https://anonymous.4open. science/r/Explanation_Summarization-F736Google Scholar
- Brit Youngmann, Michael Cafarella, Yuval Moskovitch, and Babak Salimi. 2023. NEXUS: On Explaining Confounding Bias. In Companion of the 2023 International Conference on Management of Data. 171--174.Google Scholar
- Brit Youngmann, Michael Cafarella, Yuval Moskovitch, and Babak Salimi. 2023. On Explaining Confounding Bias. 2023 IEEE 39th International Conference on Data Engineering (ICDE) (2023).Google Scholar
- Brit Youngmann, Michael J. Cafarella, Babak Salimi, and Anna Zeng. 2023. Causal Data Integration. Proc. VLDB Endow. 16, 10 (2023), 2659--2665.Google ScholarDigital Library
- Cong Yu, Laks Lakshmanan, and Sihem Amer-Yahia. 2009. It takes variety to make a world: diversification in recommender systems. In Proceedings of the 12th international conference on extending database technology: Advances in database technology. 368--378.Google ScholarDigital Library
Index Terms
- Summarized Causal Explanations For Aggregate Views
Recommendations
Disentangling causality: assumptions in causal discovery and inference
AbstractCausality has been a burgeoning field of research leading to the point where the literature abounds with different components addressing distinct parts of causality. For researchers, it has been increasingly difficult to discern the assumptions ...
Data-Driven Causal Effect Estimation Based on Graphical Causal Modelling: A Survey
In many fields of scientific research and real-world applications, unbiased estimation of causal effects from non-experimental data is crucial for understanding the mechanism underlying the data and for decision-making on effective responses or ...
Causal Effect Estimation Using Variational Information Bottleneck
Web Information Systems and ApplicationsAbstractCausal inference is to estimate the causal effect in a causalrelationship when intervention is applied. Precisely, in a causal model with binary interventions, i.e., control and treatment, the causal effect is simply the difference between the ...
Comments