skip to main content
research-article

Summarized Causal Explanations For Aggregate Views

Published:26 March 2024Publication History
Skip Abstract Section

Abstract

SQL queries with group-by and average are frequently used and plotted as bar charts in several data analysis applications. Understanding the reasons behind the results in such an aggregate view may be a highly nontrivial and time-consuming task, especially for large datasets with multiple attributes. Hence, generating automated explanations for aggregate views can allow users to gain better insights into the results while saving time in data analysis. When providing explanations for such views, it is paramount to ensure that they are succinct yet comprehensive, reveal different types of insights that hold for different aggregate answers in the view, and, most importantly, they reflect reality and arm users to make informed data-driven decisions, i.e., the explanations do not only consider correlations but are causal. In this paper, we present CauSumX, a framework for generating summarized causal explanations for the entire aggregate view. Using background knowledge captured in a causal DAG, CauSumX finds the most effective causal treatments for different groups in the view. We formally define the framework and the optimization problem, study its complexity, and devise an efficient algorithm using the Apriori algorithm, LP rounding, and several optimizations. We experimentally show that our system generates useful summarized causal explanations compared to prior work and scales well for large high-dimensional data.

References

  1. 2021. 2021 Stackoverflow Developer Survey. https://insights.stackoverflow.com/survey/2021.Google ScholarGoogle Scholar
  2. 2021. Adult Census Income Dataset. https://www.kaggle.com/datasets/uciml/adult-census-income.Google ScholarGoogle Scholar
  3. 2023. The 19th*. https://19thnews.org/2023/03/parenthood-stereotypes-gender-pay-gap/. Accessed: 2023-05--18.Google ScholarGoogle Scholar
  4. 2023. OpenAI Introducing ChatGPT. https://openai.com/blog/chatgpt.Google ScholarGoogle Scholar
  5. 2023. Viza jobs. https://vizajobs.com/what-do-technology-jobs-paysalary-insights-and-compensation-factors/.Google ScholarGoogle Scholar
  6. 2029. Tech Talks. https://bdtechtalks.com/2019/03/29/ageism-in-tech-age-limit-software-developers-face/.Google ScholarGoogle Scholar
  7. Rakesh Agrawal, Ramakrishnan Srikant, et al . 1994. Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB, Vol. 1215. Santiago, Chile, 487--499.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Mohamed Aljaban. 2021. Analysis of car accidents causes in the usa. (2021).Google ScholarGoogle Scholar
  9. Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011. Provenance for aggregate queries. In Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2011, June 12--16, 2011, Athens, Greece, Maurizio Lenzerini and Thomas Schwentick (Eds.). ACM, 153--164. https://doi.org/10.1145/1989284.1989302Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Abolfazl Asudeh, Zhongjun Jin, and HV Jagadish. 2019. Assessing and remedying coverage for a given dataset. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 554--565.Google ScholarGoogle ScholarCross RefCross Ref
  11. Arthur Asuncion and David Newman. 2007. UCI machine learning repository.Google ScholarGoogle Scholar
  12. Abhijit V Banerjee, Abhijit Banerjee, and Esther Duflo. 2011. Poor economics: A radical rethinking of the way to fight global poverty. Public Affairs.Google ScholarGoogle Scholar
  13. Senjuti Basu Roy, Sihem Amer-Yahia, Ashish Chawla, Gautam Das, and Cong Yu. 2010. Constructing and exploring composite items. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 843--854.Google ScholarGoogle Scholar
  14. Marianne Bertrand and Sendhil Mullainathan. 2004. Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination. American economic review 94, 4 (2004), 991--1013.Google ScholarGoogle Scholar
  15. Aline Bessa, Juliana Freire, Tamraparni Dasu, and Divesh Srivastava. 2020. Effective Discovery of Meaningful Outlier Relationships. ACM Transactions on Data Science 1, 2 (2020), 1--33.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Nicole Bidoit, Melanie Herschel, and Katerina Tzompanaki. 2014. Query-based why-not provenance with nedexplain. In Extending database technology (EDBT).Google ScholarGoogle Scholar
  17. Shaofeng Bu, Laks VS Lakshmanan, and Raymond T Ng. 2005. Mdl summarization with holes. In Proceedings of the 31st international conference on Very large data bases. Citeseer, 433--444.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Adriane Chapman and HV Jagadish. 2009. Why not?. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. 523--534.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Chaofan Chen and Cynthia Rudin. 2018. An optimization approach to learning falling rule lists. In International conference on artificial intelligence and statistics. PMLR, 604--612.Google ScholarGoogle Scholar
  20. Raj Chetty, Nathaniel Hendren, Maggie R Jones, and Sonya R Porter. 2020. Race and economic opportunity in the United States: An intergenerational perspective. The Quarterly Journal of Economics 135, 2 (2020), 711--783.Google ScholarGoogle ScholarCross RefCross Ref
  21. Silvia Chiappa. 2019. Path-specific counterfactual fairness. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7801--7808.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Leonardo Mendonça de Moura and Nikolaj Bjørner. 2008. Z3: An Efficient SMT Solver. In TACAS. 337--340.Google ScholarGoogle Scholar
  23. Daniel Deutch, Nave Frost, and Amir Gilad. 2020. Explaining Natural Language query results. VLDB J. 29, 1 (2020), 485--508.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Daniel Deutch and Amir Gilad. 2019. Reverse-Engineering Conjunctive Queries from Provenance Examples. In Advances in Database Technology - 22nd International Conference on Extending Database Technology, EDBT 2019, Lisbon, Portugal, March 26--29, 2019, Melanie Herschel, Helena Galhardas, Berthold Reinwald, Irini Fundulaki, Carsten Binnig, and Zoi Kaoudi (Eds.). OpenProceedings.org, 277--288. https://doi.org/10.5441/002/edbt.2019.25Google ScholarGoogle ScholarCross RefCross Ref
  25. Daniel Deutch, Amir Gilad, Tova Milo, Amit Mualem, and Amit Somech. 2022. FEDEX: An Explainability Framework for Data Exploration Steps. Proc. VLDB Endow. 15, 13 (2022), 3854--3868. https://www.vldb.org/pvldb/vol15/p3854-gilad.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and informative explanations of outcomes. Proceedings of the VLDB Endowment 8, 1 (2014), 61--72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and Informative Explanations of Outcomes. Proc. VLDB Endow. 8, 1 (sep 2014), 61--72. https://doi.org/10.14778/2735461.2735467Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Sarah Flood, Miriam King, Steven Ruggles, and J Robert Warren. 2015. Integrated public use microdata series, current population survey: Version 9.0.[Machine-readable database]. Minneapolis: University of Minnesota 1 (2015).Google ScholarGoogle Scholar
  29. Sainyam Galhotra, Amir Gilad, Sudeepa Roy, and Babak Salimi. 2022. Hyper: Hypothetical reasoning with what-if and how-to queries using a probabilistic causal approach. In Proceedings of the 2022 International Conference on Management of Data. 1598--1611.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Sander Greenland and James M Robins. 1999. Epidemiology, justice, and the probability of causation. Jurimetrics 40 (1999), 321.Google ScholarGoogle Scholar
  31. Paul W Holland. 1986. Statistics and causal inference. Journal of the American statistical Association 81, 396 (1986), 945--960.Google ScholarGoogle ScholarCross RefCross Ref
  32. Alexandra Kim, Laks VS Lakshmanan, and Divesh Srivastava. 2020. Summarizing hierarchical multidimensional data. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 877--888.Google ScholarGoogle ScholarCross RefCross Ref
  33. Been Kim, Cynthia Rudin, and Julie A Shah. 2014. The bayesian case model: A generative approach for case-based reasoning and prototype classification. Advances in neural information processing systems 27 (2014).Google ScholarGoogle Scholar
  34. Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. 2016. Interpretable decision sets: A joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1675--1684.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Laks VS Lakshmanan, Raymond T Ng, Christine Xing Wang, Xiaodong Zhou, and Theodore J Johnson. 2002. The generalized MDL approach for summarization. In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 766--777.Google ScholarGoogle ScholarCross RefCross Ref
  36. Laks VS Lakshmanan, Jian Pei, and Jiawei Han. 2002. Quotient cube: How to summarize the semantics of a data cube. In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 778--789.Google ScholarGoogle ScholarCross RefCross Ref
  37. Seokki Lee, Bertram Ludäscher, and Boris Glavic. 2020. Approximate summaries for why and why-not provenance (extended version). arXiv preprint arXiv:2002.00084 (2020).Google ScholarGoogle Scholar
  38. Chenjie Li, Zhengjie Miao, Qitian Zeng, Boris Glavic, and Sudeepa Roy. 2021. Putting Things into Context: Rich Explanations for Query Answers using Join Graphs. In Proceedings of the 2021 International Conference on Management of Data. 1051--1063.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Yin Lin, Brit Youngmann, Yuval Moskovitch, HV Jagadish, and Tova Milo. 2021. On detecting cherry-picked generalizations. Proceedings of the VLDB Endowment 15, 1 (2021), 59--71.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Ester Livshits, Leopoldo E. Bertossi, Benny Kimelfeld, and Moshe Sebag. 2020. The Shapley Value of Tuples in Query Answering. In ICDT, Vol. 155. 20:1--20:19.Google ScholarGoogle Scholar
  41. Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. 2013. Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 623--631.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Pingchuan Ma, Rui Ding, Shuai Wang, Shi Han, and Dongmei Zhang. 2023. XInsight: EXplainable Data Analysis Through The Lens of Causality. Proc. ACM Manag. Data, Article 156 (jun 2023), 27 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Alexandra Meliou, Wolfgang Gatterbauer, Katherine F Moore, and Dan Suciu. 2009. Why so? or why no? functional causality for explaining query answers. arXiv preprint arXiv:0912.5340 (2009).Google ScholarGoogle Scholar
  44. Alexandra Meliou, Wolfgang Gatterbauer, Katherine F Moore, and Dan Suciu. 2010. The complexity of causality and responsibility for query answers and non-answers. arXiv preprint arXiv:1009.2021 (2010).Google ScholarGoogle Scholar
  45. Zhengjie Miao, Qitian Zeng, Boris Glavic, and Sudeepa Roy. 2019. Going beyond provenance: Explaining query answers with pattern-based counterbalances. In Proceedings of the 2019 International Conference on Management of Data. 485--502.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Tova Milo, Yuval Moskovitch, and Brit Youngmann. 2020. Contribution Maximization in Probabilistic Datalog. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 817--828.Google ScholarGoogle Scholar
  47. Sobhan Moosavi, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. 2019. A countrywide traffic accident dataset. arXiv preprint arXiv:1906.05409 (2019).Google ScholarGoogle Scholar
  48. Göran Nilsson. 1982. Effects of speed limits on traffic accidents in Sweden.Google ScholarGoogle Scholar
  49. Ndhlovu Pardon and Chigwenya Average. 2013. The effectiveness of traffic calming measures in reducing road carnage in masvingo urban. International Journal 3, 2 (2013), 2305--1493.Google ScholarGoogle Scholar
  50. Judea Pearl. 2009. Causal inference in statistics: An overview. (2009).Google ScholarGoogle Scholar
  51. Prabhakar Raghavan and Clark D Tompson. 1987. Randomized rounding: a technique for provably good algorithms and algorithmic proofs. Combinatorica 7, 4 (1987), 365--374.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Alon Reshef, Benny Kimelfeld, and Ester Livshits. 2020. The Impact of Negation on the Complexity of the Shapley Value in Conjunctive Queries. In PODS, Dan Suciu, Yufei Tao, and Zhewei Wei (Eds.). 285--297.Google ScholarGoogle Scholar
  53. James M Robins, Miguel Angel Hernan, and Babette Brumback. 2000. Marginal structural models and causal inference in epidemiology.Google ScholarGoogle Scholar
  54. Paul R Rosenbaum and Donald B Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1 (1983), 41--55.Google ScholarGoogle ScholarCross RefCross Ref
  55. Sudeepa Roy, Laurel Orr, and Dan Suciu. 2015. Explaining query answers with explanation-ready databases. Proceedings of the VLDB Endowment 9, 4 (2015), 348--359.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Sudeepa Roy and Dan Suciu. 2014. A formal approach to finding explanations for database queries. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 1579--1590.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Donald Bruce Rubin. 1971. The use of matched sampling and regression adjustment in observational studies. Ph. D. Dissertation. Harvard University.Google ScholarGoogle Scholar
  58. Donald B Rubin. 2005. Causal inference using potential outcomes: Design, modeling, decisions. J. Amer. Statist. Assoc. 100, 469 (2005), 322--331.Google ScholarGoogle ScholarCross RefCross Ref
  59. Omer Sagi and Lior Rokach. 2021. Approximating XGBoost with an interpretable decision tree. Information Sciences 572 (2021), 522--542.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Babak Salimi, Johannes Gehrke, and Dan Suciu. 2018. Bias in olap queries: Detection, explanation, and removal. In Proceedings of the 2018 International Conference on Management of Data. 1021--1035.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Babak Salimi, Harsh Parikh, Moe Kayali, Lise Getoor, Sudeepa Roy, and Dan Suciu. 2020. Causal Relational Learning. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 241--256. https://doi.org/10.1145/3318464.3389759Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Gayatri Sathe and Sunita Sarawagi. 2001. Intelligent rollups in multidimensional OLAP data. In VLDB. 307--316.Google ScholarGoogle Scholar
  63. Holger Schielzeth. 2010. Simple means to improve the interpretability of regression coefficients. Methods in Ecology and Evolution 1, 2 (2010), 103--113.Google ScholarGoogle ScholarCross RefCross Ref
  64. Amit Sharma and Emre Kiciman. 2020. DoWhy: An End-to-End Library for Causal Inference. arXiv preprint arXiv:2011.04216 (2020).Google ScholarGoogle Scholar
  65. Shohei Shimizu, Patrik O Hoyer, Aapo Hyvärinen, Antti Kerminen, and Michael Jordan. 2006. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research 7, 10 (2006).Google ScholarGoogle Scholar
  66. Bill Shipley. 2016. Cause and correlation in biology: a user's guide to path analysis, structural equations and causal inference with R. Cambridge university press.Google ScholarGoogle ScholarCross RefCross Ref
  67. P. Spirtes et al. 2000. Causation, prediction, and search. MIT press.Google ScholarGoogle Scholar
  68. Yuchao Tao, Amir Gilad, Ashwin Machanavajjhala, and Sudeepa Roy. 2022. DPXPlain: Privately Explaining Aggregate Query Answers. Proc. VLDB Endow. 16, 1 (2022), 113--126. https://www.vldb.org/pvldb/vol16/p113-tao.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  69. Balder ten Cate, Cristina Civili, Evgeny Sherkhonov, and Wang-Chiew Tan. 2015. High-level why-not explanations using ontologies. In Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 31--43.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Jin Tian and Judea Pearl. 2000. Probabilities of causation: Bounds and identification. Annals of Mathematics and Artificial Intelligence 28, 1--4 (2000), 287--313.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Moshe Y. Vardi. 1982. The Complexity of Relational Query Languages (Extended Abstract). In Proceedings of the Fourteenth Annual ACM Symposium on Theory of Computing (San Francisco, California, USA) (STOC '82). ACM, New York, NY, USA, 137--146. https://doi.org/10.1145/800070.802186Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Stefan Wager and Susan Athey. 2018. Estimation and inference of heterogeneous treatment effects using random forests. J. Amer. Statist. Assoc. 113, 523 (2018), 1228--1242.Google ScholarGoogle ScholarCross RefCross Ref
  73. Yuhao Wen, Xiaodan Zhu, Sudeepa Roy, and Jun Yang. 2018. Interactive summarization and exploration of top aggregate query answers. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, Vol. 11. NIH Public Access, 2196.Google ScholarGoogle Scholar
  74. Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining away outliers in aggregate queries. (2013).Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Yu Xie, Jennie E Brand, and Ben Jann. 2012. Estimating heterogeneous treatment effects with observational data. Sociological methodology 42, 1 (2012), 314--347.Google ScholarGoogle Scholar
  76. Hongyu Yang, Cynthia Rudin, and Margo Seltzer. 2017. Scalable Bayesian rule lists. In International conference on machine learning. PMLR, 3921--3930.Google ScholarGoogle Scholar
  77. Brit Youngmann, Sihem Amer-Yahia, and Aurélien Personnaz. 2022. Guided Exploration of Data Summaries. Proc. VLDB Endow. 15, 9 (2022).Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Brit Youngmann, Michael Cafarella, Amir Gilad, and Sudeepa Roy. 2023. Techinical Report. https://anonymous.4open. science/r/Explanation_Summarization-F736Google ScholarGoogle Scholar
  79. Brit Youngmann, Michael Cafarella, Yuval Moskovitch, and Babak Salimi. 2023. NEXUS: On Explaining Confounding Bias. In Companion of the 2023 International Conference on Management of Data. 171--174.Google ScholarGoogle Scholar
  80. Brit Youngmann, Michael Cafarella, Yuval Moskovitch, and Babak Salimi. 2023. On Explaining Confounding Bias. 2023 IEEE 39th International Conference on Data Engineering (ICDE) (2023).Google ScholarGoogle Scholar
  81. Brit Youngmann, Michael J. Cafarella, Babak Salimi, and Anna Zeng. 2023. Causal Data Integration. Proc. VLDB Endow. 16, 10 (2023), 2659--2665.Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Cong Yu, Laks Lakshmanan, and Sihem Amer-Yahia. 2009. It takes variety to make a world: diversification in recommender systems. In Proceedings of the 12th international conference on extending database technology: Advances in database technology. 368--378.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Summarized Causal Explanations For Aggregate Views

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the ACM on Management of Data
          Proceedings of the ACM on Management of Data  Volume 2, Issue 1
          PACMMOD
          February 2024
          1874 pages
          EISSN:2836-6573
          DOI:10.1145/3654807
          Issue’s Table of Contents

          Copyright © 2024 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 26 March 2024
          Published in pacmmod Volume 2, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)40
          • Downloads (Last 6 weeks)24

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader