skip to main content

Large-scale Data Exploration Using Explanatory Regression Functions

Published: 28 September 2020 Publication History


Analysts wishing to explore multivariate data spaces, typically issue queries involving selection operators, i.e., range or equality predicates, which define data subspaces of potential interest. Then, they use aggregation functions, the results of which determine a subspace’s interestingness for further exploration and deeper analysis. However, Aggregate Query (AQ) results are scalars and convey limited information and explainability about the queried subspaces for enhanced exploratory analysis. Analysts have no way of identifying how these results are derived or how they change w.r.t query (input) parameter values. We address this shortcoming by aiding analysts to explore and understand data subspaces by contributing a novel explanation mechanism based on machine learning. We explain AQ results using functions obtained by a three-fold joint optimization problem which assume the form of explainable piecewise-linear regression functions. A key feature of the proposed solution is that the explanation functions are estimated using past executed queries. These queries provide a coarse grained overview of the underlying aggregate function (generating the AQ results) to be learned. Explanations for future, previously unseen AQs can be computed without accessing the underlying data and can be used to further explore the queried data subspaces, without issuing more queries to the backend analytics engine. We evaluate the explanation accuracy and efficiency through theoretically grounded metrics over real-world and synthetic datasets and query workloads.


2016. Crimes - 2001 to present. Retrieved December 1, 2016 from
2019. Query Analytics Workloads Dataset Data Set. Retrieved July 29, 2019 from
2020. HIGGS Data Set. Retrieved February 19, 2020 from
Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: Queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 29--42.
Yael Amsterdamer, Daniel Deutch, and Val Tannen. 2011. Provenance for aggregate queries. In Proceedings of the 13th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, 153--164.
Christos Anagnostopoulos, Fotis Savva, and Peter Triantafillou. 2018. Scalable aggregation predictive analytics. Applied Intelligence 48, 9 (2018), 2546--2567.
Christos Anagnostopoulos and Peter Triantafillou. 2015. Learning set cardinality in distance nearest neighbours. In Proceedings of the 2015 IEEE International Conference on Data Mining (ICDM’15). IEEE, 691--696.
Christos Anagnostopoulos and Peter Triantafillou. 2017. Efficient scalable accurate regression queries in In-DBMS analytics. In Proceedings of the 2017 IEEE 33rd International Conference on Data Engineering (ICDE’17). IEEE, 559--570.
Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. 2017. MacroBase: Prioritizing attention in fast data. In Proceedings of the 2017 ACM International Conference on Management of Data. 541–556.
Pierre Baldi, Peter Sadowski, and Daniel Whiteson. 2014. Searching for exotic particles in high-energy physics with deep learning. Nature Communications 5, 1 (2014), 1–9.
Bokeh Development Team. 2018. Bokeh: Python Library for Interactive Visualization. Retrieved from
Léon Bottou. 2012. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade. Springer, 421--436.
Anup Chalamalla, Ihab F Ilyas, Mourad Ouzzani, and Paolo Papotti. 2014. Descriptive and prescriptive data cleaning. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 445--456.
Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2007. Optimized stratified sampling for approximate query processing. ACM Transactions on Database Systems 32, 2 (2007), 9.
Surajit Chaudhuri, Gautam Das, and Utkarsh Srivastava. 2004. Effective use of block-level sampling in statistics estimation. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data. ACM, 287--298.
James Cheney, Laura Chiticariu, and Wang-Chiew Tan. 2009. Provenance in Databases: Why, How, and Where. Now Publishers Inc, 2009.
Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58--75.
Çağatay Demiralp, Peter J. Haas, Srinivasan Parthasarathy, and Tejaswini Pedapati. 2017. Foresight: Recommending visual insights. Proceedings of the VLDB Endowment 10, 12 (2017), 1937--1940.
Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and informative explanations of outcomes. Proceedings of the VLDB Endowment 8, 1 (2014), 61--72.
Jerome Friedman, Trevor Hastie, and Rob Tibshirani. 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33, 1 (2010), 1.
Jerome H. Friedman. 1991. Multivariate adaptive regression splines. The Annals of Statistics Mar. (1991), 1:1--67.
Greg Hamerly and Charles Elkan. 2004. Learning the k in k-means. In Proceedings of the Advances in Neural Information Processing Systems. 281--288.
John A. Hartigan and Manchek A. Wong. 1979. Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 1 (1979), 100--108.
Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. 2005. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer 27, 2 (2005), 83--85.
Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. 1997. Online aggregation. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data. Vol. 26. ACM, 171--182.
Botong Huang, Shivnath Babu, and Jun Yang. 2013. Cumulon: Optimizing statistical data analysis in the cloud. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 1--12.
Stratos Idreos, Olga Papaemmanouil, and Surajit Chaudhuri. 2015. Overview of data exploration techniques. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 277--281.
Shrainik Jain, Dominik Moritz, Daniel Halperin, Bill Howe, and Ed Lazowska. 2016. Sqlshare: Results from a multi-year sql-as-a-service experiment. In Proceedings of the 2016 International Conference on Management of Data. ACM, 281--293.
Bhargav Kanagal, Jian Li, and Amol Deshpande. 2011. Sensitivity analysis and explanations for robust query evaluation in probabilistic databases. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, 841--852.
Nodira Khoussainova, Magdalena Balazinska, and Dan Suciu. 2012. Perfxplain: Debugging mapreduce job performance. Proceedings of the VLDB Endowment 5, 7 (2012), 598--609.
Sanjay Krishnan and Eugene Wu. 2017. PALM: Machine learning explanations for iterative debugging. In Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. ACM, 4.
Roger J. Lewis. 2000. An introduction to classification and regression tree (CART) analysis. In Proceedings of the Annual Meeting of the Society for Academic Emergency Medicine. Vol. 14.
Zhicheng Liu and Jeffrey Heer. 2014. The effects of interactive latency on exploratory visual analysis. IEEE Transactions on Visualization 8 Computer Graphics 20, 12 (2014), 2122--2131.
Alexandra Meliou, Sudeepa Roy, and Dan Suciu. 2014. Causality and explanations in databases. Proceedings of the VLDB Endowment 7, 13 (2014), 1715--1716.
Zhengjie Miao, Qitian Zeng, Boris Glavic, and Sudeepa Roy. 2019. Going beyond provenance: Explaining query answers with pattern-based counterbalances. In Proceedings of the 2019 International Conference on Management of Data. ACM, 485--502.
John Moody. 1988. Fast learning in multi-resolution hierarchies. In Proceedings of the 1st International Conference on Neural Information Processing Systems (NIPS’88). MIT Press, Cambridge, MA, 29--39.
Vinod Nair, Ameya Raul, Shwetabh Khanduja, Vikas Bahirwani, Qihong Shao, Sundararajan Sellamanickam, Sathiya Keerthi, Steve Herbert, and Sudheer Dhulipalla. 2015. Learning a hierarchical monitoring system for detecting and diagnosing service issues. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2029--2038.
Yongjoo Park, Ahmad Shahab Tajik, Michael Cafarella, and Barzan Mozafari. 2017. Database learning: Toward a database that becomes smarter every time. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 587--602.
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, and Jake Vanderplas. 2011. Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12 (2011), 2825--2830.
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1135--1144.
Sudeepa Roy, Laurel Orr, and Dan Suciu. 2015. Explaining query answers with explanation-ready databases. Proceedings of the VLDB Endowment 9, 4 (2015), 348--359.
Fotis Savva, Christos Anagnostopoulos, and Peter Triantafillou. 2018. Explaining aggregates for exploratory analytics. In Proceedings of the 2018 IEEE International Conference on Big Data. IEEE, 478--487.
Fotis Savva, Christos Anagnostopoulos, and Peter Triantafillou. 2019. Aggregate query prediction under dynamic workloads. In Proceedings of the 2019 IEEE International Conference on Big Data. IEEE, 671--676.
Fotis Savva, Christos Anagnostopoulos, and Peter Triantafillou. 2020. Adaptive learning of aggregate analytics under dynamic workloads. Future Generation Computer Systems 109 (2020), 317–330.
Fotis Savva, Christos Anagnostopoulos, and Peter Triantafillou. 2020. ML-AQP: Query-driven approximate query processing based on machine learning. Arxiv Preprint Arxiv:2003.06613 (2020).
Lefteris Sidirourgos, Martin L. Kersten, and Peter A. Boncz. 2011. SciBORQ: Scientific data management with bounds on runtime and quality. In Proceedings of the Conference on Innovative Data Systems Research (CIDR'11), Vol. 11. 296--301.
Alexander S. Szalay, Jim Gray, Ani R. Thakar, Peter Z. Kunszt, Tanu Malik, Jordan Raddick, Christopher Stoughton, and Jan vandenBerg. 2002. The SDSS skyserver: Public access to the sloan digital sky server data. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. ACM, 570--581.
Jean Claude Utazirubanda, Tomás M. León, and Papa Ngom. 2019. Variable selection with group LASSO approach: Application to Cox regression with frailty model. Communications in Statistics - Simulation and Computation Feb. (2019), 16:1--21.
Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya Parameswaran, and Neoklis Polyzotis. 2015. S ee DB: Efficient data-driven visualization recommendations to support visual analytics. Proceedings of the VLDB Endowment 8, 13 (2015), 2182--2193.
Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou. 2015. Data x-ray: A diagnostic tool for data errors. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1231--1245.
Xiaolan Wang, Alexandra Meliou, and Eugene Wu. 2017. Qfix: Diagnosing errors through query histories. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 1369--1384.
Abdul Wasay, Xinding Wei, Niv Dayan, and Stratos Idreos. 2017. Data canopy: Accelerating exploratory statistical analysis. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 557--572.
Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). ACM, New York, NY, 1113--1120.
Eugene Wu and Samuel Madden. 2013. Scorpion: Explaining away outliers in aggregate queries. Proceedings of the VLDB Endowment 6, 8 (2013), 553--564.
Eugene Wu, Samuel Madden, and Michael Stonebraker. 2013. SubZero: A fine-grained lineage system for scientific databases. In Proceedings of the 2013 IEEE 29th International Conference on Data Engineering (ICDE’13). IEEE, 865--876.
Sai Wu, Beng Chin Ooi, and Kian-Lee Tan. 2010. Continuous sampling for online aggregation over multiple queries. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. ACM, 651--662.

Cited By

View all
  • (2025)Task-Aware Data Selectivity in Pervasive Edge Computing EnvironmentsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348553137:1(513-525)Online publication date: Jan-2025
  • (2024)Node and relevant data selection in distributed predictive analytics: A query-centric approachJournal of Network and Computer Applications10.1016/j.jnca.2024.104029232(104029)Online publication date: Dec-2024
  • (2023)Query-driven Edge Node Selection in Distributed Learning Environments2023 IEEE 39th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW58674.2023.00029(146-153)Online publication date: Apr-2023
  • Show More Cited By



Information & Contributors


Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 14, Issue 6
December 2020
376 pages
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]


Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 September 2020
Accepted: 01 July 2020
Revised: 01 April 2020
Received: 01 November 2019
Published in TKDD Volume 14, Issue 6


Request permissions for this article.

Check for updates

Author Tags

  1. Explainability
  2. aggregate query explanation
  3. data exploration
  4. range query explanation


  • Research-article
  • Research
  • Refereed

Funding Sources



Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)2
Reflects downloads up to 16 Feb 2025

Other Metrics


Cited By

View all
  • (2025)Task-Aware Data Selectivity in Pervasive Edge Computing EnvironmentsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348553137:1(513-525)Online publication date: Jan-2025
  • (2024)Node and relevant data selection in distributed predictive analytics: A query-centric approachJournal of Network and Computer Applications10.1016/j.jnca.2024.104029232(104029)Online publication date: Dec-2024
  • (2023)Query-driven Edge Node Selection in Distributed Learning Environments2023 IEEE 39th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW58674.2023.00029(146-153)Online publication date: Apr-2023
  • (2022)A Survey on Advancements of Real-Time Analytics Architecture ComponentsComputational Methods and Data Engineering10.1007/978-981-19-3015-7_41(547-559)Online publication date: 9-Sep-2022

View Options

Login options

Full Access

View options


View or Download as a PDF file.



View online with eReader.


HTML Format

View this article in HTML Format.

HTML Format






Share this Publication link

Share on social media