Abstract
While the benefit of data exploration becomes increasingly more prominent, factors such as the data volume and complexity and user unfamiliarity with the database contents, make querying data a non-trivial, time-consuming process. The big challenge for users is to find which query to ask at any point. PyExplore is a data exploration framework that aims to help users explore datasets by providing SQL query recommendations. The user provides an initial SQL query and then pyExplore provides new SQL queries with augmented WHERE clause. In this paper, we extend pyExplore with four new workflows one for approximate query recommendations, one for explainable query completions, one for combined explainable and approximate recommendation and finally a sampled decision tree workflow that is similar to pyExplore’s original workflow but this time only a small portion of the dataset gets processed. The purpose of the explainable workflows is to provide recommendations that are intuitive to the end user while the purpose of approximate workflows is to significantly reduce execution time compared to the full workflow. We evaluated the four workflows in terms of execution time and speedup compared to the full workflow. We found out that a) the quality of the approximate recommendations is on-par with the full workflow b) the explainable workflow is faster than using a decision tree classifier to produce the queries c) the approximate workflow is significantly faster than the full workflow.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aggarwal, C.C.: Outlier analysis. In: Data Mining, pp. 237–263. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-14142-8_8
Bader, M.: Space-Filling Curves: An Introduction with Applications in Scientific Computing, vol. 9. Springer, Cham (2012)
Dimitriadou, K., Papaemmanouil, O., Diao, Y.: AIDE: an active learning-based approach for interactive data exploration. IEEE Trans. Knowl. Data Eng. 28(11), 2842–2856 (2016). https://doi.org/10.1109/TKDE.2016.2599168
Domingos, P., Hulten, G.: Mining high-speed data streams. In: ACM SIGKDD, pp. 71–80 (2000)
Eirinaki, M., Patel, S.: Querie reloaded: using matrix factorization to improve database query recommendations. In: 2015 IEEE International Conference on Big Data, Big Data 2015, Santa Clara, CA, USA, October 29 - November 1, 2015, pp. 1500–1508. IEEE Computer Society (2015). https://doi.org/10.1109/BigData.2015.7363913
Glenis, A., Koutrika, G.: Pyexplore: query recommendations for data exploration without query logs. In: Proceedings of the 2021 International Conference on Management of Data, pp. 2731–2735 (2021)
Howe, B., Cole, G., Khoussainova, N., Battle, L.: Automatic example queries for ad hoc databases. In: Sellis, T.K., Miller, R.J., Kementsietsidis, A., Velegrakis, Y. (eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12–16, 2011, pp. 1319–1322. ACM (2011)
Huang, Z.: Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st Pacific-Asia conference on knowledge discovery and data mining,(PAKDD), pp. 21–34. Singapore (1997)
Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. DMKD 3(8), 34–39 (1997)
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Disc. 2(3), 283–304 (1998)
Kalinin, A., Çetintemel, U., Zhao, Z., Zdonik, S.B.: Dynamic query refinement for interactive data exploration. In: Bonifati, A., Zhou, Y., Salles, M.A.V., Böhm, A., Olteanu, D., Fletcher, G.H.L., Khan, A., Yang, B. (eds.) Proceedings of the 23rd International Conference on Extending Database Technology, EDBT 2020, Copenhagen, Denmark, March 30 - April 02, 2020, pp. 49–60. OpenProceedings.org (2020). https://doi.org/10.5441/002/edbt.2020.06
Khoussainova, N., Kwon, Y., Balazinska, M., Suciu, D.: SnipSuggest: context-aware autocompletion for SQL. Proc. VLDB Endow. 4(1), 22–33 (2010)
Le Guilly, M., Petit, J.M., Scuturici, V.M., Ilyas, I.F.: Explique: interactive databases exploration with SQL. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2877–2880 (2019)
Luo, Y., Qin, X., Tang, N., Li, G.: DeepEye: towards automatic data visualization. In: 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16–19, 2018, pp. 101–112. IEEE Computer Society (2018)
Sculley, D.: Web-scale k-means clustering. In: World Wide Web Conference, pp. 1177–1178 (2010)
Sellam, T., Kersten, M.: Cluster-driven navigation of the query space. IEEE Trans. Knowl. Data Eng. 28(5), 1118–1131 (2016)
Sellam, T., Kersten, M.: Have a chat with clustine, conversational engine to query large tables. In: Proceedings of the Workshop on Human-In-the-Loop Data Analytics, pp. 1–6 (2016)
Sellam, T., Kersten, M.: Ziggy: characterizing query results for data explorers. Proc. VLDB Endowment 9(13), 1473–1476 (2016)
Tahery, S., Farzi, S.: Customized query auto-completion and suggestion - a review. Inf. Syst. 87, 101415 (2020)
Yang, X., Procopiuc, C.M., Srivastava, D.: Recommending join queries via query log analysis. In: Ioannidis, Y.E., Lee, D.L., Ng, R.T. (eds.) Proceedings of the 25th International Conference on Data Engineering, ICDE 2009, March 29 2009 - April 2 2009, Shanghai, China, pp. 964–975. IEEE Computer Society (2009)
Zhang, X., Ge, X., Chrysanthis, P.K., Sharaf, M.A.: Viewseeker: an interactive view recommendation tool. In: Papotti, P. (ed.) Proceedings of the Workshops of the EDBT/ICDT 2019 Joint Conference, EDBT/ICDT 2019, Lisbon, Portugal, March 26, 2019. CEUR Workshop Proceedings, vol. 2322. CEUR-WS.org (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Glenis, A. (2024). PyExplore 2.0: Explainable, Approximate and Combined Clustering Based SQL Query Recommendations. In: Chbeir, R., Benslimane, D., Zervakis, M., Manolopoulos, Y., Ngyuen, N.T., Tekli, J. (eds) Management of Digital EcoSystems. MEDES 2023. Communications in Computer and Information Science, vol 2022. Springer, Cham. https://doi.org/10.1007/978-3-031-51643-6_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-51643-6_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-51642-9
Online ISBN: 978-3-031-51643-6
eBook Packages: Computer ScienceComputer Science (R0)