Skip to main content

PyExplore 2.0: Explainable, Approximate and Combined Clustering Based SQL Query Recommendations

  • Conference paper
  • First Online:
Management of Digital EcoSystems (MEDES 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2022))

Included in the following conference series:

  • 232 Accesses

Abstract

While the benefit of data exploration becomes increasingly more prominent, factors such as the data volume and complexity and user unfamiliarity with the database contents, make querying data a non-trivial, time-consuming process. The big challenge for users is to find which query to ask at any point. PyExplore is a data exploration framework that aims to help users explore datasets by providing SQL query recommendations. The user provides an initial SQL query and then pyExplore provides new SQL queries with augmented WHERE clause. In this paper, we extend pyExplore with four new workflows one for approximate query recommendations, one for explainable query completions, one for combined explainable and approximate recommendation and finally a sampled decision tree workflow that is similar to pyExplore’s original workflow but this time only a small portion of the dataset gets processed. The purpose of the explainable workflows is to provide recommendations that are intuitive to the end user while the purpose of approximate workflows is to significantly reduce execution time compared to the full workflow. We evaluated the four workflows in terms of execution time and speedup compared to the full workflow. We found out that a) the quality of the approximate recommendations is on-par with the full workflow b) the explainable workflow is faster than using a decision tree classifier to produce the queries c) the approximate workflow is significantly faster than the full workflow.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.kaggle.com/rounakbanik/the-movies-dataset/version/7.

  2. 2.

    https://raw.githubusercontent.com/vkrit/data-science-class/master/WA_Fn-UseC_-Sales-Win-Loss.csv.

  3. 3.

    http://db.csail.mit.edu/labdata/labdata.html.

References

  1. Aggarwal, C.C.: Outlier analysis. In: Data Mining, pp. 237–263. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-14142-8_8

    Chapter  Google Scholar 

  2. Bader, M.: Space-Filling Curves: An Introduction with Applications in Scientific Computing, vol. 9. Springer, Cham (2012)

    Google Scholar 

  3. Dimitriadou, K., Papaemmanouil, O., Diao, Y.: AIDE: an active learning-based approach for interactive data exploration. IEEE Trans. Knowl. Data Eng. 28(11), 2842–2856 (2016). https://doi.org/10.1109/TKDE.2016.2599168

    Article  Google Scholar 

  4. Domingos, P., Hulten, G.: Mining high-speed data streams. In: ACM SIGKDD, pp. 71–80 (2000)

    Google Scholar 

  5. Eirinaki, M., Patel, S.: Querie reloaded: using matrix factorization to improve database query recommendations. In: 2015 IEEE International Conference on Big Data, Big Data 2015, Santa Clara, CA, USA, October 29 - November 1, 2015, pp. 1500–1508. IEEE Computer Society (2015). https://doi.org/10.1109/BigData.2015.7363913

  6. Glenis, A., Koutrika, G.: Pyexplore: query recommendations for data exploration without query logs. In: Proceedings of the 2021 International Conference on Management of Data, pp. 2731–2735 (2021)

    Google Scholar 

  7. Howe, B., Cole, G., Khoussainova, N., Battle, L.: Automatic example queries for ad hoc databases. In: Sellis, T.K., Miller, R.J., Kementsietsidis, A., Velegrakis, Y. (eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12–16, 2011, pp. 1319–1322. ACM (2011)

    Google Scholar 

  8. Huang, Z.: Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st Pacific-Asia conference on knowledge discovery and data mining,(PAKDD), pp. 21–34. Singapore (1997)

    Google Scholar 

  9. Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. DMKD 3(8), 34–39 (1997)

    Google Scholar 

  10. Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Disc. 2(3), 283–304 (1998)

    Article  Google Scholar 

  11. Kalinin, A., Çetintemel, U., Zhao, Z., Zdonik, S.B.: Dynamic query refinement for interactive data exploration. In: Bonifati, A., Zhou, Y., Salles, M.A.V., Böhm, A., Olteanu, D., Fletcher, G.H.L., Khan, A., Yang, B. (eds.) Proceedings of the 23rd International Conference on Extending Database Technology, EDBT 2020, Copenhagen, Denmark, March 30 - April 02, 2020, pp. 49–60. OpenProceedings.org (2020). https://doi.org/10.5441/002/edbt.2020.06

  12. Khoussainova, N., Kwon, Y., Balazinska, M., Suciu, D.: SnipSuggest: context-aware autocompletion for SQL. Proc. VLDB Endow. 4(1), 22–33 (2010)

    Article  Google Scholar 

  13. Le Guilly, M., Petit, J.M., Scuturici, V.M., Ilyas, I.F.: Explique: interactive databases exploration with SQL. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2877–2880 (2019)

    Google Scholar 

  14. Luo, Y., Qin, X., Tang, N., Li, G.: DeepEye: towards automatic data visualization. In: 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16–19, 2018, pp. 101–112. IEEE Computer Society (2018)

    Google Scholar 

  15. Sculley, D.: Web-scale k-means clustering. In: World Wide Web Conference, pp. 1177–1178 (2010)

    Google Scholar 

  16. Sellam, T., Kersten, M.: Cluster-driven navigation of the query space. IEEE Trans. Knowl. Data Eng. 28(5), 1118–1131 (2016)

    Article  Google Scholar 

  17. Sellam, T., Kersten, M.: Have a chat with clustine, conversational engine to query large tables. In: Proceedings of the Workshop on Human-In-the-Loop Data Analytics, pp. 1–6 (2016)

    Google Scholar 

  18. Sellam, T., Kersten, M.: Ziggy: characterizing query results for data explorers. Proc. VLDB Endowment 9(13), 1473–1476 (2016)

    Article  Google Scholar 

  19. Tahery, S., Farzi, S.: Customized query auto-completion and suggestion - a review. Inf. Syst. 87, 101415 (2020)

    Article  Google Scholar 

  20. Yang, X., Procopiuc, C.M., Srivastava, D.: Recommending join queries via query log analysis. In: Ioannidis, Y.E., Lee, D.L., Ng, R.T. (eds.) Proceedings of the 25th International Conference on Data Engineering, ICDE 2009, March 29 2009 - April 2 2009, Shanghai, China, pp. 964–975. IEEE Computer Society (2009)

    Google Scholar 

  21. Zhang, X., Ge, X., Chrysanthis, P.K., Sharaf, M.A.: Viewseeker: an interactive view recommendation tool. In: Papotti, P. (ed.) Proceedings of the Workshops of the EDBT/ICDT 2019 Joint Conference, EDBT/ICDT 2019, Lisbon, Portugal, March 26, 2019. CEUR Workshop Proceedings, vol. 2322. CEUR-WS.org (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Apostolos Glenis .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Glenis, A. (2024). PyExplore 2.0: Explainable, Approximate and Combined Clustering Based SQL Query Recommendations. In: Chbeir, R., Benslimane, D., Zervakis, M., Manolopoulos, Y., Ngyuen, N.T., Tekli, J. (eds) Management of Digital EcoSystems. MEDES 2023. Communications in Computer and Information Science, vol 2022. Springer, Cham. https://doi.org/10.1007/978-3-031-51643-6_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-51643-6_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-51642-9

  • Online ISBN: 978-3-031-51643-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics