Abstract
Data exploration—the problem of extracting knowledge from database even if we do not know exactly what we are looking for —is important for data discovery and analysis. However, precisely specifying SQL queries is not always practical, such as “finding and ranking off-road cars based on a combination of Price, Make, Model, Age, Mileage, etc”—not only due to the query complexity (e.g.,the queries may have many if-then-else, and, or and not logic), but also because the user typically does not have the knowledge of all data instances (and their variants). We propose DExPlorer, a system for interactive data exploration. From the user perspective, we propose a simple and user-friendly interface, which allows to: (1) confirm whether a tuple is desired or not, and (2) decide whether a tuple is more preferred than another. Behind the scenes, we jointly use multiple ML models to learn from the above two types of user feedback. Moreover, in order to effectively involve human-in-the-loop, we need to select a set of tuples for each user interaction so as to solicit feedback. Therefore, we devise question selection algorithms, which consider not only the estimated benefit of each tuple, but also the possible partial orders between any two suggested tuples. Experiments on real-world datasets show that DExPlorer outperforms existing approaches in effectiveness.
Similar content being viewed by others
Notes
The logarithmic function takes 2 as the base, and we can know that \(u(t) \in [0,1]\) for \(e \in [0,1]\), and \(u(t)=1\) when \(e=0.5\).
We tell the workers that we need university students to participate in a user study, and ask them to fill in their “.edu” mails. We then send emails to these “.edu” mails with the link to the user study. Users can use DExPlorer to find their ranked desired tuples in this link.
References
Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.N.: Learning to rank using gradient descent. In: ICML, pp. 89–96 (2005)
Chai, C., Li, G., Li, J., Deng, D., Feng, J.: Cost–effective crowdsourced entity resolution: a partial-order approach. In: Özcan, F., Koutrika, G., Madden, S. (eds.) Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, 26 June–1 July 2016, pp. 969–984. ACM (2016). https://doi.org/10.1145/2882903.2915252
Chai, C., Li, G., Li, J., Deng, D., Feng, J.: A partial–order–based framework for cost–effective crowdsourced entity resolution. VLDB J. 27(6), 745–770 (2018). https://doi.org/10.1007/s00778-018-0509-6
Chai, C., Fan, J., Li, G., Wang, J., Zheng, Y.: Crowd–powered data mining. CoRR (2018). arXiv:1806.04968
Chai, C., Fan, J., Li, G., Wang, J., Zheng, Y.: Crowdsourcing database systems: overview and challenges. In: 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, 8–11 April 2019, pp. 2052–2055. IEEE (2019). https://doi.org/10.1109/ICDE.2019.00237
Chai, C., Li, G., Fan, J., Luo, Y.: Crowdsourcing-based data extraction from visualization charts. In: 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, 20–24 April 2020, pp. 1814–1817. IEEE (2020). https://doi.org/10.1109/ICDE48307.2020.00177
Chai, C., Cao, L., Li, G., Li, J., Luo, Y., Madden, S.: Human-in-the-loop outlier detection. In: Maier, D., Pottinger, R., Doan, A.H., Tan, W.-C., Alawini, A., Ngo, H.Q. (eds.) Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, Portland, OR, USA, 14–19 June 2020, pp. 19–33. ACM (2020). https://doi.org/10.1145/3318464.3389772
Chai, C., Li, G., Fan, J., Luo, Y.: CrowdChart: crowdsourced data extraction from visualization charts. IEEE Trans. Knowl. Data Eng. 33(11), 3537–3549 (2021). https://doi.org/10.1109/TKDE.2020.2972543
Chaudhuri, S., Das, G., Hristidis, V., Weikum, G.: Probabilistic ranking of database query results. In: VLDB, pp. 888–899 (2004)
Chu, W., Ghahramani, Z.: Extensions of gaussian processes for ranking: semisupervised and active learning. Learning to Rank, 29 (2005)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Dai, X., Yan, X., Zhou, K., Wang, Y., Yang, H., Cheng, J.: Convolutional embedding for edit distance. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 599–608 (2020)
Diaconis, P.: Group representations in probability and statistics. IMS Lecture Notes-monograph 72(2), 7–108 (1988)
Dimitriadou, K., Papaemmanouil, O., Diao, Y.: Explore-by-example: an automatic query steering framework for interactive data exploration. In: SIGMOD, pp. 517–528 (2014)
Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: Rank aggregation methods for the web. In: WWW (2001)
Fagin, R., Kumar, R., Sivakumar, D.: Comparing top k lists. SIAM J. Discrete Math. 17(1), 134–160 (2003)
Fariha, A., Meliou, A.: Example-driven query intent discovery: abductive reasoning using semantic similarity. PVLDB 12(11), 1262–1275 (2019)
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001)
Gharibshah, Z., Zhu, X., Hainline, A., Conway, M.: Deep learning for user interest and response prediction in online display advertising. Data Sci. Eng. 5(1), 12–26 (2020)
Gollapudi, S., Sharma, A.: An axiomatic approach for result diversification. In: WWW, pp. 381–390 (2009)
Hassin, R., Rubinstein, S., Tamir, A.: Approximation algorithms for maximum dispersion. Oper. Res. Lett. 21(3), 133–137 (1997)
Haveliwala, T.H.: Topic-sensitive pagerank. In: WWW, pp. 517–526. ACM (2002)
Hazelwood, K., Bird, S., Brooks, D., Chintala, S., Diril, U., Dzhulgakov, D., Fawzy, M., Jia, B., Jia, Y., Kalro, A., Law, J., Lee, K., Lu, J., Noordhuis, P., Smelyanskiy, M., Xiong, L., Wang, X.: Applied machine learning at facebook: a datacenter infrastructure perspective. In: HPCA (2018)
He, C., Wang, C., Zhong, Y.-X., Li, R.-F.: A survey on learning to rank. In: 2008 International Conference on Machine Learning and Cybernetics, vol. 3, pp. 1734–1739. IEEEE (2008)
He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y., Atallah, A., Herbrich, R., Bowers, S., Candela, J. Q.: Practical lessons from predicting clicks on ads at facebook. In: ADKDD, pp. 5:1–5:9 (2014)
Hristidis, V., Gravano, L., Papakonstantinou, Y.: Efficient ir-style keyword search over relational databases. In: VLDB, pp. 850–861 (2003)
Hristidis, V., Papakonstantinou, Y.: Discover: keyword search in relational databases. In: VLDB, pp. 670–681 (2002)
Huang, E., Peng, L., Palma, L.D., Abdelkafi, A., Liu, A., Diao, Y.: Optimization for active learning-based interactive database exploration. PVLDB 12(1), 71–84 (2018)
Jamieson, K.G., Nowak, R.D.: Active ranking using pairwise comparisons. arXiv preprint arXiv:1109.3701 (2011)
Joachims, T.: Training linear svms in linear time. In: SIGKDD, pp. 217–226 (2006)
Kalashnikov, D.V., Lakshmanan, L.V., Srivastava, D.: Fastqre: Fast query reverse engineering. In: Proceedings of the 2018 International Conference on Management of Data, pp. 337–350 (2018)
Lewis, D.D., Catlett, J.: Heterogeneous uncertainty sampling for supervised learning. In: Machine Learning Proceedings 1994, pp. 148–156. Elsevier (1994)
Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: SIGIR’94, pp. 3–12. Springer (1994)
Li, H., Chan, C.-Y., Maier, D.: Query from examples: an iterative, data-driven approach to query construction. Proc. VLDB Endow. 8(13), 2158–2169 (2015)
Li, G., Chai, C., Fan, J., Weng, X., Li, J., Zheng, Y., Li, Y., Yu, X., Zhang, X., Yuan, H.: CDB: optimizing queries with crowd–based selections and joins. In: Salihoglu, S., Zhou, W., Chirkova, R., Yang, J., Suciu, D. (eds.) Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, 14–19 May 2017, pp. 146–1478. ACM (2017). https://doi.org/10.1145/3035918.3064036
Li, G., Chai, C., Fan, J., Weng, X., Li, J., Zheng, Y., Li, Y., Yu, X., Zhang, X., Yuan, H.: CDB: a crowd–powered database system. Proc. VLDB Endow. 11(12), 1926–1929 (2018). https://doi.org/10.14778/3229863.3236226
Li, M., Wang, H., Li, J.: Mining conditional functional dependency rules on big data. Big Data Min. Anal. 03(01), 68 (2020)
Liaw, A., Wiener, M., et al.: Classification and regression by randomforest. R News 2(3), 18–22 (2002)
Liu, F., Yu, C., Meng, W., Chowdhury, A.: Effective keyword search in relational databases. In: SIGMOD, pp. 563–574 (2006)
Luo, Y., Chai, C., Qin, X., Tang, N., Li, G.: Interactive cleaning for progressive visualization through composite questions. In: ICDE, pp. 733–744 (2020)
Luo, Y., Qin, X., Tang, N., Li, G.: Deepeye: towards automatic data visualization. In: ICDE, pp. 101–112 (2018)
Luo, Y., Qin, X., Tang, N., Li, G., Wang, X.: DeepEye: Creating Good Data Visualizations by Keyword Search. In: Das, G., Jermaine, C.M., Bernstein, P.A. (eds.) Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, 10–15 June 2018, pp. 1733–1736. ACM (2018). https://doi.org/10.1145/3183713.3193545
Luo, Y., Chai, C., Qin, X., Tang, N., Li, G.: VisClean: interactive cleaning for progressive visualization. Proc. VLDB Endow. 13(12), 2821–2824 (2020). https://doi.org/10.14778/3415478.3415484
Luo, Y., Tang, N., Li, G., Li, W., Zhao, T., Yu, X.: DeepEye: a data science system for monitoring and exploring COVID–19 data. IEEE Data Eng. Bull. 43(2), 121–132 (2020)
Luo, Y., Li, W., Zhao, T., Yu, X., Zhang, L., Li, G., Tang, N.: DeepTrack: monitoring and exploring spatio-temporal data – a case of tracking COVID–19. Proc. VLDB Endow. 13(12), 2841–2844 (2020). https://doi.org/10.14778/3415478.3415489
Luo, Y., Qin, X., Chai, C., Tang, N., Li, G., Li, W.: Steerable self–driving data visualization. IEEE Trans. Knowl. Data Eng. (2020). https://doi.org/10.1109/TKDE.2020.2981464
Luo, Y., Tang, N., Li, G., Tang, J., Chai, C., Qin, X.: Natural Language to visualization by neural machine translation. IEEE Trans. Vis. Comput. Graph. (2021). https://doi.org/10.1109/TVCG.2021.3114848
Luo, Y., Tang, N., Li, G., Chai, C., Li, W., Qin, X.: Synthesizing natural language to visualization (NL2VIS) benchmarks from NL2SQL benchmarks. In: SIGMOD, pp. 1235–1247 (2021)
Martins, D.M.L.: Reverse engineering database queries from examples: state-of-the-art, challenges, and research opportunities. Inf. Syst. 83, 89–100 (2019)
Masermann, U, Vossen, G.: Design and implementation of a novel approach to keyword searching in relational databases. In: Current Issues in databases and information systems, pp. 171–184 (2000)
Mishra, C., Koudas, N.: Interactive query refinement. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, pp. 862–873 (2009)
Nanongkai, D., Lall, A., Sarma, A.D., Makino, K.: Interactive regret minimization, pp. 109–120 (2012)
Panev, K., Michel, S.: Reverse engineering top-k database queries with paleo. In: EDBT, pp. 113–124 (2016)
Panev, K., Michel, S., Milchevski, E., Pal, K.: Exploring databases via reverse engineering ranking queries with paleo. Proc. VLDB Endow. 9(13), 1525–1528 (2016)
Psallidas, F., Ding, B., Chakrabarti, K., Chaudhuri, S.: S4: Top-k spreadsheet-style search for query discovery. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 2001–2016 (2015)
Qian, L., Gao, J., Jagadish, H.: Learning user preferences by adaptive pairwise comparison. PVLDB 8(11), 1322–1333 (2015)
Qin, X., Chai, C., Luo, Y., Zhao, T., Tang, N., Li, G., Feng, J., Yu, X., Ouzzani, M.: Ranking desired tuples by database exploration. In: ICDE
Qin, X., Luo, Y., Tang, N., Li, G.: Deepeye: an automatic big data visualization framework. Big Data Min. Anal. 1(1), 75–82 (2018)
Qin, X., Luo, Y., Tang, N., Li, G.: DeepEye: Visualizing Your Data by Keyword Search. In: Böhlen, M.H., Pichler, R., May, N., Rahm, E., Wu, S.-H., Hose, K. (eds.) Proceedings of the 21st International Conference on Extending Database Technology, EDBT 2018, Vienna, Austria, 26–29 March 2018, pp 441–444. OpenProceedings.org (2018). https://doi.org/10.5441/002/edbt.2018.42
Qin, X., Luo, Y., Tang, N., Li, G.: Making data visualization more efficient and effective: a survey. VLDB J. 29(1), 93–117 (2020)
Settles, B.: Active learning literature survey (2009)
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)
Shen, Y., Chakrabarti, K., Chaudhuri, S., Ding, B., Novik, L.: Discovering queries based on example tuples. In: SIGMOD, pp. 493–504 (2014)
Shen, L., Shen, Luo, Y., Yang, X., Hu, X., Zhang, X., Tai, Z., Wang, J.: Towards natural language interfaces for data visualization: a survey (2021). arXiv:2109.03506
Singh, R., Meduri, V.V., Elmagarmid, A.K., Madden, S., Papotti, P., Quiané-Ruiz, J., Solar-Lezama, A., Tang, N.: Synthesizing entity matching rules by examples. PVLDB 11(2), 189–202 (2017)
Tian, S., Mo, S., Wang, L., Peng, Z.: Deep reinforcement learning-based approach to tackle topic-aware influence maximization. Data Sci. Eng. 5(1), 1–11 (2020)
Tran, Q.T., Chan, C.-Y., Parthasarathy, S.: Query by output. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 535–548 (2009)
Tran, Q.T., Chan, C.-Y., Parthasarathy, S.: Query reverse engineering. VLDB J. 23(5), 721–746 (2014)
Wang, Y., Yao, Y., Tong, H., Xu, F., Lu, J.: A brief review of network embedding. Big Data Min. Anal. 2(1), 35 (2019)
Weiss, Y.Y., Cohen, S.: Reverse engineering spj-queries from examples. In: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 151–166 (2017)
Wu, Q., Burges, C.J., Svore, K.M., Gao, J.: Adapting boosting for information retrieval measures. Inf. Retriev. 13(3), 254–270 (2010)
Xie, M., Chen, T., Wong, R.C.-W.: Findyourfavorite: an interactive system for finding the user’s favorite tuple in the database. In: SIGMOD, pp. 2017–2020 (2019)
Xie, M., Wong, R.C.-W., Lall, A.: Strongly truthful interactive regret minimization. In: SIGMOD, pp. 281–298 (2019)
Zhang, M., Elmeleegy, H., Procopiuc, C.M., Srivastava, D.: Reverse engineering complex join queries. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 809–820 (2013)
Zhang, S., Sun, Y.: Automatically synthesizing sql queries from input-output examples. In: ASE, pp. 224–234 (2013)
Acknowledgements
This work is supported by NSF of China (61925205, 61632016, 62102215), Huawei, TAL education, China National Postdoctoral Program for Innovative Talents (BX2021155), China Postdoctoral Science Foundation (2021M691784), Shuimu Tsinghua Scholar and Zhejiang Lab’s International Talent Fund for Young Professionals.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Qin, X., Chai, C., Luo, Y. et al. Interactively discovering and ranking desired tuples by data exploration. The VLDB Journal 31, 753–777 (2022). https://doi.org/10.1007/s00778-021-00714-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-021-00714-0