Interactively discovering and ranking desired tuples by data exploration

Qin, Xuedi; Chai, Chengliang; Luo, Yuyu; Zhao, Tianyu; Tang, Nan; Li, Guoliang; Feng, Jianhua; Yu, Xiang; Ouzzani, Mourad

doi:10.1007/s00778-021-00714-0

Interactively discovering and ranking desired tuples by data exploration

Regular Paper
Published: 18 January 2022

Volume 31, pages 753–777, (2022)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Xuedi Qin¹,
Chengliang Chai¹,
Yuyu Luo¹,
Tianyu Zhao¹,
Nan Tang²,
Guoliang Li ORCID: orcid.org/0000-0002-1398-0621¹,
Jianhua Feng¹,
Xiang Yu¹ &
…
Mourad Ouzzani²

596 Accesses
3 Citations
Explore all metrics

Abstract

Data exploration—the problem of extracting knowledge from database even if we do not know exactly what we are looking for —is important for data discovery and analysis. However, precisely specifying SQL queries is not always practical, such as “finding and ranking off-road cars based on a combination of Price, Make, Model, Age, Mileage, etc”—not only due to the query complexity (e.g.,the queries may have many if-then-else, and, or and not logic), but also because the user typically does not have the knowledge of all data instances (and their variants). We propose DExPlorer, a system for interactive data exploration. From the user perspective, we propose a simple and user-friendly interface, which allows to: (1) confirm whether a tuple is desired or not, and (2) decide whether a tuple is more preferred than another. Behind the scenes, we jointly use multiple ML models to learn from the above two types of user feedback. Moreover, in order to effectively involve human-in-the-loop, we need to select a set of tuples for each user interaction so as to solicit feedback. Therefore, we devise question selection algorithms, which consider not only the estimated benefit of each tuple, but also the possible partial orders between any two suggested tuples. Experiments on real-world datasets show that DExPlorer outperforms existing approaches in effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 6

PyExplore 2.0: Explainable, Approximate and Combined Clustering Based SQL Query Recommendations

QuRVe: Query Refinement for View Recommendation in Visual Data Exploration

Answering why-not and why questions on reverse top-k queries

Article 03 September 2016

Qing Liu, Yunjun Gao, … Linlin Zhou

Notes

The logarithmic function takes 2 as the base, and we can know that \(u(t) \in [0,1]\) for \(e \in [0,1]\), and \(u(t)=1\) when \(e=0.5\).
https://www.djangoproject.com/.
http://tabulator.info/.
https://www.kaggle.com/orgesleka/used-cars-database.
https://www.acm.org/publications/digital-library.
https://relational.fit.cvut.cz/dataset/TPCH.
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.
https://sourceforge.net/p/lemur/wiki/RankLib/.
http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html.
We tell the workers that we need university students to participate in a user study, and ask them to fill in their “.edu” mails. We then send emails to these “.edu” mails with the link to the user study. Users can use DExPlorer to find their ranked desired tuples in this link.
https://appen.com.

References

Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.N.: Learning to rank using gradient descent. In: ICML, pp. 89–96 (2005)
Chai, C., Li, G., Li, J., Deng, D., Feng, J.: Cost–effective crowdsourced entity resolution: a partial-order approach. In: Özcan, F., Koutrika, G., Madden, S. (eds.) Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, 26 June–1 July 2016, pp. 969–984. ACM (2016). https://doi.org/10.1145/2882903.2915252
Chai, C., Li, G., Li, J., Deng, D., Feng, J.: A partial–order–based framework for cost–effective crowdsourced entity resolution. VLDB J. 27(6), 745–770 (2018). https://doi.org/10.1007/s00778-018-0509-6
Chai, C., Fan, J., Li, G., Wang, J., Zheng, Y.: Crowd–powered data mining. CoRR (2018). arXiv:1806.04968
Chai, C., Fan, J., Li, G., Wang, J., Zheng, Y.: Crowdsourcing database systems: overview and challenges. In: 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, 8–11 April 2019, pp. 2052–2055. IEEE (2019). https://doi.org/10.1109/ICDE.2019.00237
Chai, C., Li, G., Fan, J., Luo, Y.: Crowdsourcing-based data extraction from visualization charts. In: 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, 20–24 April 2020, pp. 1814–1817. IEEE (2020). https://doi.org/10.1109/ICDE48307.2020.00177
Chai, C., Cao, L., Li, G., Li, J., Luo, Y., Madden, S.: Human-in-the-loop outlier detection. In: Maier, D., Pottinger, R., Doan, A.H., Tan, W.-C., Alawini, A., Ngo, H.Q. (eds.) Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, Portland, OR, USA, 14–19 June 2020, pp. 19–33. ACM (2020). https://doi.org/10.1145/3318464.3389772
Chai, C., Li, G., Fan, J., Luo, Y.: CrowdChart: crowdsourced data extraction from visualization charts. IEEE Trans. Knowl. Data Eng. 33(11), 3537–3549 (2021). https://doi.org/10.1109/TKDE.2020.2972543
Chaudhuri, S., Das, G., Hristidis, V., Weikum, G.: Probabilistic ranking of database query results. In: VLDB, pp. 888–899 (2004)
Chu, W., Ghahramani, Z.: Extensions of gaussian processes for ranking: semisupervised and active learning. Learning to Rank, 29 (2005)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Dai, X., Yan, X., Zhou, K., Wang, Y., Yang, H., Cheng, J.: Convolutional embedding for edit distance. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 599–608 (2020)
Diaconis, P.: Group representations in probability and statistics. IMS Lecture Notes-monograph 72(2), 7–108 (1988)
MATH Google Scholar
Dimitriadou, K., Papaemmanouil, O., Diao, Y.: Explore-by-example: an automatic query steering framework for interactive data exploration. In: SIGMOD, pp. 517–528 (2014)
Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: Rank aggregation methods for the web. In: WWW (2001)
Fagin, R., Kumar, R., Sivakumar, D.: Comparing top k lists. SIAM J. Discrete Math. 17(1), 134–160 (2003)
Article MathSciNet Google Scholar
Fariha, A., Meliou, A.: Example-driven query intent discovery: abductive reasoning using semantic similarity. PVLDB 12(11), 1262–1275 (2019)
Google Scholar
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001)
Article MathSciNet Google Scholar
Gharibshah, Z., Zhu, X., Hainline, A., Conway, M.: Deep learning for user interest and response prediction in online display advertising. Data Sci. Eng. 5(1), 12–26 (2020)
Article Google Scholar
Gollapudi, S., Sharma, A.: An axiomatic approach for result diversification. In: WWW, pp. 381–390 (2009)
Hassin, R., Rubinstein, S., Tamir, A.: Approximation algorithms for maximum dispersion. Oper. Res. Lett. 21(3), 133–137 (1997)
Article MathSciNet Google Scholar
Haveliwala, T.H.: Topic-sensitive pagerank. In: WWW, pp. 517–526. ACM (2002)
Hazelwood, K., Bird, S., Brooks, D., Chintala, S., Diril, U., Dzhulgakov, D., Fawzy, M., Jia, B., Jia, Y., Kalro, A., Law, J., Lee, K., Lu, J., Noordhuis, P., Smelyanskiy, M., Xiong, L., Wang, X.: Applied machine learning at facebook: a datacenter infrastructure perspective. In: HPCA (2018)
He, C., Wang, C., Zhong, Y.-X., Li, R.-F.: A survey on learning to rank. In: 2008 International Conference on Machine Learning and Cybernetics, vol. 3, pp. 1734–1739. IEEEE (2008)
He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y., Atallah, A., Herbrich, R., Bowers, S., Candela, J. Q.: Practical lessons from predicting clicks on ads at facebook. In: ADKDD, pp. 5:1–5:9 (2014)
Hristidis, V., Gravano, L., Papakonstantinou, Y.: Efficient ir-style keyword search over relational databases. In: VLDB, pp. 850–861 (2003)
Hristidis, V., Papakonstantinou, Y.: Discover: keyword search in relational databases. In: VLDB, pp. 670–681 (2002)
Huang, E., Peng, L., Palma, L.D., Abdelkafi, A., Liu, A., Diao, Y.: Optimization for active learning-based interactive database exploration. PVLDB 12(1), 71–84 (2018)
Google Scholar
Jamieson, K.G., Nowak, R.D.: Active ranking using pairwise comparisons. arXiv preprint arXiv:1109.3701 (2011)
Joachims, T.: Training linear svms in linear time. In: SIGKDD, pp. 217–226 (2006)
Kalashnikov, D.V., Lakshmanan, L.V., Srivastava, D.: Fastqre: Fast query reverse engineering. In: Proceedings of the 2018 International Conference on Management of Data, pp. 337–350 (2018)
Lewis, D.D., Catlett, J.: Heterogeneous uncertainty sampling for supervised learning. In: Machine Learning Proceedings 1994, pp. 148–156. Elsevier (1994)
Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: SIGIR’94, pp. 3–12. Springer (1994)
Li, H., Chan, C.-Y., Maier, D.: Query from examples: an iterative, data-driven approach to query construction. Proc. VLDB Endow. 8(13), 2158–2169 (2015)
Article Google Scholar
Li, G., Chai, C., Fan, J., Weng, X., Li, J., Zheng, Y., Li, Y., Yu, X., Zhang, X., Yuan, H.: CDB: optimizing queries with crowd–based selections and joins. In: Salihoglu, S., Zhou, W., Chirkova, R., Yang, J., Suciu, D. (eds.) Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, 14–19 May 2017, pp. 146–1478. ACM (2017). https://doi.org/10.1145/3035918.3064036
Li, G., Chai, C., Fan, J., Weng, X., Li, J., Zheng, Y., Li, Y., Yu, X., Zhang, X., Yuan, H.: CDB: a crowd–powered database system. Proc. VLDB Endow. 11(12), 1926–1929 (2018). https://doi.org/10.14778/3229863.3236226
Li, M., Wang, H., Li, J.: Mining conditional functional dependency rules on big data. Big Data Min. Anal. 03(01), 68 (2020)
Article Google Scholar
Liaw, A., Wiener, M., et al.: Classification and regression by randomforest. R News 2(3), 18–22 (2002)
Google Scholar
Liu, F., Yu, C., Meng, W., Chowdhury, A.: Effective keyword search in relational databases. In: SIGMOD, pp. 563–574 (2006)
Luo, Y., Chai, C., Qin, X., Tang, N., Li, G.: Interactive cleaning for progressive visualization through composite questions. In: ICDE, pp. 733–744 (2020)
Luo, Y., Qin, X., Tang, N., Li, G.: Deepeye: towards automatic data visualization. In: ICDE, pp. 101–112 (2018)
Luo, Y., Qin, X., Tang, N., Li, G., Wang, X.: DeepEye: Creating Good Data Visualizations by Keyword Search. In: Das, G., Jermaine, C.M., Bernstein, P.A. (eds.) Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, 10–15 June 2018, pp. 1733–1736. ACM (2018). https://doi.org/10.1145/3183713.3193545
Luo, Y., Chai, C., Qin, X., Tang, N., Li, G.: VisClean: interactive cleaning for progressive visualization. Proc. VLDB Endow. 13(12), 2821–2824 (2020). https://doi.org/10.14778/3415478.3415484
Luo, Y., Tang, N., Li, G., Li, W., Zhao, T., Yu, X.: DeepEye: a data science system for monitoring and exploring COVID–19 data. IEEE Data Eng. Bull. 43(2), 121–132 (2020)
Luo, Y., Li, W., Zhao, T., Yu, X., Zhang, L., Li, G., Tang, N.: DeepTrack: monitoring and exploring spatio-temporal data – a case of tracking COVID–19. Proc. VLDB Endow. 13(12), 2841–2844 (2020). https://doi.org/10.14778/3415478.3415489
Luo, Y., Qin, X., Chai, C., Tang, N., Li, G., Li, W.: Steerable self–driving data visualization. IEEE Trans. Knowl. Data Eng. (2020). https://doi.org/10.1109/TKDE.2020.2981464
Luo, Y., Tang, N., Li, G., Tang, J., Chai, C., Qin, X.: Natural Language to visualization by neural machine translation. IEEE Trans. Vis. Comput. Graph. (2021). https://doi.org/10.1109/TVCG.2021.3114848
Luo, Y., Tang, N., Li, G., Chai, C., Li, W., Qin, X.: Synthesizing natural language to visualization (NL2VIS) benchmarks from NL2SQL benchmarks. In: SIGMOD, pp. 1235–1247 (2021)
Martins, D.M.L.: Reverse engineering database queries from examples: state-of-the-art, challenges, and research opportunities. Inf. Syst. 83, 89–100 (2019)
Article Google Scholar
Masermann, U, Vossen, G.: Design and implementation of a novel approach to keyword searching in relational databases. In: Current Issues in databases and information systems, pp. 171–184 (2000)
Mishra, C., Koudas, N.: Interactive query refinement. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, pp. 862–873 (2009)
Nanongkai, D., Lall, A., Sarma, A.D., Makino, K.: Interactive regret minimization, pp. 109–120 (2012)
Panev, K., Michel, S.: Reverse engineering top-k database queries with paleo. In: EDBT, pp. 113–124 (2016)
Panev, K., Michel, S., Milchevski, E., Pal, K.: Exploring databases via reverse engineering ranking queries with paleo. Proc. VLDB Endow. 9(13), 1525–1528 (2016)
Article Google Scholar
Psallidas, F., Ding, B., Chakrabarti, K., Chaudhuri, S.: S4: Top-k spreadsheet-style search for query discovery. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 2001–2016 (2015)
Qian, L., Gao, J., Jagadish, H.: Learning user preferences by adaptive pairwise comparison. PVLDB 8(11), 1322–1333 (2015)
Google Scholar
Qin, X., Chai, C., Luo, Y., Zhao, T., Tang, N., Li, G., Feng, J., Yu, X., Ouzzani, M.: Ranking desired tuples by database exploration. In: ICDE
Qin, X., Luo, Y., Tang, N., Li, G.: Deepeye: an automatic big data visualization framework. Big Data Min. Anal. 1(1), 75–82 (2018)
Article Google Scholar
Qin, X., Luo, Y., Tang, N., Li, G.: DeepEye: Visualizing Your Data by Keyword Search. In: Böhlen, M.H., Pichler, R., May, N., Rahm, E., Wu, S.-H., Hose, K. (eds.) Proceedings of the 21st International Conference on Extending Database Technology, EDBT 2018, Vienna, Austria, 26–29 March 2018, pp 441–444. OpenProceedings.org (2018). https://doi.org/10.5441/002/edbt.2018.42
Qin, X., Luo, Y., Tang, N., Li, G.: Making data visualization more efficient and effective: a survey. VLDB J. 29(1), 93–117 (2020)
Article Google Scholar
Settles, B.: Active learning literature survey (2009)
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)
Article MathSciNet Google Scholar
Shen, Y., Chakrabarti, K., Chaudhuri, S., Ding, B., Novik, L.: Discovering queries based on example tuples. In: SIGMOD, pp. 493–504 (2014)
Shen, L., Shen, Luo, Y., Yang, X., Hu, X., Zhang, X., Tai, Z., Wang, J.: Towards natural language interfaces for data visualization: a survey (2021). arXiv:2109.03506
Singh, R., Meduri, V.V., Elmagarmid, A.K., Madden, S., Papotti, P., Quiané-Ruiz, J., Solar-Lezama, A., Tang, N.: Synthesizing entity matching rules by examples. PVLDB 11(2), 189–202 (2017)
Google Scholar
Tian, S., Mo, S., Wang, L., Peng, Z.: Deep reinforcement learning-based approach to tackle topic-aware influence maximization. Data Sci. Eng. 5(1), 1–11 (2020)
Article Google Scholar
Tran, Q.T., Chan, C.-Y., Parthasarathy, S.: Query by output. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 535–548 (2009)
Tran, Q.T., Chan, C.-Y., Parthasarathy, S.: Query reverse engineering. VLDB J. 23(5), 721–746 (2014)
Article Google Scholar
Wang, Y., Yao, Y., Tong, H., Xu, F., Lu, J.: A brief review of network embedding. Big Data Min. Anal. 2(1), 35 (2019)
Weiss, Y.Y., Cohen, S.: Reverse engineering spj-queries from examples. In: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 151–166 (2017)
Wu, Q., Burges, C.J., Svore, K.M., Gao, J.: Adapting boosting for information retrieval measures. Inf. Retriev. 13(3), 254–270 (2010)
Article Google Scholar
Xie, M., Chen, T., Wong, R.C.-W.: Findyourfavorite: an interactive system for finding the user’s favorite tuple in the database. In: SIGMOD, pp. 2017–2020 (2019)
Xie, M., Wong, R.C.-W., Lall, A.: Strongly truthful interactive regret minimization. In: SIGMOD, pp. 281–298 (2019)
Zhang, M., Elmeleegy, H., Procopiuc, C.M., Srivastava, D.: Reverse engineering complex join queries. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 809–820 (2013)
Zhang, S., Sun, Y.: Automatically synthesizing sql queries from input-output examples. In: ASE, pp. 224–234 (2013)

Download references

Acknowledgements

This work is supported by NSF of China (61925205, 61632016, 62102215), Huawei, TAL education, China National Postdoctoral Program for Innovative Talents (BX2021155), China Postdoctoral Science Foundation (2021M691784), Shuimu Tsinghua Scholar and Zhejiang Lab’s International Talent Fund for Young Professionals.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, China
Xuedi Qin, Chengliang Chai, Yuyu Luo, Tianyu Zhao, Guoliang Li, Jianhua Feng & Xiang Yu
Qatar Computing Research Institute, HBKU, Doha, Qatar
Nan Tang & Mourad Ouzzani

Authors

Xuedi Qin
View author publications
You can also search for this author in PubMed Google Scholar
Chengliang Chai
View author publications
You can also search for this author in PubMed Google Scholar
Yuyu Luo
View author publications
You can also search for this author in PubMed Google Scholar
Tianyu Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Nan Tang
View author publications
You can also search for this author in PubMed Google Scholar
Guoliang Li
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Feng
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Yu
View author publications
You can also search for this author in PubMed Google Scholar
Mourad Ouzzani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Chengliang Chai or Guoliang Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qin, X., Chai, C., Luo, Y. et al. Interactively discovering and ranking desired tuples by data exploration. The VLDB Journal 31, 753–777 (2022). https://doi.org/10.1007/s00778-021-00714-0

Download citation

Received: 26 February 2021
Revised: 19 October 2021
Accepted: 26 October 2021
Published: 18 January 2022
Issue Date: July 2022
DOI: https://doi.org/10.1007/s00778-021-00714-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Interactively discovering and ranking desired tuples by data exploration

Abstract

Access this article

Similar content being viewed by others

PyExplore 2.0: Explainable, Approximate and Combined Clustering Based SQL Query Recommendations

QuRVe: Query Refinement for View Recommendation in Visual Data Exploration

Answering why-not and why questions on reverse top-k queries

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Interactively discovering and ranking desired tuples by data exploration

Abstract

Access this article

Similar content being viewed by others

PyExplore 2.0: Explainable, Approximate and Combined Clustering Based SQL Query Recommendations

QuRVe: Query Refinement for View Recommendation in Visual Data Exploration

Answering why-not and why questions on reverse top-k queries

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation