POLYTOPE: a flexible sampling system for answering exploratory queries

Wu, Zhigang; Jing, Yinan; He, Zhenying; Guo, Chenghao; Wang, X. Sean

doi:10.1007/s11280-019-00685-x

POLYTOPE: a flexible sampling system for answering exploratory queries

Published: 15 May 2019

Volume 23, pages 1–22, (2020)
Cite this article

World Wide Web Aims and scope Submit manuscript

Zhigang Wu^1,2,
Yinan Jing ORCID: orcid.org/0000-0002-1169-8032^1,2,
Zhenying He^1,2,
Chenghao Guo^1,2 &
…
X. Sean Wang^1,2,3

450 Accesses
8 Citations
Explore all metrics

Abstract

Data exploration task is usually quite time-consuming. Analysts who want to find interests or verify their hypothesis may prefer a lower response time while tolerating a bounded error. Approximate query processing (AQP) is a convincing way to achieve this goal by leveraging some pre-computed samples to speed up this process. Existing sampling based AQP systems usually take a single sampling strategy on the whole dataset. However, during the data exploration tasks, various potential interests may distribute in different parts of dataset. To explore these interests, queries submitted by users thus show a rich diversity for separate sub-datasets. Therefore, only one single sampling strategy is obviously not competent for all queries accessing various sub-datasets. In this paper, we proposed a flexible and effective sampling system POLYTOPE especially designed for the data exploration tasks. To achieve this, we take the following three key ideas: (1) split the dataset into sampling blocks according to the user query patterns, (2) individually generate a set of optimized samples for each sampling block, and (3) automatically select an optimal sample at run time. We utilize both user query patterns and underlying data distribution to fulfill these ideas. We have implemented our system on the Spark platform and our comprehensive experimental results show that our system improved the accuracy performance up to 46% under the same time constraint for the data exploration tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1

Big data analytics on Apache Spark

Article 13 October 2016

Stratified random sampling from streaming and stored data

Article 23 October 2020

Big data analytics: a survey

Article Open access 01 October 2015

References

Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximate answering of group-by queries. In: ACM SIGMOD Record. vol. 29, pp. 487–498. ACM (2000)
Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: Blinkdb: Queries with bounded errors and bounded response times on very large data. In: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 29–42. ACM (2013)
Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by their probability distribution. Bull. Calcutta Math. Soc (1943)
Chaudhuri, S., Narasayya, V.: Program for tpc-d data generation with skew (2012)
Chaudhuri, S., Das, G., Narasayya, V.: A robust, optimization-based approach for approximate answering of aggregate queries. In: ACM SIGMOD Record. vol. 30, pp. 295–306. ACM (2001)
Article Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise (1996)
Ganti, V., Lee, M.L., Ramakrishnan, R.: Icicles: Self-tuning samples for approximate query answering. In: VLDB. vol. 176 (2000)
Goiri, I., Bianchini, R., Nagarakatte, S., Nguyen, T.D.: Approxhadoop: Bringing approximations to mapreduce frameworks. In: ACM SIGARCH Computer Architecture News, vol. 43, pp. 383–397. ACM (2015)
Article Google Scholar
Kandula, S., Shanbhag, A., Vitorovic, A., Olma, M., Grandl, R., Chaudhuri, S., Ding, B.: Quickr: Lazily approximating complex adhoc queries in bigdata clusters. In: Proceedings of the 2016 International Conference on Management of Data, pp. 631–646. ACM (2016)
Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P., et al.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983)
Article MathSciNet Google Scholar
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Article MathSciNet Google Scholar
Li, K., Li, G.: Approximate query processing: What is new and where to go? Data Sci. Eng. 3(4), 379–397 (2018)
Article Google Scholar
Lohr, S.: Sampling: Design and Analysis. Nelson Education (2009)
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: Interactive analysis of Web-scale datasets. Proc. VLDB Endow. 3(1–2), 330–339 (2010)
Article Google Scholar
Miller, R.B.: Response time in man-computer conversational transactions. In: Proceedings of the December 9-11, 1968, Fall Joint Computer Conference, Part I, pp. 267–277. ACM (1968)
Mozafari, B.: Approximate query engines: Commercial challenges and research opportunities. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 521–524. ACM (2017)
Peng, J., Zhang, D., Wang, J., Pei, J.: Aqp++: Connecting approximate query processing with aggregate precomputation for interactive analytics. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1477–1492. ACM (2018)
Sloan Digital Sky Surver(SkyServer). http://cas.sdss.org/dr8/en/
Sun, L., Franklin, M.J., Wang, J., Wu, E.: Skipping-oriented partitioning for columnar layouts. Proc. VLDB Endow. 10(4), 421–432 (2016)
Article Google Scholar
TPC-H, Benchmark Specification. http://www.tpc.org/tpch/
Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. (TOMS) 11(1), 37–57 (1985)
Article MathSciNet Google Scholar
Wang, C.K., Wang, J.M., Sun, J.G., Shi, S.F., Gao, H.: Abix: An approach to content-based approximate query processing in peer-to-peer data systems. J. Comput. Sci. Technol. 22(2), 280–286 (2007)
Article Google Scholar
Wang, L., Christensen, R., Li, F., Yi, K.: Spatial online sampling and aggregation. Proc. VLDB Endow. 9(3), 84–95 (2015)
Article Google Scholar
Zhang, X., Wang, J., Yin, J.: Sapprox: Enabling efficient and accurate approximations on sub-datasets with distribution-aware online sampling. Proc. VLDB Endow. 10(3), 109–120 (2016)
Article Google Scholar

Download references

Acknowledgments

We thank the anonymous reviewers for their invaluable feedback and suggestions that have greatly improved this work. This work was partially supported by National Key R&D Program of China (No. 2018YFB1004404) and the NSFC (No. 61732004).

Author information

Authors and Affiliations

Shanghai Key Laboratory of Data Science, Shanghai, China
Zhigang Wu, Yinan Jing, Zhenying He, Chenghao Guo & X. Sean Wang
School of Computer Science, Fudan University, Shanghai, China
Zhigang Wu, Yinan Jing, Zhenying He, Chenghao Guo & X. Sean Wang
Shanghai Institute of Intelligent Electronics and Systems, Shanghai, China
X. Sean Wang

Authors

Zhigang Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yinan Jing
View author publications
You can also search for this author in PubMed Google Scholar
Zhenying He
View author publications
You can also search for this author in PubMed Google Scholar
Chenghao Guo
View author publications
You can also search for this author in PubMed Google Scholar
X. Sean Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yinan Jing, Zhenying He or X. Sean Wang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, Z., Jing, Y., He, Z. et al. POLYTOPE: a flexible sampling system for answering exploratory queries. World Wide Web 23, 1–22 (2020). https://doi.org/10.1007/s11280-019-00685-x

Download citation

Received: 09 July 2018
Revised: 28 February 2019
Accepted: 25 April 2019
Published: 15 May 2019
Issue Date: January 2020
DOI: https://doi.org/10.1007/s11280-019-00685-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

POLYTOPE: a flexible sampling system for answering exploratory queries

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Stratified random sampling from streaming and stored data

Big data analytics: a survey

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

POLYTOPE: a flexible sampling system for answering exploratory queries

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Stratified random sampling from streaming and stored data

Big data analytics: a survey

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation