Abstract
Data exploration task is usually quite time-consuming. Analysts who want to find interests or verify their hypothesis may prefer a lower response time while tolerating a bounded error. Approximate query processing (AQP) is a convincing way to achieve this goal by leveraging some pre-computed samples to speed up this process. Existing sampling based AQP systems usually take a single sampling strategy on the whole dataset. However, during the data exploration tasks, various potential interests may distribute in different parts of dataset. To explore these interests, queries submitted by users thus show a rich diversity for separate sub-datasets. Therefore, only one single sampling strategy is obviously not competent for all queries accessing various sub-datasets. In this paper, we proposed a flexible and effective sampling system POLYTOPE especially designed for the data exploration tasks. To achieve this, we take the following three key ideas: (1) split the dataset into sampling blocks according to the user query patterns, (2) individually generate a set of optimized samples for each sampling block, and (3) automatically select an optimal sample at run time. We utilize both user query patterns and underlying data distribution to fulfill these ideas. We have implemented our system on the Spark platform and our comprehensive experimental results show that our system improved the accuracy performance up to 46% under the same time constraint for the data exploration tasks.
Similar content being viewed by others
References
Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximate answering of group-by queries. In: ACM SIGMOD Record. vol. 29, pp. 487–498. ACM (2000)
Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: Blinkdb: Queries with bounded errors and bounded response times on very large data. In: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 29–42. ACM (2013)
Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by their probability distribution. Bull. Calcutta Math. Soc (1943)
Chaudhuri, S., Narasayya, V.: Program for tpc-d data generation with skew (2012)
Chaudhuri, S., Das, G., Narasayya, V.: A robust, optimization-based approach for approximate answering of aggregate queries. In: ACM SIGMOD Record. vol. 30, pp. 295–306. ACM (2001)
Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise (1996)
Ganti, V., Lee, M.L., Ramakrishnan, R.: Icicles: Self-tuning samples for approximate query answering. In: VLDB. vol. 176 (2000)
Goiri, I., Bianchini, R., Nagarakatte, S., Nguyen, T.D.: Approxhadoop: Bringing approximations to mapreduce frameworks. In: ACM SIGARCH Computer Architecture News, vol. 43, pp. 383–397. ACM (2015)
Kandula, S., Shanbhag, A., Vitorovic, A., Olma, M., Grandl, R., Chaudhuri, S., Ding, B.: Quickr: Lazily approximating complex adhoc queries in bigdata clusters. In: Proceedings of the 2016 International Conference on Management of Data, pp. 631–646. ACM (2016)
Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P., et al.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983)
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Li, K., Li, G.: Approximate query processing: What is new and where to go? Data Sci. Eng. 3(4), 379–397 (2018)
Lohr, S.: Sampling: Design and Analysis. Nelson Education (2009)
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: Interactive analysis of Web-scale datasets. Proc. VLDB Endow. 3(1–2), 330–339 (2010)
Miller, R.B.: Response time in man-computer conversational transactions. In: Proceedings of the December 9-11, 1968, Fall Joint Computer Conference, Part I, pp. 267–277. ACM (1968)
Mozafari, B.: Approximate query engines: Commercial challenges and research opportunities. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 521–524. ACM (2017)
Peng, J., Zhang, D., Wang, J., Pei, J.: Aqp++: Connecting approximate query processing with aggregate precomputation for interactive analytics. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1477–1492. ACM (2018)
Sloan Digital Sky Surver(SkyServer). http://cas.sdss.org/dr8/en/
Sun, L., Franklin, M.J., Wang, J., Wu, E.: Skipping-oriented partitioning for columnar layouts. Proc. VLDB Endow. 10(4), 421–432 (2016)
TPC-H, Benchmark Specification. http://www.tpc.org/tpch/
Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. (TOMS) 11(1), 37–57 (1985)
Wang, C.K., Wang, J.M., Sun, J.G., Shi, S.F., Gao, H.: Abix: An approach to content-based approximate query processing in peer-to-peer data systems. J. Comput. Sci. Technol. 22(2), 280–286 (2007)
Wang, L., Christensen, R., Li, F., Yi, K.: Spatial online sampling and aggregation. Proc. VLDB Endow. 9(3), 84–95 (2015)
Zhang, X., Wang, J., Yin, J.: Sapprox: Enabling efficient and accurate approximations on sub-datasets with distribution-aware online sampling. Proc. VLDB Endow. 10(3), 109–120 (2016)
Acknowledgments
We thank the anonymous reviewers for their invaluable feedback and suggestions that have greatly improved this work. This work was partially supported by National Key R&D Program of China (No. 2018YFB1004404) and the NSFC (No. 61732004).
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wu, Z., Jing, Y., He, Z. et al. POLYTOPE: a flexible sampling system for answering exploratory queries. World Wide Web 23, 1–22 (2020). https://doi.org/10.1007/s11280-019-00685-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-019-00685-x