Skip to main content
Log in

POLYTOPE: a flexible sampling system for answering exploratory queries

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Data exploration task is usually quite time-consuming. Analysts who want to find interests or verify their hypothesis may prefer a lower response time while tolerating a bounded error. Approximate query processing (AQP) is a convincing way to achieve this goal by leveraging some pre-computed samples to speed up this process. Existing sampling based AQP systems usually take a single sampling strategy on the whole dataset. However, during the data exploration tasks, various potential interests may distribute in different parts of dataset. To explore these interests, queries submitted by users thus show a rich diversity for separate sub-datasets. Therefore, only one single sampling strategy is obviously not competent for all queries accessing various sub-datasets. In this paper, we proposed a flexible and effective sampling system POLYTOPE especially designed for the data exploration tasks. To achieve this, we take the following three key ideas: (1) split the dataset into sampling blocks according to the user query patterns, (2) individually generate a set of optimized samples for each sampling block, and (3) automatically select an optimal sample at run time. We utilize both user query patterns and underlying data distribution to fulfill these ideas. We have implemented our system on the Spark platform and our comprehensive experimental results show that our system improved the accuracy performance up to 46% under the same time constraint for the data exploration tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10

Similar content being viewed by others

References

  1. Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximate answering of group-by queries. In: ACM SIGMOD Record. vol. 29, pp. 487–498. ACM (2000)

  2. Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: Blinkdb: Queries with bounded errors and bounded response times on very large data. In: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 29–42. ACM (2013)

  3. Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by their probability distribution. Bull. Calcutta Math. Soc (1943)

  4. Chaudhuri, S., Narasayya, V.: Program for tpc-d data generation with skew (2012)

  5. Chaudhuri, S., Das, G., Narasayya, V.: A robust, optimization-based approach for approximate answering of aggregate queries. In: ACM SIGMOD Record. vol. 30, pp. 295–306. ACM (2001)

    Article  Google Scholar 

  6. Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise (1996)

  7. Ganti, V., Lee, M.L., Ramakrishnan, R.: Icicles: Self-tuning samples for approximate query answering. In: VLDB. vol. 176 (2000)

  8. Goiri, I., Bianchini, R., Nagarakatte, S., Nguyen, T.D.: Approxhadoop: Bringing approximations to mapreduce frameworks. In: ACM SIGARCH Computer Architecture News, vol. 43, pp. 383–397. ACM (2015)

    Article  Google Scholar 

  9. Kandula, S., Shanbhag, A., Vitorovic, A., Olma, M., Grandl, R., Chaudhuri, S., Ding, B.: Quickr: Lazily approximating complex adhoc queries in bigdata clusters. In: Proceedings of the 2016 International Conference on Management of Data, pp. 631–646. ACM (2016)

  10. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P., et al.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983)

    Article  MathSciNet  Google Scholar 

  11. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)

    Article  MathSciNet  Google Scholar 

  12. Li, K., Li, G.: Approximate query processing: What is new and where to go? Data Sci. Eng. 3(4), 379–397 (2018)

    Article  Google Scholar 

  13. Lohr, S.: Sampling: Design and Analysis. Nelson Education (2009)

  14. Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: Interactive analysis of Web-scale datasets. Proc. VLDB Endow. 3(1–2), 330–339 (2010)

    Article  Google Scholar 

  15. Miller, R.B.: Response time in man-computer conversational transactions. In: Proceedings of the December 9-11, 1968, Fall Joint Computer Conference, Part I, pp. 267–277. ACM (1968)

  16. Mozafari, B.: Approximate query engines: Commercial challenges and research opportunities. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 521–524. ACM (2017)

  17. Peng, J., Zhang, D., Wang, J., Pei, J.: Aqp++: Connecting approximate query processing with aggregate precomputation for interactive analytics. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1477–1492. ACM (2018)

  18. Sloan Digital Sky Surver(SkyServer). http://cas.sdss.org/dr8/en/

  19. Sun, L., Franklin, M.J., Wang, J., Wu, E.: Skipping-oriented partitioning for columnar layouts. Proc. VLDB Endow. 10(4), 421–432 (2016)

    Article  Google Scholar 

  20. TPC-H, Benchmark Specification. http://www.tpc.org/tpch/

  21. Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. (TOMS) 11(1), 37–57 (1985)

    Article  MathSciNet  Google Scholar 

  22. Wang, C.K., Wang, J.M., Sun, J.G., Shi, S.F., Gao, H.: Abix: An approach to content-based approximate query processing in peer-to-peer data systems. J. Comput. Sci. Technol. 22(2), 280–286 (2007)

    Article  Google Scholar 

  23. Wang, L., Christensen, R., Li, F., Yi, K.: Spatial online sampling and aggregation. Proc. VLDB Endow. 9(3), 84–95 (2015)

    Article  Google Scholar 

  24. Zhang, X., Wang, J., Yin, J.: Sapprox: Enabling efficient and accurate approximations on sub-datasets with distribution-aware online sampling. Proc. VLDB Endow. 10(3), 109–120 (2016)

    Article  Google Scholar 

Download references

Acknowledgments

We thank the anonymous reviewers for their invaluable feedback and suggestions that have greatly improved this work. This work was partially supported by National Key R&D Program of China (No. 2018YFB1004404) and the NSFC (No. 61732004).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yinan Jing, Zhenying He or X. Sean Wang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, Z., Jing, Y., He, Z. et al. POLYTOPE: a flexible sampling system for answering exploratory queries. World Wide Web 23, 1–22 (2020). https://doi.org/10.1007/s11280-019-00685-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-019-00685-x

Keywords

Navigation