Skip to main content
Log in

Sampling-based estimators for subset-based queries

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

We consider the problem of using sampling to estimate the result of an aggregation operation over a subset-based SQL query, where a subquery is correlated to an outer query by a NOT EXISTS, NOT IN, EXISTS or IN clause. We design an unbiased estimator for our query and prove that it is indeed unbiased. We then provide a second, biased estimator that makes use of the superpopulation concept from statistics to minimize the mean squared error of the resulting estimate. The two estimators are tested over an extensive set of experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. http://www.imdb.com

  2. http://cdiac.ornl.gov/epubs/ndp/ndp026b/ndp026b.htm

  3. Brown, P.G., Haas, P.J.: Techniques for warehousing of sample data. In: ICDE, p. 6 (2006)

  4. Bunge J. and Fitzpatrick M. (1993). Estimating the number of species: a review. J. Am. Statist. Assoc. 88: 364–373

    Article  Google Scholar 

  5. Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.R.: Towards estimation error guarantees for distinct values. In: PODS, pp. 268–279 (2000)

  6. Dempster A.P., Laird N.M. and Rubin D.B. (1977). Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. Ser. B. 39: 1977

    MathSciNet  Google Scholar 

  7. Dobra, A., Garofalakis, M.N., Gehrke, J., Rastogi, R.: Processing complex aggregate queries over data streams. In: SIGMOD Conference, p. 61–72 (2002)

  8. Domingos, P.: Bayesian averaging of classifiers and the overfitting problem. In: 17th International Conf. on Machine Learning, (2000)

  9. Efron B. and Tibshirani R. (1998). An Introduction to the Bootstrap. Chapman & Hall/CRC,

  10. Fan C.T., Muller M.E. and Rezucha I. (1962). Development of sampling plans by using sequential (item by item) selection techniques and digital computers. J. Am. Statist. Assoc. 57: 387–402

    Article  MATH  MathSciNet  Google Scholar 

  11. Gelman A., Carlin J.B., Stern H.S. and Rubin D.B. (2003). Bayesian Data Analysis, 2nd edn. Chapman & Hall/CRC,

  12. Goodman L.A. (1949). On the estimation of the number of classes in a population. Ann. Math. Statist. 20: 272–579

    Article  Google Scholar 

  13. Haas P. and Stokes L. (1998). Estimating the number of classes in a finite population.. J. Am. Statist. Assoc. 93: 1475–1487

    Article  MATH  MathSciNet  Google Scholar 

  14. Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. In: SIGMOD Conference, p. 287–298 (1999)

  15. Haas, P.J., Naughton, J.F., Seshadri, S., Stokes, L.: Sampling-based estimation of the number of distinct values of an attribute. In: VLDB, p. 311–322 (1995)

  16. Hellerstein J.M., Avnur R., Chou A., Hidber C., Olston C., Raman V., Roth T. and Haas P.J. (1999). Interactive data analysis: the cONTROL project. IEEE Comput. 32(8): 51–59

    Google Scholar 

  17. Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: SIGMOD Conference, pp. 171–182 (1997)

  18. Hou W.-C. and Özsoyoglu G. (1991). Statistical estimators for aggregate relational algebra queries. ACM Trans. Database Syst 16(4): 600–654

    Article  Google Scholar 

  19. Hou W.-C. and Özsoyoglu G. (1993). Processing time-constrained aggregate queries in case-db. ACM Trans. Database Syst. 18(2): 224–261

    Article  Google Scholar 

  20. Huang, H., Bi, L., Song, H., Lu, Y.: A variational em algorithm for large databases. In: International Conference on Machine Learning and Cybernetics, pp. 3048–3052 (2005)

  21. Jermaine, C., Dobra, A., Pol, A., Joshi, S.: Online estimation for subset-based SQL queries. In: 31st International Conference on Very Large Data Bases, pp. 745–756 (2005)

  22. Kempe, D., Dobra, A., Gehrke, J.: Gossip-based computation of aggregate information. In: FOCS, pp. 482–491 (2003)

  23. Krewski D., Platek R. and Rao J.N.K. (1981). Current Topics in Survey Sampling. Academic Press, New York

    MATH  Google Scholar 

  24. Lipton, R.J., Naughton, J.F.: Query size estimation by adaptive sampling. In: PODS, pp. 40–46 (1990)

  25. Lipton, R.J., Naughton, J.F., Schneider, D.A.: Practical selectivity estimation through adaptive sampling. In: SIGMOD Conference, pp. 1–11 (1990)

  26. Matias, Y., Vitter, J.S., Wang, M.: Wavelet-based histograms for selectivity estimation. In: SIGMOD Conference, pp. 448–459 (1998)

  27. Mingoti S.A. (1999). Bayesian estimator for the total number of distinct species when quadrat sampling is used. J Appl Statist 26(4): 469–483

    Article  MATH  Google Scholar 

  28. Muralikrishna, M., DeWitt, D.J.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: SIGMOD Conference, pp. 28–36 (1988)

  29. Neal, R., Hinton, G.: A view of the em algorithm that justifies incremental, sparse, and other variants. In: Learning in Graphical Models (1998)

  30. Olken, F.: Random sampling from databases. Technical Report LBL-32883, Berkeley (1993)

  31. Sarndal C.E., Swensson B. and Wretman J. (1992). Model Assisted Survey Sampling. Springer, New York

    Google Scholar 

  32. Thiesson B., Meek C. and Heckerman D. (2001). Accelerating em for large databases. Mach. Learn. 45(3): 279–299

    Article  MATH  Google Scholar 

  33. Vysochanskii D.F. and Petunin Y.I. (1980). Justification of the 3-sigma rule for unimodal distributions. Theory Probab. Math. Statist. 21: 25–36

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christopher Jermaine.

Additional information

Material in this paper is based upon work supported by the National Science Foundation via grants 0347408 and 0612170.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Joshi, S., Jermaine, C. Sampling-based estimators for subset-based queries. The VLDB Journal 18, 181–202 (2009). https://doi.org/10.1007/s00778-008-0095-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-008-0095-0

Keywords

Navigation