Abstract
We consider the problem of using sampling to estimate the result of an aggregation operation over a subset-based SQL query, where a subquery is correlated to an outer query by a NOT EXISTS, NOT IN, EXISTS or IN clause. We design an unbiased estimator for our query and prove that it is indeed unbiased. We then provide a second, biased estimator that makes use of the superpopulation concept from statistics to minimize the mean squared error of the resulting estimate. The two estimators are tested over an extensive set of experiments.
Similar content being viewed by others
References
Brown, P.G., Haas, P.J.: Techniques for warehousing of sample data. In: ICDE, p. 6 (2006)
Bunge J. and Fitzpatrick M. (1993). Estimating the number of species: a review. J. Am. Statist. Assoc. 88: 364–373
Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.R.: Towards estimation error guarantees for distinct values. In: PODS, pp. 268–279 (2000)
Dempster A.P., Laird N.M. and Rubin D.B. (1977). Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. Ser. B. 39: 1977
Dobra, A., Garofalakis, M.N., Gehrke, J., Rastogi, R.: Processing complex aggregate queries over data streams. In: SIGMOD Conference, p. 61–72 (2002)
Domingos, P.: Bayesian averaging of classifiers and the overfitting problem. In: 17th International Conf. on Machine Learning, (2000)
Efron B. and Tibshirani R. (1998). An Introduction to the Bootstrap. Chapman & Hall/CRC,
Fan C.T., Muller M.E. and Rezucha I. (1962). Development of sampling plans by using sequential (item by item) selection techniques and digital computers. J. Am. Statist. Assoc. 57: 387–402
Gelman A., Carlin J.B., Stern H.S. and Rubin D.B. (2003). Bayesian Data Analysis, 2nd edn. Chapman & Hall/CRC,
Goodman L.A. (1949). On the estimation of the number of classes in a population. Ann. Math. Statist. 20: 272–579
Haas P. and Stokes L. (1998). Estimating the number of classes in a finite population.. J. Am. Statist. Assoc. 93: 1475–1487
Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. In: SIGMOD Conference, p. 287–298 (1999)
Haas, P.J., Naughton, J.F., Seshadri, S., Stokes, L.: Sampling-based estimation of the number of distinct values of an attribute. In: VLDB, p. 311–322 (1995)
Hellerstein J.M., Avnur R., Chou A., Hidber C., Olston C., Raman V., Roth T. and Haas P.J. (1999). Interactive data analysis: the cONTROL project. IEEE Comput. 32(8): 51–59
Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: SIGMOD Conference, pp. 171–182 (1997)
Hou W.-C. and Özsoyoglu G. (1991). Statistical estimators for aggregate relational algebra queries. ACM Trans. Database Syst 16(4): 600–654
Hou W.-C. and Özsoyoglu G. (1993). Processing time-constrained aggregate queries in case-db. ACM Trans. Database Syst. 18(2): 224–261
Huang, H., Bi, L., Song, H., Lu, Y.: A variational em algorithm for large databases. In: International Conference on Machine Learning and Cybernetics, pp. 3048–3052 (2005)
Jermaine, C., Dobra, A., Pol, A., Joshi, S.: Online estimation for subset-based SQL queries. In: 31st International Conference on Very Large Data Bases, pp. 745–756 (2005)
Kempe, D., Dobra, A., Gehrke, J.: Gossip-based computation of aggregate information. In: FOCS, pp. 482–491 (2003)
Krewski D., Platek R. and Rao J.N.K. (1981). Current Topics in Survey Sampling. Academic Press, New York
Lipton, R.J., Naughton, J.F.: Query size estimation by adaptive sampling. In: PODS, pp. 40–46 (1990)
Lipton, R.J., Naughton, J.F., Schneider, D.A.: Practical selectivity estimation through adaptive sampling. In: SIGMOD Conference, pp. 1–11 (1990)
Matias, Y., Vitter, J.S., Wang, M.: Wavelet-based histograms for selectivity estimation. In: SIGMOD Conference, pp. 448–459 (1998)
Mingoti S.A. (1999). Bayesian estimator for the total number of distinct species when quadrat sampling is used. J Appl Statist 26(4): 469–483
Muralikrishna, M., DeWitt, D.J.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: SIGMOD Conference, pp. 28–36 (1988)
Neal, R., Hinton, G.: A view of the em algorithm that justifies incremental, sparse, and other variants. In: Learning in Graphical Models (1998)
Olken, F.: Random sampling from databases. Technical Report LBL-32883, Berkeley (1993)
Sarndal C.E., Swensson B. and Wretman J. (1992). Model Assisted Survey Sampling. Springer, New York
Thiesson B., Meek C. and Heckerman D. (2001). Accelerating em for large databases. Mach. Learn. 45(3): 279–299
Vysochanskii D.F. and Petunin Y.I. (1980). Justification of the 3-sigma rule for unimodal distributions. Theory Probab. Math. Statist. 21: 25–36
Author information
Authors and Affiliations
Corresponding author
Additional information
Material in this paper is based upon work supported by the National Science Foundation via grants 0347408 and 0612170.
Rights and permissions
About this article
Cite this article
Joshi, S., Jermaine, C. Sampling-based estimators for subset-based queries. The VLDB Journal 18, 181–202 (2009). https://doi.org/10.1007/s00778-008-0095-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-008-0095-0