Sampling-based estimators for subset-based queries

Joshi, Shantanu; Jermaine, Christopher

doi:10.1007/s00778-008-0095-0

Sampling-based estimators for subset-based queries

Regular Paper
Published: 04 April 2008

Volume 18, pages 181–202, (2009)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Shantanu Joshi¹ &
Christopher Jermaine²

128 Accesses
13 Citations
Explore all metrics

Abstract

We consider the problem of using sampling to estimate the result of an aggregation operation over a subset-based SQL query, where a subquery is correlated to an outer query by a NOT EXISTS, NOT IN, EXISTS or IN clause. We design an unbiased estimator for our query and prove that it is indeed unbiased. We then provide a second, biased estimator that makes use of the superpopulation concept from statistics to minimize the mean squared error of the resulting estimate. The two estimators are tested over an extensive set of experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimating Sufficient Sample Sizes for Approximate Decision Support Queries

Estimation of View Size Using Sampling Techniques

Consistent Subset Sampling

References

http://www.imdb.com
http://cdiac.ornl.gov/epubs/ndp/ndp026b/ndp026b.htm
Brown, P.G., Haas, P.J.: Techniques for warehousing of sample data. In: ICDE, p. 6 (2006)
Bunge J. and Fitzpatrick M. (1993). Estimating the number of species: a review. J. Am. Statist. Assoc. 88: 364–373
Article Google Scholar
Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.R.: Towards estimation error guarantees for distinct values. In: PODS, pp. 268–279 (2000)
Dempster A.P., Laird N.M. and Rubin D.B. (1977). Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. Ser. B. 39: 1977
MathSciNet Google Scholar
Dobra, A., Garofalakis, M.N., Gehrke, J., Rastogi, R.: Processing complex aggregate queries over data streams. In: SIGMOD Conference, p. 61–72 (2002)
Domingos, P.: Bayesian averaging of classifiers and the overfitting problem. In: 17th International Conf. on Machine Learning, (2000)
Efron B. and Tibshirani R. (1998). An Introduction to the Bootstrap. Chapman & Hall/CRC,
Fan C.T., Muller M.E. and Rezucha I. (1962). Development of sampling plans by using sequential (item by item) selection techniques and digital computers. J. Am. Statist. Assoc. 57: 387–402
Article MATH MathSciNet Google Scholar
Gelman A., Carlin J.B., Stern H.S. and Rubin D.B. (2003). Bayesian Data Analysis, 2nd edn. Chapman & Hall/CRC,
Goodman L.A. (1949). On the estimation of the number of classes in a population. Ann. Math. Statist. 20: 272–579
Article Google Scholar
Haas P. and Stokes L. (1998). Estimating the number of classes in a finite population.. J. Am. Statist. Assoc. 93: 1475–1487
Article MATH MathSciNet Google Scholar
Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. In: SIGMOD Conference, p. 287–298 (1999)
Haas, P.J., Naughton, J.F., Seshadri, S., Stokes, L.: Sampling-based estimation of the number of distinct values of an attribute. In: VLDB, p. 311–322 (1995)
Hellerstein J.M., Avnur R., Chou A., Hidber C., Olston C., Raman V., Roth T. and Haas P.J. (1999). Interactive data analysis: the cONTROL project. IEEE Comput. 32(8): 51–59
Google Scholar
Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: SIGMOD Conference, pp. 171–182 (1997)
Hou W.-C. and Özsoyoglu G. (1991). Statistical estimators for aggregate relational algebra queries. ACM Trans. Database Syst 16(4): 600–654
Article Google Scholar
Hou W.-C. and Özsoyoglu G. (1993). Processing time-constrained aggregate queries in case-db. ACM Trans. Database Syst. 18(2): 224–261
Article Google Scholar
Huang, H., Bi, L., Song, H., Lu, Y.: A variational em algorithm for large databases. In: International Conference on Machine Learning and Cybernetics, pp. 3048–3052 (2005)
Jermaine, C., Dobra, A., Pol, A., Joshi, S.: Online estimation for subset-based SQL queries. In: 31st International Conference on Very Large Data Bases, pp. 745–756 (2005)
Kempe, D., Dobra, A., Gehrke, J.: Gossip-based computation of aggregate information. In: FOCS, pp. 482–491 (2003)
Krewski D., Platek R. and Rao J.N.K. (1981). Current Topics in Survey Sampling. Academic Press, New York
MATH Google Scholar
Lipton, R.J., Naughton, J.F.: Query size estimation by adaptive sampling. In: PODS, pp. 40–46 (1990)
Lipton, R.J., Naughton, J.F., Schneider, D.A.: Practical selectivity estimation through adaptive sampling. In: SIGMOD Conference, pp. 1–11 (1990)
Matias, Y., Vitter, J.S., Wang, M.: Wavelet-based histograms for selectivity estimation. In: SIGMOD Conference, pp. 448–459 (1998)
Mingoti S.A. (1999). Bayesian estimator for the total number of distinct species when quadrat sampling is used. J Appl Statist 26(4): 469–483
Article MATH Google Scholar
Muralikrishna, M., DeWitt, D.J.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: SIGMOD Conference, pp. 28–36 (1988)
Neal, R., Hinton, G.: A view of the em algorithm that justifies incremental, sparse, and other variants. In: Learning in Graphical Models (1998)
Olken, F.: Random sampling from databases. Technical Report LBL-32883, Berkeley (1993)
Sarndal C.E., Swensson B. and Wretman J. (1992). Model Assisted Survey Sampling. Springer, New York
Google Scholar
Thiesson B., Meek C. and Heckerman D. (2001). Accelerating em for large databases. Mach. Learn. 45(3): 279–299
Article MATH Google Scholar
Vysochanskii D.F. and Petunin Y.I. (1980). Justification of the 3-sigma rule for unimodal distributions. Theory Probab. Math. Statist. 21: 25–36
Google Scholar

Download references

Author information

Authors and Affiliations

Server Manageability, Oracle, 400 Oracle Parkway, Redwood Shores, CA, 94065, USA
Shantanu Joshi
Computer and Information Science and Engineering, University of Florida, Gainesville, FL, 32611, USA
Christopher Jermaine

Authors

Shantanu Joshi
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Jermaine
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christopher Jermaine.

Additional information

Material in this paper is based upon work supported by the National Science Foundation via grants 0347408 and 0612170.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Joshi, S., Jermaine, C. Sampling-based estimators for subset-based queries. The VLDB Journal 18, 181–202 (2009). https://doi.org/10.1007/s00778-008-0095-0

Download citation

Received: 06 July 2006
Revised: 19 December 2007
Accepted: 04 January 2008
Published: 04 April 2008
Issue Date: January 2009
DOI: https://doi.org/10.1007/s00778-008-0095-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sampling-based estimators for subset-based queries

Abstract

Access this article

Similar content being viewed by others

Estimating Sufficient Sample Sizes for Approximate Decision Support Queries

Estimation of View Size Using Sampling Techniques

Consistent Subset Sampling

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Sampling-based estimators for subset-based queries

Abstract

Access this article

Similar content being viewed by others

Estimating Sufficient Sample Sizes for Approximate Decision Support Queries

Estimation of View Size Using Sampling Techniques

Consistent Subset Sampling

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation