Histograms based on the minimum description length principle

Wang, Hai; Sevcik, Kenneth C.

doi:10.1007/s00778-006-0015-0

Histograms based on the minimum description length principle

Regular Paper
Published: 14 December 2006

Volume 17, pages 419–442, (2008)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Hai Wang¹ &
Kenneth C. Sevcik²

123 Accesses
8 Citations
Explore all metrics

Abstract

Histograms have been widely used for selectivity estimation in query optimization, as well as for fast approximate query answering in many OLAP, data mining, and data visualization applications. This paper presents a new family of histograms, the Hierarchical Model Fitting (HMF) histograms, based on the Minimum Description Length principle. Rather than having each bucket of a histogram described by the same type of model, the HMF histograms employ a local optimal model for each bucket. The improved effectiveness of the locally chosen models offsets more than the overhead of keeping track of the representation of each individual bucket. Through a set of experiments, we show that the HMF histograms are capable of providing more accurate approximations than previously proposed techniques for many real and synthetic data sets across a variety of query workloads.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: Proceedings of the ACM Symposium on Theory of Computing, pp. 20–29, Philadelphia, (1996)
Bruno, N., Chaudhuri, S.: Exploiting statistics on query expressions for optimization. In: Proceedings of the ACM SIGMOD Conference, pp. 263–274, Madison, (2002)
Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: a multidimensional workload-aware histogram. In: Proceedings of the ACM SIGMOD Conference, pp. 211–222, Santa Barbara (2001)
Bucca, F., Pontieri, L., Rosaci, D., Sacca, D.: Improving range query estimation on histograms. In: Proceedings of the International Conference on Data Engineering, pp. 628–638, San Jose, (2002)
Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate query processing using wavelets. In: Proceedings of the International Conference on Very Large Data Bases, pp. 89–100, Cairo (2000)
Cover T.M., Thomas J.A. (1991) Elements of Information Theory. Wiley, New York
MATH Google Scholar
Deshpande, A., Garofalakis, M., Rastogi, R.: Independence is good: dependency-based histogram synopses for high-dimensional data. In: Proceedings of the ACM SIGMOD Conference, pp. 199–210, Santa Barbara (2001)
Faloutsos, C., Matias, Y., Silberschatz, A.: Modeling Skewed Distribution using multifractal and the ‘80-20’ law. In: Proceedings of the 22nd International Conference on Very Large Data Bases, pp. 307–317, Bombay (1996)
Garofalakis, M., Gibbons, P.B.: Wavelet synopses with error guarantees. In: Proceedings of the ACM SIGMOD Conference, pp. 476–487, Madison (2002)
Geffner, S., Agrawal, D., El Abbadi, A., Smith, T. R.: Relative prefix sum: an efficient approach for query dynamic OLAP data cubes. In: Proceedings of the International Conference on Data Engineering, pp. 328–335, Sydney (1999)
Gilbert, A. C., Guha, S., Indyk, P., Kotidis, Y., Strauss, M. J.: Fast, small-space algorithms for approximate histogram maintenance. In: Proceedings of the ACM Symposium on Theory of Computing, pp. 389–398, Montreal, (2002)
Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., Strauss, M. J.: Optimal and approximate computation of summary statistics for range aggregates. In: Proceedings of the ACM Symposium on Principles of Database Systems, pp. 227–236, Santa Barbara, (2001)
Gray J., Chaudhuri S., Bosworth A., Layman A., Venkatrao M., Pellow F., Pirahesh H. (1997) Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining Knowl Discov 1(1):29–53
Article Google Scholar
Guha, S., Shim, K., Woo, J.: REHIST: relative histogram construction algorithms. In: Proceedings of the 30th International Conferences on Very Large Data Bases, pp. 300–311, Toronto, (2004)
Gunopulos, D., Kollios, G., Tsotras, V. J., Domeniconi, C.: Approximating multi-dimensional aggregate range queries over real attributes. In: Proceedings of the ACM SIGMOD Conference, pp. 463–474, Dallas (2000)
Ho, C.-T., Agrawal, R., Megiddo, N., Srikant, R.: Range query in OLAP data cubes. In: Proceedings of the ACM SIGMOD Conference, pp. 73–88, Tucson (1997)
Ioannidis, Y.E.: The history of histograms (abridged). In: Proceedings of the International Conference on Very Large Data Bases, pp. 19–30, Berlin (2003)
Ioannidis, Y.E., Poosala, V.: Balancing histogram optimality and practicality for query result size estimation. In: Proceedings of the ACM SIGMOD Conference, pp. 233–244, San Jose, (1995)
Ioannidis, Y. E., Poosala, V.: Histogram-based approximation of set-valued query answers. In: Proceedings of the International Conference on Very Large Data Bases, pp. 174–185, Edinburgh (1999)
Jagadish, H. V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K. Suel, T.: Optimal histograms with quality guarantees. In: Proceedings of the International Conference on Very Large Data Bases, pp. 275–286, New York (1998)
Jagadish, H.V., Koudas, N., Muthukrishnan, S.: Mining deviants in a time series database. In: Proceedings of the 25th International Conference on Very Large Data Bases, pp. 102–113, Edinburgh (1999)
Konig, A. C., Weikum, G.: Combining histograms and parametric curve fitting for feedback-driven query result-size estimation. In: Proceedings of the International Conference on Very Large Data Bases, pp. 423–434, Edinburgh (1999)
Matias, Y., Vitter, J. S., Wang, M.: Wavelet-based histograms for selectivity estimation. In: Proceedings of the ACM SIGMOD Conference, pp. 448–459, Seattle (1998)
Muralikrishna, M., DeWitt, D.J.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: Proceedings of the ACM SIGMOD Conference, pp. 28–36, Chicago (1988)
Poosala, V., Ioannidis, Y.: Selectivity estimation without the attribute value independence assumption. In: Proceedings of the International Conference on Very Large Data Bases, pp. 486–495, Athens (1997)
Poosala, V., Ioannidis, Y.E., Haas, P.J., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. In: Proceedings of the ACM SIGMOD Conference, pp. 294–305, Montreal (1996)
Press W.H., Teukolsky S.A., Vetterling W.T., Flannery B.P. (1992) Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, Cambridge
Google Scholar
Rissanen J. (1983) A universal prior for integers estimation by minimum description length. The Annals of Statistics 11(2): 416–431
Article MATH MathSciNet Google Scholar
Rissanen J. (1986) Stochastic complexity and modeling. Anna. Stat. 14(3):1080–1100
Article MATH MathSciNet Google Scholar
Thaper, N., Guha, S., Indyk, P., Koudas, N.: Dynamic multidimensional histograms. In: Proceedings of the ACM SIGMOD Conference, pp. 428–439, Madison (2002)
Vitter, J.S., Wang, M.: Approximate computation of multidimensional aggregates of sparse data using wavelets. In: Proceedings of the ACM SIGMOD Conference, pp. 193–204, Philadelphia (1999)
Wang, H.: Concise and Accurate Data Summaries for Fast Approximate Query Answering. Ph.D. Thesis, University of Toronto (2004)

Download references

Author information

Authors and Affiliations

Sobey School of Business, Saint Mary’s University, Halifax, NS, B3H 3C3, Canada
Hai Wang
Department of Computer Science, University of Toronto, Toronto, ON, M5S 3G4, Canada
Kenneth C. Sevcik

Authors

Hai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kenneth C. Sevcik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hai Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, H., Sevcik, K.C. Histograms based on the minimum description length principle. The VLDB Journal 17, 419–442 (2008). https://doi.org/10.1007/s00778-006-0015-0

Download citation

Published: 14 December 2006
Issue Date: May 2008
DOI: https://doi.org/10.1007/s00778-006-0015-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Histograms based on the minimum description length principle

Abstract

Access this article

Similar content being viewed by others

Confidence distributions and hypothesis testing

Making data visualization more efficient and effective: a survey

DB-GPT: Large Language Model Meets Database

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Histograms based on the minimum description length principle

Abstract

Access this article

Similar content being viewed by others

Confidence distributions and hypothesis testing

Making data visualization more efficient and effective: a survey

DB-GPT: Large Language Model Meets Database

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation