Skip to main content
Log in

Histograms based on the minimum description length principle

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Histograms have been widely used for selectivity estimation in query optimization, as well as for fast approximate query answering in many OLAP, data mining, and data visualization applications. This paper presents a new family of histograms, the Hierarchical Model Fitting (HMF) histograms, based on the Minimum Description Length principle. Rather than having each bucket of a histogram described by the same type of model, the HMF histograms employ a local optimal model for each bucket. The improved effectiveness of the locally chosen models offsets more than the overhead of keeping track of the representation of each individual bucket. Through a set of experiments, we show that the HMF histograms are capable of providing more accurate approximations than previously proposed techniques for many real and synthetic data sets across a variety of query workloads.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: Proceedings of the ACM Symposium on Theory of Computing, pp. 20–29, Philadelphia, (1996)

  2. Bruno, N., Chaudhuri, S.: Exploiting statistics on query expressions for optimization. In: Proceedings of the ACM SIGMOD Conference, pp. 263–274, Madison, (2002)

  3. Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: a multidimensional workload-aware histogram. In: Proceedings of the ACM SIGMOD Conference, pp. 211–222, Santa Barbara (2001)

  4. Bucca, F., Pontieri, L., Rosaci, D., Sacca, D.: Improving range query estimation on histograms. In: Proceedings of the International Conference on Data Engineering, pp. 628–638, San Jose, (2002)

  5. Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate query processing using wavelets. In: Proceedings of the International Conference on Very Large Data Bases, pp. 89–100, Cairo (2000)

  6. Cover T.M., Thomas J.A. (1991) Elements of Information Theory. Wiley, New York

    MATH  Google Scholar 

  7. Deshpande, A., Garofalakis, M., Rastogi, R.: Independence is good: dependency-based histogram synopses for high-dimensional data. In: Proceedings of the ACM SIGMOD Conference, pp. 199–210, Santa Barbara (2001)

  8. Faloutsos, C., Matias, Y., Silberschatz, A.: Modeling Skewed Distribution using multifractal and the ‘80-20’ law. In: Proceedings of the 22nd International Conference on Very Large Data Bases, pp. 307–317, Bombay (1996)

  9. Garofalakis, M., Gibbons, P.B.: Wavelet synopses with error guarantees. In: Proceedings of the ACM SIGMOD Conference, pp. 476–487, Madison (2002)

  10. Geffner, S., Agrawal, D., El Abbadi, A., Smith, T. R.: Relative prefix sum: an efficient approach for query dynamic OLAP data cubes. In: Proceedings of the International Conference on Data Engineering, pp. 328–335, Sydney (1999)

  11. Gilbert, A. C., Guha, S., Indyk, P., Kotidis, Y., Strauss, M. J.: Fast, small-space algorithms for approximate histogram maintenance. In: Proceedings of the ACM Symposium on Theory of Computing, pp. 389–398, Montreal, (2002)

  12. Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., Strauss, M. J.: Optimal and approximate computation of summary statistics for range aggregates. In: Proceedings of the ACM Symposium on Principles of Database Systems, pp. 227–236, Santa Barbara, (2001)

  13. Gray J., Chaudhuri S., Bosworth A., Layman A., Venkatrao M., Pellow F., Pirahesh H. (1997) Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining Knowl Discov 1(1):29–53

    Article  Google Scholar 

  14. Guha, S., Shim, K., Woo, J.: REHIST: relative histogram construction algorithms. In: Proceedings of the 30th International Conferences on Very Large Data Bases, pp. 300–311, Toronto, (2004)

  15. Gunopulos, D., Kollios, G., Tsotras, V. J., Domeniconi, C.: Approximating multi-dimensional aggregate range queries over real attributes. In: Proceedings of the ACM SIGMOD Conference, pp. 463–474, Dallas (2000)

  16. Ho, C.-T., Agrawal, R., Megiddo, N., Srikant, R.: Range query in OLAP data cubes. In: Proceedings of the ACM SIGMOD Conference, pp. 73–88, Tucson (1997)

  17. Ioannidis, Y.E.: The history of histograms (abridged). In: Proceedings of the International Conference on Very Large Data Bases, pp. 19–30, Berlin (2003)

  18. Ioannidis, Y.E., Poosala, V.: Balancing histogram optimality and practicality for query result size estimation. In: Proceedings of the ACM SIGMOD Conference, pp. 233–244, San Jose, (1995)

  19. Ioannidis, Y. E., Poosala, V.: Histogram-based approximation of set-valued query answers. In: Proceedings of the International Conference on Very Large Data Bases, pp. 174–185, Edinburgh (1999)

  20. Jagadish, H. V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K. Suel, T.: Optimal histograms with quality guarantees. In: Proceedings of the International Conference on Very Large Data Bases, pp. 275–286, New York (1998)

  21. Jagadish, H.V., Koudas, N., Muthukrishnan, S.: Mining deviants in a time series database. In: Proceedings of the 25th International Conference on Very Large Data Bases, pp. 102–113, Edinburgh (1999)

  22. Konig, A. C., Weikum, G.: Combining histograms and parametric curve fitting for feedback-driven query result-size estimation. In: Proceedings of the International Conference on Very Large Data Bases, pp. 423–434, Edinburgh (1999)

  23. Matias, Y., Vitter, J. S., Wang, M.: Wavelet-based histograms for selectivity estimation. In: Proceedings of the ACM SIGMOD Conference, pp. 448–459, Seattle (1998)

  24. Muralikrishna, M., DeWitt, D.J.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: Proceedings of the ACM SIGMOD Conference, pp. 28–36, Chicago (1988)

  25. Poosala, V., Ioannidis, Y.: Selectivity estimation without the attribute value independence assumption. In: Proceedings of the International Conference on Very Large Data Bases, pp. 486–495, Athens (1997)

  26. Poosala, V., Ioannidis, Y.E., Haas, P.J., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. In: Proceedings of the ACM SIGMOD Conference, pp. 294–305, Montreal (1996)

  27. Press W.H., Teukolsky S.A., Vetterling W.T., Flannery B.P. (1992) Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, Cambridge

    Google Scholar 

  28. Rissanen J. (1983) A universal prior for integers estimation by minimum description length. The Annals of Statistics 11(2): 416–431

    Article  MATH  MathSciNet  Google Scholar 

  29. Rissanen J. (1986) Stochastic complexity and modeling. Anna. Stat. 14(3):1080–1100

    Article  MATH  MathSciNet  Google Scholar 

  30. Thaper, N., Guha, S., Indyk, P., Koudas, N.: Dynamic multidimensional histograms. In: Proceedings of the ACM SIGMOD Conference, pp. 428–439, Madison (2002)

  31. Vitter, J.S., Wang, M.: Approximate computation of multidimensional aggregates of sparse data using wavelets. In: Proceedings of the ACM SIGMOD Conference, pp. 193–204, Philadelphia (1999)

  32. Wang, H.: Concise and Accurate Data Summaries for Fast Approximate Query Answering. Ph.D. Thesis, University of Toronto (2004)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hai Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, H., Sevcik, K.C. Histograms based on the minimum description length principle. The VLDB Journal 17, 419–442 (2008). https://doi.org/10.1007/s00778-006-0015-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-006-0015-0

Keywords

Navigation