Abstract
Histograms have been widely used for selectivity estimation in query optimization, as well as for fast approximate query answering in many OLAP, data mining, and data visualization applications. This paper presents a new family of histograms, the Hierarchical Model Fitting (HMF) histograms, based on the Minimum Description Length principle. Rather than having each bucket of a histogram described by the same type of model, the HMF histograms employ a local optimal model for each bucket. The improved effectiveness of the locally chosen models offsets more than the overhead of keeping track of the representation of each individual bucket. Through a set of experiments, we show that the HMF histograms are capable of providing more accurate approximations than previously proposed techniques for many real and synthetic data sets across a variety of query workloads.
Similar content being viewed by others
References
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: Proceedings of the ACM Symposium on Theory of Computing, pp. 20–29, Philadelphia, (1996)
Bruno, N., Chaudhuri, S.: Exploiting statistics on query expressions for optimization. In: Proceedings of the ACM SIGMOD Conference, pp. 263–274, Madison, (2002)
Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: a multidimensional workload-aware histogram. In: Proceedings of the ACM SIGMOD Conference, pp. 211–222, Santa Barbara (2001)
Bucca, F., Pontieri, L., Rosaci, D., Sacca, D.: Improving range query estimation on histograms. In: Proceedings of the International Conference on Data Engineering, pp. 628–638, San Jose, (2002)
Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate query processing using wavelets. In: Proceedings of the International Conference on Very Large Data Bases, pp. 89–100, Cairo (2000)
Cover T.M., Thomas J.A. (1991) Elements of Information Theory. Wiley, New York
Deshpande, A., Garofalakis, M., Rastogi, R.: Independence is good: dependency-based histogram synopses for high-dimensional data. In: Proceedings of the ACM SIGMOD Conference, pp. 199–210, Santa Barbara (2001)
Faloutsos, C., Matias, Y., Silberschatz, A.: Modeling Skewed Distribution using multifractal and the ‘80-20’ law. In: Proceedings of the 22nd International Conference on Very Large Data Bases, pp. 307–317, Bombay (1996)
Garofalakis, M., Gibbons, P.B.: Wavelet synopses with error guarantees. In: Proceedings of the ACM SIGMOD Conference, pp. 476–487, Madison (2002)
Geffner, S., Agrawal, D., El Abbadi, A., Smith, T. R.: Relative prefix sum: an efficient approach for query dynamic OLAP data cubes. In: Proceedings of the International Conference on Data Engineering, pp. 328–335, Sydney (1999)
Gilbert, A. C., Guha, S., Indyk, P., Kotidis, Y., Strauss, M. J.: Fast, small-space algorithms for approximate histogram maintenance. In: Proceedings of the ACM Symposium on Theory of Computing, pp. 389–398, Montreal, (2002)
Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., Strauss, M. J.: Optimal and approximate computation of summary statistics for range aggregates. In: Proceedings of the ACM Symposium on Principles of Database Systems, pp. 227–236, Santa Barbara, (2001)
Gray J., Chaudhuri S., Bosworth A., Layman A., Venkatrao M., Pellow F., Pirahesh H. (1997) Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining Knowl Discov 1(1):29–53
Guha, S., Shim, K., Woo, J.: REHIST: relative histogram construction algorithms. In: Proceedings of the 30th International Conferences on Very Large Data Bases, pp. 300–311, Toronto, (2004)
Gunopulos, D., Kollios, G., Tsotras, V. J., Domeniconi, C.: Approximating multi-dimensional aggregate range queries over real attributes. In: Proceedings of the ACM SIGMOD Conference, pp. 463–474, Dallas (2000)
Ho, C.-T., Agrawal, R., Megiddo, N., Srikant, R.: Range query in OLAP data cubes. In: Proceedings of the ACM SIGMOD Conference, pp. 73–88, Tucson (1997)
Ioannidis, Y.E.: The history of histograms (abridged). In: Proceedings of the International Conference on Very Large Data Bases, pp. 19–30, Berlin (2003)
Ioannidis, Y.E., Poosala, V.: Balancing histogram optimality and practicality for query result size estimation. In: Proceedings of the ACM SIGMOD Conference, pp. 233–244, San Jose, (1995)
Ioannidis, Y. E., Poosala, V.: Histogram-based approximation of set-valued query answers. In: Proceedings of the International Conference on Very Large Data Bases, pp. 174–185, Edinburgh (1999)
Jagadish, H. V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K. Suel, T.: Optimal histograms with quality guarantees. In: Proceedings of the International Conference on Very Large Data Bases, pp. 275–286, New York (1998)
Jagadish, H.V., Koudas, N., Muthukrishnan, S.: Mining deviants in a time series database. In: Proceedings of the 25th International Conference on Very Large Data Bases, pp. 102–113, Edinburgh (1999)
Konig, A. C., Weikum, G.: Combining histograms and parametric curve fitting for feedback-driven query result-size estimation. In: Proceedings of the International Conference on Very Large Data Bases, pp. 423–434, Edinburgh (1999)
Matias, Y., Vitter, J. S., Wang, M.: Wavelet-based histograms for selectivity estimation. In: Proceedings of the ACM SIGMOD Conference, pp. 448–459, Seattle (1998)
Muralikrishna, M., DeWitt, D.J.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: Proceedings of the ACM SIGMOD Conference, pp. 28–36, Chicago (1988)
Poosala, V., Ioannidis, Y.: Selectivity estimation without the attribute value independence assumption. In: Proceedings of the International Conference on Very Large Data Bases, pp. 486–495, Athens (1997)
Poosala, V., Ioannidis, Y.E., Haas, P.J., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. In: Proceedings of the ACM SIGMOD Conference, pp. 294–305, Montreal (1996)
Press W.H., Teukolsky S.A., Vetterling W.T., Flannery B.P. (1992) Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, Cambridge
Rissanen J. (1983) A universal prior for integers estimation by minimum description length. The Annals of Statistics 11(2): 416–431
Rissanen J. (1986) Stochastic complexity and modeling. Anna. Stat. 14(3):1080–1100
Thaper, N., Guha, S., Indyk, P., Koudas, N.: Dynamic multidimensional histograms. In: Proceedings of the ACM SIGMOD Conference, pp. 428–439, Madison (2002)
Vitter, J.S., Wang, M.: Approximate computation of multidimensional aggregates of sparse data using wavelets. In: Proceedings of the ACM SIGMOD Conference, pp. 193–204, Philadelphia (1999)
Wang, H.: Concise and Accurate Data Summaries for Fast Approximate Query Answering. Ph.D. Thesis, University of Toronto (2004)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, H., Sevcik, K.C. Histograms based on the minimum description length principle. The VLDB Journal 17, 419–442 (2008). https://doi.org/10.1007/s00778-006-0015-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-006-0015-0