Consistent selectivity estimation via maximum entropy

Markl, V.; Haas, P. J.; Kutsch, M.; Megiddo, N.; Srivastava, U.; Tran, T. M.

doi:10.1007/s00778-006-0030-1

Consistent selectivity estimation via maximum entropy

Special Issue Paper
Published: 15 September 2006

Volume 16, pages 55–76, (2007)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

V. Markl¹,
P. J. Haas¹,
M. Kutsch²,
N. Megiddo¹,
U. Srivastava³ &
…
T. M. Tran⁴

321 Accesses
43 Citations
3 Altmetric
Explore all metrics

Abstract

Cost-based query optimizers need to estimate the selectivity of conjunctive predicates when comparing alternative query execution plans. To this end, advanced optimizers use multivariate statistics to improve information about the joint distribution of attribute values in a table. The joint distribution for all columns is almost always too large to store completely, and the resulting use of partial distribution information raises the possibility that multiple, non-equivalent selectivity estimates may be available for a given predicate. Current optimizers use cumbersome ad hoc methods to ensure that selectivities are estimated in a consistent manner. These methods ignore valuable information and tend to bias the optimizer toward query plans for which the least information is available, often yielding poor results. In this paper we present a novel method for consistent selectivity estimation based on the principle of maximum entropy (ME). Our method exploits all available information and avoids the bias problem. In the absence of detailed knowledge, the ME approach reduces to standard uniformity and independence assumptions. Experiments with our prototype implementation in DB2 UDB show that use of the ME approach can improve the optimizer’s cardinality estimates by orders of magnitude, resulting in better plan quality and significantly reduced query execution times. For almost all queries, these improvements are obtained while adding only tens of milliseconds to the overall time required for query optimization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aboulnaga, A., Chaudhuri, S.: Self-tuning histograms: Building histograms without looking at data. SIGMOD 181–192 (1999)
Aboulnaga, A., Haas, P., Lightstone, S., et al.: Automated statistics collection in DB2 UDB. VLDB 1146–1157 (2004)
Ault, M., Tumma, M., Liu, D., et al.: Oracle Database 10 g new features: Oracle10 g reference for advanced tuning and administration. Rampant TechPress (2003)
Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: a multidimensional workload-aware histogram. SIGMOD 211–222 (2001)
Bruno, N., Chaudhuri, S.: Exploiting statistics on query expressions for optimization. SIGMOD 263–274 (2002)
Bruno, N., Chaudhuri, S.: Efficient creation of statistics over query expressions. ICDE 201–212 (2003)
Bruno, N., Chaudhuri, S.: Conditional selectivity for statistics on query expressions. SIGMOD 311–322 (2004)
Chaudhuri, S., Narasayya, V.: Automating statistics management for query optimizers. ICDE 339–348 (2000)
Chiu D., Wong A., Cheung B. (1991): Information discovery through hierarchical maximum-entropy discretization and synthesis. In: Piatesky-Shapiro G., Fracley W.J., (eds). Knowledge Discovery in Databases. MIT Press, Cambridge, pp. 125–140
Google Scholar
Christodoulakis S. (1983): Estimating record selectivities. Inf. Syst. 8(2):105–115
Article Google Scholar
Darroch J.N., Ratcliff D. (1972): Generalized iterative scaling for log-linear models. Ann. Math. Statist. 43:1470–1480
MATH MathSciNet Google Scholar
Deshpande, A., Garofalakis, M., Rastogi, R.: Independence is good: dependency-based histogram synopses for high-dimensional data. SIGMOD 199–210 (2001)
Galindo-Legaria, C., Joshi, M., Waas, F., et al.: Statistics on views. VLDB 952–962 (2003)
García-Varea, I., Och, F., Ney, H., et al.: Refined Lexikon models for statistical machine translation using a maximum-entropy approach. ACL 204–211 (2001)
Getoor, L., Taskar, B., Koller, D.: Selectivity estimation using probabilistic models. SIGMOD 461–472 (2001)
Greiff W., Ponte J. (2000): The maximum-entropy approach and probabilistic IR models. ACM Trans. Inform. Sys. 18(3):246–287
Article Google Scholar
Guiasu S., Shenitzer A. (1985): The principle of maximum-entropy. Math. Intell. 7(1):42–48
Article MATH MathSciNet Google Scholar
Haas, P., Swami, A.: Sampling-based selectivity estimation for joins using augmented frequent-value statistics. ICDE 522–531 (1995)
IBM Corp.: DB2 Universal Database for iSeries: Database Performance and Query Optimization (2002)
IBM Corp.: DB2 v8.2 Performance Guide (2004)
Ilyas, I.F., Markl, V., Haas, P.J., Brown, P.G., Aboulnaga, A.: CORDS: automatic discovery of correlations and soft functional dependencies. SIGMOD 647–658 (2004)
Ioannidis, Y.E., Christodoulakis, S.: Propagation of errors in the size of join results. SIGMOD 268–277 (1991)
Kutsch, M., Haas, P.J., Markl, V., Megiddo, N., Tran, T.M.: Integrating a maximum-entropy cardinality estimator into DB2 UDB. EDBT 1092–1096 (2006)
Lynch, C.A.: Selectivity estimation and query optimization in large databases with highly skewed distribution of column values. VLDB 240–251 (1988)
Markl, V., Megiddo, N., Kutsch, M., Tran, T.M., Haas, P.J., Srivastava, U.: Consistently estimating the selectivity of conjuncts of predicates. VLDB 378–384 (2005)
Microsoft Corp.: SQL Server 2000 Books Online v8.00.02 (2004)
Piatetsky-Shapiro, G., Connell, C.: Accurate estimation of the number of tuples satisfying a condition. SIGMOD 256–276 (1984)
Poosala, V., et al.: Improved histograms for selectivity estimation of range predicates. SIGMOD 294–305 (1996)
Poosala, V., Ioannidis, Y.: Selectivity estimation without the attribute value independence assumption. VLDB 486–495 (1997)
Selinger, P.G., et al.: Access path selection in a relational DBMS. SIGMOD 23–34 (1979)
Shannon, C.E.: A mathematical theory of communication. Bell Sys. Tech. J. 27, 379–423 623–656 (1948)
Srivastava, U., Haas, P.J., Markl, V., Megiddo, N.: ISOMER: consistent histogram construction using query feedback. ICDE 6 (2006)
Stillger, M., Lohman, G., Markl, V., Kandil, M.: LEO – DB2’s learning optimizer. VLDB 19–28 (2001)
Swami, A.N., Schiefer, K.B.: On the estimation of join result sizes. EDBT 287–300 (1994)
Van Gelder, A.: Multiple join size estimation by virtual domains. PODS 180–189 (1993)

Download references

Author information

Authors and Affiliations

IBM Almaden Research Center, San Jose, CA, USA
V. Markl, P. J. Haas & N. Megiddo
IBM Germany, Boeblingen, Germany
M. Kutsch
Stanford University, Stanford, CA, USA
U. Srivastava
IBM Silicon Valley Lab, San Jose, CA, USA
T. M. Tran

Authors

V. Markl
View author publications
You can also search for this author in PubMed Google Scholar
P. J. Haas
View author publications
You can also search for this author in PubMed Google Scholar
M. Kutsch
View author publications
You can also search for this author in PubMed Google Scholar
N. Megiddo
View author publications
You can also search for this author in PubMed Google Scholar
U. Srivastava
View author publications
You can also search for this author in PubMed Google Scholar
T. M. Tran
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to P. J. Haas.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Markl, V., Haas, P.J., Kutsch, M. et al. Consistent selectivity estimation via maximum entropy. The VLDB Journal 16, 55–76 (2007). https://doi.org/10.1007/s00778-006-0030-1

Download citation

Received: 15 January 2006
Accepted: 03 August 2006
Published: 15 September 2006
Issue Date: January 2007
DOI: https://doi.org/10.1007/s00778-006-0030-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Consistent selectivity estimation via maximum entropy

Abstract

Access this article

Similar content being viewed by others

An Approach Based on Bayesian Networks for Query Selectivity Estimation

The Method of Query Selectivity Estimation for Selection Conditions Based on Sum of Sub-Independent Attributes

Data dependencies for query optimization: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Consistent selectivity estimation via maximum entropy

Abstract

Access this article

Similar content being viewed by others

An Approach Based on Bayesian Networks for Query Selectivity Estimation

The Method of Query Selectivity Estimation for Selection Conditions Based on Sum of Sub-Independent Attributes

Data dependencies for query optimization: a survey

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation