Skip to main content
Log in

Consistent selectivity estimation via maximum entropy

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Cost-based query optimizers need to estimate the selectivity of conjunctive predicates when comparing alternative query execution plans. To this end, advanced optimizers use multivariate statistics to improve information about the joint distribution of attribute values in a table. The joint distribution for all columns is almost always too large to store completely, and the resulting use of partial distribution information raises the possibility that multiple, non-equivalent selectivity estimates may be available for a given predicate. Current optimizers use cumbersome ad hoc methods to ensure that selectivities are estimated in a consistent manner. These methods ignore valuable information and tend to bias the optimizer toward query plans for which the least information is available, often yielding poor results. In this paper we present a novel method for consistent selectivity estimation based on the principle of maximum entropy (ME). Our method exploits all available information and avoids the bias problem. In the absence of detailed knowledge, the ME approach reduces to standard uniformity and independence assumptions. Experiments with our prototype implementation in DB2 UDB show that use of the ME approach can improve the optimizer’s cardinality estimates by orders of magnitude, resulting in better plan quality and significantly reduced query execution times. For almost all queries, these improvements are obtained while adding only tens of milliseconds to the overall time required for query optimization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Aboulnaga, A., Chaudhuri, S.: Self-tuning histograms: Building histograms without looking at data. SIGMOD 181–192 (1999)

  2. Aboulnaga, A., Haas, P., Lightstone, S., et al.: Automated statistics collection in DB2 UDB. VLDB 1146–1157 (2004)

  3. Ault, M., Tumma, M., Liu, D., et al.: Oracle Database 10 g new features: Oracle10 g reference for advanced tuning and administration. Rampant TechPress (2003)

  4. Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: a multidimensional workload-aware histogram. SIGMOD 211–222 (2001)

  5. Bruno, N., Chaudhuri, S.: Exploiting statistics on query expressions for optimization. SIGMOD 263–274 (2002)

  6. Bruno, N., Chaudhuri, S.: Efficient creation of statistics over query expressions. ICDE 201–212 (2003)

  7. Bruno, N., Chaudhuri, S.: Conditional selectivity for statistics on query expressions. SIGMOD 311–322 (2004)

  8. Chaudhuri, S., Narasayya, V.: Automating statistics management for query optimizers. ICDE 339–348 (2000)

  9. Chiu D., Wong A., Cheung B. (1991): Information discovery through hierarchical maximum-entropy discretization and synthesis. In: Piatesky-Shapiro G., Fracley W.J., (eds). Knowledge Discovery in Databases. MIT Press, Cambridge, pp. 125–140

    Google Scholar 

  10. Christodoulakis S. (1983): Estimating record selectivities. Inf. Syst. 8(2):105–115

    Article  Google Scholar 

  11. Darroch J.N., Ratcliff D. (1972): Generalized iterative scaling for log-linear models. Ann. Math. Statist. 43:1470–1480

    MATH  MathSciNet  Google Scholar 

  12. Deshpande, A., Garofalakis, M., Rastogi, R.: Independence is good: dependency-based histogram synopses for high-dimensional data. SIGMOD 199–210 (2001)

  13. Galindo-Legaria, C., Joshi, M., Waas, F., et al.: Statistics on views. VLDB 952–962 (2003)

  14. García-Varea, I., Och, F., Ney, H., et al.: Refined Lexikon models for statistical machine translation using a maximum-entropy approach. ACL 204–211 (2001)

  15. Getoor, L., Taskar, B., Koller, D.: Selectivity estimation using probabilistic models. SIGMOD 461–472 (2001)

  16. Greiff W., Ponte J. (2000): The maximum-entropy approach and probabilistic IR models. ACM Trans. Inform. Sys. 18(3):246–287

    Article  Google Scholar 

  17. Guiasu S., Shenitzer A. (1985): The principle of maximum-entropy. Math. Intell. 7(1):42–48

    Article  MATH  MathSciNet  Google Scholar 

  18. Haas, P., Swami, A.: Sampling-based selectivity estimation for joins using augmented frequent-value statistics. ICDE 522–531 (1995)

  19. IBM Corp.: DB2 Universal Database for iSeries: Database Performance and Query Optimization (2002)

  20. IBM Corp.: DB2 v8.2 Performance Guide (2004)

  21. Ilyas, I.F., Markl, V., Haas, P.J., Brown, P.G., Aboulnaga, A.: CORDS: automatic discovery of correlations and soft functional dependencies. SIGMOD 647–658 (2004)

  22. Ioannidis, Y.E., Christodoulakis, S.: Propagation of errors in the size of join results. SIGMOD 268–277 (1991)

  23. Kutsch, M., Haas, P.J., Markl, V., Megiddo, N., Tran, T.M.: Integrating a maximum-entropy cardinality estimator into DB2 UDB. EDBT 1092–1096 (2006)

  24. Lynch, C.A.: Selectivity estimation and query optimization in large databases with highly skewed distribution of column values. VLDB 240–251 (1988)

  25. Markl, V., Megiddo, N., Kutsch, M., Tran, T.M., Haas, P.J., Srivastava, U.: Consistently estimating the selectivity of conjuncts of predicates. VLDB 378–384 (2005)

  26. Microsoft Corp.: SQL Server 2000 Books Online v8.00.02 (2004)

  27. Piatetsky-Shapiro, G., Connell, C.: Accurate estimation of the number of tuples satisfying a condition. SIGMOD 256–276 (1984)

  28. Poosala, V., et al.: Improved histograms for selectivity estimation of range predicates. SIGMOD 294–305 (1996)

  29. Poosala, V., Ioannidis, Y.: Selectivity estimation without the attribute value independence assumption. VLDB 486–495 (1997)

  30. Selinger, P.G., et al.: Access path selection in a relational DBMS. SIGMOD 23–34 (1979)

  31. Shannon, C.E.: A mathematical theory of communication. Bell Sys. Tech. J. 27, 379–423 623–656 (1948)

  32. Srivastava, U., Haas, P.J., Markl, V., Megiddo, N.: ISOMER: consistent histogram construction using query feedback. ICDE 6 (2006)

  33. Stillger, M., Lohman, G., Markl, V., Kandil, M.: LEO – DB2’s learning optimizer. VLDB 19–28 (2001)

  34. Swami, A.N., Schiefer, K.B.: On the estimation of join result sizes. EDBT 287–300 (1994)

  35. Van Gelder, A.: Multiple join size estimation by virtual domains. PODS 180–189 (1993)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to P. J. Haas.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Markl, V., Haas, P.J., Kutsch, M. et al. Consistent selectivity estimation via maximum entropy. The VLDB Journal 16, 55–76 (2007). https://doi.org/10.1007/s00778-006-0030-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-006-0030-1

Keywords

Navigation