Skip to main content

General Database Statistics Using Entropy Maximization

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5708))

Abstract

We propose a framework in which query sizes can be estimated from arbitrary statistical assertions on the data. In its most general form, a statistical assertion states that the size of the output of a conjunctive query over the data is a given number. A very simple example is a histogram, which makes assertions about the sizes of the output of several range queries. Our model also allows much more complex assertions that include joins and projections. To model such complex statistical assertions we propose to use the Entropy-Maximization (EM) probability distribution. In this model any set of statistics that is consistent has a precise semantics, and every query has an precise size estimate. We show that several classes of statistics can be solved in closed form.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Stillger, M., Lohman, G.M., Markl, V., Kandil, M.: LEO - DB2’s LEarning Optimizer. In: VLDB (2001)

    Google Scholar 

  2. Chaudhuri, S., Narasayya, V.R., Ramamurthy, R.: Diagnosing Estimation Errors in Page Counts Using Execution Feedback. In: ICDE (2008)

    Google Scholar 

  3. Jaynes, E.: Probability Theory: The Logic of Science. Cambridge University Press, Cambridge (2003)

    Book  MATH  Google Scholar 

  4. Markl, V., Megiddo, N., et al.: Consistently estimating the selectivity of conjuncts of predicates. In: VLDB (2005)

    Google Scholar 

  5. Srivastava, U., Haas, P., Markl, V., Kutsch, M., Tran, T.M.: ISOMER: Consistent histogram construction using query feedback. In: ICDE (2006)

    Google Scholar 

  6. Erdös, P., Rényi, A.: On the evolution of random graphs. Magyar Tud. Akad. Mat. Kut. Int. Kozl. 5, 17–61 (1960)

    MathSciNet  MATH  Google Scholar 

  7. Bacchus, F., Grove, A., Halpern, J., Koller, D.: From statistical knowledge bases to degrees of belief. Artificial Intelligence 87(1-2), 75–143 (1996)

    Article  MathSciNet  Google Scholar 

  8. Dalvi, N., Miklau, G., Suciu, D.: Asymptotic conditional probabilities for conjunctive queries. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 289–305. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  9. Kaushik, R., Ré, C., Suciu, D.: General database statistics using entropy maximization: Full version. Technical Report #05-09-01, University of Washington, Seattle, Washington (May 2009)

    Google Scholar 

  10. Dalvi, N., Suciu, D.: Answering queries from statistics and probabilistic views. In: VLDB (2005)

    Google Scholar 

  11. Dalvi, N.: Query evaluation on a database given by a random graph. Theory of Computing Systems (to appear, 2009)

    Google Scholar 

  12. Ioannidis, Y.E.: The History of Histograms. In: VLDB (2003)

    Google Scholar 

  13. Olken, F.: Random Sampling from Databases. PhD thesis, University of California at Berkeley (1993)

    Google Scholar 

  14. Deligiannakis, A., Garofalakis, M.N., Roussopoulos, N.: Extended wavelets for multiple measures. ACM Trans. Database Syst. 32(2) (2007)

    Google Scholar 

  15. Alon, N., Gibbons, P.B., Matias, Y., Szegedy, M.: Tracking Join and Self-Join Sizes in Limited Storage. In: PODS (1999)

    Google Scholar 

  16. Ioannidis, Y.E., Christodoulakis, S.: On the propagation of errors in the size of join results. In: SIGMOD (May 1991)

    Google Scholar 

  17. Dalvi, N., Suciu, D.: Management of probabilistic data: Foundations and challenges. In: PODS, Beijing, China, pp. 1–12 (2007) (invited talk)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kaushik, R., Ré, C., Suciu, D. (2009). General Database Statistics Using Entropy Maximization. In: Gardner, P., Geerts, F. (eds) Database Programming Languages. DBPL 2009. Lecture Notes in Computer Science, vol 5708. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03793-1_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-03793-1_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-03792-4

  • Online ISBN: 978-3-642-03793-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics