Abstract
We propose a framework in which query sizes can be estimated from arbitrary statistical assertions on the data. In its most general form, a statistical assertion states that the size of the output of a conjunctive query over the data is a given number. A very simple example is a histogram, which makes assertions about the sizes of the output of several range queries. Our model also allows much more complex assertions that include joins and projections. To model such complex statistical assertions we propose to use the Entropy-Maximization (EM) probability distribution. In this model any set of statistics that is consistent has a precise semantics, and every query has an precise size estimate. We show that several classes of statistics can be solved in closed form.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Stillger, M., Lohman, G.M., Markl, V., Kandil, M.: LEO - DB2’s LEarning Optimizer. In: VLDB (2001)
Chaudhuri, S., Narasayya, V.R., Ramamurthy, R.: Diagnosing Estimation Errors in Page Counts Using Execution Feedback. In: ICDE (2008)
Jaynes, E.: Probability Theory: The Logic of Science. Cambridge University Press, Cambridge (2003)
Markl, V., Megiddo, N., et al.: Consistently estimating the selectivity of conjuncts of predicates. In: VLDB (2005)
Srivastava, U., Haas, P., Markl, V., Kutsch, M., Tran, T.M.: ISOMER: Consistent histogram construction using query feedback. In: ICDE (2006)
Erdös, P., Rényi, A.: On the evolution of random graphs. Magyar Tud. Akad. Mat. Kut. Int. Kozl. 5, 17–61 (1960)
Bacchus, F., Grove, A., Halpern, J., Koller, D.: From statistical knowledge bases to degrees of belief. Artificial Intelligence 87(1-2), 75–143 (1996)
Dalvi, N., Miklau, G., Suciu, D.: Asymptotic conditional probabilities for conjunctive queries. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 289–305. Springer, Heidelberg (2004)
Kaushik, R., Ré, C., Suciu, D.: General database statistics using entropy maximization: Full version. Technical Report #05-09-01, University of Washington, Seattle, Washington (May 2009)
Dalvi, N., Suciu, D.: Answering queries from statistics and probabilistic views. In: VLDB (2005)
Dalvi, N.: Query evaluation on a database given by a random graph. Theory of Computing Systems (to appear, 2009)
Ioannidis, Y.E.: The History of Histograms. In: VLDB (2003)
Olken, F.: Random Sampling from Databases. PhD thesis, University of California at Berkeley (1993)
Deligiannakis, A., Garofalakis, M.N., Roussopoulos, N.: Extended wavelets for multiple measures. ACM Trans. Database Syst. 32(2) (2007)
Alon, N., Gibbons, P.B., Matias, Y., Szegedy, M.: Tracking Join and Self-Join Sizes in Limited Storage. In: PODS (1999)
Ioannidis, Y.E., Christodoulakis, S.: On the propagation of errors in the size of join results. In: SIGMOD (May 1991)
Dalvi, N., Suciu, D.: Management of probabilistic data: Foundations and challenges. In: PODS, Beijing, China, pp. 1–12 (2007) (invited talk)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kaushik, R., Ré, C., Suciu, D. (2009). General Database Statistics Using Entropy Maximization. In: Gardner, P., Geerts, F. (eds) Database Programming Languages. DBPL 2009. Lecture Notes in Computer Science, vol 5708. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03793-1_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-03793-1_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03792-4
Online ISBN: 978-3-642-03793-1
eBook Packages: Computer ScienceComputer Science (R0)