article

Synopses for query optimization: A space-complexity perspective

Authors:
Raghav Kaushik

Microsoft Research, Redmond, WA

Microsoft Research, Redmond, WA
View Profile

,
Jeffrey F. Naughton

University of Wisconsin-Madison, Madison, WI

University of Wisconsin-Madison, Madison, WI
View Profile

,
Raghu Ramakrishnan

University of Wisconsin-Madison, Madison, WI

University of Wisconsin-Madison, Madison, WI
View Profile

,
Venkatesan T. Chakravarthy

IBM India Research Lab, New Delhi, India

IBM India Research Lab, New Delhi, India
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 30 Issue 4pp 1102–1127https://doi.org/10.1145/1114244.1114251

Published:01 December 2005Publication History

ACM Transactions on Database Systems

Abstract

Database systems use precomputed synopses of data to estimate the cost of alternative plans during query optimization. A number of alternative synopsis structures have been proposed, but histograms are by far the most commonly used. While histograms have proved to be very effective in (cost estimation for) single-table selections, queries with joins have long been seen as a challenge; under a model where histograms are maintained for individual tables, a celebrated result of Ioannidis and Christodoulakis [1991] observes that errors propagate exponentially with the number of joins in a query.In this article, we make two main contributions. First, we study the space complexity of using synopses for query optimization from a novel information-theoretic perspective. In particular, we offer evidence in support of histograms for single-table selections, including an analysis over data distributions known to be common in practice, and illustrate their limitations for join queries. Second, for a broad class of common queries involving joins (specifically, all queries involving only key-foreign key joins) we show that the strategy of storing a small precomputed sample of the database yields probabilistic guarantees that are almost space-optimal, which is an important property if these samples are to be used as database statistics. This is the first such optimality result, to our knowledge, and suggests that precomputed samples might be an effective way to circumvent the error propagation problem for queries with key-foreign key joins. We support this result empirically through an experimental study that demonstrates the effectiveness of precomputed samples, and also shows the increasing difference in the effectiveness of samples versus multidimensional histograms as the number of joins in the query grows.

References

Acharya, S., Gibbons, P., Poosala, V., and Ramaswamy, S. 1999. Join synopses for approximate query answering. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarDigital Library
Alon, N., Gibbons, P., Matias, Y., and Szegedy, M. 1999. Tracking join and self-join sizes in limited storage. In Proceedings of the 11th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'99). ACM, New York. Google ScholarDigital Library
Alon, N., Matias, Y., and Szegedy, M. 1996. The space complexity of approximating the frequency moments. In Proceedings of the Symposium on Theory of Computing, ACM, New York. Google ScholarDigital Library
Bar-Yossef, Z. 2002. The complexity of massive data set computations. Ph.D. thesis, Department of Computer Science, University of California-Berkeley. Google ScholarDigital Library
Blake, C. and Merz, C. 1998. UCI repository of machine learning databases.Google Scholar
Bruno, N. and Chaudhuri, S. 2002. Statistics on query expressions. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarDigital Library
Bruno, N., Chaudhuri, S., and Gravano, L. 2001. A multi-dimensional workload-aware histogram. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarDigital Library
Chakrabarti, K., Garofalakis, M., Rastogi, R., and Shim, K. 2000. Approximate query answering using wavelets. In Proceedings of the International Conference on Very Large Databases. Google ScholarDigital Library
Chatterji, S., Evani, S. S. K., Ganguly, S., and Yemmanuru, M. D. 2002. On the complexity of approximate query optimization. In Proceedings of the 11th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'99). ACM, New York. Google ScholarDigital Library
Chaudhuri, S. 1998. An overview of query optimization in relational systems. In Proceedings of the 11th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'99). ACM, New York. Google ScholarDigital Library
Chaudhuri, S., Motwani, R., and Narasayya, V. 1999. On random sampling over joins. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarDigital Library
Chaudhuri, S., Motwani, R., and Narasayya, V. R. 1998. Random sampling for histogram construction: How much is enough? In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarDigital Library
Christodoulakis, S. 1984. Implications of certain assumptions in databaser performance evaluation. Trans. Datab. Syst. Google ScholarDigital Library
Conte, S. D. and de Boor, C. 1972. Elementary Numerical Analysis: An Algorithmic Approach. McGraw Hill, New York. Google ScholarDigital Library
Dobra, A., Garofalakis, M., Gehrke, J. E., and Rastogi, R. 2002. Processing complex aggregate queries over data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarDigital Library
Faloutsos, C. and Jagadish, H. 1992. On B-tree indices for skewed distributions. In Proceedings of the International Conference on Very Large Databases. Google ScholarDigital Library
Gasser, T., Engel, J., and Seifert, B. 1985. Non-parametric density estimation. Ann. Stat.Google Scholar
Gibbons, P., Matias, Y., and Poosala, V. 1997. Fast incremental maintenance of approximate histograms. In Proceedings of the International Conference on Very Large Databases. Google ScholarDigital Library
Gunopulos, D., Kollios, G., Tsotras, V., and Domeniconi, C. 2000. Approximating multidimensional aggregate range queries over real attributes. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarDigital Library
Haas, P. and Swami, A. 1992. Sequential sampling procedures for query size estimation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarDigital Library
Hastad, J., Guruswami, V., Sudan, M., and Zuckerman, D. 2001. Combinatorial bounds for list decoding. In IEEE Trans. Inf. Theory. Google ScholarDigital Library
Ibaraki, T. and Kameda, T. 1984. On the optimal nesting order of computing n-relational joins. Trans. Datab. Syst. Google ScholarDigital Library
Ioannidis, Y. E. 1993. Universality of serial histograms. In Proceedings of the International Conference on Very Large Databases. Google ScholarDigital Library
Ioannidis, Y. E. 2003. The history of histograms. In Proceedings of the International Conference on Very Large Databases. Google ScholarDigital Library
Ioannidis, Y. E. and Christodoulakis, S. 1991. On the propagation of errors in the size of join results. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarDigital Library
Ioannidis, Y. E. and Christodoulakis, S. 1993. Optimal histograms for limiting worst-case error propagation in the size of join results. Trans. Datab. Syst. Google ScholarDigital Library
Jagadish, H., Poosala, V., Koudas, N., Sevcik, K., Muthukrishnan, S., and Suel, T. 1998. Optimal histograms with quality guarantees. In Proceedings of the International Conference on Very Large Databases. Google ScholarDigital Library
Koudas, N., Muthukrishnan, S., and Srivastava, D. 2000. Optimal histograms for hierarchical range queries. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM, New York. Google ScholarDigital Library
Kushilevitz, E. and Nisan, N. 1997. Communication Complexity. Cambridge University Press. Google ScholarDigital Library
Lipton, R., Naughton, J., and Schneider, D. 1990. Practical selectivity estimation through adaptive sampling. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarDigital Library
Muralikrishna, M. and DeWitt, D. 1988. Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarDigital Library
Olken, F. and Rotem, D. 1986. Simple random sampling from relational databases. In Proceedings of the International Conference on Very Large Databases. Google ScholarDigital Library
Piatetsky-Shapiro, G. and Connell, C. 1984. Accurate estimation of the number of tuples satisfying a condition. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarDigital Library
Poosala, V. and Ioannidis, Y. E. 1995. Balancing histogram optimality and practicality for query result size estimation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarDigital Library
Poosala, V., Ioannidis, Y. E., Haas, P. J., and Shekita, E. J. 1996. Improved histograms for selectivity estimation of range predicates. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarDigital Library
Selinger, P. G., Astrahan, M. M., Chamberlin, D. D., Lorie, R. A., and Price, T. G. 1979. Access path selection in a relational DBMS. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarDigital Library
Vitter, J. and Wang, M. 1999. Approximate computation of multidimensional aggregates of sparse data using wavelets. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarDigital Library
William, S., Press, H., Flannery, B., and Vetterling, W. 1993. Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press. Google ScholarDigital Library
Yao, A. 1979. Some complexity questions related to distributive computing. In Proceedings of the ACM Symposium on Theory of Computing, ACM, New York. Google ScholarDigital Library

Index Terms

Synopses for query optimization: A space-complexity perspective
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data layout
        Data compression
    2. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Synopses for query optimization: a space-complexity perspective
PODS '04: Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Database systems use precomputed synopses of data to estimate the cost of alternative plans during query optimization. A number of alternative synopsis structures have been proposed, but histograms are by far the most commonly used. While histograms ...
Read More
Join synopses for approximate query answering

In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on statistical summaries of the full data. In this paper, we demonstrate the difficulty of providing good ...
Read More
Join synopses for approximate query answering
SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data

In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on statistical summaries of the full data. In this paper, we demonstrate the difficulty of providing good ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Database Systems Volume 30, Issue 4
Special Issue: SIGMOD/PODS 2004
December 2005
241 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/1114244
Issue’s Table of Contents

Copyright © 2005 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 December 2005
Published in tods Volume 30, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cardinality estimation
histograms
sampling
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 685
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Synopses for query optimization: A space-complexity perspective

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Synopses for query optimization: a space-complexity perspective

Join synopses for approximate query answering

Join synopses for approximate query answering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Synopses for query optimization: A space-complexity perspective

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Synopses for query optimization: a space-complexity perspective

Join synopses for approximate query answering

Join synopses for approximate query answering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media