skip to main content
article

Synopses for query optimization: A space-complexity perspective

Published:01 December 2005Publication History
Skip Abstract Section

Abstract

Database systems use precomputed synopses of data to estimate the cost of alternative plans during query optimization. A number of alternative synopsis structures have been proposed, but histograms are by far the most commonly used. While histograms have proved to be very effective in (cost estimation for) single-table selections, queries with joins have long been seen as a challenge; under a model where histograms are maintained for individual tables, a celebrated result of Ioannidis and Christodoulakis [1991] observes that errors propagate exponentially with the number of joins in a query.In this article, we make two main contributions. First, we study the space complexity of using synopses for query optimization from a novel information-theoretic perspective. In particular, we offer evidence in support of histograms for single-table selections, including an analysis over data distributions known to be common in practice, and illustrate their limitations for join queries. Second, for a broad class of common queries involving joins (specifically, all queries involving only key-foreign key joins) we show that the strategy of storing a small precomputed sample of the database yields probabilistic guarantees that are almost space-optimal, which is an important property if these samples are to be used as database statistics. This is the first such optimality result, to our knowledge, and suggests that precomputed samples might be an effective way to circumvent the error propagation problem for queries with key-foreign key joins. We support this result empirically through an experimental study that demonstrates the effectiveness of precomputed samples, and also shows the increasing difference in the effectiveness of samples versus multidimensional histograms as the number of joins in the query grows.

References

  1. Acharya, S., Gibbons, P., Poosala, V., and Ramaswamy, S. 1999. Join synopses for approximate query answering. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alon, N., Gibbons, P., Matias, Y., and Szegedy, M. 1999. Tracking join and self-join sizes in limited storage. In Proceedings of the 11th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'99). ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Alon, N., Matias, Y., and Szegedy, M. 1996. The space complexity of approximating the frequency moments. In Proceedings of the Symposium on Theory of Computing, ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bar-Yossef, Z. 2002. The complexity of massive data set computations. Ph.D. thesis, Department of Computer Science, University of California-Berkeley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Blake, C. and Merz, C. 1998. UCI repository of machine learning databases.Google ScholarGoogle Scholar
  6. Bruno, N. and Chaudhuri, S. 2002. Statistics on query expressions. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bruno, N., Chaudhuri, S., and Gravano, L. 2001. A multi-dimensional workload-aware histogram. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chakrabarti, K., Garofalakis, M., Rastogi, R., and Shim, K. 2000. Approximate query answering using wavelets. In Proceedings of the International Conference on Very Large Databases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chatterji, S., Evani, S. S. K., Ganguly, S., and Yemmanuru, M. D. 2002. On the complexity of approximate query optimization. In Proceedings of the 11th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'99). ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chaudhuri, S. 1998. An overview of query optimization in relational systems. In Proceedings of the 11th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'99). ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chaudhuri, S., Motwani, R., and Narasayya, V. 1999. On random sampling over joins. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chaudhuri, S., Motwani, R., and Narasayya, V. R. 1998. Random sampling for histogram construction: How much is enough? In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Christodoulakis, S. 1984. Implications of certain assumptions in databaser performance evaluation. Trans. Datab. Syst. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Conte, S. D. and de Boor, C. 1972. Elementary Numerical Analysis: An Algorithmic Approach. McGraw Hill, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Dobra, A., Garofalakis, M., Gehrke, J. E., and Rastogi, R. 2002. Processing complex aggregate queries over data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Faloutsos, C. and Jagadish, H. 1992. On B-tree indices for skewed distributions. In Proceedings of the International Conference on Very Large Databases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Gasser, T., Engel, J., and Seifert, B. 1985. Non-parametric density estimation. Ann. Stat.Google ScholarGoogle Scholar
  18. Gibbons, P., Matias, Y., and Poosala, V. 1997. Fast incremental maintenance of approximate histograms. In Proceedings of the International Conference on Very Large Databases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Gunopulos, D., Kollios, G., Tsotras, V., and Domeniconi, C. 2000. Approximating multidimensional aggregate range queries over real attributes. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Haas, P. and Swami, A. 1992. Sequential sampling procedures for query size estimation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Hastad, J., Guruswami, V., Sudan, M., and Zuckerman, D. 2001. Combinatorial bounds for list decoding. In IEEE Trans. Inf. Theory. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ibaraki, T. and Kameda, T. 1984. On the optimal nesting order of computing n-relational joins. Trans. Datab. Syst. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ioannidis, Y. E. 1993. Universality of serial histograms. In Proceedings of the International Conference on Very Large Databases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ioannidis, Y. E. 2003. The history of histograms. In Proceedings of the International Conference on Very Large Databases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ioannidis, Y. E. and Christodoulakis, S. 1991. On the propagation of errors in the size of join results. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Ioannidis, Y. E. and Christodoulakis, S. 1993. Optimal histograms for limiting worst-case error propagation in the size of join results. Trans. Datab. Syst. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jagadish, H., Poosala, V., Koudas, N., Sevcik, K., Muthukrishnan, S., and Suel, T. 1998. Optimal histograms with quality guarantees. In Proceedings of the International Conference on Very Large Databases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Koudas, N., Muthukrishnan, S., and Srivastava, D. 2000. Optimal histograms for hierarchical range queries. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Kushilevitz, E. and Nisan, N. 1997. Communication Complexity. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Lipton, R., Naughton, J., and Schneider, D. 1990. Practical selectivity estimation through adaptive sampling. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Muralikrishna, M. and DeWitt, D. 1988. Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Olken, F. and Rotem, D. 1986. Simple random sampling from relational databases. In Proceedings of the International Conference on Very Large Databases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Piatetsky-Shapiro, G. and Connell, C. 1984. Accurate estimation of the number of tuples satisfying a condition. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Poosala, V. and Ioannidis, Y. E. 1995. Balancing histogram optimality and practicality for query result size estimation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Poosala, V., Ioannidis, Y. E., Haas, P. J., and Shekita, E. J. 1996. Improved histograms for selectivity estimation of range predicates. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Selinger, P. G., Astrahan, M. M., Chamberlin, D. D., Lorie, R. A., and Price, T. G. 1979. Access path selection in a relational DBMS. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Vitter, J. and Wang, M. 1999. Approximate computation of multidimensional aggregates of sparse data using wavelets. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. William, S., Press, H., Flannery, B., and Vetterling, W. 1993. Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Yao, A. 1979. Some complexity questions related to distributive computing. In Proceedings of the ACM Symposium on Theory of Computing, ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Synopses for query optimization: A space-complexity perspective

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Database Systems
          ACM Transactions on Database Systems  Volume 30, Issue 4
          Special Issue: SIGMOD/PODS 2004
          December 2005
          241 pages
          ISSN:0362-5915
          EISSN:1557-4644
          DOI:10.1145/1114244
          Issue’s Table of Contents

          Copyright © 2005 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 December 2005
          Published in tods Volume 30, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader