skip to main content
10.1145/3448016.3452805acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Small Selectivities Matter: Lifting the Burden of Empty Samples

Published: 18 June 2021 Publication History

Abstract

Every year more and more advanced approaches to cardinality estimation are published, using learned models or other data and workload specific synopses. In contrast, the majority of commercial in-memory systems still relies on sampling. It is arguably the most general and easiest estimator to implement. While most methods do not seem to improve much over sampling-based estimators in the presence of non-selective queries, sampling struggles with highly selective queries due to limitations of the sample size. Especially in situations where no sample tuple qualifies, optimizers fall back to basic heuristics that ignore attribute correlations and lead to large estimation errors. In this work, we present a novel approach, dealing with these 0-Tuple Situations. It is ready to use in any DBMS capable of sampling, showing a negligible impact on optimization time. Our experiments on real world and synthetic data sets demonstrate up to two orders of magnitude reduced estimation errors. Enumerating single filter predicates according to our estimates reveals 1.3 to 1.8 times faster query responses for complex filters.

Supplementary Material

MP4 File (3448016.3452805.mp4)
Every year more and more advanced approaches to cardinality estimation are published, using learned models or other data and workload specific synopses. In contrast, the majority of commercial in-memory systems still relies on sampling. It is arguably the most general and easiest estimator to implement. While most methods do not seem to improve much over sampling-based estimators in the presence of non-selective queries, sampling struggles with highly selective queries due to limitations of the sample size. Especially in situations where no sample tuple qualifies, optimizers fall back to basic heuristics that ignore attribute correlations and lead to large estimation errors. In this work, we present a novel approach, dealing with these 0-Tuple Situations. It is ready to use in any DBMS capable of sampling, showing a negligible impact on optimization time. Our experiments on real world and synthetic data sets demonstrate up to two orders of magnitude reduced estimation errors. Enumerating single filter predicates according to our estimates reveals 1.3 to 1.8 times faster query responses for complex filters.

References

[1]
[2]
Boost library . https://www.boost.org. Accessed: 2020-07-02.
[3]
Public Bi Benchmark. https://github.com/peterboncz/public_bi_benchmark-1. Accessed: 2020-05--21.
[4]
B. Babcock and S. Chaudhuri. Towards a robust query optimizer: a principled and practical approach. In SIGMOD, pages 119--130. ACM, 2005.
[5]
S. Babu, R. Motwani, K. Munagala, I. Nishizawa, and J. Widom. Adaptive ordering of pipelined stream filters. In SIGMOD, pages 407--418. ACM, 2004.
[6]
J. M. Bernardo and A. F. Smith. Bayesian theory, volume 405. John Wiley & Sons, 2009.
[7]
S. Bianco, R. Cadene, L. Celona, and P. Napoletano. Benchmark analysis of representative deep neural network architectures. IEEE Access, 6:64270--64277, 2018.
[8]
A. Birler, B. Radke, and T. Neumann. Concurrent online sampling for all, for free. In DaMoN, pages 5:1--5:8. ACM, 2020.
[9]
R. P. Brent. An algorithm with guaranteed convergence for finding a zero of a function. The Computer Journal, 14(4):422--425, 1971.
[10]
N. Bruno, S. Chaudhuri, and L. Gravano. STHoles: A multidimensional workload-aware histogram. In SIGMOD, pages 211--222. ACM, 2001.
[11]
S. Chakkappen, S. Budalakoti, R. Krishnamachari, S. R. Valluri, A. Wood, and M. Zait. Adaptive statistics in Oracle 12C. In PVLDB, volume 10, pages 1813--1824. VLDB Endowment, 2017.
[12]
G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1--3):1--294, 2011.
[13]
F. M. Dekking, C. Kraaikamp, H. P. Lopuha"a, and L. E. Meester. In A Modern Introduction to Probability and Statistics: Understanding why and how, pages 45--47. Springer Science & Business Media, 2005.
[14]
A. Dutt, C. Wang, A. Nazi, S. Kandula, V. Narasayya, and S. Chaudhuri. Selectivity estimation for range predicates using lightweight models. In PVLDB, volume 12, pages 1044--1057. VLDB Endowment, 2019.
[15]
F. F"a rber, N. May, W. Lehner, P. Große, I. Mü ller, H. Rauhe, and J. Dees. The SAP HANA database -- an architecture overview. IEEE Data Eng. Bull., 35(1):28--33, 2012.
[16]
M. Heimel, M. Kiefer, and V. Markl. Self-tuning, GPU-accelerated kernel density models for multidimensional selectivity estimation. In SIGMOD, pages 1477--1492. ACM, 2015.
[17]
J. M. Hellerstein. Optimization techniques for queries with expensive methods. In TODS, volume 23, page 121. ACM, 1998.
[18]
A. Hertzschuch, C. Hartmann, D. Habich, and W. Lehner. Simplicity done right for join ordering. CIDR, 2021.
[19]
Y. Ioannidis. The history of histograms (abridged). In PVLDB, pages 19--30. VLDB Endowment, 2003.
[20]
R. Kaas and J. M. Buhrman. Mean, median and mode in binomial distributions. Statistica Neerlandica, 34(1):13--18, 1980.
[21]
F. Kastrati and G. Moerkotte. Optimization of conjunctive predicates for main memory column stores. In PVLDB, volume 9, pages 1125--1136. VLDB Endowment, 2016.
[22]
J. Kerman. A closed-form approximation for the median of the beta distribution. arXiv preprint arXiv:1111.0433, 2011.
[23]
A. Kipf, T. Kipf, B. Radke, V. Leis, P. A. Boncz, and A. Kemper. Learned cardinalities: Estimating correlated joins with deep learning. CIDR, 2019.
[24]
A. Kipf, D. Vorona, J. Mü ller, T. Kipf, B. Radke, V. Leis, P. A. Boncz, T. Neumann, and A. Kemper. Estimating cardinalities with deep sketches. In SIGMOD, pages 1937--1940. ACM, 2019.
[25]
P.-A. Larson, W. Lehner, J. Zhou, and P. Zabback. Cardinality estimation using sample views with quality assurance. In SIGMOD, pages 175--186. ACM, 2007.
[26]
V. Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neumann. How good are query optimizers, really? In PVLDB, volume 9, pages 204--215. VLDB Endowment, 2015.
[27]
V. Leis, B. Radke, A. Gubichev, A. Kemper, and T. Neumann. Cardinality estimation done right: Index-based join sampling. In CIDR, 2017.
[28]
M. Lichman et al. UCI machine learning repository, 2013.
[29]
G. Lohmann. Is query optimization a "solved" problem? https://wp.sigmod.org/?p=1075, 2014. Accessed: 2019-09--23.
[30]
V. Markl, P. J. Haas, M. Kutsch, N. Megiddo, U. Srivastava, and T. M. Tran. Consistent selectivity estimation via maximum entropy. The VLDB journal, 16(1):55--76, 2007.
[31]
M. J. Menne, I. Durre, R. S. Vose, B. E. Gleason, and T. G. Houston. An overview of the global historical climatology network-daily database. Journal of Atmospheric and Oceanic Technology, 29(7):897--910, 2012.
[32]
G. Moerkotte, D. DeHaan, N. May, A. Nica, and A. Boehm. Exploiting ordered dictionaries to efficiently construct histograms with q-error guarantees in SAP HANA. In SIGMOD, pages 361--372. ACM, 2014.
[33]
G. Moerkotte and A. Hertzschuch. α to $ømega$: the g(r)eek alphabet of sampling. CIDR, 2020.
[34]
G. Moerkotte, T. Neumann, and G. Steidl. Preventing bad plans by bounding the impact of cardinality estimation errors. In PVLDB, volume 2, pages 982--993. VLDB Endowment, 2009.
[35]
MonetDB Team. https://www.monetdb.org/. Accessed: 2020--10-05.
[36]
M. Müller, G. Moerkotte, and O. Kolb. Improved selectivity estimation by combining knowledge from sampling and synopses. In PVLDB, volume 11, pages 1016--1028. VLDB Endowment, 2018.
[37]
Reference for Oracle DB 18c. https://docs.oracle.com/en/database/oracle/oracle-database/18/tgsql/options-for-optimizer-statistics-gathering.html#GUID-DEE2AF8B-5F4B-4FE7--9F0E-7D188921EBCC. Accessed: 2020-07-02.
[38]
V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In PVLDB. VLDB Endowment, 1997.
[39]
J. A. Rice. Mathematical statistics and data analysis. Cengage Learning, 2006.
[40]
H. Robbins. A remark on stirling's formula. The American mathematical monthly, 62(1):26--29, 1955.
[41]
Reference for SAP HANA Platform 2.0 SPS 04. https://help.sap.com/viewer/4fe29514fd584807ac9f2a04f6754767/2.0.02/en-US/4ba9edce1f2347a0b9fcda99879c17a1.html. Accessed: 2020-07-02.
[42]
D. W. Scott. Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons, 2015.
[43]
R. Sherkat, C. Florendo, M. Andrei, R. Blanco, A. Dragusanu, A. Pathak, P. Khadilkar, N. Kulkarni, C. Lemke, S. Seifert, et al. Native store extension for sap hana. PVLDB, 12(12):2047--2058, 2019.
[44]
The new and improved cardinality estimator in SQL Server 2014. https://cloudblogs.microsoft.com/sqlserver/2014/03/17/the-new-and-improved-cardinality-estimator-in-sql-server-2014/. Accessed: 2020-07-02.
[45]
J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1):37--57, 1985.
[46]
A. Vogelsgesang, M. Haubenschild, J. Finis, A. Kemper, V. Leis, T. Mü hlbauer, T. Neumann, and M. Then. Get real: How benchmarks fail to represent the real world. In A. Bö hm and T. Rabl, editors, DBTest, pages 1:1--1:6. ACM, 2018.
[47]
L. Woltmann, C. Hartmann, M. Thiele, D. Habich, and W. Lehner. Cardinality estimation with local deep learning models. In AiDM. ACM, 2019.
[48]
Z. Yang, E. Liang, A. Kamsetty, C. Wu, Y. Duan, X. Chen, P. Abbeel, J. M. Hellerstein, S. Krishnan, and I. Stoica. Deep unsupervised cardinality estimation. In PVLDB, volume 13, pages 279--292. VLDB Endowment, 2019.
[49]
X. Yu, N. Koudas, and C. Zuzarte. Hase: A hybrid approach to selectivity estimation for conjunctive predicates. In EDBT, pages 460--477. Springer, 2006.
[50]
H. .Zoładek et al. The topological proof of abel-ruffini theorem. Topological Methods in Nonlinear Analysis, 16(2):253--265, 2000.

Cited By

View all
  • (2024)PARQO: Penalty-Aware Robust Plan Selection in Query OptimizationProceedings of the VLDB Endowment10.14778/3704965.370497117:13(4627-4640)Online publication date: 1-Sep-2024
  • (2024)Accurate Sampling-Based Cardinality Estimation for Complex Graph QueriesACM Transactions on Database Systems10.1145/368920949:3(1-46)Online publication date: 17-Sep-2024
  • (2024)Robust Query Optimization in the Era of Machine Learning: State-of-the-Art and Future Directions2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00408(5371-5375)Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
June 2021
2969 pages
ISBN:9781450383431
DOI:10.1145/3448016
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. beta distribution
  2. filter predicate ordering
  3. in-memory
  4. olap
  5. sampling
  6. small selectivity

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)50
  • Downloads (Last 6 weeks)7
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)PARQO: Penalty-Aware Robust Plan Selection in Query OptimizationProceedings of the VLDB Endowment10.14778/3704965.370497117:13(4627-4640)Online publication date: 1-Sep-2024
  • (2024)Accurate Sampling-Based Cardinality Estimation for Complex Graph QueriesACM Transactions on Database Systems10.1145/368920949:3(1-46)Online publication date: 17-Sep-2024
  • (2024)Robust Query Optimization in the Era of Machine Learning: State-of-the-Art and Future Directions2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00408(5371-5375)Online publication date: 13-May-2024
  • (2022)Turbo-charging SPJ query plans with learned physical join operator selectionsProceedings of the VLDB Endowment10.14778/3551793.355182515:11(2706-2718)Online publication date: 1-Jul-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media