research-article

Small Selectivities Matter: Lifting the Burden of Empty Samples

Authors:

Axel Hertzschuch,

Guido Moerkotte,

Wolfgang Lehner,

Lars FrickeAuthors Info & Claims

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

Pages 697 - 709

https://doi.org/10.1145/3448016.3452805

Published: 18 June 2021 Publication History

Abstract

Every year more and more advanced approaches to cardinality estimation are published, using learned models or other data and workload specific synopses. In contrast, the majority of commercial in-memory systems still relies on sampling. It is arguably the most general and easiest estimator to implement. While most methods do not seem to improve much over sampling-based estimators in the presence of non-selective queries, sampling struggles with highly selective queries due to limitations of the sample size. Especially in situations where no sample tuple qualifies, optimizers fall back to basic heuristics that ignore attribute correlations and lead to large estimation errors. In this work, we present a novel approach, dealing with these 0-Tuple Situations. It is ready to use in any DBMS capable of sampling, showing a negligible impact on optimization time. Our experiments on real world and synthetic data sets demonstrate up to two orders of magnitude reduced estimation errors. Enumerating single filter predicates according to our estimates reveals 1.3 to 1.8 times faster query responses for complex filters.

Supplementary Material

MP4 File (3448016.3452805.mp4)

Every year more and more advanced approaches to cardinality estimation are published, using learned models or other data and workload specific synopses. In contrast, the majority of commercial in-memory systems still relies on sampling. It is arguably the most general and easiest estimator to implement. While most methods do not seem to improve much over sampling-based estimators in the presence of non-selective queries, sampling struggles with highly selective queries due to limitations of the sample size. Especially in situations where no sample tuple qualifies, optimizers fall back to basic heuristics that ignore attribute correlations and lead to large estimation errors. In this work, we present a novel approach, dealing with these 0-Tuple Situations. It is ready to use in any DBMS capable of sampling, showing a negligible impact on optimization time. Our experiments on real world and synthetic data sets demonstrate up to two orders of magnitude reduced estimation errors. Enumerating single filter predicates according to our estimates reveals 1.3 to 1.8 times faster query responses for complex filters.

Download
1024.01 MB

References

[1]

Ctexttt

[2]

Boost library . https://www.boost.org. Accessed: 2020-07-02.

[3]

Public Bi Benchmark. https://github.com/peterboncz/public_bi_benchmark-1. Accessed: 2020-05--21.

[4]

B. Babcock and S. Chaudhuri. Towards a robust query optimizer: a principled and practical approach. In SIGMOD, pages 119--130. ACM, 2005.

Digital Library

[5]

S. Babu, R. Motwani, K. Munagala, I. Nishizawa, and J. Widom. Adaptive ordering of pipelined stream filters. In SIGMOD, pages 407--418. ACM, 2004.

Digital Library

[6]

J. M. Bernardo and A. F. Smith. Bayesian theory, volume 405. John Wiley & Sons, 2009.

[7]

S. Bianco, R. Cadene, L. Celona, and P. Napoletano. Benchmark analysis of representative deep neural network architectures. IEEE Access, 6:64270--64277, 2018.

[8]

A. Birler, B. Radke, and T. Neumann. Concurrent online sampling for all, for free. In DaMoN, pages 5:1--5:8. ACM, 2020.

[9]

R. P. Brent. An algorithm with guaranteed convergence for finding a zero of a function. The Computer Journal, 14(4):422--425, 1971.

[10]

N. Bruno, S. Chaudhuri, and L. Gravano. STHoles: A multidimensional workload-aware histogram. In SIGMOD, pages 211--222. ACM, 2001.

Digital Library

[11]

S. Chakkappen, S. Budalakoti, R. Krishnamachari, S. R. Valluri, A. Wood, and M. Zait. Adaptive statistics in Oracle 12C. In PVLDB, volume 10, pages 1813--1824. VLDB Endowment, 2017.

Digital Library

[12]

G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1--3):1--294, 2011.

[13]

F. M. Dekking, C. Kraaikamp, H. P. Lopuha"a, and L. E. Meester. In A Modern Introduction to Probability and Statistics: Understanding why and how, pages 45--47. Springer Science & Business Media, 2005.

[14]

A. Dutt, C. Wang, A. Nazi, S. Kandula, V. Narasayya, and S. Chaudhuri. Selectivity estimation for range predicates using lightweight models. In PVLDB, volume 12, pages 1044--1057. VLDB Endowment, 2019.

Digital Library

[15]

F. F"a rber, N. May, W. Lehner, P. Große, I. Mü ller, H. Rauhe, and J. Dees. The SAP HANA database -- an architecture overview. IEEE Data Eng. Bull., 35(1):28--33, 2012.

[16]

M. Heimel, M. Kiefer, and V. Markl. Self-tuning, GPU-accelerated kernel density models for multidimensional selectivity estimation. In SIGMOD, pages 1477--1492. ACM, 2015.

Digital Library

[17]

J. M. Hellerstein. Optimization techniques for queries with expensive methods. In TODS, volume 23, page 121. ACM, 1998.

[18]

A. Hertzschuch, C. Hartmann, D. Habich, and W. Lehner. Simplicity done right for join ordering. CIDR, 2021.

[19]

Y. Ioannidis. The history of histograms (abridged). In PVLDB, pages 19--30. VLDB Endowment, 2003.

[20]

R. Kaas and J. M. Buhrman. Mean, median and mode in binomial distributions. Statistica Neerlandica, 34(1):13--18, 1980.

[21]

F. Kastrati and G. Moerkotte. Optimization of conjunctive predicates for main memory column stores. In PVLDB, volume 9, pages 1125--1136. VLDB Endowment, 2016.

Digital Library

[22]

J. Kerman. A closed-form approximation for the median of the beta distribution. arXiv preprint arXiv:1111.0433, 2011.

[23]

A. Kipf, T. Kipf, B. Radke, V. Leis, P. A. Boncz, and A. Kemper. Learned cardinalities: Estimating correlated joins with deep learning. CIDR, 2019.

[24]

A. Kipf, D. Vorona, J. Mü ller, T. Kipf, B. Radke, V. Leis, P. A. Boncz, T. Neumann, and A. Kemper. Estimating cardinalities with deep sketches. In SIGMOD, pages 1937--1940. ACM, 2019.

Digital Library

[25]

P.-A. Larson, W. Lehner, J. Zhou, and P. Zabback. Cardinality estimation using sample views with quality assurance. In SIGMOD, pages 175--186. ACM, 2007.

Digital Library

[26]

V. Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neumann. How good are query optimizers, really? In PVLDB, volume 9, pages 204--215. VLDB Endowment, 2015.

Digital Library

[27]

V. Leis, B. Radke, A. Gubichev, A. Kemper, and T. Neumann. Cardinality estimation done right: Index-based join sampling. In CIDR, 2017.

[28]

M. Lichman et al. UCI machine learning repository, 2013.

[29]

G. Lohmann. Is query optimization a "solved" problem? https://wp.sigmod.org/?p=1075, 2014. Accessed: 2019-09--23.

[30]

V. Markl, P. J. Haas, M. Kutsch, N. Megiddo, U. Srivastava, and T. M. Tran. Consistent selectivity estimation via maximum entropy. The VLDB journal, 16(1):55--76, 2007.

Digital Library

[31]

M. J. Menne, I. Durre, R. S. Vose, B. E. Gleason, and T. G. Houston. An overview of the global historical climatology network-daily database. Journal of Atmospheric and Oceanic Technology, 29(7):897--910, 2012.

[32]

G. Moerkotte, D. DeHaan, N. May, A. Nica, and A. Boehm. Exploiting ordered dictionaries to efficiently construct histograms with q-error guarantees in SAP HANA. In SIGMOD, pages 361--372. ACM, 2014.

Digital Library

[33]

G. Moerkotte and A. Hertzschuch. α to $ømega$: the g(r)eek alphabet of sampling. CIDR, 2020.

[34]

G. Moerkotte, T. Neumann, and G. Steidl. Preventing bad plans by bounding the impact of cardinality estimation errors. In PVLDB, volume 2, pages 982--993. VLDB Endowment, 2009.

Digital Library

[35]

MonetDB Team. https://www.monetdb.org/. Accessed: 2020--10-05.

[36]

M. Müller, G. Moerkotte, and O. Kolb. Improved selectivity estimation by combining knowledge from sampling and synopses. In PVLDB, volume 11, pages 1016--1028. VLDB Endowment, 2018.

Digital Library

[37]

Reference for Oracle DB 18c. https://docs.oracle.com/en/database/oracle/oracle-database/18/tgsql/options-for-optimizer-statistics-gathering.html#GUID-DEE2AF8B-5F4B-4FE7--9F0E-7D188921EBCC. Accessed: 2020-07-02.

[38]

V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In PVLDB. VLDB Endowment, 1997.

[39]

J. A. Rice. Mathematical statistics and data analysis. Cengage Learning, 2006.

[40]

H. Robbins. A remark on stirling's formula. The American mathematical monthly, 62(1):26--29, 1955.

[41]

Reference for SAP HANA Platform 2.0 SPS 04. https://help.sap.com/viewer/4fe29514fd584807ac9f2a04f6754767/2.0.02/en-US/4ba9edce1f2347a0b9fcda99879c17a1.html. Accessed: 2020-07-02.

[42]

D. W. Scott. Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons, 2015.

[43]

R. Sherkat, C. Florendo, M. Andrei, R. Blanco, A. Dragusanu, A. Pathak, P. Khadilkar, N. Kulkarni, C. Lemke, S. Seifert, et al. Native store extension for sap hana. PVLDB, 12(12):2047--2058, 2019.

Digital Library

[44]

The new and improved cardinality estimator in SQL Server 2014. https://cloudblogs.microsoft.com/sqlserver/2014/03/17/the-new-and-improved-cardinality-estimator-in-sql-server-2014/. Accessed: 2020-07-02.

[45]

J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1):37--57, 1985.

Digital Library

[46]

A. Vogelsgesang, M. Haubenschild, J. Finis, A. Kemper, V. Leis, T. Mü hlbauer, T. Neumann, and M. Then. Get real: How benchmarks fail to represent the real world. In A. Bö hm and T. Rabl, editors, DBTest, pages 1:1--1:6. ACM, 2018.

[47]

L. Woltmann, C. Hartmann, M. Thiele, D. Habich, and W. Lehner. Cardinality estimation with local deep learning models. In AiDM. ACM, 2019.

Digital Library

[48]

Z. Yang, E. Liang, A. Kamsetty, C. Wu, Y. Duan, X. Chen, P. Abbeel, J. M. Hellerstein, S. Krishnan, and I. Stoica. Deep unsupervised cardinality estimation. In PVLDB, volume 13, pages 279--292. VLDB Endowment, 2019.

Digital Library

[49]

X. Yu, N. Koudas, and C. Zuzarte. Hase: A hybrid approach to selectivity estimation for conjunctive predicates. In EDBT, pages 460--477. Springer, 2006.

Digital Library

[50]

H. .Zoładek et al. The topological proof of abel-ruffini theorem. Topological Methods in Nonlinear Analysis, 16(2):253--265, 2000.

Cited By

Xiu HAgarwal PYang J(2024)PARQO: Penalty-Aware Robust Plan Selection in Query OptimizationProceedings of the VLDB Endowment10.14778/3704965.370497117:13(4627-4640)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.14778/3704965.3704971
Hu PMotik B(2024)Accurate Sampling-Based Cardinality Estimation for Complex Graph QueriesACM Transactions on Database Systems10.1145/368920949:3(1-46)Online publication date: 17-Sep-2024
https://dl.acm.org/doi/10.1145/3689209
Kamali AKantere VZuzarte C(2024)Robust Query Optimization in the Era of Machine Learning: State-of-the-Art and Future Directions2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00408(5371-5375)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00408
Show More Cited By

Index Terms

Small Selectivities Matter: Lifting the Burden of Empty Samples
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
        Query optimization

Recommendations

Sampling-Based AQP in Modern Analytical Engines
DaMoN '22: Proceedings of the 18th International Workshop on Data Management on New Hardware

As the data volume grows, reducing the query execution times remains an elusive goal. While approximate query processing (AQP) techniques present a principled method to trade off accuracy for faster queries in analytics, the sample creation is often ...
LAQy: Efficient and Reusable Query Approximations via Lazy Sampling
PACMMOD

Modern analytical engines rely on Approximate Query Processing (AQP) to provide faster response times than the hardware allows for exact query answering. However, existing AQP methods impose steep performance penalties as workload unpredictability ...
Synopses for query optimization: A space-complexity perspective
Special Issue: SIGMOD/PODS 2004

Database systems use precomputed synopses of data to estimate the cost of alternative plans during query optimization. A number of alternative synopsis structures have been proposed, but histograms are by far the most commonly used. While histograms ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

June 2021

2969 pages

ISBN:9781450383431

DOI:10.1145/3448016

General Chairs:
Guoliang Li
Tsinghua University (China)
,
Zhanhuai Li
Northwestern Polytechnical University (China)
,
Program Chairs:
Stratos Idreos
Harvard University (USA)
,
Divesh Srivastava
AT&T (USA)

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Deutsche Forschungsgemeinschaft

Conference

SIGMOD/PODS '21

Sponsor:

SIGMOD

SIGMOD/PODS '21: International Conference on Management of Data

June 20 - 25, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
417
Total Downloads

Downloads (Last 12 months)50
Downloads (Last 6 weeks)7

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xiu HAgarwal PYang J(2024)PARQO: Penalty-Aware Robust Plan Selection in Query OptimizationProceedings of the VLDB Endowment10.14778/3704965.370497117:13(4627-4640)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.14778/3704965.3704971
Hu PMotik B(2024)Accurate Sampling-Based Cardinality Estimation for Complex Graph QueriesACM Transactions on Database Systems10.1145/368920949:3(1-46)Online publication date: 17-Sep-2024
https://dl.acm.org/doi/10.1145/3689209
Kamali AKantere VZuzarte C(2024)Robust Query Optimization in the Era of Machine Learning: State-of-the-Art and Future Directions2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00408(5371-5375)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00408
Hertzschuch AHartmann CHabich DLehner W(2022)Turbo-charging SPJ query plans with learned physical join operator selectionsProceedings of the VLDB Endowment10.14778/3551793.355182515:11(2706-2718)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.14778/3551793.3551825

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten