skip to main content
research-article

Identifying the Root Causes of DBMS Suboptimality

Published: 28 February 2024 Publication History

Abstract

The query optimization phase within a database management system (DBMS) ostensibly finds the fastest query execution plan from a potentially large set of enumerated plans, all of which correctly compute the same result of the specified query. Sometimes the cost-based optimizer selects a slower plan, for a variety of reasons. Previous work has focused on increasing the performance of specific components, often a single operator, within an individual DBMS. However, that does not address the fundamental question: from where does this suboptimality arise, across DBMSes generally? In particular, the contribution of each of many possible factors to DBMS suboptimality is currently unknown. To identify the root causes of DBMS suboptimality, we first introduce the notion of empirical suboptimality of a query plan chosen by the DBMS, indicated by the existence of a query plan that performs more efficiently than the chosen plan, for the same query. A crucial aspect is that this can be measured externally to the DBMS, and thus does not require access to its source code. We then propose a novel predictive model to explain the relationship between various factors in query optimization and empirical suboptimality. Our model associates suboptimality with the factors of complexity of the schema, of the underlying data on which the query is evaluated, of the query itself, and of the DBMS optimizer. The model also characterizes concomitant interactions among these factors. This model induces a number of specific hypotheses that were tested on multiple DBMSes. We performed a series of experiments that examined the plans for thousands of queries run on four popular DBMSes. We tested the model on over a million of these query executions, using correlational analysis, regression analysis, and causal analysis, specifically Structural Equation Modeling (SEM). We observed that the dependent construct of empirical suboptimality prevalence correlates positively with nine specific constructs characterizing four identified factors that explain in concert much of the variance of suboptimality of two extensive benchmarks, across these disparate DBMSes. This predictive model shows that it is the common aspects of these DBMSes that predict suboptimality, not the particulars embedded in the inordinate complexity of each of these DBMSes. This paper thus provides a new methodology to study mature query optimizers, identifies underlying DBMS-independent causes for the observed suboptimality, and quantifies the relative contribution of each of these causes to the observed suboptimality. This work thus provides a roadmap for fundamental improvements of cost-based query optimizers.

References

[1]
R. Angles and C. Gutierrez. 2008. Survey of graph database models. Computing Surveys 40 (February2008), 1–39. Issue 1.
[2]
Derek Ajesam Asoh, Salvatore Belardo, and Jakov Crnkovic. 2007. Assessing knowledge management: Refining and cross validating the knowledge management index using structural equation modeling techniques. International Journal of Knowledge Management (IJKM) 3, 2 (2007), 1–30.
[3]
Ron Avnur and Joseph M. Hellerstein. 2000. Eddies: Continuously adaptive query processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, Dallas, TX, 261–272.
[4]
Brian Babcock and Surajit Chaudhuri. 2005. Towards a robust query optimizer: A principled and practical approach. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Baltimore). ACM, New York, 119–130.
[5]
Reuben M. Baron and David A. Kenny. 1986. The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology 51, 6 (1986), 1173–1182.
[6]
Srikanth Bellamkonda, Hua-Gang Li, Unmesh Jagtap, Yali Zhu, Vince Liang, and Thierry Cruanes. 2013. Adaptive and big data scale parallel execution in oracle. Proceedings of the VLDB Endowment 6, 11 (2013), 1102–1113.
[7]
Donald J. Campbell. 1988. Task complexity: A review and analysis. Academy of Management Review 13, 1 (1988), 40–52.
[8]
James Cheney. 2010. Causality and the semantics of provenance. In Proceedings of the Workshop on Developments in Computational Models: Causality, Computation, and Physics(EPTCS, Vol. 26), S. Barry Cooper, Prakash Panangaden, and Elham Kashefi (Eds.). Open Publishing Association, Waterloo, Australia, 63–74.
[9]
Paul Cohen. 1995. Empirical Methods for Artificial Intelligence. MIT Press, Cambridge, MA,.
[10]
Sabah Currim, Richard T. Snodgrass, Young-Kyoon Suh, and Rui Zhang. 2017. DBMS metrology: Measuring query time. ACM Transactions on Database Systems 42, 1 (March2017), 3:1–3:42.
[11]
Sabah Currim, Richard T. Snodgrass, Young-Kyoon Suh, Rui Zhang, Matthew Wong Johnson, and Cheng Yi. 2013. DBMS metrology: Measuring query time. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, 421–432.
[12]
Ronald A. Fisher. 1971. The Design of Experiments (9 ed.). Macmillan, New York.
[13]
Goetz Graefe. 1993. Query evaluation techniques for large databases. Comput. Surveys 25, 2 (June1993), 73–169.
[14]
Hazar Hamouch and Felix Naumann. 2018. Cardinality estimation: An experimental survey. Proceedings of the VLDB Endowment 11, 4 (Aug.2018), 499–512.
[15]
D. Harish, Pooja N. Darera, and Jayant R. Haritsa. 2007. On the production of anorexic plan diagrams. In Proceedings of the International Conference on Very Large Data Bases. VLDB Endowment, Vienna, Austria, 1081–1092.
[16]
Jayant R. Haritsa. 2010. The Picasso database query optimizer visualizer. Proceedings of the International Conference on Very Large Data Bases 3, 1-2 (Sept.2010), 1517–1520.
[17]
Surajit Chaudhuri. 1998. An overview of query optimization in relational systems. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM, Seattle, Washington, 34–43.
[18]
Yannis E. Ioannidis. 1996. Query optimization. Comput. Surveys 28, 1 (March1996), 121–123.
[19]
Alan R. Hevner, Salvatore T. March, Jinsoo Park, and Sudha Ram. 2004. Design science in information systems research. MIS Quarterly 28, 1 (March2004), 75–105.
[20]
Li-tze Hu and Peter M. Bentler. 1999. Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal 6, 1 (1999), 1–55. DOI: arXiv:
[21]
Arvind Hulgeri and Sundar Sudarshan. 2002. Parametric query optimization for linear and piecewise linear cost functions. In Proceedings of the International Conference on Very Large Data Bases (VLDB). VLDB Endowment, Hong Kong, China, 167–178.
[22]
IMDb.com. 2020. IMDb Datasets. URL: https://www.imdb.com/interfaces/.
[23]
Yannis E. Ioannidis. 2003. The history of histograms (abridged). In Proceedings of the 29th International Conference on Very Large Data Bases. VLDB Endowment, Berlin, Germany, 19–30.
[24]
ISO. 2008. ISO SQL:2008 International Standard. Technical Report. International Organization for Standardization, Geneva, Switzerland.
[25]
Matthias Jarke and Jurgen Koch. 1984. Query optimization in database systems. Comput. Surveys 16, 2 (June1984), 111–152.
[26]
Navin Kabra and David J. DeWitt. 1998. Efficient mid-query re-optimization of sub-optimal query execution plans. In Proceedings of the SIGMOD ACM International Conference. ACM, Seattle, Washington, 106–117.
[27]
Nodira Khoussainova, Magdalena Balazinska, and Dan Suciu. 2012. PerfXplain: Debugging MapReduce job performance. Proceedings of the VLDB Endowment 5, 7 (Mar. 2012), 598–609.
[28]
Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? Proceedings of the VLDB Endowment 9, 3 (Nov.2015), 204–215.
[29]
Quanzhong Li, Minglong Shao, Volker Markl, Kevin Beyer, Latha Colby, and Guy Lohman. 2007. Adaptively reordering joins during query execution. In Proceedings of the IEEE International Conference on Data Engineering. IEEE, Istanbul, Turkey, 26–35.
[30]
Michael V. Mannino, Paicheng Chu, and Thomas Sager. 1988. Statistical profile estimation in database systems. Comput. Surveys 20, 3 (Sept.1988), 191–221.
[31]
Alexandra Meliou, Wolfgang Gatterbauer, Joseph Y. Halpern, Christoph Koch, Katherine F. Moore, and Dan Suciu. 2010. Causality in databases. IEEE Data Eng. Bull. 33, 3 (2010), 59–67.
[32]
Alexandra Meliou, Sudeepa Roy, and Dan Suciu. 2014. Causality and explanations in databases. Proceedings of the VLDB Endowment 7, 13 (2014), 1715–1716.
[33]
Jim Melton. 2003. Advanced SQL: 1999. Morgan Kaufmann, Burlington, MA.
[34]
Daniel L. Moody. 1998. Metrics for evaluating the quality of entity relationship models. In Proceedings of the 17th International Conference on Conceptual Modeling. Springer-Verlag, London, UK, 211–225.
[35]
Patrick O’Neil, Edward Chang, Dieter Gawlick, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 4 (June1996), 351–385.
[36]
Judea Pearl. 2010. An introduction to causal inference. The International Journal of Biostatistics 6, 2 (2010), 1–62.
[37]
Judea Pearl. 2012. The causal foundations of structural equation modeling. In Handbook of Structural Equation Modeling. The Guilford Press, New York, NY, US, 68–91.
[38]
Christopher Ré and Dan Suciu. 2008. Managing probabilistic data with MystiQ: The can-do, the could-do, and the can’t-do. In Scalable Uncertainty Management, 2nd International Conference, SUM, Sergio Greco and Thomas Lukasiewicz (Eds.), Vol. 5291. Springer, Naples, Italy, 5–18.
[39]
Naveen Reddy and Jayant R. Haritsa. 2005. Analyzing plan diagrams of database query optimizers. In Proceedings of the International Conference on Very Large Data Bases. VLDB Endowment, Trondheim, Norway, 1228–1239.
[40]
Frederick R. Reiss and Tapas Kanungo. 2003. A characterization of the sensitivity of query optimization to storage access cost parameters. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, San Diego, California, 385–396.
[41]
Yves Rosseel. 2012. Lavaan: An R package for structural equation modeling. Journal of Statistical Software 48, 2 (2012), 1–36. https://www.jstatsoft.org/v48/i02/
[42]
Babak Salimi, Leopoldo Bertossi, Dan Suciu, and Guy Van Den Broeck. 2016. Quantifying causal effects on query answering in databases. In Proceedings of the 8th USENIX Conference on Theory and Practice of Provenance (TaPP’16). USENIX Association, Washington, DC, 7–10.
[43]
Sunita Sarawagi and Gayatri Sathe. 2000. i\({}^{\mbox{3}}\): Intelligent, interactive investigaton of OLAP data cubes. ACM SIGMOD Record 29, 2 (2000), 589.
[44]
Patricia Selinger, Morton Astrahan, Donald Chamberlin, Raymond Lorie, and Thomas Price. 1979. Access path selection in a relational database management system. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, Boston, Massachusetts, 23–34.
[45]
Young-Kyoon Suh, Jun Young An, Byungchul Tak, and Gap-Joo Na. 2022. A comprehensive empirical study of query performance across GPU DBMSes. Proceedings of the ACM on Measurement and Analysis of Computing Systems 6, 1 (2022), 4:1–4:29.
[46]
Young-Kyoon Suh, Richard Snodgrass, and Sabah Currim. 2017. An empirical study of transaction throughput thrashing across multiple relational DBMSes. Information Systems 66, 6 (2017), 119–136.
[47]
Young-Kyoon Suh, Richard T. Snodgrass, and Rui Zhang. 2014. AZDBLab: A laboratory information system for large-scale empirical DBMS studies. Proceedings of the VLDB Endowment 7, 13 (2014), 1641–1644.
[48]
Benjamin R. Warner. 2018. Causality. In The SAGE Encyclopedia of Communication Research Methods, Mike Allen (Ed.). SAGE Publications, Inc, Thousand Oaks, CA, US, 122–124.
[49]
Marianne Winslett. 2002. David DeWitt speaks out. SIGMOD Record 31, 2 (June2002), 50–62.
[50]
Eugene Wu, Sammuel Madden, and Michael Stonebraker. 2013. SubZero: A fine-grained lineage system for scientific databases. In Proceedings of theIEEE International Conference on Data Engineering. IEEE, Brisbane, QLD, Australia, 865–876.

Cited By

View all
  • (2024)Selectivity Estimation for Spatial Filters Using Optimizer Feedback: A Machine Learning PerspectiveWeb Information Systems Engineering – WISE 202410.1007/978-981-96-0573-6_8(101-115)Online publication date: 27-Nov-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 49, Issue 1
March 2024
176 pages
EISSN:1557-4644
DOI:10.1145/3613511
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 February 2024
Online AM: 10 January 2024
Accepted: 04 October 2023
Revised: 11 January 2023
Received: 29 November 2021
Published in TODS Volume 49, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Query optimization
  2. database performance evaluation
  3. empirical studies
  4. query suboptimality

Qualifiers

  • Research-article

Funding Sources

  • NSF
  • NRF

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)360
  • Downloads (Last 6 weeks)28
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Selectivity Estimation for Spatial Filters Using Optimizer Feedback: A Machine Learning PerspectiveWeb Information Systems Engineering – WISE 202410.1007/978-981-96-0573-6_8(101-115)Online publication date: 27-Nov-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media