research-article

Modeling Shifting Workloads for Learned Database Systems

Authors:
Peizhi Wu

University of Pennsylvania, Philadelphia, PA, USA

University of Pennsylvania, Philadelphia, PA, USA

0009-0006-4765-8733
View Profile

,
Zachary G. Ives

University of Pennsylvania, Philadelphia, PA, USA

University of Pennsylvania, Philadelphia, PA, USA

0000-0001-7527-2957
View Profile

Proceedings of the ACM on Management of Data Volume 2 Issue 1Article No.: 38pp 1–27https://doi.org/10.1145/3639293

Published:26 March 2024Publication History

Proceedings of the ACM on Management of Data

Abstract

Learned database systems address several weaknesses of traditional cost estimation techniques in query optimization: they learn a model of a database instance, e.g., as queries are executed. However, when the database instance has skew and correlation, it is nontrivial to create an effective training set that anticipates workload shifts, where query structure changes and/or different regions of the data contribute to query answers. Our predictive model may perform poorly with these out-of-distribution inputs. In this paper, we study how the notion of a replay buffer can be managed through online algorithms to build a concise yet representative model of the workload distribution --- allowing for rapid adaptation and effective prediction of cardinalities and costs. We experimentally validate our methods over several data domains.

References

Ashraf Aboulnaga and Surajit Chaudhuri. 1999. Self-tuning histograms: Building histograms without looking at data. ACM SIGMOD Record, Vol. 28, 2 (1999), 181--192.Google ScholarDigital Library
Swarup Acharya, Phillip B. Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. 1999. Join Synopses for Approximate Query Answering. In SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1--3, 1999, Philadelphia, Pennsylvania, USA, Alex Delis, Christos Faloutsos, and Shahram Ghandeharizadeh (Eds.). ACM Press, 275--286. https://doi.org/10.1145/304182.304207Google ScholarDigital Library
Ben Adlam and Jeffrey Pennington. 2020. Understanding double descent requires a fine-grained bias-variance decomposition. Advances in neural information processing systems, Vol. 33 (2020), 11022--11032.Google Scholar
Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. 2019. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, Vol. 32 (2019).Google Scholar
Christos Anagnostopoulos and Peter Triantafillou. 2015a. Learning set cardinality in distance nearest neighbours. In 2015 IEEE international conference on data mining. IEEE, 691--696.Google ScholarDigital Library
Christos Anagnostopoulos and Peter Triantafillou. 2015b. Learning to accurately count with query-driven predictive analytics. In 2015 IEEE international conference on big data (big data). IEEE, 14--23.Google ScholarDigital Library
Christos Anagnostopoulos and Peter Triantafillou. 2017. Query-driven learning for predictive analytics of data subspace cardinality. ACM Transactions on Knowledge Discovery from Data (TKDD), Vol. 11, 4 (2017), 1--46.Google Scholar
Charles E Antoniak. 1974. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The annals of statistics (1974), 1152--1174.Google Scholar
Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM, Vol. 18, 9 (1975), 509--517.Google ScholarDigital Library
Allan Borodin and Ran El-Yaniv. 2005. Online computation and competitive analysis. cambridge university press.Google Scholar
Vladimir Braverman, Adam Meyerson, Rafail Ostrovsky, Alan Roytman, Michael Shindler, and Brian Tagiku. 2011. Streaming k-means on well-clusterable data. In Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms. SIAM, 26--40.Google ScholarDigital Library
Nicolas Bruno and Surajit Chaudhuri. 2002. Exploiting statistics on query expressions for optimization. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data. 263--274.Google ScholarDigital Library
Nicolas Bruno, Surajit Chaudhuri, and Luis Gravano. 2001. STHoles: a multidimensional workload-aware histogram. In SIGMOD. 211--222.Google Scholar
Peter Buneman, Sanjeev Khanna, and Tan Wang-Chiew. 2001. Why and where: A characterization of data provenance. In Database Theory-ICDT 2001: 8th International Conference London, UK, January 4--6, 2001 Proceedings 8. Springer, 316--330.Google ScholarCross Ref
Pierluigi Crescenzi. 1997. A short guide to approximation preserving reductions. In Proceedings of Computational Complexity. Twelfth Annual IEEE Conference. IEEE, 262--273.Google ScholarCross Ref
Sanjoy Dasgupta. 2008. The hardness of k-means clustering. Department of Computer Science and Engineering, University of California.Google Scholar
David L Davies and Donald W Bouldin. 1979. A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence 2 (1979), 224--227.Google ScholarDigital Library
Amol Deshpande, Zachary Ives, Vijayshankar Raman, et al. 2007. Adaptive query processing. Foundations and Trends® in Databases, Vol. 1, 1 (2007), 1--140.Google ScholarDigital Library
Or Dinari and Oren Freifeld. 2022. Revisiting DP-Means: Fast Scalable Algorithms via Parallelism and Delayed Cluster Creation. In The 38th Conference on Uncertainty in Artificial Intelligence.Google Scholar
Bailu Ding, Surajit Chaudhuri, Johannes Gehrke, and Vivek Narasayya. 2021. DSB: A decision support benchmark for workload-driven and traditional database systems. Proceedings of the VLDB Endowment, Vol. 14, 13 (2021), 3376--3388.Google ScholarDigital Library
Haowen Dong, Chengliang Chai, Yuyu Luo, Jiabin Liu, Jianhua Feng, and Chaoqun Zhan. 2022. Rw-tree: A learned workload-aware framework for R-tree construction. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 2073--2085.Google ScholarCross Ref
Simon S Du, Yining Wang, Xiyu Zhai, Sivaraman Balakrishnan, Russ R Salakhutdinov, and Aarti Singh. 2018. How many samples are needed to estimate a convolutional neural network? Advances in Neural Information Processing Systems, Vol. 31 (2018).Google Scholar
Stéphane d'Ascoli, Maria Refinetti, Giulio Biroli, and Florent Krzakala. 2020. Double trouble in double descent: Bias and variance (s) in the lazy regime. In International Conference on Machine Learning. PMLR, 2280--2290.Google Scholar
Tongtong Fang, Nan Lu, Gang Niu, and Masashi Sugiyama. 2020. Rethinking importance weighting for deep learning under distribution shift. Advances in neural information processing systems, Vol. 33 (2020), 11996--12007.Google Scholar
Dimitris Fotakis. 2008. On the competitive ratio for online facility location. Algorithmica, Vol. 50, 1 (2008), 1--57.Google ScholarDigital Library
Dimitris Fotakis. 2011. Online and incremental algorithms for facility location. ACM SIGACT News, Vol. 42, 1 (2011), 97--131.Google ScholarDigital Library
Nir Friedman and Zohar Yakhini. 2013. On the sample complexity of learning Bayesian networks. arXiv preprint arXiv:1302.3579 (2013).Google Scholar
Noah Golowich, Alexander Rakhlin, and Ohad Shamir. 2018. Size-independent sample complexity of neural networks. In Conference On Learning Theory. PMLR, 297--299.Google Scholar
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. NIPS, Vol. 27 (2014).Google Scholar
Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, not from Queries! VLDB, Vol. 13, 7, 992--1005.Google Scholar
Marc Holze and Norbert Ritter. 2007. Towards workload shift detection and prediction for autonomic databases. In Proceedings of the ACM first Ph. D. workshop in CIKM. 109--116.Google ScholarDigital Library
Yannis E Ioannidis and Stavros Christodoulakis. 1991. On the propagation of errors in the size of join results. In Proceedings of the 1991 ACM SIGMOD International Conference on Management of data. 268--277.Google ScholarDigital Library
Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Bertty Contreras-Rojas, Rodrigo Pardo-Meza, Anis Troudi, and Sanjay Chawla. 2020. ML-based cross-platform query optimization. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1489--1500.Google ScholarCross Ref
Oded Kariv and S Louis Hakimi. 1979. An algorithmic approach to network location problems. I: The p-centers. SIAM journal on applied mathematics, Vol. 37, 3 (1979), 513--538.Google Scholar
Ronald Kemker, Marc McClure, Angelina Abitino, Tyler Hayes, and Christopher Kanan. 2018. Measuring catastrophic forgetting in neural networks. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.Google ScholarCross Ref
Kyoungmin Kim, Jisung Jung, In Seo, Wook-Shin Han, Kangwoo Choi, and Jaehyok Chong. 2022. Learned cardinality estimation: An in-depth study. In Proceedings of the 2022 International Conference on Management of Data. 1214--1227.Google Scholar
Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2019. Learned cardinalities: Estimating correlated joins with deep learning. In CIDR.Google Scholar
Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. 2021. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning. PMLR, 5637--5664.Google Scholar
Andrey Kolmogorov. 1933. Sulla determinazione empirica di una lgge di distribuzione. Inst. Ital. Attuari, Giorn., Vol. 4 (1933), 83--91.Google Scholar
Brian Kulis and Michael I Jordan. 2011. Revisiting k-means: New algorithms via Bayesian nonparametrics. arXiv preprint arXiv:1111.0352 (2011).Google Scholar
Meghdad Kurmanji and Peter Triantafillou. 2023. Detect, Distill and Update: Learned DB Systems Facing Out of Distribution Data. Proceedings of the ACM on Management of Data, Vol. 1, 1 (2023), 1--27.Google ScholarDigital Library
Erich Leo Lehmann and EL Lehmann. 1986. Testing statistical hypotheses. Vol. 2. Springer.Google Scholar
Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? Proceedings of the VLDB Endowment, Vol. 9, 3 (2015), 204--215.Google ScholarDigital Library
Beibin Li, Yao Lu, and Srikanth Kandula. 2022. Warper: Efficiently adapting learned cardinality estimators to data and workload drifts. In Proceedings of the 2022 International Conference on Management of Data. 1920--1933.Google ScholarDigital Library
Guoliang Li, Xuanhe Zhou, Shifu Li, and Bo Gao. 2019. Qtune: A query-aware database tuning system with deep reinforcement learning. Proceedings of the VLDB Endowment, Vol. 12, 12 (2019), 2118--2130.Google ScholarDigital Library
Lipyeow Lim, Min Wang, and Jeffrey Scott Vitter. 2003. SASH: A self-adaptive histogram set for dynamically changing workloads. In Proceedings 2003 VLDB Conference. Elsevier, 369--380.Google ScholarCross Ref
Lin Ma, Bailu Ding, Sudipto Das, and Adith Swaminathan. 2020. Active Learning for ML Enhanced Database Systems. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 175--191. https://doi.org/10.1145/3318464.3389768Google ScholarDigital Library
J MacQueen. 1965. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symposium on Math., Stat., and Prob. 281.Google Scholar
Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, and Tim Kraska. 2022. Bao: Making learned query optimization practical. ACM SIGMOD Record, Vol. 51, 1 (2022), 6--13.Google ScholarDigital Library
Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A learned query optimizer. In VLDB.Google ScholarDigital Library
Ryan Marcus and Olga Papaemmanouil. 2019. Plan-structured deep neural network models for query performance prediction. PVLDB (2019).Google Scholar
Volker Markl, Guy M Lohman, and Vijayshankar Raman. 2003. LEO: An autonomic query optimizer for DB2. IBM Systems Journal, Vol. 42, 1 (2003), 98--106.Google ScholarDigital Library
Adam Meyerson. 2001. Online facility location. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science. IEEE, 426--431.Google ScholarCross Ref
Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learning multi-dimensional indexes. In Proceedings of the 2020 ACM SIGMOD international conference on management of data. 985--1000.Google ScholarDigital Library
Parimarjan Negi, Ziniu Wu, Andreas Kipf, Nesime Tatbul, Ryan Marcus, Sam Madden, Tim Kraska, and Mohammad Alizadeh. 2023. Robust Query Driven Cardinality Estimation under Changing Workloads. Proceedings of the VLDB Endowment, Vol. 16, 6 (2023), 1520--1533.Google ScholarDigital Library
Shigeyuki Odashima, Miwa Ueki, and Naoyuki Sawasaki. 2016. A split-merge DP-means algorithm to avoid local minima. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 63--78.Google ScholarCross Ref
Peter Orbanz and Yee Whye Teh. 2010. Bayesian Nonparametric Models. Encyclopedia of machine learning, Vol. 1 (2010).Google Scholar
Hae-Sang Park and Chi-Hyuck Jun. 2009. A simple and fast algorithm for K-medoids clustering. Expert systems with applications, Vol. 36, 2 (2009), 3336--3341.Google Scholar
Stephan Rabanser, Stephan Günnemann, and Zachary Lipton. 2019. Failing loudly: An empirical study of methods for detecting dataset shift. Advances in Neural Information Processing Systems, Vol. 32 (2019).Google Scholar
Gaurav Saxena, Mohammad Rahman, Naresh Chainani, Chunbin Lin, George Caragea, Fahim Chowdhury, Ryan Marcus, Tim Kraska, Ippokratis Pandis, and Balakrishnan Narayanaswamy. 2023. Auto-WLM: Machine learning enhanced workload management in Amazon Redshift. In Companion of the 2023 International Conference on Management of Data. 225--237.Google ScholarDigital Library
Michael Shindler, Alex Wong, and Adam Meyerson. 2011. Fast and accurate k-means for large datasets. Advances in neural information processing systems, Vol. 24 (2011).Google Scholar
Tarique Siddiqui, Alekh Jindal, Shi Qiao, Hiren Patel, and Wangchao Le. 2020. Cost models for big data query processing: Learning, retrofitting, and our findings. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 99--113.Google ScholarDigital Library
Nikolai V Smirnov. 1939. On the estimation of the discrepancy between empirical curves of distribution for two independent samples. Bull. Math. Univ. Moscou, Vol. 2, 2 (1939), 3--14.Google Scholar
Ji Sun and Guoliang Li. 2019. An end-to-end learning-based cost estimator. VLDB, Vol. 13, 3 (2019), 307--319.Google ScholarDigital Library
Fadi Thabtah, Suhel Hammoud, Firuz Kamalov, and Amanda Gonsalves. 2020. Data imbalance in classification: Experimental evaluation. Information Sciences, Vol. 513 (2020), 429--441.Google ScholarDigital Library
Anbupalam Thalamuthu, Indranil Mukhopadhyay, Xiaojing Zheng, and George C Tseng. 2006. Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics, Vol. 22, 19 (2006), 2405--2412.Google ScholarDigital Library
Kostas Tzoumas, Man Lung Yiu, and Christian S Jensen. 2009. Workload-aware indexing of continuously moving objects. Proceedings of the VLDB Endowment, Vol. 2, 1 (2009), 1186--1197.Google ScholarDigital Library
Dana Van Aken, Andrew Pavlo, Geoffrey J Gordon, and Bohan Zhang. 2017. Automatic database management system tuning through large-scale machine learning. In Proceedings of the 2017 ACM international conference on management of data. 1009--1024.Google ScholarDigital Library
Jeffrey S Vitter. 1985. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), Vol. 11, 1 (1985), 37--57.Google ScholarDigital Library
Chenggang Wu, Alekh Jindal, Saeed Amizadeh, Hiren Patel, Wangchao Le, Shi Qiao, and Sriram Rao. 2018. Towards a learning optimizer for shared clouds. VLDB, Vol. 12, 3 (2018), 210--222.Google ScholarDigital Library
CF Jeff Wu. 1983. On the convergence properties of the EM algorithm. The Annals of statistics (1983), 95--103.Google Scholar
Peizhi Wu and Gao Cong. 2021. A unified deep model of learning from both data and queries for cardinality estimation. In Proceedings of the 2021 International Conference on Management of Data. 2009--2022.Google ScholarDigital Library
Jingyi Yang, Peizhi Wu, Gao Cong, Tieying Zhang, and Xiao He. 2022. SAM: Database Generation from Query Workloads with Supervised Autoregressive Models. In Proceedings of the 2022 International Conference on Management of Data. 1542--1555.Google ScholarDigital Library
Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. 2021. NeuroCard: One Cardinality Estimator for All Tables. PVLDB (2021).Google Scholar
Ji Zhang, Yu Liu, Ke Zhou, Guoliang Li, Zhili Xiao, Bin Cheng, Jiashu Xing, Yangtao Wang, Tianheng Cheng, Li Liu, et al. 2019. An end-to-end automatic cloud database tuning system using deep reinforcement learning. In Proceedings of the 2019 International Conference on Management of Data. 415--432.Google ScholarDigital Library
Johan Kok Zhi Kang, Sien Yi Tan, Feng Cheng, Shixuan Sun, and Bingsheng He. 2021. Efficient deep learning pipelines for accurate cost estimations over large scale query workload. In Proceedings of the 2021 International Conference on Management of Data. 1014--1022.Google ScholarDigital Library
Xuanhe Zhou, Ji Sun, Guoliang Li, and Jianhua Feng. 2020. Query performance prediction for concurrent queries using graph embedding. Proceedings of the VLDB Endowment, Vol. 13, 9 (2020), 1416--1428.Google ScholarDigital Library

Index Terms

Modeling Shifting Workloads for Learned Database Systems
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
        Query optimization

Recommendations

Machine Unlearning in Learned Databases: An Experimental Analysis
PACMMOD

Machine learning models based on neural networks (NNs) are enjoying ever-increasing attention in the Database (DB) community, both in research and practice. However, an important issue has been largely overlooked, namely the challenge of dealing with the ...
Read More
Database Systems: A Practical Approach to Design, Implementation and Management
Read More
Database Management Systems
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the ACM on Management of Data Volume 2, Issue 1
PACMMOD
February 2024
1874 pages
EISSN:2836-6573
DOI:10.1145/3654807
Editor:
Divyakant Agrawal
UC Santa Barbara, United States
Issue’s Table of Contents
Copyright © 2024 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 March 2024
Published in pacmmod Volume 2, Issue 1

Permissions
Request permissions about this article.
Request Permissions
Author Tags
learned database systems
online algorithms
replay buffer
workload shifts
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 141
  Total Downloads
- Downloads (Last 12 months)141
- Downloads (Last 6 weeks)102
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Modeling Shifting Workloads for Learned Database Systems

Proceedings of the ACM on Management of Data

Abstract

References

Cited By

Index Terms

Recommendations

Machine Unlearning in Learned Databases: An Experimental Analysis

Database Systems: A Practical Approach to Design, Implementation and Management

Database Management Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Modeling Shifting Workloads for Learned Database Systems

Proceedings of the ACM on Management of Data

Abstract

References

Cited By

Index Terms

Recommendations

Machine Unlearning in Learned Databases: An Experimental Analysis

Database Systems: A Practical Approach to Design, Implementation and Management

Database Management Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media