Abstract
Learned database systems address several weaknesses of traditional cost estimation techniques in query optimization: they learn a model of a database instance, e.g., as queries are executed. However, when the database instance has skew and correlation, it is nontrivial to create an effective training set that anticipates workload shifts, where query structure changes and/or different regions of the data contribute to query answers. Our predictive model may perform poorly with these out-of-distribution inputs. In this paper, we study how the notion of a replay buffer can be managed through online algorithms to build a concise yet representative model of the workload distribution --- allowing for rapid adaptation and effective prediction of cardinalities and costs. We experimentally validate our methods over several data domains.
- Ashraf Aboulnaga and Surajit Chaudhuri. 1999. Self-tuning histograms: Building histograms without looking at data. ACM SIGMOD Record, Vol. 28, 2 (1999), 181--192.Google ScholarDigital Library
- Swarup Acharya, Phillip B. Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. 1999. Join Synopses for Approximate Query Answering. In SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1--3, 1999, Philadelphia, Pennsylvania, USA, Alex Delis, Christos Faloutsos, and Shahram Ghandeharizadeh (Eds.). ACM Press, 275--286. https://doi.org/10.1145/304182.304207Google ScholarDigital Library
- Ben Adlam and Jeffrey Pennington. 2020. Understanding double descent requires a fine-grained bias-variance decomposition. Advances in neural information processing systems, Vol. 33 (2020), 11022--11032.Google Scholar
- Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. 2019. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, Vol. 32 (2019).Google Scholar
- Christos Anagnostopoulos and Peter Triantafillou. 2015a. Learning set cardinality in distance nearest neighbours. In 2015 IEEE international conference on data mining. IEEE, 691--696.Google ScholarDigital Library
- Christos Anagnostopoulos and Peter Triantafillou. 2015b. Learning to accurately count with query-driven predictive analytics. In 2015 IEEE international conference on big data (big data). IEEE, 14--23.Google ScholarDigital Library
- Christos Anagnostopoulos and Peter Triantafillou. 2017. Query-driven learning for predictive analytics of data subspace cardinality. ACM Transactions on Knowledge Discovery from Data (TKDD), Vol. 11, 4 (2017), 1--46.Google Scholar
- Charles E Antoniak. 1974. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The annals of statistics (1974), 1152--1174.Google Scholar
- Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM, Vol. 18, 9 (1975), 509--517.Google ScholarDigital Library
- Allan Borodin and Ran El-Yaniv. 2005. Online computation and competitive analysis. cambridge university press.Google Scholar
- Vladimir Braverman, Adam Meyerson, Rafail Ostrovsky, Alan Roytman, Michael Shindler, and Brian Tagiku. 2011. Streaming k-means on well-clusterable data. In Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms. SIAM, 26--40.Google ScholarDigital Library
- Nicolas Bruno and Surajit Chaudhuri. 2002. Exploiting statistics on query expressions for optimization. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data. 263--274.Google ScholarDigital Library
- Nicolas Bruno, Surajit Chaudhuri, and Luis Gravano. 2001. STHoles: a multidimensional workload-aware histogram. In SIGMOD. 211--222.Google Scholar
- Peter Buneman, Sanjeev Khanna, and Tan Wang-Chiew. 2001. Why and where: A characterization of data provenance. In Database Theory-ICDT 2001: 8th International Conference London, UK, January 4--6, 2001 Proceedings 8. Springer, 316--330.Google ScholarCross Ref
- Pierluigi Crescenzi. 1997. A short guide to approximation preserving reductions. In Proceedings of Computational Complexity. Twelfth Annual IEEE Conference. IEEE, 262--273.Google ScholarCross Ref
- Sanjoy Dasgupta. 2008. The hardness of k-means clustering. Department of Computer Science and Engineering, University of California.Google Scholar
- David L Davies and Donald W Bouldin. 1979. A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence 2 (1979), 224--227.Google ScholarDigital Library
- Amol Deshpande, Zachary Ives, Vijayshankar Raman, et al. 2007. Adaptive query processing. Foundations and Trends® in Databases, Vol. 1, 1 (2007), 1--140.Google ScholarDigital Library
- Or Dinari and Oren Freifeld. 2022. Revisiting DP-Means: Fast Scalable Algorithms via Parallelism and Delayed Cluster Creation. In The 38th Conference on Uncertainty in Artificial Intelligence.Google Scholar
- Bailu Ding, Surajit Chaudhuri, Johannes Gehrke, and Vivek Narasayya. 2021. DSB: A decision support benchmark for workload-driven and traditional database systems. Proceedings of the VLDB Endowment, Vol. 14, 13 (2021), 3376--3388.Google ScholarDigital Library
- Haowen Dong, Chengliang Chai, Yuyu Luo, Jiabin Liu, Jianhua Feng, and Chaoqun Zhan. 2022. Rw-tree: A learned workload-aware framework for R-tree construction. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 2073--2085.Google ScholarCross Ref
- Simon S Du, Yining Wang, Xiyu Zhai, Sivaraman Balakrishnan, Russ R Salakhutdinov, and Aarti Singh. 2018. How many samples are needed to estimate a convolutional neural network? Advances in Neural Information Processing Systems, Vol. 31 (2018).Google Scholar
- Stéphane d'Ascoli, Maria Refinetti, Giulio Biroli, and Florent Krzakala. 2020. Double trouble in double descent: Bias and variance (s) in the lazy regime. In International Conference on Machine Learning. PMLR, 2280--2290.Google Scholar
- Tongtong Fang, Nan Lu, Gang Niu, and Masashi Sugiyama. 2020. Rethinking importance weighting for deep learning under distribution shift. Advances in neural information processing systems, Vol. 33 (2020), 11996--12007.Google Scholar
- Dimitris Fotakis. 2008. On the competitive ratio for online facility location. Algorithmica, Vol. 50, 1 (2008), 1--57.Google ScholarDigital Library
- Dimitris Fotakis. 2011. Online and incremental algorithms for facility location. ACM SIGACT News, Vol. 42, 1 (2011), 97--131.Google ScholarDigital Library
- Nir Friedman and Zohar Yakhini. 2013. On the sample complexity of learning Bayesian networks. arXiv preprint arXiv:1302.3579 (2013).Google Scholar
- Noah Golowich, Alexander Rakhlin, and Ohad Shamir. 2018. Size-independent sample complexity of neural networks. In Conference On Learning Theory. PMLR, 297--299.Google Scholar
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. NIPS, Vol. 27 (2014).Google Scholar
- Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, not from Queries! VLDB, Vol. 13, 7, 992--1005.Google Scholar
- Marc Holze and Norbert Ritter. 2007. Towards workload shift detection and prediction for autonomic databases. In Proceedings of the ACM first Ph. D. workshop in CIKM. 109--116.Google ScholarDigital Library
- Yannis E Ioannidis and Stavros Christodoulakis. 1991. On the propagation of errors in the size of join results. In Proceedings of the 1991 ACM SIGMOD International Conference on Management of data. 268--277.Google ScholarDigital Library
- Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Bertty Contreras-Rojas, Rodrigo Pardo-Meza, Anis Troudi, and Sanjay Chawla. 2020. ML-based cross-platform query optimization. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1489--1500.Google ScholarCross Ref
- Oded Kariv and S Louis Hakimi. 1979. An algorithmic approach to network location problems. I: The p-centers. SIAM journal on applied mathematics, Vol. 37, 3 (1979), 513--538.Google Scholar
- Ronald Kemker, Marc McClure, Angelina Abitino, Tyler Hayes, and Christopher Kanan. 2018. Measuring catastrophic forgetting in neural networks. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.Google ScholarCross Ref
- Kyoungmin Kim, Jisung Jung, In Seo, Wook-Shin Han, Kangwoo Choi, and Jaehyok Chong. 2022. Learned cardinality estimation: An in-depth study. In Proceedings of the 2022 International Conference on Management of Data. 1214--1227.Google Scholar
- Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2019. Learned cardinalities: Estimating correlated joins with deep learning. In CIDR.Google Scholar
- Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. 2021. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning. PMLR, 5637--5664.Google Scholar
- Andrey Kolmogorov. 1933. Sulla determinazione empirica di una lgge di distribuzione. Inst. Ital. Attuari, Giorn., Vol. 4 (1933), 83--91.Google Scholar
- Brian Kulis and Michael I Jordan. 2011. Revisiting k-means: New algorithms via Bayesian nonparametrics. arXiv preprint arXiv:1111.0352 (2011).Google Scholar
- Meghdad Kurmanji and Peter Triantafillou. 2023. Detect, Distill and Update: Learned DB Systems Facing Out of Distribution Data. Proceedings of the ACM on Management of Data, Vol. 1, 1 (2023), 1--27.Google ScholarDigital Library
- Erich Leo Lehmann and EL Lehmann. 1986. Testing statistical hypotheses. Vol. 2. Springer.Google Scholar
- Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? Proceedings of the VLDB Endowment, Vol. 9, 3 (2015), 204--215.Google ScholarDigital Library
- Beibin Li, Yao Lu, and Srikanth Kandula. 2022. Warper: Efficiently adapting learned cardinality estimators to data and workload drifts. In Proceedings of the 2022 International Conference on Management of Data. 1920--1933.Google ScholarDigital Library
- Guoliang Li, Xuanhe Zhou, Shifu Li, and Bo Gao. 2019. Qtune: A query-aware database tuning system with deep reinforcement learning. Proceedings of the VLDB Endowment, Vol. 12, 12 (2019), 2118--2130.Google ScholarDigital Library
- Lipyeow Lim, Min Wang, and Jeffrey Scott Vitter. 2003. SASH: A self-adaptive histogram set for dynamically changing workloads. In Proceedings 2003 VLDB Conference. Elsevier, 369--380.Google ScholarCross Ref
- Lin Ma, Bailu Ding, Sudipto Das, and Adith Swaminathan. 2020. Active Learning for ML Enhanced Database Systems. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 175--191. https://doi.org/10.1145/3318464.3389768Google ScholarDigital Library
- J MacQueen. 1965. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symposium on Math., Stat., and Prob. 281.Google Scholar
- Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, and Tim Kraska. 2022. Bao: Making learned query optimization practical. ACM SIGMOD Record, Vol. 51, 1 (2022), 6--13.Google ScholarDigital Library
- Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A learned query optimizer. In VLDB.Google ScholarDigital Library
- Ryan Marcus and Olga Papaemmanouil. 2019. Plan-structured deep neural network models for query performance prediction. PVLDB (2019).Google Scholar
- Volker Markl, Guy M Lohman, and Vijayshankar Raman. 2003. LEO: An autonomic query optimizer for DB2. IBM Systems Journal, Vol. 42, 1 (2003), 98--106.Google ScholarDigital Library
- Adam Meyerson. 2001. Online facility location. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science. IEEE, 426--431.Google ScholarCross Ref
- Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learning multi-dimensional indexes. In Proceedings of the 2020 ACM SIGMOD international conference on management of data. 985--1000.Google ScholarDigital Library
- Parimarjan Negi, Ziniu Wu, Andreas Kipf, Nesime Tatbul, Ryan Marcus, Sam Madden, Tim Kraska, and Mohammad Alizadeh. 2023. Robust Query Driven Cardinality Estimation under Changing Workloads. Proceedings of the VLDB Endowment, Vol. 16, 6 (2023), 1520--1533.Google ScholarDigital Library
- Shigeyuki Odashima, Miwa Ueki, and Naoyuki Sawasaki. 2016. A split-merge DP-means algorithm to avoid local minima. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 63--78.Google ScholarCross Ref
- Peter Orbanz and Yee Whye Teh. 2010. Bayesian Nonparametric Models. Encyclopedia of machine learning, Vol. 1 (2010).Google Scholar
- Hae-Sang Park and Chi-Hyuck Jun. 2009. A simple and fast algorithm for K-medoids clustering. Expert systems with applications, Vol. 36, 2 (2009), 3336--3341.Google Scholar
- Stephan Rabanser, Stephan Günnemann, and Zachary Lipton. 2019. Failing loudly: An empirical study of methods for detecting dataset shift. Advances in Neural Information Processing Systems, Vol. 32 (2019).Google Scholar
- Gaurav Saxena, Mohammad Rahman, Naresh Chainani, Chunbin Lin, George Caragea, Fahim Chowdhury, Ryan Marcus, Tim Kraska, Ippokratis Pandis, and Balakrishnan Narayanaswamy. 2023. Auto-WLM: Machine learning enhanced workload management in Amazon Redshift. In Companion of the 2023 International Conference on Management of Data. 225--237.Google ScholarDigital Library
- Michael Shindler, Alex Wong, and Adam Meyerson. 2011. Fast and accurate k-means for large datasets. Advances in neural information processing systems, Vol. 24 (2011).Google Scholar
- Tarique Siddiqui, Alekh Jindal, Shi Qiao, Hiren Patel, and Wangchao Le. 2020. Cost models for big data query processing: Learning, retrofitting, and our findings. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 99--113.Google ScholarDigital Library
- Nikolai V Smirnov. 1939. On the estimation of the discrepancy between empirical curves of distribution for two independent samples. Bull. Math. Univ. Moscou, Vol. 2, 2 (1939), 3--14.Google Scholar
- Ji Sun and Guoliang Li. 2019. An end-to-end learning-based cost estimator. VLDB, Vol. 13, 3 (2019), 307--319.Google ScholarDigital Library
- Fadi Thabtah, Suhel Hammoud, Firuz Kamalov, and Amanda Gonsalves. 2020. Data imbalance in classification: Experimental evaluation. Information Sciences, Vol. 513 (2020), 429--441.Google ScholarDigital Library
- Anbupalam Thalamuthu, Indranil Mukhopadhyay, Xiaojing Zheng, and George C Tseng. 2006. Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics, Vol. 22, 19 (2006), 2405--2412.Google ScholarDigital Library
- Kostas Tzoumas, Man Lung Yiu, and Christian S Jensen. 2009. Workload-aware indexing of continuously moving objects. Proceedings of the VLDB Endowment, Vol. 2, 1 (2009), 1186--1197.Google ScholarDigital Library
- Dana Van Aken, Andrew Pavlo, Geoffrey J Gordon, and Bohan Zhang. 2017. Automatic database management system tuning through large-scale machine learning. In Proceedings of the 2017 ACM international conference on management of data. 1009--1024.Google ScholarDigital Library
- Jeffrey S Vitter. 1985. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), Vol. 11, 1 (1985), 37--57.Google ScholarDigital Library
- Chenggang Wu, Alekh Jindal, Saeed Amizadeh, Hiren Patel, Wangchao Le, Shi Qiao, and Sriram Rao. 2018. Towards a learning optimizer for shared clouds. VLDB, Vol. 12, 3 (2018), 210--222.Google ScholarDigital Library
- CF Jeff Wu. 1983. On the convergence properties of the EM algorithm. The Annals of statistics (1983), 95--103.Google Scholar
- Peizhi Wu and Gao Cong. 2021. A unified deep model of learning from both data and queries for cardinality estimation. In Proceedings of the 2021 International Conference on Management of Data. 2009--2022.Google ScholarDigital Library
- Jingyi Yang, Peizhi Wu, Gao Cong, Tieying Zhang, and Xiao He. 2022. SAM: Database Generation from Query Workloads with Supervised Autoregressive Models. In Proceedings of the 2022 International Conference on Management of Data. 1542--1555.Google ScholarDigital Library
- Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. 2021. NeuroCard: One Cardinality Estimator for All Tables. PVLDB (2021).Google Scholar
- Ji Zhang, Yu Liu, Ke Zhou, Guoliang Li, Zhili Xiao, Bin Cheng, Jiashu Xing, Yangtao Wang, Tianheng Cheng, Li Liu, et al. 2019. An end-to-end automatic cloud database tuning system using deep reinforcement learning. In Proceedings of the 2019 International Conference on Management of Data. 415--432.Google ScholarDigital Library
- Johan Kok Zhi Kang, Sien Yi Tan, Feng Cheng, Shixuan Sun, and Bingsheng He. 2021. Efficient deep learning pipelines for accurate cost estimations over large scale query workload. In Proceedings of the 2021 International Conference on Management of Data. 1014--1022.Google ScholarDigital Library
- Xuanhe Zhou, Ji Sun, Guoliang Li, and Jianhua Feng. 2020. Query performance prediction for concurrent queries using graph embedding. Proceedings of the VLDB Endowment, Vol. 13, 9 (2020), 1416--1428.Google ScholarDigital Library
Index Terms
- Modeling Shifting Workloads for Learned Database Systems
Recommendations
Machine Unlearning in Learned Databases: An Experimental Analysis
PACMMODMachine learning models based on neural networks (NNs) are enjoying ever-increasing attention in the Database (DB) community, both in research and practice. However, an important issue has been largely overlooked, namely the challenge of dealing with the ...
Comments