skip to main content
research-article

Modeling Shifting Workloads for Learned Database Systems

Published:26 March 2024Publication History
Skip Abstract Section

Abstract

Learned database systems address several weaknesses of traditional cost estimation techniques in query optimization: they learn a model of a database instance, e.g., as queries are executed. However, when the database instance has skew and correlation, it is nontrivial to create an effective training set that anticipates workload shifts, where query structure changes and/or different regions of the data contribute to query answers. Our predictive model may perform poorly with these out-of-distribution inputs. In this paper, we study how the notion of a replay buffer can be managed through online algorithms to build a concise yet representative model of the workload distribution --- allowing for rapid adaptation and effective prediction of cardinalities and costs. We experimentally validate our methods over several data domains.

References

  1. Ashraf Aboulnaga and Surajit Chaudhuri. 1999. Self-tuning histograms: Building histograms without looking at data. ACM SIGMOD Record, Vol. 28, 2 (1999), 181--192.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Swarup Acharya, Phillip B. Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. 1999. Join Synopses for Approximate Query Answering. In SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1--3, 1999, Philadelphia, Pennsylvania, USA, Alex Delis, Christos Faloutsos, and Shahram Ghandeharizadeh (Eds.). ACM Press, 275--286. https://doi.org/10.1145/304182.304207Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ben Adlam and Jeffrey Pennington. 2020. Understanding double descent requires a fine-grained bias-variance decomposition. Advances in neural information processing systems, Vol. 33 (2020), 11022--11032.Google ScholarGoogle Scholar
  4. Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. 2019. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, Vol. 32 (2019).Google ScholarGoogle Scholar
  5. Christos Anagnostopoulos and Peter Triantafillou. 2015a. Learning set cardinality in distance nearest neighbours. In 2015 IEEE international conference on data mining. IEEE, 691--696.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Christos Anagnostopoulos and Peter Triantafillou. 2015b. Learning to accurately count with query-driven predictive analytics. In 2015 IEEE international conference on big data (big data). IEEE, 14--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Christos Anagnostopoulos and Peter Triantafillou. 2017. Query-driven learning for predictive analytics of data subspace cardinality. ACM Transactions on Knowledge Discovery from Data (TKDD), Vol. 11, 4 (2017), 1--46.Google ScholarGoogle Scholar
  8. Charles E Antoniak. 1974. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The annals of statistics (1974), 1152--1174.Google ScholarGoogle Scholar
  9. Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM, Vol. 18, 9 (1975), 509--517.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Allan Borodin and Ran El-Yaniv. 2005. Online computation and competitive analysis. cambridge university press.Google ScholarGoogle Scholar
  11. Vladimir Braverman, Adam Meyerson, Rafail Ostrovsky, Alan Roytman, Michael Shindler, and Brian Tagiku. 2011. Streaming k-means on well-clusterable data. In Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms. SIAM, 26--40.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Nicolas Bruno and Surajit Chaudhuri. 2002. Exploiting statistics on query expressions for optimization. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data. 263--274.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Nicolas Bruno, Surajit Chaudhuri, and Luis Gravano. 2001. STHoles: a multidimensional workload-aware histogram. In SIGMOD. 211--222.Google ScholarGoogle Scholar
  14. Peter Buneman, Sanjeev Khanna, and Tan Wang-Chiew. 2001. Why and where: A characterization of data provenance. In Database Theory-ICDT 2001: 8th International Conference London, UK, January 4--6, 2001 Proceedings 8. Springer, 316--330.Google ScholarGoogle ScholarCross RefCross Ref
  15. Pierluigi Crescenzi. 1997. A short guide to approximation preserving reductions. In Proceedings of Computational Complexity. Twelfth Annual IEEE Conference. IEEE, 262--273.Google ScholarGoogle ScholarCross RefCross Ref
  16. Sanjoy Dasgupta. 2008. The hardness of k-means clustering. Department of Computer Science and Engineering, University of California.Google ScholarGoogle Scholar
  17. David L Davies and Donald W Bouldin. 1979. A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence 2 (1979), 224--227.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Amol Deshpande, Zachary Ives, Vijayshankar Raman, et al. 2007. Adaptive query processing. Foundations and Trends® in Databases, Vol. 1, 1 (2007), 1--140.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Or Dinari and Oren Freifeld. 2022. Revisiting DP-Means: Fast Scalable Algorithms via Parallelism and Delayed Cluster Creation. In The 38th Conference on Uncertainty in Artificial Intelligence.Google ScholarGoogle Scholar
  20. Bailu Ding, Surajit Chaudhuri, Johannes Gehrke, and Vivek Narasayya. 2021. DSB: A decision support benchmark for workload-driven and traditional database systems. Proceedings of the VLDB Endowment, Vol. 14, 13 (2021), 3376--3388.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Haowen Dong, Chengliang Chai, Yuyu Luo, Jiabin Liu, Jianhua Feng, and Chaoqun Zhan. 2022. Rw-tree: A learned workload-aware framework for R-tree construction. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 2073--2085.Google ScholarGoogle ScholarCross RefCross Ref
  22. Simon S Du, Yining Wang, Xiyu Zhai, Sivaraman Balakrishnan, Russ R Salakhutdinov, and Aarti Singh. 2018. How many samples are needed to estimate a convolutional neural network? Advances in Neural Information Processing Systems, Vol. 31 (2018).Google ScholarGoogle Scholar
  23. Stéphane d'Ascoli, Maria Refinetti, Giulio Biroli, and Florent Krzakala. 2020. Double trouble in double descent: Bias and variance (s) in the lazy regime. In International Conference on Machine Learning. PMLR, 2280--2290.Google ScholarGoogle Scholar
  24. Tongtong Fang, Nan Lu, Gang Niu, and Masashi Sugiyama. 2020. Rethinking importance weighting for deep learning under distribution shift. Advances in neural information processing systems, Vol. 33 (2020), 11996--12007.Google ScholarGoogle Scholar
  25. Dimitris Fotakis. 2008. On the competitive ratio for online facility location. Algorithmica, Vol. 50, 1 (2008), 1--57.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Dimitris Fotakis. 2011. Online and incremental algorithms for facility location. ACM SIGACT News, Vol. 42, 1 (2011), 97--131.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Nir Friedman and Zohar Yakhini. 2013. On the sample complexity of learning Bayesian networks. arXiv preprint arXiv:1302.3579 (2013).Google ScholarGoogle Scholar
  28. Noah Golowich, Alexander Rakhlin, and Ohad Shamir. 2018. Size-independent sample complexity of neural networks. In Conference On Learning Theory. PMLR, 297--299.Google ScholarGoogle Scholar
  29. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. NIPS, Vol. 27 (2014).Google ScholarGoogle Scholar
  30. Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, not from Queries! VLDB, Vol. 13, 7, 992--1005.Google ScholarGoogle Scholar
  31. Marc Holze and Norbert Ritter. 2007. Towards workload shift detection and prediction for autonomic databases. In Proceedings of the ACM first Ph. D. workshop in CIKM. 109--116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Yannis E Ioannidis and Stavros Christodoulakis. 1991. On the propagation of errors in the size of join results. In Proceedings of the 1991 ACM SIGMOD International Conference on Management of data. 268--277.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Bertty Contreras-Rojas, Rodrigo Pardo-Meza, Anis Troudi, and Sanjay Chawla. 2020. ML-based cross-platform query optimization. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1489--1500.Google ScholarGoogle ScholarCross RefCross Ref
  34. Oded Kariv and S Louis Hakimi. 1979. An algorithmic approach to network location problems. I: The p-centers. SIAM journal on applied mathematics, Vol. 37, 3 (1979), 513--538.Google ScholarGoogle Scholar
  35. Ronald Kemker, Marc McClure, Angelina Abitino, Tyler Hayes, and Christopher Kanan. 2018. Measuring catastrophic forgetting in neural networks. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.Google ScholarGoogle ScholarCross RefCross Ref
  36. Kyoungmin Kim, Jisung Jung, In Seo, Wook-Shin Han, Kangwoo Choi, and Jaehyok Chong. 2022. Learned cardinality estimation: An in-depth study. In Proceedings of the 2022 International Conference on Management of Data. 1214--1227.Google ScholarGoogle Scholar
  37. Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2019. Learned cardinalities: Estimating correlated joins with deep learning. In CIDR.Google ScholarGoogle Scholar
  38. Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. 2021. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning. PMLR, 5637--5664.Google ScholarGoogle Scholar
  39. Andrey Kolmogorov. 1933. Sulla determinazione empirica di una lgge di distribuzione. Inst. Ital. Attuari, Giorn., Vol. 4 (1933), 83--91.Google ScholarGoogle Scholar
  40. Brian Kulis and Michael I Jordan. 2011. Revisiting k-means: New algorithms via Bayesian nonparametrics. arXiv preprint arXiv:1111.0352 (2011).Google ScholarGoogle Scholar
  41. Meghdad Kurmanji and Peter Triantafillou. 2023. Detect, Distill and Update: Learned DB Systems Facing Out of Distribution Data. Proceedings of the ACM on Management of Data, Vol. 1, 1 (2023), 1--27.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Erich Leo Lehmann and EL Lehmann. 1986. Testing statistical hypotheses. Vol. 2. Springer.Google ScholarGoogle Scholar
  43. Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? Proceedings of the VLDB Endowment, Vol. 9, 3 (2015), 204--215.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Beibin Li, Yao Lu, and Srikanth Kandula. 2022. Warper: Efficiently adapting learned cardinality estimators to data and workload drifts. In Proceedings of the 2022 International Conference on Management of Data. 1920--1933.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Guoliang Li, Xuanhe Zhou, Shifu Li, and Bo Gao. 2019. Qtune: A query-aware database tuning system with deep reinforcement learning. Proceedings of the VLDB Endowment, Vol. 12, 12 (2019), 2118--2130.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Lipyeow Lim, Min Wang, and Jeffrey Scott Vitter. 2003. SASH: A self-adaptive histogram set for dynamically changing workloads. In Proceedings 2003 VLDB Conference. Elsevier, 369--380.Google ScholarGoogle ScholarCross RefCross Ref
  47. Lin Ma, Bailu Ding, Sudipto Das, and Adith Swaminathan. 2020. Active Learning for ML Enhanced Database Systems. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 175--191. https://doi.org/10.1145/3318464.3389768Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J MacQueen. 1965. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symposium on Math., Stat., and Prob. 281.Google ScholarGoogle Scholar
  49. Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, and Tim Kraska. 2022. Bao: Making learned query optimization practical. ACM SIGMOD Record, Vol. 51, 1 (2022), 6--13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A learned query optimizer. In VLDB.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Ryan Marcus and Olga Papaemmanouil. 2019. Plan-structured deep neural network models for query performance prediction. PVLDB (2019).Google ScholarGoogle Scholar
  52. Volker Markl, Guy M Lohman, and Vijayshankar Raman. 2003. LEO: An autonomic query optimizer for DB2. IBM Systems Journal, Vol. 42, 1 (2003), 98--106.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Adam Meyerson. 2001. Online facility location. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science. IEEE, 426--431.Google ScholarGoogle ScholarCross RefCross Ref
  54. Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learning multi-dimensional indexes. In Proceedings of the 2020 ACM SIGMOD international conference on management of data. 985--1000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Parimarjan Negi, Ziniu Wu, Andreas Kipf, Nesime Tatbul, Ryan Marcus, Sam Madden, Tim Kraska, and Mohammad Alizadeh. 2023. Robust Query Driven Cardinality Estimation under Changing Workloads. Proceedings of the VLDB Endowment, Vol. 16, 6 (2023), 1520--1533.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Shigeyuki Odashima, Miwa Ueki, and Naoyuki Sawasaki. 2016. A split-merge DP-means algorithm to avoid local minima. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 63--78.Google ScholarGoogle ScholarCross RefCross Ref
  57. Peter Orbanz and Yee Whye Teh. 2010. Bayesian Nonparametric Models. Encyclopedia of machine learning, Vol. 1 (2010).Google ScholarGoogle Scholar
  58. Hae-Sang Park and Chi-Hyuck Jun. 2009. A simple and fast algorithm for K-medoids clustering. Expert systems with applications, Vol. 36, 2 (2009), 3336--3341.Google ScholarGoogle Scholar
  59. Stephan Rabanser, Stephan Günnemann, and Zachary Lipton. 2019. Failing loudly: An empirical study of methods for detecting dataset shift. Advances in Neural Information Processing Systems, Vol. 32 (2019).Google ScholarGoogle Scholar
  60. Gaurav Saxena, Mohammad Rahman, Naresh Chainani, Chunbin Lin, George Caragea, Fahim Chowdhury, Ryan Marcus, Tim Kraska, Ippokratis Pandis, and Balakrishnan Narayanaswamy. 2023. Auto-WLM: Machine learning enhanced workload management in Amazon Redshift. In Companion of the 2023 International Conference on Management of Data. 225--237.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Michael Shindler, Alex Wong, and Adam Meyerson. 2011. Fast and accurate k-means for large datasets. Advances in neural information processing systems, Vol. 24 (2011).Google ScholarGoogle Scholar
  62. Tarique Siddiqui, Alekh Jindal, Shi Qiao, Hiren Patel, and Wangchao Le. 2020. Cost models for big data query processing: Learning, retrofitting, and our findings. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 99--113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Nikolai V Smirnov. 1939. On the estimation of the discrepancy between empirical curves of distribution for two independent samples. Bull. Math. Univ. Moscou, Vol. 2, 2 (1939), 3--14.Google ScholarGoogle Scholar
  64. Ji Sun and Guoliang Li. 2019. An end-to-end learning-based cost estimator. VLDB, Vol. 13, 3 (2019), 307--319.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Fadi Thabtah, Suhel Hammoud, Firuz Kamalov, and Amanda Gonsalves. 2020. Data imbalance in classification: Experimental evaluation. Information Sciences, Vol. 513 (2020), 429--441.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Anbupalam Thalamuthu, Indranil Mukhopadhyay, Xiaojing Zheng, and George C Tseng. 2006. Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics, Vol. 22, 19 (2006), 2405--2412.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Kostas Tzoumas, Man Lung Yiu, and Christian S Jensen. 2009. Workload-aware indexing of continuously moving objects. Proceedings of the VLDB Endowment, Vol. 2, 1 (2009), 1186--1197.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Dana Van Aken, Andrew Pavlo, Geoffrey J Gordon, and Bohan Zhang. 2017. Automatic database management system tuning through large-scale machine learning. In Proceedings of the 2017 ACM international conference on management of data. 1009--1024.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Jeffrey S Vitter. 1985. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), Vol. 11, 1 (1985), 37--57.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Chenggang Wu, Alekh Jindal, Saeed Amizadeh, Hiren Patel, Wangchao Le, Shi Qiao, and Sriram Rao. 2018. Towards a learning optimizer for shared clouds. VLDB, Vol. 12, 3 (2018), 210--222.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. CF Jeff Wu. 1983. On the convergence properties of the EM algorithm. The Annals of statistics (1983), 95--103.Google ScholarGoogle Scholar
  72. Peizhi Wu and Gao Cong. 2021. A unified deep model of learning from both data and queries for cardinality estimation. In Proceedings of the 2021 International Conference on Management of Data. 2009--2022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Jingyi Yang, Peizhi Wu, Gao Cong, Tieying Zhang, and Xiao He. 2022. SAM: Database Generation from Query Workloads with Supervised Autoregressive Models. In Proceedings of the 2022 International Conference on Management of Data. 1542--1555.Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. 2021. NeuroCard: One Cardinality Estimator for All Tables. PVLDB (2021).Google ScholarGoogle Scholar
  75. Ji Zhang, Yu Liu, Ke Zhou, Guoliang Li, Zhili Xiao, Bin Cheng, Jiashu Xing, Yangtao Wang, Tianheng Cheng, Li Liu, et al. 2019. An end-to-end automatic cloud database tuning system using deep reinforcement learning. In Proceedings of the 2019 International Conference on Management of Data. 415--432.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Johan Kok Zhi Kang, Sien Yi Tan, Feng Cheng, Shixuan Sun, and Bingsheng He. 2021. Efficient deep learning pipelines for accurate cost estimations over large scale query workload. In Proceedings of the 2021 International Conference on Management of Data. 1014--1022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Xuanhe Zhou, Ji Sun, Guoliang Li, and Jianhua Feng. 2020. Query performance prediction for concurrent queries using graph embedding. Proceedings of the VLDB Endowment, Vol. 13, 9 (2020), 1416--1428.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Modeling Shifting Workloads for Learned Database Systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the ACM on Management of Data
        Proceedings of the ACM on Management of Data  Volume 2, Issue 1
        PACMMOD
        February 2024
        1874 pages
        EISSN:2836-6573
        DOI:10.1145/3654807
        Issue’s Table of Contents

        Copyright © 2024 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 March 2024
        Published in pacmmod Volume 2, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Qualifiers

        • research-article
      • Article Metrics

        • Downloads (Last 12 months)141
        • Downloads (Last 6 weeks)102

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader