skip to main content
10.1145/3514221.3526179acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Warper: Efficiently Adapting Learned Cardinality Estimators to Data and Workload Drifts

Published: 11 June 2022 Publication History

Abstract

Recent learned cardinality estimation (CE) models are vulnerable when query predicates or the underlying datasets drift from what the models were trained upon. We propose a system Warper that accelerates model adaptation to drifts; Warper generates additional queries when limited examples are available from the new workload and carefully picks which queries to use to update the CE model. We show that Warper can be used to adapt different CE models including ones that support queries over single tables and join expressions. Experiments with different drifts suggest that Warper has a small computational cost and adapts much faster compared to state-of-the-art solutions. We also show that faster model adaptation improves query performance by shortening the period for which imperfect query plans are picked by a query optimizer due to incorrect cardinality estimates.

References

[1]
2022. TPC-H Benchmark. http://www.tpc.org/tpch/.
[2]
2022. Warper: Efficiently Adapting Learned Cardinality Estimators to Data and Workload Drifts - Extended Report. http://www.beibinli.com/docs/warper_extended_report.pdf
[3]
Nicolas Bruno, Surajit Chaudhuri, and Luis Gravano. 2001. STHoles: A Multi-dimensional Workload-aware Histogram. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 211--222.
[4]
Surajit Chaudhuri. 1998. An Overview of Query Optimization in Relational Systems. In Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. 34--43.
[5]
Jaehoon Choi, Taekyung Kim, and Changick Kim. 2019. Self-ensembling with GAN-based Data Augmentation for Domain Adaptation in Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6830--6840.
[6]
Corinna Cortes and Vladimir Vapnik. 1995. Support-vector Networks. Machine learning 20, 3 (1995), 273--297.
[7]
Gregory Ditzler, Manuel Roveri, Cesare Alippi, and Robi Polikar. 2015. Learning in Nonstationary Environments: A Survey. IEEE Computational Intelligence Magazine 10, 4 (2015), 12--25.
[8]
Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml.
[9]
Anshuman Dutt, Chi Wang, Vivek Narasayya, and Surajit Chaudhuri. 2020. Efficiently Approximating Selectivity Functions Using Low Overhead Regression Models. Proceedings of the VLDB Endowment 13, 12 (2020), 2215--2228.
[10]
Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek Narasayya, and Surajit Chaudhuri. 2019. Selectivity Estimation for Range Predicates Using Lightweight Models. Proceedings of the VLDB Endowment 12, 9 (2019), 1044--1057.
[11]
Ju Fan, Junyou Chen, Tongyu Liu, Yuwei Shen, Guoliang Li, and Xiaoyong Du. 2020. Relational Data Synthesis Using Generative Adversarial Networks: A Design Space Exploration. Proceedings of the VLDB Endowment 13, 12 (2020), 1962--1975.
[12]
Pedro Felzenszwalb, David McAllester, and Deva Ramanan. 2008. A Discrimi- natively Trained, Multiscale, Deformable Part Model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--8.
[13]
Maayan Frid-Adar, Idit Diamant, Eyal Klang, Michal Amitai, Jacob Goldberger, and Hayit Greenspan. 2018. GAN-based Synthetic Medical Image Augmentation for Increased CNN Performance in Liver Lesion Classification. Neurocomputing 321 (2018), 321--331.
[14]
Jerome H Friedman. 2002. Stochastic Gradient Boosting. Computational statistics & data analysis 38, 4 (2002), 367--378.
[15]
Sylvia Frühwirth-Schnatter. 1994. Data Augmentation and Dynamic Linear Models. Journal of Time Series Analysis 15, 2 (1994), 183--202.
[16]
João Gama Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A Survey on Concept Drift Adaptation. ACM Computing Surveys (CSUR) 46, 4 (2014), 1--37.
[17]
Lise Getoor, Benjamin Taskar, and Daphne Koller. 2001. Selectivity Estimation Using Probabilistic Models. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[18]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT press.
[19]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems. 2672--2680.
[20]
Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. 2018. Long text generation via adversarial training with leaked information. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
[21]
Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, not from Queries! Proceedings of the VLDB Endowment 13, 7 (2020).
[22]
Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. 2018. Cycada: Cycle-Consistent Adversarial Domain Adaptation. In International Conference on Machine Learning. PMLR, 1989--1998.
[23]
Weixiang Hong, Zhenzhen Wang, Ming Yang, and Junsong Yuan. 2018. Conditional generative adversarial network for structured domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1335--1344.
[24]
Tero Karras, Samuli Laine, and Timo Aila. 2019. A Style-based Generator Architecture for Generative Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4401--4410.
[25]
Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2018. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. Proceedings of the 2018 Conference on Innovative Data Systems Research (CIDR) (2018).
[26]
Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. 2021. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning. PMLR, 5637--5664.
[27]
Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 489--504.
[28]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, Vol. 25. 1097--1105.
[29]
Solomon Kullback. 1997. Information Theory and Statistics. Courier Corporation.
[30]
Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brébisson, Yoshua Bengio, and Aaron C Courville. 2019. Melgan: Generative adversarial networks for conditional waveform syn- thesis. In Advances in Neural Information Processing Systems. 14910--14921.
[31]
Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How Good are Query Optimizers, Really? Proceedings of the VLDB Endowment 9, 3 (2015), 204--215.
[32]
Jie Lu, Dianshuang Wu, Mingsong Mao, Wei Wang, and Guangquan Zhang. 2015. Recommender system application developments: a survey. Decision Support Systems 74 (2015), 12--32.
[33]
Yao Lu, Aakanksha Chowdhery, Srikanth Kandula, and Surajit Chaudhuri. 2018. Accelerating Machine Learning Inference with Probabilistic Predicates. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1493--1508.
[34]
Lin Ma, Bailu Ding, Sudipto Das, and Adith Swaminathan. 2020. Active Learning for ML Enhanced Database Systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 175--191.
[35]
Christopher Manning and Hinrich Schutze. 1999. Foundations of Statistical Natural Language Processing. MIT press.
[36]
Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul23. 2019. Neo: A Learned Query Optimizer. Proceedings of the VLDB Endowment 12, 11 (2019).
[37]
Magnus Müller, Guido Moerkotte, and Oliver Kolb. 2018. Improved Selectivity Estimation by Combining Knowledge from Sampling and Synopses. Proceedings of the VLDB Endowment 11, 9 (2018), 1016--1028.
[38]
Hieu T Nguyen and Arnold Smeulders. 2004. Active Learning Using Pre- clustering. In Proceedings of the twenty-first International Conference on Machine Learning. 79.
[39]
Alexander J Ratner, Stephen H Bach, Henry R Ehrenberg, and Chris Ré. 2017. Snorkel: Fast Training Set Generation for Information Extraction. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1683--1686.
[40]
Veit Sandfort, Ke Yan, Perry J Pickhardt, and Ronald M Summers. 2019. Data Augmentation Using Generative Adversarial Networks (CycleGAN) to Improve Generalizability in CT Segmentation Tasks. Scientific reports 9, 1 (2019), 1--9.
[41]
Burr Settles. 2009. Active Learning Literature Survey. (2009).
[42]
Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. 2016. Training Region-based Object Detectors with Online Hard Example Mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 761--769.
[43]
PY Simard, D Steinkraus, and JC Platt. 2003. Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis. In Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings. IEEE, 958--963.
[44]
Kostas Tzoumas, Amol Deshpande, and Christian S. Jensen. 2013. Efficiently Adapting Graphical Models for Selectivity Estimation. The VLDB Journal 22, 1 (2013).
[45]
Yu-Xiong Wang, Ross Girshick, Martial Hebert, and Bharath Hariharan. 2018. Low-shot Learning from Imaginary Data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7278--7286.
[46]
Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal Component Analysis. Chemometrics and Intelligent Laboratory Systems 2, 1--3 (1987), 37--52.
[47]
Donghui Yan, Ling Huang, and Michael I Jordan. 2009. Fast Approximate Spectral Clustering. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 907--916.
[48]
Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep Unsupervised Cardinality Estimation. Proceedings of the VLDB Endowment 13, 3 (2019), 279--292.
[49]
Dong Yu and Li Deng. 2016. Automatic Speech Recognition. Springer.
[50]
Xiaojin Jerry Zhu. 2005. Semi-supervised Learning Literature Survey. (2005).

Cited By

View all
  • (2025)PolyCard: A learned cardinality estimator for intersection queries on spatial polygonsJournal of Intelligent Information Systems10.1007/s10844-025-00921-zOnline publication date: 22-Jan-2025
  • (2024)Breaking It Down: An In-Depth Study of Index AdvisorsProceedings of the VLDB Endowment10.14778/3675034.367503517:10(2405-2418)Online publication date: 6-Aug-2024
  • (2024)Eraser: Eliminating Performance Regression on Learned Query OptimizerProceedings of the VLDB Endowment10.14778/3641204.364120517:5(926-938)Online publication date: 2-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data
June 2022
2597 pages
ISBN:9781450392495
DOI:10.1145/3514221
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. adaptation
  2. cardinality estimation
  3. data shift
  4. database optimization

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)103
  • Downloads (Last 6 weeks)6
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)PolyCard: A learned cardinality estimator for intersection queries on spatial polygonsJournal of Intelligent Information Systems10.1007/s10844-025-00921-zOnline publication date: 22-Jan-2025
  • (2024)Breaking It Down: An In-Depth Study of Index AdvisorsProceedings of the VLDB Endowment10.14778/3675034.367503517:10(2405-2418)Online publication date: 6-Aug-2024
  • (2024)Eraser: Eliminating Performance Regression on Learned Query OptimizerProceedings of the VLDB Endowment10.14778/3641204.364120517:5(926-938)Online publication date: 2-May-2024
  • (2024)Deferred Continuous Batching in Resource-Efficient Large Language Model ServingProceedings of the 4th Workshop on Machine Learning and Systems10.1145/3642970.3655835(98-106)Online publication date: 22-Apr-2024
  • (2024)Machine Unlearning in Learned Databases: An Experimental AnalysisProceedings of the ACM on Management of Data10.1145/36393042:1(1-26)Online publication date: 26-Mar-2024
  • (2024)ASM: Harmonizing Autoregressive Model, Sampling, and Multi-dimensional Statistics Merging for Cardinality EstimationProceedings of the ACM on Management of Data10.1145/36393002:1(1-27)Online publication date: 26-Mar-2024
  • (2024)Modeling Shifting Workloads for Learned Database SystemsProceedings of the ACM on Management of Data10.1145/36392932:1(1-27)Online publication date: 26-Mar-2024
  • (2024)PACE: Poisoning Attacks on Learned Cardinality EstimationProceedings of the ACM on Management of Data10.1145/36392922:1(1-27)Online publication date: 26-Mar-2024
  • (2024)A Cause-Focused Query Optimizer Alert SystemProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679771(2981-2990)Online publication date: 21-Oct-2024
  • (2024)Precision Meets Resilience: Cross-Database Generalization with Uncertainty Quantification for Robust Cost EstimationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679632(581-590)Online publication date: 21-Oct-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media