research-article

Warper: Efficiently Adapting Learned Cardinality Estimators to Data and Workload Drifts

Authors:

Srikanth KandulaAuthors Info & Claims

SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

Pages 1920 - 1933

https://doi.org/10.1145/3514221.3526179

Published: 11 June 2022 Publication History

Abstract

Recent learned cardinality estimation (CE) models are vulnerable when query predicates or the underlying datasets drift from what the models were trained upon. We propose a system Warper that accelerates model adaptation to drifts; Warper generates additional queries when limited examples are available from the new workload and carefully picks which queries to use to update the CE model. We show that Warper can be used to adapt different CE models including ones that support queries over single tables and join expressions. Experiments with different drifts suggest that Warper has a small computational cost and adapts much faster compared to state-of-the-art solutions. We also show that faster model adaptation improves query performance by shortening the period for which imperfect query plans are picked by a query optimizer due to incorrect cardinality estimates.

References

[1]

2022. TPC-H Benchmark. http://www.tpc.org/tpch/.

[2]

2022. Warper: Efficiently Adapting Learned Cardinality Estimators to Data and Workload Drifts - Extended Report. http://www.beibinli.com/docs/warper_extended_report.pdf

[3]

Nicolas Bruno, Surajit Chaudhuri, and Luis Gravano. 2001. STHoles: A Multi-dimensional Workload-aware Histogram. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 211--222.

[4]

Surajit Chaudhuri. 1998. An Overview of Query Optimization in Relational Systems. In Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. 34--43.

Digital Library

[5]

Jaehoon Choi, Taekyung Kim, and Changick Kim. 2019. Self-ensembling with GAN-based Data Augmentation for Domain Adaptation in Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6830--6840.

[6]

Corinna Cortes and Vladimir Vapnik. 1995. Support-vector Networks. Machine learning 20, 3 (1995), 273--297.

Digital Library

[7]

Gregory Ditzler, Manuel Roveri, Cesare Alippi, and Robi Polikar. 2015. Learning in Nonstationary Environments: A Survey. IEEE Computational Intelligence Magazine 10, 4 (2015), 12--25.

Digital Library

[8]

Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml.

[9]

Anshuman Dutt, Chi Wang, Vivek Narasayya, and Surajit Chaudhuri. 2020. Efficiently Approximating Selectivity Functions Using Low Overhead Regression Models. Proceedings of the VLDB Endowment 13, 12 (2020), 2215--2228.

Digital Library

[10]

Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek Narasayya, and Surajit Chaudhuri. 2019. Selectivity Estimation for Range Predicates Using Lightweight Models. Proceedings of the VLDB Endowment 12, 9 (2019), 1044--1057.

Digital Library

[11]

Ju Fan, Junyou Chen, Tongyu Liu, Yuwei Shen, Guoliang Li, and Xiaoyong Du. 2020. Relational Data Synthesis Using Generative Adversarial Networks: A Design Space Exploration. Proceedings of the VLDB Endowment 13, 12 (2020), 1962--1975.

Digital Library

[12]

Pedro Felzenszwalb, David McAllester, and Deva Ramanan. 2008. A Discrimi- natively Trained, Multiscale, Deformable Part Model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--8.

[13]

Maayan Frid-Adar, Idit Diamant, Eyal Klang, Michal Amitai, Jacob Goldberger, and Hayit Greenspan. 2018. GAN-based Synthetic Medical Image Augmentation for Increased CNN Performance in Liver Lesion Classification. Neurocomputing 321 (2018), 321--331.

[14]

Jerome H Friedman. 2002. Stochastic Gradient Boosting. Computational statistics & data analysis 38, 4 (2002), 367--378.

[15]

Sylvia Frühwirth-Schnatter. 1994. Data Augmentation and Dynamic Linear Models. Journal of Time Series Analysis 15, 2 (1994), 183--202.

[16]

João Gama Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A Survey on Concept Drift Adaptation. ACM Computing Surveys (CSUR) 46, 4 (2014), 1--37.

Digital Library

[17]

Lise Getoor, Benjamin Taskar, and Daphne Koller. 2001. Selectivity Estimation Using Probabilistic Models. In Proceedings of the ACM SIGMOD International Conference on Management of Data.

Digital Library

[18]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT press.

Digital Library

[19]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems. 2672--2680.

[20]

Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. 2018. Long text generation via adversarial training with leaked information. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

[21]

Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, not from Queries! Proceedings of the VLDB Endowment 13, 7 (2020).

Digital Library

[22]

Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. 2018. Cycada: Cycle-Consistent Adversarial Domain Adaptation. In International Conference on Machine Learning. PMLR, 1989--1998.

[23]

Weixiang Hong, Zhenzhen Wang, Ming Yang, and Junsong Yuan. 2018. Conditional generative adversarial network for structured domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1335--1344.

[24]

Tero Karras, Samuli Laine, and Timo Aila. 2019. A Style-based Generator Architecture for Generative Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4401--4410.

[25]

Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2018. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. Proceedings of the 2018 Conference on Innovative Data Systems Research (CIDR) (2018).

[26]

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. 2021. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning. PMLR, 5637--5664.

[27]

Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 489--504.

Digital Library

[28]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, Vol. 25. 1097--1105.

Digital Library

[29]

Solomon Kullback. 1997. Information Theory and Statistics. Courier Corporation.

Digital Library

[30]

Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brébisson, Yoshua Bengio, and Aaron C Courville. 2019. Melgan: Generative adversarial networks for conditional waveform syn- thesis. In Advances in Neural Information Processing Systems. 14910--14921.

[31]

Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How Good are Query Optimizers, Really? Proceedings of the VLDB Endowment 9, 3 (2015), 204--215.

Digital Library

[32]

Jie Lu, Dianshuang Wu, Mingsong Mao, Wei Wang, and Guangquan Zhang. 2015. Recommender system application developments: a survey. Decision Support Systems 74 (2015), 12--32.

Digital Library

[33]

Yao Lu, Aakanksha Chowdhery, Srikanth Kandula, and Surajit Chaudhuri. 2018. Accelerating Machine Learning Inference with Probabilistic Predicates. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1493--1508.

Digital Library

[34]

Lin Ma, Bailu Ding, Sudipto Das, and Adith Swaminathan. 2020. Active Learning for ML Enhanced Database Systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 175--191.

Digital Library

[35]

Christopher Manning and Hinrich Schutze. 1999. Foundations of Statistical Natural Language Processing. MIT press.

Digital Library

[36]

Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul23. 2019. Neo: A Learned Query Optimizer. Proceedings of the VLDB Endowment 12, 11 (2019).

Digital Library

[37]

Magnus Müller, Guido Moerkotte, and Oliver Kolb. 2018. Improved Selectivity Estimation by Combining Knowledge from Sampling and Synopses. Proceedings of the VLDB Endowment 11, 9 (2018), 1016--1028.

Digital Library

[38]

Hieu T Nguyen and Arnold Smeulders. 2004. Active Learning Using Pre- clustering. In Proceedings of the twenty-first International Conference on Machine Learning. 79.

Digital Library

[39]

Alexander J Ratner, Stephen H Bach, Henry R Ehrenberg, and Chris Ré. 2017. Snorkel: Fast Training Set Generation for Information Extraction. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1683--1686.

Digital Library

[40]

Veit Sandfort, Ke Yan, Perry J Pickhardt, and Ronald M Summers. 2019. Data Augmentation Using Generative Adversarial Networks (CycleGAN) to Improve Generalizability in CT Segmentation Tasks. Scientific reports 9, 1 (2019), 1--9.

[41]

Burr Settles. 2009. Active Learning Literature Survey. (2009).

[42]

Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. 2016. Training Region-based Object Detectors with Online Hard Example Mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 761--769.

[43]

PY Simard, D Steinkraus, and JC Platt. 2003. Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis. In Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings. IEEE, 958--963.

[44]

Kostas Tzoumas, Amol Deshpande, and Christian S. Jensen. 2013. Efficiently Adapting Graphical Models for Selectivity Estimation. The VLDB Journal 22, 1 (2013).

Digital Library

[45]

Yu-Xiong Wang, Ross Girshick, Martial Hebert, and Bharath Hariharan. 2018. Low-shot Learning from Imaginary Data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7278--7286.

[46]

Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal Component Analysis. Chemometrics and Intelligent Laboratory Systems 2, 1--3 (1987), 37--52.

[47]

Donghui Yan, Ling Huang, and Michael I Jordan. 2009. Fast Approximate Spectral Clustering. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 907--916.

Digital Library

[48]

Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep Unsupervised Cardinality Estimation. Proceedings of the VLDB Endowment 13, 3 (2019), 279--292.

Digital Library

[49]

Dong Yu and Li Deng. 2016. Automatic Speech Recognition. Springer.

[50]

Xiaojin Jerry Zhu. 2005. Semi-supervised Learning Literature Survey. (2005).

Cited By

Ji YAmagata DSasaki YHara T(2025)PolyCard: A learned cardinality estimator for intersection queries on spatial polygonsJournal of Intelligent Information Systems10.1007/s10844-025-00921-zOnline publication date: 22-Jan-2025
https://doi.org/10.1007/s10844-025-00921-z
Zhou WLin CZhou XLi G(2024)Breaking It Down: An In-Depth Study of Index AdvisorsProceedings of the VLDB Endowment10.14778/3675034.367503517:10(2405-2418)Online publication date: 6-Aug-2024
https://dl.acm.org/doi/10.14778/3675034.3675035
Weng LZhu RWu DDing BZheng BZhou J(2024)Eraser: Eliminating Performance Regression on Learned Query OptimizerProceedings of the VLDB Endowment10.14778/3641204.364120517:5(926-938)Online publication date: 2-May-2024
https://dl.acm.org/doi/10.14778/3641204.3641205
Show More Cited By

Index Terms

Warper: Efficiently Adapting Learned Cardinality Estimators to Data and Workload Drifts
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Design and analysis of algorithms
    1. Online algorithms
      1. Online learning algorithms
  2. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Cardinality Estimation of LIKE Predicate Queries using Deep Learning
SIGMOD

Cardinality estimation of LIKE predicate queries has an important role in the query optimization of database systems. Traditional approaches generally use a summary of text data with some statistical assumptions. Recently, the deep learning model for ...
Understanding cardinality estimation using entropy maximization
PODS '10: Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Cardinality estimation is the problem of estimating the number of tuples returned by a query; it is a fundamentally important task in data management, used in query optimization, progress estimation, and resource provisioning. We study cardinality ...
Understanding cardinality estimation using entropy maximization

Cardinality estimation is the problem of estimating the number of tuples returned by a query; it is a fundamentally important task in data management, used in query optimization, progress estimation, and resource provisioning. We study cardinality ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

June 2022

2597 pages

ISBN:9781450392495

DOI:10.1145/3514221

General Chair:
Zachary Ives
University of Pennsylvania (USA)
,
Program Chairs:
Angela Bonifati
Lyon 1 University (France)
,
Amr El Abbadi
University of California, Santa Barbara (USA)

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '22

Sponsor:

SIGMOD

SIGMOD/PODS '22: International Conference on Management of Data

June 12 - 17, 2022

PA, Philadelphia, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
423
Total Downloads

Downloads (Last 12 months)103
Downloads (Last 6 weeks)6

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ji YAmagata DSasaki YHara T(2025)PolyCard: A learned cardinality estimator for intersection queries on spatial polygonsJournal of Intelligent Information Systems10.1007/s10844-025-00921-zOnline publication date: 22-Jan-2025
https://doi.org/10.1007/s10844-025-00921-z
Zhou WLin CZhou XLi G(2024)Breaking It Down: An In-Depth Study of Index AdvisorsProceedings of the VLDB Endowment10.14778/3675034.367503517:10(2405-2418)Online publication date: 6-Aug-2024
https://dl.acm.org/doi/10.14778/3675034.3675035
Weng LZhu RWu DDing BZheng BZhou J(2024)Eraser: Eliminating Performance Regression on Learned Query OptimizerProceedings of the VLDB Endowment10.14778/3641204.364120517:5(926-938)Online publication date: 2-May-2024
https://dl.acm.org/doi/10.14778/3641204.3641205
He YLu YAlonso G(2024)Deferred Continuous Batching in Resource-Efficient Large Language Model ServingProceedings of the 4th Workshop on Machine Learning and Systems10.1145/3642970.3655835(98-106)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3642970.3655835
Kurmanji MTriantafillou ETriantafillou P(2024)Machine Unlearning in Learned Databases: An Experimental AnalysisProceedings of the ACM on Management of Data10.1145/36393042:1(1-26)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639304
Kim KLee SKim IHan W(2024)ASM: Harmonizing Autoregressive Model, Sampling, and Multi-dimensional Statistics Merging for Cardinality EstimationProceedings of the ACM on Management of Data10.1145/36393002:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639300
Wu PIves Z(2024)Modeling Shifting Workloads for Learned Database SystemsProceedings of the ACM on Management of Data10.1145/36392932:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639293
Zhang JZhang CLi GChai C(2024)PACE: Poisoning Attacks on Learned Cardinality EstimationProceedings of the ACM on Management of Data10.1145/36392922:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639292
Ye RLiang ZChen XLiu SZheng KSerra ESpezzano F(2024)A Cause-Focused Query Optimizer Alert SystemProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679771(2981-2990)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679771
Fan SHou MXi RMa WSerra ESpezzano F(2024)Precision Meets Resilience: Cross-Database Generalization with Uncertainty Quantification for Robust Cost EstimationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679632(581-590)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679632
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten