research-article

Expand your Training Limits! Generating Training Data for ML-based Data Management

Authors:
Francesco Ventura

Politecnico di Torino, Turin, Italy

Politecnico di Torino, Turin, Italy
View Profile

,
Zoi Kaoudi

TU Berlin & DFKI GmbH, Berlin, Germany

TU Berlin & DFKI GmbH, Berlin, Germany
View Profile

,
Jorge Arnulfo Quiané-Ruiz

TU Berlin & DFKI GmbH, Berlin, Germany

TU Berlin & DFKI GmbH, Berlin, Germany
View Profile

,
Volker Markl

TU Berlin & DFKI GmbH, Berlin, Germany

TU Berlin & DFKI GmbH, Berlin, Germany
View Profile

SIGMOD '21: Proceedings of the 2021 International Conference on Management of DataJune 2021Pages 1865–1878https://doi.org/10.1145/3448016.3457286

Published:18 June 2021Publication History

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

Pages 1865–1878

ABSTRACT

Machine Learning (ML) is quickly becoming a prominent method in many data management components, including query optimizers which have recently shown very promising results. However, the low availability of training data (i.e., large query workloads with execution time or output cardinality as labels) widely limits further advancement in research and compromises the technology transfer from research to industry. Collecting a labeled query workload has a very high cost in terms of time and money due to the development and execution of thousands of realistic queries/jobs.

In this work, we face the problem of generating training data for data management components tailored to users' needs. We present DataFarm, an innovative framework for efficiently generating and labeling large query workloads. We follow a data-driven white-box approach to learn from pre-existing small workload patterns, input data, and computational resources. Our framework allows users to produce a large heterogeneous set of realistic jobs with their labels, which can be used by any ML-based data management component. We show that our framework outperforms the current state-of-the-art both in query generation and label estimation using synthetic and real datasets. It has up to 9x better labeling performance, in terms of R2 score. More importantly, it allows users to reduce the cost of getting labeled query workloads by 54x (and up to an estimated factor of 104x) compared to standard approaches.

Supplemental Material

3448016.3457286.mp4

mp4

310.4 MB

Download

References

Divy Agrawal, Mouhamadou Lamine Ba, Laure Berti-É quille, Sanjay Chawla, Ahmed K. Elmagarmid, Hossam Hammady, Yasser Idris, Zoi Kaoudi, Zuhair Khayyat, Sebastian Kruse, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané -Ruiz, Nan Tang, and Mohammed J. Zaki. 2016. Rheem: Enabling Multi-Platform Task Execution. In SIGMOD. 2069--2072.Google Scholar
Divy Agrawal, Sanjay Chawla, Bertty Contreras-Rojas, Ahmed K. Elmagarmid, Yasser Idris, Zoi Kaoudi, Sebastian Kruse, Ji Lucas, Essam Mansour, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané -Ruiz, Nan Tang, Saravanan Thirumuruganathan, and Anis Troudi. 2018. RHEEM: Enabling Cross-Platform Data Processing - May The Big Data Be With You! -. Proc. VLDB Endow., Vol. 11, 11 (2018), 1414--1427.Google ScholarDigital Library
Mert Akdere, Ugur cC etintemel, Matteo Riondato, Eli Upfal, and Stanley B Zdonik. 2012. Learning-based Query Performance Modeling and Prediction. In 2012 IEEE 28th International Conference on Data Engineering. IEEE, 390--401.Google Scholar
Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinl"ander, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, and Daniel Warneke. 2014. The Stratosphere Platform for Big Data Analytics. The VLDB Journal, Vol. 23, 6 (Dec. 2014), 939--964. https://doi.org/10.1007/s00778-014-0357-yGoogle ScholarDigital Library
Arvind Arasu, Michaela Götz, and Raghav Kaushik. 2010. On Active Learning of Record Matching Packages. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (Indianapolis, Indiana, USA) (SIGMOD '10). Association for Computing Machinery, New York, NY, USA, 783--794. https://doi.org/10.1145/1807167.1807252Google ScholarDigital Library
Gérard Biau, Luc Devroye, and Gábor Lugosi. 2008. Consistency of Random Forests and Other Averaging Classifiers. J. Mach. Learn. Res., Vol. 9 (June 2008), 2015--2033.Google Scholar
Leo Breiman. 2001. Random Forests. Machine Learning, Vol. 45, 1 (2001), 5--32. https://doi.org/10.1023/a:1010933404324Google ScholarDigital Library
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink#8482;: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull., Vol. 38, 4 (2015), 28--38. http://sites.computer.org/debull/A15dec/p28.pdfGoogle Scholar
Tania Cerquitelli, Stefano Proto, Francesco Ventura, Daniele Apiletti, and Elena Baralis. 2019. Towards a Real-time Unsupervised Estimation of Predictive Model Degradation. In Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics, BIRTE 2019, Los Angeles, CA, USA, August 26, 2019. 5:1--5:6. https://doi.org/10.1145/3350489.3350494Google ScholarDigital Library
Bailu Ding, Sudipto Das, Ryan Marcus, Wentao Wu, Surajit Chaudhuri, and Vivek R. Narasayya. 2019. AI Meets AI: Leveraging Query Executions to Improve Index Recommendations. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 1241--1258. https://doi.org/10.1145/3299869.3324957Google ScholarDigital Library
Norman Richard Draper and Harry Smith. 1998. Applied Regression Analysis 3rd ed ed.). Wiley, New York.Google Scholar
Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2015. The Pascal Visual Object Classes Challenge: A Retrospective ., 98--136 pages. https://doi.org/10.1007/s11263-014-0733--5Google Scholar
Weijie Fu, Meng Wang, Shijie Hao, and Xindong Wu. 2018. Scalable Active Learning by Approximated Error Reduction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (London, United Kingdom) (KDD '18). Association for Computing Machinery, New York, NY, USA, 1396--1405. https://doi.org/10.1145/3219819.3219954Google ScholarDigital Library
Yifan Fu, Xingquan Zhu, and Bin Li. 2013. A Survey on Instance Selection for Active Learning. Knowledge and information systems, Vol. 35, 2 (2013), 249--283. https://doi.org/10.1007/s10115-012-0507--8Google Scholar
Paul A. Gagniuc. 2017. Markov Chains: from Theory to Implementation and Experimentation .John Wiley & Sons, Hoboken, NJ.Google Scholar
Jo ao Gama, Indr.e vZ liobait.e, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A Survey on Concept Drift Adaptation. ACM Comput. Surv., Vol. 46, 4, Article 44 (March 2014), 37 pages. https://doi.org/10.1145/2523813Google ScholarDigital Library
Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2008. Database Systems: The Complete Book 2 ed.). Prentice Hall Press, USA.Google Scholar
Shohedul Hasan, Saravanan Thirumuruganathan, Jees Augustine, Nick Koudas, and Gautam Das. 2020. Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 1035--1050. https://doi.org/10.1145/3318464.3389741Google ScholarDigital Library
Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, Not from Queries! Proc. VLDB Endow., Vol. 13, 7 (March 2020), 992--1005. https://doi.org/10.14778/3384345.3384349Google ScholarDigital Library
Geoff Hulten, Laurie Spencer, and Pedro Domingos. 2001. Mining Time-Changing Data Streams. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California) (KDD '01). Association for Computing Machinery, New York, NY, USA, 97--106. https://doi.org/10.1145/502512.502529Google ScholarDigital Library
Z. Kaoudi, J. Quiané-Ruiz, B. Contreras-Rojas, R. Pardo-Meza, A. Troudi, and S. Chawla. 2020. ML-based Cross-Platform Query Optimization. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). 1489--1500.Google Scholar
Mark G Kelly, David J Hand, and Niall M Adams. 1999. The Impact of Changing Populations on Classifier Performance. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. 367--371.Google ScholarDigital Library
Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2018. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. arXiv preprint arXiv:1809.00677 (2018).Google Scholar
Andreas Kipf, Dimitri Vorona, Jonas Müller, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, Thomas Neumann, and Alfons Kemper. 2019. Estimating Cardinalities with Deep Sketches. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 1937--1940. https://doi.org/10.1145/3299869.3320218Google ScholarDigital Library
Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph M. Hellerstein, and Ion Stoica. 2018. Learning to Optimize Join Queries With Deep Reinforcement Learning. CoRR, Vol. abs/1808.03196 (2018). arxiv: 1808.03196Google Scholar
Sebastian Kruse, Zoi Kaoudi, Bertty Contreras-Rojas, Sanjay Chawla, Felix Naumann, and Jorge-Arnulfo Quiané-Ruiz. 2020. RHEEMix in the Data Jungle: a Cost-based Optimizer for Cross-platform Systems. VLDB JOURNAL (2020). https://doi.org/10.1007/s00778-020-00612-xGoogle ScholarDigital Library
Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How Good are Query Optimizers, Really? Proceedings of the VLDB Endowment, Vol. 9, 3 (2015), 204--215.Google ScholarDigital Library
Charles X. Ling and Jun Du. 2008. Active Learning with Direct Query Construction. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Las Vegas, Nevada, USA) (KDD '08). Association for Computing Machinery, New York, NY, USA, 480--487. https://doi.org/10.1145/1401890.1401950Google ScholarDigital Library
Lin Ma, Bailu Ding, Sudipto Das, and Adith Swaminathan. 2020. Active Learning for ML Enhanced Database Systems. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 175--191. https://doi.org/10.1145/3318464.3389768Google ScholarDigital Library
Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A Learned Query Optimizer. Proc. VLDB Endow., Vol. 12, 11 (July 2019), 1705--1718. https://doi.org/10.14778/3342263.3342644Google ScholarDigital Library
Ryan Marcus and Olga Papaemmanouil. 2018. Deep Reinforcement Learning for Join Order Enumeration. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (Houston, TX, USA) (aiDM'18). Association for Computing Machinery, New York, NY, USA, Article 3, 4 pages. https://doi.org/10.1145/3211954.3211957Google ScholarDigital Library
Ryan Marcus and Olga Papaemmanouil. 2019 a. Flexible Operator Embeddings via Deep Learning. CoRR, Vol. abs/1901.09090 (2019). arxiv: 1901.09090 http://arxiv.org/abs/1901.09090Google Scholar
Ryan Marcus and Olga Papaemmanouil. 2019 b. Plan-Structured Deep Neural Network Models for Query Performance Prediction. Proc. VLDB Endow., Vol. 12, 11 (July 2019), 1733--1746. https://doi.org/10.14778/3342263.3342646Google ScholarDigital Library
Volker Markl and Guy M. Lohman. 2002. Learning Table Access Cardinalities with LEO. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, USA, June 3--6, 2002, Michael J. Franklin, Bongki Moon, and Anastassia Ailamaki (Eds.). ACM, 613. https://doi.org/10.1145/564691.564766Google ScholarDigital Library
Nicolai Meinshausen. 2006. Quantile Regression Forests. Journal of Machine Learning Research, Vol. 7, Jun (2006), 983--999.Google ScholarDigital Library
Raghunath Othayoth Nambiar and Meikel Poess. 2006. The Making of TPC-DS. In Proceedings of the 32nd International Conference on Very Large Data Bases (Seoul, Korea) (VLDB '06). VLDB Endowment, 1049--1058.Google ScholarDigital Library
Meikel Poess and Chris Floyd. 2000. New TPC Benchmarks for Decision Support and Web Commerce. SIGMOD Rec., Vol. 29, 4 (Dec. 2000), 64--71. https://doi.org/10.1145/369275.369291Google ScholarDigital Library
Adrian Daniel Popescu, Andrey Balmin, Vuk Ercegovac, and Anastasia Ailamaki. 2013. PREDIcT: Towards Predicting the Runtime of Large Scale Iterative Analytics. Proc. VLDB Endow., Vol. 6, 14 (Sept. 2013), 1678--1689. https://doi.org/10.14778/2556549.2556553Google ScholarDigital Library
Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017a. Snorkel: Rapid Training Data Creation with Weak Supervision. Proc. VLDB Endow., Vol. 11, 3 (Nov. 2017), 269--282. https://doi.org/10.14778/3157794.3157797Google ScholarDigital Library
Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017b. Snorkel: Rapid Training Data Creation with Weak Supervision. Proc. VLDB Endow., Vol. 11, 3 (Nov. 2017), 269--282. https://doi.org/10.14778/3157794.3157797Google ScholarDigital Library
Lior Rokach and Oded Maimon. 2005. Clustering Methods .Springer US, Boston, MA, 321--352. https://doi.org/10.1007/0--387--25465-X_15Google Scholar
Justin Salamon and Juan Pablo Bello. 2017. Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification. IEEE Signal Processing Letters, Vol. 24, 3 (2017), 279--283.Google ScholarCross Ref
Erwan Scornet. 2016. On the asymptotics of random forests. Journal of Multivariate Analysis, Vol. 146 (2016), 72 -- 83. https://doi.org/10.1016/j.jmva.2015.06.009 Special Issue on Statistical Models and Methods for High or Infinite Dimensional Spaces.Google ScholarDigital Library
Burr Settles. 2009. Active Learning Literature Survey. Technical Report. University of Wisconsin-Madison Department of Computer Sciences.Google Scholar
Connor Shorten and Taghi M Khoshgoftaar. 2019. A Survey on Image Data Augmentation for Deep Learning. Journal of Big Data, Vol. 6, 1 (2019), 60. https://doi.org/10.1186/s40537-019-0197-0Google ScholarCross Ref
Michael Stillger, Guy M. Lohman, Volker Markl, and Mokhtar Kandil. 2001. LEO - DB2's LEarning Optimizer. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB '01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 19--28.Google ScholarDigital Library
Ji Sun and Guoliang Li. 2019. An End-to-End Learning-Based Cost Estimator. Proc. VLDB Endow., Vol. 13, 3 (Nov. 2019), 307--319. https://doi.org/10.14778/3368289.3368296Google ScholarDigital Library
Balder ten Cate, Phokion G. Kolaitis, Kun Qian, and Wang-Chiew Tan. 2018. Active Learning of GAV Schema Mappings. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (Houston, TX, USA) (SIGMOD/PODS '18). Association for Computing Machinery, New York, NY, USA, 355--368. https://doi.org/10.1145/3196959.3196974Google ScholarDigital Library
Jonas Traub, Zoi Kaoudi, Jorge-Arnulfo Quiané -Ruiz, and Volker Markl. 2019. Agora: Bringing Together Datasets, Algorithms, Models and More in a Unified Ecosystem [Vision]. Proc. VLDB Endow., Vol. 49, 4 (Dec. 2019), SIGMOD Record.Google Scholar
Yiwei Wang, Wei Wang, Yuxuan Liang, Yujun Cai, Juncheng Liu, and Bryan Hooi. 2020. NodeAug: Semi-Supervised Node Classification with Data Augmentation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Virtual Event, CA, USA) (KDD '20). Association for Computing Machinery, New York, NY, USA, 207--217. https://doi.org/10.1145/3394486.3403063Google ScholarDigital Library
Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal Component Analysis. Chemometrics and intelligent laboratory systems, Vol. 2, 1--3 (1987), 37--52.Google Scholar
Wentao Wu, Xi Wu, Hakan Hacigümücs, and Jeffrey F. Naughton. 2014. Uncertainty Aware Query Execution Time Prediction. Proc. VLDB Endow., Vol. 7, 14 (Oct. 2014), 1857--1868. https://doi.org/10.14778/2733085.2733092Google ScholarDigital Library
Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M. Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep Unsupervised Cardinality Estimation. Proc. VLDB Endow., Vol. 13, 3 (Nov. 2019), 279--292. https://doi.org/10.14778/3368289.3368294Google ScholarDigital Library
Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: a Unified Engine for Big Data Processing. Commun. ACM, Vol. 59, 11 (2016), 56--65.Google ScholarDigital Library
Xuanhe Zhou, Ji Sun, Guoliang Li, and Jianhua Feng. 2020. Query Performance Prediction for Concurrent Queries Using Graph Embedding. Proc. VLDB Endow., Vol. 13, 9 (May 2020), 1416--1428. https://doi.org/10.14778/3397230.3397238Google ScholarDigital Library
Xiangxin Zhu, Carl Vondrick, Charless C Fowlkes, and Deva Ramanan. 2016. Do we need more Training Data? International Journal of Computer Vision, Vol. 119, 1 (2016), 76--92.Google ScholarDigital Library

Index Terms

Expand your Training Limits! Generating Training Data for ML-based Data Management
1. Computing methodologies
  1. Machine learning
    1. Learning settings
      1. Active learning settings
2. Information systems
  1. Data management systems
    1. Database administration
      1. Database utilities and tools
    2. Database management system engines

Recommendations

Few training data for Objection Detection
EITCE '20: Proceedings of the 2020 4th International Conference on Electronic Information Technology and Computer Engineering

Deep learning method of object detection has achieved excellent results, but most of the object detection network training processes are supervised learning. The performance improvement is driven by a large amount of annotation data to drive deeper and ...
Read More
Tri-Training: Exploiting Unlabeled Data Using Three Classifiers

In many practical data mining applications, such as Web page classification, unlabeled training examples are readily available, but labeled ones are fairly expensive to obtain. Therefore, semi-supervised learning algorithms such as co-training have ...
Read More
DCPE co-training for classification

Co-training is a well-known semi-supervised learning technique that applies two basic learners to train the data source, which uses the most confident unlabeled data to augment labeled data in the learning process. In the paper, we use the diversity of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
June 2021
2969 pages
ISBN:9781450383431
DOI:10.1145/3448016
General Chairs:
Guoliang Li
Tsinghua University (China)
,
Zhanhuai Li
Northwestern Polytechnical University (China)
,
Program Chairs:
Stratos Idreos
Harvard University (USA)
,
Divesh Srivastava
AT&T (USA)
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
active learning
data augmentation
learned data management
machine learning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 488
  Total Downloads
- Downloads (Last 12 months)70
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Expand your Training Limits! Generating Training Data for ML-based Data Management

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Few training data for Objection Detection

Tri-Training: Exploiting Unlabeled Data Using Three Classifiers

DCPE co-training for classification