ABSTRACT
Machine Learning (ML) is quickly becoming a prominent method in many data management components, including query optimizers which have recently shown very promising results. However, the low availability of training data (i.e., large query workloads with execution time or output cardinality as labels) widely limits further advancement in research and compromises the technology transfer from research to industry. Collecting a labeled query workload has a very high cost in terms of time and money due to the development and execution of thousands of realistic queries/jobs.
In this work, we face the problem of generating training data for data management components tailored to users' needs. We present DataFarm, an innovative framework for efficiently generating and labeling large query workloads. We follow a data-driven white-box approach to learn from pre-existing small workload patterns, input data, and computational resources. Our framework allows users to produce a large heterogeneous set of realistic jobs with their labels, which can be used by any ML-based data management component. We show that our framework outperforms the current state-of-the-art both in query generation and label estimation using synthetic and real datasets. It has up to 9x better labeling performance, in terms of R2 score. More importantly, it allows users to reduce the cost of getting labeled query workloads by 54x (and up to an estimated factor of 104x) compared to standard approaches.
Supplemental Material
- Divy Agrawal, Mouhamadou Lamine Ba, Laure Berti-É quille, Sanjay Chawla, Ahmed K. Elmagarmid, Hossam Hammady, Yasser Idris, Zoi Kaoudi, Zuhair Khayyat, Sebastian Kruse, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané -Ruiz, Nan Tang, and Mohammed J. Zaki. 2016. Rheem: Enabling Multi-Platform Task Execution. In SIGMOD. 2069--2072.Google Scholar
- Divy Agrawal, Sanjay Chawla, Bertty Contreras-Rojas, Ahmed K. Elmagarmid, Yasser Idris, Zoi Kaoudi, Sebastian Kruse, Ji Lucas, Essam Mansour, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané -Ruiz, Nan Tang, Saravanan Thirumuruganathan, and Anis Troudi. 2018. RHEEM: Enabling Cross-Platform Data Processing - May The Big Data Be With You! -. Proc. VLDB Endow., Vol. 11, 11 (2018), 1414--1427.Google ScholarDigital Library
- Mert Akdere, Ugur cC etintemel, Matteo Riondato, Eli Upfal, and Stanley B Zdonik. 2012. Learning-based Query Performance Modeling and Prediction. In 2012 IEEE 28th International Conference on Data Engineering. IEEE, 390--401.Google Scholar
- Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinl"ander, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, and Daniel Warneke. 2014. The Stratosphere Platform for Big Data Analytics. The VLDB Journal, Vol. 23, 6 (Dec. 2014), 939--964. https://doi.org/10.1007/s00778-014-0357-yGoogle ScholarDigital Library
- Arvind Arasu, Michaela Götz, and Raghav Kaushik. 2010. On Active Learning of Record Matching Packages. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (Indianapolis, Indiana, USA) (SIGMOD '10). Association for Computing Machinery, New York, NY, USA, 783--794. https://doi.org/10.1145/1807167.1807252Google ScholarDigital Library
- Gérard Biau, Luc Devroye, and Gábor Lugosi. 2008. Consistency of Random Forests and Other Averaging Classifiers. J. Mach. Learn. Res., Vol. 9 (June 2008), 2015--2033.Google Scholar
- Leo Breiman. 2001. Random Forests. Machine Learning, Vol. 45, 1 (2001), 5--32. https://doi.org/10.1023/a:1010933404324Google ScholarDigital Library
- Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink#8482;: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull., Vol. 38, 4 (2015), 28--38. http://sites.computer.org/debull/A15dec/p28.pdfGoogle Scholar
- Tania Cerquitelli, Stefano Proto, Francesco Ventura, Daniele Apiletti, and Elena Baralis. 2019. Towards a Real-time Unsupervised Estimation of Predictive Model Degradation. In Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics, BIRTE 2019, Los Angeles, CA, USA, August 26, 2019. 5:1--5:6. https://doi.org/10.1145/3350489.3350494Google ScholarDigital Library
- Bailu Ding, Sudipto Das, Ryan Marcus, Wentao Wu, Surajit Chaudhuri, and Vivek R. Narasayya. 2019. AI Meets AI: Leveraging Query Executions to Improve Index Recommendations. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 1241--1258. https://doi.org/10.1145/3299869.3324957Google ScholarDigital Library
- Norman Richard Draper and Harry Smith. 1998. Applied Regression Analysis 3rd ed ed.). Wiley, New York.Google Scholar
- Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2015. The Pascal Visual Object Classes Challenge: A Retrospective ., 98--136 pages. https://doi.org/10.1007/s11263-014-0733--5Google Scholar
- Weijie Fu, Meng Wang, Shijie Hao, and Xindong Wu. 2018. Scalable Active Learning by Approximated Error Reduction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (London, United Kingdom) (KDD '18). Association for Computing Machinery, New York, NY, USA, 1396--1405. https://doi.org/10.1145/3219819.3219954Google ScholarDigital Library
- Yifan Fu, Xingquan Zhu, and Bin Li. 2013. A Survey on Instance Selection for Active Learning. Knowledge and information systems, Vol. 35, 2 (2013), 249--283. https://doi.org/10.1007/s10115-012-0507--8Google Scholar
- Paul A. Gagniuc. 2017. Markov Chains: from Theory to Implementation and Experimentation .John Wiley & Sons, Hoboken, NJ.Google Scholar
- Jo ao Gama, Indr.e vZ liobait.e, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A Survey on Concept Drift Adaptation. ACM Comput. Surv., Vol. 46, 4, Article 44 (March 2014), 37 pages. https://doi.org/10.1145/2523813Google ScholarDigital Library
- Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2008. Database Systems: The Complete Book 2 ed.). Prentice Hall Press, USA.Google Scholar
- Shohedul Hasan, Saravanan Thirumuruganathan, Jees Augustine, Nick Koudas, and Gautam Das. 2020. Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 1035--1050. https://doi.org/10.1145/3318464.3389741Google ScholarDigital Library
- Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, Not from Queries! Proc. VLDB Endow., Vol. 13, 7 (March 2020), 992--1005. https://doi.org/10.14778/3384345.3384349Google ScholarDigital Library
- Geoff Hulten, Laurie Spencer, and Pedro Domingos. 2001. Mining Time-Changing Data Streams. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California) (KDD '01). Association for Computing Machinery, New York, NY, USA, 97--106. https://doi.org/10.1145/502512.502529Google ScholarDigital Library
- Z. Kaoudi, J. Quiané-Ruiz, B. Contreras-Rojas, R. Pardo-Meza, A. Troudi, and S. Chawla. 2020. ML-based Cross-Platform Query Optimization. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). 1489--1500.Google Scholar
- Mark G Kelly, David J Hand, and Niall M Adams. 1999. The Impact of Changing Populations on Classifier Performance. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. 367--371.Google ScholarDigital Library
- Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2018. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. arXiv preprint arXiv:1809.00677 (2018).Google Scholar
- Andreas Kipf, Dimitri Vorona, Jonas Müller, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, Thomas Neumann, and Alfons Kemper. 2019. Estimating Cardinalities with Deep Sketches. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 1937--1940. https://doi.org/10.1145/3299869.3320218Google ScholarDigital Library
- Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph M. Hellerstein, and Ion Stoica. 2018. Learning to Optimize Join Queries With Deep Reinforcement Learning. CoRR, Vol. abs/1808.03196 (2018). arxiv: 1808.03196Google Scholar
- Sebastian Kruse, Zoi Kaoudi, Bertty Contreras-Rojas, Sanjay Chawla, Felix Naumann, and Jorge-Arnulfo Quiané-Ruiz. 2020. RHEEMix in the Data Jungle: a Cost-based Optimizer for Cross-platform Systems. VLDB JOURNAL (2020). https://doi.org/10.1007/s00778-020-00612-xGoogle ScholarDigital Library
- Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How Good are Query Optimizers, Really? Proceedings of the VLDB Endowment, Vol. 9, 3 (2015), 204--215.Google ScholarDigital Library
- Charles X. Ling and Jun Du. 2008. Active Learning with Direct Query Construction. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Las Vegas, Nevada, USA) (KDD '08). Association for Computing Machinery, New York, NY, USA, 480--487. https://doi.org/10.1145/1401890.1401950Google ScholarDigital Library
- Lin Ma, Bailu Ding, Sudipto Das, and Adith Swaminathan. 2020. Active Learning for ML Enhanced Database Systems. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 175--191. https://doi.org/10.1145/3318464.3389768Google ScholarDigital Library
- Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A Learned Query Optimizer. Proc. VLDB Endow., Vol. 12, 11 (July 2019), 1705--1718. https://doi.org/10.14778/3342263.3342644Google ScholarDigital Library
- Ryan Marcus and Olga Papaemmanouil. 2018. Deep Reinforcement Learning for Join Order Enumeration. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (Houston, TX, USA) (aiDM'18). Association for Computing Machinery, New York, NY, USA, Article 3, 4 pages. https://doi.org/10.1145/3211954.3211957Google ScholarDigital Library
- Ryan Marcus and Olga Papaemmanouil. 2019 a. Flexible Operator Embeddings via Deep Learning. CoRR, Vol. abs/1901.09090 (2019). arxiv: 1901.09090 http://arxiv.org/abs/1901.09090Google Scholar
- Ryan Marcus and Olga Papaemmanouil. 2019 b. Plan-Structured Deep Neural Network Models for Query Performance Prediction. Proc. VLDB Endow., Vol. 12, 11 (July 2019), 1733--1746. https://doi.org/10.14778/3342263.3342646Google ScholarDigital Library
- Volker Markl and Guy M. Lohman. 2002. Learning Table Access Cardinalities with LEO. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, USA, June 3--6, 2002, Michael J. Franklin, Bongki Moon, and Anastassia Ailamaki (Eds.). ACM, 613. https://doi.org/10.1145/564691.564766Google ScholarDigital Library
- Nicolai Meinshausen. 2006. Quantile Regression Forests. Journal of Machine Learning Research, Vol. 7, Jun (2006), 983--999.Google ScholarDigital Library
- Raghunath Othayoth Nambiar and Meikel Poess. 2006. The Making of TPC-DS. In Proceedings of the 32nd International Conference on Very Large Data Bases (Seoul, Korea) (VLDB '06). VLDB Endowment, 1049--1058.Google ScholarDigital Library
- Meikel Poess and Chris Floyd. 2000. New TPC Benchmarks for Decision Support and Web Commerce. SIGMOD Rec., Vol. 29, 4 (Dec. 2000), 64--71. https://doi.org/10.1145/369275.369291Google ScholarDigital Library
- Adrian Daniel Popescu, Andrey Balmin, Vuk Ercegovac, and Anastasia Ailamaki. 2013. PREDIcT: Towards Predicting the Runtime of Large Scale Iterative Analytics. Proc. VLDB Endow., Vol. 6, 14 (Sept. 2013), 1678--1689. https://doi.org/10.14778/2556549.2556553Google ScholarDigital Library
- Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017a. Snorkel: Rapid Training Data Creation with Weak Supervision. Proc. VLDB Endow., Vol. 11, 3 (Nov. 2017), 269--282. https://doi.org/10.14778/3157794.3157797Google ScholarDigital Library
- Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017b. Snorkel: Rapid Training Data Creation with Weak Supervision. Proc. VLDB Endow., Vol. 11, 3 (Nov. 2017), 269--282. https://doi.org/10.14778/3157794.3157797Google ScholarDigital Library
- Lior Rokach and Oded Maimon. 2005. Clustering Methods .Springer US, Boston, MA, 321--352. https://doi.org/10.1007/0--387--25465-X_15Google Scholar
- Justin Salamon and Juan Pablo Bello. 2017. Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification. IEEE Signal Processing Letters, Vol. 24, 3 (2017), 279--283.Google ScholarCross Ref
- Erwan Scornet. 2016. On the asymptotics of random forests. Journal of Multivariate Analysis, Vol. 146 (2016), 72 -- 83. https://doi.org/10.1016/j.jmva.2015.06.009 Special Issue on Statistical Models and Methods for High or Infinite Dimensional Spaces.Google ScholarDigital Library
- Burr Settles. 2009. Active Learning Literature Survey. Technical Report. University of Wisconsin-Madison Department of Computer Sciences.Google Scholar
- Connor Shorten and Taghi M Khoshgoftaar. 2019. A Survey on Image Data Augmentation for Deep Learning. Journal of Big Data, Vol. 6, 1 (2019), 60. https://doi.org/10.1186/s40537-019-0197-0Google ScholarCross Ref
- Michael Stillger, Guy M. Lohman, Volker Markl, and Mokhtar Kandil. 2001. LEO - DB2's LEarning Optimizer. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB '01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 19--28.Google ScholarDigital Library
- Ji Sun and Guoliang Li. 2019. An End-to-End Learning-Based Cost Estimator. Proc. VLDB Endow., Vol. 13, 3 (Nov. 2019), 307--319. https://doi.org/10.14778/3368289.3368296Google ScholarDigital Library
- Balder ten Cate, Phokion G. Kolaitis, Kun Qian, and Wang-Chiew Tan. 2018. Active Learning of GAV Schema Mappings. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (Houston, TX, USA) (SIGMOD/PODS '18). Association for Computing Machinery, New York, NY, USA, 355--368. https://doi.org/10.1145/3196959.3196974Google ScholarDigital Library
- Jonas Traub, Zoi Kaoudi, Jorge-Arnulfo Quiané -Ruiz, and Volker Markl. 2019. Agora: Bringing Together Datasets, Algorithms, Models and More in a Unified Ecosystem [Vision]. Proc. VLDB Endow., Vol. 49, 4 (Dec. 2019), SIGMOD Record.Google Scholar
- Yiwei Wang, Wei Wang, Yuxuan Liang, Yujun Cai, Juncheng Liu, and Bryan Hooi. 2020. NodeAug: Semi-Supervised Node Classification with Data Augmentation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Virtual Event, CA, USA) (KDD '20). Association for Computing Machinery, New York, NY, USA, 207--217. https://doi.org/10.1145/3394486.3403063Google ScholarDigital Library
- Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal Component Analysis. Chemometrics and intelligent laboratory systems, Vol. 2, 1--3 (1987), 37--52.Google Scholar
- Wentao Wu, Xi Wu, Hakan Hacigümücs, and Jeffrey F. Naughton. 2014. Uncertainty Aware Query Execution Time Prediction. Proc. VLDB Endow., Vol. 7, 14 (Oct. 2014), 1857--1868. https://doi.org/10.14778/2733085.2733092Google ScholarDigital Library
- Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M. Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep Unsupervised Cardinality Estimation. Proc. VLDB Endow., Vol. 13, 3 (Nov. 2019), 279--292. https://doi.org/10.14778/3368289.3368294Google ScholarDigital Library
- Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: a Unified Engine for Big Data Processing. Commun. ACM, Vol. 59, 11 (2016), 56--65.Google ScholarDigital Library
- Xuanhe Zhou, Ji Sun, Guoliang Li, and Jianhua Feng. 2020. Query Performance Prediction for Concurrent Queries Using Graph Embedding. Proc. VLDB Endow., Vol. 13, 9 (May 2020), 1416--1428. https://doi.org/10.14778/3397230.3397238Google ScholarDigital Library
- Xiangxin Zhu, Carl Vondrick, Charless C Fowlkes, and Deva Ramanan. 2016. Do we need more Training Data? International Journal of Computer Vision, Vol. 119, 1 (2016), 76--92.Google ScholarDigital Library
Index Terms
- Expand your Training Limits! Generating Training Data for ML-based Data Management
Recommendations
Few training data for Objection Detection
EITCE '20: Proceedings of the 2020 4th International Conference on Electronic Information Technology and Computer EngineeringDeep learning method of object detection has achieved excellent results, but most of the object detection network training processes are supervised learning. The performance improvement is driven by a large amount of annotation data to drive deeper and ...
Tri-Training: Exploiting Unlabeled Data Using Three Classifiers
In many practical data mining applications, such as Web page classification, unlabeled training examples are readily available, but labeled ones are fairly expensive to obtain. Therefore, semi-supervised learning algorithms such as co-training have ...
DCPE co-training for classification
Co-training is a well-known semi-supervised learning technique that applies two basic learners to train the data source, which uses the most confident unlabeled data to augment labeled data in the learning process. In the paper, we use the diversity of ...
Comments