skip to main content
10.1145/3448016.3457286acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Expand your Training Limits! Generating Training Data for ML-based Data Management

Published:18 June 2021Publication History

ABSTRACT

Machine Learning (ML) is quickly becoming a prominent method in many data management components, including query optimizers which have recently shown very promising results. However, the low availability of training data (i.e., large query workloads with execution time or output cardinality as labels) widely limits further advancement in research and compromises the technology transfer from research to industry. Collecting a labeled query workload has a very high cost in terms of time and money due to the development and execution of thousands of realistic queries/jobs.

In this work, we face the problem of generating training data for data management components tailored to users' needs. We present DataFarm, an innovative framework for efficiently generating and labeling large query workloads. We follow a data-driven white-box approach to learn from pre-existing small workload patterns, input data, and computational resources. Our framework allows users to produce a large heterogeneous set of realistic jobs with their labels, which can be used by any ML-based data management component. We show that our framework outperforms the current state-of-the-art both in query generation and label estimation using synthetic and real datasets. It has up to 9x better labeling performance, in terms of R2 score. More importantly, it allows users to reduce the cost of getting labeled query workloads by 54x (and up to an estimated factor of 104x) compared to standard approaches.

Skip Supplemental Material Section

Supplemental Material

3448016.3457286.mp4

mp4

310.4 MB

References

  1. Divy Agrawal, Mouhamadou Lamine Ba, Laure Berti-É quille, Sanjay Chawla, Ahmed K. Elmagarmid, Hossam Hammady, Yasser Idris, Zoi Kaoudi, Zuhair Khayyat, Sebastian Kruse, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané -Ruiz, Nan Tang, and Mohammed J. Zaki. 2016. Rheem: Enabling Multi-Platform Task Execution. In SIGMOD. 2069--2072.Google ScholarGoogle Scholar
  2. Divy Agrawal, Sanjay Chawla, Bertty Contreras-Rojas, Ahmed K. Elmagarmid, Yasser Idris, Zoi Kaoudi, Sebastian Kruse, Ji Lucas, Essam Mansour, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané -Ruiz, Nan Tang, Saravanan Thirumuruganathan, and Anis Troudi. 2018. RHEEM: Enabling Cross-Platform Data Processing - May The Big Data Be With You! -. Proc. VLDB Endow., Vol. 11, 11 (2018), 1414--1427.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Mert Akdere, Ugur cC etintemel, Matteo Riondato, Eli Upfal, and Stanley B Zdonik. 2012. Learning-based Query Performance Modeling and Prediction. In 2012 IEEE 28th International Conference on Data Engineering. IEEE, 390--401.Google ScholarGoogle Scholar
  4. Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinl"ander, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, and Daniel Warneke. 2014. The Stratosphere Platform for Big Data Analytics. The VLDB Journal, Vol. 23, 6 (Dec. 2014), 939--964. https://doi.org/10.1007/s00778-014-0357-yGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  5. Arvind Arasu, Michaela Götz, and Raghav Kaushik. 2010. On Active Learning of Record Matching Packages. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (Indianapolis, Indiana, USA) (SIGMOD '10). Association for Computing Machinery, New York, NY, USA, 783--794. https://doi.org/10.1145/1807167.1807252Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Gérard Biau, Luc Devroye, and Gábor Lugosi. 2008. Consistency of Random Forests and Other Averaging Classifiers. J. Mach. Learn. Res., Vol. 9 (June 2008), 2015--2033.Google ScholarGoogle Scholar
  7. Leo Breiman. 2001. Random Forests. Machine Learning, Vol. 45, 1 (2001), 5--32. https://doi.org/10.1023/a:1010933404324Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink#8482;: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull., Vol. 38, 4 (2015), 28--38. http://sites.computer.org/debull/A15dec/p28.pdfGoogle ScholarGoogle Scholar
  9. Tania Cerquitelli, Stefano Proto, Francesco Ventura, Daniele Apiletti, and Elena Baralis. 2019. Towards a Real-time Unsupervised Estimation of Predictive Model Degradation. In Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics, BIRTE 2019, Los Angeles, CA, USA, August 26, 2019. 5:1--5:6. https://doi.org/10.1145/3350489.3350494Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Bailu Ding, Sudipto Das, Ryan Marcus, Wentao Wu, Surajit Chaudhuri, and Vivek R. Narasayya. 2019. AI Meets AI: Leveraging Query Executions to Improve Index Recommendations. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 1241--1258. https://doi.org/10.1145/3299869.3324957Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Norman Richard Draper and Harry Smith. 1998. Applied Regression Analysis 3rd ed ed.). Wiley, New York.Google ScholarGoogle Scholar
  12. Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2015. The Pascal Visual Object Classes Challenge: A Retrospective ., 98--136 pages. https://doi.org/10.1007/s11263-014-0733--5Google ScholarGoogle Scholar
  13. Weijie Fu, Meng Wang, Shijie Hao, and Xindong Wu. 2018. Scalable Active Learning by Approximated Error Reduction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (London, United Kingdom) (KDD '18). Association for Computing Machinery, New York, NY, USA, 1396--1405. https://doi.org/10.1145/3219819.3219954Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yifan Fu, Xingquan Zhu, and Bin Li. 2013. A Survey on Instance Selection for Active Learning. Knowledge and information systems, Vol. 35, 2 (2013), 249--283. https://doi.org/10.1007/s10115-012-0507--8Google ScholarGoogle Scholar
  15. Paul A. Gagniuc. 2017. Markov Chains: from Theory to Implementation and Experimentation .John Wiley & Sons, Hoboken, NJ.Google ScholarGoogle Scholar
  16. Jo ao Gama, Indr.e vZ liobait.e, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A Survey on Concept Drift Adaptation. ACM Comput. Surv., Vol. 46, 4, Article 44 (March 2014), 37 pages. https://doi.org/10.1145/2523813Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2008. Database Systems: The Complete Book 2 ed.). Prentice Hall Press, USA.Google ScholarGoogle Scholar
  18. Shohedul Hasan, Saravanan Thirumuruganathan, Jees Augustine, Nick Koudas, and Gautam Das. 2020. Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 1035--1050. https://doi.org/10.1145/3318464.3389741Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, Not from Queries! Proc. VLDB Endow., Vol. 13, 7 (March 2020), 992--1005. https://doi.org/10.14778/3384345.3384349Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Geoff Hulten, Laurie Spencer, and Pedro Domingos. 2001. Mining Time-Changing Data Streams. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California) (KDD '01). Association for Computing Machinery, New York, NY, USA, 97--106. https://doi.org/10.1145/502512.502529Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Z. Kaoudi, J. Quiané-Ruiz, B. Contreras-Rojas, R. Pardo-Meza, A. Troudi, and S. Chawla. 2020. ML-based Cross-Platform Query Optimization. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). 1489--1500.Google ScholarGoogle Scholar
  22. Mark G Kelly, David J Hand, and Niall M Adams. 1999. The Impact of Changing Populations on Classifier Performance. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. 367--371.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2018. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. arXiv preprint arXiv:1809.00677 (2018).Google ScholarGoogle Scholar
  24. Andreas Kipf, Dimitri Vorona, Jonas Müller, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, Thomas Neumann, and Alfons Kemper. 2019. Estimating Cardinalities with Deep Sketches. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 1937--1940. https://doi.org/10.1145/3299869.3320218Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph M. Hellerstein, and Ion Stoica. 2018. Learning to Optimize Join Queries With Deep Reinforcement Learning. CoRR, Vol. abs/1808.03196 (2018). arxiv: 1808.03196Google ScholarGoogle Scholar
  26. Sebastian Kruse, Zoi Kaoudi, Bertty Contreras-Rojas, Sanjay Chawla, Felix Naumann, and Jorge-Arnulfo Quiané-Ruiz. 2020. RHEEMix in the Data Jungle: a Cost-based Optimizer for Cross-platform Systems. VLDB JOURNAL (2020). https://doi.org/10.1007/s00778-020-00612-xGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  27. Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How Good are Query Optimizers, Really? Proceedings of the VLDB Endowment, Vol. 9, 3 (2015), 204--215.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Charles X. Ling and Jun Du. 2008. Active Learning with Direct Query Construction. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Las Vegas, Nevada, USA) (KDD '08). Association for Computing Machinery, New York, NY, USA, 480--487. https://doi.org/10.1145/1401890.1401950Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Lin Ma, Bailu Ding, Sudipto Das, and Adith Swaminathan. 2020. Active Learning for ML Enhanced Database Systems. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 175--191. https://doi.org/10.1145/3318464.3389768Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A Learned Query Optimizer. Proc. VLDB Endow., Vol. 12, 11 (July 2019), 1705--1718. https://doi.org/10.14778/3342263.3342644Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ryan Marcus and Olga Papaemmanouil. 2018. Deep Reinforcement Learning for Join Order Enumeration. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (Houston, TX, USA) (aiDM'18). Association for Computing Machinery, New York, NY, USA, Article 3, 4 pages. https://doi.org/10.1145/3211954.3211957Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ryan Marcus and Olga Papaemmanouil. 2019 a. Flexible Operator Embeddings via Deep Learning. CoRR, Vol. abs/1901.09090 (2019). arxiv: 1901.09090 http://arxiv.org/abs/1901.09090Google ScholarGoogle Scholar
  33. Ryan Marcus and Olga Papaemmanouil. 2019 b. Plan-Structured Deep Neural Network Models for Query Performance Prediction. Proc. VLDB Endow., Vol. 12, 11 (July 2019), 1733--1746. https://doi.org/10.14778/3342263.3342646Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Volker Markl and Guy M. Lohman. 2002. Learning Table Access Cardinalities with LEO. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, USA, June 3--6, 2002, Michael J. Franklin, Bongki Moon, and Anastassia Ailamaki (Eds.). ACM, 613. https://doi.org/10.1145/564691.564766Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Nicolai Meinshausen. 2006. Quantile Regression Forests. Journal of Machine Learning Research, Vol. 7, Jun (2006), 983--999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Raghunath Othayoth Nambiar and Meikel Poess. 2006. The Making of TPC-DS. In Proceedings of the 32nd International Conference on Very Large Data Bases (Seoul, Korea) (VLDB '06). VLDB Endowment, 1049--1058.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Meikel Poess and Chris Floyd. 2000. New TPC Benchmarks for Decision Support and Web Commerce. SIGMOD Rec., Vol. 29, 4 (Dec. 2000), 64--71. https://doi.org/10.1145/369275.369291Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Adrian Daniel Popescu, Andrey Balmin, Vuk Ercegovac, and Anastasia Ailamaki. 2013. PREDIcT: Towards Predicting the Runtime of Large Scale Iterative Analytics. Proc. VLDB Endow., Vol. 6, 14 (Sept. 2013), 1678--1689. https://doi.org/10.14778/2556549.2556553Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017a. Snorkel: Rapid Training Data Creation with Weak Supervision. Proc. VLDB Endow., Vol. 11, 3 (Nov. 2017), 269--282. https://doi.org/10.14778/3157794.3157797Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017b. Snorkel: Rapid Training Data Creation with Weak Supervision. Proc. VLDB Endow., Vol. 11, 3 (Nov. 2017), 269--282. https://doi.org/10.14778/3157794.3157797Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Lior Rokach and Oded Maimon. 2005. Clustering Methods .Springer US, Boston, MA, 321--352. https://doi.org/10.1007/0--387--25465-X_15Google ScholarGoogle Scholar
  42. Justin Salamon and Juan Pablo Bello. 2017. Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification. IEEE Signal Processing Letters, Vol. 24, 3 (2017), 279--283.Google ScholarGoogle ScholarCross RefCross Ref
  43. Erwan Scornet. 2016. On the asymptotics of random forests. Journal of Multivariate Analysis, Vol. 146 (2016), 72 -- 83. https://doi.org/10.1016/j.jmva.2015.06.009 Special Issue on Statistical Models and Methods for High or Infinite Dimensional Spaces.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Burr Settles. 2009. Active Learning Literature Survey. Technical Report. University of Wisconsin-Madison Department of Computer Sciences.Google ScholarGoogle Scholar
  45. Connor Shorten and Taghi M Khoshgoftaar. 2019. A Survey on Image Data Augmentation for Deep Learning. Journal of Big Data, Vol. 6, 1 (2019), 60. https://doi.org/10.1186/s40537-019-0197-0Google ScholarGoogle ScholarCross RefCross Ref
  46. Michael Stillger, Guy M. Lohman, Volker Markl, and Mokhtar Kandil. 2001. LEO - DB2's LEarning Optimizer. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB '01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 19--28.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Ji Sun and Guoliang Li. 2019. An End-to-End Learning-Based Cost Estimator. Proc. VLDB Endow., Vol. 13, 3 (Nov. 2019), 307--319. https://doi.org/10.14778/3368289.3368296Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Balder ten Cate, Phokion G. Kolaitis, Kun Qian, and Wang-Chiew Tan. 2018. Active Learning of GAV Schema Mappings. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (Houston, TX, USA) (SIGMOD/PODS '18). Association for Computing Machinery, New York, NY, USA, 355--368. https://doi.org/10.1145/3196959.3196974Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Jonas Traub, Zoi Kaoudi, Jorge-Arnulfo Quiané -Ruiz, and Volker Markl. 2019. Agora: Bringing Together Datasets, Algorithms, Models and More in a Unified Ecosystem [Vision]. Proc. VLDB Endow., Vol. 49, 4 (Dec. 2019), SIGMOD Record.Google ScholarGoogle Scholar
  50. Yiwei Wang, Wei Wang, Yuxuan Liang, Yujun Cai, Juncheng Liu, and Bryan Hooi. 2020. NodeAug: Semi-Supervised Node Classification with Data Augmentation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Virtual Event, CA, USA) (KDD '20). Association for Computing Machinery, New York, NY, USA, 207--217. https://doi.org/10.1145/3394486.3403063Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal Component Analysis. Chemometrics and intelligent laboratory systems, Vol. 2, 1--3 (1987), 37--52.Google ScholarGoogle Scholar
  52. Wentao Wu, Xi Wu, Hakan Hacigümücs, and Jeffrey F. Naughton. 2014. Uncertainty Aware Query Execution Time Prediction. Proc. VLDB Endow., Vol. 7, 14 (Oct. 2014), 1857--1868. https://doi.org/10.14778/2733085.2733092Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M. Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep Unsupervised Cardinality Estimation. Proc. VLDB Endow., Vol. 13, 3 (Nov. 2019), 279--292. https://doi.org/10.14778/3368289.3368294Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: a Unified Engine for Big Data Processing. Commun. ACM, Vol. 59, 11 (2016), 56--65.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Xuanhe Zhou, Ji Sun, Guoliang Li, and Jianhua Feng. 2020. Query Performance Prediction for Concurrent Queries Using Graph Embedding. Proc. VLDB Endow., Vol. 13, 9 (May 2020), 1416--1428. https://doi.org/10.14778/3397230.3397238Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Xiangxin Zhu, Carl Vondrick, Charless C Fowlkes, and Deva Ramanan. 2016. Do we need more Training Data? International Journal of Computer Vision, Vol. 119, 1 (2016), 76--92.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Expand your Training Limits! Generating Training Data for ML-based Data Management

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
          June 2021
          2969 pages
          ISBN:9781450383431
          DOI:10.1145/3448016

          Copyright © 2021 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 18 June 2021

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate785of4,003submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader