Abstract
Data preparation is crucial in achieving optimized results for machine learning (ML). However, having a good data preparation pipeline is highly non-trivial for ML practitioners, which is not only domain-specific, but also dataset-specific. There are two common practices. Human-generated pipelines (HI-pipelines) typically use a wide range of any operations or libraries but are highly experience- and heuristic-based. In contrast, machine-generated pipelines (AI-pipelines), a.k.a. AutoML, often adopt a predefined set of sophisticated operations and are search-based and optimized. These two common practices are mutually complementary. In this paper, we study a new problem that, given an HI-pipeline and an AI-pipeline for the same ML task, can we combine them to get a new pipeline (HAI-pipeline) that is better than the provided HI-pipeline and AI-pipeline? We propose HAIPipe, a framework to address the problem, which adopts an enumeration-sampling strategy to carefully select the best performing combined pipeline. We also introduce a reinforcement learning (RL) based approach to search an optimized AI-pipeline. Extensive experiments using 1400+ real-world HI-pipelines (Jupyter notebooks from Kaggle) verify that HAIPipe can significantly outperform the approaches using either HI-pipelines or AI-pipelines alone.
Supplemental Material
- James Bergstra and Yoshua Bengio. 2012. Random Search for Hyper-Parameter Optimization. J. Mach. Learn. Res., Vol. 13 (2012), 281--305. http://dl.acm.org/citation.cfm?id=2188395Google ScholarCross Ref
- Laure Berti-É quille. 2019a. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13--17, 2019, Ling Liu, Ryen W. White, Amin Mantrach, Fabrizio Silvestri, Julian J. McAuley, Ricardo Baeza-Yates, and Leila Zia (Eds.). ACM, 2580--2586. https://doi.org/10.1145/3308558.3313602Google ScholarDigital Library
- Laure Berti-É quille. 2019b. Reinforcement Learning for Data Preparation with Active Reward Learning. In Internet Science - 6th International Conference, INSCI 2019, Perpignan, France, December 2--5, 2019, Proceedings (Lecture Notes in Computer Science, Vol. 11938), Samira El Yacoubi, Franco Bagnoli, and Giovanna Pacini (Eds.). Springer, 121--132. https://doi.org/10.1007/978--3-030--34770--3_10Google Scholar
- Carsten Binnig, Benedetto Buratti, Yeounoh Chung, Cyrus Cousins, Tim Kraska, Zeyuan Shang, Eli Upfal, Robert C. Zeleznik, and Emanuel Zgraggen. 2018. Towards Interactive Curation & Automatic Tuning of ML Pipelines. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning, DEEM@SIGMOD 2018, Houston, TX, USA, June 15, 2019, Sebastian Schelter, Stephan Seufert, and Arun Kumar (Eds.). ACM, 1:1--1:4. https://doi.org/10.1145/3209889.3209891Google ScholarDigital Library
- Wenbin Cai, Ya Zhang, and Jun Zhou. 2013. Maximizing Expected Model Change for Active Learning in Regression. In 2013 IEEE 13th International Conference on Data Mining, Dallas, TX, USA, December 7--10, 2013, Hui Xiong, George Karypis, Bhavani Thuraisingham, Diane J. Cook, and Xindong Wu (Eds.). IEEE Computer Society, 51--60. https://doi.org/10.1109/ICDM.2013.104Google Scholar
- Copilot. [n.,d.]. https://github.com/features/copilot.Google Scholar
- Iddo Drori, Yamuna Krishnamurthy, Ré mi Rampin, Raoni de Paula Lourencc o, Jorge Piazentin Ono, Kyunghyun Cho, Clá udio T. Silva, and Juliana Freire. 2021. AlphaD3M: Machine Learning Pipeline Synthesis. CoRR, Vol. abs/2111.02508 (2021). showeprint[arXiv]2111.02508 https://arxiv.org/abs/2111.02508Google Scholar
- Ori Bar El, Tova Milo, and Amit Somech. 2020. Automatically Generating Data Exploration Sessions Using Deep Reinforcement Learning. In SIGMOD, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1527--1537. https://doi.org/10.1145/3318464.3389779Google ScholarDigital Library
- Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and Frank Hutter. 2019. Auto-sklearn: Efficient and Robust Automated Machine Learning. In Automated Machine Learning - Methods, Systems, Challenges, Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren (Eds.). Springer, 113--134. https://doi.org/10.1007/978--3-030-05318--5_6Google Scholar
- Nicoló Fusi, Rishit Sheth, and Melih Elibol. 2018. Probabilistic Matrix Factorization for Automated Machine Learning. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3--8, 2018, Montré al, Canada, Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (Eds.). 3352--3361. https://proceedings.neurips.cc/paper/2018/hash/b59a51a3c0bf9c5228fde841714f523a-Abstract.htmlGoogle Scholar
- Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D. Sculley. 2017. Google Vizier: A Service for Black-Box Optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017. ACM, 1487--1495. https://doi.org/10.1145/3097983.3098043Google ScholarDigital Library
- Yuval Heffetz, Roman Vainshtein, Gilad Katz, and Lior Rokach. 2020. DeepLine: AutoML Tool for Pipelines Generation using Deep Reinforcement Learning and Hierarchical Actions Filtering. In KDD, Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash (Eds.). ACM, 2103--2113. https://dl.acm.org/doi/10.1145/3394486.3403261Google Scholar
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.Google Scholar
- Kaggle. [n.,d.]. https://www.kaggle.com/.Google Scholar
- Lisha Li, Kevin G. Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2017. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. J. Mach. Learn. Res., Vol. 18 (2017), 185:1--185:52. http://jmlr.org/papers/v18/16--558.htmlGoogle Scholar
- Tian Li, Jie Zhong, Ji Liu, Wentao Wu, and Ce Zhang. 2018. Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads. Proc. VLDB Endow., Vol. 11, 5 (2018), 607--620. https://doi.org/10.1145/3187009.3177737Google ScholarDigital Library
- Gang Luo. 2016. A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Netw. Model. Anal. Health Informatics Bioinform., Vol. 5, 1 (2016), 18. https://doi.org/10.1007/s13721-016-0125--6Google ScholarCross Ref
- Tova Milo and Amit Somech. 2018. Next-Step Suggestions for Modern Interactive Data Analysis Platforms. In KDD, Yike Guo and Faisal Farooq (Eds.). ACM, 576--585. https://doi.org/10.1145/3219819.3219848Google ScholarDigital Library
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. 2013. Playing Atari with Deep Reinforcement Learning. CoRR, Vol. abs/1312.5602 (2013). arxiv: 1312.5602 http://arxiv.org/abs/1312.5602Google Scholar
- Pyhon AST Module. [n.,d.]. https://docs.python.org/3/library/ast.html.Google Scholar
- Moss. [n.,d.]. http://theory.stanford.edu/ aiken/moss/.Google Scholar
- Mohammad Hossein Namaki, Avrilia Floratou, Fotis Psallidas, Subru Krishnan, Ashvin Agrawal, Yinghui Wu, Yiwen Zhu, and Markus Weimer. 2020. Vamsa: Automated Provenance Tracking in Data Science Scripts. In KDD, Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash (Eds.). ACM, 1542--1551. https://doi.org/10.1145/3394486.3403205Google ScholarDigital Library
- Randal S. Olson and Jason H. Moore. 2019. TPOT: A Tree-Based Pipeline Optimization Tool for Automating Machine Learning. In Automated Machine Learning - Methods, Systems, Challenges, Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren (Eds.). Springer, 151--160. https://doi.org/10.1007/978--3-030-05318--5_8Google Scholar
- Pandas. [n.,d.]. https://pandas.pydata.org/.Google Scholar
- Luigi Quaranta, Fabio Calefato, and Filippo Lanubile. 2021. KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle. In 18th IEEE/ACM International Conference on Mining Software Repositories, MSR 2021, Madrid, Spain, May 17--19, 2021. IEEE, 550--554. https://doi.org/10.1109/MSR52588.2021.00072Google ScholarCross Ref
- Shubhangi Vashisth Rita Sallam, Ehtisham Zaidi. 2017. Market guide for data preparation.Google Scholar
- Scikit-Learn. [n.,d.]. https://scikit-learn.org/stable/.Google Scholar
- Zeyuan Shang, Emanuel Zgraggen, Benedetto Buratti, Ferdinand Kossmann, Philipp Eichmann, Yeounoh Chung, Carsten Binnig, Eli Upfal, and Tim Kraska. 2019. Democratizing Data Science through Interactive Curation of ML Pipelines. In SIGMOD, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 1171--1188. https://doi.org/10.1145/3299869.3319863Google ScholarDigital Library
- Evan R. Sparks, Ameet Talwalkar, Daniel Haas, Michael J. Franklin, Michael I. Jordan, and Tim Kraska. 2015. Automating model search for large scale machine learning. In Proceedings of the Sixth ACM Symposium on Cloud Computing, SoCC 2015, Kohala Coast, Hawaii, USA, August 27--29, 2015, Shahram Ghandeharizadeh, Sumita Barahmand, Magdalena Balazinska, and Michael J. Freedman (Eds.). ACM, 368--380. https://doi.org/10.1145/2806777.2806945Google ScholarDigital Library
- Thomas Swearingen, Will Drevo, Bennett Cyphers, Alfredo Cuesta-Infante, Arun Ross, and Kalyan Veeramachaneni. 2017. ATM: A distributed, collaborative, scalable system for automated machine learning. In 2017 IEEE International Conference on Big Data (IEEE BigData 2017), Boston, MA, USA, December 11--14, 2017, Jian-Yun Nie, Zoran Obradovic, Toyotaro Suzumura, Rumi Ghosh, Raghunath Nambiar, Chonggang Wang, Hui Zang, Ricardo Baeza-Yates, Xiaohua Hu, Jeremy Kepner, Alfredo Cuzzocrea, Jian Tang, and Masashi Toyoda (Eds.). IEEE Computer Society, 151--162. https://doi.org/10.1109/BigData.2017.8257923Google ScholarCross Ref
- Chris Thornton, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. 2012. Auto-WEKA: Automated Selection and Hyper-Parameter Optimization of Classification Algorithms. CoRR, Vol. abs/1208.3719 (2012). arxiv: 1208.3719Google Scholar
- Jianhong Tu, Ju Fan, Nan Tang, Peng Wang, Chengliang Chai, Guoliang Li, Ruixue Fan, and Xiaoyong Du. 2022a. Domain Adaptation for Deep Entity Resolution. In SIGMOD '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, Zachary G. Ives, Angela Bonifati, and Amr El Abbadi (Eds.). ACM, 443--457. https://doi.org/10.1145/3514221.3517870Google ScholarDigital Library
- Jianhong Tu, Xiaoyue Han, Ju Fan, Nan Tang, Chengliang Chai, Guoliang Li, and Xiaoyong Du. 2022b. DADER: Hands-Off Entity Resolution with Domain Adaptation. Proc. VLDB Endow., Vol. 15, 12 (2022), 3666--3669. https://www.vldb.org/pvldb/vol15/p3666-fan.pdfGoogle ScholarDigital Library
- Cong Yan and Yeye He. 2020. Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1539--1554. https://doi.org/10.1145/3318464.3389738Google ScholarDigital Library
- Chengrun Yang, Jicong Fan, Ziyang Wu, and Madeleine Udell. 2020. AutoML Pipeline Selection: Efficiently Navigating the Combinatorial Space. In KDD, Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash (Eds.). ACM, 1446--1456. https://doi.org/10.1145/3394486.3403197Google ScholarDigital Library
- Junwen Yang, Yeye He, and Surajit Chaudhuri. 2021. Auto-Pipeline: Synthesize Data Pipelines By-Target Using Reinforcement Learning and Search. Proc. VLDB Endow., Vol. 14, 11 (2021), 2563--2575. https://doi.org/10.14778/3476249.3476303Google ScholarDigital Library
- Ziyue Zhong, Meihui Zhang, Ju Fan, and Chenxiao Dou. 2022. Semantics Driven Embedding Learning for Effective Entity Alignment. In 38th IEEE International Conference on Data Engineering, ICDE 2022, Kuala Lumpur, Malaysia, May 9--12, 2022. IEEE, 2127--2140. https://doi.org/10.1109/ICDE53745.2022.00205Google Scholar
- Barret Zoph and Quoc V. Le. 2016. Neural Architecture Search with Reinforcement Learning. CoRR, Vol. abs/1611.01578 (2016). showeprint[arXiv]1611.01578 http://arxiv.org/abs/1611.01578Google Scholar
Index Terms
- HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation
Recommendations
Automated Generation of Ensemble Pipelines using Policy-Based Reinforcement Learning method
AbstractAt the moment, there are a considerable number of different automated machine learning frameworks. They are often use predefined pipelines and choose the best one among them. However, searching for optimal pipelines can be improved by using ...
Automatic Generation of Visualizations for Machine Learning Pipelines
ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software EngineeringVisualization is very important for machine learning (ML) pipelines because it can show explorations of the data to inspire data scientists and show explanations of the pipeline to improve understandability. In this paper, we present a novel approach ...
Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines
PACMMODSoftware systems that learn from data with machine learning (ML) are used in critical decision-making processes. Unfortunately, real-world experience shows that the pipelines for data preparation, feature encoding and model training in ML systems are ...
Comments