research-article

HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation

Authors:
Sibei Chen

Renmin University of China, Beijing, China

Renmin University of China, Beijing, China

0009-0001-5331-5829
View Profile

,
Nan Tang

QCRI, Doha, Qatar

QCRI, Doha, Qatar

0000-0003-2832-0295
View Profile

,
Ju Fan

Renmin University of China, Beijing, China

Renmin University of China, Beijing, China

0000-0003-4729-9903
View Profile

,
Xuemi Yan

Renmin University of China, Beijing, China

Renmin University of China, Beijing, China

0009-0007-7018-0640
View Profile

,
Chengliang Chai

Beijing Institute of Technology, Beijing, China

Beijing Institute of Technology, Beijing, China

0000-0001-8080-5594
View Profile

,
Guoliang Li

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China

0000-0002-1398-0621
View Profile

,
Xiaoyong Du

Renmin University of China, Beijing, China

Renmin University of China, Beijing, China

0000-0002-5757-9135
View Profile

Proceedings of the ACM on Management of Data Volume 1 Issue 1Article No.: 91pp 1–26https://doi.org/10.1145/3588945

Published:30 May 2023Publication History

Proceedings of the ACM on Management of Data

Abstract

Data preparation is crucial in achieving optimized results for machine learning (ML). However, having a good data preparation pipeline is highly non-trivial for ML practitioners, which is not only domain-specific, but also dataset-specific. There are two common practices. Human-generated pipelines (HI-pipelines) typically use a wide range of any operations or libraries but are highly experience- and heuristic-based. In contrast, machine-generated pipelines (AI-pipelines), a.k.a. AutoML, often adopt a predefined set of sophisticated operations and are search-based and optimized. These two common practices are mutually complementary. In this paper, we study a new problem that, given an HI-pipeline and an AI-pipeline for the same ML task, can we combine them to get a new pipeline (HAI-pipeline) that is better than the provided HI-pipeline and AI-pipeline? We propose HAIPipe, a framework to address the problem, which adopts an enumeration-sampling strategy to carefully select the best performing combined pipeline. We also introduce a reinforcement learning (RL) based approach to search an optimized AI-pipeline. Extensive experiments using 1400+ real-world HI-pipelines (Jupyter notebooks from Kaggle) verify that HAIPipe can significantly outperform the approaches using either HI-pipelines or AI-pipelines alone.

Supplemental Material

PACMMOD-V1mod091.mp4

mp4

103 MB

Download

References

James Bergstra and Yoshua Bengio. 2012. Random Search for Hyper-Parameter Optimization. J. Mach. Learn. Res., Vol. 13 (2012), 281--305. http://dl.acm.org/citation.cfm?id=2188395Google ScholarCross Ref
Laure Berti-É quille. 2019a. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13--17, 2019, Ling Liu, Ryen W. White, Amin Mantrach, Fabrizio Silvestri, Julian J. McAuley, Ricardo Baeza-Yates, and Leila Zia (Eds.). ACM, 2580--2586. https://doi.org/10.1145/3308558.3313602Google ScholarDigital Library
Laure Berti-É quille. 2019b. Reinforcement Learning for Data Preparation with Active Reward Learning. In Internet Science - 6th International Conference, INSCI 2019, Perpignan, France, December 2--5, 2019, Proceedings (Lecture Notes in Computer Science, Vol. 11938), Samira El Yacoubi, Franco Bagnoli, and Giovanna Pacini (Eds.). Springer, 121--132. https://doi.org/10.1007/978--3-030--34770--3_10Google Scholar
Carsten Binnig, Benedetto Buratti, Yeounoh Chung, Cyrus Cousins, Tim Kraska, Zeyuan Shang, Eli Upfal, Robert C. Zeleznik, and Emanuel Zgraggen. 2018. Towards Interactive Curation & Automatic Tuning of ML Pipelines. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning, DEEM@SIGMOD 2018, Houston, TX, USA, June 15, 2019, Sebastian Schelter, Stephan Seufert, and Arun Kumar (Eds.). ACM, 1:1--1:4. https://doi.org/10.1145/3209889.3209891Google ScholarDigital Library
Wenbin Cai, Ya Zhang, and Jun Zhou. 2013. Maximizing Expected Model Change for Active Learning in Regression. In 2013 IEEE 13th International Conference on Data Mining, Dallas, TX, USA, December 7--10, 2013, Hui Xiong, George Karypis, Bhavani Thuraisingham, Diane J. Cook, and Xindong Wu (Eds.). IEEE Computer Society, 51--60. https://doi.org/10.1109/ICDM.2013.104Google Scholar
Copilot. [n.,d.]. https://github.com/features/copilot.Google Scholar
Iddo Drori, Yamuna Krishnamurthy, Ré mi Rampin, Raoni de Paula Lourencc o, Jorge Piazentin Ono, Kyunghyun Cho, Clá udio T. Silva, and Juliana Freire. 2021. AlphaD3M: Machine Learning Pipeline Synthesis. CoRR, Vol. abs/2111.02508 (2021). showeprint[arXiv]2111.02508 https://arxiv.org/abs/2111.02508Google Scholar
Ori Bar El, Tova Milo, and Amit Somech. 2020. Automatically Generating Data Exploration Sessions Using Deep Reinforcement Learning. In SIGMOD, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1527--1537. https://doi.org/10.1145/3318464.3389779Google ScholarDigital Library
Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and Frank Hutter. 2019. Auto-sklearn: Efficient and Robust Automated Machine Learning. In Automated Machine Learning - Methods, Systems, Challenges, Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren (Eds.). Springer, 113--134. https://doi.org/10.1007/978--3-030-05318--5_6Google Scholar
Nicoló Fusi, Rishit Sheth, and Melih Elibol. 2018. Probabilistic Matrix Factorization for Automated Machine Learning. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3--8, 2018, Montré al, Canada, Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (Eds.). 3352--3361. https://proceedings.neurips.cc/paper/2018/hash/b59a51a3c0bf9c5228fde841714f523a-Abstract.htmlGoogle Scholar
Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D. Sculley. 2017. Google Vizier: A Service for Black-Box Optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017. ACM, 1487--1495. https://doi.org/10.1145/3097983.3098043Google ScholarDigital Library
Yuval Heffetz, Roman Vainshtein, Gilad Katz, and Lior Rokach. 2020. DeepLine: AutoML Tool for Pipelines Generation using Deep Reinforcement Learning and Hierarchical Actions Filtering. In KDD, Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash (Eds.). ACM, 2103--2113. https://dl.acm.org/doi/10.1145/3394486.3403261Google Scholar
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.Google Scholar
Kaggle. [n.,d.]. https://www.kaggle.com/.Google Scholar
Lisha Li, Kevin G. Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2017. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. J. Mach. Learn. Res., Vol. 18 (2017), 185:1--185:52. http://jmlr.org/papers/v18/16--558.htmlGoogle Scholar
Tian Li, Jie Zhong, Ji Liu, Wentao Wu, and Ce Zhang. 2018. Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads. Proc. VLDB Endow., Vol. 11, 5 (2018), 607--620. https://doi.org/10.1145/3187009.3177737Google ScholarDigital Library
Gang Luo. 2016. A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Netw. Model. Anal. Health Informatics Bioinform., Vol. 5, 1 (2016), 18. https://doi.org/10.1007/s13721-016-0125--6Google ScholarCross Ref
Tova Milo and Amit Somech. 2018. Next-Step Suggestions for Modern Interactive Data Analysis Platforms. In KDD, Yike Guo and Faisal Farooq (Eds.). ACM, 576--585. https://doi.org/10.1145/3219819.3219848Google ScholarDigital Library
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. 2013. Playing Atari with Deep Reinforcement Learning. CoRR, Vol. abs/1312.5602 (2013). arxiv: 1312.5602 http://arxiv.org/abs/1312.5602Google Scholar
Pyhon AST Module. [n.,d.]. https://docs.python.org/3/library/ast.html.Google Scholar
Moss. [n.,d.]. http://theory.stanford.edu/ aiken/moss/.Google Scholar
Mohammad Hossein Namaki, Avrilia Floratou, Fotis Psallidas, Subru Krishnan, Ashvin Agrawal, Yinghui Wu, Yiwen Zhu, and Markus Weimer. 2020. Vamsa: Automated Provenance Tracking in Data Science Scripts. In KDD, Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash (Eds.). ACM, 1542--1551. https://doi.org/10.1145/3394486.3403205Google ScholarDigital Library
Randal S. Olson and Jason H. Moore. 2019. TPOT: A Tree-Based Pipeline Optimization Tool for Automating Machine Learning. In Automated Machine Learning - Methods, Systems, Challenges, Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren (Eds.). Springer, 151--160. https://doi.org/10.1007/978--3-030-05318--5_8Google Scholar
Pandas. [n.,d.]. https://pandas.pydata.org/.Google Scholar
Luigi Quaranta, Fabio Calefato, and Filippo Lanubile. 2021. KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle. In 18th IEEE/ACM International Conference on Mining Software Repositories, MSR 2021, Madrid, Spain, May 17--19, 2021. IEEE, 550--554. https://doi.org/10.1109/MSR52588.2021.00072Google ScholarCross Ref
Shubhangi Vashisth Rita Sallam, Ehtisham Zaidi. 2017. Market guide for data preparation.Google Scholar
Scikit-Learn. [n.,d.]. https://scikit-learn.org/stable/.Google Scholar
Zeyuan Shang, Emanuel Zgraggen, Benedetto Buratti, Ferdinand Kossmann, Philipp Eichmann, Yeounoh Chung, Carsten Binnig, Eli Upfal, and Tim Kraska. 2019. Democratizing Data Science through Interactive Curation of ML Pipelines. In SIGMOD, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 1171--1188. https://doi.org/10.1145/3299869.3319863Google ScholarDigital Library
Evan R. Sparks, Ameet Talwalkar, Daniel Haas, Michael J. Franklin, Michael I. Jordan, and Tim Kraska. 2015. Automating model search for large scale machine learning. In Proceedings of the Sixth ACM Symposium on Cloud Computing, SoCC 2015, Kohala Coast, Hawaii, USA, August 27--29, 2015, Shahram Ghandeharizadeh, Sumita Barahmand, Magdalena Balazinska, and Michael J. Freedman (Eds.). ACM, 368--380. https://doi.org/10.1145/2806777.2806945Google ScholarDigital Library
Thomas Swearingen, Will Drevo, Bennett Cyphers, Alfredo Cuesta-Infante, Arun Ross, and Kalyan Veeramachaneni. 2017. ATM: A distributed, collaborative, scalable system for automated machine learning. In 2017 IEEE International Conference on Big Data (IEEE BigData 2017), Boston, MA, USA, December 11--14, 2017, Jian-Yun Nie, Zoran Obradovic, Toyotaro Suzumura, Rumi Ghosh, Raghunath Nambiar, Chonggang Wang, Hui Zang, Ricardo Baeza-Yates, Xiaohua Hu, Jeremy Kepner, Alfredo Cuzzocrea, Jian Tang, and Masashi Toyoda (Eds.). IEEE Computer Society, 151--162. https://doi.org/10.1109/BigData.2017.8257923Google ScholarCross Ref
Chris Thornton, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. 2012. Auto-WEKA: Automated Selection and Hyper-Parameter Optimization of Classification Algorithms. CoRR, Vol. abs/1208.3719 (2012). arxiv: 1208.3719Google Scholar
Jianhong Tu, Ju Fan, Nan Tang, Peng Wang, Chengliang Chai, Guoliang Li, Ruixue Fan, and Xiaoyong Du. 2022a. Domain Adaptation for Deep Entity Resolution. In SIGMOD '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, Zachary G. Ives, Angela Bonifati, and Amr El Abbadi (Eds.). ACM, 443--457. https://doi.org/10.1145/3514221.3517870Google ScholarDigital Library
Jianhong Tu, Xiaoyue Han, Ju Fan, Nan Tang, Chengliang Chai, Guoliang Li, and Xiaoyong Du. 2022b. DADER: Hands-Off Entity Resolution with Domain Adaptation. Proc. VLDB Endow., Vol. 15, 12 (2022), 3666--3669. https://www.vldb.org/pvldb/vol15/p3666-fan.pdfGoogle ScholarDigital Library
Cong Yan and Yeye He. 2020. Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1539--1554. https://doi.org/10.1145/3318464.3389738Google ScholarDigital Library
Chengrun Yang, Jicong Fan, Ziyang Wu, and Madeleine Udell. 2020. AutoML Pipeline Selection: Efficiently Navigating the Combinatorial Space. In KDD, Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash (Eds.). ACM, 1446--1456. https://doi.org/10.1145/3394486.3403197Google ScholarDigital Library
Junwen Yang, Yeye He, and Surajit Chaudhuri. 2021. Auto-Pipeline: Synthesize Data Pipelines By-Target Using Reinforcement Learning and Search. Proc. VLDB Endow., Vol. 14, 11 (2021), 2563--2575. https://doi.org/10.14778/3476249.3476303Google ScholarDigital Library
Ziyue Zhong, Meihui Zhang, Ju Fan, and Chenxiao Dou. 2022. Semantics Driven Embedding Learning for Effective Entity Alignment. In 38th IEEE International Conference on Data Engineering, ICDE 2022, Kuala Lumpur, Malaysia, May 9--12, 2022. IEEE, 2127--2140. https://doi.org/10.1109/ICDE53745.2022.00205Google Scholar
Barret Zoph and Quoc V. Le. 2016. Neural Architecture Search with Reinforcement Learning. CoRR, Vol. abs/1611.01578 (2016). showeprint[arXiv]1611.01578 http://arxiv.org/abs/1611.01578Google Scholar

Index Terms

HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation
1. Information systems
  1. Information systems applications
    1. Decision support systems
      1. Data analytics

Recommendations

Automated Generation of Ensemble Pipelines using Policy-Based Reinforcement Learning method
Abstract
At the moment, there are a considerable number of different automated machine learning frameworks. They are often use predefined pipelines and choose the best one among them. However, searching for optimal pipelines can be improved by using ...
Read More
Automatic Generation of Visualizations for Machine Learning Pipelines
ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

Visualization is very important for machine learning (ML) pipelines because it can show explorations of the data to inspire data scientists and show explanations of the pipeline to improve understandability. In this paper, we present a novel approach ...
Read More
Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines
PACMMOD

Software systems that learn from data with machine learning (ML) are used in critical decision-making processes. Unfortunately, real-world experience shows that the pipelines for data preparation, feature encoding and model training in ML systems are ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the ACM on Management of Data Volume 1, Issue 1
PACMMOD
May 2023
2807 pages
EISSN:2836-6573
DOI:10.1145/3603164
Editor:
Divyakant Agrawal
UC Santa Barbara, United States
Issue’s Table of Contents
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 May 2023
Published in pacmmod Volume 1, Issue 1

Permissions
Request permissions about this article.
Request Permissions
Author Tags
data preparation
pipeline generation
reinforcement learning
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 225
  Total Downloads
- Downloads (Last 12 months)225
- Downloads (Last 6 weeks)33
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation

Proceedings of the ACM on Management of Data

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Automated Generation of Ensemble Pipelines using Policy-Based Reinforcement Learning method

Automatic Generation of Visualizations for Machine Learning Pipelines

Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation

Proceedings of the ACM on Management of Data

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Automated Generation of Ensemble Pipelines using Policy-Based Reinforcement Learning method

Automatic Generation of Visualizations for Machine Learning Pipelines

Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media