skip to main content
research-article

HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation

Published:30 May 2023Publication History
Skip Abstract Section

Abstract

Data preparation is crucial in achieving optimized results for machine learning (ML). However, having a good data preparation pipeline is highly non-trivial for ML practitioners, which is not only domain-specific, but also dataset-specific. There are two common practices. Human-generated pipelines (HI-pipelines) typically use a wide range of any operations or libraries but are highly experience- and heuristic-based. In contrast, machine-generated pipelines (AI-pipelines), a.k.a. AutoML, often adopt a predefined set of sophisticated operations and are search-based and optimized. These two common practices are mutually complementary. In this paper, we study a new problem that, given an HI-pipeline and an AI-pipeline for the same ML task, can we combine them to get a new pipeline (HAI-pipeline) that is better than the provided HI-pipeline and AI-pipeline? We propose HAIPipe, a framework to address the problem, which adopts an enumeration-sampling strategy to carefully select the best performing combined pipeline. We also introduce a reinforcement learning (RL) based approach to search an optimized AI-pipeline. Extensive experiments using 1400+ real-world HI-pipelines (Jupyter notebooks from Kaggle) verify that HAIPipe can significantly outperform the approaches using either HI-pipelines or AI-pipelines alone.

Skip Supplemental Material Section

Supplemental Material

PACMMOD-V1mod091.mp4

mp4

103 MB

References

  1. James Bergstra and Yoshua Bengio. 2012. Random Search for Hyper-Parameter Optimization. J. Mach. Learn. Res., Vol. 13 (2012), 281--305. http://dl.acm.org/citation.cfm?id=2188395Google ScholarGoogle ScholarCross RefCross Ref
  2. Laure Berti-É quille. 2019a. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13--17, 2019, Ling Liu, Ryen W. White, Amin Mantrach, Fabrizio Silvestri, Julian J. McAuley, Ricardo Baeza-Yates, and Leila Zia (Eds.). ACM, 2580--2586. https://doi.org/10.1145/3308558.3313602Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Laure Berti-É quille. 2019b. Reinforcement Learning for Data Preparation with Active Reward Learning. In Internet Science - 6th International Conference, INSCI 2019, Perpignan, France, December 2--5, 2019, Proceedings (Lecture Notes in Computer Science, Vol. 11938), Samira El Yacoubi, Franco Bagnoli, and Giovanna Pacini (Eds.). Springer, 121--132. https://doi.org/10.1007/978--3-030--34770--3_10Google ScholarGoogle Scholar
  4. Carsten Binnig, Benedetto Buratti, Yeounoh Chung, Cyrus Cousins, Tim Kraska, Zeyuan Shang, Eli Upfal, Robert C. Zeleznik, and Emanuel Zgraggen. 2018. Towards Interactive Curation & Automatic Tuning of ML Pipelines. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning, DEEM@SIGMOD 2018, Houston, TX, USA, June 15, 2019, Sebastian Schelter, Stephan Seufert, and Arun Kumar (Eds.). ACM, 1:1--1:4. https://doi.org/10.1145/3209889.3209891Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Wenbin Cai, Ya Zhang, and Jun Zhou. 2013. Maximizing Expected Model Change for Active Learning in Regression. In 2013 IEEE 13th International Conference on Data Mining, Dallas, TX, USA, December 7--10, 2013, Hui Xiong, George Karypis, Bhavani Thuraisingham, Diane J. Cook, and Xindong Wu (Eds.). IEEE Computer Society, 51--60. https://doi.org/10.1109/ICDM.2013.104Google ScholarGoogle Scholar
  6. Copilot. [n.,d.]. https://github.com/features/copilot.Google ScholarGoogle Scholar
  7. Iddo Drori, Yamuna Krishnamurthy, Ré mi Rampin, Raoni de Paula Lourencc o, Jorge Piazentin Ono, Kyunghyun Cho, Clá udio T. Silva, and Juliana Freire. 2021. AlphaD3M: Machine Learning Pipeline Synthesis. CoRR, Vol. abs/2111.02508 (2021). showeprint[arXiv]2111.02508 https://arxiv.org/abs/2111.02508Google ScholarGoogle Scholar
  8. Ori Bar El, Tova Milo, and Amit Somech. 2020. Automatically Generating Data Exploration Sessions Using Deep Reinforcement Learning. In SIGMOD, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1527--1537. https://doi.org/10.1145/3318464.3389779Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and Frank Hutter. 2019. Auto-sklearn: Efficient and Robust Automated Machine Learning. In Automated Machine Learning - Methods, Systems, Challenges, Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren (Eds.). Springer, 113--134. https://doi.org/10.1007/978--3-030-05318--5_6Google ScholarGoogle Scholar
  10. Nicoló Fusi, Rishit Sheth, and Melih Elibol. 2018. Probabilistic Matrix Factorization for Automated Machine Learning. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3--8, 2018, Montré al, Canada, Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (Eds.). 3352--3361. https://proceedings.neurips.cc/paper/2018/hash/b59a51a3c0bf9c5228fde841714f523a-Abstract.htmlGoogle ScholarGoogle Scholar
  11. Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D. Sculley. 2017. Google Vizier: A Service for Black-Box Optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017. ACM, 1487--1495. https://doi.org/10.1145/3097983.3098043Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yuval Heffetz, Roman Vainshtein, Gilad Katz, and Lior Rokach. 2020. DeepLine: AutoML Tool for Pipelines Generation using Deep Reinforcement Learning and Hierarchical Actions Filtering. In KDD, Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash (Eds.). ACM, 2103--2113. https://dl.acm.org/doi/10.1145/3394486.3403261Google ScholarGoogle Scholar
  13. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.Google ScholarGoogle Scholar
  14. Kaggle. [n.,d.]. https://www.kaggle.com/.Google ScholarGoogle Scholar
  15. Lisha Li, Kevin G. Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2017. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. J. Mach. Learn. Res., Vol. 18 (2017), 185:1--185:52. http://jmlr.org/papers/v18/16--558.htmlGoogle ScholarGoogle Scholar
  16. Tian Li, Jie Zhong, Ji Liu, Wentao Wu, and Ce Zhang. 2018. Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads. Proc. VLDB Endow., Vol. 11, 5 (2018), 607--620. https://doi.org/10.1145/3187009.3177737Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Gang Luo. 2016. A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Netw. Model. Anal. Health Informatics Bioinform., Vol. 5, 1 (2016), 18. https://doi.org/10.1007/s13721-016-0125--6Google ScholarGoogle ScholarCross RefCross Ref
  18. Tova Milo and Amit Somech. 2018. Next-Step Suggestions for Modern Interactive Data Analysis Platforms. In KDD, Yike Guo and Faisal Farooq (Eds.). ACM, 576--585. https://doi.org/10.1145/3219819.3219848Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. 2013. Playing Atari with Deep Reinforcement Learning. CoRR, Vol. abs/1312.5602 (2013). arxiv: 1312.5602 http://arxiv.org/abs/1312.5602Google ScholarGoogle Scholar
  20. Pyhon AST Module. [n.,d.]. https://docs.python.org/3/library/ast.html.Google ScholarGoogle Scholar
  21. Moss. [n.,d.]. http://theory.stanford.edu/ aiken/moss/.Google ScholarGoogle Scholar
  22. Mohammad Hossein Namaki, Avrilia Floratou, Fotis Psallidas, Subru Krishnan, Ashvin Agrawal, Yinghui Wu, Yiwen Zhu, and Markus Weimer. 2020. Vamsa: Automated Provenance Tracking in Data Science Scripts. In KDD, Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash (Eds.). ACM, 1542--1551. https://doi.org/10.1145/3394486.3403205Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Randal S. Olson and Jason H. Moore. 2019. TPOT: A Tree-Based Pipeline Optimization Tool for Automating Machine Learning. In Automated Machine Learning - Methods, Systems, Challenges, Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren (Eds.). Springer, 151--160. https://doi.org/10.1007/978--3-030-05318--5_8Google ScholarGoogle Scholar
  24. Pandas. [n.,d.]. https://pandas.pydata.org/.Google ScholarGoogle Scholar
  25. Luigi Quaranta, Fabio Calefato, and Filippo Lanubile. 2021. KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle. In 18th IEEE/ACM International Conference on Mining Software Repositories, MSR 2021, Madrid, Spain, May 17--19, 2021. IEEE, 550--554. https://doi.org/10.1109/MSR52588.2021.00072Google ScholarGoogle ScholarCross RefCross Ref
  26. Shubhangi Vashisth Rita Sallam, Ehtisham Zaidi. 2017. Market guide for data preparation.Google ScholarGoogle Scholar
  27. Scikit-Learn. [n.,d.]. https://scikit-learn.org/stable/.Google ScholarGoogle Scholar
  28. Zeyuan Shang, Emanuel Zgraggen, Benedetto Buratti, Ferdinand Kossmann, Philipp Eichmann, Yeounoh Chung, Carsten Binnig, Eli Upfal, and Tim Kraska. 2019. Democratizing Data Science through Interactive Curation of ML Pipelines. In SIGMOD, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 1171--1188. https://doi.org/10.1145/3299869.3319863Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Evan R. Sparks, Ameet Talwalkar, Daniel Haas, Michael J. Franklin, Michael I. Jordan, and Tim Kraska. 2015. Automating model search for large scale machine learning. In Proceedings of the Sixth ACM Symposium on Cloud Computing, SoCC 2015, Kohala Coast, Hawaii, USA, August 27--29, 2015, Shahram Ghandeharizadeh, Sumita Barahmand, Magdalena Balazinska, and Michael J. Freedman (Eds.). ACM, 368--380. https://doi.org/10.1145/2806777.2806945Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Thomas Swearingen, Will Drevo, Bennett Cyphers, Alfredo Cuesta-Infante, Arun Ross, and Kalyan Veeramachaneni. 2017. ATM: A distributed, collaborative, scalable system for automated machine learning. In 2017 IEEE International Conference on Big Data (IEEE BigData 2017), Boston, MA, USA, December 11--14, 2017, Jian-Yun Nie, Zoran Obradovic, Toyotaro Suzumura, Rumi Ghosh, Raghunath Nambiar, Chonggang Wang, Hui Zang, Ricardo Baeza-Yates, Xiaohua Hu, Jeremy Kepner, Alfredo Cuzzocrea, Jian Tang, and Masashi Toyoda (Eds.). IEEE Computer Society, 151--162. https://doi.org/10.1109/BigData.2017.8257923Google ScholarGoogle ScholarCross RefCross Ref
  31. Chris Thornton, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. 2012. Auto-WEKA: Automated Selection and Hyper-Parameter Optimization of Classification Algorithms. CoRR, Vol. abs/1208.3719 (2012). arxiv: 1208.3719Google ScholarGoogle Scholar
  32. Jianhong Tu, Ju Fan, Nan Tang, Peng Wang, Chengliang Chai, Guoliang Li, Ruixue Fan, and Xiaoyong Du. 2022a. Domain Adaptation for Deep Entity Resolution. In SIGMOD '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, Zachary G. Ives, Angela Bonifati, and Amr El Abbadi (Eds.). ACM, 443--457. https://doi.org/10.1145/3514221.3517870Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jianhong Tu, Xiaoyue Han, Ju Fan, Nan Tang, Chengliang Chai, Guoliang Li, and Xiaoyong Du. 2022b. DADER: Hands-Off Entity Resolution with Domain Adaptation. Proc. VLDB Endow., Vol. 15, 12 (2022), 3666--3669. https://www.vldb.org/pvldb/vol15/p3666-fan.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  34. Cong Yan and Yeye He. 2020. Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1539--1554. https://doi.org/10.1145/3318464.3389738Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Chengrun Yang, Jicong Fan, Ziyang Wu, and Madeleine Udell. 2020. AutoML Pipeline Selection: Efficiently Navigating the Combinatorial Space. In KDD, Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash (Eds.). ACM, 1446--1456. https://doi.org/10.1145/3394486.3403197Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Junwen Yang, Yeye He, and Surajit Chaudhuri. 2021. Auto-Pipeline: Synthesize Data Pipelines By-Target Using Reinforcement Learning and Search. Proc. VLDB Endow., Vol. 14, 11 (2021), 2563--2575. https://doi.org/10.14778/3476249.3476303Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Ziyue Zhong, Meihui Zhang, Ju Fan, and Chenxiao Dou. 2022. Semantics Driven Embedding Learning for Effective Entity Alignment. In 38th IEEE International Conference on Data Engineering, ICDE 2022, Kuala Lumpur, Malaysia, May 9--12, 2022. IEEE, 2127--2140. https://doi.org/10.1109/ICDE53745.2022.00205Google ScholarGoogle Scholar
  38. Barret Zoph and Quoc V. Le. 2016. Neural Architecture Search with Reinforcement Learning. CoRR, Vol. abs/1611.01578 (2016). showeprint[arXiv]1611.01578 http://arxiv.org/abs/1611.01578Google ScholarGoogle Scholar

Index Terms

  1. HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the ACM on Management of Data
      Proceedings of the ACM on Management of Data  Volume 1, Issue 1
      PACMMOD
      May 2023
      2807 pages
      EISSN:2836-6573
      DOI:10.1145/3603164
      Issue’s Table of Contents

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 30 May 2023
      Published in pacmmod Volume 1, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader