skip to main content
10.1145/3626246.3654727acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
short-paper

ChatPipe: Orchestrating Data Preparation Pipelines by Optimizing Human-ChatGPT Interactions

Published: 09 June 2024 Publication History

Abstract

Orchestrating a high-quality data preparation program is essential for successful machine learning (ML), but it is known to be time and effort consuming. Despite the impressive capabilities of large language models like ChatGPT in generating programs by inter- acting with users through natural language prompts, there are still limitations. Specifically, a user must provide specific prompts to iteratively guide ChatGPT in improving data preparation programs, which requires a certain level of expertise in programming, the dataset used and the ML task. Moreover, once a program has been generated, it is non-trivial to revisit a previous version or make changes to the program without starting the process over again. In this paper, we present ChatPipe, a novel system designed to facilitate seamless interaction between users and ChatGPT. Chat- Pipe provides users with effective recommendation on next data preparation operations, and guides ChatGPT to generate program for the operations. Also, ChatPipe enables users to easily roll back to previous versions of the program, which facilitates more efficient experimentation and testing. We have developed a web application for ChatPipe and prepared several real-world ML tasks from Kaggle, which can demonstrate the capabilities of ChatPipe.

References

[1]
Sibei Chen, Nan Tang, Ju Fan, Xuemi Yan, Chengliang Chai, Guoliang Li, and Xiaoyong Du. 2023. HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation. Proc. ACM Manag. Data, Vol. 1, 1 (2023), 91:1--91:26. https://doi.org/10.1145/3588945
[2]
Kaggle Diabetes Dataset. [n.d.]. https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset.
[3]
Behrouz Derakhshan, Alireza Rezaei Mahdiraji, Ziawasch Abedjan, Tilmann Rabl, and Volker Markl. 2020. Optimizing machine learning workloads in collaborative environments. In SIGMOD. 1701--1716.
[4]
Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. Unixcoder: Unified cross-modal pre-training for code representation. arXiv:2203.03850 (2022).

Cited By

View all
  • (2025)Large language model for table processing: a surveyFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-024-40763-619:2Online publication date: 1-Feb-2025

Index Terms

  1. ChatPipe: Orchestrating Data Preparation Pipelines by Optimizing Human-ChatGPT Interactions

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD/PODS '24: Companion of the 2024 International Conference on Management of Data
    June 2024
    694 pages
    ISBN:9798400704222
    DOI:10.1145/3626246
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 June 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data preparation
    2. human-in-the-loop
    3. reinforcement learning

    Qualifiers

    • Short-paper

    Funding Sources

    • the Outstanding Innovative Talents Cultivation Funded Programs 2024 of Renmin Univertity of China.
    • the Beijing Natural Science Foundation
    • The Natural Science Foundation of China

    Conference

    SIGMOD/PODS '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 699 of 3,470 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)190
    • Downloads (Last 6 weeks)50
    Reflects downloads up to 15 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Large language model for table processing: a surveyFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-024-40763-619:2Online publication date: 1-Feb-2025

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media