short-paper

ChatPipe: Orchestrating Data Preparation Pipelines by Optimizing Human-ChatGPT Interactions

Authors:

Ju Fan,

Nan TangAuthors Info & Claims

SIGMOD/PODS '24: Companion of the 2024 International Conference on Management of Data

Pages 484 - 487

https://doi.org/10.1145/3626246.3654727

Published: 09 June 2024 Publication History

Get Access

Abstract

Orchestrating a high-quality data preparation program is essential for successful machine learning (ML), but it is known to be time and effort consuming. Despite the impressive capabilities of large language models like ChatGPT in generating programs by inter- acting with users through natural language prompts, there are still limitations. Specifically, a user must provide specific prompts to iteratively guide ChatGPT in improving data preparation programs, which requires a certain level of expertise in programming, the dataset used and the ML task. Moreover, once a program has been generated, it is non-trivial to revisit a previous version or make changes to the program without starting the process over again. In this paper, we present ChatPipe, a novel system designed to facilitate seamless interaction between users and ChatGPT. Chat- Pipe provides users with effective recommendation on next data preparation operations, and guides ChatGPT to generate program for the operations. Also, ChatPipe enables users to easily roll back to previous versions of the program, which facilitates more efficient experimentation and testing. We have developed a web application for ChatPipe and prepared several real-world ML tasks from Kaggle, which can demonstrate the capabilities of ChatPipe.

References

[1]

Sibei Chen, Nan Tang, Ju Fan, Xuemi Yan, Chengliang Chai, Guoliang Li, and Xiaoyong Du. 2023. HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation. Proc. ACM Manag. Data, Vol. 1, 1 (2023), 91:1--91:26. https://doi.org/10.1145/3588945

Digital Library

Google Scholar

[2]

Kaggle Diabetes Dataset. [n.d.]. https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset.

Google Scholar

[3]

Behrouz Derakhshan, Alireza Rezaei Mahdiraji, Ziawasch Abedjan, Tilmann Rabl, and Volker Markl. 2020. Optimizing machine learning workloads in collaborative environments. In SIGMOD. 1701--1716.

Google Scholar

[4]

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. Unixcoder: Unified cross-modal pre-training for code representation. arXiv:2203.03850 (2022).

Google Scholar

Cited By

View all

Lu WZhang JFan JFu ZChen YDu X(2025)Large language model for table processing: a surveyFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-024-40763-619:2Online publication date: 1-Feb-2025
https://dl.acm.org/doi/10.1007/s11704-024-40763-6

Index Terms

ChatPipe: Orchestrating Data Preparation Pipelines by Optimizing Human-ChatGPT Interactions
1. Information systems
  1. Information systems applications
    1. Decision support systems
      1. Data analytics

Recommendations

HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation
PACMMOD

Data preparation is crucial in achieving optimized results for machine learning (ML). However, having a good data preparation pipeline is highly non-trivial for ML practitioners, which is not only domain-specific, but also dataset-specific. There are two ...
Reinforcement-learning based dialogue system for human-robot interactions with socially-inspired rewards

HighlightsWe integrate user appraisals in a POMDP-based dialogue manager procedure.We employ additional socially-inspired rewards in a RL setup to guide the learning.A unified framework for speeding up the policy optimisation and user adaptation.We ...
Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning
AAMAS '24: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems

To create useful reinforcement learning (RL) agents, step zero is to design a suitable reward function that captures the nuances of the task. However, reward engineering can be a difficult and time-consuming process. Instead, human-in-the-loop (HitL) RL ...

Comments

Information & Contributors

Information

Published In

SIGMOD/PODS '24: Companion of the 2024 International Conference on Management of Data

June 2024

694 pages

ISBN:9798400704222

DOI:10.1145/3626246

General Chairs:
Pablo Barcelo
Universidad Catolica, Chile
,
Nayat Sanchez-Pi
INRIA Chile
,
Program Chairs:
Alexandra Meliou
University of Massachusetts Amherst, USA
,
S. Sudarshan
Indian Institute of Technology Bombay

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

the Outstanding Innovative Talents Cultivation Funded Programs 2024 of Renmin Univertity of China.
the Beijing Natural Science Foundation
The Natural Science Foundation of China

Conference

SIGMOD/PODS '24

Sponsor:

SIGMOD

SIGMOD/PODS '24: International Conference on Management of Data

June 9 - 15, 2024

Santiago AA, Chile

Acceptance Rates

Overall Acceptance Rate 699 of 3,470 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
190
Total Downloads

Downloads (Last 12 months)190
Downloads (Last 6 weeks)50

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Lu WZhang JFan JFu ZChen YDu X(2025)Large language model for table processing: a surveyFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-024-40763-619:2Online publication date: 1-Feb-2025
https://dl.acm.org/doi/10.1007/s11704-024-40763-6

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation

Reinforcement-learning based dialogue system for human-robot interactions with socially-inspired rewards

Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations