research-article

On Integrating the Data-Science and Machine-Learning Pipelines for Responsible AI

Authors:

Andy Yu,

Kazem TaghvaAuthors Info & Claims

GUIDE-AI '24: Proceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI

Pages 50 - 53

https://doi.org/10.1145/3665601.3669849

Published: 09 June 2024 Publication History

Get Access

Abstract

Herein, we advocate for the integration of the pipelines for data science (e.g., extraction, cleaning, and exploration) and machine learning (e.g., training data collection, feature selection, model selection, and parameter tuning), toward responsible and trustworthy artificial intelligence. We argue that the metadata generated by the machine-learning pipeline, which includes model outputs and model accuracy scores, is best managed and analyzed using data-science tools, thereby obtaining actionable insights into model performance, interpretability, and bias. We illustrate via two examples from our recent work as proof of concept: data summarization for model performance diagnostics; and input and output exploration to understand retrieval-augmented language models.

References

[1]

Ziawasch Abedjan, Lukasz Golab, Felix Naumann, and Thorsten Papenbrock. 2018. Data Profiling. Morgan & Claypool Publishers.

Google Scholar

[2]

[2] Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast algorithms for mining association rules in large databases. In VLDB Conf., 487–499.

Google Scholar

[3]

Vargha Dadvar, Lukasz Golab, and Divesh Srivastava. 2022. Exploring data using patterns: A survey. Inf. Syst. 108 (2022), 101985.

Digital Library

Google Scholar

[4]

Armin Esmaeilzadeh, Lukasz Golab, and Kazem Taghva. 2023. InfoMoD: Information-theoretic Model Diagnostics. In SSDBM Conf., 19:1–19:4.

Google Scholar

[5]

Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and Informative Explanations of Outcomes. Proc. VLDB Endow. 8, 1 (2014), 61–72.

Digital Library

Google Scholar

[6]

Kareem El Gebaly, Guoyao Feng, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2018. Explanation Tables. IEEE Data Eng. Bull. 41, 3 (2018), 43–51.

Google Scholar

[7]

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts. Trans. of the Assoc. for Comp. Linguistics 12 (02 2024), 157–173.

Google Scholar

[8]

Christoph Molnar. 2020. Interpretable machine learning. Lulu.com, online.

Google Scholar

[9]

Joel Rorseth, Parke Godfrey, Lukasz Golab, Divesh Srivastava, and Jaroslaw Szlichta. 2024. RAGE: Against the Machine: Retrieval-Augmented LLM Explanations. In ICDE Conf.

Google Scholar

[10]

Andy Yu, Parke Godfrey, Lukasz Golab, Divesh Srivastava, and Jaroslaw Szlichta. 2024. CAMO: Explaining Consensus Across Models. In ICDE Conf.

Google Scholar

Index Terms

On Integrating the Data-Science and Machine-Learning Pipelines for Responsible AI
1. Computing methodologies
  1. Artificial intelligence
  2. Machine learning

Recommendations

Why Not to Trust Big Data: Discussing Statistical Paradoxes
Database Systems for Advanced Applications. DASFAA 2022 International Workshops
Abstract
Big data is driving the growth of businesses, data is the money, big data is the fuel of the twenty-first century, and there are many other claims over Big Data. Can we, however, rely on big data blindly? What happens if the training data set of a ...
Integrating Systems Modelling and Data Science: The Joint Future of Simulation and 'Big Data' Science

Although System Dynamics modelling is sometimes referred to as data-poor modelling, it often is -or could be-applied in a data-rich manner. However, more can be done in the era of 'big data'. Big data refers here to situations with much more available ...
Data Science: A Comprehensive Overview

The 21st century has ushered in the age of big data and data economy, in which data DNA, which carries important knowledge, insights, and potential, has become an intrinsic constituent of all data-based organisms. An appropriate understanding of data ...

Comments

Information & Contributors

Information

Published In

GUIDE-AI '24: Proceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI

June 2024

67 pages

ISBN:9798400706943

DOI:10.1145/3665601

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SIGMOD/PODS '24

Sponsor:

SIGMOD

SIGMOD/PODS '24: International Conference on Management of Data

June 9 - 15, 2024

AA, Santiago, Chile

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
119
Total Downloads

Downloads (Last 12 months)119
Downloads (Last 6 weeks)11

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Abstract

References

Index Terms

Recommendations

Why Not to Trust Big Data: Discussing Statistical Paradoxes

Integrating Systems Modelling and Data Science: The Joint Future of Simulation and 'Big Data' Science

Data Science: A Comprehensive Overview

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

HTML Format

Share

Share this Publication link

Share on social media

Affiliations