abstract

Multi-modal Data Processing for Foundation Models: Practical Guidances and Use Cases

Authors:

Bolin DingAuthors Info & Claims

KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 6414 - 6415

https://doi.org/10.1145/3637528.3671441

Published: 24 August 2024 Publication History

Abstract

In the foundation models era, efficiently processing multi-modal data is crucial. This tutorial covers key techniques for multi-modal data processing and introduces the open-source Data-Juicer system, designed to tackle the complexities of data variety, quality, and scale. Participants will learn how to use Data-Juicer's operators and tools for formatting, mapping, filtering, deduplicating, and selecting multi-modal data efficiently and effectively. They will also be familiar with the Data-Juicer Sandbox Lab, where users can easily experiment with diverse data recipes that represent methodical sequences of operators and streamline the creation of scalable data processing pipelines. This experience solidifies the concepts discussed, as well as provides a space for innovation and exploration, highlighting how data recipes can be optimized and deployed in high-performance distributed environments.

By the end of this tutorial, attendees will be equipped with the practical knowledge and skills to navigate the multi-modal data processing for foundation models. They will leave with actionable knowledge with an industrial open-source system and an enriched perspective on the importance of high-quality data in AI, poised to implement sustainable and scalable solutions in their projects. The system and related materials are available at https://github.com/modelscope/data-juicer.

References

[1]

Stephen H. Bach, Victor Sanh, Zheng Xin Yong, Albert Webson, Colin Raffel, Nihal V. Nayak, Abheesht Sharma, Taewoon Kim, M. Saiful Bari, Thibault Févry, et al. Promptsource: An integrated development environment and repository for natural language prompts. In ACL (demo), pages 93--104, 2022.

[2]

Jiamu Bai, Daoyuan Chen, Bingchen Qian, Liuyi Yao, and Yaliang Li. Federated fine-tuning of large language models under heterogeneous language tasks and client resources. arXiv:2402.11505, 2024.

[3]

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv:2311.15127, 2023.

[4]

Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li, Bolin Ding, and Jingren Zhou. Data-juicer: A one-stop data processing system for large language models. In SIGMOD, 2024.

[5]

Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, April 2023.

[6]

Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, and Bolin Ding. Data mixing made efficient: A bivariate scaling law for language model pretraining. arXiv:2405.14908, 2024.

[7]

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In EMNLP (Findings), pages 3356--3369, 2020.

[8]

Qirui Jiao, Daoyuan Chen, Yilun Huang, Yaliang Li, and Ying Shen. Enhancing multimodal large language models with vision detection models: An empirical study. arXiv:2401.17981, 2024.

[9]

Weirui Kuang, Bingchen Qian, Zitao Li, Daoyuan Chen, Dawei Gao, Xuchen Pan, Yuexiang Xie, Yaliang Li, Bolin Ding, and Jingren Zhou. Federatedscope-llm: A comprehensive package for fine-tuning large language models in federated learning. In KDD, 2024.

Digital Library

[10]

Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro von Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, et al. The bigscience ROOTS corpus: A 1.6tb composite multilingual dataset. In NeurIPS, 2022.

[11]

Yin Lin, Bolin Ding, H. V. Jagadish, and Jingren Zhou. Smartfeat: Efficient feature construction through feature-level foundation model interactions. arXiv:2309.07856, 2023.

[12]

Zhenqing Ling, Daoyuan Chen, Liuyi Yao, Yaliang Li, and Ying Shen. On the convergence of zeroth-order federated tuning for large language models. In KDD, 2024.

Digital Library

[13]

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023.

[14]

Mengsha Liu, Daoyuan Chen, Yaliang Li, Guian Fang, and Ying Shen. Chartthinker: A contextual chain-of-thought approach to optimized chart summarization. In COLING, 2024.

[15]

Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, JasonWei, Kevin Robinson, David Mimno, and Daphne Ippolito. A pretrainer's guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity. CoRR, abs/2305.13169, 2023.

[16]

Yingqian Min, Kun Zhou, Dawei Gao, Wayne Xin Zhao, He Hu, and Yaliang Li. Data-cube: Data curriculum for instruction-based sentence representation learning. In ACL (Findings), 2024.

[17]

OpenAI. Gpt-4 technical report. CoRR, abs/2303.08774, 2023.

[18]

Yichen Qian, Yongyi He, Rong Zhu, Jintao Huang, Zhijian Ma, Haibin Wang, Yaohua Wang, Xiuyu Sun, Defu Lian, Bolin Ding, and Jingren Zhou. Unidm: A unified framework for data manipulation with large language models. In MLSys, volume 6, pages 465--482, 2024.

[19]

Soldaini, Luca and Lo, Kyle and Kinney, Rodney and Naik, Aakanksha and Ravichander, Abhilasha and Bhagia, Akshita and Groeneveld, Dirk and Schwenk, Dustin and Magnusson, Ian and Chandu, Khyathi. The Dolma Toolkit, 2023.

[20]

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model, 2023.

Index Terms

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2024

6901 pages

ISBN:9798400704901

DOI:10.1145/3637528

General Chairs:
Ricardo Baeza-Yates
Northeastern University, USA
,
Francesco Bonchi
CENTAI / Eurecat, Italy

Copyright © 2024 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2024

Check for updates

Author Tags

Qualifiers

Abstract

Conference

KDD '24

Sponsor:

KDD '24: The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona, Spain

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
101
Total Downloads

Downloads (Last 12 months)101
Downloads (Last 6 weeks)23

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents