MMDataLoader: Reusing Preprocessed Data Among Concurrent Model Training Tasks | IEEE Journals & Magazine | IEEE Xplore

MMDataLoader: Reusing Preprocessed Data Among Concurrent Model Training Tasks


Abstract:

Data preprocessing plays an important role in deep learning, which directly affects the training efficiency. Data preprocessing is performed on the CPU. The preprocessed ...Show More

Abstract:

Data preprocessing plays an important role in deep learning, which directly affects the training efficiency. Data preprocessing is performed on the CPU. The preprocessed data are then fed to the models that are trained on the GPU. We observe that data preprocessing on the CPU can potentially create a bottleneck in the entire process of a model training task. In order to tackle this issue, we have developed MMDataLoader, which enables reusing preprocessed data among multiple model training tasks. MMDataLoader automatically constructs a data preprocessing pipeline based on each task's specific preprocessing workflow, allowing for maximum data reuse and reduced computing workload on the CPU. Unlike conventional data loaders that operate at the task level and provide data provision services to specific training tasks, MMDataLoader operates at the server level and provides data for all concurrently running tasks. We have conducted extensive experiments. The results show that MMDataLoader can significantly increase preprocessing throughput without affecting model convergence when compared to conventional methods where model training tasks are executed concurrently. For instance, with three tasks running, the preprocessing throughput can increase by 1.6x to 3.15x, depending on the tasks being executed and the proportion of preprocessing operations that are shared among them.
Published in: IEEE Transactions on Computers ( Volume: 73, Issue: 2, February 2024)
Page(s): 510 - 522
Date of Publication: 23 November 2023

ISSN Information:

Funding Agency:


References

References is not available for this document.