Abstract
In deep learning training, CPU-intensive data preprocessing often leads to CPU bottlenecks, and expensive GPUs cannot be fully utilized, thus degrading end-to-end training performance. In general, CPUs are used to doing preprocessing and GPUs are used to training the model. We propose a new caching algorithm for AI model training, named HCache, which uses the DLT-Informed caching approach to improve the reuse of cached data and the usage of memory. DLT-Informed Caching approach can make intelligent caching decisions by identifying data discrepancies in different preprocessing stages. Our evaluation shows that HCache can achieves about 2\(\times \) speedup compared with the state-of-the-art CoorDL and Quiver in the training of computer vision models while maintaining comparable accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Cloud tensor processing units (tpu). https://cloud.google.com/tpu/docs/tpus
Imagenet22 dataset. https://opendatalab.com/OpenDataLab/ImageNet-21k
Nvidia a100. https://www.nvidia.com/en-us/data-center/a100/
Runtime options with memory, cpus, and gpus. https://docs.docker.com/config/containers/resource_constraints/
Choi, D., Passos, A., Shallue, C.J., Dahl, G.E.: Faster neural network training with data echoing. arXiv preprint arXiv:1907.05550 (2019)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Graur, D., Aymon, D., Kluser, D., Albrici, T., Thekkath, C.A., Klimovic, A.: Cachew: machine learning input data processing as a service. In: 2022 USENIX Annual Technical Conference (USENIX ATC 2022), pp. 689–706 (2022)
Gu, R., et al.: Fluid: dataset abstraction and elastic acceleration for cloud-native deep learning training jobs. In: 2022 IEEE 38th International Conference on Data Engineering (ICDE), pp. 2182–2195. IEEE (2022)
Guirao, J.A., et al.: Fast AI data preprocessing with nvidia dali. In: GPU Technology Conference (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Kumar, A.V., Sivathanu, M.: Quiver: an informed storage cache for deep learning. In: 18th USENIX Conference on File and Storage Technologies (FAST 2020), pp. 283–296 (2020)
Lee, G., et al.: Refurbish your training data: reusing partially augmented samples for faster deep neural network training. In: 2021 USENIX Annual Technical Conference (USENIX ATC 2021), pp. 537–550 (2021)
Mohan, J., Phanishayee, A., Raniwala, A., Chidambaram, V.: Analyzing and mitigating data stalls in DNN training. arXiv preprint arXiv:2007.06775 (2020)
Naman Agarwal, R.A., Koren, T., Talwar, K., Zhang, C.: Stochastic optimization with laggard data pipelines. Adv. Neural Inf. Process. Syst. 33 (2020)
Park, P., Jeong, H., Kim, J.: Trainbox: an extreme-scale neural network training server architecture by systematically balancing operations. In: 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 825–838. IEEE (2020)
Um, T., Oh, B., Seo, B., Kweun, M., Kim, G., Lee, W.Y.: Fastflow: accelerating deep learning model training with smart offloading of input data pipeline. Proc. VLDB Endow. 16(5), 1086–1099 (2023)
Zhao, M., et al.: Understanding data storage and ingestion for large-scale deep recommendation model training: industrial product. In: Proceedings of the 49th Annual International Symposium on Computer Architecture, pp. 1042–1057 (2022)
Acknowledgement
This work is supported in part by the National Science and Technology Major Project (2021ZD0114300), the National Natural Science Foundation of China (U22A6001), the National Key R&D Program of China (2022YFB4500405, and 2023YFB4503005), and the Zhejiang provincial “Ten Thousand Talents Program” (2021R52007).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Han, K., Cheng, W., Li, Y., Wu, Y., Zeng, L., Chen, G. (2024). Reusing Your Prepared Data: An Informed Cache for Accelerating DNN Model Training. In: Onizuka, M., et al. Database Systems for Advanced Applications. DASFAA 2024. Lecture Notes in Computer Science, vol 14855. Springer, Singapore. https://doi.org/10.1007/978-981-97-5572-1_34
Download citation
DOI: https://doi.org/10.1007/978-981-97-5572-1_34
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5571-4
Online ISBN: 978-981-97-5572-1
eBook Packages: Computer ScienceComputer Science (R0)