ABSTRACT
Generative AI models are at the forefront of advancing creative and analytical tasks, pushing the boundaries of what machines can generate and comprehend. Among these, latent diffusion models represent significant advancements in generating high-fidelity audio and images. This study introduces a systematic approach to study GPU utilisation during the training of these models by leveraging Weights & Biases and the PyTorch Profiler for detailed monitoring and profiling. Our methodology is designed to uncover inefficiencies in GPU resource allocation, pinpointing bottlenecks in the training pipeline. The insights gained aim to guide the development of strategies for enhancing training efficiency, potentially reducing computational costs and accelerating the development cycle of generative AI models. This contribution not only highlights the critical role of resource optimisation in scaling AI technologies but also opens new avenues for research in efficient model training.
- Marcel Aach, Eray Inanc, Rakesh Sarma, Morris Riedel, and Andreas Lintermann. 2023. Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks. Journal of Big Data 10 (6 2023), 96. Issue 1. https://doi.org/10.1186/s40537-023-00765-wGoogle ScholarCross Ref
- Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.Google Scholar
- Ahmed M. Abdelmoniem and Marco Canini. 2021. DC2: Delay-aware Compression Control for Distributed Machine Learning. In IEEE INFOCOM 2021 - IEEE Conference on Computer Communications. 1--10. https://doi.org/10.1109/INFOCOM42981.2021.9488810Google ScholarDigital Library
- Lucas Bellaiche, Rohin Shahi, Martin Harry Turpin, Anya Ragnhildstveit, Shawn Sprockett, Nathaniel Barr, Alexander Christensen, and Paul Seli. 2023. Humans versus AI: whether and why we prefer human-created compared to AI-created artwork. Cognitive Research: Principles and Implications 8 (7 2023), 42. Issue 1. https://doi.org/10.1186/s41235-023-00499-6Google ScholarCross Ref
- Lukas Biewald. 2020. Experiment Tracking with Weights and Biases. https://www.wandb.com/ Software available from wandb.com.Google Scholar
- Ebubekir BUBER and Banu DIRI. 2018. Performance Analysis and CPU vs GPU Comparison for Deep Learning. In 2018 6th International Conference on Control Engineering Information Technology (CEIT). 1--6. https://doi.org/10.1109/CEIT.2018.8751930Google ScholarCross Ref
- Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. (12 2015). http://arxiv.org/abs/1512.01274Google Scholar
- William Falcon and The PyTorch Lightning team. 2019. PyTorch Lightning. https://doi.org/10.5281/zenodo.3828935Google ScholarCross Ref
- Attila Farkas, Krisztián Póra, Sándor Szénási, Gábor Kertész, and Róbert Lovas. 2022. Evaluation of a distributed deep learning framework as a reference architecture for a cloud environment. In 2022 IEEE 10th Jubilee International Conference on Computational Cybernetics and Cyber-Medical Systems (ICCC). 000083--000088. https://doi.org/10.1109/ICCC202255925.2022.9922765Google ScholarCross Ref
- Cong Fu, Keqiang Yan, Limei Wang, Wing Yee Au, Michael McThrow, Tao Komikado, Koji Maruhashi, Kanji Uchino, Xiaoning Qian, and Shuiwang Ji. 2023. A Latent Diffusion Model for Protein Structure Generation. arXiv:2305.04120 [q-bio.BM]Google Scholar
- Dipesh Gyawali. 2023. Comparative Analysis of CPU and GPU Profiling for Deep Learning Models. (9 2023). http://arxiv.org/abs/2309.02521Google Scholar
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (Orlando, Florida, USA) (MM '14). Association for Computing Machinery, New York, NY, USA, 675--678. https://doi.org/10.1145/2647868.2654889Google ScholarDigital Library
- Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. 2019. AudioCaps: Generating Captions for Audios in The Wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 119--132. https://doi.org/10.18653/v1/N19-1011Google ScholarCross Ref
- Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. 2020. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. (6 2020). http://arxiv.org/abs/2006.15704Google ScholarDigital Library
- Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D. Plumbley. 2023. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. (1 2023). https://arxiv.org/abs/2301.12503Google Scholar
- Haohe Liu, Qiao Tian, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D. Plumbley. 2023. AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining. (8 2023). http://arxiv.org/abs/2308.05734Google Scholar
- Ahmed M. Abdelmoniem, Ahmed Elzanaty, Mohamed-Slim Alouini, and Marco Canini. 2021. An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems. In Proceedings of Machine Learning and Systems, A. Smola, A. Dimakis, and I. Stoica (Eds.), Vol. 3. 297--322. https://proceedings.mlsys.org/paper_files/paper/2021/file/fea47a8aa372e42f3c84327aec9506cf-Paper.pdfGoogle Scholar
- Sophie J Nightingale and Hany Farid. 2022. AI-synthesized faces are indistinguishable from real faces and more trustworthy. Proceedings of the National Academy of Sciences of the United States of America 119 (2 2022). Issue 8. https://doi.org/10.1073/pnas.2120481119Google ScholarCross Ref
- PyTorch. 2021. Introducing PyTorch Profiler - the new and improved performance tool. https://pytorch.org/docs/stable/profiler.html Software available from https://pytorch.org.Google Scholar
- Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Virtual Event, CA, USA) (KDD '20). Association for Computing Machinery, New York, NY, USA, 3505--3506. https://doi.org/10.1145/3394486.3406703Google ScholarDigital Library
- Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. (12 2021). http://arxiv.org/abs/2112.10752Google Scholar
- Nadathur Satish, Narayanan Sundaram, and Kurt Keutzer. 2009. Optimizing the use of GPU memory in applications with large data sets. 16th International Conference on High Performance Computing, HiPC 2009 - Proceedings, 408--418. https://doi.org/10.1109/HIPC.2009.5433185Google ScholarCross Ref
- Frank Seide and Amit Agarwal. 2016. CNTK: Microsoft's Open-Source Deep-Learning Toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD '16). Association for Computing Machinery, New York, NY, USA, 2135. https://doi.org/10.1145/2939672.2945397Google ScholarDigital Library
- Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. (2 2018). https://arxiv.org/abs/1802.05799Google Scholar
- Shaohuai Shi, Qiang Wang, and Xiaowen Chu. 2017. Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs. (11 2017). http://arxiv.org/abs/1711.05979Google Scholar
- Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. Proceedings of the 32nd International Conference on Machine Learning, 2256--2265.Google Scholar
- Mark Stephenson, Siva Kumar Sastry Hari, Yunsup Lee, Eiman Ebrahimi, Daniel R. Johnson, David Nellans, Mike O'Connor, and Stephen W. Keckler. 2015. Flexible software profiling of GPU architectures. SIGARCH Comput. Archit. News 43, 3S (jun 2015), 185--197. https://doi.org/10.1145/2872887.2750375Google ScholarDigital Library
- Ehsan Yousefzadeh-Asl-Miandoab, Ties Robroek, and Pinar Tozun. 2023. Profiling and Monitoring Deep Learning Training Tasks. In Proceedings of the 3rd Workshop on Machine Learning and Systems (Rome, Italy) (EuroMLSys '23). Association for Computing Machinery, New York, NY, USA, 18--25. https://doi.org/10.1145/3578356.3592589Google ScholarDigital Library
Index Terms
- Comparative Profiling: Insights Into Latent Diffusion Model Training
Recommendations
Cross-Accelerator Performance Profiling
XSEDE16: Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at ScaleThe computing requirements of scientific applications have influenced processor design, and have motivated the introduction and use of many-core processors, i.e., accelerators, for high performance computing (HPC). Consequently, it is now common for the ...
CUDAAdvisor: LLVM-based runtime profiling for modern GPUs
CGO 2018: Proceedings of the 2018 International Symposium on Code Generation and OptimizationGeneral-purpose GPUs have been widely utilized to accelerate parallel applications. Given a relatively complex programming model and fast architecture evolution, producing efficient GPU code is nontrivial. A variety of simulation and profiling tools ...
Profiling and Monitoring Deep Learning Training Tasks
EuroMLSys '23: Proceedings of the 3rd Workshop on Machine Learning and SystemsThe embarrassingly parallel nature of deep learning training tasks makes CPU-GPU co-processors the primary commodity hardware for them. The computing and memory requirements of these tasks, however, do not always align well with the available GPU ...
Comments