research-article

Free Access

Comparative Profiling: Insights Into Latent Diffusion Model Training

Authors:
Bradley Aldous

Queen Mary University of London, London, United Kingdom

Queen Mary University of London, London, United Kingdom

0009-0003-7994-6498
View Profile

,
Ahmed M. Abdelmoniem

Queen Mary University of London, London, United Kingdom

Queen Mary University of London, London, United Kingdom

0000-0002-1374-1882
View Profile

EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and SystemsApril 2024Pages 176–183https://doi.org/10.1145/3642970.3655847

Published:22 April 2024Publication History

EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and Systems

Pages 176–183

ABSTRACT

Generative AI models are at the forefront of advancing creative and analytical tasks, pushing the boundaries of what machines can generate and comprehend. Among these, latent diffusion models represent significant advancements in generating high-fidelity audio and images. This study introduces a systematic approach to study GPU utilisation during the training of these models by leveraging Weights & Biases and the PyTorch Profiler for detailed monitoring and profiling. Our methodology is designed to uncover inefficiencies in GPU resource allocation, pinpointing bottlenecks in the training pipeline. The insights gained aim to guide the development of strategies for enhancing training efficiency, potentially reducing computational costs and accelerating the development cycle of generative AI models. This contribution not only highlights the critical role of resource optimisation in scaling AI technologies but also opens new avenues for research in efficient model training.

References

Marcel Aach, Eray Inanc, Rakesh Sarma, Morris Riedel, and Andreas Lintermann. 2023. Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks. Journal of Big Data 10 (6 2023), 96. Issue 1. https://doi.org/10.1186/s40537-023-00765-wGoogle ScholarCross Ref
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.Google Scholar
Ahmed M. Abdelmoniem and Marco Canini. 2021. DC2: Delay-aware Compression Control for Distributed Machine Learning. In IEEE INFOCOM 2021 - IEEE Conference on Computer Communications. 1--10. https://doi.org/10.1109/INFOCOM42981.2021.9488810Google ScholarDigital Library
Lucas Bellaiche, Rohin Shahi, Martin Harry Turpin, Anya Ragnhildstveit, Shawn Sprockett, Nathaniel Barr, Alexander Christensen, and Paul Seli. 2023. Humans versus AI: whether and why we prefer human-created compared to AI-created artwork. Cognitive Research: Principles and Implications 8 (7 2023), 42. Issue 1. https://doi.org/10.1186/s41235-023-00499-6Google ScholarCross Ref
Lukas Biewald. 2020. Experiment Tracking with Weights and Biases. https://www.wandb.com/ Software available from wandb.com.Google Scholar
Ebubekir BUBER and Banu DIRI. 2018. Performance Analysis and CPU vs GPU Comparison for Deep Learning. In 2018 6th International Conference on Control Engineering Information Technology (CEIT). 1--6. https://doi.org/10.1109/CEIT.2018.8751930Google ScholarCross Ref
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. (12 2015). http://arxiv.org/abs/1512.01274Google Scholar
William Falcon and The PyTorch Lightning team. 2019. PyTorch Lightning. https://doi.org/10.5281/zenodo.3828935Google ScholarCross Ref
Attila Farkas, Krisztián Póra, Sándor Szénási, Gábor Kertész, and Róbert Lovas. 2022. Evaluation of a distributed deep learning framework as a reference architecture for a cloud environment. In 2022 IEEE 10th Jubilee International Conference on Computational Cybernetics and Cyber-Medical Systems (ICCC). 000083--000088. https://doi.org/10.1109/ICCC202255925.2022.9922765Google ScholarCross Ref
Cong Fu, Keqiang Yan, Limei Wang, Wing Yee Au, Michael McThrow, Tao Komikado, Koji Maruhashi, Kanji Uchino, Xiaoning Qian, and Shuiwang Ji. 2023. A Latent Diffusion Model for Protein Structure Generation. arXiv:2305.04120 [q-bio.BM]Google Scholar
Dipesh Gyawali. 2023. Comparative Analysis of CPU and GPU Profiling for Deep Learning Models. (9 2023). http://arxiv.org/abs/2309.02521Google Scholar
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (Orlando, Florida, USA) (MM '14). Association for Computing Machinery, New York, NY, USA, 675--678. https://doi.org/10.1145/2647868.2654889Google ScholarDigital Library
Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. 2019. AudioCaps: Generating Captions for Audios in The Wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 119--132. https://doi.org/10.18653/v1/N19-1011Google ScholarCross Ref
Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. 2020. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. (6 2020). http://arxiv.org/abs/2006.15704Google ScholarDigital Library
Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D. Plumbley. 2023. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. (1 2023). https://arxiv.org/abs/2301.12503Google Scholar
Haohe Liu, Qiao Tian, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D. Plumbley. 2023. AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining. (8 2023). http://arxiv.org/abs/2308.05734Google Scholar
Ahmed M. Abdelmoniem, Ahmed Elzanaty, Mohamed-Slim Alouini, and Marco Canini. 2021. An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems. In Proceedings of Machine Learning and Systems, A. Smola, A. Dimakis, and I. Stoica (Eds.), Vol. 3. 297--322. https://proceedings.mlsys.org/paper_files/paper/2021/file/fea47a8aa372e42f3c84327aec9506cf-Paper.pdfGoogle Scholar
Sophie J Nightingale and Hany Farid. 2022. AI-synthesized faces are indistinguishable from real faces and more trustworthy. Proceedings of the National Academy of Sciences of the United States of America 119 (2 2022). Issue 8. https://doi.org/10.1073/pnas.2120481119Google ScholarCross Ref
PyTorch. 2021. Introducing PyTorch Profiler - the new and improved performance tool. https://pytorch.org/docs/stable/profiler.html Software available from https://pytorch.org.Google Scholar
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Virtual Event, CA, USA) (KDD '20). Association for Computing Machinery, New York, NY, USA, 3505--3506. https://doi.org/10.1145/3394486.3406703Google ScholarDigital Library
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. (12 2021). http://arxiv.org/abs/2112.10752Google Scholar
Nadathur Satish, Narayanan Sundaram, and Kurt Keutzer. 2009. Optimizing the use of GPU memory in applications with large data sets. 16th International Conference on High Performance Computing, HiPC 2009 - Proceedings, 408--418. https://doi.org/10.1109/HIPC.2009.5433185Google ScholarCross Ref
Frank Seide and Amit Agarwal. 2016. CNTK: Microsoft's Open-Source Deep-Learning Toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD '16). Association for Computing Machinery, New York, NY, USA, 2135. https://doi.org/10.1145/2939672.2945397Google ScholarDigital Library
Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. (2 2018). https://arxiv.org/abs/1802.05799Google Scholar
Shaohuai Shi, Qiang Wang, and Xiaowen Chu. 2017. Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs. (11 2017). http://arxiv.org/abs/1711.05979Google Scholar
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. Proceedings of the 32nd International Conference on Machine Learning, 2256--2265.Google Scholar
Mark Stephenson, Siva Kumar Sastry Hari, Yunsup Lee, Eiman Ebrahimi, Daniel R. Johnson, David Nellans, Mike O'Connor, and Stephen W. Keckler. 2015. Flexible software profiling of GPU architectures. SIGARCH Comput. Archit. News 43, 3S (jun 2015), 185--197. https://doi.org/10.1145/2872887.2750375Google ScholarDigital Library
Ehsan Yousefzadeh-Asl-Miandoab, Ties Robroek, and Pinar Tozun. 2023. Profiling and Monitoring Deep Learning Training Tasks. In Proceedings of the 3rd Workshop on Machine Learning and Systems (Rome, Italy) (EuroMLSys '23). Association for Computing Machinery, New York, NY, USA, 18--25. https://doi.org/10.1145/3578356.3592589Google ScholarDigital Library

Index Terms

Comparative Profiling: Insights Into Latent Diffusion Model Training
1. Computing methodologies
  1. Artificial intelligence
    1. Distributed artificial intelligence
2. General and reference
  1. Cross-computing tools and techniques
    1. Measurement
    2. Performance

Recommendations

Cross-Accelerator Performance Profiling
XSEDE16: Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale

The computing requirements of scientific applications have influenced processor design, and have motivated the introduction and use of many-core processors, i.e., accelerators, for high performance computing (HPC). Consequently, it is now common for the ...
Read More
CUDAAdvisor: LLVM-based runtime profiling for modern GPUs
CGO 2018: Proceedings of the 2018 International Symposium on Code Generation and Optimization

General-purpose GPUs have been widely utilized to accelerate parallel applications. Given a relatively complex programming model and fast architecture evolution, producing efficient GPU code is nontrivial. A variety of simulation and profiling tools ...
Read More
Profiling and Monitoring Deep Learning Training Tasks
EuroMLSys '23: Proceedings of the 3rd Workshop on Machine Learning and Systems

The embarrassingly parallel nature of deep learning training tasks makes CPU-GPU co-processors the primary commodity hardware for them. The computing and memory requirements of these tasks, however, do not always align well with the available GPU ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and Systems
April 2024
218 pages
ISBN:9798400705410
DOI:10.1145/3642970

Copyright © 2024 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 April 2024
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
deep learning
diffusion model
profiling
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate18of26submissions,69%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 13
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Comparative Profiling: Insights Into Latent Diffusion Model Training

EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Cross-Accelerator Performance Profiling

CUDAAdvisor: LLVM-based runtime profiling for modern GPUs

Profiling and Monitoring Deep Learning Training Tasks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media