Abstract
Hybrid parallelism is popular in training large language models (LLMs). However, existing efforts have focused on optimizing individual strategies in hybrid parallelism, such as pipeline scheduling, device assignment, etc., which limits the overall training efficiency. This paper explores the intricate dependencies among four pivotal strategies-model scaling, model splitting, pipeline scheduling, and device assignment-and proposes Quartet, a holistic hybrid parallel framework for joint optimization. The novelty lies upon the formulation of parameterized pipeline scheduling and device assignment, alongside a pioneering analysis of model scaling’s impact on the throughput. These provide the basis for orchestrating four strategies within a unified framework to maximize the overall training throughput efficiently. Evaluation results show that: for representative LLMs , Quartet improves the training throughput by up to 2.16\(\times \) over the state-of-the-art synchronous hybrid parallel approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Athlur, S., Saran, N., Sivathanu, M., Ramjee, R., Kwatra, N.: Varuna: scalable, low-cost training of massive deep learning models. In: Proceedings of the Seventeenth European Conference on Computer Systems, pp. 472–487 (2022)
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, 1877–1901 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Eliad, S., Hakimi, I., De Jagger, A., Silberstein, M., Schuster, A.: Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism. In: 2021 USENIX Annual Technical Conference, pp. 381–396 (2021)
Fan, S., et al.: Dapple: a pipelined data parallel approach for training large models. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 431–445 (2021)
Feng, Y., Xie, M., Tian, Z., Wang, S., Lu, Y., Shu, J.: Mobius: fine tuning large-scale models on commodity GPU servers. In: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, vol. 2, pp. 489–501 (2023)
Guo, J., et al.: Accudnn: a GPU memory efficient accelerator for training ultra-deep neural networks. In: 2019 IEEE 37th International Conference on Computer Design, pp. 65–72. IEEE (2019)
Huang, Y., et al.: Gpipe: efficient training of giant neural networks using pipeline parallelism. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Lee, S., Kim, J.K., Zheng, X., Ho, Q., Gibson, G.A., Xing, E.P.: On model parallelization and scheduling strategies for distributed machine learning. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Li, D., Wang, H., Xing, E., Zhang, H.: AMP: automatically finding model parallel strategies with heterogeneity awareness. In: Advances in Neural Information Processing Systems, vol. 35, pp. 6630–6639 (2022)
Li, M., Andersen, D.G., Smola, A.J., Yu, K.: Communication efficient distributed machine learning with the parameter server. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Li, S., Hoefler, T.: Chimera: efficiently training large-scale neural networks with bidirectional pipelines. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14 (2021)
Li, Z., et al.: Train big, then compress: rethinking model size for efficient training and inference of transformers. In: International Conference on Machine Learning, pp. 5958–5968. PMLR (2020)
Narayanan, D., et al.: Pipedream: generalized pipeline parallelism for DNN training. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp. 1–15 (2019)
Narayanan, D., Phanishayee, A., Shi, K., Chen, X., Zaharia, M.: Memory-efficient pipeline-parallel DNN training. In: International Conference on Machine Learning, pp. 7937–7947. PMLR (2021)
Narayanan, D., et al.: Efficient large-scale language model training on GPU clusters using Megatron-LM. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15 (2021)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Yang, P., Zhang, X., Zhang, W., Yang, M., Wei, H.: Group-based interleaved pipeline parallelism for large-scale DNN training. In: International Conference on Learning Representations (2021)
Zhang, W., Zhou, B., Tang, X., Wang, Z., Hu, S.: Mixpipe: efficient bidirectional pipeline parallelism for training large-scale models. In: 2023 60th ACM/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE (2023)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, W. et al. (2024). Quartet: A Holistic Hybrid Parallel Framework for Training Large Language Models. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14802. Springer, Cham. https://doi.org/10.1007/978-3-031-69766-1_29
Download citation
DOI: https://doi.org/10.1007/978-3-031-69766-1_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-69765-4
Online ISBN: 978-3-031-69766-1
eBook Packages: Computer ScienceComputer Science (R0)