WActiGrad: Structured Pruning for Efficient Finetuning and Inference of Large Language Models on AI Accelerators

Chitty-Venkata, Krishna Teja; Sastry, Varuni Katti; Emani, Murali; Vishwanath, Venkatram; Shanmugavelu, Sanjif; Howland, Sylvia

doi:10.1007/978-3-031-69766-1_22

Krishna Teja Chitty-Venkata¹³,
Varuni Katti Sastry¹³,
Murali Emani¹³,
Venkatram Vishwanath¹³,
Sanjif Shanmugavelu¹⁴ &
…
Sylvia Howland¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14802))

Included in the following conference series:

European Conference on Parallel Processing

831 Accesses

Abstract

Large Language Models (LLMs) have shown remarkable performance across various language processing applications. Nevertheless, their extensive computational requirements could hinder their deployment in real-time applications or resource-constrained environments. Pruning is a powerful technique to reduce the model size and make it computationally efficient. In this paper, we propose a structured pruning algorithm, Weight Activation and Gradient (WActiGrad), to obtain smaller LLMs from large pre-trained models. We investigate the level of granularity at which structured pruning techniques can be applied to an LLM and identify the challenges in applying these techniques across different parts of the transformer. Finally, based on these observations, we develop a pruning methodology that is adaptable to various attention and feedforward network modules. We comprehensively assess our WActiGrad method on state-of-the-art LLMs, LLaMA (7B and 13B), LLaMA-2 (7B and 13B), and Mistral-7B models across several language benchmarks for post-pretraining. This approach can prune close to 20% of the original model size without significantly compromising the model validation accuracy. We evaluate the hardware performance of our structurally pruned LLMs on different AI accelerators such as Nvidia A100 GPU, Groq LPU, Cerebras CS-2, and Graphcore Bow systems to show the effectiveness of the structured pruning technique. The findings presented in this paper offer insights into the integration of structured pruning techniques deployment on AI accelerators.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

GemBode and PhiBode: Adapting Small Language Models to Brazilian Portuguese

DAP-BERT: Differentiable Architecture Pruning of BERT

MOROCCO: Model Resource Comparison Framework

References

Polaris supercomputing system (2023). https://www.alcf.anl.gov/polaris
Weight Streaming Mode (2023). https://docs.cerebras.net/en/latest/wsc/cerebras-basics/cerebras-execution-modes.html
ALCF AI testbed (2024). https://www.alcf.anl.gov/alcf-ai-testbed
Abts, D., et al.: Think fast: a tensor streaming processor (TSP) for accelerating deep learning workloads. In: 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 145–158. IEEE (2020)
Google Scholar
Ainslie, J., et al.: GQA: training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245 (2023)
Aminabadi, R.Y., et al.: Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In: SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15. IEEE (2022)
Google Scholar
Frantar, E., et al.: Sparsegpt: massive language models can be accurately pruned in one-shot. In: International Conference on Machine Learning, pp. 10323–10337. PMLR (2023)
Google Scholar
Graphcore: Application examples (2024). https://github.com/graphcore/examples
Hu, E.J., et al.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Jiang, A.Q., et al.: Mistral 7b. arXiv preprint arXiv:2310.06825 (2023)
Ma, X., et al.: LLM-pruner: on the structural pruning of large language models. Adv. Neural. Inf. Process. Syst. 36, 21702–21720 (2023)
Google Scholar
Marcus, M., et al.: The penn treebank: annotating predicate argument structure. In: Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, 8–11 March 1994 (1994)
Google Scholar
Merity, S., et al.: Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843 (2016)
Sun, M., et al.: A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695 (2023)

Download references

Acknowledgment

This research was funded in part by and used resources at the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. We thank Sid Raskar, Sam Foreman, Ray Powell and William Arnold from ALCF, Natalia Vassilieva and Alice Zhang from Cerebras and Alex Tsyplikhin from Graphcore for their inputs.

Author information

Authors and Affiliations

Argonne National Laboratory, Lemont, USA
Krishna Teja Chitty-Venkata, Varuni Katti Sastry, Murali Emani & Venkatram Vishwanath
Groq Inc., Mountain View, USA
Sanjif Shanmugavelu
Cerebras Systems, Sunnyvale, USA
Sylvia Howland

Authors

Krishna Teja Chitty-Venkata
View author publications
You can also search for this author in PubMed Google Scholar
Varuni Katti Sastry
View author publications
You can also search for this author in PubMed Google Scholar
Murali Emani
View author publications
You can also search for this author in PubMed Google Scholar
Venkatram Vishwanath
View author publications
You can also search for this author in PubMed Google Scholar
Sanjif Shanmugavelu
View author publications
You can also search for this author in PubMed Google Scholar
Sylvia Howland
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Krishna Teja Chitty-Venkata .

Editor information

Editors and Affiliations

University Carlos III of Madrid, Madrid, Spain
Jesus Carretero
University of Oregon, Eugene, OR, USA
Sameer Shende
University Carlos III of Madrid, Madrid, Spain
Javier Garcia-Blas
TU Wien, Vienna, Austria
Ivona Brandic
Universidad Complutense de Madrid, Madrid, Spain
Katzalin Olcoz
Université Grenoble Alpes, Saint Martin d'Hères, France
Martin Schreiber

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chitty-Venkata, K.T., Sastry, V.K., Emani, M., Vishwanath, V., Shanmugavelu, S., Howland, S. (2024). WActiGrad: Structured Pruning for Efficient Finetuning and Inference of Large Language Models on AI Accelerators. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14802. Springer, Cham. https://doi.org/10.1007/978-3-031-69766-1_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-69766-1_22
Published: 26 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-69765-4
Online ISBN: 978-3-031-69766-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

WActiGrad: Structured Pruning for Efficient Finetuning and Inference of Large Language Models on AI Accelerators