Improve LLM Inference Performance with Matrix Decomposition Strategies

Shi, Jiyuan; Shi, Chunqi

doi:10.1007/978-3-031-71253-1_12

Jiyuan Shi¹⁸ &
Chunqi Shi¹⁹

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 720))

Included in the following conference series:

International Conference on Intelligence Science

169 Accesses

Abstract

Large Language Models (LLMs) are highly effective in various applications but are often limited by their performance (both efficiency and accuracy) during the inference stage. This paper introduces a novel compression technique that leverages Singular Value Decomposition (SVD) and Non-negative Matrix Factorization (NMF) within the MLP layers of transformer-based LLMs. By incorporating adaptive batch sizing and various initialization methods, our method significantly enhances the inference efficiency of these models without compromising their accuracy. We present empirical evidence showing that our method improves both model efficiency and accuracy during inference stage. Specifically, with SVD decomposition, we achieve a 1.6x speedup in inference tokens processing while retaining over 95% of the original model’s accuracy. Additionally, through NMF decomposition, we observe up to a 7% improvement in model accuracy compared with the original model, while maintaining or slightly enhancing tokens processing efficiency. These observations suggest that different matrix decomposition techniques can be strategically employed depending on the application requirements-SVD decomposition to boost efficiency, and NMF decomposition to enhance accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Transformers Compression: A Study of Matrix Decomposition Methods Using Fisher Information

AQLoRA: An Adaptive Quantization-Based Efficient Fine-Tuning Method for LLMs

An efficient quantized GEMV implementation for large language models inference with matrix core

Article 14 February 2025

References

Zhao, W.X., et al.: A survey of large language models. CoRR, abs/2303.18223 (2023)
Google Scholar
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Google Scholar
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D.: GPTQ: accurate post-training quantization for generative pre-trained transformers. CoRR, abs/2210.17323 (2022)
Google Scholar
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., Han, S.: AWQ: activationaware weight quantization for LLM compression and acceleration. CoRR, abs/2306.00978 (2023)
Google Scholar
Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., Han, S.: Smoothquant: accurate and efficient post-training quantization for large language models. In: Proceedings of Machine Learning Research, vol. 202, pp. 38087–38099. ICML, PMLR (2023)
Google Scholar
Ma, X., Fang, G., Wang, X.: On the structural pruning of large language models. In: NeurIPS, Llm-pruner (2023)
Google Scholar
Frantar, E., Alistarh, D.: Sparsegpt: massive language models can be accurately pruned in one-shot. In: ICML, vol. 202, pp. 10323–10337. PMLR (2023)
Google Scholar
Gu, Y., Dong, L., Wei, F., Huang, M.: Knowledge distillation of large language models. CoRR, abs/2306.08543 (2023)
Google Scholar
Hsieh, C-Y., et al.: Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In: ACL (Findings), pp. 8003–8017. Association for Computational Linguistics (2023)
Google Scholar
Courbariaux, M., Bengio, Y., David, J.-P.: Binaryconnect: Training deep neural networks with binary weights during propagations. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar
Dettmers, T., Lewis, M., Belkada, Y., Zettlemoyer, L.: Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339 (2022)
Zhu, X., Li, J., Liu, Y., Ma, C., Wang, W.: A survey on model compression for large language models. arXiv preprint arXiv:2308.07633 (2023)
LeCun, Y., Denker, J., Solla, S.: Optimal brain damage. In: Advances in Neural Information Processing Systems, vol. 2 (1989)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Agarwal, R., Vieillard, N., Stanczyk, P., Ramos, S., Geist, M., Bachem, O.: Gkd: generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649 (2023)
Hsu, Y-C., Hua, T., Chang, S., Lou, Q., Shen, Y., Jin, H.: Language model compression with weighted low-rank factorization. In: ICLR. OpenReview.net (2022)
Google Scholar
Yuan, Z., Shang, Y., Song, Y., Wu, Q., Yan, Y., Sun, G.: ASVD: activation-aware singular value decomposition for compressing large language models. CoRR, abs/2312.05821 (2023)
Google Scholar
Touvron, H., et al.: Llama 2: open foundation and finetuned chat models. arXiv preprint arXiv:2307.09288 (2023b)
Sharma, P., Ash, J.T., Misra, D.: The Truth is in there: improving reasoning in language models with layer-selective rank reduction. arXiv preprint arXiv:2312.13558 (2023)
Wang, J., Zhang, X.-L.: Deep NMF topic modeling. arXiv preprint arXiv:2102.12998

Download references

Author information

Authors and Affiliations

The Hong Kong University of Science and Technology (Guangzhou), No.1 Duxue Road, Nansha District, Guangzhou, 511400, China
Jiyuan Shi
China Pacific Insurance (Group) Co., Ltd., No.1 Zhongshan South Road, Huangpu District, Shanghai, 200010, China
Chunqi Shi

Authors

Jiyuan Shi
View author publications
You can also search for this author in PubMed Google Scholar
Chunqi Shi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiyuan Shi .

Editor information

Editors and Affiliations

Chinese Academy of Sciences, Beijing, China
Zhongzhi Shi
University of Auckland, Auckland, New Zealand
Michael Witbrock
Huawei Technologies, Huawei, China
Qi Tian

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shi, J., Shi, C. (2025). Improve LLM Inference Performance with Matrix Decomposition Strategies. In: Shi, Z., Witbrock, M., Tian, Q. (eds) Intelligence Science V. ICIS 2024. IFIP Advances in Information and Communication Technology, vol 720. Springer, Cham. https://doi.org/10.1007/978-3-031-71253-1_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-71253-1_12
Published: 29 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-71252-4
Online ISBN: 978-3-031-71253-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

Improve LLM Inference Performance with Matrix Decomposition Strategies