Abstract
Large Language Models (LLMs) are highly effective in various applications but are often limited by their performance (both efficiency and accuracy) during the inference stage. This paper introduces a novel compression technique that leverages Singular Value Decomposition (SVD) and Non-negative Matrix Factorization (NMF) within the MLP layers of transformer-based LLMs. By incorporating adaptive batch sizing and various initialization methods, our method significantly enhances the inference efficiency of these models without compromising their accuracy. We present empirical evidence showing that our method improves both model efficiency and accuracy during inference stage. Specifically, with SVD decomposition, we achieve a 1.6x speedup in inference tokens processing while retaining over 95% of the original model’s accuracy. Additionally, through NMF decomposition, we observe up to a 7% improvement in model accuracy compared with the original model, while maintaining or slightly enhancing tokens processing efficiency. These observations suggest that different matrix decomposition techniques can be strategically employed depending on the application requirements-SVD decomposition to boost efficiency, and NMF decomposition to enhance accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Zhao, W.X., et al.: A survey of large language models. CoRR, abs/2303.18223 (2023)
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D.: GPTQ: accurate post-training quantization for generative pre-trained transformers. CoRR, abs/2210.17323 (2022)
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., Han, S.: AWQ: activationaware weight quantization for LLM compression and acceleration. CoRR, abs/2306.00978 (2023)
Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., Han, S.: Smoothquant: accurate and efficient post-training quantization for large language models. In: Proceedings of Machine Learning Research, vol. 202, pp. 38087–38099. ICML, PMLR (2023)
Ma, X., Fang, G., Wang, X.: On the structural pruning of large language models. In: NeurIPS, Llm-pruner (2023)
Frantar, E., Alistarh, D.: Sparsegpt: massive language models can be accurately pruned in one-shot. In: ICML, vol. 202, pp. 10323–10337. PMLR (2023)
Gu, Y., Dong, L., Wei, F., Huang, M.: Knowledge distillation of large language models. CoRR, abs/2306.08543 (2023)
Hsieh, C-Y., et al.: Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In: ACL (Findings), pp. 8003–8017. Association for Computational Linguistics (2023)
Courbariaux, M., Bengio, Y., David, J.-P.: Binaryconnect: Training deep neural networks with binary weights during propagations. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Dettmers, T., Lewis, M., Belkada, Y., Zettlemoyer, L.: Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339 (2022)
Zhu, X., Li, J., Liu, Y., Ma, C., Wang, W.: A survey on model compression for large language models. arXiv preprint arXiv:2308.07633 (2023)
LeCun, Y., Denker, J., Solla, S.: Optimal brain damage. In: Advances in Neural Information Processing Systems, vol. 2 (1989)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Agarwal, R., Vieillard, N., Stanczyk, P., Ramos, S., Geist, M., Bachem, O.: Gkd: generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649 (2023)
Hsu, Y-C., Hua, T., Chang, S., Lou, Q., Shen, Y., Jin, H.: Language model compression with weighted low-rank factorization. In: ICLR. OpenReview.net (2022)
Yuan, Z., Shang, Y., Song, Y., Wu, Q., Yan, Y., Sun, G.: ASVD: activation-aware singular value decomposition for compressing large language models. CoRR, abs/2312.05821 (2023)
Touvron, H., et al.: Llama 2: open foundation and finetuned chat models. arXiv preprint arXiv:2307.09288 (2023b)
Sharma, P., Ash, J.T., Misra, D.: The Truth is in there: improving reasoning in language models with layer-selective rank reduction. arXiv preprint arXiv:2312.13558 (2023)
Wang, J., Zhang, X.-L.: Deep NMF topic modeling. arXiv preprint arXiv:2102.12998
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 IFIP International Federation for Information Processing
About this paper
Cite this paper
Shi, J., Shi, C. (2025). Improve LLM Inference Performance with Matrix Decomposition Strategies. In: Shi, Z., Witbrock, M., Tian, Q. (eds) Intelligence Science V. ICIS 2024. IFIP Advances in Information and Communication Technology, vol 720. Springer, Cham. https://doi.org/10.1007/978-3-031-71253-1_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-71253-1_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-71252-4
Online ISBN: 978-3-031-71253-1
eBook Packages: Computer ScienceComputer Science (R0)