Skip to main content

Improve LLM Inference Performance with Matrix Decomposition Strategies

  • Conference paper
  • First Online:
Intelligence Science V (ICIS 2024)

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 720))

Included in the following conference series:

  • 169 Accesses

Abstract

Large Language Models (LLMs) are highly effective in various applications but are often limited by their performance (both efficiency and accuracy) during the inference stage. This paper introduces a novel compression technique that leverages Singular Value Decomposition (SVD) and Non-negative Matrix Factorization (NMF) within the MLP layers of transformer-based LLMs. By incorporating adaptive batch sizing and various initialization methods, our method significantly enhances the inference efficiency of these models without compromising their accuracy. We present empirical evidence showing that our method improves both model efficiency and accuracy during inference stage. Specifically, with SVD decomposition, we achieve a 1.6x speedup in inference tokens processing while retaining over 95% of the original model’s accuracy. Additionally, through NMF decomposition, we observe up to a 7% improvement in model accuracy compared with the original model, while maintaining or slightly enhancing tokens processing efficiency. These observations suggest that different matrix decomposition techniques can be strategically employed depending on the application requirements-SVD decomposition to boost efficiency, and NMF decomposition to enhance accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Zhao, W.X., et al.: A survey of large language models. CoRR, abs/2303.18223 (2023)

    Google Scholar 

  2. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018)

  3. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  4. Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)

    Google Scholar 

  5. Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  6. Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D.: GPTQ: accurate post-training quantization for generative pre-trained transformers. CoRR, abs/2210.17323 (2022)

    Google Scholar 

  7. Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., Han, S.: AWQ: activationaware weight quantization for LLM compression and acceleration. CoRR, abs/2306.00978 (2023)

    Google Scholar 

  8. Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., Han, S.: Smoothquant: accurate and efficient post-training quantization for large language models. In: Proceedings of Machine Learning Research, vol. 202, pp. 38087–38099. ICML, PMLR (2023)

    Google Scholar 

  9. Ma, X., Fang, G., Wang, X.: On the structural pruning of large language models. In: NeurIPS, Llm-pruner (2023)

    Google Scholar 

  10. Frantar, E., Alistarh, D.: Sparsegpt: massive language models can be accurately pruned in one-shot. In: ICML, vol. 202, pp. 10323–10337. PMLR (2023)

    Google Scholar 

  11. Gu, Y., Dong, L., Wei, F., Huang, M.: Knowledge distillation of large language models. CoRR, abs/2306.08543 (2023)

    Google Scholar 

  12. Hsieh, C-Y., et al.: Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In: ACL (Findings), pp. 8003–8017. Association for Computational Linguistics (2023)

    Google Scholar 

  13. Courbariaux, M., Bengio, Y., David, J.-P.: Binaryconnect: Training deep neural networks with binary weights during propagations. In: Advances in Neural Information Processing Systems, vol. 28 (2015)

    Google Scholar 

  14. Dettmers, T., Lewis, M., Belkada, Y., Zettlemoyer, L.: Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339 (2022)

  15. Zhu, X., Li, J., Liu, Y., Ma, C., Wang, W.: A survey on model compression for large language models. arXiv preprint arXiv:2308.07633 (2023)

  16. LeCun, Y., Denker, J., Solla, S.: Optimal brain damage. In: Advances in Neural Information Processing Systems, vol. 2 (1989)

    Google Scholar 

  17. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

  18. Agarwal, R., Vieillard, N., Stanczyk, P., Ramos, S., Geist, M., Bachem, O.: Gkd: generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649 (2023)

  19. Hsu, Y-C., Hua, T., Chang, S., Lou, Q., Shen, Y., Jin, H.: Language model compression with weighted low-rank factorization. In: ICLR. OpenReview.net (2022)

    Google Scholar 

  20. Yuan, Z., Shang, Y., Song, Y., Wu, Q., Yan, Y., Sun, G.: ASVD: activation-aware singular value decomposition for compressing large language models. CoRR, abs/2312.05821 (2023)

    Google Scholar 

  21. Touvron, H., et al.: Llama 2: open foundation and finetuned chat models. arXiv preprint arXiv:2307.09288 (2023b)

  22. Sharma, P., Ash, J.T., Misra, D.: The Truth is in there: improving reasoning in language models with layer-selective rank reduction. arXiv preprint arXiv:2312.13558 (2023)

  23. Wang, J., Zhang, X.-L.: Deep NMF topic modeling. arXiv preprint arXiv:2102.12998

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiyuan Shi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shi, J., Shi, C. (2025). Improve LLM Inference Performance with Matrix Decomposition Strategies. In: Shi, Z., Witbrock, M., Tian, Q. (eds) Intelligence Science V. ICIS 2024. IFIP Advances in Information and Communication Technology, vol 720. Springer, Cham. https://doi.org/10.1007/978-3-031-71253-1_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-71253-1_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-71252-4

  • Online ISBN: 978-3-031-71253-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics