Skip to main content

Advertisement

Log in

An efficient quantized GEMV implementation for large language models inference with matrix core

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The impressive advantages of Large Language Models (LLMs) have sparked much attention in deploying and utilizing these models on devices. However, the excessive parameters in LLMs lead to a significant memory footprint and computing burden during the inference process, severely restricting the potential uses of LLMs. As an effective model compression method, quantized compression can lower the threshold for deployment and inference of LLMs. In this way, quantized GEneral Matrix–Vector multiplication (GEMV) is the primary runtime component in the inference process. In practice, the dequantization process and low computational density limit the performance of quantized GEMV. This paper proposes an efficient quantized GEMV implementation consisting of the vectorized pre-fetch scheme, an efficient kernel design based on Matrix Core, and the optimization of atomicAdd to accelerate the LLMs’ inference. Finally, various comparison experiments were performed on MI210. Experiment results show the proposed method performs better than previous approaches in multiple shape quantized GEMV and end-to-end inference on LLMs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Algorithm 1
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Data Availability

No datasets were generated or analyzed during the current study.

Notes

  1. https://huggingface.co/.

  2. https://pytorch.org/.

  3. https://www.tensorflow.org/.

  4. https://onnxruntime.ai/.

  5. https://huggingface.co/docs/optimum/index.

  6. https://resources.nvidia.com/en-us-blackwell-architecture.

References

  1. Lyu C, Xu J, Wang L (2023) New trends in machine translation using large language models: case examples with chatgpt. arXiv preprint arXiv:2305.01181

  2. Kurisinkel LJ, Chen NF (2023) Llm based multi-document summarization exploiting main-event biased monotone submodular content extraction. arXiv preprint arXiv:2310.03414

  3. Minaee S, Mikolov T, Nikzad N, Chenaghlu M, Socher R, Amatriain X, Gao J (2024) Large language models: a survey. arXiv preprint arXiv:2402.06196

  4. Bansal G, Chamola V, Hussain A, Guizani M, Niyato D (2024) Transforming conversations with AI-a comprehensive study of chatgpt. Cogn Comput, 1–24

  5. Hadi MU, Qureshi R, Shah A, Irfan M, Zafar A, Shaikh MB, Akhtar N, Wu J, Mirjalili S et al (2023) A survey on large language models: applications, challenges, limitations, and practical usage. Authorea Preprints

  6. Valero-Lara P, Huante A, Lail MA, Godoy WF, Teranishi K, Balaprakash P, Vetter JS (2023) Comparing llama-2 and GPT-3 LLMS for HPC kernels generation. arXiv preprint arXiv:2309.07103

  7. Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F et al (2023) Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971

  8. Frantar E, Ashkboos S, Hoefler T, Alistarh D (2023) Optq: Accurate quantization for generative pre-trained transformers. International Conference on Learning Representations

  9. Lin J, Tang J, Tang H, Yang S, Dang X, Han S (2023) AWQ: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978

  10. Xu M, Xu YL, Mandic DP (2023) Tensorgpt: efficient compression of the embedding layer in LLMS based on the tensor-train decomposition. arXiv preprint arXiv:2307.00526

  11. Li L, Zhang Y, Chen L (2023) Prompt distillation for efficient LLM-based recommendation. In: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 1348–1357

  12. Deng L, Li G, Han S, Shi L, Xie Y (2020) Model compression and hardware acceleration for neural networks: a comprehensive survey. Proceed IEEE 108(4):485–532

    Article  MATH  Google Scholar 

  13. He Y, Lin J, Liu Z, Wang H, Li L-J, Han S (2018) AMC: automl for model compression and acceleration on mobile devices. In: Proceedings of the European Conference on Computer Vision (ECCV), 784–800

  14. Dettmers T, Svirschevski R, Egiazarian V, Kuznedelev D, Frantar E, Ashkboos S, Borzunov A, Hoefler T, Alistarh D (2023) SPQR: A sparse-quantized representation for near-lossless LLM weight compression. arXiv preprint arXiv:2306.03078

  15. Dettmers T, Lewis M, Belkada Y, Zettlemoyer L (2022) Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Adv Neural Inf Process Syst 35:30318–30332

    Google Scholar 

  16. Nagel M, Baalen Mv, Blankevoort T, Welling M (2019) Data-free quantization through weight equalization and bias correction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 1325–1334

  17. Zhong Y, Lin M, Nan G, Liu J, Zhang B, Tian Y, Ji R (2022) Intraq: Learning synthetic images with intra-class heterogeneity for zero-shot network quantization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12339–12348

  18. Park S, Jang Y, Park E. (2022) Symmetry regularization and saturating nonlinearity for robust quantization. In: European Conference on Computer Vision, 206–222. Springer

  19. Zhu X, Li J, Liu Y, Ma C, Wang W (2023) A survey on model compression for large language models. arXiv preprint arXiv:2308.07633

  20. Shen M, Liang F, Gong R, Li Y, Li C, Lin C, Yu F, Yan J, Ouyang W (2021) Once quantization-aware training: high performance extremely low-bit architecture search. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 5340–5349

  21. Pegolotti T, Frantar E, Alistarh D, Püschel M (2023) Generating efficient kernels for quantized inference on large language models. In: Workshop on Efficient Systems for Foundation Models@ ICML2023

  22. Wang F, Shen M (2023) Automatic kernel generation for large language models on deep learning accelerators. In: 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), 1–9. IEEE

  23. AMD: rocBLAS: ROCm Basic Linear Algebra Subprograms (BLAS) library. https://github.com/ROCm/rocBLAS (2023)

  24. NVIDIA: cuBLAS: Basic Linear Algebra on NVIDIA GPUs. https://developer.nvidia.com/cublas (2023)

  25. Park G, Park B, Kim M, Lee S, Kim J, Kwon B, Kwon S.J, Kim B, Lee Y, Lee D (2022) Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models. arXiv preprint arXiv:2206.09557

  26. Exllama: A more memory-efficient rewrite of the HF transformers implementation of llama for use with quantized weights. https://github.com/turboderp/exllama (2023)

  27. AMD-lab-notes: AMD matrix cores. https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/ (2023)

  28. AMD: AMD Instinst MI200 Series Accelerator. https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instinct-mi200-datasheet.pdf (2023)

  29. AMD: AMD Instinct MI200 Instruction SetArchitecture. https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/instinct-mi200-cdna2-instruction-set-architecture.pdf (2023)

  30. Jeon Y, Park B, Kwon SJ, Kim B, Yun J, Lee D (2020) Biqgemm: matrix multiplication with lookup table for binary-coding-based quantized dnns. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 1–14. IEEE

  31. Gholami A, Kim S, Dong Z, Yao Z, Mahoney MW, Keutzer K (2022) A survey of quantization methods for efficient neural network inference. Low-Power Computer Vision. Chapman and Hall/CRC, USA, 291–326

  32. Zhao Y, Lin C-Y, Zhu K, Ye Z, Chen L, Zheng S, Ceze L, Krishnamurthy A, Chen T, Kasikci B (2023) Atom: Low-bit quantization for efficient and accurate LLM serving. arXiv preprint arXiv:2310.19102

  33. Hong K, Dai G, Xu J, Mao Q, Li X, Liu J, Chen K, Dong H, Wang Y (2023) Flashdecoding++: Faster large language model inference on GPUS. arXiv preprint arXiv:2311.01282

  34. Kim T, Lee J, Ahn D, Kim S, Choi J, Kim M, Kim H (2024) Quick: Quantization-aware interleaving and conflict-free kernel for efficient LLM inference. arXiv preprint arXiv:2402.10076

  35. Mukunoki D, Imamura T, Takahashi D (2015) Fast implementation of general matrix-vector multiplication (gemv) on kepler GPUS. In: 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, 642–650. IEEE

  36. Yin J, Yu H, Xu W, Wang Y, Tian Z, Zhang Y, Chen B (2014) Highly parallel Gemv with register blocking method on GPU architecture. J Vis Commun Image Represent 25(7):1566–1573

    Article  MATH  Google Scholar 

  37. Cheng J, Liu X, Cao Y, Zhang W, Han Z, Peng B, Liu Y, Zhang D, Han Y, Xu X et al (2022) Cache-major: A hardware architecture and scheduling policy for improving dram access efficiency in Gemv. In: 2022 IEEE 16th International Conference on Solid-State and Integrated Circuit Technology (ICSICT), 1–3. IEEE

  38. AMD: rocWMMA: ROCm Wavefront-level Matrix Multiply and Accumulate. https://github.com/ROCmSoftwarePlatform/rocWMMA (2023)

  39. Markidis S, Der Chien SW, Laure E, Peng IB, Vetter JS (2018) Nvidia tensor core programmability, performance and precision. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 522–531. IEEE

  40. Ashkboos S, Markov I, Frantar E, Zhong T, Wang X, Ren J, Hoefler T, Alistarh D (2023) Towards end-to-end 4-bit inference on generative large language models. arXiv preprint arXiv:2310.09259

  41. Yang Z, Lu L, Wang R (2022) A batched GEMM optimization framework for deep learning. J Supercomput 78(11):13393–13408

    Article  MATH  Google Scholar 

  42. Guo Y, Lu L, Zhu S (2023) Novel accelerated methods for convolution neural network with matrix core. J Supercomput 79(17):19547–19573

    Article  MATH  Google Scholar 

  43. Lu G, Xu D, Wang N, Zhang X, Zhen D, Lei H, Bai Y, Kong D, Ruan H, Chi Z et al (2020) A design of 16tops efficient gemm module in deep learning accelerator. In: 2020 IEEE International Conference on Integrated Circuits, Technologies and Applications (ICTA), 59–60. IEEE

  44. Smith A, James N (2022) Amd \(\text{instinct}^{{\rm TM}}\) mi200 series accelerator and node architectures. In: HCS, 1–23

  45. Yang J, Jin H, Tang R, Han X, Feng Q, Jiang H, Zhong S, Yin B, Hu X (2024) Harnessing the power of LLMS in practice: a survey on chatgpt and beyond. ACM Trans Knowl Discov Data 18(6):1–32

    Article  Google Scholar 

  46. Wang L, Ma C, Feng X, Zhang Z, Yang H, Zhang J, Chen Z, Tang J, Chen X, Lin Y et al (2024) A survey on large language model based autonomous agents. Front Comput Sci 18(6):1–26

    Article  MATH  Google Scholar 

  47. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901

    Google Scholar 

  48. Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Skadron K (2008) A performance study of general-purpose applications on graphics processors using Cuda. J Parallel Distrib Comput 68(10):1370–1380

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the Key Field Research and Development Plan of Guangdong Province (2022B0101070001) and the Natural Science Foundation of Guangdong Province (2024A1515010204).

Author information

Authors and Affiliations

Authors

Contributions

Yu Zhang and Rong Zhao did the research and wrote the main manuscript text with the guidance of Lu Lu. Yijie Guo and Zhanyu Yang prepared tables and pictures of the experimental section. All authors reviewed the manuscript.

Corresponding author

Correspondence to Lu Lu.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Lu, L., Zhao, R. et al. An efficient quantized GEMV implementation for large language models inference with matrix core. J Supercomput 81, 496 (2025). https://doi.org/10.1007/s11227-025-06993-6

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11227-025-06993-6

Keywords