An efficient quantized GEMV implementation for large language models inference with matrix core

Zhang, Yu; Lu, Lu; Zhao, Rong; Guo, Yijie; Yang, Zhanyu

doi:10.1007/s11227-025-06993-6

An efficient quantized GEMV implementation for large language models inference with matrix core

Published: 14 February 2025

Volume 81, article number 496, (2025)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Yu Zhang¹,
Lu Lu^1,2,
Rong Zhao¹,
Yijie Guo¹ &
…
Zhanyu Yang¹

147 Accesses
Explore all metrics

Abstract

The impressive advantages of Large Language Models (LLMs) have sparked much attention in deploying and utilizing these models on devices. However, the excessive parameters in LLMs lead to a significant memory footprint and computing burden during the inference process, severely restricting the potential uses of LLMs. As an effective model compression method, quantized compression can lower the threshold for deployment and inference of LLMs. In this way, quantized GEneral Matrix–Vector multiplication (GEMV) is the primary runtime component in the inference process. In practice, the dequantization process and low computational density limit the performance of quantized GEMV. This paper proposes an efficient quantized GEMV implementation consisting of the vectorized pre-fetch scheme, an efficient kernel design based on Matrix Core, and the optimization of atomicAdd to accelerate the LLMs’ inference. Finally, various comparison experiments were performed on MI210. Experiment results show the proposed method performs better than previous approaches in multiple shape quantized GEMV and end-to-end inference on LLMs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improve LLM Inference Performance with Matrix Decomposition Strategies

AQLoRA: An Adaptive Quantization-Based Efficient Fine-Tuning Method for LLMs

An empirical study of LLaMA3 quantization: from LLMs to MLLMs

Article Open access 30 December 2024

Data Availability

No datasets were generated or analyzed during the current study.

Notes

References

Lyu C, Xu J, Wang L (2023) New trends in machine translation using large language models: case examples with chatgpt. arXiv preprint arXiv:2305.01181
Kurisinkel LJ, Chen NF (2023) Llm based multi-document summarization exploiting main-event biased monotone submodular content extraction. arXiv preprint arXiv:2310.03414
Minaee S, Mikolov T, Nikzad N, Chenaghlu M, Socher R, Amatriain X, Gao J (2024) Large language models: a survey. arXiv preprint arXiv:2402.06196
Bansal G, Chamola V, Hussain A, Guizani M, Niyato D (2024) Transforming conversations with AI-a comprehensive study of chatgpt. Cogn Comput, 1–24
Hadi MU, Qureshi R, Shah A, Irfan M, Zafar A, Shaikh MB, Akhtar N, Wu J, Mirjalili S et al (2023) A survey on large language models: applications, challenges, limitations, and practical usage. Authorea Preprints
Valero-Lara P, Huante A, Lail MA, Godoy WF, Teranishi K, Balaprakash P, Vetter JS (2023) Comparing llama-2 and GPT-3 LLMS for HPC kernels generation. arXiv preprint arXiv:2309.07103
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F et al (2023) Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971
Frantar E, Ashkboos S, Hoefler T, Alistarh D (2023) Optq: Accurate quantization for generative pre-trained transformers. International Conference on Learning Representations
Lin J, Tang J, Tang H, Yang S, Dang X, Han S (2023) AWQ: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978
Xu M, Xu YL, Mandic DP (2023) Tensorgpt: efficient compression of the embedding layer in LLMS based on the tensor-train decomposition. arXiv preprint arXiv:2307.00526
Li L, Zhang Y, Chen L (2023) Prompt distillation for efficient LLM-based recommendation. In: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 1348–1357
Deng L, Li G, Han S, Shi L, Xie Y (2020) Model compression and hardware acceleration for neural networks: a comprehensive survey. Proceed IEEE 108(4):485–532
Article MATH Google Scholar
He Y, Lin J, Liu Z, Wang H, Li L-J, Han S (2018) AMC: automl for model compression and acceleration on mobile devices. In: Proceedings of the European Conference on Computer Vision (ECCV), 784–800
Dettmers T, Svirschevski R, Egiazarian V, Kuznedelev D, Frantar E, Ashkboos S, Borzunov A, Hoefler T, Alistarh D (2023) SPQR: A sparse-quantized representation for near-lossless LLM weight compression. arXiv preprint arXiv:2306.03078
Dettmers T, Lewis M, Belkada Y, Zettlemoyer L (2022) Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Adv Neural Inf Process Syst 35:30318–30332
Google Scholar
Nagel M, Baalen Mv, Blankevoort T, Welling M (2019) Data-free quantization through weight equalization and bias correction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 1325–1334
Zhong Y, Lin M, Nan G, Liu J, Zhang B, Tian Y, Ji R (2022) Intraq: Learning synthetic images with intra-class heterogeneity for zero-shot network quantization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12339–12348
Park S, Jang Y, Park E. (2022) Symmetry regularization and saturating nonlinearity for robust quantization. In: European Conference on Computer Vision, 206–222. Springer
Zhu X, Li J, Liu Y, Ma C, Wang W (2023) A survey on model compression for large language models. arXiv preprint arXiv:2308.07633
Shen M, Liang F, Gong R, Li Y, Li C, Lin C, Yu F, Yan J, Ouyang W (2021) Once quantization-aware training: high performance extremely low-bit architecture search. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 5340–5349
Pegolotti T, Frantar E, Alistarh D, Püschel M (2023) Generating efficient kernels for quantized inference on large language models. In: Workshop on Efficient Systems for Foundation Models@ ICML2023
Wang F, Shen M (2023) Automatic kernel generation for large language models on deep learning accelerators. In: 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), 1–9. IEEE
AMD: rocBLAS: ROCm Basic Linear Algebra Subprograms (BLAS) library. https://github.com/ROCm/rocBLAS (2023)
NVIDIA: cuBLAS: Basic Linear Algebra on NVIDIA GPUs. https://developer.nvidia.com/cublas (2023)
Park G, Park B, Kim M, Lee S, Kim J, Kwon B, Kwon S.J, Kim B, Lee Y, Lee D (2022) Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models. arXiv preprint arXiv:2206.09557
Exllama: A more memory-efficient rewrite of the HF transformers implementation of llama for use with quantized weights. https://github.com/turboderp/exllama (2023)
AMD-lab-notes: AMD matrix cores. https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/ (2023)
AMD: AMD Instinst MI200 Series Accelerator. https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instinct-mi200-datasheet.pdf (2023)
AMD: AMD Instinct MI200 Instruction SetArchitecture. https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/instinct-mi200-cdna2-instruction-set-architecture.pdf (2023)
Jeon Y, Park B, Kwon SJ, Kim B, Yun J, Lee D (2020) Biqgemm: matrix multiplication with lookup table for binary-coding-based quantized dnns. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 1–14. IEEE
Gholami A, Kim S, Dong Z, Yao Z, Mahoney MW, Keutzer K (2022) A survey of quantization methods for efficient neural network inference. Low-Power Computer Vision. Chapman and Hall/CRC, USA, 291–326
Zhao Y, Lin C-Y, Zhu K, Ye Z, Chen L, Zheng S, Ceze L, Krishnamurthy A, Chen T, Kasikci B (2023) Atom: Low-bit quantization for efficient and accurate LLM serving. arXiv preprint arXiv:2310.19102
Hong K, Dai G, Xu J, Mao Q, Li X, Liu J, Chen K, Dong H, Wang Y (2023) Flashdecoding++: Faster large language model inference on GPUS. arXiv preprint arXiv:2311.01282
Kim T, Lee J, Ahn D, Kim S, Choi J, Kim M, Kim H (2024) Quick: Quantization-aware interleaving and conflict-free kernel for efficient LLM inference. arXiv preprint arXiv:2402.10076
Mukunoki D, Imamura T, Takahashi D (2015) Fast implementation of general matrix-vector multiplication (gemv) on kepler GPUS. In: 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, 642–650. IEEE
Yin J, Yu H, Xu W, Wang Y, Tian Z, Zhang Y, Chen B (2014) Highly parallel Gemv with register blocking method on GPU architecture. J Vis Commun Image Represent 25(7):1566–1573
Article MATH Google Scholar
Cheng J, Liu X, Cao Y, Zhang W, Han Z, Peng B, Liu Y, Zhang D, Han Y, Xu X et al (2022) Cache-major: A hardware architecture and scheduling policy for improving dram access efficiency in Gemv. In: 2022 IEEE 16th International Conference on Solid-State and Integrated Circuit Technology (ICSICT), 1–3. IEEE
AMD: rocWMMA: ROCm Wavefront-level Matrix Multiply and Accumulate. https://github.com/ROCmSoftwarePlatform/rocWMMA (2023)
Markidis S, Der Chien SW, Laure E, Peng IB, Vetter JS (2018) Nvidia tensor core programmability, performance and precision. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 522–531. IEEE
Ashkboos S, Markov I, Frantar E, Zhong T, Wang X, Ren J, Hoefler T, Alistarh D (2023) Towards end-to-end 4-bit inference on generative large language models. arXiv preprint arXiv:2310.09259
Yang Z, Lu L, Wang R (2022) A batched GEMM optimization framework for deep learning. J Supercomput 78(11):13393–13408
Article MATH Google Scholar
Guo Y, Lu L, Zhu S (2023) Novel accelerated methods for convolution neural network with matrix core. J Supercomput 79(17):19547–19573
Article MATH Google Scholar
Lu G, Xu D, Wang N, Zhang X, Zhen D, Lei H, Bai Y, Kong D, Ruan H, Chi Z et al (2020) A design of 16tops efficient gemm module in deep learning accelerator. In: 2020 IEEE International Conference on Integrated Circuits, Technologies and Applications (ICTA), 59–60. IEEE
Smith A, James N (2022) Amd $\text{instinct}^{{\rm TM}}$ mi200 series accelerator and node architectures. In: HCS, 1–23
Yang J, Jin H, Tang R, Han X, Feng Q, Jiang H, Zhong S, Yin B, Hu X (2024) Harnessing the power of LLMS in practice: a survey on chatgpt and beyond. ACM Trans Knowl Discov Data 18(6):1–32
Article Google Scholar
Wang L, Ma C, Feng X, Zhang Z, Yang H, Zhang J, Chen Z, Tang J, Chen X, Lin Y et al (2024) A survey on large language model based autonomous agents. Front Comput Sci 18(6):1–26
Article MATH Google Scholar
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901
Google Scholar
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Skadron K (2008) A performance study of general-purpose applications on graphics processors using Cuda. J Parallel Distrib Comput 68(10):1370–1380
Article Google Scholar

Download references

Acknowledgements

This work was supported by the Key Field Research and Development Plan of Guangdong Province (2022B0101070001) and the Natural Science Foundation of Guangdong Province (2024A1515010204).

Author information

Authors and Affiliations

School of Computer Science and Engineering, South China University of Technology, GuangZhou, 510006, China
Yu Zhang, Lu Lu, Rong Zhao, Yijie Guo & Zhanyu Yang
Peng Cheng Laboratory, Shenzhen, 518055, China
Lu Lu

Authors

Yu Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Lu Lu
View author publications
You can also search for this author inPubMed Google Scholar
Rong Zhao
View author publications
You can also search for this author inPubMed Google Scholar
Yijie Guo
View author publications
You can also search for this author inPubMed Google Scholar
Zhanyu Yang
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Yu Zhang and Rong Zhao did the research and wrote the main manuscript text with the guidance of Lu Lu. Yijie Guo and Zhanyu Yang prepared tables and pictures of the experimental section. All authors reviewed the manuscript.

Corresponding author

Correspondence to Lu Lu.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, Y., Lu, L., Zhao, R. et al. An efficient quantized GEMV implementation for large language models inference with matrix core. J Supercomput 81, 496 (2025). https://doi.org/10.1007/s11227-025-06993-6

Download citation

Accepted: 27 January 2025
Published: 14 February 2025
DOI: https://doi.org/10.1007/s11227-025-06993-6

Keywords

Part of a collection:

Section - Architectures, Systems and Hardware Security

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient quantized GEMV implementation for large language models inference with matrix core

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improve LLM Inference Performance with Matrix Decomposition Strategies

AQLoRA: An Adaptive Quantization-Based Efficient Fine-Tuning Method for LLMs

An empirical study of LLaMA3 quantization: from LLMs to MLLMs

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now