Abstract
The impressive advantages of Large Language Models (LLMs) have sparked much attention in deploying and utilizing these models on devices. However, the excessive parameters in LLMs lead to a significant memory footprint and computing burden during the inference process, severely restricting the potential uses of LLMs. As an effective model compression method, quantized compression can lower the threshold for deployment and inference of LLMs. In this way, quantized GEneral Matrix–Vector multiplication (GEMV) is the primary runtime component in the inference process. In practice, the dequantization process and low computational density limit the performance of quantized GEMV. This paper proposes an efficient quantized GEMV implementation consisting of the vectorized pre-fetch scheme, an efficient kernel design based on Matrix Core, and the optimization of atomicAdd to accelerate the LLMs’ inference. Finally, various comparison experiments were performed on MI210. Experiment results show the proposed method performs better than previous approaches in multiple shape quantized GEMV and end-to-end inference on LLMs.

















Similar content being viewed by others
Data Availability
No datasets were generated or analyzed during the current study.
References
Lyu C, Xu J, Wang L (2023) New trends in machine translation using large language models: case examples with chatgpt. arXiv preprint arXiv:2305.01181
Kurisinkel LJ, Chen NF (2023) Llm based multi-document summarization exploiting main-event biased monotone submodular content extraction. arXiv preprint arXiv:2310.03414
Minaee S, Mikolov T, Nikzad N, Chenaghlu M, Socher R, Amatriain X, Gao J (2024) Large language models: a survey. arXiv preprint arXiv:2402.06196
Bansal G, Chamola V, Hussain A, Guizani M, Niyato D (2024) Transforming conversations with AI-a comprehensive study of chatgpt. Cogn Comput, 1–24
Hadi MU, Qureshi R, Shah A, Irfan M, Zafar A, Shaikh MB, Akhtar N, Wu J, Mirjalili S et al (2023) A survey on large language models: applications, challenges, limitations, and practical usage. Authorea Preprints
Valero-Lara P, Huante A, Lail MA, Godoy WF, Teranishi K, Balaprakash P, Vetter JS (2023) Comparing llama-2 and GPT-3 LLMS for HPC kernels generation. arXiv preprint arXiv:2309.07103
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F et al (2023) Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971
Frantar E, Ashkboos S, Hoefler T, Alistarh D (2023) Optq: Accurate quantization for generative pre-trained transformers. International Conference on Learning Representations
Lin J, Tang J, Tang H, Yang S, Dang X, Han S (2023) AWQ: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978
Xu M, Xu YL, Mandic DP (2023) Tensorgpt: efficient compression of the embedding layer in LLMS based on the tensor-train decomposition. arXiv preprint arXiv:2307.00526
Li L, Zhang Y, Chen L (2023) Prompt distillation for efficient LLM-based recommendation. In: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 1348–1357
Deng L, Li G, Han S, Shi L, Xie Y (2020) Model compression and hardware acceleration for neural networks: a comprehensive survey. Proceed IEEE 108(4):485–532
He Y, Lin J, Liu Z, Wang H, Li L-J, Han S (2018) AMC: automl for model compression and acceleration on mobile devices. In: Proceedings of the European Conference on Computer Vision (ECCV), 784–800
Dettmers T, Svirschevski R, Egiazarian V, Kuznedelev D, Frantar E, Ashkboos S, Borzunov A, Hoefler T, Alistarh D (2023) SPQR: A sparse-quantized representation for near-lossless LLM weight compression. arXiv preprint arXiv:2306.03078
Dettmers T, Lewis M, Belkada Y, Zettlemoyer L (2022) Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Adv Neural Inf Process Syst 35:30318–30332
Nagel M, Baalen Mv, Blankevoort T, Welling M (2019) Data-free quantization through weight equalization and bias correction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 1325–1334
Zhong Y, Lin M, Nan G, Liu J, Zhang B, Tian Y, Ji R (2022) Intraq: Learning synthetic images with intra-class heterogeneity for zero-shot network quantization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12339–12348
Park S, Jang Y, Park E. (2022) Symmetry regularization and saturating nonlinearity for robust quantization. In: European Conference on Computer Vision, 206–222. Springer
Zhu X, Li J, Liu Y, Ma C, Wang W (2023) A survey on model compression for large language models. arXiv preprint arXiv:2308.07633
Shen M, Liang F, Gong R, Li Y, Li C, Lin C, Yu F, Yan J, Ouyang W (2021) Once quantization-aware training: high performance extremely low-bit architecture search. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 5340–5349
Pegolotti T, Frantar E, Alistarh D, Püschel M (2023) Generating efficient kernels for quantized inference on large language models. In: Workshop on Efficient Systems for Foundation Models@ ICML2023
Wang F, Shen M (2023) Automatic kernel generation for large language models on deep learning accelerators. In: 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), 1–9. IEEE
AMD: rocBLAS: ROCm Basic Linear Algebra Subprograms (BLAS) library. https://github.com/ROCm/rocBLAS (2023)
NVIDIA: cuBLAS: Basic Linear Algebra on NVIDIA GPUs. https://developer.nvidia.com/cublas (2023)
Park G, Park B, Kim M, Lee S, Kim J, Kwon B, Kwon S.J, Kim B, Lee Y, Lee D (2022) Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models. arXiv preprint arXiv:2206.09557
Exllama: A more memory-efficient rewrite of the HF transformers implementation of llama for use with quantized weights. https://github.com/turboderp/exllama (2023)
AMD-lab-notes: AMD matrix cores. https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-readme/ (2023)
AMD: AMD Instinst MI200 Series Accelerator. https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instinct-mi200-datasheet.pdf (2023)
AMD: AMD Instinct MI200 Instruction SetArchitecture. https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/instinct-mi200-cdna2-instruction-set-architecture.pdf (2023)
Jeon Y, Park B, Kwon SJ, Kim B, Yun J, Lee D (2020) Biqgemm: matrix multiplication with lookup table for binary-coding-based quantized dnns. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 1–14. IEEE
Gholami A, Kim S, Dong Z, Yao Z, Mahoney MW, Keutzer K (2022) A survey of quantization methods for efficient neural network inference. Low-Power Computer Vision. Chapman and Hall/CRC, USA, 291–326
Zhao Y, Lin C-Y, Zhu K, Ye Z, Chen L, Zheng S, Ceze L, Krishnamurthy A, Chen T, Kasikci B (2023) Atom: Low-bit quantization for efficient and accurate LLM serving. arXiv preprint arXiv:2310.19102
Hong K, Dai G, Xu J, Mao Q, Li X, Liu J, Chen K, Dong H, Wang Y (2023) Flashdecoding++: Faster large language model inference on GPUS. arXiv preprint arXiv:2311.01282
Kim T, Lee J, Ahn D, Kim S, Choi J, Kim M, Kim H (2024) Quick: Quantization-aware interleaving and conflict-free kernel for efficient LLM inference. arXiv preprint arXiv:2402.10076
Mukunoki D, Imamura T, Takahashi D (2015) Fast implementation of general matrix-vector multiplication (gemv) on kepler GPUS. In: 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, 642–650. IEEE
Yin J, Yu H, Xu W, Wang Y, Tian Z, Zhang Y, Chen B (2014) Highly parallel Gemv with register blocking method on GPU architecture. J Vis Commun Image Represent 25(7):1566–1573
Cheng J, Liu X, Cao Y, Zhang W, Han Z, Peng B, Liu Y, Zhang D, Han Y, Xu X et al (2022) Cache-major: A hardware architecture and scheduling policy for improving dram access efficiency in Gemv. In: 2022 IEEE 16th International Conference on Solid-State and Integrated Circuit Technology (ICSICT), 1–3. IEEE
AMD: rocWMMA: ROCm Wavefront-level Matrix Multiply and Accumulate. https://github.com/ROCmSoftwarePlatform/rocWMMA (2023)
Markidis S, Der Chien SW, Laure E, Peng IB, Vetter JS (2018) Nvidia tensor core programmability, performance and precision. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 522–531. IEEE
Ashkboos S, Markov I, Frantar E, Zhong T, Wang X, Ren J, Hoefler T, Alistarh D (2023) Towards end-to-end 4-bit inference on generative large language models. arXiv preprint arXiv:2310.09259
Yang Z, Lu L, Wang R (2022) A batched GEMM optimization framework for deep learning. J Supercomput 78(11):13393–13408
Guo Y, Lu L, Zhu S (2023) Novel accelerated methods for convolution neural network with matrix core. J Supercomput 79(17):19547–19573
Lu G, Xu D, Wang N, Zhang X, Zhen D, Lei H, Bai Y, Kong D, Ruan H, Chi Z et al (2020) A design of 16tops efficient gemm module in deep learning accelerator. In: 2020 IEEE International Conference on Integrated Circuits, Technologies and Applications (ICTA), 59–60. IEEE
Smith A, James N (2022) Amd \(\text{instinct}^{{\rm TM}}\) mi200 series accelerator and node architectures. In: HCS, 1–23
Yang J, Jin H, Tang R, Han X, Feng Q, Jiang H, Zhong S, Yin B, Hu X (2024) Harnessing the power of LLMS in practice: a survey on chatgpt and beyond. ACM Trans Knowl Discov Data 18(6):1–32
Wang L, Ma C, Feng X, Zhang Z, Yang H, Zhang J, Chen Z, Tang J, Chen X, Lin Y et al (2024) A survey on large language model based autonomous agents. Front Comput Sci 18(6):1–26
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Skadron K (2008) A performance study of general-purpose applications on graphics processors using Cuda. J Parallel Distrib Comput 68(10):1370–1380
Acknowledgements
This work was supported by the Key Field Research and Development Plan of Guangdong Province (2022B0101070001) and the Natural Science Foundation of Guangdong Province (2024A1515010204).
Author information
Authors and Affiliations
Contributions
Yu Zhang and Rong Zhao did the research and wrote the main manuscript text with the guidance of Lu Lu. Yijie Guo and Zhanyu Yang prepared tables and pictures of the experimental section. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no Conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, Y., Lu, L., Zhao, R. et al. An efficient quantized GEMV implementation for large language models inference with matrix core. J Supercomput 81, 496 (2025). https://doi.org/10.1007/s11227-025-06993-6
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-025-06993-6