Highly parallel GEMV with register blocking method on GPU architecture

https://doi.org/10.1016/j.jvcir.2014.06.002Get rights and content

Highlights

  • We propose a register blocking method for GEMV on GPU.

  • The proposed method can improve the parallelism and reuse data on chip at the same time.

  • Different block sizes are tested to found the best block size on a GPU platform.

Abstract

GPUs can provide powerful computing ability especially for data parallel applications, such as video/image processing applications. However, the complexity of GPU system makes the optimization of even a simple algorithm difficult. Different optimization methods on a GPU often lead to different performances. The matrix–vector multiplication routine for general dense matrices (GEMV) is an important kernel in video/image processing applications. We find that the implementations of GEMV in CUBLAS or MAGMA are not efficient, especially for small or fat matrix. In this paper, we propose a novel register blocking method to optimize GEMV on GPU architecture. This new method has three advantages. First, instead of using only one thread, we use a warp to compute an element of vector y so that the method can exploit the highly parallel GPU architecture. Second, the register blocking method is used to reduce the requirement of off-chip memory bandwidth. At last, the memory access order is elaborately arranged for the threads in one warp so that coalesced memory access is ensured. The proposed optimization methods for GEMV are comprehensively evaluated on different matrix sizes. The performance of the register blocking method with different block sizes is also evaluated in the experiment. Experiment results show that the new method can achieve very high speedup for small square matrices and fat matrices compared to CUBLAS or MAGMA, and can also achieve higher performance for large square matrices.

Introduction

The matrix–vector multiplication routine for general dense matrices (GEMV) is a very important kernel in video/image processing applications. GEMV can be described as follows. y = αAx + βy, where A is an M by N dense matrix, x and y are vectors, and α and β are scalars. GEMV is the most common routine in level 2 BLAS [16] which is a building block of dense linear algebra. GEMV is also a building block for other routines in BLAS such as SYMV [11].

GPUs can offer high peak computational throughput and memory bandwidth, which substantially exceed conventional multi-core platforms. However, GPU architecture demands new software to exploit the increasing computing ability. The complexity of GPU platform makes programming and performance tuning more difficult than CPU platform. Different methods to parallelize or optimize an algorithm often make difference in performance. Furthermore, different optimization methods often affect each other. For example, increasing the number of registers per thread can improve the performance of a single thread, but it may limit the number of threads executing on GPU, which may degrade the parallelism and performance.

Many-core processors are good candidates for accelerating video/image applications [1], [2], [3], [4], [5], [6], [7]. GPU is one of the most wildly used many-core processors, which can efficiently exploit the parallelism in these applications. There are multiple parallel levels in video/image applications which can be exploited by GPUs, such as parallelism among pixels, parallelism among image blocks, and parallelism among frames. For example, motion estimation is a kernel algorithm for many video applications with the above parallelism levels. Motion estimation is very time-consuming and demands GPU acceleration. The study on how to use GPU to exploit the parallelism in video/image applications is very meaningful, especially for real-time video applications.

GPU architecture is also very suitable for accelerating the video/image kernel algorithm GEMV, and there are several works on optimizing GEMV on GPU such as CUBLAS [12] and MAGMA [13]. However, current software cannot fully exploit the computing ability of GPU. The performance of GEMV in CUBLAS or MAGMA is far from the peak performance of GPU. For example, the performance of SGEMV (single-precision) is less than 45 GFLOPS for CUBLAS 4.0 (Fig. 1) and less than 70 GFLOPS for MAGMA (Fig. 2) on GeForce GTX 480 whose peak performance can reach 2 TFLOPS. A new hardware architecture demands new algorithms to exploit the new features. For example, the introduction of the Fermi/Kepler GPU architecture calls for new algorithms that adapt to the new hardware features [19].

We test the performance of SGEMV in CUBLAS 4.0 and MAGMA with different matrix sizes. The experiment results are shown in Fig. 1, Fig. 2. There are three observations as follows:

  • (1)

    There is a large gap between the achieved performance of SGEMV and peak performance of GPU, which calls for further optimization of the SGEMV kernel.

  • (2)

    The performance of the same algorithm is affected by the shape of the matrix [8]. The performance of fat matrix is relatively poor in CUBLAS 4.0 or MAGMA. For example, the performance of the matrices with 64 rows (64×) is lower than 5 GFLOPS (Fig. 1). But the performance becomes higher when the number of rows is larger, over 40 GFLOPS for some 16,384 × matrices.

  • (3)

    The performance of a smaller matrix is lower than the performance of a larger matrix [9]. For example, the performance of the 256 × 256 matrix is 4.4 GFLOPS and the performance of the 1024 × 1024 matrix is 10.8 GFLOPS (Fig. 2).

The reason for problem (1) is that GEMV is a memory bandwidth-bound kernel because of its low arithmetic intensity. For an n × n matrix, O(n2) memory operations are required for O(n2) computation operations for GEMV, while O(n3) computation operations are performed for GEMM with the same number of memory operations. In order to improve the performance of GEMV, we can try to reduce the memory traffic through reusing data in on-chip memory.

The reason for problem (2) and problem (3) is that the number of active threads launched for small or fat matrices is small because one thread is used to compute one element of vector y. A small number of threads means low parallelism or occupancy on GPU, which often leads to low performance. On the other hand, launching enough number of threads can hide memory access latency and fully exploit the computation ability of GPU cores. So we should try to find a proper parallelizing method to improve parallelism and solve these two problems on GPU.

In this paper, we propose a novel register blocking method to optimize GEMV on GPU. This new method has three advantages. First, instead of using only one thread, we use a warp to compute an element of vector y so that the method can exploit the highly parallel GPU architecture. Second, the register blocking method is used to reduce the requirement of off-chip memory bandwidth. At last, the memory access order is elaborately arranged for the threads in one warp so that coalesced memory access is ensured. The proposed optimization method for GEMV is comprehensively evaluated on different matrix sizes. The performance of the register blocking method with different block sizes is also tested. Experiment results show that the new method can achieve very high speedup for small square matrices and fat matrices when compared to CUBLAS or MAGMA, and can also achieve higher performance for large square matrices.

The rest of this paper is organized as follows. The new register blocking algorithm is presented in Section 2. Experiment results are shown and analyzed in Section 3. Section 4 gives the related work. Conclusion and future work is presented in Section 5.

Section snippets

Fast GEMV on GPU

In this section, we first introduce the blocking method [20] used in MAGMA. Then a novel register blocking method with elaborate memory access order in the warp (Warp_block) is presented to improve the performance of GEMV on GPU.

Introduction to NVIDIA GPU and CUDA

A NVIDIA GPU usually consists of several SMs, and each SM consists of 8 or 32 SPs. There is a private local memory in the form of registers for each thread and a low-latency on-chip memory for a group of threads. The main memory of a GPU is a high-bandwidth DRAM shared by all threads.

CUDA is a programming model designed for NVIDIA GPUs. A CUDA program consists of a host program running on the CPU and a kernel program running on the GPU. The host program transfers the data from CPU to GPU, the

Related works

There are many works on optimizing GEMV because of its importance. Optimizing methods on various platforms have been proposed. In this section, we only introduce the most related works on GPU platform.

GEMV is implemented in CUBLAS [12], but the performance is not good enough. The cache or shared memory blocking method is used by Fujimoto [14] and Magma [11], [13] to reuse vector x in cache or shared memory, and the performance is further improved over CUBLAS.

Liu et al. [8] find that the shape

Conclusion and future work

In this paper, we find that the computation ability of GPU cannot be fully exploited by GEMV in previous works. Furthermore, the performance degrades heavily for small or fat matrices. So we propose a new algorithm (Warp_block) to improve the performance of GEMV on GPU. The new method can achieve over 10 × speedup for small square matrices and fat matrices compared to CUBLAS 4.0 or the cache blocking method on GeForce GTX 480, and can also perform better for large square matrices. We also study

Acknowledgments

This work is supported in part by the China Major S&T Project (No. 2013ZX01033001-001-003), the International S & T Cooperation Project of China Grant (No. 2012DFA11170), the Tsinghua Indigenous Research Project (No. 20111080997) and the NNSF of China Grant (No. 61274131). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPU used for this research.

References (20)

  • Y. Liu et al.

    A cross-input adaptive framework for GPU program optimizations

    IEEE Int. Parallel Distributed Process. Sympos. (IPDPS)

    (2009)
  • Chenggang Yan et al.

    A highly parallel framework for HEVC coding unit partitioning tree decision on many-core processors

    IEEE Signal Process. Lett.

    (2014)
  • Chenggang Yan et al.

    Efficient parallel HEVC intra prediction on many-core processor

    Electron. Lett.

    (2014)
  • Chenggang Yan et al.

    Parallel deblocking filter for HEVC on many-core processor

    Electron. Lett.

    (2014)
  • Chenggang Yan, Yongdong Zhang, Feng Dai, Liang Li, Highly parallel framework for HEVC motion estimation on many-core...
  • Chenggang Yan, Yongdong Zhang, Feng Dai, Liang Li, Efficient parallel framework for HEVC deblocking filter on many-core...
  • Chenggang Yan, Feng Dai, Yongdong Zhang, Yike Ma, Licheng Chen, Lingjun Fan, Yasong Zheng, Parallel deblocking filter...
  • Yongdong Zhang et al.

    Efficient parallel framework for H.264/AVC deblocking filter on many-core platform

    IEEE Trans. Multimedia

    (2012)
  • M. Anderson et al.

    A predictive model for solving small linear algebra problems in GPU registers

    IEEE Int. Parallel Distributed Process. Sympos. (IPDPS)

    (2012)
  • J.W. Choi, A. Singh, R.W. Vuduc, Model-driven autotuning of sparse matrix-vector multiply on GPUs, in: Proceedings of...
There are more references available in the full text version of this article.

Cited by (0)

View full text