Elsevier

Parallel Computing

Volume 71, January 2018, Pages 23-41
Parallel Computing

Benchmarking the GPU memory at the warp level

https://doi.org/10.1016/j.parco.2017.11.003Get rights and content

Highlights

  • We propose a warp-based approach, and design two sets of micro-benchmarks to measure the capability of broadcasting and parallel accessing.

  • We benchmark the characteristics of shared memory, constant memory, global memory and texture memory with our approach.

  • We quantify the performance benefits of replacing local memory with registers, avoiding bank conflicts of using shared memory, and maximizing global memory bandwidth with different data types.

  • We summarize the optimization guidelines for different memory types towards an optimization framework on GPU memories.

  • We demonstrate how to optimize a case study in hyperspectral image dimension reduction with the help of our framework.

Abstract

Graphic process units (GPUs) are widely used in scientific computing, because of their high performance and energy efficiency. Nonetheless, GPUs are featured with a hierarchical memory system, on which code optimization requires an in-depth understanding for programmers. For this, we often measure the capability (latency or bandwidth) of the memory system with micro-benchmarks. Prior works focus on the latency of a single thread to disclose the unrevealed information. This per-thread measurement cannot reflect the actual process of a program execution, because the smallest executable unit of parallelism on a GPU comprises 32 threads (a warp of threads). This motivates us to benchmark the GPU memory system at the warp-level.

In this paper, we benchmark the GPU memory system to quantify the capability of parallel accessing and broadcasting. Such warp-level measurements are performed on shared memory, constant memory, global memory and texture memory. Further, we discuss how to replace local memory with registers, how to avoid bank conflicts of share memory, and how to maximize global memory bandwidth with alternative data types. By analyzing the experimental results, we summarize the optimization guidelines for different types of memories, and build an optimization framework on GPU memories. Taking a case study of maximum noise fraction rotation in dimension reduction of hyperspectral images, we demonstrate that our framework is applicable and effective.

Our work discloses the characteristics of GPU memories at the warp-level, and leads to optimization guidelines. The warp-level benchmarking results can facilitate the process of designing parallel algorithms, modeling and optimizing GPU programs. To the best of our knowledge, this is the first benchmarking effort at the warp-level for the GPU memory system.

Introduction

GPUs were initially used for 3D graphics rendering. With the increase of programmability, GPUs have gained more and more attention in scientific computing [1], [2], [3], [4], [5]. A large amount of applications written in CUDA, OpenCL and OpenACC are running efficiently on the modern GPUs. Nevertheless, it is difficult to exploit such many-cores in order to fully mine the hardware potentials. Among others, taming the hierarchical memory system on GPUs is a significant obstacle [6], [7], [8]. The information offered by the GPU text books [9], [10] and/or the NVIDIA’s official documents [11] is very limited. In terms of GPU memory usage, we have no quantified performance gain/loss of using a specific memory type, let alone the reasons behind them. In this case, it is significant to disclose the desired information by designing micro-benchmarks and then performing actual runnings.

Prior works on benchmarking CPU/GPU memory systems often use the pointer-chasing technique to measure the accessing latency of a single thread [6], [12], [13], [14], [15], [16]. Different from CPUs, threads on GPUs do not run separately in real-world applications. Instead, a group of threads (a.k.a. warp) are taken as the smallest executing unit on GPUs. Specifically, multiple threads within a warp run simultaneously in a Single Instruction Multiple Threads (SIMT) fashion. When two threads within a warp follow different code paths, they are serialized. In terms of memory accesses, the neighboring threads within a warp need access neighboring cells in the global memory space, i.e., coalesced memory access. Therefore, we argue that the performance behaviors from a warp are more interesting than that of a single thread. To this end, we benchmark the GPU memory system at the warp level to investigate its characteristics and gain new insights.

In this paper, we design two sets of warp-level micro-benchmarks1 to measure the capability of broadcasting and parallel accessing. The basic idea is that a warp of threads access a batch of data elements with various patterns. By comparing the differences of the execution time (i.e., the warp-level latency) on a certain memory type, we infer whether it supports broadcasting and/or parallel accessing. Moreover, we test two memory accessing constraints (aligned accessing and contiguously accessing) to check whether they are a must of efficiently using a memory type. The experiments are running on shared memory, constant memory, global memory and texture memory of an NVIDIA Tesla K20c GPU.

Furthermore, we discuss how to replace local memory with registers, avoid bank conflicts of using shared memory, and maximize global memory bandwidth with alternative data types. By analyzing all the benchmarking results, we summarize a suite of optimization guidelines for each type of memory, and build an optimization framework on GPU memories. With a case study on maximum noise fraction rotation in hyperspectral images dimension reduction, we show that the optimized code can obtain a superior performance than the CUABLS version with the speedup of 1.5 ×  ∼ 3 × , and can obtain a speedup of up to 93 ×  comparing with the serial version. This demonstrates that our framework is applicable and effective. To the best of our knowledge, this is the first benchmarking effort at the warp level for the GPU memory system. Although the warp-based approach is for NVIDIA GPUs and we use the NVIDIA terms (e.g., warp and shared memory) in the context, we argue that it is equally applicable for other GPUs.

To summarize, we make the following contributions.

  • (1)

    We propose a warp-based approach, and design two sets of micro-benchmarks to measure the capability of broadcasting and parallel accessing.

  • (2)

    We benchmark the characteristics of shared memory, constant memory, global memory and texture memory with our approach.

  • (3)

    We quantify the performance benefits of replacing local memory with registers, avoiding bank conflicts of using shared memory, and maximizing global memory bandwidth with different data types.

  • (4)

    We summarize the optimization guidelines for different memory types towards an optimization framework on GPU memories.

  • (5)

    We demonstrate how to optimize a case study in hyperspectral image dimension reduction with the help of our framework.

Section snippets

Related work

In this section, we introduce the related work on benchmarking CPUs, Intel Xeon Phis and GPUs, with a focus on their memory systems. Then we describe GPU memory optimizations, and the concept of warps and SIMT.

Benchmarking CPUs and Phis: Various studies have been performed to investigate the microarchitectures of CPUs and Intel Xeon Phis with micro-benchmarks. Smith et al. develop a high-level program to evaluate the cache and the TLB of CPUs [24]. Peng et al. assess the performance of the

Background: GPU memory and its latency

In this section, we first describe a typical GPU memory system and its organization, and then measure the thread-level accessing latency.

Our warp-Level benchmarking approach

In this section, we propose our warp-level latency benchmarking, and investigate the data accessing capability of multiple threads. And then, we introduce our experiments for warp-level benchmarking: broadcasting and parallel accessing, consecutively and aligned accessing constraints.

The warp-level benchmarking results

With the micro-benchmarks, we measure the warp-level latency of shared memory, constant memory, global memory and texture memory on the NVIDIA Tesla K20c GPU, aiming to disclose more information. For each type of memory, we benchmark the accessing capability (broadcasting and parallel accessing) by measuring the warp-level latency, and measure the performance impact of accessing constraints.

Performance tradeoffs on GPU memories

In addition to benchmarking the GPU memory system at the warp level, we investigate the performance tradeoffs on GPU memories. In this section, we discuss how to replace local memory with registers for a private array, how to avoid bank conflicts of shared memory, and how to maximize the global memory bandwidth with different data types.

Towards an optimization framework on GPU memories

With the warp-level benchmarking results and analysis, we derive optimizations guidelines on the GPU memory system.

Case study: maximum noise fraction rotation

In this section, we apply our memory optimization framework to the maximum noise fraction (MNF) rotation for hyperspectral image dimensionality reduction. We describe the MNF algorithm, analyze its hotspots, propose some detailed optimizations for each hotspot with our optimization framework, and show the performance improvement by comparing to the state-of-the-art.

Conclusion

In this paper, we propose a warp-level benchmarking approach for GPU memory systems and design a suit of micro-benchmarks. With the micro-benchmarks, we measure the warp-level latency of shared memory, constant memory, global memory and texture memory. The experimental results show that all the memories support broadcasting, and it is only constant memory that does not support parallel accessing. We observe that the aligned accessing constraint is not a must for all memory types. Meanwhile, the

References (31)

  • Y. Liang et al.

    An accurate GPU performance model for effective control flow divergence optimization.

    IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.

    (2016)
  • S.W. Keckler et al.

    GPUs and the future of parallel computing.

    IEEE Micro

    (2011)
  • Y. Li et al.

    Accelerating the scoring module of mass spectrometry-based peptide identification using GPUs.

    BMC Bioinf.

    (2014)
  • S. Ryoo et al.

    Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

    Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming.

    (2008)
  • P. Micikevicius

    3D finite difference computation on GPUs using CUDA

    Proceedings of 2nd workshop on general purpose processing on graphics processing units.

    (2009)
  • K. Zhao et al.

    G-BLASTN: accelerating nucleotide alignment by graphics processors[j].

    Bioinformatics

    (2014)
  • H. Wong et al.

    Demystifying GPU microarchitecture through microbenchmarking

    Performance Analysis of Systems & Software (ISPASS)

    (2010)
  • B. Jang et al.

    Exploiting memory access pat-terns to improve memory performance in data-parallel architectures.

    Parallel and Distributed Systems IEEE Transactions on

    (2011)
  • G. Chen et al.

    Porple: an extensible optimizer for portable data placement on GPU

    Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture.

    (2014)
  • W. Nicholas

    The CUDA Handbook: A Comprehensive Guide to GPU Programming.

    (2014)
  • B. David et al.

    Programing Massively Parallel Processors:A Hands-on Approach, Second Edition.

    (2013)
  • 2016, CUDA C programming guide (v8.0). NVIDIA Corporation....
  • V. Volkov et al.

    Benchmarking GPUs to tune dense linear algebra

    High Performance Computing, Networking Storage and Analysis, 2008

    (2008)
  • S.S. Baghsorkhi et al.

    Efficient perfor-mance evaluation of memory hierarchy for highly multithreaded graphics processors

    ACM SIGPLAN Notices.

    (2012)
  • R. Meltzer et al.

    Micro-benchmarking the c2070

    GPU Technology Conference

    (2013)
  • Cited by (15)

    • Investigation of heterogeneous computing platforms for real-time data analysis in the CBM experiment

      2020, Computer Physics Communications
      Citation Excerpt :

      Threads belonging to the same warp execute the same instruction over different data. The efficiency of computation is best when the threads follow the same execution path for the majority of the computation [40]. Execution divergence, when threads of a warp follow different execution paths, is handled automatically inside the hardware with a slight penalty on execution time.

    • Analyzing data locality in GPU kernels using memory footprint analysis

      2019, Simulation Modelling Practice and Theory
      Citation Excerpt :

      However, long memory latency, caused by global load and store instructions, can still impose memory stall cycles, thereby limiting the overall performance. Consequently, the memory system has appeared as a performance bottleneck in many applications [1–3]. Employing cache memories is regarded as a major solution to resolve the memory latency problem [4].

    View all citing articles on Scopus

    This paper was supported by the National Natural Science Foundation of China (Grant No. 61602501, 61272146 and 41375113).

    View full text