Benchmarking the GPU memory at the warp level

doi:10.1016/j.parco.2017.11.003

Parallel Computing

Volume 71, January 2018, Pages 23-41

https://doi.org/10.1016/j.parco.2017.11.003 Get rights and content

Highlights

•
We propose a warp-based approach, and design two sets of micro-benchmarks to measure the capability of broadcasting and parallel accessing.
•
We benchmark the characteristics of shared memory, constant memory, global memory and texture memory with our approach.
•
We quantify the performance benefits of replacing local memory with registers, avoiding bank conflicts of using shared memory, and maximizing global memory bandwidth with different data types.
•
We summarize the optimization guidelines for different memory types towards an optimization framework on GPU memories.
•
We demonstrate how to optimize a case study in hyperspectral image dimension reduction with the help of our framework.

Abstract

Graphic process units (GPUs) are widely used in scientific computing, because of their high performance and energy efficiency. Nonetheless, GPUs are featured with a hierarchical memory system, on which code optimization requires an in-depth understanding for programmers. For this, we often measure the capability (latency or bandwidth) of the memory system with micro-benchmarks. Prior works focus on the latency of a single thread to disclose the unrevealed information. This per-thread measurement cannot reflect the actual process of a program execution, because the smallest executable unit of parallelism on a GPU comprises 32 threads (a warp of threads). This motivates us to benchmark the GPU memory system at the warp-level.

In this paper, we benchmark the GPU memory system to quantify the capability of parallel accessing and broadcasting. Such warp-level measurements are performed on shared memory, constant memory, global memory and texture memory. Further, we discuss how to replace local memory with registers, how to avoid bank conflicts of share memory, and how to maximize global memory bandwidth with alternative data types. By analyzing the experimental results, we summarize the optimization guidelines for different types of memories, and build an optimization framework on GPU memories. Taking a case study of maximum noise fraction rotation in dimension reduction of hyperspectral images, we demonstrate that our framework is applicable and effective.

Our work discloses the characteristics of GPU memories at the warp-level, and leads to optimization guidelines. The warp-level benchmarking results can facilitate the process of designing parallel algorithms, modeling and optimizing GPU programs. To the best of our knowledge, this is the first benchmarking effort at the warp-level for the GPU memory system.

Introduction

GPUs were initially used for 3D graphics rendering. With the increase of programmability, GPUs have gained more and more attention in scientific computing [1], [2], [3], [4], [5]. A large amount of applications written in CUDA, OpenCL and OpenACC are running efficiently on the modern GPUs. Nevertheless, it is difficult to exploit such many-cores in order to fully mine the hardware potentials. Among others, taming the hierarchical memory system on GPUs is a significant obstacle [6], [7], [8]. The information offered by the GPU text books [9], [10] and/or the NVIDIA’s official documents [11] is very limited. In terms of GPU memory usage, we have no quantified performance gain/loss of using a specific memory type, let alone the reasons behind them. In this case, it is significant to disclose the desired information by designing micro-benchmarks and then performing actual runnings.

Prior works on benchmarking CPU/GPU memory systems often use the pointer-chasing technique to measure the accessing latency of a single thread [6], [12], [13], [14], [15], [16]. Different from CPUs, threads on GPUs do not run separately in real-world applications. Instead, a group of threads (a.k.a. warp) are taken as the smallest executing unit on GPUs. Specifically, multiple threads within a warp run simultaneously in a Single Instruction Multiple Threads (SIMT) fashion. When two threads within a warp follow different code paths, they are serialized. In terms of memory accesses, the neighboring threads within a warp need access neighboring cells in the global memory space, i.e., coalesced memory access. Therefore, we argue that the performance behaviors from a warp are more interesting than that of a single thread. To this end, we benchmark the GPU memory system at the warp level to investigate its characteristics and gain new insights.

In this paper, we design two sets of warp-level micro-benchmarks¹ to measure the capability of broadcasting and parallel accessing. The basic idea is that a warp of threads access a batch of data elements with various patterns. By comparing the differences of the execution time (i.e., the warp-level latency) on a certain memory type, we infer whether it supports broadcasting and/or parallel accessing. Moreover, we test two memory accessing constraints (aligned accessing and contiguously accessing) to check whether they are a must of efficiently using a memory type. The experiments are running on shared memory, constant memory, global memory and texture memory of an NVIDIA Tesla K20c GPU.

Furthermore, we discuss how to replace local memory with registers, avoid bank conflicts of using shared memory, and maximize global memory bandwidth with alternative data types. By analyzing all the benchmarking results, we summarize a suite of optimization guidelines for each type of memory, and build an optimization framework on GPU memories. With a case study on maximum noise fraction rotation in hyperspectral images dimension reduction, we show that the optimized code can obtain a superior performance than the CUABLS version with the speedup of 1.5 ×  ∼ 3 × , and can obtain a speedup of up to 93 ×  comparing with the serial version. This demonstrates that our framework is applicable and effective. To the best of our knowledge, this is the first benchmarking effort at the warp level for the GPU memory system. Although the warp-based approach is for NVIDIA GPUs and we use the NVIDIA terms (e.g., warp and shared memory) in the context, we argue that it is equally applicable for other GPUs.

To summarize, we make the following contributions.

(1)
We propose a warp-based approach, and design two sets of micro-benchmarks to measure the capability of broadcasting and parallel accessing.
(2)
We benchmark the characteristics of shared memory, constant memory, global memory and texture memory with our approach.
(3)
We quantify the performance benefits of replacing local memory with registers, avoiding bank conflicts of using shared memory, and maximizing global memory bandwidth with different data types.
(4)
We summarize the optimization guidelines for different memory types towards an optimization framework on GPU memories.
(5)
We demonstrate how to optimize a case study in hyperspectral image dimension reduction with the help of our framework.

Section snippets

Related work

In this section, we introduce the related work on benchmarking CPUs, Intel Xeon Phis and GPUs, with a focus on their memory systems. Then we describe GPU memory optimizations, and the concept of warps and SIMT.

Benchmarking CPUs and Phis: Various studies have been performed to investigate the microarchitectures of CPUs and Intel Xeon Phis with micro-benchmarks. Smith et al. develop a high-level program to evaluate the cache and the TLB of CPUs [24]. Peng et al. assess the performance of the

Background: GPU memory and its latency

In this section, we first describe a typical GPU memory system and its organization, and then measure the thread-level accessing latency.

Our warp-Level benchmarking approach

In this section, we propose our warp-level latency benchmarking, and investigate the data accessing capability of multiple threads. And then, we introduce our experiments for warp-level benchmarking: broadcasting and parallel accessing, consecutively and aligned accessing constraints.

The warp-level benchmarking results

With the micro-benchmarks, we measure the warp-level latency of shared memory, constant memory, global memory and texture memory on the NVIDIA Tesla K20c GPU, aiming to disclose more information. For each type of memory, we benchmark the accessing capability (broadcasting and parallel accessing) by measuring the warp-level latency, and measure the performance impact of accessing constraints.

Performance tradeoffs on GPU memories

In addition to benchmarking the GPU memory system at the warp level, we investigate the performance tradeoffs on GPU memories. In this section, we discuss how to replace local memory with registers for a private array, how to avoid bank conflicts of shared memory, and how to maximize the global memory bandwidth with different data types.

Towards an optimization framework on GPU memories

With the warp-level benchmarking results and analysis, we derive optimizations guidelines on the GPU memory system.

Case study: maximum noise fraction rotation

In this section, we apply our memory optimization framework to the maximum noise fraction (MNF) rotation for hyperspectral image dimensionality reduction. We describe the MNF algorithm, analyze its hotspots, propose some detailed optimizations for each hotspot with our optimization framework, and show the performance improvement by comparing to the state-of-the-art.

Conclusion

In this paper, we propose a warp-level benchmarking approach for GPU memory systems and design a suit of micro-benchmarks. With the micro-benchmarks, we measure the warp-level latency of shared memory, constant memory, global memory and texture memory. The experimental results show that all the memories support broadcasting, and it is only constant memory that does not support parallel accessing. We observe that the aligned accessing constraint is not a must for all memory types. Meanwhile, the

References (31)

Y. Liang et al.
An accurate GPU performance model for effective control flow divergence optimization.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.
(2016)
S.W. Keckler et al.
GPUs and the future of parallel computing.
IEEE Micro
(2011)
Y. Li et al.
Accelerating the scoring module of mass spectrometry-based peptide identification using GPUs.
BMC Bioinf.
(2014)
S. Ryoo et al.
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming.
(2008)
P. Micikevicius
3D finite difference computation on GPUs using CUDA
Proceedings of 2nd workshop on general purpose processing on graphics processing units.
(2009)
K. Zhao et al.
G-BLASTN: accelerating nucleotide alignment by graphics processors[j].
Bioinformatics
(2014)
H. Wong et al.
Demystifying GPU microarchitecture through microbenchmarking
Performance Analysis of Systems & Software (ISPASS)
(2010)
B. Jang et al.
Exploiting memory access pat-terns to improve memory performance in data-parallel architectures.
Parallel and Distributed Systems IEEE Transactions on
(2011)
G. Chen et al.
Porple: an extensible optimizer for portable data placement on GPU
Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture.
(2014)
W. Nicholas
The CUDA Handbook: A Comprehensive Guide to GPU Programming.
(2014)

B. David et al.

Programing Massively Parallel Processors:A Hands-on Approach, Second Edition.

(2013)

2016, CUDA C programming guide (v8.0). NVIDIA Corporation....

V. Volkov et al.

Benchmarking GPUs to tune dense linear algebra

High Performance Computing, Networking Storage and Analysis, 2008

(2008)

S.S. Baghsorkhi et al.

Efficient perfor-mance evaluation of memory hierarchy for highly multithreaded graphics processors

ACM SIGPLAN Notices.

(2012)

R. Meltzer et al.

Micro-benchmarking the c2070

GPU Technology Conference

(2013)

Cited by (15)

Investigation of heterogeneous computing platforms for real-time data analysis in the CBM experiment
2020, Computer Physics Communications
Citation Excerpt :
Threads belonging to the same warp execute the same instruction over different data. The efficiency of computation is best when the threads follow the same execution path for the majority of the computation [40]. Execution divergence, when threads of a warp follow different execution paths, is handled automatically inside the hardware with a slight penalty on execution time.
Future experiments in high-energy physics will pose stringent requirements to computing, in particular to real-time data processing. As an example, the CBM experiment at FAIR Germany intends to perform online data selection exclusively in software, without using any hardware trigger, at extreme interaction rates of up to 10 MHz. In this article, we describe how heterogeneous computing platforms, Graphical Processing Units (GPUs) and CPUs, can be used to solve the associated computing problems on the example of the first-level event selection process sensitive to J/ $ψ$ decays using muon detectors. We investigate and compare pure parallel computing paradigms (Posix Thread, OpenMP, MPI) and heterogeneous parallel computing paradigms (CUDA, OpenCL) on both CPU and GPU architectures and demonstrate that the problem under consideration can be accommodated with a moderate deployment of hardware resources, provided their compute power is made optimal use of. In addition, we compare OpenCL and pure parallel computing paradigms on CPUs and show that OpenCL can be considered as a single parallel paradigm for all hardware resources.
clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization
2020, Future Generation Computer Systems
Alternating least squares (ALS) has been proved to be an effective solver for matrix factorization in recommender systems. To speed up factorizing performance, various parallel ALS solvers have been proposed to leverage modern multi-cores and many-cores. Existing implementations are limited in either speed or portability. In this paper, we present an efficient and portable ALS solver (clMF) for recommender systems. On one hand, we diagnose the baseline implementation and observe that it lacks of the awareness of the hierarchical thread organization on modern hardware. To achieve high performance, we apply the thread batching technique, the fine-grained tiling technique and three architecture-specific optimizations. On the other hand, we implement the ALS solver in OpenCL so that it can run on various platforms (CPUs, GPUs and MICs). Based on the architectural specifics, we select a suitable code variant for each platform to efficiently map it to the underlying hardware. The experimental results show that our implementation performs 2.8 $\times$ –15.7 $\times$ faster on an Intel 16-core CPU, 23.9 $\times$ –87.9 $\times$ faster on an NVIDIA K20C GPU and 34.6 $\times$ –97.1 $\times$ faster on an AMD Fury X GPU than the baseline implementation. On the K20C GPU, our implementation also outperforms cuMF over different latent features ranging from 10 to 100 with various real-world recommendation datasets.
VATE: A trade-off between memory and preserving time for high accurate cardinality estimation under sliding time window
2019, Computer Communications
Host cardinality is one of the most important attributes in the field of network research. The cardinality estimation under sliding time window has become a research hotspot in recent years. This kind of algorithms preserve the time information of sliding time window by introducing more powerful counters. The more counters used in these algorithms, the higher the estimation accuracy of these algorithms will be. However, the available number of sliding counters is limited due to their large memory footprint or long state-preserving time. To solve this problem, a new sliding counter, asynchronous time stamp (AT), is designed in this paper. AT has the advantages of small memory consumption and low state-preserving time. It can directly replace the counter used in the existing algorithms. On the same platform, higher accuracy can be achieved by adopting more AT. Furthermore, this paper designs a new per host cardinality estimation algorithm, virtual AT estimator (VATE), based on AT. VATE is also a parallel algorithm that can be deployed on GPU. With the parallel processing capability of GPU, VATE can estimate cardinalities of hosts in a 40 Gb/s high-speed network in real time at the time granularity of 1 s. In our experiments, VATE increases the state-preserving speed by 4 to 400 times at the cost of 11.11% more memory compared with a state-of-the-art algorithm.
Analyzing data locality in GPU kernels using memory footprint analysis
2019, Simulation Modelling Practice and Theory
Citation Excerpt :
However, long memory latency, caused by global load and store instructions, can still impose memory stall cycles, thereby limiting the overall performance. Consequently, the memory system has appeared as a performance bottleneck in many applications [1–3]. Employing cache memories is regarded as a major solution to resolve the memory latency problem [4].
Memory footprint is a metric for quantifying data reuse in memory trace. It can also be used to approximate cache performance, especially in shared cache systems. Memory footprint is acquired through memory footprint analysis (FPA). However, its main limitation is that, for a memory trace of n accesses, the all-window FPA algorithm requires O(n³) time. Therefore, in this paper, we propose an analytical algorithm for FPA, whereby the average footprints are calculated in O(n²). The proposed algorithm can also be employed for window distribution analysis. Moreover, we propose a framework to enable the application of FPA to GPU kernels and model the performance of L1 cache memories. The results of experimental evaluations indicate that our proposed framework functions 1.55X slower than the Xiang’s formula, as a fast average FPA method, while it can also be utilized for window distribution analysis. In the context of FPA-based cache performance estimation, the experimental results indicate a fair correlation between the estimated L1 miss rates and those of the native GPU executions. On average, the proposed framework has 23.8% error in the estimation of L1 cache miss rates. Further, our algorithm runs 125X slower than the reuse distance analysis (RDA) when analyzing a single kernel. However, the proposed method outperforms RDA in modeling shared caches and multiple kernel executions in GPUs.
Optimized Implementation of Argon2 Utilizing the Graphics Processing Unit
2023, Applied Sciences (Switzerland)
Optimization Techniques for GPU Programming
2023, ACM Computing Surveys

View all citing articles on Scopus

^☆: This paper was supported by the National Natural Science Foundation of China (Grant No. 61602501, 61272146 and 41375113).

View full text

Benchmarking the GPU memory at the warp level☆

Highlights

Abstract

Introduction

Section snippets

Related work

Background: GPU memory and its latency

Our warp-Level benchmarking approach

The warp-level benchmarking results

Performance tradeoffs on GPU memories

Towards an optimization framework on GPU memories

Case study: maximum noise fraction rotation

Conclusion

IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.

GPUs and the future of parallel computing.

IEEE Micro

Accelerating the scoring module of mass spectrometry-based peptide identification using GPUs.

BMC Bioinf.

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming.

3D finite difference computation on GPUs using CUDA

Proceedings of 2nd workshop on general purpose processing on graphics processing units.

G-BLASTN: accelerating nucleotide alignment by graphics processors[j].

Bioinformatics

Demystifying GPU microarchitecture through microbenchmarking

Performance Analysis of Systems & Software (ISPASS)

Exploiting memory access pat-terns to improve memory performance in data-parallel architectures.

Parallel and Distributed Systems IEEE Transactions on

Porple: an extensible optimizer for portable data placement on GPU

Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture.

The CUDA Handbook: A Comprehensive Guide to GPU Programming.