A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneity

doi:10.1016/j.jpdc.2018.11.012

Journal of Parallel and Distributed Computing

Volume 127, May 2019, Pages 65-88

https://doi.org/10.1016/j.jpdc.2018.11.012 Get rights and content

Highlights

•
Recent years have been witnessing the emergence of using GPUs for general purpose computing due to their efficient performance/power ratio.
•
Various issues need be addressed in order to rely on GPGPUs as a compelling general purpose accelerator for the next power-limited big-data era.
•
Control Divergence, Memory Bandwidth and Limited Parallelism are the three main bottlenecks that limit GPGPU performance.
•
Enhancing GPGPU programmability is an important feature for future GPUs to simplify GPGPU programming.
•
The aim of this paper is to provide a survey of architectural advances to improve performance and programmability of GPUs.

Abstract

With the skyrocketing advances of process technology, the increased need to process huge amount of data, and the pivotal need for power efficiency, the usage of Graphics Processing Units (GPUs) for General Purpose Computing becomes a trend and natural. GPUs have high computational power and excellent performance per watt, for data parallel applications, relative to traditional multicore processors. GPUs appear as discrete or embedded with Central Processing Units (CPUs), leading to a scheme of heterogeneous computing. Heterogeneous computing brings as many challenges as it brings opportunities. To get the most of such systems, we need to guarantee high GPU utilization, deal with irregular control flow of some workloads, and struggle with far-friendly-programming models. The aim of this paper is to provide a survey about GPUs from two perspectives: (1) architectural advances to improve performance and programmability and (2) advances to enhance CPU–GPU integration in heterogeneous systems. This will help researchers see the opportunities and challenges of using GPUs for general purpose computing, especially in the era of big data and the continuous need of high-performance computing.

Introduction

Graphics Processing Units (GPUs) have been used for several years as a fixed-function hardware accelerator for 3D Graphics applications. Earlier generations of GPUs were designed to implement the conventional 3D rendering pipeline [103], [143]. However, the high computational power, the GPUs can achieve compared with traditional multicore Central Processor Units (CPUs), encourage the developers to use GPUs for compute-intensive non-graphics workloads [222]. At that time, the term General Purpose computing using Graphics Processing Units (GPGPU) has emerged widely. The programmers used graphics APIs (e.g. Direct3D or OpenGL) to access shader cores. The programmers had to map program data appropriately to the available shader buffers and manage the data accurately through the graphics pipeline. Obviously, using graphics APIs for non-graphics general purpose programming was a very difficult task. However, with some heroic efforts, a considerable speedups were achieved [173]. This trend prompted the GPUs vendor to build a more programmable GPU architecture, known as unified shader architecture (e.g. NVIDIA’s Tesla [144], NVIDIA’s Fermi [161] and AMD Evegreen [13]) and release more-friendly high level abstraction APIs to facilitate GPGPU programming (e.g. NVIDIA’s CUDA [166], AMD’s CTM [71] and OpenCL [102]). Since then, a new era of GPGPU architecture and programming was unleashed and is still evolving to this day [38], [99], [161].

GPU acceleration has been widely adopted in high-performance computing (HPC) applications, such as computer vision, graph processing, biomedical, financial analysis, and physical simulation [80], [81]. This is due to the fact that GPUs are able to achieve tremendous computational power and efficient performance-per-watt compared to conventional multicore CPUs. Thus, there is no wonder that a large portion of supercomputers found in Top500 list rely on GPUs [224]. Moreover, the scope of applications that benefit from GPU acceleration has been expanded rapidly during the last decade to include server and cloud workloads [66], [74], database processing [24], [248] and deep machine learning [35]. However, because GPUs were initially designed to execute regular streaming applications, like graphics workloads, they are still not effective to accelerate some emerging data intensive workloads, due to the lack of irregular execution support, the memory bandwidth bottleneck and the GPGPU programming complexity.

In order to improve the performance, energy efficiency and programmability of GPUs for emerging data intensive workloads, researchers have been diligently working on enhancing GPU architecture for general purpose computing. Fig. 1 depicts the number of research papers related to GPGPUs that were published during the last decade at the top-tier computer architecture conferences. As shown in figure, there has been a growing interest in improving GPGPUs architecture during the last five years. Up to 28 and 29 research papers were published in 2016 and 2017 respectively which represents nearly 16% of the total number of papers.

Fig. 2 characterizes and divides these works into different categories. As we can see, there has been a noticeable interest in improving the performance of GPGPUs by mitigating the impact of control flow divergence [46], [48], [194], [216], [221], alleviating on-chip resource contention [97], [160], [195], [274] and improving memory hierarchy performance [22], [111], [126], [193], [221], [269]. Since GPGPU programming is complex, researchers worked on enhancing the GPU programmability by equipping GPUs with architectural support to improve data sharing and synchronization (e.g. cache coherence [209] and transactional memory [68], [208], [213]). They also investigated new techniques to boost GPGPU concurrency and multitasking [6], [218], [255], leading to an increase in available thread level parallelism (TLP) and efficiently utilizing the execution resources. Besides, to amortize the increasing chip area, there have been some efforts to integrate CPU with GPU on the same die chip [11], [82], [162]. Such designs need to be carefully studied because GPUs execute hundreds of threads that can monopolize on-chip shared resources (e.g. memory controller [17] , on-chip network [128] and last level cache [123]) and this leads CPU applications to be starved. To address this problem, researches have worked on efficiently and fairly managing shared resources between CPU and GPU. Furthermore, they worked on augmenting CPU–GPU architecture with more powerful communication mechanisms and fine-grained data sharing (e.g. unified virtual memory space [183] and CPU–GPU cache coherence [184]). In addition to that, some works have also investigated novel techniques to improve energy and power efficiency [156], define accurate model for performance and power [77], [133], create software frameworks to ease GPGPU programming, develop fault tolerance capability and improve the 3D rendering pipeline for graphics workloads. Recently, researchers started looking into novel architecture techniques to build large scalable GPUs that are easy to manufacture [16], [154], investigating security breaches on modern GPU [89], [159] as well as building software frameworks and designing novel hardware to customize GPUs for deep learning acceleration [75], [210].

In this paper, we present a survey of research works that aim to improve GPGPU performance, programmability and heterogeneity (i.e., CPU–GPU integration). Further, we introduce a classification of these works on the basis of their technical approach and key idea. Since it is not possible to review all the research works that are related to GPGPUs, we mainly focus on the following areas to limit the scope of the survey. We only discuss techniques proposed for improving GPGPU performance including (1) control flow divergence mitigation, (2) alleviating resource contention and efficient utilization of memory bandwidth across the entire memory hierarchy, including caches, interconnection and main memory, (3) increasing the available parallelism and concurrency, and (4) improving pipeline execution and exploiting scalarization opportunities. We also include architectural-based techniques that aim to improve GPGPU programmability, e.g. cache coherence, memory consistency, transactional memory, synchronization, debugging and memory management. We also provide a survey on research works which aim to enhance the on-chip integration of CPU–GPU heterogeneous architecture, including on-chip shared resource management and improving CPU–GPU programmability. While our main focus in this work is to discuss micro-architectural approaches, we may also refer to some prominent software- and compiler-based techniques related to our scope. On the other hand, we do not include studies related to performance and energy modeling, employing emerging memory technologies (e.g. non-volatile memory), register file, fault tolerance, works that only focus on improving GPU energy and power efficiency or CPU–GPU power management.¹ Additionally, we only adopt works that are related to many-thread GPU-like accelerator, while works that are concerned with other types of accelerators, such as many-core accelerator [62], [201], are not covered in this survey. Further, we only focus on ideas related to general purpose computing, whereas techniques which aim to improve GPUs for graphics workloads are out of scope in this work.

The remainder of this paper is organized as follows, Section 2 presents a brief overview on GPGPUs programming model and architecture, Sections 3 Control flow divergence, 4 Efficient utilization of memory bandwidth, 5 Increasing parallelism and improving execution pipelining review the techniques on improving GPGPU performance by alleviating control flow divergence, efficiently utilizing memory bandwidth and increasing parallelism respectively, Section 6 reviews the studies on enhancing GPGPU programmability, Section 7 reviews the works that aim to enhance CPU–GPU integration, Section 8 suggests future research directions and Section 9 concludes.

Section snippets

Background

In this section, we give a brief overview on GPGPU programming model and architecture. For more details, we kindly refer the reader to [70], [114], [166].

Control flow divergence

Control flow divergence occurs when threads in the same warp execute different control flow paths. Control flow divergence causes significant performance reduction for irregular workloads. The drawbacks of control divergence and irregular execution are four-fold. First, GPUs employ PDOM stack-based mechanism that serializes the execution of divergent paths. This serialization of divergent paths reduces the available thread level parallelism (i.e., the number of active warps at a time) which

Efficient utilization of memory bandwidth

GPGPU caches and memory hierarchy suffer from severe resource contention which may degrade the performance due to the massive multithreading. Memory divergence is the main source of GPU resource contention, especially caches contention [195]. Memory divergence occurs when threads in the same warp access different regions of memory in the same SIMT instruction. Moreover, as we discussed earlier, GPUs are throughput-oriented architecture that run hundreds of threads simultaneously, thus many

Increasing parallelism and improving execution pipelining

GPGPUs achieve the highest performance by running many concurrent threads on their massively parallel architecture. However, some applications have a low number of active thread blocks due to the small input size or the unavailability of some required resources in SM (e.g. registers or shared memory), thus they fail to efficiently utilize the execution units. This results in inefficient utilization of execution unit and hinders the GPU ability to hide long memory latency. Previous works

Enhancing GPGPU programmability

GPGPU programming is hard and complex. Prior work have explored new techniques to enhance GPGPU programmability. In fact, most of these works were about addressing the same challenges that were found in conventional CPU multi-core programming (e.g. cache coherence, memory consistency, synchronization and transactional memory). However, this is not a trivial task for GPUs, since GPUs run thousands of threads concurrently, whereas multi-core CPUs run 4–16 threads. Building a scalable hardware to

CPU–GPU heterogeneous architecture

In order to amortize the increasing die area, recent years have seen a noticeable trend from the industry to integrate CPU and GPU cores on the same chip, as it can be seen in Intel’s Haswell [82], AMD’s accelerated processing units (APU), like AMD FusionKaveri [11], and NVIDIA’s Denver project [162]. In these architectures, the concurrent CPU and GPU applications will share most of the on-chip resources, such as memory controller, interconnection network and last level cache). However, GPUs

Future directions

GPUs continue to evolve as new applications arise or to make executing current applications more efficient in terms of power and performance. In this section, we will look at some of the advances in GPU research that started now and are expected to continue and evolve in the future.

Conclusion

Recent years have been witnessing the emergence of using GPUs for general purpose computing due to their massive computational power and energy efficiency. The ultimate goal of this growing interest is to make GPUs a real general purpose many-core accelerator that can be used side-by-side with CPU in order to improve the performance of compute-intensive workloads and reduce energy and power consumption. That is, to efficiently utilize the emerging CPU–GPU heterogeneous architecture, we need to

Mahmoud Khairy received his B.Sc. and M.Sc. in Computer Engineering from Cairo University, Egypt. He is currently a Ph.D. student with the Electrical and Computer Engineering Department at Purdue University, US. His research interests include GPGPU architecture, FPGAs, heterogeneous architecture and emerging memory technologies.

References (276)

AbdolrashidiAmir Ali et al.
Wireframe: supporting data-dependent parallelism through dependency graph execution in gpus
AdriaensJacob T. et al.
The case for gpgpu spatial multitasking
N. Agarwal, D. Nellans, E. Ebrahimi, T.F. Wenisch, J. Danskin, S.W. Keckler, Selective GPU caches to eliminate CPU–GPU...
Neha Agarwal, David Nellans, Mike O’Connor, Stephen W. Keckler, Thomas F. Wenisch, Unlocking bandwidth for GPUs in...
Neha Agarwal, David Nellans, Mark Stephenson, Mike O’Connor, S. Keckler, Page placement strategies for GPUs within...
P. Aguilera, K. Morrow, Nam Sung Kim, Qos-aware dynamic resource allocation for spatial-multitasking GPUs, in: Design...
AguileraP. et al.
Fair share: Allocation of GPU resources for both performance and fairness
Jade Alglave, Mark Batty, Alastair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl, Tyler Sorensen,...
J. Alsop, M.S. Orr, B.M. Beckmann, D.A. Wood, Lazy release consistency for GPUs, in: 2016 49th Annual IEEE/ACM...
AMD
Graphics Core Next Arhcitecure whitepaper
(2013)

AMD

AMD fusion kaveri

(2014)

AMD

High bandwidth memory

(2015)

AMD Evegreen

(2009)

Jayvant Anantpur, R. Govindarajan, PRO: Progress aware GPU warp scheduling algorithm, in: Parallel and Distributed...

AroraManish et al.

Redefining the role of the CPU in the era of CPU–GPU integration

ArunkumarAkhil et al.

Mcm-gpu: Multi-chip-module gpus for continued performance scalability

AusavarungnirunRachata et al.

Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems

Rachata Ausavarungnirun, Saugata Ghose, Onur Kayiran, Gabriel H. Loh, Chita R. Das, Mahmut T. Kandemir, Onur Mutlu,...

AusavarungnirunRachata et al.

Mosaic: a gpu memory manager with application-transparent support for multiple page sizes

AwatramaniMihir et al.

Increasing GPU throughput using kernel interleaved thread block scheduling

Mihir Awatramani, Xian Zhu, Joseph Zambreno, Diane Rover, Phase aware warp scheduling: Mitigating effects of phase...

BakhodaAli et al.

Throughput-effective on-chip networks for manycore accelerators

Ali Bakhoda, George L. Yuan, Wilson W.L. Fung, Henry Wong, Tor M. Aamodt, Analyzing CUDA workloads using a detailed GPU...

BakkumPeter et al.

Accelerating SQL database operations on a GPU with CUDA

BrunieNicolas et al.

Simultaneous branch and warp interweaving for sustained GPU performance

Daniel Cederman, Philippas Tsigas, Muhammad Tayyab Chaudhry, Towards a software transactional memory for graphics...

Niladrish Chatterjee, Mike O’Connor, Gabriel H. Loh, Nuwan Jayasena, Rajeev Balasubramonian, Managing DRAM latency...

Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Lv Ji, Zhiying Wang, Wen mei Hwu, Adaptive cache management for...

ChenZhongliang et al.

Characterizing scalar opportunities in GPGPU applications

S. Chen, L. Peng, Improving GPU hardware transactional memory performance via conflict and contention reduction, in:...

ChenSui et al.

Accelerating gpu hardware transactional memory with snapshot isolation

Guoyang Chen, Xipeng Shen, Free launch: Optimizing GPU dynamic kernel launches through thread reuse, in: Proceedings of...

Xuhao Chen, Shengzhao Wu, Li-Wen Chang, Wei-Sheng Huang, Carl Pearson, Zhiying Wang, Wen-Mei W. Hwu, Adaptive cache...

Hyojin Choi, Jaewoo Ahn, Wonyong Sung, Reducing off-chip memory traffic by selective cache management scheme in GPGPUs,...

Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, Ng Andrew, Deep learning with COTS HPC systems, in:...

CollangeSylvain et al.

Dynamic detection of uniform and affine vectors in gpgpu computations

DagaMayank et al.

On the efficacy of a fused CPU+GPU processor (or APU) for parallel computing

William J. Dally, The end of denial architecture and the rise of throughput computing, in: Keynote Speech at Desgin...

Jeffrey R. Diamond, Donald S. Fussell, Stephen W. Keckler, Arbitrary modulus indexing, in: Proceedings of the 47th...

DiamosGregory et al.

SIMD re-convergence at thread frontiers

Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun,...

DuongNam et al.

Improving cache management policies using dynamic reuse distances

A. ElTantawy, T.M. Aamodt, MIMD synchronization on SIMT architectures, in: 2016 49th Annual IEEE/ACM International...

Ahmed ElTantawy, Jessica Wenjie Ma, Mike O’Connor, Tor M. Aamodt, A scalable multi-path microarchitecture for efficient...

FraneySean et al.

Accelerating atomic operations on GPGPUs

FungWilson W.L. et al.

Thread block compaction for efficient SIMT control flow

FungWilson W.L. et al.

Energy efficient GPU transactional memory via space–time optimizations

FungWilson W.L. et al.

Dynamic warp formation and scheduling for efficient GPU control flow

FungWilson W.L. et al.

Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware

ACM Trans. Archit. Code Optim.

(2009)

FungWilson W.L. et al.

Hardware transactional memory for GPU architectures

Cited by (18)

Parallel self-avoiding walks for a low-autocorrelation binary sequences problem
2024, Journal of Computational Science
A low-autocorrelation binary sequences problem with a high figure of merit factor represents a formidable computational challenge. An efficient parallel computing algorithm is required to reach the new best-known solutions for this problem. Therefore, we developed the sokol $_{s k e w}$ solver for the skew-symmetric search space. The developed solver takes the advantage of parallel computing on graphics processing units. The solver organized the search process as a sequence of parallel and contiguous self-avoiding walks and achieved a speedup factor of 387 compared with lssOrel, its predecessor. The sokol $_{s k e w}$ solver belongs to stochastic solvers and cannot guarantee the optimality of solutions. To mitigate this problem, we established the predictive model of stopping conditions according to the small instances for which the optimal skew-symmetric solutions are known. With its help and 99% probability, the sokol $_{s k e w}$ solver found all the known and seven new best-known skew-symmetric sequences for odd instances from $L = 121$ to $L = 223$ . For larger instances, the solver cannot reach 99% probability within our limitations, but it still found several new best-known binary sequences. We also analyzed the trend of the best merit factor values, and it shows that as sequence size increases, the value of the merit factor also increases, and this trend is flatter for larger instances.
Innermost many-sorted term rewriting on GPUs
2023, Science of Computer Programming
Citation Excerpt :
Graphics Processing Units (GPUs) have enormous computational power and performance-per-watt compared to (multi-core) CPUs [1].
This article presents a way to implement many-sorted term rewriting on a GPU. This is done by letting the GPU repeatedly perform a massively parallel evaluation of all subterms. Innermost many-sorted term rewriting is experimentally compared with a relaxed form of innermost many-sorted term rewriting, and two different garbage collection mechanisms, to remove terms that are no longer needed, are discussed and experimentally compared. It is concluded that when the many-sorted term rewrite systems exhibit sufficient internal parallelism, GPU rewriting substantially outperforms the CPU. Both relaxed innermost many-sorted rewriting and garbage collection further improve this performance. Since the implementation can probably be even further optimised, and because in any case GPUs will become much more powerful in the future, this suggests that GPUs are an interesting platform for (many-sorted) term rewriting. As term rewriting can be viewed as a universal programming language, this also opens a route towards programming GPUs by term rewriting, especially for irregular computations.
Accelerating Fortran Codes: A Method for Integrating Coarray Fortran with Cuda Fortran and Openmp
2024, SSRN
GPU-based Parallel Technique for Solving the N-Similarity Problem in Textual Data Mining
2024, 2024 3rd International Conference on Distributed Computing and High Performance Computing, DCHPC 2024
MIMD Programs Execution Support on SIMD Machines: A Holistic Survey
2024, IEEE Access
Mimd Programs Execution Support on Simd Machines
2023, SSRN

View all citing articles on Scopus

Amr G. Wassal received his Ph.D. degree in Electrical and Computer Engineering from the University of Waterloo, Ontario, Canada, in 2000. He has held several senior technical positions in the industry at SiWare Systems, PMC-Sierra, and IBM Technology Group. He is currently a Professor with the Computer Engineering Department, Cairo University. He has a number of conference and journal papers and patent applications in the areas of multi-core architectures and their applications in DSP and sensor fusion.

Mohamed Zahran received his Ph.D. in Electrical and Computer Engineering from University of Maryland at College Park in 2003. He is currently a faculty member with the Computer Science Department at NYU. His research interest spans several aspects of computer architecture, such as architecture of heterogeneous systems, hardware/software interaction, and biologically-inspired architectures. Zahran is a senior member of IEEE, senior member of ACM, and Sigma Xi scientific honor society.

View full text

A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneity

Highlights

Abstract

Introduction

Section snippets

Background

Control flow divergence

Efficient utilization of memory bandwidth

Increasing parallelism and improving execution pipelining

Enhancing GPGPU programmability

CPU–GPU heterogeneous architecture

Future directions

Conclusion

Wireframe: supporting data-dependent parallelism through dependency graph execution in gpus

The case for gpgpu spatial multitasking

Fair share: Allocation of GPU resources for both performance and fairness

Graphics Core Next Arhcitecure whitepaper

AMD fusion kaveri

High bandwidth memory

Redefining the role of the CPU in the era of CPU–GPU integration

Mcm-gpu: Multi-chip-module gpus for continued performance scalability

Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems

Mosaic: a gpu memory manager with application-transparent support for multiple page sizes

Increasing GPU throughput using kernel interleaved thread block scheduling

Throughput-effective on-chip networks for manycore accelerators

Accelerating SQL database operations on a GPU with CUDA

Simultaneous branch and warp interweaving for sustained GPU performance

Characterizing scalar opportunities in GPGPU applications

Accelerating gpu hardware transactional memory with snapshot isolation

Dynamic detection of uniform and affine vectors in gpgpu computations

On the efficacy of a fused CPU+GPU processor (or APU) for parallel computing

SIMD re-convergence at thread frontiers

Improving cache management policies using dynamic reuse distances

Accelerating atomic operations on GPGPUs

Thread block compaction for efficient SIMT control flow

Energy efficient GPU transactional memory via space–time optimizations

Dynamic warp formation and scheduling for efficient GPU control flow

Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware

ACM Trans. Archit. Code Optim.

Hardware transactional memory for GPU architectures