A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneity
Introduction
Graphics Processing Units (GPUs) have been used for several years as a fixed-function hardware accelerator for 3D Graphics applications. Earlier generations of GPUs were designed to implement the conventional 3D rendering pipeline [103], [143]. However, the high computational power, the GPUs can achieve compared with traditional multicore Central Processor Units (CPUs), encourage the developers to use GPUs for compute-intensive non-graphics workloads [222]. At that time, the term General Purpose computing using Graphics Processing Units (GPGPU) has emerged widely. The programmers used graphics APIs (e.g. Direct3D or OpenGL) to access shader cores. The programmers had to map program data appropriately to the available shader buffers and manage the data accurately through the graphics pipeline. Obviously, using graphics APIs for non-graphics general purpose programming was a very difficult task. However, with some heroic efforts, a considerable speedups were achieved [173]. This trend prompted the GPUs vendor to build a more programmable GPU architecture, known as unified shader architecture (e.g. NVIDIA’s Tesla [144], NVIDIA’s Fermi [161] and AMD Evegreen [13]) and release more-friendly high level abstraction APIs to facilitate GPGPU programming (e.g. NVIDIA’s CUDA [166], AMD’s CTM [71] and OpenCL [102]). Since then, a new era of GPGPU architecture and programming was unleashed and is still evolving to this day [38], [99], [161].
GPU acceleration has been widely adopted in high-performance computing (HPC) applications, such as computer vision, graph processing, biomedical, financial analysis, and physical simulation [80], [81]. This is due to the fact that GPUs are able to achieve tremendous computational power and efficient performance-per-watt compared to conventional multicore CPUs. Thus, there is no wonder that a large portion of supercomputers found in Top500 list rely on GPUs [224]. Moreover, the scope of applications that benefit from GPU acceleration has been expanded rapidly during the last decade to include server and cloud workloads [66], [74], database processing [24], [248] and deep machine learning [35]. However, because GPUs were initially designed to execute regular streaming applications, like graphics workloads, they are still not effective to accelerate some emerging data intensive workloads, due to the lack of irregular execution support, the memory bandwidth bottleneck and the GPGPU programming complexity.
In order to improve the performance, energy efficiency and programmability of GPUs for emerging data intensive workloads, researchers have been diligently working on enhancing GPU architecture for general purpose computing. Fig. 1 depicts the number of research papers related to GPGPUs that were published during the last decade at the top-tier computer architecture conferences. As shown in figure, there has been a growing interest in improving GPGPUs architecture during the last five years. Up to 28 and 29 research papers were published in 2016 and 2017 respectively which represents nearly 16% of the total number of papers.
Fig. 2 characterizes and divides these works into different categories. As we can see, there has been a noticeable interest in improving the performance of GPGPUs by mitigating the impact of control flow divergence [46], [48], [194], [216], [221], alleviating on-chip resource contention [97], [160], [195], [274] and improving memory hierarchy performance [22], [111], [126], [193], [221], [269]. Since GPGPU programming is complex, researchers worked on enhancing the GPU programmability by equipping GPUs with architectural support to improve data sharing and synchronization (e.g. cache coherence [209] and transactional memory [68], [208], [213]). They also investigated new techniques to boost GPGPU concurrency and multitasking [6], [218], [255], leading to an increase in available thread level parallelism (TLP) and efficiently utilizing the execution resources. Besides, to amortize the increasing chip area, there have been some efforts to integrate CPU with GPU on the same die chip [11], [82], [162]. Such designs need to be carefully studied because GPUs execute hundreds of threads that can monopolize on-chip shared resources (e.g. memory controller [17] , on-chip network [128] and last level cache [123]) and this leads CPU applications to be starved. To address this problem, researches have worked on efficiently and fairly managing shared resources between CPU and GPU. Furthermore, they worked on augmenting CPU–GPU architecture with more powerful communication mechanisms and fine-grained data sharing (e.g. unified virtual memory space [183] and CPU–GPU cache coherence [184]). In addition to that, some works have also investigated novel techniques to improve energy and power efficiency [156], define accurate model for performance and power [77], [133], create software frameworks to ease GPGPU programming, develop fault tolerance capability and improve the 3D rendering pipeline for graphics workloads. Recently, researchers started looking into novel architecture techniques to build large scalable GPUs that are easy to manufacture [16], [154], investigating security breaches on modern GPU [89], [159] as well as building software frameworks and designing novel hardware to customize GPUs for deep learning acceleration [75], [210].
In this paper, we present a survey of research works that aim to improve GPGPU performance, programmability and heterogeneity (i.e., CPU–GPU integration). Further, we introduce a classification of these works on the basis of their technical approach and key idea. Since it is not possible to review all the research works that are related to GPGPUs, we mainly focus on the following areas to limit the scope of the survey. We only discuss techniques proposed for improving GPGPU performance including (1) control flow divergence mitigation, (2) alleviating resource contention and efficient utilization of memory bandwidth across the entire memory hierarchy, including caches, interconnection and main memory, (3) increasing the available parallelism and concurrency, and (4) improving pipeline execution and exploiting scalarization opportunities. We also include architectural-based techniques that aim to improve GPGPU programmability, e.g. cache coherence, memory consistency, transactional memory, synchronization, debugging and memory management. We also provide a survey on research works which aim to enhance the on-chip integration of CPU–GPU heterogeneous architecture, including on-chip shared resource management and improving CPU–GPU programmability. While our main focus in this work is to discuss micro-architectural approaches, we may also refer to some prominent software- and compiler-based techniques related to our scope. On the other hand, we do not include studies related to performance and energy modeling, employing emerging memory technologies (e.g. non-volatile memory), register file, fault tolerance, works that only focus on improving GPU energy and power efficiency or CPU–GPU power management.1 Additionally, we only adopt works that are related to many-thread GPU-like accelerator, while works that are concerned with other types of accelerators, such as many-core accelerator [62], [201], are not covered in this survey. Further, we only focus on ideas related to general purpose computing, whereas techniques which aim to improve GPUs for graphics workloads are out of scope in this work.
The remainder of this paper is organized as follows, Section 2 presents a brief overview on GPGPUs programming model and architecture, Sections 3 Control flow divergence, 4 Efficient utilization of memory bandwidth, 5 Increasing parallelism and improving execution pipelining review the techniques on improving GPGPU performance by alleviating control flow divergence, efficiently utilizing memory bandwidth and increasing parallelism respectively, Section 6 reviews the studies on enhancing GPGPU programmability, Section 7 reviews the works that aim to enhance CPU–GPU integration, Section 8 suggests future research directions and Section 9 concludes.
Section snippets
Background
In this section, we give a brief overview on GPGPU programming model and architecture. For more details, we kindly refer the reader to [70], [114], [166].
Control flow divergence
Control flow divergence occurs when threads in the same warp execute different control flow paths. Control flow divergence causes significant performance reduction for irregular workloads. The drawbacks of control divergence and irregular execution are four-fold. First, GPUs employ PDOM stack-based mechanism that serializes the execution of divergent paths. This serialization of divergent paths reduces the available thread level parallelism (i.e., the number of active warps at a time) which
Efficient utilization of memory bandwidth
GPGPU caches and memory hierarchy suffer from severe resource contention which may degrade the performance due to the massive multithreading. Memory divergence is the main source of GPU resource contention, especially caches contention [195]. Memory divergence occurs when threads in the same warp access different regions of memory in the same SIMT instruction. Moreover, as we discussed earlier, GPUs are throughput-oriented architecture that run hundreds of threads simultaneously, thus many
Increasing parallelism and improving execution pipelining
GPGPUs achieve the highest performance by running many concurrent threads on their massively parallel architecture. However, some applications have a low number of active thread blocks due to the small input size or the unavailability of some required resources in SM (e.g. registers or shared memory), thus they fail to efficiently utilize the execution units. This results in inefficient utilization of execution unit and hinders the GPU ability to hide long memory latency. Previous works
Enhancing GPGPU programmability
GPGPU programming is hard and complex. Prior work have explored new techniques to enhance GPGPU programmability. In fact, most of these works were about addressing the same challenges that were found in conventional CPU multi-core programming (e.g. cache coherence, memory consistency, synchronization and transactional memory). However, this is not a trivial task for GPUs, since GPUs run thousands of threads concurrently, whereas multi-core CPUs run 4–16 threads. Building a scalable hardware to
CPU–GPU heterogeneous architecture
In order to amortize the increasing die area, recent years have seen a noticeable trend from the industry to integrate CPU and GPU cores on the same chip, as it can be seen in Intel’s Haswell [82], AMD’s accelerated processing units (APU), like AMD FusionKaveri [11], and NVIDIA’s Denver project [162]. In these architectures, the concurrent CPU and GPU applications will share most of the on-chip resources, such as memory controller, interconnection network and last level cache). However, GPUs
Future directions
GPUs continue to evolve as new applications arise or to make executing current applications more efficient in terms of power and performance. In this section, we will look at some of the advances in GPU research that started now and are expected to continue and evolve in the future.
Conclusion
Recent years have been witnessing the emergence of using GPUs for general purpose computing due to their massive computational power and energy efficiency. The ultimate goal of this growing interest is to make GPUs a real general purpose many-core accelerator that can be used side-by-side with CPU in order to improve the performance of compute-intensive workloads and reduce energy and power consumption. That is, to efficiently utilize the emerging CPU–GPU heterogeneous architecture, we need to
Mahmoud Khairy received his B.Sc. and M.Sc. in Computer Engineering from Cairo University, Egypt. He is currently a Ph.D. student with the Electrical and Computer Engineering Department at Purdue University, US. His research interests include GPGPU architecture, FPGAs, heterogeneous architecture and emerging memory technologies.
References (276)
- et al.
Wireframe: supporting data-dependent parallelism through dependency graph execution in gpus
- et al.
The case for gpgpu spatial multitasking
- N. Agarwal, D. Nellans, E. Ebrahimi, T.F. Wenisch, J. Danskin, S.W. Keckler, Selective GPU caches to eliminate CPU–GPU...
- Neha Agarwal, David Nellans, Mike O’Connor, Stephen W. Keckler, Thomas F. Wenisch, Unlocking bandwidth for GPUs in...
- Neha Agarwal, David Nellans, Mark Stephenson, Mike O’Connor, S. Keckler, Page placement strategies for GPUs within...
- P. Aguilera, K. Morrow, Nam Sung Kim, Qos-aware dynamic resource allocation for spatial-multitasking GPUs, in: Design...
- et al.
Fair share: Allocation of GPU resources for both performance and fairness
- Jade Alglave, Mark Batty, Alastair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl, Tyler Sorensen,...
- J. Alsop, M.S. Orr, B.M. Beckmann, D.A. Wood, Lazy release consistency for GPUs, in: 2016 49th Annual IEEE/ACM...
Graphics Core Next Arhcitecure whitepaper
(2013)
AMD fusion kaveri
High bandwidth memory
Redefining the role of the CPU in the era of CPU–GPU integration
Mcm-gpu: Multi-chip-module gpus for continued performance scalability
Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems
Mosaic: a gpu memory manager with application-transparent support for multiple page sizes
Increasing GPU throughput using kernel interleaved thread block scheduling
Throughput-effective on-chip networks for manycore accelerators
Accelerating SQL database operations on a GPU with CUDA
Simultaneous branch and warp interweaving for sustained GPU performance
Characterizing scalar opportunities in GPGPU applications
Accelerating gpu hardware transactional memory with snapshot isolation
Dynamic detection of uniform and affine vectors in gpgpu computations
On the efficacy of a fused CPU+GPU processor (or APU) for parallel computing
SIMD re-convergence at thread frontiers
Improving cache management policies using dynamic reuse distances
Accelerating atomic operations on GPGPUs
Thread block compaction for efficient SIMT control flow
Energy efficient GPU transactional memory via space–time optimizations
Dynamic warp formation and scheduling for efficient GPU control flow
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware
ACM Trans. Archit. Code Optim.
Hardware transactional memory for GPU architectures
Cited by (18)
Parallel self-avoiding walks for a low-autocorrelation binary sequences problem
2024, Journal of Computational ScienceInnermost many-sorted term rewriting on GPUs
2023, Science of Computer ProgrammingCitation Excerpt :Graphics Processing Units (GPUs) have enormous computational power and performance-per-watt compared to (multi-core) CPUs [1].
GPU-based Parallel Technique for Solving the N-Similarity Problem in Textual Data Mining
2024, 2024 3rd International Conference on Distributed Computing and High Performance Computing, DCHPC 2024MIMD Programs Execution Support on SIMD Machines: A Holistic Survey
2024, IEEE Access
Mahmoud Khairy received his B.Sc. and M.Sc. in Computer Engineering from Cairo University, Egypt. He is currently a Ph.D. student with the Electrical and Computer Engineering Department at Purdue University, US. His research interests include GPGPU architecture, FPGAs, heterogeneous architecture and emerging memory technologies.
Amr G. Wassal received his Ph.D. degree in Electrical and Computer Engineering from the University of Waterloo, Ontario, Canada, in 2000. He has held several senior technical positions in the industry at SiWare Systems, PMC-Sierra, and IBM Technology Group. He is currently a Professor with the Computer Engineering Department, Cairo University. He has a number of conference and journal papers and patent applications in the areas of multi-core architectures and their applications in DSP and sensor fusion.
Mohamed Zahran received his Ph.D. in Electrical and Computer Engineering from University of Maryland at College Park in 2003. He is currently a faculty member with the Computer Science Department at NYU. His research interest spans several aspects of computer architecture, such as architecture of heterogeneous systems, hardware/software interaction, and biologically-inspired architectures. Zahran is a senior member of IEEE, senior member of ACM, and Sigma Xi scientific honor society.