We would like to welcome you to the proceedings of the 6th Annual Workshop on General Purpose Processing using Graphics Processors. We have another strong program that includes a keynote by Robert Geva from Intel on the programming model for the Phi accelerator, with presentations of 15 out of the 37 submitted papers.
Proceeding Downloads
Comparison based sorting for systems with multiple GPUs
As a basic building block of many applications, sorting algorithms that efficiently run on modern machines are key for the performance of these applications. With the recent shift to using GPUs for general purpose compuing, researches have proposed ...
Reducing divergence in GPGPU programs with loop merging
Branch divergence can incur a high performance penalty on GPGPU programs. We propose a software optimization, called loop merging, that aims to reduce divergence due to varying trip-count of a loop across warp threads. This optimization merges the ...
Split tiling for GPUs: automatic parallelization using trapezoidal tiles
Tiling is a key technique to enhance data reuse. For computations structured as one sequential outer "time" loop enclosing a set of parallel inner loops, tiling only the parallel inner loops may not enable enough data reuse in the cache. Tiling the ...
Formalizing address spaces with application to Cuda, OpenCL, and beyond
Cuda and OpenCL are aimed at programmers developing parallel applications targeting GPUs and embedded micro-processors. These systems often have explicitly managed memories exposed directly though a notion of disjoint address spaces. OpenCL address ...
Memory reuse optimizations in the R-Stream compiler
We propose a new set of automated techniques to optimize memory reuse in programs with explicitly managed memory. Our techniques are inspired by hand-tuned seismic kernels on GPUs. The solutions we develop reduce the cost of transferring data across ...
Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems
Heterogeneous systems have grown in popularity within the commercial platform and application developer communities. We have seen a growing number of systems incorporating CPUs, Graphics Processors (GPUs) and Accelerated Processing Units (APUs combine a ...
Input-aware auto-tuning for directive-based GPU programming
The difficulties posed by GPGPU programming and the need to increase productivity have guided research towards directive-based high-level programs for accelerators. This effort has led to the definition of the OpenACC industry standard. It significantly ...
Betweenness centrality on GPUs and heterogeneous architectures
The betweenness centrality metric has always been intriguing for graph analyses and used in various applications. Yet, it is one of the most computationally expensive kernels in graph mining. In this work, we investigate a set of techniques to make the ...
OpenCL C++
With the success of programming models such as Khronos' OpenCL, heterogeneous computing is going mainstream. However, these models are low-level, even when considering them as systems programming models. For example, OpenCL is effectively an extended ...
Atomic-free irregular computations on GPUs
Atomic instructions are a key ingredient of codes that operate on irregular data structures like trees and graphs. It is well known that atomics can be expensive, especially on massively parallel GPUs, and are often on the critical path of a program. In ...
Accelerating simulation of agent-based models on heterogeneous architectures
The wide usage of GPGPU programming models and compiler techniques enables the optimization of data-parallel programs on commodity GPUs. However, mapping GPGPU applications running on discrete parts to emerging integrated heterogeneous architectures ...
Fast dynamic memory allocator for massively parallel architectures
Dynamic memory allocation in massively parallel systems often suffers from drastic performance decreases due to the required global synchronization. This is especially true when many allocation or deallocation requests occur in parallel. We propose a ...
Accelerating financial applications on the GPU
The QuantLib library is a popular library used for many areas of computational finance. In this work, the parallel processing power of the GPU is used to accelerate QuantLib financial applications. Black-Scholes, Monte-Carlo, Bonds, and Repo code paths ...
Exploring GPU architectures to accelerate semantic comparison for intention-based search
Semantic comparison is the basic computational task behind meaningful search techniques being deployed by most of the new search engines. This report presents performance comparison among three GPU architectures implementing semantic comparison. We have ...
Warp size impact in GPUs: large or small?
There are a number of design decisions that impact a GPU's performance. Among such decisions deciding the right warp size can deeply influence the rest of the design. Small warps reduce the performance penalty associated with branch divergence at the ...
Index Terms
- Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units