All-pairs computations on many-core graphics processors

doi:10.1016/j.parco.2013.01.002

Parallel Computing

Volume 39, Issue 2, February 2013, Pages 79-93

https://doi.org/10.1016/j.parco.2013.01.002 Get rights and content

Abstract

Developing high-performance applications on emerging multi- and many-core architectures requires efficient mapping techniques and architecture-specific tuning methodologies to realize performance closer to their peak compute capability and memory bandwidth. In this paper, we develop architecture-aware methods to accelerate all-pairs computations on many-core graphics processors. Pairwise computations occur frequently in numerous application areas in scientific computing. While they appear easy to parallelize due to the independence of computing each pairwise interaction from all others, development of techniques to address multi-layered memory hierarchies, mapping within the restrictions imposed by the small and low-latency on-chip memories, striking the right balanced between concurrency, reuse and memory traffic etc., are crucial to obtain high-performance. We present a hierarchical decomposition scheme for GPUs based on decomposition of the output matrix and input data. We demonstrate that a careful tuning of the involved set of decomposition parameters is essential to achieve high efficiency on the GPUs. We also compare the performance of our strategies with an implementation on the STI Cell processor as well as multi-core CPU parallelizations using OpenMP and Intel Threading Building Blocks.

Highlights

► We develop architecture-aware high-performance methods to accelerate generalized all-pairs computations graphics processors. ► We perform in-depth analysis on how to carefully tune the parameters involved. ► We demonstrate the methods through applications in fluid dynamics and material sciences. ► We compare the performances on graphics processors, Cell processors, and multi-core CPUs.

Introduction

Emerging multi-core and many-core processors are increasingly being used in data and compute-intensive applications to obtain high-performance. Efficient parallelization techniques and software engineering are essential in order to harness the raw computational potential these architectures offer. Pairwise computations occur in numerous scientific applications across many areas, ranging from molecular dynamics and many-body simulations [1] to systems biology [2] and clustering algorithms [3]. While specifics vary, all these applications involve all-pairs computations between n entities, each described by a d-dimensional vector. Typically, n and/or d can be large, making acceleration of such computations critical to achieve high-performance and enable large-scale applications.

Other than its use in direct form, all-pairs computation, in many cases, forms a part of a more complex algorithmic strategy. The Fast Multipole Method [4] is an example of such an algorithm, where all-pairs computations are restricted to certain “neighborhoods” in a larger scheme of approximations based on a hierarchy of spatial decomposition. While such algorithmic strategies should be used whenever possible, the run-times are often dominated or co-dominated by all-pairs computations.

In this paper, we study the problem of all-pairs computations, generalizing it by assuming a computational kernel $F$ is given depending on the application. Pairwise computations have been studied previously on multi-core architectures in the context of a specific application that the researchers are trying to solve, including on the Cell processor [5], [6], [7], [8], and graphics processors [9], [10]. As the problem arises frequently in many contexts and efficient parallelization rarely depends on the specifics of the application, we consider this problem in its abstract form, and develop common architecture-aware algorithmic strategies to extract maximum performance. In earlier work, we developed efficient schemes for this problem on the Cell processor [21], [11], [5]. In this paper, we focus on developing methods for all-pairs computations on many-core graphics processors.

We describe the generalized all-pairs computations problem in Section 2. Given the fine-grain parallelism offered by graphics processors, we propose a hierarchical decomposition scheme in Section 3 to perform all-pairs computations efficiently on graphics processors taking into account the memory hierarchy. In Section 4 we provide an in-depth analysis of how performance varies as a function of the various decomposition parameter values. Many GPU implementations overlook this issue. We show that a clever choice of these parameter values is essential to achieving maximum performance, in some cases improving the performance by an order of magnitude, and demonstrate how to tune the values of these parameters given a GPU architecture. We review our decomposition scheme for the coarse-grained thread parallelism offered by the STI Cell processors in Section 5 and use this implementation to compare the performance of our schemes on graphics processors. Further, we also compare GPU performance with parallel implementations using OpenMP and Intel Threading Building Blocks on general purpose multi-core CPUs in Section 6.

Section snippets

Generalized pairwise computations

The problem of all-pairs computation can be abstracted as follows: Given two input matrices $M_{1}$ and $M_{2}$ , of sizes $n_{1} \times d$ and $n_{2} \times d$ , respectively, compute an output matrix D of size $n_{1} \times n_{2}$ where $D [i, j] = F (M_{1} [i, 0 \dots (d - 1)], M_{2} [j, 0 \dots (d - 1)])$ . Here, $F$ is a computational kernel function, and $M_{k} [i, 0 \dots (d - 1)] = 〈 M_{k} [i, 0], M_{k} [i, 1], \dots, M_{k} [i, d - 1] 〉$ represents a d-dimensional vector. See Fig. 1 for an illustration. In general, $n_{1}$ , $n_{2}$ and d can be arbitrary, and $F$ can be any binary function.

Computing D requires $n_{1} \cdot n_{2}$

Developing an efficient scheme on GPUs

We develop our schemes for all-pairs computation based on the NVIDIA GPGPU and the CUDA programming model [12]. GPUs provide fine-grained parallelism, with potentially thousands of threads running simultaneously. The basic architecture of an NVIDIA GPU consists of an array of Streaming Multiprocessors (SM), or simply multiprocessors, where each SM consists of a number of scalar processors (SP), such as eight in older GPU chips and 32 in Fermi architectures. A small on-chip shared memory is

Analyzing performance on GPU

Decomposition of input vectors and output computations into slices, tiles and subtiles, raises the question of choosing optimal values for the various parameters: $r, c, s$ and $d^{'}$ , to obtain highest possible performance. In this section, we address this question, leveraging on the various architectural features and constraints of a generic NVIDIA GPU.

In the following, we conduct experiments on two different NVIDIA GPU architectures. The first platform is a 2.0 GHz quadcore Intel Xeon (Nehalem

An efficient scheme for the Cell processor

The IBM Cell processor [19], [20] is a heterogeneous multi-core CPU, offering high-performance through specialized vector processing cores (SPEs). In previous work, we developed an optimal scheme for scheduling all-pairs computations on the Cell processor. Here we give a brief high-level overview of the scheme for the purpose of comparing with the GPU parallelization and to help understand the performance comparisons provided subsequently. Further details can be found in [21], [11], [5].

Performance analysis

In this section, we present performance results of the proposed GPU-based all pairs computation method for different problems sizes, dimensionality, and precision. We use the parameters c = 16, r = 6, s = 4 and $d^{'}$ =50, which provide an optimal choice based on experiments conducted. We further provide comparison of GPU performance with the Cell using the scheme described in Section 5, and multi-core implementations using OpenMP and Intel Threading Building Blocks (TBB). For the latter comparison, we

Conclusions

In this paper we developed efficient and scalable architecture-aware techniques for the problem of scheduling generalized all-pairs computations on graphics processors. These techniques are based on decomposition of the output matrix into tiles, thread blocks, and subtiles, and decomposition of the input into dimensional slices. We focus on the most efficient usage of the available memory hierarchies and minimizing the number of memory transfers taking place. This is crucial for

Acknowledgments

The authors thank Baskar Ganapathysubramanian for providing input data sets from flapping-wing MAV simulations and microstructure samples. The all-pairs computations work on the Cell processor was done previously in collaboration with Jaroslaw Zola. The authors also acknowledge Georgia Institute of Technology, its STI Center of Competence, and the National Science Foundation, for the use of Cell resources that have contributed to this research.

References (25)

B. Hendrickson et al.
Parallel many-body simulations without all-to-all communication
Journal of Parallel and Distributed Computing
(1995)
B. Ganapathysubramanian et al.
A non-linear dimension reduction methodology for generating data-driven stochastic input models
Journal of Computational Physics
(2008)
J. Zola, M. Aluru, S. Aluru, Parallel information theory based construction of gene regulatory networks, in:...
P. Berkhin
A survey of clustering data mining techniques
M. Vikram, A. Baczewzki, B. Shanker, S. Aluru, Parallel accelerated cartesian expansions for particle dynamics...
J. Zola, A. Sarje, S. Aluru, Constructing gene regulatory networks on clusters of Cell processors, in: International...
J. Zola et al.
Parallel information-theory-based construction of genome-wide gene regulatory networks
IEEE Transactions on Parallel and Distributed Systems
(2010)
N. Arora, A. Shringarpure, R. Vuduc, Direct N-body kernels for multicore platforms, in: Proceedings of International...
A. Wirawan, B. Schmidt, C.K. Kwoh, Pairwise distance matrix computation for multiple sequence alignment on the Cell...
S. Barrachina et al.
Exploiting the capabilities of modern GPUs for dense matrix computations
Concurrency and Computation: Practice and Experience
(2009)

D. Chang, A.H. Desoky, M. Ouyang, E.C. Rouchka, Compute pairwise manhattan distance and pearson correlation coefficient...

A. Sarje et al.

Accelerating pairwise computations on Cell processors

IEEE Transactions on Parallel and Distributed Systems

(2011)

Cited by (9)

Sequential and parallel algorithms for all-pair k-mismatch maximal common substrings
2020, Journal of Parallel and Distributed Computing
Identifying long pairwise maximal common substrings among a large set of sequences is a frequently used construct in computational biology, with applications in DNA sequence clustering and assembly. Due to errors made by sequencers, algorithms that can accommodate a small number of differences are of particular interest. Formally, let $D$ be a collection of $n$ sequences of total length $N$ , $ϕ$ be a length threshold, and $k$ be a mismatch threshold. The goal is to identify and report all $k$ -mismatch maximal common substrings of length at least $ϕ$ over all pairs of strings in $D$ . Heuristics based on seed-and-extend style filtering techniques are often employed in such applications. However, such methods cannot provide any provably efficient run time guarantees. To this end, we present a sequential algorithm with an expected run time of $O (N {log}^{k} N + occ)$ , where $occ$ is the output size. We then present a distributed memory parallel algorithm with an expected run time of $O ((\frac{N}{p} log N + occ) {log}^{k} N)$ using $O ({log}^{k + 1} N)$ expected rounds of global communications, under some realistic assumptions, where $p$ is the number of processors. Finally, we demonstrate the performance and scalability of our algorithms using experiments on large high throughput sequencing data.
Enabling Highly Efficient k-Means Computations on the SW26010 Many-Core Processor of Sunway TaihuLight
2019, Journal of Computer Science and Technology
A portability layer of an all-pairs operation for hierarchical n-body algorithm framework tapas
2018, ACM International Conference Proceeding Series
Comparing parallel hardware architectures for visually guided robot navigation
2017, Concurrency and Computation: Practice and Experience
Parallel Pairwise Correlation Computation on Intel Xeon Phi Clusters
2016, Proceedings - Symposium on Computer Architecture and High Performance Computing
A Parallel Algorithm for Finding All Pairs κ-Mismatch Maximal Common Substrings
2016, International Conference for High Performance Computing, Networking, Storage and Analysis, SC

View all citing articles on Scopus

View full text

All-pairs computations on many-core graphics processors

Abstract

Highlights

Introduction

Section snippets

Generalized pairwise computations

Developing an efficient scheme on GPUs

Analyzing performance on GPU

An efficient scheme for the Cell processor

Performance analysis

Conclusions

Acknowledgments

Journal of Parallel and Distributed Computing

Journal of Computational Physics

A survey of clustering data mining techniques

Parallel information-theory-based construction of genome-wide gene regulatory networks

IEEE Transactions on Parallel and Distributed Systems

Exploiting the capabilities of modern GPUs for dense matrix computations

Concurrency and Computation: Practice and Experience

Accelerating pairwise computations on Cell processors

IEEE Transactions on Parallel and Distributed Systems