Elsevier

Parallel Computing

Volume 39, Issue 2, February 2013, Pages 79-93
Parallel Computing

All-pairs computations on many-core graphics processors

https://doi.org/10.1016/j.parco.2013.01.002Get rights and content

Abstract

Developing high-performance applications on emerging multi- and many-core architectures requires efficient mapping techniques and architecture-specific tuning methodologies to realize performance closer to their peak compute capability and memory bandwidth. In this paper, we develop architecture-aware methods to accelerate all-pairs computations on many-core graphics processors. Pairwise computations occur frequently in numerous application areas in scientific computing. While they appear easy to parallelize due to the independence of computing each pairwise interaction from all others, development of techniques to address multi-layered memory hierarchies, mapping within the restrictions imposed by the small and low-latency on-chip memories, striking the right balanced between concurrency, reuse and memory traffic etc., are crucial to obtain high-performance. We present a hierarchical decomposition scheme for GPUs based on decomposition of the output matrix and input data. We demonstrate that a careful tuning of the involved set of decomposition parameters is essential to achieve high efficiency on the GPUs. We also compare the performance of our strategies with an implementation on the STI Cell processor as well as multi-core CPU parallelizations using OpenMP and Intel Threading Building Blocks.

Highlights

► We develop architecture-aware high-performance methods to accelerate generalized all-pairs computations graphics processors. ► We perform in-depth analysis on how to carefully tune the parameters involved. ► We demonstrate the methods through applications in fluid dynamics and material sciences. ► We compare the performances on graphics processors, Cell processors, and multi-core CPUs.

Introduction

Emerging multi-core and many-core processors are increasingly being used in data and compute-intensive applications to obtain high-performance. Efficient parallelization techniques and software engineering are essential in order to harness the raw computational potential these architectures offer. Pairwise computations occur in numerous scientific applications across many areas, ranging from molecular dynamics and many-body simulations [1] to systems biology [2] and clustering algorithms [3]. While specifics vary, all these applications involve all-pairs computations between n entities, each described by a d-dimensional vector. Typically, n and/or d can be large, making acceleration of such computations critical to achieve high-performance and enable large-scale applications.

Other than its use in direct form, all-pairs computation, in many cases, forms a part of a more complex algorithmic strategy. The Fast Multipole Method [4] is an example of such an algorithm, where all-pairs computations are restricted to certain “neighborhoods” in a larger scheme of approximations based on a hierarchy of spatial decomposition. While such algorithmic strategies should be used whenever possible, the run-times are often dominated or co-dominated by all-pairs computations.

In this paper, we study the problem of all-pairs computations, generalizing it by assuming a computational kernel F is given depending on the application. Pairwise computations have been studied previously on multi-core architectures in the context of a specific application that the researchers are trying to solve, including on the Cell processor [5], [6], [7], [8], and graphics processors [9], [10]. As the problem arises frequently in many contexts and efficient parallelization rarely depends on the specifics of the application, we consider this problem in its abstract form, and develop common architecture-aware algorithmic strategies to extract maximum performance. In earlier work, we developed efficient schemes for this problem on the Cell processor [21], [11], [5]. In this paper, we focus on developing methods for all-pairs computations on many-core graphics processors.

We describe the generalized all-pairs computations problem in Section 2. Given the fine-grain parallelism offered by graphics processors, we propose a hierarchical decomposition scheme in Section 3 to perform all-pairs computations efficiently on graphics processors taking into account the memory hierarchy. In Section 4 we provide an in-depth analysis of how performance varies as a function of the various decomposition parameter values. Many GPU implementations overlook this issue. We show that a clever choice of these parameter values is essential to achieving maximum performance, in some cases improving the performance by an order of magnitude, and demonstrate how to tune the values of these parameters given a GPU architecture. We review our decomposition scheme for the coarse-grained thread parallelism offered by the STI Cell processors in Section 5 and use this implementation to compare the performance of our schemes on graphics processors. Further, we also compare GPU performance with parallel implementations using OpenMP and Intel Threading Building Blocks on general purpose multi-core CPUs in Section 6.

Section snippets

Generalized pairwise computations

The problem of all-pairs computation can be abstracted as follows: Given two input matrices M1 and M2, of sizes n1×d and n2×d, respectively, compute an output matrix D of size n1×n2 where D[i,j]=F(M1[i,0(d-1)],M2[j,0(d-1)]). Here, F is a computational kernel function, and Mk[i,0(d-1)]=Mk[i,0],Mk[i,1],,Mk[i,d-1] represents a d-dimensional vector. See Fig. 1 for an illustration. In general, n1, n2 and d can be arbitrary, and F can be any binary function.

Computing D requires n1·n2

Developing an efficient scheme on GPUs

We develop our schemes for all-pairs computation based on the NVIDIA GPGPU and the CUDA programming model [12]. GPUs provide fine-grained parallelism, with potentially thousands of threads running simultaneously. The basic architecture of an NVIDIA GPU consists of an array of Streaming Multiprocessors (SM), or simply multiprocessors, where each SM consists of a number of scalar processors (SP), such as eight in older GPU chips and 32 in Fermi architectures. A small on-chip shared memory is

Analyzing performance on GPU

Decomposition of input vectors and output computations into slices, tiles and subtiles, raises the question of choosing optimal values for the various parameters: r,c,s and d, to obtain highest possible performance. In this section, we address this question, leveraging on the various architectural features and constraints of a generic NVIDIA GPU.

In the following, we conduct experiments on two different NVIDIA GPU architectures. The first platform is a 2.0 GHz quadcore Intel Xeon (Nehalem

An efficient scheme for the Cell processor

The IBM Cell processor [19], [20] is a heterogeneous multi-core CPU, offering high-performance through specialized vector processing cores (SPEs). In previous work, we developed an optimal scheme for scheduling all-pairs computations on the Cell processor. Here we give a brief high-level overview of the scheme for the purpose of comparing with the GPU parallelization and to help understand the performance comparisons provided subsequently. Further details can be found in [21], [11], [5].

Performance analysis

In this section, we present performance results of the proposed GPU-based all pairs computation method for different problems sizes, dimensionality, and precision. We use the parameters c = 16, r = 6, s = 4 and d=50, which provide an optimal choice based on experiments conducted. We further provide comparison of GPU performance with the Cell using the scheme described in Section 5, and multi-core implementations using OpenMP and Intel Threading Building Blocks (TBB). For the latter comparison, we

Conclusions

In this paper we developed efficient and scalable architecture-aware techniques for the problem of scheduling generalized all-pairs computations on graphics processors. These techniques are based on decomposition of the output matrix into tiles, thread blocks, and subtiles, and decomposition of the input into dimensional slices. We focus on the most efficient usage of the available memory hierarchies and minimizing the number of memory transfers taking place. This is crucial for

Acknowledgments

The authors thank Baskar Ganapathysubramanian for providing input data sets from flapping-wing MAV simulations and microstructure samples. The all-pairs computations work on the Cell processor was done previously in collaboration with Jaroslaw Zola. The authors also acknowledge Georgia Institute of Technology, its STI Center of Competence, and the National Science Foundation, for the use of Cell resources that have contributed to this research.

References (25)

  • B. Hendrickson et al.

    Parallel many-body simulations without all-to-all communication

    Journal of Parallel and Distributed Computing

    (1995)
  • B. Ganapathysubramanian et al.

    A non-linear dimension reduction methodology for generating data-driven stochastic input models

    Journal of Computational Physics

    (2008)
  • J. Zola, M. Aluru, S. Aluru, Parallel information theory based construction of gene regulatory networks, in:...
  • P. Berkhin

    A survey of clustering data mining techniques

  • M. Vikram, A. Baczewzki, B. Shanker, S. Aluru, Parallel accelerated cartesian expansions for particle dynamics...
  • J. Zola, A. Sarje, S. Aluru, Constructing gene regulatory networks on clusters of Cell processors, in: International...
  • J. Zola et al.

    Parallel information-theory-based construction of genome-wide gene regulatory networks

    IEEE Transactions on Parallel and Distributed Systems

    (2010)
  • N. Arora, A. Shringarpure, R. Vuduc, Direct N-body kernels for multicore platforms, in: Proceedings of International...
  • A. Wirawan, B. Schmidt, C.K. Kwoh, Pairwise distance matrix computation for multiple sequence alignment on the Cell...
  • S. Barrachina et al.

    Exploiting the capabilities of modern GPUs for dense matrix computations

    Concurrency and Computation: Practice and Experience

    (2009)
  • D. Chang, A.H. Desoky, M. Ouyang, E.C. Rouchka, Compute pairwise manhattan distance and pearson correlation coefficient...
  • A. Sarje et al.

    Accelerating pairwise computations on Cell processors

    IEEE Transactions on Parallel and Distributed Systems

    (2011)
  • Cited by (9)

    View all citing articles on Scopus
    View full text