Generalized multipartitioning of multi-dimensional arrays for parallelizing line-sweep computations

https://doi.org/10.1016/S0743-7315(03)00103-5Get rights and content

Abstract

Multipartitioning is a strategy for decomposing multi-dimensional arrays into tiles and mapping the resulting tiles onto a collection of processors. This class of partitionings enables efficient parallelization of “line-sweep” computations that solve one-dimensional recurrences along each dimension of a multi-dimensional array. Multipartitionings yield balanced parallelism for line sweeps by assigning each processor the same number of data tiles to compute at each step of a sweep along any array dimension. Also, they induce only coarse-grain communication.

This paper considers the problem of computing generalized multipartitionings, which decompose d-dimensional arrays, d⩾2, onto an arbitrary number of processors. We describe an algorithm that computes an optimal multipartitioning onto all of the processors for this general case. We use a cost model to select the dimensionality of the best partitioning and the number of cuts to make along each array dimension; then, we show how to construct a mapping that assigns the resulting data tiles to each of the processors. The assignment of tiles to processors induced by this class of multipartitionings corresponds to an instance of a latin hyper-rectangle, a natural extension of latin squares, which have been widely studied in mathematics and statistics.

Finally, we describe how we extended the Rice dHPF compiler for High Performance Fortran to generate code that employs our strategy for generalized multipartitioning and show that the compiler's generated code for the NAS SP computational fluid dynamics benchmark achieves scalable high performance.

Introduction

Line sweeps are used to solve one-dimensional recurrences along a dimension of a multi-dimensional hyper-rectangular domain. Line sweeps are at the heart of a variety of numerical methods and solution techniques [22]. One important computational method based on line sweeps is Alternating Direction Implicit (ADI) integration—a widely-used numerical technique for solving partial differential equations such as the Navier–Stokes equation [5], [6], [20], [22]. Fig. 1 illustrates an ADI integration computation over a three-dimensional data volume. In each time step, the computation alternately sweeps forward and backward over each spatial dimension of the data domain to solve a system of one-dimensional recurrences. The callout in the figure shows a code fragment that solves recurrences among elements in the first dimension of rhs, which we call x. In this fragment, the i loop iterates over the first dimension of rhs in reverse order. It computes a new value for each rhs(i,j,k,m) element from its prior value, the values of its two successors along the i dimension, and values in lhs. Since each element of rhs is computed from values of its successors along the i dimension, elements must be computed sequentially in reverse order along this dimension; accordingly, we say that this loop sweeps along the x dimension in reverse index order.

Parallelizing computations based on line sweeps is important because such computations are of practical interest and they are computationally intensive. However, parallelizing multi-dimensional line sweep computations is difficult because recurrences serialize each sweep, as we just described. Using standard block partitionings, which assign a single hyper-rectangular volume of data to each processor, there are two reasonable parallelization strategies:

  • A static block unipartitioning partitions one of the array dimensions for the entire computation. With this partitioning, a line-sweep along any unpartitioned dimension can be performed independently by each processor in parallel. Achieving parallelism for a line sweep along the partitioned dimension, however, requires a wavefront parallelization. Fig. 2 shows a line sweep computation over a three-dimensional data volume using a static block unipartitioning. The dark solid lines show the partitioning of the data volume into three vertical slabs, one for each of three processors. Sweeps along the y and z dimensions do not cross partition boundaries and thus can be performed by each processor independently in parallel. The figure also shows a reverse sweep along the partitioned x dimension, as would be required for the loop nest shown in the callout of Fig. 1. The lines in the front plane represent the tiling used for a wavefront computation along x. The short arcs represent messages between processors in the course of the reverse sweep along the x dimension. A forward sweep along x (not shown) is similar, except that the direction of the communication arcs would be reversed. In wavefront computations, there is a tension between using small tiles and messages to maximize parallelism by minimizing the length of pipeline fill and drain phases, and using larger tiles and messages to minimize communication overhead in the computation's steady state when the pipeline is full.

  • A dynamic block partitioning involves partitioning some subset of the dimensions, with each processor performing line sweeps in all unpartitioned dimensions independently in parallel, and then transposing the data as necessary so that, in turn, each of the remaining sweeps can be performed independently in parallel as well. Fig. 3 shows a line sweep computation over a three-dimensional data volume using a one-dimensional dynamic block partitioning. Initially, the data volume is decomposed into horizontal slabs, one for each of three processors. With this partitioning, each processor can perform its x and z sweeps independently. A transpose to a partitioning consisting of vertical slabs then enables independent sweeps over y. A final transpose returns the data to its original partitioning in preparation for the next sweeps over x and z. While a dynamic block partitioning achieves better efficiency during a parallel sweep over an unpartitioned dimension than a static block partitioning achieves during a wavefront sweep over a partitioned dimension, the cost of the data transposes it needs as well can be substantial.

To support better parallelization of line sweep computations over multi-dimensional arrays, a third sophisticated strategy for partitioning data and computation known as multipartitioning was developed [5], [6], [20], [22]. This strategy partitions arrays of d⩾2 dimensions among a set of processors so that all processors are active in every step of a line sweep computation along any array dimension, load-balance is nearly perfect and only coarse-grain communication is needed. A multipartitioning achieves full parallelism and load balance by (1) assigning each processor a balanced number of tiles in each hyper-rectangular slab defined by a pair of adjacent cuts along a partitioned data dimension and (2) ensuring that for all tiles mapped to a processor, their immediate tile neighbors in any one coordinate direction are all mapped to some other single processor. We refer to these two properties as the balance property, and the neighbor property, respectively. The utility of the balance property is obvious: it enables even load balance. The neighbor property is useful because it enables a fully-vectorized, directional-shift communication to be accomplished with one message per processor. A study by van der Wijngaart [25] of hand-coded parallelizations of ADI Integration found that three-dimensional multipartitionings yield better performance than static block or dynamic block partitionings.

Fig. 4 shows a multipartitioning of a three-dimensional data volume onto 16 processors; the number in each data tile indicates the processor onto which that tile is mapped. The partitioning exhibits the balance property because each xy, yz or xz plane contains one data tile for each processor. One manifestation of the neighbor property in the figure is that processor 4 owns each data tile to the right of data tiles owned by processor 0. Similar relationships hold along each coordinate direction for each processor's tiles. When performing a line sweep computation along any coordinate dimension using this partitioning, computation within each plane of tiles orthogonal to the sweep direction is fully parallel. When computation within a plane is complete, each processor sends a message to its neighbor in the next plane and then the (fully parallel) computation in the next plane can begin.

All of the multipartitionings described in the literature to date consider only one tile per processor per hyper-rectangular slab along a partitioned dimension. The most broadly applicable of the multipartitioning strategies in the literature is known as diagonal multipartitioning. In two dimensions, these partitionings can be performed on any number of processors, p; however, in three dimensions they are only useful if p is a perfect square. The multipartitioning shown in Fig. 4 is a diagonal multipartitioning on 42 processors. We consider the general problem of computing optimal multipartitionings for d-dimensional data volumes onto an arbitrary number of processors.

In the next section, we describe prior work in multipartitioning and define generalized multipartitioning. In the subsequent sections, we present our strategy for constructing generalized multipartitionings. This strategy has two principal parts: (1) a cost-model-driven algorithm for choosing the dimensionality and number of tiles along each dimension of an optimal multipartitioning, and (2) an algorithm for computing a mapping of data tiles to processors. We present a proof and complexity analysis of the partitioning algorithm and a constructive proof for the tile to processor mapping. Finally, we describe a prototype implementation of generalized multipartitioning in the Rice dHPF compiler for High Performance Fortran. We report performance results obtained using it to parallelize the NAS SP computational fluid dynamics benchmark.

Section snippets

Background

In this section, we first describe prior constructions for multipartitionings. A limitation of these constructions is that they cannot accommodate an arbitrary number of processors when more than two dimensions are partitioned. Then, we define what we call generalized multipartitionings. This class of partitionings enables a d-dimensional array to be decomposed along an arbitrary number of dimensions ρ, 2⩽ρd, onto an arbitrary number of processors. Later sections show how to constructively

Finding an optimal partitioning

We now develop a strategy for determining an optimal set of cuts to decompose a multi-dimensional array onto an arbitrary number of processors. First, we define an objective function that serves as the basis for evaluating the relative cost of partitionings. Second, we cast partition selection as an optimization problem. Third, we discuss necessary properties of an optimal partitioning. Fourth, we present an algorithm for selecting an optimal partitioning by exhaustive enumeration of elementary

Finding a tile-to-processor mapping

In Section 3, we described how, given a number of processors and the dimensionality of an array, we compute an array partitioning that minimizes an objective function.5 A partitioning specifies an array (of tiles) whose size is γ for which the objective is minimized. So far, we have assumed that a

Experiments

We have incorporated support for generalized multipartitionings in the Rice dHPF compiler and run-time system for High Performance Fortran. In this section, we briefly describe how we extended HPF to support generalized multipartitionings, the dHPF compiler's support for multipartitioning, and performance measurements of dHPF-generated code using generalized multipartitionings.

Fig. 9 shows two four-dimensional arrays x and y multipartitioned using our extensions to HPF's data layout directives.

Conclusions

This paper describes an algorithm for computing an optimal multipartitioning of d-dimensional arrays, d>2, onto an arbitrary number of processors, p. Our strategy first selects how many dimensions to partition and the number of cuts along each partitioned dimension to minimize the communication cost in line sweep computations. Then, it computes a tile-to-processor mapping to complete the partitioning. In practice, partition selection is very fast (on the order of milliseconds) even for large

Alain Darte is a researcher with the French National Council for Scientific Research (CNRS). His education includes an Agrégation de Mathématiques in 1992 and Ph.D. degree in computer science in 1993 from the École normale supérieure de Lyon, France. His main scientific interests are in mathematical tools, automatic program transformations, and optimizations for parallelizing compilers and for compiler-based tools used to automatically synthesize hardware accelerators.

References (25)

  • J. Bruno et al.

    Implementing the 3D alternating direction method on the hypercube

    Journal of Parallel Distributed Comput.

    (December 1994)
  • A. Darte

    Regular partitioning for synthesizing fixed-size systolic arrays

    INTEGRATION, VLSI J.

    (1991)
  • V. Adve, G. Jin, J. Mellor-Crummey, Q. Yi, High performance Fortran compilation techniques for parallelizing scientific...
  • V. Adve, J. Mellor-Crummey, Using integer sets for data-parallel program analysis and optimization, in: Proceedings of...
  • G.E. Andrews

    Theory of Partitions

    (1976)
  • D. Bailey, T. Harris, W. Saphir, R. van der Wijngaart, A. Woo, M. Yarrow, The NAS parallel benchmarks 2.0, Technical...
  • J. Bruno, P. Cappello, Implementing the beam and warming method on the hypercube, in: Proceedings of the Third...
  • D. Chavarrı́a-Miranda, A. Darte, R. Fowler, J. Mellor-Crummey, Efficient parallelization of line-sweep computations,...
  • D. Chavarrı́a-Miranda, J. Mellor-Crummey, Towards compiler support for scalable parallelism, in: Proceedings of the...
  • D. Chavarrı́a-Miranda, J. Mellor-Crummey, T. Sarang, Data-parallel compiler support for multipartitioning, in: European...
  • A. Darte et al.

    A characterization of one-to-one modular mappings

    Parallel Process. Lett.

    (1996)
  • A. Darte, R. Schreiber, B.R. Rau, F. Vivien, A constructive solution to the juggling problem in systolic array...
  • Cited by (0)

    Alain Darte is a researcher with the French National Council for Scientific Research (CNRS). His education includes an Agrégation de Mathématiques in 1992 and Ph.D. degree in computer science in 1993 from the École normale supérieure de Lyon, France. His main scientific interests are in mathematical tools, automatic program transformations, and optimizations for parallelizing compilers and for compiler-based tools used to automatically synthesize hardware accelerators.

    John Mellor-Crummey is a Senior Faculty Fellow in the Department of Computer Science and Deputy Director of the Center for High Performance Software Research at Rice University. His education includes a B.S.E. degree magna cum laude in electrical engineering and computer science from Princeton University in 1984, and M.S. (1986) and Ph.D. (1989) degrees in computer science from the University of Rochester. His principal research interests are in the area of software technology for parallel and high performance computing, including compilers, tools and run-time libraries.

    Robert J. Fowler is a Senior Research Scientist and Associate Director of the Center for High Performance Software Research at Rice University. His education includes an A.B. in Physics from Harvard (1971) and M.S. (1981) and Ph.D. (1985) degrees from the University of Washington. His research interests are in the area of high-performance distributed and parallel computing. Specific interests include compilers and programming environments, architectures, operating systems, performance evaluation, and simulation.

    Daniel Chavarrı́a-Miranda received the B.S. degree in computer science with honors from the University of Costa Rica in 1994, and the M.S. degree in computer science with honors from the Instituto Tecnologico de Monterrey (ITESM) in 1998. Currently, he is a Ph.D. candidate at Rice University's Department of Computer Science, where he expects to receive his degree in December 2003. His main area of research is compiler and runtime support for data-parallel languages.

    1

    This work performed while a visiting scholar at Rice University.

    View full text