Elsevier

Parallel Computing

Volume 25, Issue 9, September 1999, Pages 1105-1130
Parallel Computing

Communication set generations with CSD calculus and expression-rewriting framework

https://doi.org/10.1016/S0167-8191(99)00037-XGet rights and content

Abstract

In this paper, we present a new framework based on expression-rewritings and a calculus form called common-stride descriptor (CSD) calculus to generate the local enumeration set and communication set for High Performance Fortran (HPF) programs with “Block-Cyclic” distributions. Our framework is a practical software framework, and can handle the general cases so that the communication set of HPF programs of block-cyclic distributions with two-level alignments (or multiple-level alignments), multi-dimensional arrays, array intrinsic functions (such as Transpose operation), and affine indices and axis exchanges in the array subscript, can be calculated in a systematic way with a sound software foundation. Previously, existing works do not report a software framework to solve a problem with such general cases. In addition, our expression-rewriting framework is based on a new representative form, CSD, to describe the regularity of the access patterns of HPF programs with block-cyclic distribution. We also demonstrate a calculus of CSD that CSD is closed under intersection and normalization, which helps the process of calculating local enumeration and communication sets of HPF programs with block-cyclic distributions. We also utilize the characteristics of CSD calculus to provide a global-to-local mapping function for multiple level alignments and block-cyclic distributions. Experimental results show that our software scheme not only can be easily implemented in the practice, but also is with good efficiency

Introduction

An increasing number of programming languages, such as Fortran D, High Performance Fortran (HPF) [6], and parallel C++ [2], [21], are providing distributed arrays to support a global name space on distributed memory architectures. In those languages (such as HPF), programmers can specify the distribution of the distributed arrays among processors, and the compiler can then take distribution information (how data are distributed among processors), and generate the communication codes [3], [15], [20] for programs to emulate the shared memory space on distributed memory architectures. The distribution pattern specified by programmers in HPF includes “Block” pattern which has each processor get a consecutive block of elements, “Cyclic” distribution which is a round-robin rotation, “Collapse” distribution which means no distribution, or “Block-Cyclic” also known as cyclic(k) distribution which is a round-robin rotation, but each time each processor takes a small block number of elements. The generation of the communication set for HPF programs with block, cyclic, or collapse distribution has been implemented with success in commercial HPF compilers, however few commercial HPF compilers have so far provided efficient support for block-cyclic distribution [10]. Nevertheless, block-cyclic distribution is very important for expressing “block-scatter” distribution in the design of scalable libraries for linear algebra computation [5]. The lack of commercial compiler supports for HPF programs with block-cyclic distributions has recently prompted many research efforts [8], [9], [10], [12], [13], [27] to hope to lead to efficient schemes in supporting this issue.

While we think the recent research efforts will greatly help the implementation of block-cyclic distributions, we feel the lack of support in commercial compilers for block-cyclic distributions is in part due to the lack of a practical software framework which can lead to implementation with commercial strength in a straightforward manner in the practice to support the communication set of the block-cyclic distribution with general cases. The general case of the block-cyclic distributions will be at least with multiple-level alignments, multi-dimensional arrays, array intrinsic functions (such as Transpose operation) [1] and affine indices and axis exchange in the array subscript. We show a typical example of such cases below in Code Fragment 1:

HPF Code Fragment 1

In the code above, array A and B are first aligned to template array T1 and T2, respectively. T1 and T2 are then aligned to template array T3 and T4, respectively. Next, T3 and T4 are distributed among processors by block-cyclic distribution. Therefore, it's a program of multipl-level alignments. The functions, fi and gi are all affine functions to be used in the array subscript, and F is an array function or array intrinsic function (such as Transpose operation). In addition, as we focus on static and compiler time solution, we will assume that the number of processors, the array bounds, the block and cyclic parameters of the HPF program to be compile-time constants in this paper. In our work, we present a software framework based on expression-rewritings and a calculus form called common-stride descriptor (CSD) calculus to hope to lead to implementation with commercial strength for HPF programs of block-cyclic distributions with multiple-level alignments, multi-dimensional arrays, array intrinsic functions (such as Transpose operation) [1], and affine indices and axis exchanges in the array subscript.

Our expression rewriting framework is based on a new representative form called CSD to represent the set of access elements in HPF programs with block-cyclic distribution. Previously, research community used regular section descriptor (RSD) to describe the set of access elements in HPF programs with block or cyclic distribution patterns. RSD is not sufficient to represent the access elements with block-cyclic distributions. Later, researchers [24] tried to use block-cyclic section descriptor (BSD) to represent the access elements of block-cyclic distributions. Unfortunately, BSD is not closed under intersection or index normalization. Therefore, BSD is not appropriate for calculating communication set of HPF programs with block-cyclic distributions. On the other hand, CSD is a new representative form we propose to describe the regularity of the access patterns of HPF programs with block-cyclic distribution. We will demonstrate a calculus of CSD that CSD is closed under intersection and normalization.

Note that the properties of CSD closed under intersection and normalization made it easy to calculate the local enumeration set and communication set for block-cyclic distributions. With the help of CSD calculus, we present a five-step framework with mechanical arithmetic rewriting to accompany the CSD calculus so that the communication set of HPF programs of block-cyclic distributions with two-level alignments (or multiple-level alignments) and axis alignments, multi-dimensional arrays, array intrinsic functions (such as Transpose operation), and affine indices and axis exchange in the array subscripts, can be calculated in a straightforward way with a sound software foundation. Previously, when dealing with multi-dimensional cases, existing works tried to use a dimension-wise scheme to calculate the communication set at each dimension, separately. They, however, do not address the complicated software issues when multiple level alignments, axis alignments, and data movements between different dimensions are involved. Our five-step framework can perform mechanical arithmetic-rewritings to calculate communication sets to deal with such complicated cases. In addition, we also utilize the characteristics of CSD calculus to provide a global-to-local mapping function for multiple level alignments and block-cyclic distributions. Our software framework is currently being incorporated into an experimental HPF II compiler which is under development at our university as a research prototype to explore compiler technologies [4], [7], [14], [15], [22], [23] for distributed environments. Preliminary experimental results with our CSD calculus and expression rewriting framework show that our framework not only can be easily implemented in the practice to handle the very general cases, but also is with good efficiency.

The remainder of the paper is organized as follows. Section 2 describes the CSD calculus which lays the foundation of our framework. Section 3 then presents our five-step expression rewriting framework to generate communication sets for HPF programs with block-cyclic distributions. Next, Section 4 then presents a global-to-local mapping mechanism for HPF programs of block-cyclic distributions with multiple-level alignments. The work is important for code generations. Finally, Section 5 gives the experimental results and Section 6 discusses related work and also concludes this paper.

Section snippets

CSD calculus

In this section, we give definitions for RSD, BSD, and CSD representations. The representations will be used as a basis for our proposed five-step framework (which will be described in Section 3) with mechanical arithmetic-rewritings to generate communication set of HPF programs for block-cyclic distributions with multiple-level alignments, multi-dimensional arrays, and affine indices and axis exchanges in the array subscript.

RSD describes the set of access elements in HPF programs with Block

Expression-rewritings to generate communication sets

In this section, we present our proposed five-step framework with mechanical arithmetic-rewriting rules to accompany the CSD calculus so that the communication set of HPF programs of block-cyclic distributions with two-level alignments (or multiple-level alignments), multi-dimensional arrays, array intrinsic functions (such as Transpose operation), and affine indices in the array subscripts, can be calculated in a systematic way. The process is guided by the owner computes rule of [6]. The

Local storage scheme and address translation

In Section 3.3, we present the scheme to calculate global indices of the communication set. However, if we want to have the codes generated for HPF programs to actually work at runtime, we still need a local storage scheme to represent several discontinuous regions of a distributed array on a processor, and a mapping mechanism to map a global index to a local index on a processor. In this section, we utilize the characteristics of CSD calculus to develop a local storage packing scheme and a

Experimental results

We have implemented our expression-rewriting framework and CSD calculus to generate the communication codes for HPF programs. Our software framework includes (1) the definition of data structures of RSD, BSD and CSD, (2) intersection and normalization of RSD, BSD and CSD, and (3) the proposed expression-rewriting framework. The software is currently being incorporated into an experimental HPF II compiler which is under development at Tsing-Hua university as a research prototype to explore

Related work and conclusion

The problem to deal with array statements distributed with block-cyclic distribution patterns was first addressed by Chatterjee et al. [8]. It described a method for the enumeration of local indices in an increasing order based on a finite-state machine. In [10], [28], linear algorithms are given based on an integer lattice method [10], [28]. Their work mainly focused on the local set enumeration and did not discuss generating communication sets in any detail. Samuel et al. [26] solved the same

Acknowledgements

The software with integer lattice used as a base code for comparison in the experiments comes from Rice University [10]. We gratefully acknowledge their efforts.

References (29)

  • Z. Bozkus et al.

    Compiling Fortran 90D/HPF for distributed memory MIMD computers

    Journal of Parallel and Distributed Computing

    (1994)
  • S.K.S. Gupta et al.

    On compiling array expressions for efficient execution on distributed memory machines

    Journal of Parallel and Distributed Computing

    (1996)
  • J.K. Lee et al.

    Parallel array object I/O support on distributed environments

    Journal of Parallel and Distributed Computing

    (1997)
  • J.C. Adams, W.S. Brainerd, J.T. Martin, B.T. Smith, J.L. Wagener, Fortran 90 Handbook Complete Ansi/Iso Reference,...
  • F. Bodin, P. Beckman, D. Gannon, S. Narayana, S. Yang, Distributed pC++: Basic ideas for an object parallel language,...
  • T.-R. Chuang, R.-G. Chang, J.K. Lee, Sampling and analytical techniques for data distribution of parallel sparse...
  • J. Dongarra, R. van de Geijn, D. Walker, A look at scalable dense linear algebra libraries in: Proceedings of Scalable...
  • C. Koelbel et al.

    The High Performance Fortran Handbook

    (1994)
  • R.-G. Chang, T.-R. Chuang, J. Kuen Lee, Efficient support of parallel sparse computation for array intrinsic functions...
  • S. Chatterjee, J. Gilbert, F. Long, R. Scheriber, S. Teng, Generating local addresses and communication sets for data...
  • S. Hiranandani, K. Kennedy, J. Mellor-Crummey, A. Sethi, Compilation techniques for block-cyclic distributions, in:...
  • K. Kennedy, N. Nedeljkovic, A. Sethi, A linear-time algorithm for computing the memory access sequence in data-parallel...
  • K. Kennedy, N. Nedeljkovic, A. Sethi, Communication generation for cyclic(k) distributions, in: Proceedings of the...
  • H.J. Sips, K. van Reeuwijk, W. Denissen, Analysis of local enumeration and storage schemes in HPF, in: Proceedings of...
  • Cited by (0)

    A preliminary version [19] of this work appeared in the Proceedings of the International Parallel Processing Symposium (IPPS/SPDP) Orlando, March 30–April 3, 1998. This was supported in part by NSC of Taiwan under grant No. NSC85-2213-E-007-050 and NSC85-2221-E-007-031.

    View full text