Optimal systolic array algorithms for tensor product

https://doi.org/10.1016/j.amc.2004.09.045Get rights and content

Abstract

In this paper we examine the computational complexity of optimal systolic array algorithms for tensor product. We provide a time minimal schedule that meets the computed processor lower and upper bounds including one for tensor product. Our principle contribution is to find lower and upper bounds in the algorithm and its implementation for generating function and find processor-time-minimal schedule.

Introduction

Systolic arrays are important for application-specific systems, they perform well because they are highly concurrent [54], [60]. Their design cost is low because, they comprise a large number of processors, they are of only one type (or a small fixed number of types). Inter processor communication in between nearest neighbors, hence has short latency. Nearest neighbor connections also imply that the VLSI circuit is devoted primarily to the processors—very little to their interconnection. As a consequence, they are ideally suited to VLSI technology [22]. Minimizing the amount of time and number of processors needed to perform an application reduces the application’s fabrication cost and operation costs. This can be important in applications where minimizing the size and energy of the hardware is critical, such as space applications and consumer electronics [16].

We consider computations suitable for systolic arrays, often called regular array computations or systems of uniform recurrence equations[27], [31], [33]. Parallel execution of uniform recurrence equations has been studied extensively, from at least as far back as 1966 (e.g., [26], [32], [30], [8], [38], [10], [11], [39], [40], [14]). In such computations, the tasks to be computed are viewed as the nodes of a directed acyclic graph (dag), where the data dependencies are represented as arcs. Given a dag G = (N, A), a multiprocessor schedule assigns node υ for processing during step τ(υ) on processor π(υ). A valid multiprocessor schedule is subject to two constraints: we consider computations suitable for systolic arrays, often called regular array computations or systems of uniform recurrence relations. In such computations, the tasks to be computed are viewed as the nodes of a directed acyclic graph (dag), where the data dependencies are represented as arcs. A processor-time-minimal schedule measures the minimum number of processors needed to extract the maximum parallelism from the dag. We present a technique for finding a lower and upper bound on the number of processors needed to achieve a given schedule of an algorithm represented as a dag. The application of this technique is illustrated with a tensor product computation. We then consider the free schedule of algorithms for matrix product. For this problem, we provide a time-minimal processor schedule that meets the computed processor lower bounds, including the one for tensor product.

Casuality: A node can be computed only when its children have been computed at previous steps(u,υ)Aτ(u)<τ(υ).

Non-conflict: A processor cannot compute 2 different nodes during the same time stepτ(υ)=τ(u)π(υ)π(u).

In what follows, we refer to a valid schedule simply as a schedule. A time-minimal schedule for an algorithm completes in S steps, where S is the number of nodes on a longest path in the dag. A time-minimal schedule exposes an algorithm’s maximum parallelism. That is, it bounds the number of time steps the algorithm needs, when infinitely many processors may be used. A processor-time-minimal schedule is a time-minimal schedule that uses as few processors as any time-minimal schedule for the algorithm. Although only one of many performance measures. Processor-time-minimality is useful because it measures the minimum number of processors needed to extract the maximum parallelism from a dag. Being machine-independent, this measure is a more fundamental measure than those that depend on a particular machine or architecture. This view prompted several researchers to investigate processor-time-minimal schedules for families of dags. Processors-time-minimal systolic arrays are easy to devise, in an ad hoc manner for 2D systolic algorithms. This apparently is not the case for 3D systolic algorithms. There have been several publications regarding processor-time-minimal systolic arrays for fundamentals 3D algorithms. Processor-time-minimal schedules for various fundamental problems have been proposed in the literature: Scheiman and Cappello [9], [5], [48], [45] examine the dag family for matrix product; Louka and Tchuente [34] examine the dag family for Gauss–Jordan elimination; Scheiman and Cappello [46], [47] examine the dag family for transitive closure; Benaini and Robert and Trystram, and Rote [2], [3], [4], [41], [42] examine the dag families for the algebraic path problem and Gaussian elimination. Each of the algorithms listed in Table 1 has the property that, in its dag representation, every node is on a longest path. Therefore, for each algorithm listed. Its free schedule is its only time-minimal schedule. A processor lower bound for achieving time-minimal schedule is thus a processor lower bound for achieving maximum parallelism with the algorithm [6], [7].

Clauss et al. [12] developed a set of mathematical tools to help find a processor-time-minimal multiprocessor array for a given dag. Another approach to a general solution has been reported by Wong and Delosme [58], [59], and Shang and Fortes [49]. They present methods for obtaining optimal linear schedules. That is, their processor arrays may be sub-optimal, but they get the best linear schedule possible. Darte et al. [14] show that such schedules are close to optimal, even when the constraint of linearity is relaxed. Geometric/Combinatorial formulations of a dag’s task domain have been used in various contexts in parallel algorithm design as well (e.g., [26], [27], [32], [38], [39], [19], [18], [40], [12], [49], [55], [59]; see Fortes et al. [17] for a survey of systolic/array algorithm formulations).

We present an algorithm for finding a lower bound on the number of processors needed to achieve a given schedule of an algorithm. The application of this technique is illustrated with a tensor product computation. Then, we apply the technique to the free schedule of algorithms for matrix product. For this, we provide a compatible processor schedule that meets the these processor lower bounds (i.e., a processor-time-minimal schedule) including the one for tensor product. We finish with some general conclusion and mention some open problems.

One strength of our approach centers around centers around the world algorithm: we are finding processor lower bounds not just for a particular dag, but for a linearly parameterized family of dags, representing the infinitely many problem sizes for which the algorithm works. Thus, the processor lower bound is not number but a piecewise polynomial function[1], [61], [62] in the parameter used to express different problem sizes. The formula then can be used to optimize the implementation of the algorithm, not just a particular execution of it. The processor lower bounds are produced by:

  • Formulating the problem as finding, for a particular time step, the number of processors needed by that time step as a formula for the number of integer points in a convex polyhedron.

  • Representing this set, of points as the set of solutions to a linearly parameterized set of linear Diophantine equations.

  • Computing a generating function for the number of such solutions.

  • Deriving a formula from the generating function.

The ability to compute formulae for the number of solutions to a linearly parameterized set of linear Diophantine equations has other applications for nested loops [13], such as finding the number of instructions executed, the number of memory locations touched, and the number of I/O operations.

The strength of our algorithm—its ability to produce formulae—comes, of course, with a price: the computational complexity of the algorithm is exponential in the size of the input (the number of bits in the coefficients of the system of Diophantine equations). The algorithm’s complexity however is quite reasonable given the complexity of the problem: Determining if there are any integer solutions to the system of Diophantine equations are NP-complete [20]. We produce a formula for the number of such solutions.

Clauss and Loechner [13] independently developed an algorithm for the problem based on Ehrhart polynomials. In [13], they sketch an algorithm for the problem, using the “polyhedral library” of Wilde [57].

Section snippets

Processor lower bounds

We present a general and uniform technique for deriving lower bounds: given a parameterized dag family and a correspondingly parameterized linear schedule. We compute a formula for a lower bound on the number of processors required by the schedule. This is much more general than the analysis of an optimal schedule for a given specific dag. The lower bounds obtained are good; we know of no dag treatable by this method for which the lower bounds are not also upper bounds. We believe this to be

Tensor product

We examine the dag family for the 4D mesh: the n × n × n × n directed mesh. This family is fundamental, representing a communication-localized version of the standard algorithm for tensor product (also known as Kronecker product). The tensor product is used in many mathematical computations, including multivariable spline blending and image processing [21], multivariable approximation algorithms (used in graphics, optics, and topography) [29], as well as many recursive algorithms [28].

The tensor

General formulation

We now generalize this example and consider the problem of computing a lower bound for the number of processors needed to satisfy a given linear schedule. That is, we show how to automatically construct a formula for the number of lattice points inside a linearly parameterized family of convex polyhedra, by automatically constructing a formula for the number of solutions to the corresponding linearly parameterized system of linear Diophantine equations. The algorithms for doing this and its

Complexity and remarks

The number of leaves in the generated ternary tree is exponential inn={ai}|ai|,where {ai}, is the set of coefficients describing the set of Diophantine equations. The depth of recursion can be reduced somewhat, when the columns to be used are picked carefully. It is also possible to prune the tree when the input vector c determines that there can be no λ—free terms resulting from the current matrix (e.g., some row is all strictly positive or all negative with c = 0, or the row elements are

Conclusion

Given a nested loop program whose underlying computation dag has nodes representable as lattice points in a convex polyhedron, and multiprocessor schedule for these nodes that is linear in the loop indices, we produce a formula for the number of lattice points in the convex polyhedron that are scheduled for a particular time step (which is a lower bound on the number of processors needed to satisfy the schedule). This is done by constructing a system of parametric linear Diophantine equations

References (62)

  • P. Cappello, Ömer Eğecioğlu, C. Scheiman, Processor-time-optimal systolic arrays, J. Parallel Algorithms...
  • P.R. Cappello, VLSI architectures for digital signal processing. Ph.D. thesis, Princeton University, Princeton, NJ,...
  • P.R. Cappello

    A spacetime-minimal systolic array for matrix product

  • P.R. Cappello, K. Steiglitz, Unifying VLSI array design with linear transformations, in: H.J. Siegel, L. Siegel,...
  • P.R. Cappello et al.

    Unifying VLSI array design with linear transformations of space-time

  • Ph. Clauss et al.

    Calculus of space-optimal mappings of systolic algorithms on processor arrays

  • P. Clauss et al.

    Parametric analysis of polyhedral iteration spaces

    J. VLSI Signal Process.

    (1998)
  • A. Darte et al.

    Linear scheduling is close to optimal

  • R.W. Floyd

    Algorithm 97: shortest path

    Commun. ACM

    (1962)
  • J.A.B. Fortes et al.

    Systematic design approaches for algorithmically specified systolic arrays

  • J.A.B. Fortes et al.

    Optimal linear schedules for the parallel execution of algorithms

    Int. Conf. Parallel Processing

    (1984)
  • M.R. Garey et al.

    Computers and Intractability: A Guide to the Theory of NP-Completeness

    (1979)
  • J. Granata et al.

    Recursive fast algorithm and the role of the tensor product

    IEEE Trans. Signal Process.

    (1992)
  • L.J. Guibas et al.

    Direct VLSI implementation of combinatorial algorithms

    Proc. Caltech Conf. on VLSI

    (1979)
  • D.S. Hirschberg

    Recent results on the complexity of common-subsequence problems

  • O.H. Ibarra et al.

    VLSI algorithms for solving recurrence equations and applications

    IEEE Trans. Acoust, Speech, Signal Process.

    (1987)
  • G.j. Li et al.

    The design of optimal systolic algorithms

    IEEE Trans. Comput.

    (1985)
  • R.M. Karrp et al.

    Properties of a model for parallel computations: determinacy, termination, queueing

    SIAM J. Appl. Math.

    (1966)
  • R.M. Karp et al.

    The organization of computations for uniform recurrence equations

    J. ACM

    (1967)
  • H.-T. Kung et al.

    Algorithms for VLSI processor arrays

  • S.-Y. Kung et al.

    Optimal systolic design for the transitive closure and the shortest path problem

    IEEE Trans. Comput.

    (1987)
  • Cited by (0)

    View full text