Optimal systolic array algorithms for tensor product
Introduction
Systolic arrays are important for application-specific systems, they perform well because they are highly concurrent [54], [60]. Their design cost is low because, they comprise a large number of processors, they are of only one type (or a small fixed number of types). Inter processor communication in between nearest neighbors, hence has short latency. Nearest neighbor connections also imply that the VLSI circuit is devoted primarily to the processors—very little to their interconnection. As a consequence, they are ideally suited to VLSI technology [22]. Minimizing the amount of time and number of processors needed to perform an application reduces the application’s fabrication cost and operation costs. This can be important in applications where minimizing the size and energy of the hardware is critical, such as space applications and consumer electronics [16].
We consider computations suitable for systolic arrays, often called regular array computations or systems of uniform recurrence equations[27], [31], [33]. Parallel execution of uniform recurrence equations has been studied extensively, from at least as far back as 1966 (e.g., [26], [32], [30], [8], [38], [10], [11], [39], [40], [14]). In such computations, the tasks to be computed are viewed as the nodes of a directed acyclic graph (dag), where the data dependencies are represented as arcs. Given a dag G = (N, A), a multiprocessor schedule assigns node υ for processing during step τ(υ) on processor π(υ). A valid multiprocessor schedule is subject to two constraints: we consider computations suitable for systolic arrays, often called regular array computations or systems of uniform recurrence relations. In such computations, the tasks to be computed are viewed as the nodes of a directed acyclic graph (dag), where the data dependencies are represented as arcs. A processor-time-minimal schedule measures the minimum number of processors needed to extract the maximum parallelism from the dag. We present a technique for finding a lower and upper bound on the number of processors needed to achieve a given schedule of an algorithm represented as a dag. The application of this technique is illustrated with a tensor product computation. We then consider the free schedule of algorithms for matrix product. For this problem, we provide a time-minimal processor schedule that meets the computed processor lower bounds, including the one for tensor product.
Casuality: A node can be computed only when its children have been computed at previous steps
Non-conflict: A processor cannot compute 2 different nodes during the same time step
In what follows, we refer to a valid schedule simply as a schedule. A time-minimal schedule for an algorithm completes in S steps, where S is the number of nodes on a longest path in the dag. A time-minimal schedule exposes an algorithm’s maximum parallelism. That is, it bounds the number of time steps the algorithm needs, when infinitely many processors may be used. A processor-time-minimal schedule is a time-minimal schedule that uses as few processors as any time-minimal schedule for the algorithm. Although only one of many performance measures. Processor-time-minimality is useful because it measures the minimum number of processors needed to extract the maximum parallelism from a dag. Being machine-independent, this measure is a more fundamental measure than those that depend on a particular machine or architecture. This view prompted several researchers to investigate processor-time-minimal schedules for families of dags. Processors-time-minimal systolic arrays are easy to devise, in an ad hoc manner for 2D systolic algorithms. This apparently is not the case for 3D systolic algorithms. There have been several publications regarding processor-time-minimal systolic arrays for fundamentals 3D algorithms. Processor-time-minimal schedules for various fundamental problems have been proposed in the literature: Scheiman and Cappello [9], [5], [48], [45] examine the dag family for matrix product; Louka and Tchuente [34] examine the dag family for Gauss–Jordan elimination; Scheiman and Cappello [46], [47] examine the dag family for transitive closure; Benaini and Robert and Trystram, and Rote [2], [3], [4], [41], [42] examine the dag families for the algebraic path problem and Gaussian elimination. Each of the algorithms listed in Table 1 has the property that, in its dag representation, every node is on a longest path. Therefore, for each algorithm listed. Its free schedule is its only time-minimal schedule. A processor lower bound for achieving time-minimal schedule is thus a processor lower bound for achieving maximum parallelism with the algorithm [6], [7].
Clauss et al. [12] developed a set of mathematical tools to help find a processor-time-minimal multiprocessor array for a given dag. Another approach to a general solution has been reported by Wong and Delosme [58], [59], and Shang and Fortes [49]. They present methods for obtaining optimal linear schedules. That is, their processor arrays may be sub-optimal, but they get the best linear schedule possible. Darte et al. [14] show that such schedules are close to optimal, even when the constraint of linearity is relaxed. Geometric/Combinatorial formulations of a dag’s task domain have been used in various contexts in parallel algorithm design as well (e.g., [26], [27], [32], [38], [39], [19], [18], [40], [12], [49], [55], [59]; see Fortes et al. [17] for a survey of systolic/array algorithm formulations).
We present an algorithm for finding a lower bound on the number of processors needed to achieve a given schedule of an algorithm. The application of this technique is illustrated with a tensor product computation. Then, we apply the technique to the free schedule of algorithms for matrix product. For this, we provide a compatible processor schedule that meets the these processor lower bounds (i.e., a processor-time-minimal schedule) including the one for tensor product. We finish with some general conclusion and mention some open problems.
One strength of our approach centers around centers around the world algorithm: we are finding processor lower bounds not just for a particular dag, but for a linearly parameterized family of dags, representing the infinitely many problem sizes for which the algorithm works. Thus, the processor lower bound is not number but a piecewise polynomial function[1], [61], [62] in the parameter used to express different problem sizes. The formula then can be used to optimize the implementation of the algorithm, not just a particular execution of it. The processor lower bounds are produced by:
- •
Formulating the problem as finding, for a particular time step, the number of processors needed by that time step as a formula for the number of integer points in a convex polyhedron.
- •
Representing this set, of points as the set of solutions to a linearly parameterized set of linear Diophantine equations.
- •
Computing a generating function for the number of such solutions.
- •
Deriving a formula from the generating function.
The ability to compute formulae for the number of solutions to a linearly parameterized set of linear Diophantine equations has other applications for nested loops [13], such as finding the number of instructions executed, the number of memory locations touched, and the number of I/O operations.
The strength of our algorithm—its ability to produce formulae—comes, of course, with a price: the computational complexity of the algorithm is exponential in the size of the input (the number of bits in the coefficients of the system of Diophantine equations). The algorithm’s complexity however is quite reasonable given the complexity of the problem: Determining if there are any integer solutions to the system of Diophantine equations are NP-complete [20]. We produce a formula for the number of such solutions.
Clauss and Loechner [13] independently developed an algorithm for the problem based on Ehrhart polynomials. In [13], they sketch an algorithm for the problem, using the “polyhedral library” of Wilde [57].
Section snippets
Processor lower bounds
We present a general and uniform technique for deriving lower bounds: given a parameterized dag family and a correspondingly parameterized linear schedule. We compute a formula for a lower bound on the number of processors required by the schedule. This is much more general than the analysis of an optimal schedule for a given specific dag. The lower bounds obtained are good; we know of no dag treatable by this method for which the lower bounds are not also upper bounds. We believe this to be
Tensor product
We examine the dag family for the 4D mesh: the n × n × n × n directed mesh. This family is fundamental, representing a communication-localized version of the standard algorithm for tensor product (also known as Kronecker product). The tensor product is used in many mathematical computations, including multivariable spline blending and image processing [21], multivariable approximation algorithms (used in graphics, optics, and topography) [29], as well as many recursive algorithms [28].
The tensor
General formulation
We now generalize this example and consider the problem of computing a lower bound for the number of processors needed to satisfy a given linear schedule. That is, we show how to automatically construct a formula for the number of lattice points inside a linearly parameterized family of convex polyhedra, by automatically constructing a formula for the number of solutions to the corresponding linearly parameterized system of linear Diophantine equations. The algorithms for doing this and its
Complexity and remarks
The number of leaves in the generated ternary tree is exponential inwhere {ai}, is the set of coefficients describing the set of Diophantine equations. The depth of recursion can be reduced somewhat, when the columns to be used are picked carefully. It is also possible to prune the tree when the input vector c determines that there can be no λ—free terms resulting from the current matrix (e.g., some row is all strictly positive or all negative with c = 0, or the row elements are
Conclusion
Given a nested loop program whose underlying computation dag has nodes representable as lattice points in a convex polyhedron, and multiprocessor schedule for these nodes that is linear in the loop indices, we produce a formula for the number of lattice points in the convex polyhedron that are scheduled for a particular time step (which is a lower bound on the number of processors needed to satisfy the schedule). This is done by constructing a system of parametric linear Diophantine equations
References (62)
- et al.
Space-time-minimal systolic arrays for Gaussian elimination and the algebraic path problem
Parallel Comput.
(1990) - et al.
The SDEF programming system
J. Parallel Distributed Comput.
(1989) - et al.
Parallelism detection and algorithm transformation techniques useful for VLSI architecture design
J. Parallel Distributed Comput.
(1985) - et al.
Systolic algorithm for tensor products of matrices: implementation and applications
Parallel Comput.
(1990) - et al.
Systolic algorithm for multivariable approximation using tensor products of basis functions
Parallel Comput.
(1991) - et al.
The Design and Analysis of Computer Algorithms
(1974) - et al.
Spacetime-minimal systolic arrays for gaussian elimination and the algebraic path problem
- et al.
A new systolic architecture for the algebraic path problem
A processor-time-minimal systolic array for cubical meshalgorithms
IEEE Trans. Parallel Distributed Syst.
(1992)- et al.
Processor lower bound formulas for array computations and parametric Diophantine systems
Int. J. Found. Comput. Sci.
(1998)