Fast problem-size-independent parallel prefix circuits

https://doi.org/10.1016/j.jpdc.2008.12.003Get rights and content

Abstract

A family of parallel algorithms solving the prefix problem on the combinational circuit model is presented. These prefix circuits are waist-size optimal with waist 1 (WSO-1). They are not only building blocks for constructing fast depth-size optimal prefix circuits, but also themselves fast problem-size-independent prefix circuits. When the problem size is greater than the circuit width, the presented prefix circuits may very much faster than any other prefix circuits of the same width, especially when the problem size is greater than or equal to twice the circuit width. The new prefix circuits are compared analytically with other representative prefix circuits to show how fast they are. They have the minimum depth and are the fastest among all WSO-1 prefix circuits of the same width and fan-out. Thus, they are better building blocks than other WSO-1 circuits for constructing fast depth-size optimal prefix circuits with the same fan-out.

Introduction

Given n inputs x1,x2,,xn and an associative binary operator , the prefix operation, also called prefix computation, is to compute yi=x1x2xi, for 1in. This operation has been extensively studied for its numerous applications, such as biological sequence comparison, cryptography, design of silicon compilers, and loop parallelization [1], [2], [4], [11], [20], [21], [22], [24], [38], [44], [48]. Because of its usefulness, prefix computation is considered as a primitive operation [5]. For ease of presentation, unless otherwise stated, xi’s and yi’s represent inputs and outputs, respectively, and i:j represents the result of computing xixi+1xj, where 1ijn. Particularly, yj=1:j.

The widest application of parallel prefix is fast adders. Some representative works are briefly described in the following. Knowles presents adders that take into account speed, area, and power trade-offs [19]. Beaumont-Smith and Lim introduce novel designs of prefix adder carry trees [3]. Dimitrakopoulos and Nikolos merge Ling carry computation and parallel prefix techniques to obtain fast adders that require less fan-out [10]. Patel et al. present new algorithms for modulo 2n–1 addition with single zero representation [38]. Reviews of numerous other adder designs can also be found in the above four papers.

To speed up the prefix operation, many parallel prefix algorithms for various parallel computing models have been proposed; the computing models include binary tree [1], [24], hypercube [16], [47], mesh [1], fully connected multicomputer [12], [17], [26], [31], LogP multicomputer [39], parallel random-access machine [8], [14], [15], [18], [20], [21], [42], linear array with a reconfigurable bus system [9], hardware algorithms [25], [37], and others [1], [16], [20], [22], [35], [36], [43]. In addition, many prefix circuits, which are parallel prefix algorithms on the combinational circuit model, have been designed and studied [4], [6], [7], [13], [21], [22], [23], [27], [28], [29], [30], [32], [34], [40], [41], [46]. A prefix circuit of width m is represented as a directed acyclic graph containing m inputs, m outputs, at least m–1 operation nodes, and at least one duplication node. As shown in Fig. 1, an operation node, represented by a black dot, performs the operation on its two inputs, having indegree 2 and outdegree 1 or more. A duplication node has indegree 1 and outdegree 2 or more, denoted by a small circle also in Fig. 1. Because only the duplication node has indegree 1 and outdegree 2 or more, it need not and will not be explicitly represented by a small circle.

An example prefix circuit is shown in Fig. 2. Its vertical edges from left to right are named line 1, line 2, …, line 5, respectively. Input nodes are at the top of a circuit, representing input items, and output nodes are at the bottom, representing outputs. Output yj is generated on line j, for 1j5. The numbers at the left side of a prefix circuit denote the depth levels of the nodes to the right. Fig. 2 also illustrates the outputs of operation nodes on line 5 by giving the outputs at the right side of these nodes. If a line has no operation or duplication node at some level, we assume that there is a repeater node or latch; thus, if all inputs arrive simultaneously at all input nodes, respectively, all outputs will be at the output nodes at the same time. For any operation node on line i at level j, its two inputs are from nodes at level j–1; the left input is from a node on line k, where k<i, and the right input is from a node on line i. The fan-in of a node is its indegree, and the fan-out is its outdegree. A node having a smaller fan-in and fan-out is faster and smaller in VLSI implementation [45]. The fan-in (respectively fan-out) of a prefix circuit is the maximum fan-in (respectively fan-out) of all nodes in the circuit. This paper considers prefix circuits with fan-in 2 only, and, unless otherwise stated, fan-out also 2.

The size of a prefix circuit named A, denoted s(A), is the number of operation nodes in A, and the depth, denoted d(A), is the maximum level of operation nodes in A. Smaller size implies less power consumption and less area in VLSI implementation and thus less cost. Smaller depth implies faster computation. For any prefix circuit A of width m, it has been shown that d(A)+s(A)2m2 [41]; A is depth-size optimal if d(A)+s(A)=2m2.

For prefix circuit A of width m, we use i(A) to denote the smallest level that has a duplication node on line 1, and use l(A) to denote the level of an operation node on line m that computes 1:m. Let w(A)=l(A)i(A), which is called waist of A. It has been proved that w(A)+s(A)2m2 [30]. Therefore, if w(A)+s(A)=2m2, then A is waist-size optimal (WSO); moreover, if w(A)=1, A is said to be WSO-1. Note that we may use the notation A(m) to mean that a circuit named A is of width m.

In this paper, we assume that prefix circuits are of width m, and the prefix operation has n inputs, where n>m, unless otherwise stated. All the previous prefix circuits in the literature are designed under the assumption that the circuit width is equal to the number of inputs, or m=n. Most of them are to achieve fast computation with constraints, such as depth-size optimality; however, they in general are slow when n>m. Constructing problem-size-independent prefix circuits that are fast when n>m is as significant as constructing ones that are fast only when n=m.

This paper is the first to focus on fast problem-size-independent prefix circuits. We present a family of parallel WSO-1 circuits H(m), for m3, prove that they have the minimum depth and are the fastest among all WSO-1 circuits of the same width and fan-out, and show that they can be faster than other prefix circuits of the same width when n>m. In addition, the greater n is, the faster H is than the others. For example, when n2m, H can be faster than a prefix circuit that has unbounded fan-out and has the minimum depth among all prefix circuits of the same width. In fact, H is the fastest prefix circuit with fan-out 2 when n>m.

Moreover, H is also a building block for constructing depth-size optimal prefix circuits that are fast when n=m. Of all the proposed prefix circuits in the literature, many are depth-size optimal [22], [23], [28], [29], [30], [32], [34], [40], [46]. Most of the recent depth-size optimal prefix circuits are constructed with WSO-1 circuits as building blocks  [28], [29], [30], [34], [40], [46]; thus, it is useful to have H circuits of any width to support construction of depth-size optimal prefix circuits of any larger width. Although some algorithms have been presented to construct WSO-1 circuits [29], [34], [40], they can obtain WSO-1 circuits of only certain widths; for example, two algorithms each use a WSO-1 circuit of width m and depth d to derive a WSO-1 circuit of width 2m1 and depth d+2[34], [40]. In addition to the unlimited width range, H has the minimum depth and is the fastest among all WSO-1 circuits of the same width and fan-out; thus, it is a better building block than other WSO-1 circuits for constructing depth-size optimal prefix circuits with a depth as small as possible.

The remainder of this paper is organized as follows: Section 2 first uses an example to show that when n>m, an H circuit is faster than a depth-size optimal prefix circuit that is faster than H when n=m, and then gives the number of computation time steps required by any prefix circuit when n>m. Section 3 defines H, a family of parallel WSO-1 circuits of any width m3. This section also shows that H has the minimum depth and is the fastest among all WSO-1 circuits of the same width and fan-out, and gives some other properties of H. To see how fast H is when n>m, Section 4 compares the computation time of H with those of other representative prefix circuits. Section 5 concludes this paper.

Section snippets

A WSO-1 circuit compared with a depth-size optimal circuit

This section begins with a simple example to show that a WSO-1 circuit is faster than a depth-size optimal prefix circuit when n>m. A general formula for the number of time steps a prefix circuit requires is then given. The formula motivates the pursuit of WSO-1 circuits with a small depth in the next section.

Fig. 3 shows parallel prefix circuit H(9) and its input sequence when n=25. Clearly, d(H(9))=7, i(H(9))=3, l(H(9))=4, and s(H(9))=15. Therefore, by definition, H(9) is WSO-1. In contrast,

A family of parallel WSO-1 prefix circuits

Before presenting H, a family of WSO-1 circuits with the minimum depth, we first review an approach to defining prefix circuits. A prefix circuit A can be defined with sets of operation nodes at level i, for i=1,2,,d(A): Gi={(x,y)There is an operation node at level i on line y whose left input is from line x}.

For example, the prefix circuit H(5) already shown in Fig. 2 can be defined with G1={(2,3),(4,5)},G2={(3,5)},G3={(1,5)},G4={(1,3)},G5={(1,2),(3,4)}.

Let r=lg(m1). Prefix circuit H(m),

Comparisons and discussions

To see whether H is faster than other prefix circuits, in this section we first compare H with other prefix circuits in number of time steps. The characteristics of prefix circuits that can be faster than H are then considered to find out such circuits. For ease of presentation, let k=(nm)/(m1)1.

By Theorem 1, any prefix circuit A requires tA=d(A)+k(w(A)+1) time steps to complete the prefix operation. Thus, H takes tH=d(H)+2k time steps. Let diff(A)=tAtH=d(A)d(H)+k(w(A)1). Clearly, A

Conclusions

This paper is the first to focus on fast problem-size-independent prefix circuits. We have presented a family of parallel WSO-1 circuits H(m) with fan-out 2, for any width m3. H is the fastest among all WSO-1 circuits of the same width and fan-out when n>m. A prefix circuit cannot be faster than H unless it has a larger fan-out or is not WSO-1. Although H has a smaller fan-out and larger depth, it is so fast that when n2m, it almost always takes fewer time steps than LF, which has unbounded

Acknowledgment

This research was supported in part by the National Science Council of Taiwan under contract NSC 91-2218-E-011-002.

Yen-Chun Lin received his BS degree in electrical engineering from National Taiwan University in 1977, MS degree in computer engineering from National Chiao Tung University in 1983, and PhD degree in electrical engineering from National Taiwan University in 1988. Since 1988 he has been on the faculty at National Taiwan University of Science and Technology. Dr. Lin has been a full professor since February 1993, in Department of Electronic Engineering and subsequently in Department of Computer

References (48)

  • M. Snir

    Depth-size trade-offs for parallel prefix computation

    J. Algorithms

    (1986)
  • S.G. Ziavras et al.

    Data broadcasting and reduction, prefix computation, and sorting on reduced hypercube parallel computers

    Parallel Comput.

    (1996)
  • S.G. Akl

    Parallel Computation: Models and Methods

    (1997)
  • A. Beaumont-Smith, C.-C. Lim, Parallel prefix adder design, in: Proc. 15th IEEE Symposium on Computer Arithmetic, Vail,...
  • A. Bilgory et al.

    A heuristic for suffix solutions

    IEEE Trans. Comput.

    (1986)
  • G.E. Blelloch

    Scans as primitive operations

    IEEE Trans. Comput.

    (1989)
  • R.P. Brent et al.

    A regular layout for parallel adders

    IEEE Trans. Comput.

    (1982)
  • D.A. Carlson et al.

    Limited width parallel prefix circuits

    J. Supercomput.

    (1990)
  • A. Datta

    Multiple addition and prefix sum on a linear array with a reconfigurable pipelined bus system

    J. Supercomput.

    (2004)
  • G. Dimitrakopoulos et al.

    High-speed parallel-prefix VLSI Ling adders

    IEEE Trans. Comput.

    (2005)
  • C. Efstathiou et al.

    Fast parallel-prefix modulo 2n+ 1 adders

    IEEE Trans. Comput.

    (2004)
  • F.E. Fich, New bounds for parallel prefix circuits, in: Proc. 15th Symposium on the Theory of Computing, 1983, PP....
  • W.J. Hsu et al.

    Parallel tree contraction and prefix computations on a large family of interconnection topologies

    Acta Inform.

    (1995)
  • L.-L. Hung et al.

    Parallel prefix algorithms on the multicomputer

    WSEAS Trans. Comput. Res.

    (2008)
  • Cited by (10)

    View all citing articles on Scopus

    Yen-Chun Lin received his BS degree in electrical engineering from National Taiwan University in 1977, MS degree in computer engineering from National Chiao Tung University in 1983, and PhD degree in electrical engineering from National Taiwan University in 1988. Since 1988 he has been on the faculty at National Taiwan University of Science and Technology. Dr. Lin has been a full professor since February 1993, in Department of Electronic Engineering and subsequently in Department of Computer Science and Information Engineering. He served as Program Chair of the 2001 International Conference on Parallel and Distributed Computing, Applications, and Technologies and as Guest Editor of The Journal of Supercomputing, March 2003. He was a Visiting Scientist at the IBM Almaden Research Center, San Jose, California, from 1993 to 1994. His research interests include parallel computing and multimedia systems.

    Li-Ling Hung received her BS degree in computer science from Tunghai University, Taichung, Taiwan in 1993. She received her MS degree in computer science and engineering from Yuan-Ze University, Chungli, Taiwan in 1995. She received her PhD degree in computer science and information engineering at National Taiwan University of Science and Technology, Taipei in 2008. Since February 2009 she has been an assistant professor in Department of Computer Science and Information Engineering, Aletheia University, Tamsui, Taiwan. Her research interests include parallel computing.

    View full text