Fast problem-size-independent parallel prefix circuits
Introduction
Given inputs and an associative binary operator , the prefix operation, also called prefix computation, is to compute , for . This operation has been extensively studied for its numerous applications, such as biological sequence comparison, cryptography, design of silicon compilers, and loop parallelization [1], [2], [4], [11], [20], [21], [22], [24], [38], [44], [48]. Because of its usefulness, prefix computation is considered as a primitive operation [5]. For ease of presentation, unless otherwise stated, ’s and ’s represent inputs and outputs, respectively, and represents the result of computing , where . Particularly, .
The widest application of parallel prefix is fast adders. Some representative works are briefly described in the following. Knowles presents adders that take into account speed, area, and power trade-offs [19]. Beaumont-Smith and Lim introduce novel designs of prefix adder carry trees [3]. Dimitrakopoulos and Nikolos merge Ling carry computation and parallel prefix techniques to obtain fast adders that require less fan-out [10]. Patel et al. present new algorithms for modulo –1 addition with single zero representation [38]. Reviews of numerous other adder designs can also be found in the above four papers.
To speed up the prefix operation, many parallel prefix algorithms for various parallel computing models have been proposed; the computing models include binary tree [1], [24], hypercube [16], [47], mesh [1], fully connected multicomputer [12], [17], [26], [31], LogP multicomputer [39], parallel random-access machine [8], [14], [15], [18], [20], [21], [42], linear array with a reconfigurable bus system [9], hardware algorithms [25], [37], and others [1], [16], [20], [22], [35], [36], [43]. In addition, many prefix circuits, which are parallel prefix algorithms on the combinational circuit model, have been designed and studied [4], [6], [7], [13], [21], [22], [23], [27], [28], [29], [30], [32], [34], [40], [41], [46]. A prefix circuit of width is represented as a directed acyclic graph containing inputs, outputs, at least –1 operation nodes, and at least one duplication node. As shown in Fig. 1, an operation node, represented by a black dot, performs the operation on its two inputs, having indegree 2 and outdegree 1 or more. A duplication node has indegree 1 and outdegree 2 or more, denoted by a small circle also in Fig. 1. Because only the duplication node has indegree 1 and outdegree 2 or more, it need not and will not be explicitly represented by a small circle.
An example prefix circuit is shown in Fig. 2. Its vertical edges from left to right are named line 1, line 2, …, line 5, respectively. Input nodes are at the top of a circuit, representing input items, and output nodes are at the bottom, representing outputs. Output is generated on line , for . The numbers at the left side of a prefix circuit denote the depth levels of the nodes to the right. Fig. 2 also illustrates the outputs of operation nodes on line 5 by giving the outputs at the right side of these nodes. If a line has no operation or duplication node at some level, we assume that there is a repeater node or latch; thus, if all inputs arrive simultaneously at all input nodes, respectively, all outputs will be at the output nodes at the same time. For any operation node on line at level , its two inputs are from nodes at level –1; the left input is from a node on line , where , and the right input is from a node on line . The fan-in of a node is its indegree, and the fan-out is its outdegree. A node having a smaller fan-in and fan-out is faster and smaller in VLSI implementation [45]. The fan-in (respectively fan-out) of a prefix circuit is the maximum fan-in (respectively fan-out) of all nodes in the circuit. This paper considers prefix circuits with fan-in 2 only, and, unless otherwise stated, fan-out also 2.
The size of a prefix circuit named , denoted , is the number of operation nodes in , and the depth, denoted , is the maximum level of operation nodes in . Smaller size implies less power consumption and less area in VLSI implementation and thus less cost. Smaller depth implies faster computation. For any prefix circuit of width , it has been shown that [41]; is depth-size optimal if .
For prefix circuit of width , we use to denote the smallest level that has a duplication node on line 1, and use to denote the level of an operation node on line that computes . Let , which is called waist of . It has been proved that [30]. Therefore, if , then is waist-size optimal (WSO); moreover, if , is said to be WSO-1. Note that we may use the notation to mean that a circuit named is of width .
In this paper, we assume that prefix circuits are of width , and the prefix operation has inputs, where , unless otherwise stated. All the previous prefix circuits in the literature are designed under the assumption that the circuit width is equal to the number of inputs, or . Most of them are to achieve fast computation with constraints, such as depth-size optimality; however, they in general are slow when . Constructing problem-size-independent prefix circuits that are fast when is as significant as constructing ones that are fast only when .
This paper is the first to focus on fast problem-size-independent prefix circuits. We present a family of parallel WSO-1 circuits , for , prove that they have the minimum depth and are the fastest among all WSO-1 circuits of the same width and fan-out, and show that they can be faster than other prefix circuits of the same width when . In addition, the greater is, the faster is than the others. For example, when , can be faster than a prefix circuit that has unbounded fan-out and has the minimum depth among all prefix circuits of the same width. In fact, is the fastest prefix circuit with fan-out 2 when .
Moreover, is also a building block for constructing depth-size optimal prefix circuits that are fast when . Of all the proposed prefix circuits in the literature, many are depth-size optimal [22], [23], [28], [29], [30], [32], [34], [40], [46]. Most of the recent depth-size optimal prefix circuits are constructed with WSO-1 circuits as building blocks [28], [29], [30], [34], [40], [46]; thus, it is useful to have circuits of any width to support construction of depth-size optimal prefix circuits of any larger width. Although some algorithms have been presented to construct WSO-1 circuits [29], [34], [40], they can obtain WSO-1 circuits of only certain widths; for example, two algorithms each use a WSO-1 circuit of width and depth to derive a WSO-1 circuit of width and depth [34], [40]. In addition to the unlimited width range, has the minimum depth and is the fastest among all WSO-1 circuits of the same width and fan-out; thus, it is a better building block than other WSO-1 circuits for constructing depth-size optimal prefix circuits with a depth as small as possible.
The remainder of this paper is organized as follows: Section 2 first uses an example to show that when , an circuit is faster than a depth-size optimal prefix circuit that is faster than when , and then gives the number of computation time steps required by any prefix circuit when . Section 3 defines , a family of parallel WSO-1 circuits of any width . This section also shows that has the minimum depth and is the fastest among all WSO-1 circuits of the same width and fan-out, and gives some other properties of . To see how fast is when , Section 4 compares the computation time of with those of other representative prefix circuits. Section 5 concludes this paper.
Section snippets
A WSO-1 circuit compared with a depth-size optimal circuit
This section begins with a simple example to show that a WSO-1 circuit is faster than a depth-size optimal prefix circuit when . A general formula for the number of time steps a prefix circuit requires is then given. The formula motivates the pursuit of WSO-1 circuits with a small depth in the next section.
Fig. 3 shows parallel prefix circuit and its input sequence when . Clearly, , , , and . Therefore, by definition, is WSO-1. In contrast,
A family of parallel WSO-1 prefix circuits
Before presenting , a family of WSO-1 circuits with the minimum depth, we first review an approach to defining prefix circuits. A prefix circuit can be defined with sets of operation nodes at level , for :
For example, the prefix circuit already shown in Fig. 2 can be defined with
Let . Prefix circuit ,
Comparisons and discussions
To see whether is faster than other prefix circuits, in this section we first compare with other prefix circuits in number of time steps. The characteristics of prefix circuits that can be faster than are then considered to find out such circuits. For ease of presentation, let .
By Theorem 1, any prefix circuit requires time steps to complete the prefix operation. Thus, takes time steps. Let Clearly,
Conclusions
This paper is the first to focus on fast problem-size-independent prefix circuits. We have presented a family of parallel WSO-1 circuits with fan-out 2, for any width . is the fastest among all WSO-1 circuits of the same width and fan-out when . A prefix circuit cannot be faster than unless it has a larger fan-out or is not WSO-1. Although has a smaller fan-out and larger depth, it is so fast that when , it almost always takes fewer time steps than LF, which has unbounded
Acknowledgment
This research was supported in part by the National Science Council of Taiwan under contract NSC 91-2218-E-011-002.
Yen-Chun Lin received his BS degree in electrical engineering from National Taiwan University in 1977, MS degree in computer engineering from National Chiao Tung University in 1983, and PhD degree in electrical engineering from National Taiwan University in 1988. Since 1988 he has been on the faculty at National Taiwan University of Science and Technology. Dr. Lin has been a full professor since February 1993, in Department of Electronic Engineering and subsequently in Department of Computer
References (48)
- et al.
Parallel biological sequence comparison using prefix computations
J. Parallel Distrib. Comput.
(2003) - et al.
Faster optimal parallel prefix sums and list ranking
Inform. Control
(1989) - et al.
Parallel prefix computation with few processors
Comput. Math. Appl.
(1992) The parallel complexity of integer prefix summation
Inform. Process. Lett.
(1995)- et al.
Prefix computations on symmetric multiprocessors
J. Parallel Distrib. Comput.
(2001) - et al.
A new approach to constructing optimal parallel prefix circuits with small depth
J. Parallel Distrib. Comput.
(2004) - et al.
Finding optimal parallel prefix circuits with fan-out 2 in constant time
Inform. Process. Lett.
(1999) - et al.
Faster optimal parallel prefix circuits: New algorithmic construction
J. Parallel Distrib. Comput.
(2005) - et al.
Efficient parallel prefix algorithms on multiport message-passing systems
Inform. Process. Lett.
(1999) Optimal and efficient algorithms for summing and prefix summing on parallel machines
J. Parallel Distrib. Comput.
(2002)
Depth-size trade-offs for parallel prefix computation
J. Algorithms
Data broadcasting and reduction, prefix computation, and sorting on reduced hypercube parallel computers
Parallel Comput.
Parallel Computation: Models and Methods
A heuristic for suffix solutions
IEEE Trans. Comput.
Scans as primitive operations
IEEE Trans. Comput.
A regular layout for parallel adders
IEEE Trans. Comput.
Limited width parallel prefix circuits
J. Supercomput.
Multiple addition and prefix sum on a linear array with a reconfigurable pipelined bus system
J. Supercomput.
High-speed parallel-prefix VLSI Ling adders
IEEE Trans. Comput.
Fast parallel-prefix modulo 2+ 1 adders
IEEE Trans. Comput.
Parallel tree contraction and prefix computations on a large family of interconnection topologies
Acta Inform.
Parallel prefix algorithms on the multicomputer
WSEAS Trans. Comput. Res.
Cited by (10)
A class of almost-optimal size-independent parallel prefix circuits
2013, Journal of Parallel and Distributed ComputingDesigning programmable parallel lfsr using parallel prefix trees
2019, Journal of Engineering Research (Kuwait)Solving large problem sizes of index-digit algorithms on GPU: FFT and tridiagonal system solvers
2018, IEEE Transactions on ComputersNew Tridiagonal Systems Solvers on GPU Architectures
2016, Proceedings - 22nd IEEE International Conference on High Performance Computing, HiPC 2015Dynamic-width reconfigurable parallel prefix circuits
2015, Journal of SupercomputingDynamic-width reconfigurable parallel prefix circuits
2013, Proceedings - 16th IEEE International Conference on Computational Science and Engineering, CSE 2013
Yen-Chun Lin received his BS degree in electrical engineering from National Taiwan University in 1977, MS degree in computer engineering from National Chiao Tung University in 1983, and PhD degree in electrical engineering from National Taiwan University in 1988. Since 1988 he has been on the faculty at National Taiwan University of Science and Technology. Dr. Lin has been a full professor since February 1993, in Department of Electronic Engineering and subsequently in Department of Computer Science and Information Engineering. He served as Program Chair of the 2001 International Conference on Parallel and Distributed Computing, Applications, and Technologies and as Guest Editor of The Journal of Supercomputing, March 2003. He was a Visiting Scientist at the IBM Almaden Research Center, San Jose, California, from 1993 to 1994. His research interests include parallel computing and multimedia systems.
Li-Ling Hung received her BS degree in computer science from Tunghai University, Taichung, Taiwan in 1993. She received her MS degree in computer science and engineering from Yuan-Ze University, Chungli, Taiwan in 1995. She received her PhD degree in computer science and information engineering at National Taiwan University of Science and Technology, Taipei in 2008. Since February 2009 she has been an assistant professor in Department of Computer Science and Information Engineering, Aletheia University, Tamsui, Taiwan. Her research interests include parallel computing.