## INFORMATION TO USERS

This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer.

The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction.

In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.

Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps.

Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality $6^{\prime \prime} \times 9^{\prime \prime}$ black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order.

ProQuest Information and Learning 300 North Zeeb Road, Ann Arbor, MI 48106-1346 USA 800-521-0600



# THE UNIVERSITY OF OKLAHOMA 

GRADUATE COLLEGE

# POWER-SPEED TRADE-OFF <br> IN <br> PARALLEL PREFIX CIRCUITS 

# A Dissertation <br> SUBMITTED TO THE GRADUATE FACULTY in partial fulfilment of the requirements for the degree of <br> Doctor of Philosophy 

## By

SIRIRUT VANICHAYOBON
Norman, Oklahoma 2002

UMI Number: 3038980

## UMi

## UMI Microform 3038980

Copyright 2002 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code.

## ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346

Ann Arbor, MI 48106-1346

# POWER-SPEED TRADE-OFF <br> IN <br> PARALLEL PREFIX CIRCUITS 

## A Dissertation APPROVED FOR THE SCHOOL OF COMPUTER SCIENCE

## BY

Sudarshan K. Dhall, Committee Chair
$\qquad$
S. Lakshmivarahan

Qom K. Andonis
J bn K. Antonio


## Acknowledgements

"No sweat, no gain" were the first words my professor said in the first class I attended at OU. I always remember them whenever I am down, and keep them as my encouragement.

I would like to express my deepest thanks and appreciation to my advisor, Prof. Sudarshan K. Dhall, and my co-advisor, Prof. S. Lakshmivarahan, for their confidence in my work and me. They both have guided me through the dissertation and provided support and encouragement. Without them, I would never have finished my work on this project. They are the wind beneath my wings. I have learned a lot form them both personally and professionally.

I am greatly indebted to Prof. John K. Antonio for his financial support, his kindness, and his guidance during the work as well as for his comments, suggestions, and support on my dissertation. I am very grateful to have had the opportunity to work on his project.

I extend my gratitude, appreciation and sincere thanks to Dr. Le Gruenwald for her valuable guidance, and support through my study, and for serving on my committee. I also would like to thank Dr. Albert Schwarzkopf for his time and support while serving on my committee.

I would like to thank the Thai govemment for bringing me to the USA, and for supporting me financially and motivating me. I would not dare to be in the States all by myself.

I would like to thank the Oklahoma Climatological Survey for giving me an opportunity to work with them.

All my life, I have met many people and have many good friends. I would like to thank all of them for touching me and giving me their love and support. In particular, I would like to thank Supawadee Poompuang for guiding me the way to live in the States. I would like to thank Sridhar Kulasekharan for his encouragement and helping me with everything I have asked for. I would like to thank Nathan Phillips for being my good friend, sending me encouragement and helping me with English. I would like to thank Vinod Choyi for being my good friend. I also would like to thank my officemates and my colleagues, Hongping Li for sharing invaluable discussion, Wang Jun, Manoj Suresh Kumar, Ming-Shan Su, Leonard Brown, and Brian Veale for giving me many laughs. I would like to thank Anurat Wisitsora, who destroys all my electronic doubts. I would like to thank Kemming Zhang for taking care of my sister when I am busy with my studies. Friends in Thailand and the USA, Sakaowrat Modthongkum, Nutharin Phusitphoykai, Chantarin Titawiriya, Aurawan Wiratanapokin, Maneerat Maneewong, Wimon Wongcharoen, Kitsiri Kaewpipat, Siriporn Laopiriyawong, Rossukon Laopaiboon,

Kulwadee L. Pigott, Charnnarong Saikaew and everyone, I would like to thank you all for being good listeners and for your tireless encouragement. I also want to thank my friends from Internet world who answer my many questions.

I would like to thank the School of Computer Science department's administrative staff and The University of Oklahoma for providing such a positive study environment.

I am grateful to learn Vipassana meditation technique from Teacher S. N. Goenka and to read the book, "Living, Loving and Learning" by Dr. Leo Buscaglia. They teach me to see through life and to live my life with joy.

I would like to thank my younger sister, Phanarat Vanitchyobol, for supporting and being with me whenever I need her.

I truly thank my brothers and sister in Thailand for their love and support, and for helping and taking care of my parents while I am away.

Finally, I would like to thank my very beloved mother and father, to whom this work is dedicated. They both work tirelessly to give their children an opportunity for education that they didn't have. I can't wait to be with them and make them proud.

May everyone be happy.

## TABLE OF CONTENTS

CHAPTER 1. INTRODUCTION ..... 1
CHAPTER 2. PREFIX COMPUTATION ..... 4
2.1 Prefix Computation Model ..... 4
2.2 Prefix Circuit: An Overview ..... 11
2.2.1 The Serial Prefix Circuit ..... 12
2.2.2 The Divide-and-Conquer Parallel Prefix Circuit ..... 13
2.2.3 The Ladner-Fischer Parallel Prefix Circuit ..... 15
2.2.4 The Brent-Kung Parallel Prefix Circuit ..... 17
2.2.5 The Snir Parallel Prefix Circuit ..... 19
2.2.6 The LYD Parallel Prefix Circuit ..... 25
2.2.7 The Shih-Lin Parallel Prefix Circuit ..... 29
2.3 Comparison ..... 31
CHAPTER 3 SOURCES OF POWER CONSUMPTION ..... 33
3.1 CMOS ..... 33
3.2 Power Consumption ..... 35
3.2.1 Sources of Power Consumption ..... 35
3.2.2 Power Consumption and Fan-out ..... 42
3.3 The Circuit-level Simulation: PSpice ..... 43
CHAPTER 4 POWER MODELING OF PREFIX CIRCUITS ..... 45
4.1 Step 1-The Constant Output Capacitance ..... 46
4.1.1 The Serial Prefix Circuit ..... 47
4.1.2 The Divide-and-Conquer Parallel Prefix Circuit ..... 48
4.1.3 The Brent-Kung Parallel Prefix Circuit ..... 51
4.1.4 The Ladner-Fisher Parallel Prefix Circuit ..... 56
4.1.5 The Snir Parallel Prefix Circuit ..... 57
4.1.6 The Shih-Lin Parallel Prefix Circuit ..... 58
4.1.7 The LYD Parallel Prefix Circuit ..... 60
4.2 Step 2 - Capacitance of Residual Circuit ..... 63
4.2.1 The Serial Prefix Circuit ..... 64
4.2.2 The Divide-and-Conquer Parallel Prefix Circuit ..... 66
4.2.3 The Brent-Kung Parallel Prefix Circuit ..... 69
4.2.4 The Ladner-Fisher Parallel Prefix Circuit ..... 74
4.2.5 The Snir Parallel Prefix Circuit and Shih-Lin Parallel Prefix Circuit ..... 74
4.2.6 The LYD Parallel Prefix Circuit ..... 76
4.3 Comparison ..... 79
CHAPTER 5 POWER-SPEED TRADE-OFF IN PREFIX CIRCUITS ..... 81
5.1 Prefix Circuits at Fixed Voltage ..... 82
5.2 Effects of Voltage Scaling on Prefix Circuits ..... 84
5.3 Summary ..... 92
CHAPTER 6 ADDITION CIRCUITS ..... 94
6.1 Adder: Theory ..... 95
6.2 Parallel Addition ..... 99
6.2.1 Binary Addition as a Prefix Problem ..... 101
CHAPTER 7 SIMULATION RESULT ..... 110
7.1 Effect of Block Size on Adder Implementation ..... 110
7.2 Effect of Prefix Circuit on Adder Implementation ..... 114
7.3 Summary ..... 117
CHAPTER 8 CONCLUSION ..... 118
BIBLIOGRAPHY ..... 121
APPENDICES ..... 126
Appendix A ..... 126
Appendix B ..... 134
Appendix C ..... 136
Appendix D ..... 143
Appendix E ..... 152

## LIST OF TABLES

2.1 A comparison of the seven prefix circuits illustrated in this chapter ..... 32
4.1 The constant output capacitance table for $B K(N)$ ..... 53
4.2 Expression of the constant output capacitance ..... 63
4.3 The residual circuit table for $B K(N)$ ..... 71
4.4 The effective circuit capacitance of prefix circuits ..... 80
5.1 Estimated power consumption based on Eq. 4.1 for various prefix circuits for $N=64$ ..... 89
6.1 Adder truth table ..... 96
6.2 Gate count of a $N$-bit adder ..... 109
7.1 Gate count, delay time, power consumption, and power-delay-product of different design of 8 -bit, 16-bit, 32-bit, and 64-bit adders using the divide-and-conquer prefix circuit ..... 111
E. 1 A comparison of the exact capacitance values and the estimated capacitance values of the Brent-Kung prefix circuit ..... 152

## LIST OF FIGURES

2.1 An illustration of the prefix computation model ..... 5
2.2 An illustration of an operation mode and a repeater node ..... 6
2.3 An illustration of the prefix circuit's layout ..... 6
2.4 The prefix circuit with 4 inputs, size $=4$, depth $=2$ ..... 8
2.5 The prefix circuit $A$ with 4 inputs, size $=3$, depth $=3$ ..... 9
2.6 The prefix circuit $B$ with 4 inputs, size $=4$, depth $=2$ ..... 9
2.7 An illustration of the serial prefix circuit, $S(N)$, derived from [LD94] ..... 12
2.8 The serial circuit with 10 inputs, $S(10)$, size $=9$, depth $=9$ ..... 13
2.9 An illustration of the divide-and-conquer prefix circuit, $D C(N)$, derived from [LD94] ..... 14
2.10 The divide-and-conquer parallel prefix circuit with 8 inputs, $D C(8)$, size $=12$, depth $=3$ ..... 15
2.11 An illustration of the Ladner-Fischer parallel prefix circuit when $\mathrm{k}=0$, $L F_{0}(N)$, derived from [LF80] ..... 16
2.12 An illustration of the Ladner-Fischer parallel prefix circuit when $\mathrm{k} \neq 0$, $L F_{k}(N)$, derived from [LF80] ..... 16
2.13 Examples of Ladner-Fisher parallel prefix circuits with 8 inputs ..... 17
2.14 A Brent-Kung parallel prefix circuit, $B K(N)$ based on divide-and-conquer strategy ( $o=$ odd, $e=$ even), derived from [LD94] ..... 18
2.15 An illustration of the Brent-Kung parallel prefix circuit, $B K(8)$, size $=11$, depth=4 ..... 19
2.16 The Snir prefix circuit, $S N(N)=C R\left(N_{1}\right) \cdot S\left(N_{2}\right)$ ..... 20
2.17 An illustration of the uncompressed layered prefix circuit, size $=11$, depth=5 ..... 22
2.18 An illustration of the compressed layered prefix circuit, size=1, depth=4 ..... 23
2.19 The Snir prefix circuit, $S N(19)$, size $=28$, depth $=8$ ..... 24
2.20 The $Q(7)$ prefix circuit ..... 26
2.21 The structure of $L Y D(N)$, derived from [LD94] ..... 28
2.22 The $L Y D(19)$ prefix circuit with size 31 and depth 5 ..... 29
2.23 The $S L(N)$ prefix circuit, $S L(N)=C R\left(N_{1}\right) \cdot S\left(N_{2}\right)$ ..... 30
2.24 The $S L$ (19) prefix circuit, size $=30$ and depth $=6$ ..... 31
3.1 $\quad$-type and $N$-type transistor and their characteristics ..... 34
3.2 CMOS inverter ..... 34
3.3 CMOS NAND gate ..... 35
3.4 CMOS NOR gate ..... 35
3.5 The leakage current from the gate to the drain of a transistor ..... 36
3.6 An illustration of short-circuit when both $P$-type and $N$-type transistor being in the on state at the same time ..... 36
3.7 An illustration of capacitance charging ..... 37
3.8 Plots of normalized delay vs. supply voltage for a variety of different logic circuits, derived from [CB95] ..... 38
3.9 An illustration of the glitching behavior of a chain of eight NAND gates [RCN01] ..... 40
3.10 An illustration of extra transaction activity, derived from [CB95] ..... 41
3.11 Effect of fan-out on power consumption of a 2-input XOR gate ..... 43
4.1 An illustration of the serial prefix circuit, $S(N)$ ..... 47
4.2 An illustration of the serial prefix circuit, $S(N)$, built from $S(N-1)$ ..... 47
4.3 An illustration of the divide-and-conquer prefix circuit, $D C(N)$, built from $D C(N / 2)$, derived from [LD94] ..... 49
4.4 A Brent-Kung parallel prefix circuit, $B K(N)$, divided into three parts ( $o=$ odd, $e=$ even), derived from [LD94] ..... 51
4.5 Part $C$, the distribution of $N / 2-1$ nodes ..... 54
4.6 The $S N(N)$ circuit, $S N(N)=C R\left(N_{1}\right) \cdot S\left(N_{2}\right)$ ..... 57
4.7 The $\operatorname{SL}(N)$ circuit, $S L(N)=C R\left(N_{1}\right) \cdot S\left(N_{2}\right)$ ..... 59
4.8 The structure of $L Y D(N)$, derived from [LD94] ..... 61
4.9 The serial prefix circuit for $N$ inputs with fan-out shown in solid lines ..... 64
4.10 The residual circuit of the serial prefix circuit, shown in solid lines ..... 65
4.11 The illustration of the residual circuit of the $S(N)$, built from $S(N-1)$ ..... 65
4.12 The divide-and-conquer prefix circuit, $D C(N)$, with fan-outs shown in solid lines, derived from [LD94] ..... 67
4.13 The residual circuit of the divide-and-conquer prefix circuit, $D C(N)$, shown in solid lines ..... 67
4.14 The residual network of the Brent-Kung parallel prefix circuit, $B K(N)$, divided in to 3 parts ..... 70
4.15 Part $C$, the Distribution of $N / 2-1$ nodes ..... 72
4.16 The $\operatorname{SN}(N)$, and $S L(N)$ circuit ..... 75
4.17 The structure of $L Y D(N)$, derived from [LD94] ..... 77
5.1 Power consumption of the 32-bit XOR parallel prefix circuits, obtained through PSpice simulation ..... 82
5.2 Estimated power consumption of prefix circuits when $N=32$ bits ..... 82
5.3 Comparison between simulation results and modified estimation results for $N=32$ bits. The modified estimation enhances the original estimation by including a component of power proportionally to circuit size ..... 83
5.4 Power consumption of the XOR parallel prefix circuits at fixed voltage, obtained through PSpice simulation ..... 84
5.5 Estimated power consumption of prefix circuits with fixed voltage ..... 84
5.6 Plot of supply voltage vs. normalized delay from [10] ..... 85
5.7 Estimated delay of parallel prefix circuits when $N=64$ ..... 87
5.8 Estimated power consumption of parallel prefix circuits when $N=64$ ..... 87
5.9 Estimated power-delay product of parallel prefix circuits when $N=64$ ..... 87
5.10 Delay of the 64-bit XOR parallel prefix circuits, obtained through PSpice simulation ..... 90
5.11 Power consumption of the 64-bit XOR parallel prefix circuits, obtained through PSpice simulation ..... 90
5.12 Power-delay product of the 64-bit XOR parallel prefix circuits, obtained through PSpice simulation ..... 90
6.1 The Block diagram of the full-adder circuit ..... 96
6.2 The K-Maps for the full-adder-circuit ..... 96
6.3 A chain of $N$ full-adders ..... 98
6.4 A full-adder circuit ..... 98
6.5 Three stages of the implementation of the fast adder ..... 100
6.6 The partition of all $p_{i}$ and $g_{i}$ into $r$ group with $q$ members each ..... 101
6.7 A parallel scheme for computing the carry, derived from [LD90] ..... 102
7.1 The plots of delay, power consumption, and power-delay-product of different design of 8 -, 16, 32-, and 64-bit adders using the divide-and- conquer prefix circuit ..... 112
7.2 The illustration of the effect of the block size on other factors ..... 113
7.3 The plot of power-delay-product of the divide-and-conquer, the LYD, and the Shih-Lin prefix circuits ..... 115
7.4 The plot of power consumption of four prefix circuits using in carry calculation in 64-bit adder implementing with block size of sixteen ..... 116
A. 1 Inverter ..... 126
A. 2 Switch model of CMOS transistor ..... 127
A. 3 The equivalent action of an inverting gate when a step input charges and discharges the capacitor ..... 128
A. 4 Charging and discharging exponential curves for an $R C$ network ..... 129
A. $5 \quad$ The plot of the delay vs. $V_{D D}$ ..... 130
A. 6 CMOS inverter's input and output waveforms ..... 132
A. 7 CMOS inverter's power and energy waveforms ..... 133
B. 1 A CMOS inverter ..... 134
B. 2 A CMOS AND gate ..... 134
B. 3 A CMOS OR gate ..... 135
B. 4 A CMOS XOR gate ..... 135
C. 1 The worst case input of XOR gate (i.e., the first input is equal to 0 and the other inputs are $0 \rightarrow 1$ ), causing the output to ripple the most ..... 136
C. 2 The 8-bit XOR gate implemented with the serial prefix circuit ..... 137
C. 3 The outputs of 8-bit XOR gates implemented with the serial prefix circuit, showing the longest ripple (the maximum number of switching) ..... 138
C. 4 Delay of 8-bit XOR gates implemented with the serial prefix circuit from PSpice simulation ..... 139
C. 5 Energy of 8-bit XOR gates implemented with the serial prefix circuit from PSpice simulation ..... 139
C. 6 The 8-bit XOR gate implemented with the divide-and-conquer prefix circuit ..... 140
C. 7 The outputs of 8-bit XOR gates implemented with the divide-and-conquer prefix circuit, showing the longest ripple (the maximum number of switching) ..... 141
C. 8 Delay of 8-bit XOR gates implemented with the divide-and-conquer prefix circuit from PSpice simulation ..... 142
C. 9 Energy of 8-bit XOR gates implemented with the divide-and-conquer prefix circuit from PSpice simulation ..... 142
D. 1 Preprocessing: carry propagate bits and carry generate bits ..... 143
D. 2 Postprocessing: $s_{i}=a_{i} \oplus b_{i} \oplus c_{i-1}$ ..... 143
D. 3 The implementation of $E_{i}, i=1$ ..... 144
D. 4 The implementation of carry bits ..... 145
D. 5 The implementation of $E_{i}, 1 \leq i \leq 2$ ..... 145
D. 6 The implementation of prefix circuit with 4 inputs ..... 146
D. 7 The implementation of carry bits ..... 147
D. 8 The implementation of $E_{i}, 1 \leq i \leq 4$ ..... 148
D. 9 The implementation of prefix circuit ..... 148
D. 10 The implementation of carry bits ..... 149
D. 11 The implementation of $E_{i}, 1 \leq i \leq 8$ ..... 150
D. 12 The implementation of prefix circuit with 4 inputs ..... 150
D. 13 The implementation of carry bits ..... 151


#### Abstract

Optimizing area and speed in parallel prefix circuits have been considered important for long time. The issue of power consumption in these circuits, however, has not been addressed. This dissertation presents a comparative study of different parallel prefix circuits from the point of view of power-speed trade-off. The power consumption and the power-delay product of seven parallel prefix circuits were compared. A linear output capacitance assumption, combined with PSpice simulations, is used to investigate the power consumption in the circuits. The degrees of freedom studied include different parallel prefix algorithms and voltage scaling. The results show that the use of the linear output capacitance assumption provides results that are consistent with those obtained using PSpice simulations. Because of the size-depth trade-off characteristic of prefix circuits, the results also show that parallelism of prefix circuits at a certain level coupled with the use of low supply voltage can be used to reduce the power-delay product to attain a desired throughput beyond the minimum possible. The study enables us to understand the power consumption behavior of prefix circuits, and to pick the suitable prefix circuit for the acceptable power consumption in the prefix with a given throughput. Circuit designers can then choose the best prefix circuit for a particular application.


## CHAPTER 1

## INTRODUCTION

The three most widely accepted metrics for measuring the quality of an integrated circuit are its area, speed, and power consumption. Optimizing area and speed have been considered important for long time, but minimizing power consumption has been gaining prominence only recently [Bel01, BM00, CB95, GNHF01, Hub00, Mil00, RP00, RP96]. One important reason for minimizing power consumption of a circuit is the proliferation of portable electronic systems, such as laptops, mobile phones and wireless devices, where maximizing battery life is important. Since it is desirable to minimize the size and weight of batteries in such devices, while increasing the time between battery recharges, finding methods of reducing power consumption has assumed considerable importance recently.

In this dissertation, we study power-speed trade-off for prefix circuits. The prefix circuits play an important role in many applications. It appears in a number of areas such as the carry-look-ahead adder, ranking, packing, radix sort, etc. [LD94]. Many new approaches for prefix circuits with the goal of optimizing depth (i.e., speed) and size (i.e., area) have been proposed [BK82, LF80, LD94, LS99, Snir86]. As a result, performance in terms of the speed and area has improved. The issue of power consumption in these circuits, however, has not been addressed. Therefore, our goal is to make a comparative study of different prefix circuits from the point of view of power-speed trade-off in order
to facilitate the design choices, specifications, and resource limitations. In this study, we use the power-delay product as a quality measure for the prefix circuits. The power-delay product is the product of the circuit's power consumption and propagation delay, which represents the energy consumed by the circuit per operation.

Two issues have been addressed in this dissertation. The first deals with our proposed power modeling of prefix circuits. Then, the model, combined with PSpice simulations, is used to investigate the power consumption in the circuits considered. The simulations were carried out on both fixed and scaled supply voltage. It is found that amongst the parallel prefix circuits the circuit having the shortest depth (the divide-and-conquer prefix circuit) consumes the most power. Also according to PSpice simulations, the power-delay product of the LYD prefix circuit seems to be the best (lowest) amongst the circuits considered while the power-delay product of the divide-and-conquer is the highest. The second issue deals with an investigation of the binary adders using selected prefix algorithms. A parameter in the implementation of these circuits is the choice of block size for computing carries in parallel. The 8-, $16-, 32-$, and 64 -bit binary adders were implemented and simulated on PSpice. The performance was measured and compared. In regard of power-delay product, we have found that an optimum block size falls somewhere around the middle among the various possible block sizes.

The rest of this dissertation is divided into seven chapters. Chapter II presents a literature survey on the various prefix circuits and discusses the current state of the art in this field. Chapter III reviews the sources of power consumption in CMOS circuits and presents strategies to estimate power consumption of the circuits. In addition, Chapter III briefly introduces the circuit simulation tool called PSpice. Chapter IV focuses on
modeling the power consumption of the prefix circuits. The analysis of the power-speed trade-off of various prefix circuits is described in Chapter V. Chapter VI introduces the basic addition principle and structure as well as the formulation of carry propagation as a prefix problem. The simulation studies of adders are given in Chapter VII. Finally, the main results of the dissertation are summarized in Chapter VIII.

## CHAPTER 2

## PREFIX COMPUTATION

As parallel-processing computers have proliferated, the notion of prefix computation has gained considerable attention in the literature and it played an important role in parallel algorithms. In 1963, Ofman, a Russian Mathematician, was a pioneer in introducing the use of prefix computation for fast binary adder circuits. The prefix computation appears in a number of areas such as the carry-look-ahead adder, the ranking, the packing, the radix sort, the finite state transducers, and the solutions of linear recurrences [LD94]. In this chapter, the prefix computation model is introduced. Then a survey of the seven wellknown prefix circuits is presented.

### 2.1 Prefix Computation Model

A prefix computation [LD94], or simply the prefix circuit, is the process of taking $N$ inputs values $x_{1}, x_{2}, \ldots, x_{N-1}, x_{N}$ and producing $N$ output values $y_{1}, y_{2}, \ldots, y_{N-1}, y_{N}$ such that

$$
\begin{aligned}
& y_{1}=x_{1}, \\
& y_{i}=y_{i-1} \bullet x_{i}=x_{1} \bullet x_{2} \bullet \ldots \bullet x_{i-1} \bullet x_{i}
\end{aligned}
$$

$$
\text { for } 2 \leq i \leq N
$$

and • is an associative binary operation as shown in Figure 2.1. In other words, each $\boldsymbol{y}_{i}$ is obtained by "operating" together the first $i$ elements of the sequence of $x_{i}$--hence, the term "prefix." As an example, suppose that $x_{i}=1$ for $1 \leq i \leq N$, and let $\cdot$ be the
ordinary addition. Then, $y_{1}=x_{1}=1, y_{2}=y_{1}+x_{2}=2$, and so on. Therefore, the prefix circuit produces $y_{i}=i$ for $1 \leq i \leq N$.


Figure 2.1: An illustration of the prefix computation model.
The inputs of the prefix circuit, $x_{i}$ 's, can be anything depending on its application. If the input is either an integer, real number, or complex number and its operation is one of the two arithmetic operators (i.e., + , and $\times$ ), we call the circuit as an arithmetic circuit. If the input is a Boolean element (for example, $\{0,1\}$ or $\{t r u e, f a l s e\}$ ) associated with a Boolean operator we call it as a Boolean circuit.

A prefix circuit with $N$ inputs can also be viewed as a directed acyclic graph $G=(V, E)$ with $N$ input vertices, $N$ output vertices, and at least $N-1$ internal vertices. These vertices will be referred to as input nodes, output nodes, and internal nodes, respectively. An internal node is neither an input nor an output node. There are two types of internal nodes: operation nodes and repeater nodes. An $\boldsymbol{N}$-input prefix circuit has at least $N-1$ operation nodes and has zero or more repeater nodes. An illustration of an operation node and a repeater node is shown in Figure 2.2. An operation node shown as a
black dot, $\bullet$, takes two inputs and produces one output. A repeater node shown as a small square, $\square$, takes one input and produces as output one or more copies of its input.


Figure 2.2: An illustration of an operation node and a repeater node.

In the prefix circuit's layout, vertical lines identify the inputs and outputs. The inputs are the lines leading from the top while the outputs are the lines leading to the bottom. As an example, Figure 2.3 illustrates the layout and the components of a prefix circuit. The numbers along the left-hand side of the layout give the depth (level) of the operation nodes on the right. Note that the first output node in the prefix circuit is from the first input node and the other outputs are from the internal nodes.


Figure 2.3: An illustration of the prefix circuit's layout.

The metrics for measuring the performance of a prefix circuit are the circuit size, depth, fan-in, and fan-out. These are explained in detail in the following.

## Circuit Size

The size of a prefix circuit, size $(N)$, is the total number of operation nodes in the circuit. The size represents the amount of space required for the circuit. The circuit with smaller size occupies less chip area in VLSI implementation [WE93]. One of the design aims may be minimizing the size of the circuit.

## Circuit Depth

The depth of a prefix circuit, $\operatorname{depth}(N)$, is the length of the longest path measured in terms of the number of operations along the path in the circuit from its input nodes to its output nodes. If a prefix circuit produces its outputs at depths $d_{1}, d_{2}, \ldots, d_{k}$, the depth of a circuit is the maximum of $\left\{d_{1}, d_{2}, \ldots, d_{k}\right\}$. In other words, the depth of a prefix circuit is the maximum depth of its outputs. The circuit depth is related to its computation time. In VLSI implementation, a circuit with smaller depth is generally faster than one with greater depth when the fan-out of most nodes in the two circuits is similar [WE93]. A prefix circuit is depth-optimal if the circuit has the smallest depth among all possible circuits.

## Circuit Fan-in and Fan-out

The fan-in of a prefix circuit is the maximum fan-in of all nodes in the circuit. The fan-in
of a node is the number of inputs the node has in the path being exercised. Thus, the fanin of a node is defined as the node's indegree. The fan-in of a node except the input nodes can be either bounded or unbounded. A node has unbounded fan-in if the fan-in is not fixed. In this study, unless otherwise stated, we are interested in the prefix circuit with the fan-in of two, which represents a binary operation.

The fan-out of a prefix circuit is the maximum fan-out of all nodes in the circuit. The fan-out of a node is the number of outputs the node produces to drive the other nodes. The fan-out of a node is defined as the node's outdegree. A node has unbounded fan-out if the fan-out is not fixed. In the circuit shown in Figure 2.4, the nodes have fan-out of three, and one, respectively. In the following, unless otherwise stated, we assume that the fan-out of the prefix circuit is a function of $N$.


Figure 2.4: The prefix circuit with 4 inputs, size=4, depth=2.

## Size-depth trade-off

Ladner and Fisher [LF80] were the first to introduce the important property of the prefix circuit, the size-depth trade-off. They showed that a decrease in the circuit depth can be achieved by an increase in the circuit size and vice versa. Snir [Sni86] further strengthened this notion by proving the following result:

Theorem [Sni86] The sum of the size and depth of a prefix circuit, $G(N)$, is lower bounded by $2 N-2$, i.e., $\operatorname{size}(G(N))+\operatorname{depth}(G(N)) \geq 2 N-2$.

This bound is tight in the sense that there are prefix circuits which actually achieve this bound. Figure 2.5 and Figure 2.6 show the size-depth trade-off of prefix circuits. The circuit $A$ and circuit $B$ produce the same outputs, $y_{i}$ where $1 \leq i \leq 4$.


Figure 2.5: The prefix circuit $A$ with 4 inputs, size $=3$, depth $=3$.


Figure 2.6: The prefix circuit $B$ with 4 inputs, size $=4$, depth $=2$.

The circuit $A$ in Figure 2.5 has size 3 and depth 3 while the circuit $B$ in Figure 2.6 has larger size but smaller depth (i.e., size is 4 and depth is 2 ). Hence the circuit $B$ is faster but has to do more work than the circuit $A$. Both circuits are (size, depth)-optimal.

The deficiency of a prefix circuit [Sni86] is defined as

$$
\text { deficiency }=\text { size }+ \text { depth }-(2 N-2) .
$$

Since $2 N-2$ is the lower bound on the sum of size and depth, clearly, if deficiency $=0$, then the prefix circuit is said to be (size, depth)-optimal.

In this study, all inputs are assumed to be at level zero. Unless otherwise stated, we assume the number of inputs is $N$ which need not to be a power of two. The input nodes will be denoted as $x_{1}, x_{2}, \ldots, x_{N-1}, x_{N}$. For integers $i$ and $j$ in the range $1 \leq i \leq j \leq N$, we define

$$
i: j=x_{i} \bullet x_{i+1} \bullet \ldots \bullet x_{j}
$$

Thus, for $i=1,2, \ldots, N$, we have $i: i=x_{i}$, since the composition of just one input $x_{i}$ is itself. For $i, j$, and $k$ satisfying $1 \leq i \leq j \leq k \leq N$, we also have the identity

$$
i: k=i: j-1 \bullet j: k
$$

since the - operator is associative [LD94]. For purposes of notational convenience, the input values $x_{i}$ 's are labeled with the integer $i$, and the output values $y_{i}$ 's are labeled with $1: i$, where $1: i=x_{1} \bullet x_{2} \bullet \ldots \bullet x_{i-1} \bullet x_{i}$ for $1 \leq i \leq N$. All input nodes have zero fan-in and a fan-out of at most two. The output nodes have at most one fan-in and zero fan-out. For the internal nodes, the operation nodes have two fan-ins while the repeater nodes have only one fan-in. However, both have unbounded fan-out. We will use this structure to represent a prefix circuit for the rest of the study. This type of prefix circuit is termed as a conservative circuit [Sni86]. If a prefix circuit produces its last output (i.e., $1: N$ ) at level $\lceil\lg N\rceil$, we call such a circuit as a restricted prefix circuit. We will see in the next section that the restricted prefix circuit plays a major role in many parallel prefix circuits.

### 2.2. Prefix Circuits: An Overview

In this section, we review the design of prefix circuits commonly found in literature. We first introduce the serial prefix circuit. The size and depth complexity of this circuit is $O(N)$. Then the parallel prefix circuits based on the divide-and-conquer approach are presented. These circuits are known as the divide-and-conquer prefix circuit, the LadnerFischer prefix circuit [LF80], and the Brent-Kung prefix circuit [BK82]. By way of comparison with the serial prefix circuit, the size of the divide-and-conquer prefix circuit increases to $\mathrm{O}(N \lg N)$ whereas the size of the Brent-Kung prefix circuit is $\mathrm{O}(N)$. However, the computation time of all three circuits is improved to $\mathrm{O}(\lg N)$. LadnerFischer prefix circuit is the first circuit that shows the trade-off between the circuit size and circuit depth. Finally, the prefix circuits that are (size, depth)-optimal and are based on the combination of two or more prefix circuits are presented. Each circuit has its own methodology to divide inputs into two or more parts, intending to reduce the circuit depth. For example, the Snir prefix circuit [Sni86] and the Shih-Lin prefix circuit [LS99] are composed of two parts. The first part is the non-optimal prefix circuit called the compressed layered prefix circuit and the second part is the serial prefix circuit. The Lakshmivaranhan, Yang and Dhall's prefix circuit (LYD prefix circuit) is composed of four parts and has the shortest circuit depth among all (size, depth)-optimal prefix circuits. Note that all of the circuits, except the serial prefix circuit, have unbounded fanout and operate in parallel: more than one operations are performed at a time. Instead of producing the outputs one by one at a time as in the serial prefix circuit, they produce
outputs $y_{1}, y_{2}, \ldots, y_{N-1}, y_{N}$ more quickly. In the following, unless specified otherwise, all the logarithms are to the base 2.

### 2.2.1 The Serial Prefix Circuit

The serial prefix circuit, $S(N)$, produces the outputs one by one at a time. It is straightforward to construct the serial prefix circuit. The layout of the circuit for $N$ inputs is illustrated in Figure 2.7. The $S(N)$ circuit is formed by cascading $N-1$ operation nodes, and feeding the output of the previous level directly into the input of the current level. Each operation node has a fan-in of exactly two and a fan-out of two except the last operation which has only one fan-out.


Figure 2.7: An illustration of the serial prefix circuit, $S(M)$ derived from [LD94].

The last output is produced at depth $N-1$. There is one operation node at each level so the size of the circuit is $N-1$, which is the smallest size among all other circuits. Moreover, the serial prefix circuit is a (size, depth)-optimal prefix circuit since the sum of its size and depth is $\mathbf{2 N - 2}$. However, the circuit is neither depth-optimal nor a restricted circuit. Due to the size-depth trade-off rule, the serial prefix circuit has the longest depth
among all other circuits (i.e., slowest circuit). Thus, all other faster circuits must have sizes larger than $N-1$. Figure 2.8 shows the serial prefix circuit for $N=10$. The output from the $i^{\text {th }}$ level is the input of the $(i+1)^{\text {th }}$ level. For example, the output of node labeled 1:2 is the input of node labeled 1:3. The circuit size and depth are 9. Even thought the serial prefix circuit uses only $N-1$ operations, the time taken is also $N-1$. Hence we have to look at other alternatives for better performance.


Figure 2.8: The serial circuit with 10 inputs, $S(10)$, size $=9$, depth $=9$.

### 2.2.2 The Divide-and-Conquer Parallel Prefix Circuit

The divide-and-conquer prefix circuit reduces the depth to $\lg N$, as opposed to $N-1$ needed by the serial prefix circuit, by using parallel operations and the well-known divide-and-conquer strategy. The construction of the divide-and-conquer parallel prefix circuit with $N$ inputs, denoted as $D C(N)$, is illustrated in Figure 2.9. The $D C(N)$ circuit can be built from two $D C(N / 2)$ circuits, recursively. Thus the size $(D C(N))$ is the size of two $D C(N / 2)$ circuits plus additional connection nodes, which are $N / 2$ in number.

Similarly, the depth of $D C(N)$ is one more than the depth of $D C(N / 2)$. Therefore, the following recurrences for the size and depth of this circuit are immediate:

$$
\begin{array}{lll}
\operatorname{size}(D C(N))=2 \operatorname{size}\left(D C\left(\frac{N}{2}\right)\right)+\frac{N}{2}, & \text { with } & \operatorname{size}(D C(2))=1 \\
\operatorname{depth}(D C(N))=\operatorname{depth}\left(D C\left(\frac{N}{2}\right)\right)+1, & \text { with } & \operatorname{depth}(D C(2))=1
\end{array}
$$



Figure 2.9: An illustration of the divide-and-conquer prefix circuit, $D C(N)$. derived from [LD94].

Solving, for $\operatorname{size}(D C(N))$, we get

$$
\operatorname{size}(D C(N))=\frac{N}{2} \lg N=\mathrm{O}(N \lg N)
$$

Solving, for depth $(D C(N))$, we get

$$
\operatorname{depth}(D C(N))=\lg N=\mathrm{O}(\lg N)
$$

Thus, the $D C(N)$ circuit takes only $\lg N$ time. This circuit is, therefore, depth optimal. However, the circuit size is much bigger than the serial prefix circuit, increasing to $O(N \lg N)$. However, the circuit is not (size, depth)-optimal because the sum of the size and depth of the circuit is much more than the lower bound $2 N-2$, for $N>4$. Figure
2.10 shows the circuit $D C(N)$ for $N=8$. The circuit size and depth are 12 and 3 , respectively. In this case we have reduced the depth to $\lg N$ but the number of operations increases to $(N / 2) \lg N$.


Figure 2.10: The divide-and-conquer parallel prefix circuit with 8 inputs, $D C(8)$, size $=12$, depth $=3$.

### 2.2.3 The Ladner-Fischer Parallel Prefix Circuit

From the above description, we see that the serial circuit has longer depth but smaller size whereas the divide-and-conquer parallel prefix circuit has smaller depth but larger size. Ladner and Fischer [LF80] were the first to discuss the size-depth trade off in prefix circuits - a reduction of the circuit depth is achieved at the cost of an increase in the number of operations. They introduced a family of circuits, $L F_{k}(N)$, where $k$ denotes the depth above $\lceil\lg N\rceil$, with $0 \leq k \leq\lceil\lg N\rceil$. Based on the divide-and-conquer strategy, $L F_{0}(N)$ and $L F_{k}(N)$ (when $k \neq 0$ ) are defined recursively as shown in Figure 2.11 and 2.12, respectively.

The last output, $1: N$, of the $L F_{k}(N)$ circuit for all $N$ and $k$ is available in $\lceil\lg N\rceil$
units of time so the circuit is a restricted parallel prefix circuit. The circuit size depends on the value of $\boldsymbol{k}$ such that

$$
\begin{aligned}
& \operatorname{size}_{0}(N)=4 N-F(5+\lg N)+1 \\
& \operatorname{size}_{k}(N)=2 N\left(1+\frac{1}{2^{k}}\right)-F(5+\lg N-k)-k+1
\end{aligned}
$$

where $F(N)$ is the $N^{\text {th }}$ Fibonacci number and for $k \geq 1$. Note that when $N$ is not a power of 2 , this solution is not precise. The circuit depth by construction is $\lceil\lg N\rceil \leq \operatorname{depth}\left(L F_{k}(N)\right) \leq 2\lceil\lg N\rceil-2$.


Figure 2.11: An illustration of the Ladner-Fischer parallel prefix circuit when $k=0, L F_{o}(N)$, derived from [LF80].


Figure 2.12: An illustration of the Ladner-Fischer parallel prefix circuit when $k \neq 0, L F_{\mathbf{L}}(N)$, derived from [LF80].

The Ladner-Fischer circuit is depth-optimal when $k=0$. The circuit is not (size, depth $)$-optimal because $\operatorname{size}\left(L F_{k}(N)\right)+\operatorname{depth}\left(L F_{k}(N)\right)>2 N-2$, for $k \geq 0$ and $N>4$. Therefore, the $L F_{k}(N)$ circuit has $O(N)$ size and $O(\lg N)$ depth. Figure 2.13 illustrates the $L F_{k}(N)$ circuits, for $0 \leq k \leq 1$. The circuit size and depth vary with the value of $k$. As the value of $k$ increases, the circuit size decreases but the circuit depth increases. This algorithm allows us to trade the size for depth and vice-versa.

(a) $L F_{0}(8)$, size $=12$ and depth $=3$.

(b) $L F_{1}(8)$, size $=11$ and depth $=4$.

Figure 2.13: Examples of Ladner-Fischer parallel prefix circuits with 8 inputs.

### 2.2.4 The Brent-Kung Parallel Prefix Circuit

The Brent-Kung prefix circuit [BK82], $B K(N)$, is another circuit which is based on the divide-and-conquer strategy. This circuit has smaller size than that of the $L F_{k}$ ( $k<\lceil\lg N\rceil-2$ ) circuits, but its depth is greater than that of these circuits. This described circuit can be as follows. Let $N=2^{n}$. The $B K(N)$ is divided into three levels -- the first level with $N / 2$ operation nodes, the second level with $B K(N / 2)$, and the last level with ( $N / 2-1$ ) operation nodes. According to Figure 2.14 , we can build $B K(N)$ from $B K(N / 2)$ recursively. The following recurrences for the size and depth of this circuit are
immediate:

$$
\begin{array}{lll}
\operatorname{size}(B K(N))=\operatorname{size}\left(B K\left(\frac{N}{2}\right)\right)+N-1, & \text { with } & \operatorname{size}(B K(4))=4 \\
\operatorname{depth}(B K(N))=\operatorname{depth}\left(B K\left(\frac{N}{2}\right)\right)+2, & \text { with } & \operatorname{depth}(B K(4))=2
\end{array}
$$

When $N=2^{n}$, we can solve these recurrences easily, as follows.

$$
\begin{aligned}
\operatorname{size}(B K(N)) & =\operatorname{size}\left(B K\left(\frac{N}{2}\right)\right)+N-1 \\
& =2 N-\lg N-2=\mathrm{O}(N)
\end{aligned}
$$

Similarly,


Figure 2.14: A Brent-Kung paraliel prefix circuit, $B K(N)$ based on divide-and-conquer strategy( $0=$ odd, $e=$ even), derived from [LD94].

$$
\begin{gathered}
\operatorname{depth}(B K(N))=\operatorname{depth}\left(B K\left(\frac{N}{2}\right)\right)+2 \\
=2 \lg N-2=\mathrm{O}(\lg N)
\end{gathered}
$$

The $B K(N)$ circuit takes $\mathrm{O}(\lg N)$ time like the $D C(N)$ circuit. However, the circuit size, which is $\mathrm{O}(N)$, is smaller than that of the $D C(N)$ circuit. The circuit is not depth-
optimal, and because $\operatorname{size}(B K(N))+\operatorname{depth}(B K(N))>2 N-2$ for $N>4, B K(N)$ is not (size, depth)-optimal either. Figure 2.15 shows the $B K(N)$ circuit for $N=8$. The circuit size and depth are 11 and 4, respectively. This is a compromise between serial prefix circuit and the divide-and-conquer algorithms. In this case the number of operations is $2 N-\lg N-2$ and the depth is $2 \lg N-2$.


Figure 2.15: An illustration of the Brent-Kung parallel prefix circuit, $B K(8)$, size $=11$, depth $=4$.

### 2.2.5 The Snir Parallel Prefix Circuit

Snir [Sni86] showed that the sum of the circuit depth and circuit size of any prefix circuit with $N$ inputs is lower bounded by $2 N-2$ (that is $\operatorname{depth}(N)+\operatorname{size}(N) \geq 2 N-2$ ). He also introduced an algorithm to construct the (size, depth)-optimal prefix circuits for any $N$ with the depth in the range $\max (\lceil\lg N\rceil, 2\lceil\lg N\rceil-2) \leq \operatorname{depth}(S N(N)) \leq N-1$. The deficiency of a prefix circuit is defined as

$$
\text { deficiency }=\text { size }+ \text { depth }-(2 N-2)
$$

A circuit with zero deficiency is said to be (size, depth)-optimal. The Snir prefix circuit, $S N(N)$, is the combination of two prefix circuits: the compressed layered prefix circuit, $C R\left(N_{1}\right)$, and the serial prefix circuit, $S\left(N_{2}\right)$, where $N=N_{1}+N_{2}-1$. The circuit's layout is shown in Figure 2.16. $S N(N)$ is constructed by feeding the last output of $\operatorname{CR}\left(N_{\mathrm{t}}\right)$ as the first input of $S\left(N_{2}\right)$.


Figure 2.16: The Snir prefix circuit, $S N(N)=C R\left(N_{1}\right) \cdot S\left(N_{2}\right)$.

## Compressed Layered Prefix Circuits [LD94, Sni86]

The compressed layered prefix circuit, $C R(N)$, is obtained by compressing the layered parallel prefix circuit. The compression involves moving same nodes to their actual level as determined by the path from the input nodes. The design of the layered parallel prefix circuit [Sni86] is based on the divide-and-conquer strategy. The design specifies the operations level by level as follows. Let $g_{\alpha}$ be a set of a pair of inputs such that

$$
g_{\alpha}=\{(i, j) \mid \text { a node at level } \alpha \text { is fed by lines } i \text { and } j\} .
$$

Now, given $N$, let $m=\lceil\lg N\rceil$. For each level, let

$$
g_{t}=\left\{\left(k 2^{t}-2^{t-1}, \min \left(N, k 2^{t}\right)\right) \left\lvert\, k=\left\lfloor\frac{N-1}{2^{t}}+\frac{1}{2}\right\rfloor \ldots\right., 2,1\right\}
$$

be the set of operations at level $t$ for, $t=1, \ldots m$, and

$$
g_{m+1}=\left\{\left(k 2^{m-t}, k 2^{m-t}+2^{m-t-1}\right) \left\lvert\, k=\left\lfloor\frac{N-1}{2^{m-1}}-\frac{1}{2}\right\rfloor\right., \ldots, 2,1\right\}
$$

be the set of operations at level $m+t$ for, $t=1, \ldots m-1$.
The depth of the circuit as defined above is $2\lceil\lg N\rceil-1$ (i.e., $m+m-1$ ). The first $\lceil\lg N\rceil$ levels construct a complete binary tree rooted at $1: 2^{\lceil 8 N\rceil}$. Including the leaves of the binary tree, which are all inputs, the tree depth is $\lceil\lg N\rceil+1$. Therefore, the tree clearly has $(N-1)$ internal nodes, $\lceil\lg N\rceil+1$ of which are output nodes (i.e., nodes labeled 1:2 $2^{x}$ for $x=0,1, \ldots, \lg N$ ). A prefix circuit with $N$ inputs must have $N$ outputs. Therefore, the remainder of $\lceil\lg N\rceil-1$ levels contain $N-\lceil\lg N\rceil-1$ nodes. Thus, the total size of the circuit is $2 N-\lceil\lg N\rceil-2$. Also the last output $y_{N}$ is available at depth $\lceil\lg N\rceil$. Hence, the circuit is a restricted prefix circuit. In this layered design definition, there are operation nodes at level $>m$ where inputs do not exactly come from the immediately preceding level. Such nodes are then moved to the appropriate level. After all such nodes are moved to the appropriate level the layered circuit is compressed to yield a circuit with depth as follows:

$$
\operatorname{depth}(C R(N))= \begin{cases}\lceil\lg N\rceil & \text { if } N \leq 5, \\ 2 r-3 & \text { if } 3 \times 2^{r-2} \leq N<2^{r} \text { for } r \geq 3, \\ 2 r-2 & \text { if } 2^{r} \leq N \leq 3 \times 2^{r-1} \text { for } r \geq 3\end{cases}
$$

As an example, let $N=8$. Then $m=\lceil\lg 8\rceil=3$ and we obtain $g_{t}$, where $t=1,2, \ldots, 5$, as follows.

$$
\begin{aligned}
& \boldsymbol{g}_{1}=\{(1,2),(3,4),(5,6),(7,8)\} \\
& \boldsymbol{g}_{\mathbf{2}}=\{(2,4),(6,8)\} \\
& \boldsymbol{g}_{3}=\{(4,8)\} \\
& \boldsymbol{g}_{4}=\{(4,6)\} \\
& \boldsymbol{g}_{5}=\{(2,3),(4,5),(6,7)\}
\end{aligned}
$$

The layout of the layered prefix circuit before compression is shown in Figure 2.17. Its depth is $\mathbf{5}$ and its size is $\mathbf{1 1}$. However, the circuit can be compressed by moving the node labeled 1:6 at level 4 to level 3 and the nodes labeled 1:3, 1:5, and 1:7 at level 5 to level 2, 3, and 4, respectively. As shown in Figure 2.18, the depth of the compressed circuit is reduced by one level. Thus, the new circuit depth is 4 . The compressed circuit is not a (size, depth)-optimal circuit since the circuit's deficiency is greater than zero as follows.


Figure 2.17: An illustration of the layered parallel prefix circuit, size $=11$, depth $=5$.


Figure 2.18: An illustration of the compressed layered prefix circuit, size $=11$, depth $=4$.

$$
\begin{aligned}
\operatorname{deficiency} & =\operatorname{size}(C R(N))+\operatorname{depth}(C R(N))-(2 N-2) \\
& \geq(2 N-2-\lceil\lg N\rceil)+(2\lceil\lg N\rceil-3)-(2 N-2) \\
& \geq\lceil\lg N\rceil-3 \\
& \geq 0
\end{aligned}
$$

As in the previous discussion, the Snir's circuit, $\operatorname{SN}(N)$, is composed of two prefix circuits: the compressed layered prefix circuit and the serial prefix circuit. Therefore, the circuit size and depth are defined as

$$
\begin{gathered}
\operatorname{size}(\operatorname{SN}(N))=\operatorname{size}\left(\operatorname{CR}\left(N_{1}\right)\right)+\operatorname{size}\left(S\left(N_{2}\right)\right) \\
\operatorname{depth}(S N(N))=\max \left\{\operatorname{depth}\left(\operatorname{CR}\left(N_{1}\right)\right),\left\lceil\lg N_{\mathrm{t}}\right\rceil+\operatorname{depth}\left(S\left(N_{2}\right)\right)\right\}
\end{gathered}
$$

Although the $S N(N)$ circuit is a combination of a (size-depth)-non-optimal prefix circuit (that is $C R\left(N_{1}\right)$ ) and the (size, depth)-optimal prefix circuit (that is $S\left(N_{2}\right)$ ), it is a (size, depth)-optimal prefix circuit [Sni86]. If the given input value $N$ satisfy the inequality

$$
2\lceil\lg N\rceil-2 \leq \operatorname{depth}(S N(N))<2 \lg (N-1)-1,
$$

then the design recursively defines (size, depth)-optimal circuit with depth $\operatorname{depth}(\operatorname{SN}(N))$. Otherwise, a circuit with

$$
N-2 \geq \operatorname{depth}(S N(N)) \geq 2 \lg (N-1)-1
$$

is given.
As an example of the $S N(N)$ circuit, let $N=19$. Then $r=4, N_{2}=r+1=5$ and $N_{1}=N-N_{2}+1=15$. The $S N(19)$ circuit is given in Figure 2.19, which is composed of $C R(15)$ and $S(5)$. Clearly, the circuit depth is 8 , the circuit size is 28 and their sum is 36, which is equal to ( $2 \times 19-2$ ). Hence, $S N(19)$ is (size, depth)-optimal. However, Snir parallel prefix circuit is not depth-optimal, and also not a restricted prefix circuit.


Figure 2.19: The Snir prefix circuit, $S N(19)$, size $=28$ and depth $=8$.

### 2.2.6 The LYD Parallel Prefix Circuit

Lakshmivarahan, Yang, and Dhall [LYD87] were the first to introduce the algorithm to design a (size, depth)-optimal prefix circuit, having the smallest depth among all other circuits, for $N=9$ to $12, N=17$ to 20 , and $N=33$. Their discovery proves that there is (size, depth)-optimal prefix circuit with depth in the range $\lceil\lg N\rceil \leq d(N) \leq \max (\lceil\lg N\rceil 2\lceil\lg N\rceil-3)$. Moreover, their algorithm gives the depthoptimal prefix circuits for some inputs. The algorithm distributes all $N$ inputs in to four parts properly. Like the $\operatorname{SN}(N)$ prefix circuit, Part 1 corresponds to the compressed layered prefix circuit. Part 2 is a new optimal prefix circuit, $Q(N)$, proposed by the group [LYD87, LD94]. Part 3 and Part 4 are the serial prefix circuits.

New optimal prefix circuit, $Q(N)$
$Q(N)$ is a new class of (size, depth)-optimal prefix circuits with condition

$$
N=\frac{t(t+1)}{2}+1
$$

$$
\text { for } t>0
$$

Let $g_{i, j}$ denote the $j^{\text {th }}$ node at level $i$ and be represented with an ordered pair $(a, b)$; where $a=\operatorname{left}\left(g_{i, j}\right)$ and $b=\operatorname{right}\left(g_{i, j}\right)$, refer to the left and right inputs of the node $g_{i, j}$, respectively.

The $Q(N)$ circuit is constructed as follows.

1. At level $1, g_{1,1}=(1,2), g_{1.2}=(3,4)$, and

$$
g_{1, j}=\left(\operatorname{left}\left(g_{1, j-1}\right)+(j-1), \operatorname{right}\left(g_{1, j-1}\right)+(j-1)\right), \text { for } j=3,4, \ldots, t
$$

2. For levels $i=2$ to $t$,

$$
\begin{aligned}
& g_{i, 1}=\left(\operatorname{right}\left(g_{i-1,1}\right), \operatorname{right}\left(g_{i-1,2}\right)\right), \text { and } \\
& g_{i, j}=\left(\operatorname{right}\left(g_{i-1, j+1}\right), \operatorname{right}\left(g_{i-1, j+1}\right)+1\right), \text { for } j=2,3, \ldots, t+1-i .
\end{aligned}
$$

3. The nodes at level $(t+1)$ are given by

$$
g_{t+1}=\left\{\left(\operatorname{right}\left(g_{i, 1}\right), \operatorname{right}\left(g_{i, 1}\right)+j\right) \mid i=1,2, \ldots, t-1, j=1,2, \ldots i\right\}
$$

The $Q(N)$ circuit has unique properties: The circuit depth is equal to the circuit width and the circuit size is equal to the square of the circuit depth.

Let $N=7$. Thus $t=3$. We obtain $g_{i, j}$ as follows.
$g_{1,1}=(1,2)$
$g_{1,2}=(3,4)$
$g_{1,3}=(5,6)$
$g_{2.1}=(2,4)$
$g_{2.2}=(6,7)$
$g_{3,1}=(4,7)$
$g_{4}=(2,3),(4,5),(4,6)$

The $Q(7)$ circuit is illustrated in Figure 2.20. As seen, the $Q(N)$ circuit consists of blocks of the serial prefix circuits with block sizes increasing in an arithmetic sequence.


Figure 2.20: The $Q(7)$ prefix circuit.

The $L Y D$ prefix circuit, $L Y D(N)$, is composed of 4 parts (see Figure 2.21) as follows.
Part 1: the compressed layered prefix circuit, $\operatorname{CR}\left(N_{1}\right)$,

$$
\begin{aligned}
& \operatorname{depth}(\operatorname{Part1}) \leq \operatorname{depth}(N) \\
& \operatorname{size}(\operatorname{Part1})=2 N_{1}-\left\lceil\lg N_{1}\right\rceil-2
\end{aligned}
$$

The last output, 1: $N_{1}$, is available at level $t=\left\lceil\lg N_{1}\right\rceil \leq \operatorname{depth}(N)-2$.
Part 2: the new (size, depth)-optimal prefix circuit, $Q\left(N_{2}\right)$,

$$
N_{2}=\frac{\left[\lg N_{1} \backslash\left(\lg N_{1}\right\rceil+1\right)}{2}+1
$$

$$
\operatorname{depth}(\operatorname{Part2})=\left\lceil\lg N_{1}\right\rceil+2 \leq \operatorname{depth}(N)
$$

The size after combining with Part 1 is $\operatorname{size}(\operatorname{Part} 2)=2 N_{2}-1$. The last output, $1: N_{1}+N_{2}$, is available at level $\left\lceil\lg N_{1}\right\rceil+1$.

Part 3: the serial prefix circuit, $S\left(N_{3}\right)$,

$$
\begin{aligned}
\operatorname{depth}(\operatorname{Part} 3) & =\left\lceil\lg N_{1}\right\rceil+1+N_{3} \leq \operatorname{depth}(N) \\
\operatorname{size}(\operatorname{Part} 3) & =N_{3}
\end{aligned}
$$

Part 4: the serial prefix circuit, $S\left(N_{4}\right)$,

$$
\begin{aligned}
\operatorname{depth}(\text { Part } 4) & =\left\lceil\lg N_{1}\right\rceil+2+N_{3}=\operatorname{depth}(N) \\
\operatorname{size}(\operatorname{Part} 4) & =2 N_{4}-1
\end{aligned}
$$

where $N=N_{1}+N_{2}+N_{3}+N_{4} ; N_{1}, N_{2}$, and $N_{4} \geq 1 ;$ and $N_{3} \geq 0$.


Figure 2.21: The structure of $L Y D(N)$, derived from [LD94].

Thus, the circuit depth is $\left\lceil\lg N_{1}\right\rceil+2+N_{3}$ and the circuit size is $\operatorname{size}($ Part1) $+\operatorname{size}($ Part 2$)+\operatorname{size}($ Part3 $)+\operatorname{size}($ Part 4), which is $(2 N-2)-\operatorname{depth}(N)$. It is easy to see that the circuit $L Y D(N)$ is (size, depth)-optimal. For any integer $N$, there exists a (size, depth)-optimal prefix circuit, $L Y D(N)$, such that $2\lceil\lg N\rceil-6$ $\leq \operatorname{depth}(L Y D(N)) \leq 2\lceil\lg N\rceil-3$. However, the circuit is not restricted prefix circuit and not size optimal. But for many $N$ 's, the circuit yields the optimal depth. As an example, Figure 2.22 shows the circuit $L Y D(19)$, which is a combination of $C R(8) \cdot Q(7) \cdot S(0) \cdot S(4)$.


Figure 2.22: The $L Y D(19)$ prefix circuit with size 31 and depth 5.

### 2.2.7 The Shih-Lin Parallel Prefix Circuit

Recently, Lin and Shih [LS99] have proposed a new (size, depth)-optimal prefix circuit, $S L(N)$, with the depth in the range

$$
2\lceil\lg N\rceil-5 \leq \operatorname{depth}(S L(N)) \leq N-1,
$$

for $N \geq 12$.
The structure of the $S L(N)$ circuit is similar to the $S N(N)$ circuit but differs in the partitions of the circuit. The $S L(N)$ circuit is also composed of two parts: the compressed layered prefix circuit and the serial prefix circuit as shown in Figure 2.23. In other words, $\operatorname{SL}(N)=\operatorname{CR}\left(N_{1}\right) \cdot S\left(N_{2}\right)$, where $N=N_{1}+N_{2}-1$. This algorithm offers the same or equivalent performance as that of LYD but it is easier to implement.


Figure 2.23: The $S L(N)$ circuit, $S L(N)=C R\left(N_{1}\right) \cdot S\left(N_{2}\right)$.

Let depth(SL(N)) be the depth of the $S L(N)$ circuit, defined above. Then

$$
\operatorname{depth}(S L(N))=\left\{\begin{array}{lll}
2\lceil\lg N\rceil-5 & \text { if } 2^{r-1}<N<2^{r-1}+r-4 & \text { for } r \geq 6 \\
2\lceil\lg N\rceil-4 & \text { if } 2^{r-1}+r-4 \leq N<3 \times 2^{r-2} & \text { for } r \geq 5 \\
2\lceil\lg N\rceil-3 & \text { if } 3 \times 2^{r-2} \leq N \leq 2^{r} & \text { for } r \geq 4
\end{array}\right.
$$

The following are the conditions to choose $N_{2}$ [LS99].

$$
\begin{array}{ll}
\text { If } r \geq 4 \text { and } 3 \times 2^{r-2} \leq N \leq 2^{r}, & \text { then } N_{2}=r-2 . \\
\text { If } r \geq 6 \text { and } 2^{r-1}<N<2^{r-1}+r-4, & \text { then } N_{2}=r-3 . \\
\text { If } r \geq 5 \text { and } N=2^{r-1}+r-4, & \text { then } N_{2}=r-2 . \\
\text { If } r \geq 5 \text { and } 2^{r-1}+r-4<N<3 \times 2^{r-2}, & \text { then } N_{2}=r-3 .
\end{array}
$$

Since $\operatorname{depth}(\operatorname{SL}(N))+\operatorname{size}(\operatorname{SL}(N))=2 N-2$, the $\operatorname{SL}(N)$ circuit is a (size, depth)-optimal prefix circuit [LS99]. Like the Snir prefix circuit, the $\operatorname{SL}(N)$ circuit is neither depthoptimal nor restricted prefix circuit. As an example of $\operatorname{SL}(N)$, let $N=19$. Then $r=5$ and $2^{r-1}+r-4<N \leq 3 \times 2^{r-2}$. The layout of the $S L(19)$ circuit is given in Figure 2.24, which
is composed of $C R(18)$ and $S(2)$. Clearly, the circuit depth is 6 , the circuit size is 30 , and $\operatorname{size}(S L(19))+\operatorname{depth}(S L(19))=2 N-2=36$. Comparing the $S L(19)$ circuit with the $L Y D(19)$ circuit, the $S L(19)$ circuit's depth is longer while its size is smaller.


Figure 2.24: The $S L(19)$ prefix circuit, size $=30$ and depth $=6$.

### 2.3 Comparison

Table 2.1 provides a comparison of the prefix circuits illustrated in this chapter. While the parallel prefix circuits have desirable depths, which are $\mathrm{O}(\lg N)$, they differ widely in the number of operations performed. Only four prefix circuits (i.e., serial, Snir, Shih-Lin, and LYD prefix circuits) are (size, depth)-optimal. The divide-and-conquer prefix circuit and the $L F_{0}$ prefix circuit have the shortest depth and the serial prefix circuit has the smallest size.

The size-depth trade-off does apply to any prefix circuit. For example, the serial prefix circuit performs fewest operations (i.e., smallest size) compared to the others, but has the longest depth while the divide-and-conquer prefix circuit has the largest size, but has the smallest depth. Although the Shih-Lin prefix circuit and the Snir prefix circuit have similar circuit layouts, the Shih-Lin prefix circuit has a smaller depth than the Snir prefix circuit. All circuits have unbounded fan-out except the serial prefix circuit that has a constant fan-out of two. The divide-and-conquer prefix circuit and the $L F_{0}$ prefix circuit have the largest fan-out $((N / 2)+1)$. The Brent-Kung, Shih-Lin and Snir prefix circuits have the same fan-out $(\lceil\lg N\rceil+1)$, which is smaller than that of the LYD prefix circuit $(2\lceil\lg N\rceil-2)$.

Table 2.1: A Comparison of the seven prefix circuits illustrated in this chapter.

| Prefix Circuit | Size | Depth | Fan-ont | (size, depth)optimal |
| :---: | :---: | :---: | :---: | :---: |
| Serial | $N-1$ | $N-1$ | 2 |  |
| Divide-and- <br> Conquer | $(N / 2) \lg N$ | $\lg N$ | $(N / 2)+1$ | No <br>  |
| $L F_{0}$ | $4 N-F(5+\lg N)+1$ |  |  |  |
| $L F_{k}$ <br> when $0<t<$ ifin-2 | $2 N\left(1+\left(1 / 2^{*}\right)\right)-F(5+\lg N-k)-k+1$ | $\lg N+k$ | $\left(N / 22^{2+1}\right)+k$ | No |
| $L F_{k}$ $\text { mane } t \geq 14 \mathrm{~N}-2$ | $2 N-\lg N-2$ | $2 \mathrm{lg} N-2$ | $\lg N+1$ |  |
| Brent-Kung | $2 N-\lg N-2$ | $2 \mathrm{lg} N-2$ | $\lg N+1$ | No |
| Snir | 2N-2-depth | $\begin{aligned} & \max (\lg N, 2 \lg N-2) \\ & \leq \text { depth } \leq N-1 \end{aligned}$ | $\lg N+1$ | Yes |
| LYD | 2N-2-depth | $2 \lg N-6 \leq$ depeh $\leq 2 \lg N-3$ | $2 \lg N-2$ | Yes |
| Shih-Lin | 2N-2-depth | $2 \lg N-5 \leq d e p t h \leq 2 \lg N-3$ | $\lg N+1$ | Yes |

## CHAPTER 3

## SOURCES OF POWER CONSUMPTION

In the previous chapter we examined size and depth trade-offs of various prefix circuit designs. We want to examine the power consumption characteristics of these circuits. In this chapter, the sources of power consumption in circuits are reviewed and the strategies to estimate the power consumption of the various prefix circuits are presented. This should help us to better understand the power consumption characteristics of the circuits. We also introduce the circuit simulation tool called PSpice in brief.

### 3.1 CMOS

Presently, CMOS (Complementary-symmetry Metal-Oxide Semiconductor) technology is the most popular technology used by the digital IC (Integrated Circuit) industry because of its low power consumption, its good scalability and its speed [CB95, RCN01, WE93]. CMOS technology uses two types of transistors: a P-type transistor and an $N$-type transistor realizing logic functions. Figure 3.1 shows the $P$-type and $N$-type transistors, and their characteristics. The $P$-type transistor has a bubble on its symbol indicating that the transistor is conducting when its input is 0 . The $N$-type transistor is conducting when its input is 1 . The input has been labeled with the signal $s$.

## Examples of CMOS Logic

The CMOS inverter is the heart of all digital designs. Each complex design (for example, NAND gate) can be clearly explained if the inverter's characteristics are understood. It consists of two transistors, one $P$-type and one $N$-type transistor. Figure 3.2 shows the CMOS inverter and its truth table.


Figure 3.1: $P$-type and $N$-type transistor and their characteristics.


Figure 3.2: CMOS inverter.

The CMOS NAND gate, CMOS NOR gate and their truth tables are illustrated in Figures 3.3 and 3.4, respectively. Both gates consist of four transistors, two $P$-type and two $N$-type transistors.


Figure 3.3: CMOS NAND gate.


Figure 3.4: CMOS NOR gate.

### 3.2 Power Consumption

### 3.2.1 Sources of Power Consumption

In CMOS circuits, power consumption is due to the following three types of current flow [WE93]:

1. Static power consumption due to leakage currents. The static power consumption occurs when some current leaks through to other parts of the transistor (i.e., the
leakage current from the gate to the drain as shown in Figure 3.5), resulting in power loss. The power loss due to leakage current in CMOS is usually insignificant compared to the dynamic power consumption [CB95, RCN01].


Figure 3.5: The leakage current from the gate to the drain of a transistor.
2. Dynamic power consumption due to short-circuit currents. The short-circuit occurs when both $P$-type and $N$-type transistors are momentarily on at the same time (see Figure 3.6). Although there is some dynamic power consumption from the short circuit, this power loss is usually insignificant compared to the power dissipated from the switching [CB95, RCN01].


Figure 3.6: An illustration of short-circuit when both $P$-type and $N$-type transistor being in the on state at the same time.
3. Dynamic power consumption due to switching currents from repetitively charging and discharging the parasitic capacitances at the transistor's gate (see Figure 3.7). The currents must flow through the transistor's gate to reach the capacitances (i.e., charging the capacitance). During the switching transient, the power is dissipated (i.e., discharging the capacitance). The charging and discharging of the parasitic capacitances are the dominant form of power consumption in CMOS circuits [WE93].


Figure 3.7: An illustration of capacitance charging.

Therefore, two components establish the amount of power consumption in a CMOS circuit. They are static and dynamic. Static power consumption is due to imperfect transistors while dynamic power consumption is due to the process of switching transistors on and off. However, in properly designed CMOS circuits, the major portion of the power consumption is from dynamic switching. As a result, in this study, we focus on the dynamic component due to the repetitive charging and discharging of the capacitive loads.

The average power consumption in a CMOS gate or module (e.g., an adder) due to switching can be written as [CB95, WE93]:

$$
\begin{equation*}
P_{\text {swiuching }}=C_{e f f} V_{D D}^{2} f, \tag{3.1}
\end{equation*}
$$

where $C_{e f f}$ is the effective capacitance switched, $V_{D D}$ is the supply voltage, and $f$ is the clock frequency. $C_{\text {eff }}$ has two components, the switching activity (signal transition activity) per clock cycle, $p_{f}$, and the load capacitance, $C_{L}$. Thus, for a given circuit running at a given speed (i.e., $C_{L}$ and $f$ constant), power consumption is a function of the supply voltage and switching activity. Therefore, power reduction can be achieved by either operating the circuit at a lower voltage or by choosing an architecture that reduces the switching activity of the circuit's signals.

## Effect of Voltage Scaling

Due to the quadratic relationship between the supply voltage and the power consumption, lowering supply voltage can be an effective way to achieve dramatic power savings. However, as the supply voltage is decreased, the circuit delay generally increases relatively independent of the logic function and style; see Figure 3.8. Thus, reducing


Figure 3.8: Plots of normalized delay vs. supply voltage for a variety of different logic circuits, derived from [CB95].
supply voltage unfortunately reduces the system throughput. This loss in throughput can be recovered in some cases by applying architectural techniques to compensate for the additional delay (e.g., utilization of parallelism and pipeline). Reference [CB95] shows that by changing circuit architecture (i.e., using parallelism and pipelining) it is possible to gain significant speed improvements with only a slight increase in power, hence enabling some voltage down-scaling while maintaining the throughput.

## Effect of Switching Activity

The power in CMOS circuits is dissipated when the signals in the circuit switch (i.e., change values). As a result, the amount of switching activity is an indicator of the power consumption. The manner in which the nodes in a circuit are interconnected can have a strong influence on the overall switching activity [CB95]. Some architectures induce extra transition activity at the operation nodes called glitching transitions or dynamic hazards, which consume extra power. Glitching is a major problem that increases the effective switching activity, causing a circuit node to undergo several rapid transitions in a single clock cycle [CB95, RCN01].

Figure 3.9 illustrates an example of the glitching behavior for a chain of eight NAND gates [RCN01] by using a PSpice ${ }^{\infty}$ simulation [Cad00]. In the simulation, all bits of the first input were set to logic 'one' and all bits of second input transition from logic 'zero' to 'one'. For an ideal circuit without propagation delays, the resultant outputs VOUT2, 4,6 and 8 would stay logic 'one' all the time. However, due to the presence of delays, these outputs switch to low temporarily. This glitching causes extra power to be consumed. Outputs VOUT1, 3, 5 and 7 do not glitch; they just have some propagation delay. It is
noted that the degree of glitching depends on the switching pattern of the input signals [RCNO1].


Figure 3.9: An illustration of the glitching behavior of a chain of eight NAND gates [RCNO1].
To reduce glitching activity, the depth of the signal paths in the circuit should be balanced. The following is an illustration of two different circuit architectures of a 4input adder. In Figure 3.10(a), assume that all primary inputs (A, B, C, and D) arrive at the time $t_{0}$ and the implementation is non-pipelined. While the first adder makes one transition by computing A+B, the second adder also makes one transition based on C and the previous (i.e., initial) value of $A+B$. After the correct value of $A+B$ has propagated through the first adder at time say $t_{0}+t_{p}$, the second adder re-evaluates $(\mathrm{A}+\mathrm{B})+\mathrm{C}$, which is complete at time $t_{0}+2 t_{p}$. Thus, there is a second transition at the second adder.

Similarly, there will be three transitions at the third adder. With a path-balancing approach of Figure 3.10 (b), while the first and second adders make one transition the third adder will make only two transitions to produce the same output as in Figure $\mathbf{3 . 1 0}$ (a). In [CB95], the "total switched capacitance" of the circuit layout in Figures 3.10(a) and 3.10(b) has been simulated by using a switch-level simulator over random input patterns. The results show that the switched capacitance of the circuit layout in Figure 3.10(a) is larger than that of the circuit layout in Figure 3.10(b) by a factor of 1.5 for a four input addition, and 2.5 for an eight input addition. Hence, increasing circuit depth generally increases the total switched capacitance due to glitching and thus increases power consumption [CB95]. As a consequence, the amount of transition activity (switching activity) for a layered and non-pipelined circuit can be a function of depth $\boldsymbol{d}$ and the number of nodes at each level $i, w_{i}$, as [CB95]

$$
\begin{equation*}
\sum_{i=1}^{d} i w_{i} . \tag{3.2}
\end{equation*}
$$


(a) Chain Model

(b) Tree Model

Figure 3.10: An illustration of extra transition activity, derived from [CB95].

From this, it follows that in the worst case estimate for the switching activity of such a circuit can grow according to $\mathbf{O}\left(d^{2}\right)$, assuming a constant number of nodes at each level.

From the previous discussion and the example of Figure 3.10, we have seen that different circuit architectures for performing the same function can consume different amounts of power. Therefore, the implementation of the various prefix circuits in an application will have different power consumption as well. However, in the prefix circuits, we cannot say with certainty that the circuit with the longer depth will consume more power than one with shorter depth. The reason is that both depth and the number of operation nodes among the candidate prefix circuits differ. In prefix circuits, when the depth decreases, the number of operation nodes (i.e., size) generally increases and vice versa. This is known as the size-depth trade-off [LF80, LD94]. As a result, the switching activity in a prefix circuit not only depends on its logic depth but also on the number of operation nodes at each level. The circuit with shorter depth and more nodes might have more switching activity than the one with longer depth and fewer nodes.

### 3.2.2 Power Consumption and Fan-out

Besides the switching activity at an operation node, the node's fan-out also has an effect on power consumption in a circuit design in VLSI [Cal96, WE93]: the larger the fan-out, the more power the circuit consumes because there are more signals. For example, by using the PSpice over random input patterns, the power consumed by a 2-input XOR gate is dependent upon the fan-out and the relationship is linear (Figure 3.11). Hence, fan-out should be taken into account when a power consumption estimate is made for the prefix circuit.


Figure 3.11: Effect of fan-out on power consumption of a 2-input XOR gate.

### 3.3 The Circuit-level Simulation: PSpice

The circuit-level simulation called SPICE (Simulation Program for Integrated Circuit Emphasis) is a powerful general purpose analog and digital circuit simulator that is used to verify circuit design and to predict the circuit behavior under a variety of different circumstances. The program SPICE was originally developed at the Electronics Research Laboratory of the University of California at Berkeley in early 1970's and has become a de facto standard in the area of analog and digital simulation. SPICE is often used to characterize logic cells. The software performs a simulation of the design and monitors the power supply current waveform. This technique gives accurate power consumption. But it is very time-consuming.

In this study, we use a PC version of SPICE called PSpice [Cad00]. PSpice is registered trademark of Orcad Corporation and it is the most popular circuit simulation software on the market today. PSpice offers a large library of models obtained from files of standard components, semiconductor manufactures, and user inputs so that users can run simulations with confidence and get accurate results. Circuits are entered using a
schematic capture editor, which can access the component and symbol libraries. The simulation results take the form of textual, tabular, and graphical output, depending on the analysis performed and the Probe post-processor displays output data in the form of graphs.

In the next chapter, we will analyze switching activity and fan-out for each prefix circuit considered. We then use this to further estimate and investigate the power-speed trade-off between various types of prefix circuits.

## CHAPTER 4

## POWER MODELING OF PREFIX CIRCUITS

Having seen the various sources of power consumption in general circuits we now focus on analytical model for predicting the average power consumption of a prefix circuit. As mentioned previously, the signal switching activity has a major influence on the power consumption. Therefore, the switching activity will be used as a basis to determine power consumption of prefix circuits. Further, as mentioned in Section 3.2.2, the power consumption of an operation node is a linear function of fan-out [Ca196]. Therefore, to take into account the effect of fan-out on the output load capacitance of an operation node, we assume that the load capacitance of a node with fan-out $\boldsymbol{k}$ is equal to $C_{0}+C^{\prime}(k-1)$, where $C_{0}$ is the load capacitance of a node with fan-out 1 , and $C^{\prime}$ is the load capacitance for each additional fan-out [Smi97].

The effective circuit capacitance of a prefix circuit, $\operatorname{cap}_{e f f}(N)$, is the effective load capacitance of all nodes in the circuit. As defined here, the effective circuit capacitance depends on input signal patterns and the effects of signal glitching. Thus if a node output experiences two transitions due to glitching, its effective capacitance is twice that of the physical capacitance. Because the degree of glitching depends on input signal patterns, we consider derivations of the worst case scenario in which glitching at the nodes are assumed to be the maximum possible. By scaling the effective circuit capacitance by the circuit clock frequency and $V_{D D}^{2}$, we arrive at our power estimate.

$$
\begin{equation*}
P=c a p_{a f}(N) V_{D D}^{2} f \tag{4.1}
\end{equation*}
$$

The capacitance evaluation for various circuits according to our model is made in two steps. As a first step, in Section 4.1, we assume that the load capacitance for each operation node is independent of the fan-out, i.e., the load capacitance in the constant $C_{0}$. In the second step we first compute the residual network by deleting one output of each operation node with fan-out $\geq 1$. We then compute the load capacitance of the residual circuit assuming that the load capacitance of each node is $C^{\prime}$, independent of the fan-out. This step is repeated $k-1$ times where $k$ is the fan-out of the given circuit. This step is performed in Section 4.2. The total capacitance is the sum of the values obtained in step 1 and step 2.

### 4.1 Step 1 - The Constant Output Capacitance

In this step, we assume that the physical output capacitance of each operation node is constant. Let $\operatorname{Kcap}_{\text {of }}(N)$ be the effective circuit capacitance under the constant output capacitance assumption, $\operatorname{depth}(N)$ be the depth of the circuit, $w_{i}$ be the number of operation nodes in the circuit at level $i$, and $C_{0}$ as the assumed constant load capacitance of one node. Then from Eq. 3.2,

$$
\begin{equation*}
K \operatorname{cap}_{e f f}(N)=\left(\sum_{i=1}^{\operatorname{depen}(N)} i w_{i}\right) C_{0} \tag{4.2}
\end{equation*}
$$

In the following, we use this equation to derive $\operatorname{Kcap}_{\text {ef }}(N)$ of the various prefix circuits.

### 4.1.1. The Serial Prefix Circuit

From the layout of the serial prefix circuit in Figure 4.1, we see that each level contains one operation node and each operation node has exactly two fan-ins and two fan-outs. The size and depth of this circuit is $(N-1)$. As shown in Figure 4.2, the $S(N)$ circuit can be built from the $S(N-1)$ circuit by adding a new input into the $S(N-1)$ circuit at the $\operatorname{depth}(S(N))^{\text {th }}$ level (i.e., at level $\left.N-1\right)$. Thus, we can determine the recurrence for the constant output capacitance of the serial prefix circuit for $N$ inputs as the sum of the capacitance of $N-I$ inputs and the capacitance of the new input at the depth $(S(N))^{\text {th }}$ level as follows

$$
K \operatorname{cap}_{e f f}(N)=\operatorname{Kcap}_{e f f}(N-1)+\operatorname{depth}(S(N)) \cdot 1, \quad \text { with } \quad K \operatorname{cap}_{e f f}(2)=1
$$



Figure 4.1: An illustration of the serial prefix circuit, $S(N)$.


Figure 4.2: An illustration of the serial prefix circuit, $S(N)$, buith from $S(N-1)$.

Therefore, the recurrence can be solved as

$$
\begin{aligned}
\operatorname{Kcap}_{c f}(N)= & \operatorname{Kcap}_{c f f}(N-1)+\operatorname{depth}(S(N)) \cdot 1 \\
= & K_{c a p}^{c f f} \\
= & (N-1)+(N-1) \\
& \cdot \\
& \cdot \\
= & \operatorname{Kcap}_{f f f}(N-2)+(N-2)+(N-1) \\
= & 1+\sum_{i=2}^{N-1} i \\
= & \frac{N(N-1)}{2} \\
= & \mathrm{O}\left(N^{2}\right)
\end{aligned}
$$

The size and depth of the serial prefix circuit is $(N-1)$. Therefore, $\operatorname{Kcap}_{\text {aff }}(N)$ can be written as a function of the circuit's size $(s)$ and depth $(d)$ as follows

$$
\begin{aligned}
\operatorname{Kcap}_{\text {eff }}(N) & =\frac{N(N-1)}{2} \\
& =\frac{1}{2} s(d+1) \\
& =\frac{1}{2}(s d+s) \\
& =\frac{s d+s}{2}
\end{aligned}
$$

Obviously, the serial prefix circuit has $\mathrm{O}(N)$ size, $\mathrm{O}(N)$ depth and $\mathrm{O}\left(N^{2}\right)$ effective circuit capacitance under the constant output capacitance.

### 4.1.2. The Divide-and-Conquer Parallel Prefix Circuit

Let $N=2^{n}$. Using the well-known divide-and-conquer strategy, the divide-and-conquer parallel prefix circuit, $D C(N)$, is designed according to the principle illustrated in Figure 4.3. That is, $D C(N)$ is built from two $D C(N / 2)$ circuits and by connecting output $1: N / 2$ from the first $D C(N / 2)$ to each of the output of the second $D C(N / 2)$ at level $\operatorname{depth}(D C(N / 2))+1$. Therefore, the circuit's, $\operatorname{Kcap}_{\text {eff }}(N)$, can be derived from that of $D C(N / 2)$, according to the following recurrence relation,


Figure 4.3: An illustration of the divide-and-conquer prefix circuit, $D C(N)$, built from $D C(N / 2)$, derived from ILD941.
$\operatorname{Kcap}_{e f f}(N)=2 \operatorname{Kcap}_{e f}\left(\frac{N}{2}\right)+\left(\operatorname{depth}\left(D C\left(\frac{N}{2}\right)\right)+1\right) \frac{N}{2}, \quad$ with $\quad \operatorname{Kcap}_{e f f}(2)=1$.
The first part of $\operatorname{Kcap}_{\text {eff }}(N)$ is the constant output capacitance from the two circuits with $N / 2$ inputs while the second part is the capacitance from the last level of $D C(N)$. Since there are $N / 2$ operation nodes at the last level, the circuit depth is $\operatorname{depth}(D C(N / 2))+1$.

Recall that depth of $D C(N)=\lg N$. Therefore, we have

$$
\operatorname{Kcap}_{e f f}(N)=2 K \operatorname{cap}_{e f f}\left(\frac{N}{2}\right)+\left(\lg \frac{N}{2}+1\right) \frac{N}{2}
$$

$$
\begin{aligned}
& =2^{2} K \operatorname{cap} p_{f f f}\left(\frac{N}{2^{2}}\right)+2^{1}\left(\lg \frac{N}{2^{2}}+1\right) \frac{N}{2^{2}}+2^{0}\left(\lg \frac{N}{2^{1}}+1\right) \frac{N}{2^{1}} \\
& \quad \cdot \\
& =2^{n-1} K c a p_{e f f}\left(\frac{N}{2^{n-1}}\right)+2^{n-2}\left(\lg \frac{N}{2^{n-1}}+1\right) \frac{N}{2^{n-1}}+\ldots+2^{1}\left(\lg \frac{N}{2^{2}}+1\right) \frac{N}{2^{2}}+2^{0}\left(\lg \frac{N}{2^{1}}+1\right) \frac{N}{2^{1}} \\
& =2^{n-1} K c a p_{e f f}(2)+2^{n-2}\left(\lg 2^{1}+1\right) 2^{1}+\ldots+2^{1}\left(\lg 2^{n-2}+1\right) 2^{n-2}+2^{0}\left(\lg 2^{n-1}+1\right) 2^{n-1} \\
& =2^{n-1}(1)+2^{n-1}(2)+\ldots+2^{n-1}(n-1)+2^{n-1}(n-1+1) \\
& =2^{n-1} \sum_{i=1}^{n} i \\
& =2^{n-1} \frac{n(n+1)}{2} \\
& =\frac{N}{4}\left((\lg N)^{2}+\lg N\right) \\
& =\mathrm{O}\left(N(\lg N)^{2}\right)
\end{aligned}
$$

We also can write $\operatorname{Kcap}_{s f}(N)$ in terms of circuit size (i.e., $s=\frac{N}{2} \lg N$ ) and circuit depth (i.e., $d=\lg N$ ), as follows.

$$
\begin{aligned}
\operatorname{Kcap}_{e f f}(N) & =\frac{N}{4}\left((\lg N)^{2}+\lg N\right) \\
& =\frac{N}{4}(\lg N)^{2}+\frac{N}{4} \lg N+\frac{N}{4}(\lg N)^{2}-\frac{N}{4}(\lg N)^{2} \\
& =\frac{N}{2}(\lg N)^{2}+\frac{1}{2} \cdot \frac{N}{2} \lg N-\frac{1}{2} \cdot \frac{N}{2}(\lg N)^{2} \\
& =s d+\frac{1}{2} s-\frac{1}{2} s d \\
& =\frac{s d+s}{2}
\end{aligned}
$$

Thus, the divide-and-conquer prefix circuit has $O(N \lg N)$ size, $O(\lg N)$ depth and $\mathrm{O}\left(N(\lg N)^{2}\right)$ effective circuit capacitance under the constant output capacitance assumption

### 4.1.3 The Brent-Kung Parallel Prefix Circuit

Let $N=2^{n}$. The Brent-Kung prefix circuit for $N$ inputs, $B K(N)$, is also built from the Brent-Kung circuit for $N / 2$ inputs, as shown in Figure 4.4. The recurrence relation for this circuit is, however, not as straightforward as the previous two circuits. The part of the problem arises because $B K(N / 2)$ occupies the middle level, which causes the level of all nodes in $B K(N / 2)$ to increase. This requires taking into account the number of nodes at


Figure 4.4: A Brent-Kung parallel prefix circuit, $B K(N)$, divided into three parts ( $0=0$ odd, $e=$ even), derived from [LD94].
each level of $B K(N / 2)$, and is not at all difficult to overcome. However, the major problem is the last step where output of $B K(N / 2)$ is combined with half of the inputs as illustrated in part $C$ of Figure 4.4. Although all these nodes appear at the last level of the
circuit, in fact, some of them are at lower level. To determine level of each node, we construct a table (Table 4.1) for $B K(N)$ corresponding to the circuit layout. The table is divided into three parts, $A, B$, and $C$, corresponding to the circuit layout in Figure 4.4. The entries of the form $\left(x_{i}, i\right)$ in the table represent the fact that level $i$ has $x_{i}$ nodes. The first row is divided into two parts - column 1 corresponding to part $B$, and column 2 corresponding to part $C$ while the second row is represented by part $\boldsymbol{A}$. Computation for capacitance corresponding to part $B$ is simple. In this part there are $\boldsymbol{N} / \mathbf{2}$ operation nodes - all at level 1. Hence, capacitance of part $B$ is equal to $N / 2$. Computation for part $A$ can also be achieved easily by observing that all inputs to $B K(N / 2)$ are at level 1 , which cause the level of each node in $B K(N / 2)$ to increase by 1 . Let $w_{i}$ be the number of nodes at level $i$ in $B K(N / 2)$.

Then,

Note that part $C$ has ( $N / 2-1$ ) operation nodes. Though in the circuit diagram they appear to be at the last level, in fact they are distributed at different levels of the circuit. To compute the capacitance for part $C$, let capacitance of part $C$ be denoted as $K(N)$.

A row in Table 4.1 represents the first level (i.e., column 1) and the last level (i.e., column 2) of $B K\left(N / 2^{k}\right)$ circuit, for $0 \leq k \leq \lg N-1$, after distributing all nodes of the last level to the lower level. Let row $i$ of column 2 in Table 4.1 be $K\left(N / 2^{k}\right)$, where $1 \leq i \leq \lg N$ and $0 \leq k \leq \lg N-1$. For example, the first row (i.e., part $C$ in Table 4.1) represents the nodes used to be at the last level of $B K(N)$ circuit (i.e., part $C$ in Figure
4.4). The second row represents the nodes used to be the last level of $B K(N / 2)$ circuit (i.e., the subpart $C$ of part $A$ in Figure 4.4). The relationship of each row in the table is as follows:

Table 4.1: The constant output capacitance table for $B K(N)$.

| $\left(\frac{N}{2}, 1\right)$ | $(1,2) \xrightarrow{(1,3)} \underset{\sim}{(1,4)} \underset{\rightarrow}{(1,4)} \underset{\rightarrow}{(1,5)} \ldots(1, d e$ |
| :---: | :---: |
| $\begin{aligned} & \left(\frac{N}{2^{2}}, 2\right) \\ & \vdots \\ & \left(\frac{N}{2^{r}}, r\right) \\ & \vdots \\ & \left(\frac{N}{2^{n-2}}, n-2\right)^{\prime} \\ & \left(\frac{N}{2^{n-1}}, n-1\right) \\ & \left(\frac{N}{2^{n}}, n\right) \end{aligned}$ | $(1,3)$ |

- The first entry of column 2 at row $i$ is generated from the entry at row ( $i+1$ ), locating at row $\boldsymbol{i}$ 's diagonal in column 1, as one operation node having the same circuit depth as (i+1)'s entry (see the line ) . For example, the entry $(1, n)$ at row $\lg N-1$ is generated from the entry $\left(N / 2^{n}, n\right)$ at row
$\lg N$. Then this new output entry at row $i$ produces two entries at row ( $i-1$ ): one operation node having the same circuit depth as i's entry and one operation node having one more circuit depth than the $\boldsymbol{i} \mathbf{\prime}$ s entry (see the arrow ${ }^{\uparrow}$ ). For example, the entry $(1, n)$ at row $\lg N-1$ produces the entry $(1, n)$ and the entry $(1, n+1)$ at row $\lg N-2$.

Therefore, in column 2, the first row, $K(N)$, (i.e., part C in Table 4.1) is derived from the second row, $K(N / 2)$ (see Figure 4.5). The $K(N)$ is written as follows.

$$
\begin{aligned}
K(N) & =2+\left(k_{1}+1\right)+\left(k_{1}+2\right)+\left(k_{2}+1\right)+\left(k_{2}+2\right)+\ldots+\left(k_{\frac{N}{4}-1}+1\right)+\left(k_{\frac{N}{4}-1}+2\right) \\
& =2+2\left(k_{1}+k_{2}+\ldots+k_{\frac{N}{4}-1}\right)+\left(\frac{N}{4}-1\right)+2\left(\frac{N}{4}-1\right) \\
& =2 K\left(\frac{N}{2}\right)+\frac{3 N}{4}-1
\end{aligned}
$$



Figure 4.5: Part C, the distribution of $N / 2-1$ nodes.

Solving for $K(N)$, we obtain $K(N)=(3 / 4) N \lg N-5 N / 4+1$. Thus $K c a p_{\text {efII }}(N)$, which is the sum of capacitances of $\operatorname{Part} A, B$ and $C$, can be written as follows:
$K \operatorname{cap}_{e f f}(N)=\operatorname{Kcap}_{e f f}\left(\frac{N}{2}\right)+\left[\frac{3}{2} \cdot \frac{N}{2} \cdot\left(\lg \frac{N}{2}\right)+2 \cdot \frac{N}{2}-\left(\lg \frac{N}{2}\right)-1\right]$

$$
\begin{aligned}
&= K \operatorname{cap} \\
& \text { eff }\left(\frac{N}{2^{2}}\right)+\left[\frac{3}{2} \cdot \frac{N}{2^{2}} \cdot\left(\lg \frac{N}{2^{2}}\right)+2 \cdot \frac{N}{2^{2}}-\left(\lg \frac{N}{2^{2}}\right)-1\right]+\left[\frac{3}{2} \cdot \frac{N}{2} \cdot\left(\lg \frac{N}{2}\right)+2 \cdot \frac{N}{2}-\left(\lg \frac{N}{2}\right)-1\right] \\
& \cdot \\
&= K \operatorname{cap}_{e f f}\left(\frac{N}{2^{n-1}}\right)+\left[\frac{3}{2} \cdot \frac{N}{2^{n-1}} \cdot \lg \frac{N}{2^{n-1}}+2 \cdot \frac{N}{2^{n-1}}-\lg \frac{N}{2^{n-1}}-1\right]+\ldots \\
&+\left[\frac{3}{2} \cdot \frac{N}{2^{1}} \cdot \lg \frac{N}{2^{1}}+2 \cdot \frac{N}{2^{1}}-\lg \frac{N}{2^{1}}-1\right] \\
&= K \operatorname{cap} e f(2)+\left[\frac{3}{2} \cdot \frac{N}{2^{n-1}} \cdot\left(\lg \frac{N}{2^{n-1}}\right)+2 \cdot \frac{N}{2^{n-1}}-\left(\lg \frac{N}{2^{n-1}}\right)-1\right]+\ldots \\
&+\left[\frac{3}{2} \cdot \frac{N}{2^{1}} \cdot\left(\lg \frac{N}{2^{1}}\right)+2 \cdot \frac{N}{2^{1}}-\lg \frac{N}{2^{1}}-1\right] \\
&= 1+\left[\frac{3}{2} \cdot \frac{N}{2^{n-1}} \cdot \lg \frac{N}{2^{n-1}}+2 \cdot \frac{N}{2^{n-1}}-\lg \frac{N}{2^{n-1}}-1\right]+\ldots+\left[\frac{3}{2} \cdot \frac{N}{2^{1}} \cdot \lg \frac{N}{2^{1}}+2 \cdot \frac{N}{2^{1}}-\lg \frac{N}{2^{1}}-1\right] \\
&= 1+\frac{3}{2}\left[\left(\frac{N}{2^{n-1}} \cdot \lg \frac{N}{2^{n-1}}\right)+\left(\frac{N}{2^{n-2}} \cdot \lg \frac{N}{2^{n-2}}\right)+\ldots+\left(\frac{N}{2^{1}} \cdot \lg \frac{N}{2^{1}}\right)\right] \\
&+2\left[\frac{N}{2^{n-1}}+\frac{N}{2^{n-2}}+\ldots+\frac{N}{2^{1}}\right]-\left[\lg \frac{N}{2^{n-1}}+\lg \frac{N}{2^{n-2}}+\ldots+\lg \frac{N}{2^{1}}\right]-(n-1) \\
&= \frac{3}{2} \sum_{i=0}^{n-1} i 2^{i}+2 \sum_{i=0}^{n-1} 2^{i}-\sum_{i=0}^{n-1} i-n \\
&= \frac{3}{2}\left[2+n \cdot 2^{n}-2 \cdot 2^{n}\right]+2\left[2^{n}-1\right]-\frac{n(n-1)}{2}-n \\
&= {\left[3+\frac{3}{2} n \cdot 2^{n}-3 \cdot 2^{n}\right]+\left[2 \cdot 2^{n}-2\right]-\frac{n^{2}}{2}+\frac{n}{2}-n } \\
&= 1-N+\frac{3}{2} N \lg N-\frac{(\lg N)^{2}}{2}-\frac{\lg N}{2} \\
&= O(N \lg N) \\
& 2
\end{aligned}
$$

We can also write $K_{c a p}(N)$ in terms of size and depth. Thus,

$$
\begin{aligned}
\operatorname{Kcap}_{\text {cff }}(N) & =\left[1+\frac{3}{2} N \lg N\right]-\frac{1}{2}\left[2 N+(\lg N)^{2}+\lg N\right] \\
& =\left[2 N \lg N-(\lg N)^{2}-\frac{3}{2} \lg N-N+1\right]-\frac{1}{2} \lg N[N-\lg N-2] \\
& =\left[\frac{s d+s}{2}\right]-\left[\left(\frac{d}{4}+\frac{1}{2}\right) \cdot\left(\frac{s}{2}-\frac{d}{4}-\frac{3}{2}\right)\right] \\
& =\frac{3 s d}{8}+\frac{d^{2}}{16}+\frac{s}{4}+\frac{d}{2}+\frac{3}{4}
\end{aligned}
$$

Similar to the previous circuits, the constant output capacitance is $\mathrm{O}(s d)(\mathrm{O}(N \lg N))$.

### 4.1.4 The Ladner-Fischer Parallel Prefix Circuit

As described in Chapter 2, Ladner and Fischer [LF80] introduced the family of circuits $L F_{k}(N)$ when $0 \leq k \leq\lceil\lg N\rceil$. Different values of $k$ give different prefix circuits' structures. However, $L F_{k}(N) s$ are bounded by the divide-and-conquer prefix circuit and the Brent-Kung prefix circuit. The $L F_{0}(N)$ prefix circuit has the shortest depth and biggest sizes among the family of circuits $L F_{k}(N)$. Also the $L F_{0}(N)$ prefix circuit has the same depth as the divide-and-conquer circuit. Both circuits' structures are similar with small input $N$. But the size of the $L F_{0}(N)$ circuit is smaller than that of the divide-and-conquer circuit when $N$ is larger. Thus, the constant output capacitance of the $L F_{k}(N)$ circuit is lower bounded by the constant output capacitance of the divide-andconquer circuit. The $L F_{k}(N)$ prefix circuit behaves like the Brent-Kung prefix circuit when $k \geq \lg N-2$. Therefore, the upper bound of the $L F_{k}(N)$ prefix circuit is the constant output capacitance of the Brent-Kung circuit.

To summarize, the effective capacitance under the constant output capacitance assumption is in the range $\operatorname{Kcap}_{e f}(D C(N)) \leq K \operatorname{Cap}_{e f}\left(L F_{k}(N)\right) \leq K \operatorname{cap}{ }_{e f}(B K(N))$. That is $N\left((\lg N)^{2}+\lg N\right) / 4 \leq K \operatorname{cop}{ }_{\text {ff }}\left(L F_{k}(N)\right) \leq[1+(3 N \lg N) / 2]-\left[2 N+(\lg N)^{2}+\lg N\right] / 2$.

### 4.1.5 The Snir Parallel Prefix Circuit

The Snir parallel prefix circuit, $S N(N)$, is composed of two parts as shown in Figure 4.6. The two parts consist of the compressed layered prefix circuit, $\operatorname{CR}\left(N_{1}\right)$, and the serial prefix circuit, $S\left(N_{2}\right)$, where $N=N_{1}+N_{2}-1$. Therefore, the capacitance is computed by summing the capacitance of these two parts.


Figure 4.6: The $\operatorname{SN}(N)$ circuit, $S N(N)=C R\left(N_{1}\right) \cdot S\left(N_{2}\right)$.

When $N=2^{n}$, the Brent-Kung prefix circuit is the compressed layered prefix circuit. Therefore, we can use the Brent-Kung prefix circuit's capacitance formula for the compress layered prefix circuit's capacitance formula. When $N \neq \mathbf{2}^{n}$, this formula overestimates the capacitance by less than 7\% (see Appendix E).

In the Snir prefix circuit, the capacitance of the first part can be computed by using the formula from the Brent-Kung parallel prefix circuit described in Section 4.1.3 whereas the capacitance of the second part can be computed by starting from the level of the last output of Part 1 (i.e., $\left\lceil\lg N_{1}\right\rceil$ ). There are $N_{2}-1$ operations. Each operation is at a succeeding level. Therefore, the capacitance of the second part is given by

$$
\begin{aligned}
\operatorname{Kcap}_{e f f}\left(S\left(N_{2}\right)\right) & =\sum_{i=1}^{N_{2}-i}\left(\left[\lg N_{1}\right\rceil+i\right) \\
& =\left(\left(N_{2}-1\right)\left\lceil\lg N_{1}\right\rceil+\sum_{i=1}^{N_{2}-1} i\right) \\
& =\left(N_{2}\left\lceil\lg N_{1}\right\rceil-\left\lceil\lg N_{1}\right\rceil+\frac{N_{2}^{2}-N_{2}}{2}\right)
\end{aligned}
$$

The constant output capacitance of the circuit $S N(N)$ is given by

$$
\begin{aligned}
\operatorname{Kcap}_{e f f}(\operatorname{SN}(N))= & \operatorname{Kcap}_{e f f}\left(\operatorname{CR}\left(N_{1}\right)\right)+\operatorname{Kcap}_{e f}\left(S\left(N_{2}\right)\right) \\
= & {\left[1+\frac{3}{2} N_{1}\left\lceil\lg N_{1}\right\rceil\right]-\frac{1}{2}\left[2 N_{1}+\left(\left\lceil\lg N_{1}\right\rceil^{2}+\left\lceil\lg N_{1}\right\rceil+\right.\right.} \\
& N_{2}\left\lceil\lg N_{1}\right\rceil-\left\lceil\lg N_{1}\right\rceil+\frac{N_{2}^{2}-N_{2}}{2}
\end{aligned}
$$

Clearly, capacitance of the circuit $S N(N)$ is $\mathrm{O}(N \lg N)$.

### 4.1.6 The Shih-Lin Parallel Prefix Circuit

The $S L(N)$ parallel prefix circuit [LS99] is composed of two parts consisting of the layered prefix circuit, $C R\left(N_{1}\right)$, and the serial prefix circuit $S\left(N_{2}\right)$, where $N=N_{1}+N_{2}-1$ (see Figure 4.7). Therefore, the capacitance is computed by summing the capacitance from these two parts. As discussed in the previous section, the
capacitance of the first part can be computed by using the formula from the Brent-Kung parallel prefix circuit described in Section 4.1.3 whereas the constant capacitance of the second part can be computed by computing the capacitance of the serial prefix circuit and the capacitance of the connecting nodes starting from the last output of Part 1 (i.e., $\left\lceil\lg N_{1}\right\rceil$ ). Therefore, the constant capacitance of the second part as before is $N_{2}\left\lceil\lg N_{1}\right\rceil-\left\lceil\lg N_{1}\right\rceil+\left(N_{2}^{2}-N_{2}\right) / 2$.


Figure 4.7: The $S L(N)$ circuit, $S L(N)=C R\left(N_{1}\right) \cdot S\left(N_{2}\right)$.

The constant output capacitance of the circuit $S L(N)$ is

$$
\begin{aligned}
\operatorname{Kcap}_{\text {eff }}(\operatorname{SL}(N))= & \operatorname{Kcap}_{e f f}\left(C R\left(N_{1}\right)\right)+K \operatorname{cap}_{\text {eff }}\left(S\left(N_{2}\right)\right) \\
= & {\left[1+\frac{3}{2} N_{1}\left\lceil\lg N_{1}\right\rceil\right]-\frac{1}{2}\left[2 N_{1}+\left(\left\lceil\lg N_{1}\right]^{2}+\left\lceil\lg N_{1}\right\rceil\right]+\right.} \\
& N_{2}\left\lceil\lg N_{1}\right\rceil-\left\lceil\lg N_{1}\right\rceil+\frac{N_{2}^{2}-N_{2}}{2}
\end{aligned}
$$

Clearly, the effective capacitance of the circuit $S L(N)$ under the constant output capacitance assumption is also $\mathrm{O}(N \lg N)$.

### 4.1.7 The LYD Parallel Prefix Circuit

The LYD parallel prefix circuit, $\operatorname{LYD}(N)$, is composed of four parts including the layered prefix circuit, $\quad C R\left(N_{1}\right), \quad Q\left(N_{2}\right), \quad S\left(N_{3}\right), \quad$ and $S\left(N_{4}\right) \quad$ where $N=N_{1}+N_{2}+N_{3}+N_{4}$ (see Figure 4.8). Therefore, the capacitance is computed by summing the capacitance from these four parts. The capacitance of Part 1 can be computed by using the formula from the Brent-Kung parallel prefix circuit described in Section 4.1.3. The capacitance of Part 2, Part 3, and Part 4 can be computed by starting from the level of the last output of the first part, second part and the third part, respectively.

Part 1 is the $\operatorname{CR}\left(N_{1}\right)$ circuit. Thus, the capacitance of Part 1 is
$=\left\{\left[1+\frac{3}{2} N_{1}\left(\lg N_{1}\right)\right]-\left[\frac{1}{2}\left[2 N_{1}+\left(\lg N_{1}\right)^{2}+\left(\lg N_{1}\right)\right]\right]\right.$
Part 2 is the $Q\left(N_{2}\right)$ circuit. Let $t$ be $\left\lceil\lg N_{1}\right\rceil$. For level $i=1$ to $t$, level $i$ has $(t-i+1)$ operation nodes. Thus, the capacitance is $\sum_{i=1}^{i} i \cdot(t-i+1)$.

At level $(t+1)$, there are $(t+1)$ operation nodes.
At level $(t+2)$, there are $t(t-1) / 2$ operation nodes.
Thus, the capacitance of Part 2 is

$$
\begin{aligned}
& =\sum_{i=1}^{t} i \cdot(t-i+1)+(t+1) \cdot\left(\left\lceil\lg N_{1}\right\rceil+1\right)+\frac{t(t-1)}{2} \cdot\left(\left\lceil\lg N_{1}\right\rceil+2\right) \\
& =\left(\sum_{i=1}^{t} i-\sum_{i=1}^{i} i^{2}+\sum_{i=1}^{i} i\right)+\left(t \cdot\left\lceil\lg N_{1}\right\rceil+t+\left\lceil\lg N_{1}\right\rceil+1\right)+\frac{1}{2}\left(t^{2}\left\lceil\lg N_{1}\right\rceil+2 t^{2}-t\left\lceil\lg N_{1}\right\rceil-2 t\right) \\
& =\left(\frac{t^{3}+3 t^{2}+2 t}{6}\right)+\left(t \cdot\left\lceil\lg N_{1}\right\rceil+t+\left\lceil\lg N_{1}\right\rceil+1\right)+\frac{1}{2}\left(t^{2}\left\lceil\lg N_{1}\right\rceil+2 t^{2}-t\left\lceil\lg N_{1}\right\rceil-2 t\right) \\
& =\frac{4}{6} t^{3}+2 t^{2}+\frac{4}{3} t+1 \\
& =\frac{2}{3}\left\lceil\lg N_{1}\right\rceil+2\left\lceil\lg N_{1}\right\rceil^{2}+\frac{4}{3}\left\lceil\lg N_{1}\right\rceil+1
\end{aligned}
$$



Figure 4.8: The structure of $\operatorname{LYD}(N)$, derived from [LD94].

Part 3 is the $S\left(N_{3}\right)$ circuit. Thus, the capacitance of Part 3 is
$=\sum_{i=1}^{N_{3}}\left(\left\lceil\lg N_{1}\right\rceil+1+i\right)$

$$
=\left(N_{3}\left\lceil\lg N_{1}\right\rceil+N_{3}+\frac{N_{3}\left(N_{3}+1\right)}{2}\right)
$$

Part 4 is the $S\left(N_{4}\right)$ circuit. Thus, the capacitance of Part 4 is
$=\left(\sum_{i=1}^{N_{4}-1} i+N_{4}\left(\left\lceil\lg N_{1}\right\rceil+2+N_{3}\right)\right)$
$=\left(\frac{N_{4}\left(N_{4}-1\right)}{2}+N_{4}\left(\left\lceil\lg N_{1}\right\rceil+2+N_{3}\right)\right)$

Therefore, $L Y D(N)$ 's capacitance is calculated from the sum of Part 1, Part 2, Part 3, and Part 4 derived as follows

$$
\begin{aligned}
\operatorname{Kcap}_{e f}(N)= & {\left[1+\frac{3}{2} N_{1}\left(\lg N_{1}\right)\right]-\left[\frac{1}{2}\left[2 N_{1}+\left(\lg N_{1}\right)^{2}+\left(\lg N_{1}\right)\right]\right]+} \\
& {\left[\frac{2}{3}\left[\lg N_{1} \beta+2\left\lceil\lg N_{1}\right\rceil^{2}+\frac{4}{3}\left\lceil\lg N_{1}\right\rceil+1\right]+\right.} \\
& {\left[\left(N_{3}+N_{4}\right)\left(\left\lceil\left(\lg N_{1}\right)\right\rceil+\frac{3}{2}\right)+\frac{1}{2}\left(N_{3}^{2}+N_{4}^{2}\right)+\left(N_{3} \cdot N_{4}\right)\right] }
\end{aligned}
$$

Clearly, capacitance of the circuit $L Y D(N)$ is also $O(N\lceil\lg N D$.

Table 4.2 shows all the expressions for the effective circuit capacitance for prefix circuits considered assuming the constant output capacitance assumption.

Table 4.2. Expression of the constant output capacitance.

| Prefix Circuit | Kcapon(N) |
| :---: | :---: |
| Serial | $\left\{\frac{N(N-1)}{2}\right\} c_{0}$ |
| Divide-and-Conquer | $\left.\left\{\frac{N}{4}(\lg N)^{2}+\lg N\right)\right\}_{0}$ |
| Brent-Kung | $\left\{1+\frac{3}{2} N \lg N-\frac{1}{2}\left[2 N+(\lg N)^{2}+\lg N\right] C_{0}\right.$ |
| LFF | $\left.\left\{\frac{N}{4}(\lg N)^{2}+\lg N\right)\right\} C \leq L F_{i} \leq\left\{1+\frac{3}{2} N \lg N-\frac{1}{2}\left[\mathrm{E} N+(\lg N)^{2}+\lg N\right]\right\} C_{0}$ |
| Snir |  |
| Shih-Lin | $\left\{\left[1+\frac{3}{2} N_{1}\left(\lg N_{1}\right)\right]-\left[\frac{1}{2}\left[2 N_{1}+\left(\lg N_{1}\right)^{2}+\left(\lg N_{1}\right)\right]+\left[N_{2}\left[\left(\lg N_{1}\right)\right]-\left[\left(\lg N_{1}\right)\right]+\left(\frac{N_{3}^{2}-N_{2}}{2}\right)\right]\right\} C_{0}\right.$ |
| LYD | $\begin{aligned} & \left\{\left[1+\frac{3}{2} N_{1} \lg N_{1}\right]-\left[\frac{1}{2}\left[2 N_{1}+\left(\lg N_{1}\right)^{2}+\lg N_{1}\right]\right]+\left[\frac{2 \lg N_{1} P}{3}+\left[\lg N_{1} P+\frac{4\left[\lg N_{1}\right]}{3}+1\right]+\right.\right. \\ & \left.\left[\left(N_{3}+N_{0}\right)\left(\left[\lg N_{1}\right]+\frac{3}{2}\right)+\frac{1}{2}\left(N_{3}^{2}+N_{1}^{2}\right)+\left(N_{1} N_{4}\right)\right]\right] C_{0} \end{aligned}$ |

### 4.2. Step2 - Capacitance of Residual Circuit

We have assumed that a node with fan-out $k \geq 1$, has a physical output capacitance given as $C_{0}+(k-1) C^{\prime}$. However, the capacitances computed in Section 4.1 for various circuits is based on the assumption that the capacitance of each node is $C_{0}$ irrespective of the fanout of the node. We still need to account for the component $(k-1) C^{\prime}$ for a node with fanout $k, k>1$. To get this value, we introduce the concept of the residual circuit. The residual circuit of a prefix circuit is the circuit obtained by eliminating one of the fan-outs from each operation node of the given prefix circuit. For example, Figure 4.13 shows the residual circuit of the divide-and-conquer prefix circuit. This residual circuit is the result of removing one of the fan-outs (i.e., the vertical fan-out) from each operation node of the circuit in Figure 4.12. We can compute the capacitance of this residual circuit,
$\operatorname{Rcap}_{\text {eff }}(N)$, by assuming constant output capacitance (i.e., $C^{\prime}$ ) for all operation nodes. We then construct the residual circuit of the current residual circuit by removing one fanout from each operation node and compute its residual output capacitance. We continue accumulating the capacitances after every reduction until there are no more links to remove. Thus, the total effective circuit capacitance of the prefix circuit using the linear output capacitance assumption is given by

$$
\operatorname{cap}_{e f f}(N)=K \operatorname{cap}_{e f f}(N) C_{0}+R \operatorname{cap}_{e f f}(N) C^{\prime}
$$

### 4.2.1. The Serial Prefix Circuit

The layout of the serial prefix circuit for $N$ inputs, $S(N)$, with fan-out shown in solid lines is illustrated in Figure 4.9. Each operation node has a fan-out of exactly two. The residual circuit obtained after removing the vertical fan-out is shown in Figure 4.10. Each operation node, except the last one, has exactly one fan-out. As shown in Figure 4.11, the residual circuit with $N$ inputs can be built from the residual circuit with $N-1$ inputs with the following recurrence relation:

$$
\operatorname{Rcap}_{e f}(N)=\operatorname{Rcap}_{e f f}(N-1)+\operatorname{depth}(S(N-1)) \cdot 1, \quad \text { with } \quad \operatorname{Rcap}_{e f f}(2)=0
$$



Figure 4.9: The serial prefix circuit for $N$ inputs with fan-out shown in solid lines.


Figure 4.10: The residual circuit of the serial prefix circuit, $S(N)$, shown in solid lines.


Figure 4.11: An illustration of the residual circuit of the $S(N)$, built from $S(N-1)$.

Since $\operatorname{depth}(S(N-1))=N-2$, solving this recurrence, we get

$$
\begin{aligned}
\operatorname{Rcap}_{e f f}(N)= & \operatorname{Rcap}_{e f f}(N-1)+(N-2) \\
= & \operatorname{Rcap}_{e f f}(N-2)+(N-3)+(N-2) \\
& \cdot \\
& \cdot \\
= & \operatorname{Rcap}_{e f f}(N-N+2)+(N-N+1)+\ldots+(N-4)+(N-3)+(N-2) \\
= & \operatorname{Rcap}_{e f f}(2)+(1)+\ldots+(N-4)+(N-3)+(N-2)
\end{aligned}
$$

$$
\begin{aligned}
& =0+\sum_{i}^{N-2} i \\
& =\frac{(N-1)(N-2)}{2}
\end{aligned}
$$

The size and depth of the serial circuit are ( $N-1$ ). Therefore, $\operatorname{Rcap}_{\text {off }}(N)$ can be written as,

$$
\begin{aligned}
\operatorname{Rcap}_{\text {eff }}(N) & =\frac{(N-1)(N-2)}{2} \\
& =\frac{s-(d-1)}{2} \\
& =\frac{(s d-s)}{2}
\end{aligned}
$$

Thus using the linear output capacitance assumption, the effective capacitance for the serial prefix circuit is as follows.

$$
\begin{aligned}
& \operatorname{cap}_{e f}(N)=\left\{\frac{N(N-1)}{2}\right\} C_{0}+\left\{\frac{(N-1)(N-2)}{2}\right\} C^{\prime}, \text { or } \\
& \operatorname{cap}_{e f}(N)=\left\{\frac{s d+s}{2}\right\} C_{0}+\left\{\frac{s d-s}{2}\right\} C^{\prime},
\end{aligned}
$$

where $s$ and $d$ are the size and depth of the circuit, respectively. Both values are equal to $N-1$. Hence, the serial prefix circuit has $\mathrm{O}(N)$ size, $\mathrm{O}(N)$ depth, and $\mathrm{O}(N)^{2}$ effective circuit capacitance under the linear output capacitance assumption.

### 4.2.2. The Divide-and-Conquer Parallel Prefix Circuit

The layout of the divide-and-conquer prefix circuit for $N$ inputs, $D C(N)$, with fan-outs shown in solid lines is illustrated in Figure 4.12. The operation node at level $\operatorname{depth}(D C(N / 2))$ has the maximum fan-out, which is $(N / 2+1)$. After removing the
vertical fan-outs, the residual circuit is derived as shown in Figure 4.13. The operation node at level $\operatorname{depth}(D C(N / 2))$ has the maximum fan-out, which is $(N / 2)$.

Let $N=2^{n}$. Using the divide-and-conquer strategy, the residual circuit of $D C(N)$ with $N$ inputs can be computed from the circuit of $N / 2$ inputs. Hence we can write the recurrences to compute the capacitance of the residual circuit from the parallel prefix circuit $D C(N)$ according to the principle illustrated in Figure 4.13 as follows:

$$
\operatorname{Rcap}_{c f f}(N)=2 \operatorname{Rcap}_{e f f}\left(\frac{N}{2}\right)+\frac{N}{2} \lg \frac{N}{2}, \quad \text { with } \quad \operatorname{Rcap}_{e f f}(2)=0
$$



Figure 4.12: The divide-and-conquer prefix circuit, $D C(N)$, with fan-outs shown in solid lines, derived from [LD94].


Figure 4.13: The residual circuit of the divide-and-conquer prefix circuit, $D C(N)$, shown in solid lines.

Note that the first part of $\operatorname{Rcap}_{\text {eff }}(N)$ is the load capacitance of the two circuits with ( $N / 2$ ) inputs while the second part is the cost of merging cost of these two circuits.

Solving the recurrence, we get

$$
\begin{aligned}
\operatorname{Rcap}_{f f}(N)= & 2 \operatorname{Rcap}_{e f f}\left(\frac{N}{2}\right)+\frac{N}{2} \lg \frac{N}{2} \\
= & 2^{2} \operatorname{Rcap}_{e f f}\left(\frac{N}{2^{2}}\right)+2^{1} \frac{N}{2^{2}} \lg \frac{N}{2^{2}}+2^{0} \frac{N}{2^{1}} \lg \frac{N}{2^{1}} \\
& \cdot \\
& \cdot \\
= & 2^{n-1} \operatorname{Rcap} e f\left(\frac{N}{2^{n-1}}\right)+2^{n-2} \frac{N}{2^{n-1}} \lg \frac{N}{2^{n-1}}+\ldots+2^{1} \frac{N}{2^{2}} \lg \frac{N}{2^{2}}+2^{0} \frac{N}{2^{1}} \lg \frac{N}{2^{1}} \\
= & 2^{n-1} R \operatorname{cap}{ }_{e f f}(2)+2^{n-2} 2^{1} \lg 2^{1}+\ldots+2^{1} 2^{n-2} \lg 2^{n-2}+2^{0} 2^{n-1} \lg 2^{n-1} \\
= & 2^{n-1}(1)+2^{n-1}(2)+\ldots+2^{n-1}(n-2)+2^{n-1}(n-1) \\
= & 2^{n-1} \sum_{i=1}^{n-1} i \\
= & 2^{n-1} \frac{n(n-1)}{2} \\
= & \frac{N}{4}\left((\lg N)^{2}-\lg N\right) \\
= & \mathrm{O}\left(N(\lg N)^{2}\right)
\end{aligned}
$$

We can also write the capacitance of the residual circuit as a function of size and depth as follows.

$$
\begin{aligned}
\operatorname{Rcap}_{\text {cf }}(N) & =\frac{N}{4}\left((\lg N)^{2}-\lg N\right) \\
& =\frac{N}{4}(\lg N)^{2}-\frac{N}{4} \lg N+\frac{N}{4}(\lg N)^{2}-\frac{N}{4}(\lg N)^{2} \\
& =\frac{N}{2}(\lg N)^{2}-\frac{1}{2} \cdot \frac{N}{2} \lg N-\frac{1}{2} \cdot \frac{N}{2}(\lg N)^{2}
\end{aligned}
$$

$$
\begin{aligned}
& =s d-\frac{1}{2} s-\frac{1}{2} s d \\
& =\frac{s d-s}{2}
\end{aligned}
$$

Thus, using the linear output capacitance assumption, the effective capacitance for the divide-and-conquer prefix circuit is as follows.

$$
\begin{aligned}
& \operatorname{cap}_{e f}(N)=\left\{\frac{N}{4}\left((\lg N)^{2}+\lg N\right)\right\} C_{0}+\left\{\frac{N}{4}\left((\lg N)^{2}-\lg N\right)\right\} C^{\prime} \text { or } \\
& \operatorname{cap}_{e f f}(N)=\left\{\frac{s d+s}{2}\right\} C_{0}+\left\{\frac{s d-s}{2}\right\} C^{\prime}
\end{aligned}
$$

where $s=\frac{N}{2} \lg N$ and $d=\lg N$.
To summarize, the divide-and-conquer prefix circuit has $\mathrm{O}(N \lg N)$ size, $\mathrm{O}(\lg N)$ depth, and $\mathrm{O}\left(N(\lg N)^{2}\right)$ effective circuit capacitance under the linear output capacitance assumption.

### 4.2.3 The Brent-Kung Parallel Prefix Circuit

Let $N=2^{n}$. After removing the vertical fan-out from the $B K(N)$ circuit, the residual circuit of $B K(N)$ for $N$ inputs can be computed from the circuit of $N / 2$ inputs, as shown in Figure 4.14. The recurrence relation for this residual circuit is, however, not as straight forward as for the previous two circuits. The number of the remaining fan-out of a node at level $i$ is the number of the connection links to the node at level $i+1$. Thus, the problem is reduced to the $B K(N)$ circuit with being shifted up one level (i.e., the $i^{\text {th }}$ level is the


Figure 4.14: The residual network of the Brent-Kung parallel prefix circuit, $B K(M)$, divided into 3 parts.
$(i+1)^{\text {th }}$ level). The process to compute the residual circuit is like the process to compute the $B K(N)$ circuit in Section 4.1.3. Similarly, the major problem is the last step where output of $B K(N / 2)$ is combined with half of the inputs as illustrated in part $C$ of Figure 4.14. To determine exactly the level of each remaining fan-out of a node in Part $C$, we construct a table (Table 4.3) for the residual circuit of $B K(N)$ corresponding to the residual circuit layout. The table is divided into three parts, $A, B$, and $C$, corresponding to the residual circuit layout in Figure 4.14.

As in Section 4.1.3, the entries of the form $\left(x_{i}, i\right)$ in the table represent the fact that the node at level $i$ has $x_{i}$ fan-outs. The first row is divided into two parts - column 1 corresponds to part $B$, and column 2 corresponds to part $C$. Computation for residual network corresponding to part $B$ is zero since, after shifting up, all nodes at level 1 are removed. Computation for part $A$ can be achieved easily by observing that all operation nodes of the residual circuit of $B K(N / 2)$ are at one lower level, meaning that the level of each node in the residual circuit of $B K(N / 2)$ is the same as the level of $B K(N / 2)$ circuit.

Let $l_{i}$ be the number of nodes at level $i$ in $B K(N / 2)$. Then, the residual network of $\operatorname{PartA}=\sum_{i=1}^{\operatorname{depd}(N / 2)} i i_{i}=\sum_{i=1}^{\operatorname{depd}(N / 2)} i l_{i}=K c a p_{e f}(N / 2)$. To compute the capacitance for part C, note that part $C$ has $\frac{N}{2}-1$ operation nodes distributed at different levels of the circuit. Let capacitance of part $C$ be denoted as $K(N)$.

Table 4.3: The residual circuit table for $B K(N)$
(

Let row $i$ of column 2 in Table 4.3 be $K\left(N / 2^{k}\right)$, where $0 \leq i \leq \lg N-1$ and $0 \leq k \leq \lg N-1$. Each row represents the last level of the residual circuit of
$B K\left(N / 2^{k}\right)$ circuit, for $0 \leq k \leq \lg N-1$, after level adjustment. For example, the first row (i.e., part $C$ in Table 4.3) represents the last level of the residual circuit of $B K(N)$ circuit (i.e., part $C$ in Figure 4.14). The second row represents the last level of the residual circuit of $B K(N / 2)$ circuit (i.e., the subpart $C$ of part $A$ in Figure 4.14). The relationship of each row in the table is as follow:

- The first entry of column 2 at row $i$ is generated from the entry at row $(i+1)$, located at $i$ 's diagonal in column 1 as one operation node having the same circuit depth as ( $i+1$ )'s entry (see the line $\mid$ ). For example, the entry $(1, n-1)$ at rowlg $N-2$ is generated from the entry $\left(N / 2^{n}, n-1\right)$ at row $\lg N-1$. Then this new output entry at row $i$ produces two entries at row (i-I): one operation node having the same circuit depth as i's entry and one operation node having one more circuit depth than the $i$ 's entry (see the arrow $\uparrow$ ). For example, the entry $(1, n-1)$ at row $\lg N-2$ produces the entry $(1, n-1)$ and the entry $(1, n)$ at row $\lg N-3$.

Therefore, in column 2, the first row, $K(N)$ (i.e., part C in Table 4.3), is derived from the second row, $K(N / 2)$ (see Figure 4.15) and can be written as follow.

| $K(N) 1+k_{1}$ | $k_{1}+1$ | $k_{2}$ | $k_{2}+1$ | $\ldots$ | $k_{\frac{N}{4}}$ | $k_{\frac{N}{4}-1}+1$ | $2^{\text {n }}$ inputs |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $K\left(\frac{N}{2}\right)$ | $7$ |  |  | $\cdots$ |  |  | $2^{n-1}$ inputs |

Figure 4.15: Part $C$, the distribution of $N / 2-1$ nodes.

$$
\begin{aligned}
\text { Part } \begin{aligned}
C & =1+k_{1}+\left(k_{1}+1\right)+k_{2}+\left(k_{2}+1\right)+\ldots+k_{\frac{N}{4} 1}+\left(k_{\frac{N}{4} 1}+1\right) \\
& =1+2\left(k_{1}+k_{2}+\ldots+k_{\frac{N}{4}-1}\right)+\left(\frac{N}{4}-1\right) \\
& =2 K\left(\frac{N}{2}\right)+\frac{N}{4}
\end{aligned} \text { }
\end{aligned}
$$

From Section 4.1.3, we know that

$$
K(N)=\frac{3}{4} N \lg N-\frac{5 N}{4}+1, \text { and }
$$

$K \operatorname{cap}{ }_{e f}(N)=\left[1+\frac{3}{2} N \lg N\right]-\frac{1}{2}\left[2 N+(\lg N)^{2}+\lg N\right]$.
Thus, $\operatorname{Rcap}_{e f}(N)$, which is the residual circuit, can be written as follows:

$$
\begin{aligned}
& \operatorname{Rcap}_{\text {eff }}(N)=\operatorname{Kcap}_{\text {eff }}\left(\frac{N}{2}\right)+2 K\left(\frac{N}{2}\right)+\frac{N}{4} \\
& =\left(1+\frac{3}{2} \frac{N}{2} \lg \frac{N}{2}\right)-\frac{1}{2}\left(2 \frac{N}{2}+\left(\lg \frac{N}{2}\right)^{2}+\lg \frac{N}{2}\right)+2\left(\frac{3}{4} \frac{N}{2} \lg \frac{N}{2}-\frac{5}{4} \cdot \frac{N}{2}+1\right)+\frac{N}{4} \\
& =(1+2)+\left(\left(\frac{3}{2}+\frac{3}{2}\right) \frac{N}{2} \lg \frac{N}{2}\right)-\frac{1}{2}\left(\left(\lg \frac{N}{2}\right)^{2}\right)-\left(\frac{1}{2} \lg \frac{N}{2}\right)-\left(\left(\frac{5}{2}+1\right) \frac{N}{2}+\frac{N}{4}\right) \\
& =3+3\left(\frac{N}{2} \lg \frac{N}{2}\right)-\frac{1}{2}\left(\left(\lg \frac{N}{2}\right)^{2}\right)-\frac{1}{2}\left(\lg \frac{N}{2}\right)-\frac{3 N}{2} \\
& =3\left(1+\frac{N}{2} \lg \frac{N}{2}\right)-\frac{1}{2}\left(3 N+\left(\lg \frac{N}{2}\right)^{2}+\lg \frac{N}{2}\right)
\end{aligned}
$$

with gives the effective capacitance as:

$$
\begin{aligned}
\operatorname{cap}_{\text {cf }}(N)= & \left\{1+\frac{3}{2} N \lg N-\frac{1}{2}\left[2 N+(\lg N)^{2}+\lg N\right]\right\} C_{0}+ \\
& \left\{3\left(1+\frac{N}{2} \lg \frac{N}{2}\right)-\frac{1}{2}\left(3 N+\left(\lg \frac{N}{2}\right)^{2}+\lg \frac{N}{2}\right)\right\} C^{\prime}
\end{aligned}
$$

The effective circuit capacitance under linear output capacitance assumption of the BrentKung prefix circuit is $O(N \lg N)$.

### 4.2.4 The Ladner-Fisher Parallel Prefix Circuit

As described in Section 4.1.4 that the effective capacitance of the family of the $L F_{k}(N)$ prefix circuit is bounded by the divide-and-conquer prefix circuit and the Brent-Kung prefix circuit. Thus, the $L F_{k}(N)$ circuit's effective capacitance is such that

$$
\begin{aligned}
& \left\{\left(\frac{N}{4}\right)\left((\lg N)^{2}+\lg N\right)\right\} C_{0}+\left\{\left(\frac{N}{4}\right)\left((\lg N)^{2}-\lg N\right)\right\} C^{\prime} \leq \operatorname{cap}_{e f}(N) \leq\left\{1+\frac{3}{2} N \lg N-\right. \\
& \left.\frac{1}{2}\left[2 N+(\lg N)^{2}+\lg N\right]\right\} C_{0}+\left\{3\left(1+\frac{N}{2} \lg \frac{N}{2}\right)-\frac{1}{2}\left(3 N+\left(\lg \frac{N}{2}\right)^{2}+\lg \frac{N}{2}\right)\right\} C^{\prime} .
\end{aligned}
$$

### 4.2.5 The Snir Parallel Prefix Circuit and The Shih-Lin Parallel Prefix Circuit

As in Chapter 2, the $S N(N)$ and $S L(N)$ parallel prefix circuit are composed of two parts which are the compressed layered prefix circuit, $C R\left(N_{1}\right)$, and the serial prefix circuit $S\left(N_{2}\right)$, where $N=N_{1}+N_{2}-1$ (see Figure 4.16). Note that only the residual circuit is computed here. Therefore, the capacitance of the residual circuit is given by

$$
\operatorname{Rcap}_{e f f}(N)=\operatorname{Rcap}_{e f f}(\operatorname{Part} 1)+\operatorname{Rcap}_{e f}(\text { Part } 2)
$$

We can use the formula from the Brent-Kung parallel prefix circuit in Section 4.2.3 to compute Rcapef (Partl). Thus,

$$
\operatorname{Rcap}_{e f f}(\text { Partl })=3\left(1+\frac{N_{1}}{2} \lg \frac{N_{1}}{2}\right)-\frac{1}{2}\left(3 N_{1}+\left(\lg \frac{N_{1}}{2}\right)^{2}+\lg \frac{N_{1}}{2}\right)
$$



Figure 4.16: The $S N(N)$, and $S L(N)$ prefix circuits.
The $\operatorname{Rcap}_{e f f}$ (Part2) (i.e., the serial circuit) can be computed by starting at the level of the last output of Part 1 which is $\left\lceil\lg N_{1}\right\rceil$. Since there are $N_{2}-1$ operations, and the circuit depth is also $N_{2}-1$, the $\operatorname{Rcap}_{\text {eff }}$ (Part2) is given by:

$$
\begin{aligned}
\operatorname{Rcap}_{e f f}(\text { Part } 2) & =\sum_{i=0}^{N_{2}-2}\left(\left[\lg N_{1}\right\rceil+i\right) \\
& =\sum_{i=0}^{N_{2}-2}\left\lceil\lg N_{1}\right\rceil+\sum_{i=0}^{N_{2}-2} i \\
& =\left\lceil\lg N_{1}\right\rceil\left(N_{2}-2+1\right)+\frac{\left(N_{2}-2\right)\left(N_{2}-1\right)}{2} \\
& =\left\lceil\lg N_{1}\right\rceil\left(N_{2}-1\right)+\frac{1}{2}\left(N_{2}^{2}-3 N_{2}+2\right)
\end{aligned}
$$

The effective capacitance under linear output capacitance assumption is

$$
\begin{aligned}
\operatorname{cap}_{e f f}(N)= & \operatorname{Kcap}_{e f f}(N)+\operatorname{Rcap} e f \\
\operatorname{cap}_{e f f}(N)= & \left\{\left[1+\frac{3}{2} N_{1}\left(\lg N_{1}\right)\right]-\left[\frac{1}{2}\left[2 N_{1}+\left(\lg N_{1}\right)^{2}+\left(\lg N_{1}\right)\right]\right]+\right. \\
& {\left.\left[N_{2}\left[\left(\lg N_{1}\right)\right\rceil-\left\lceil\left(\lg N_{1}\right)\right\rceil+\left(\frac{N_{2}^{2}-N_{2}}{2}\right)\right]\right\} C_{0}+}
\end{aligned}
$$

$$
\begin{aligned}
& \left\{\left[3\left(1+\frac{N_{1}}{2} \lg \frac{N_{1}}{2}\right)-\frac{1}{2}\left(3 N_{1}+\left(\lg \frac{N_{1}}{2}\right)^{2}+\lg \frac{N_{1}}{2}\right)\right]+\right. \\
& \left.\left[\left\lceil\lg N_{1}\right]\left(N_{2}-1\right)+\frac{1}{2}\left(N_{2}^{2}-3 N_{2}+2\right)\right]\right\} C^{\prime}
\end{aligned}
$$

This implies that the $S N(N)$ and $S L(N)$ prefix circuit take $\mathrm{O}(N\lceil\lg N \square)$ to compute linear capacitance.

### 4.2.6 The LYD Parallel Prefix Circuit

The LYD parallel prefix circuit, $L Y D(N)$, is composed of four parts which are the layered prefix circuit, $\quad C R\left(N_{1}\right), \quad Q\left(N_{2}\right), \quad S\left(N_{3}\right), \quad$ and $S\left(N_{4}\right), \quad$ where $N=N_{1}+N_{2}+N_{3}+N_{4}$ (see Figure 4.17). Therefore, the residual circuit is computed by summing the capacitance of the residual circuit from these four parts. The capacitance of Part 1 can be computed by using the formula from the Brent-Kung parallel prefix circuit described in Section 4.2.3. The capacitance of Parts 2, 3 and 4 can be computed by starting at the level of the last output of the first part, second part and the third part, respectively.

Part 1 represents the circuit $C R\left(N_{1}\right)$. Thus the capacitance of Part 1 is equal to $3\left(1+\frac{N_{1}}{2} \lg \frac{N_{1}}{2}\right)-\frac{1}{2}\left(3 N_{1}+\left(\lg \frac{N_{1}}{2}\right)^{2}+\lg \frac{N_{1}}{2}\right)$.

Part 2 represents the circuit $Q\left(N_{2}\right)$. For level $i=1$ to $t$, level $i$ has $(t-i+1)$ operation nodes.
$\operatorname{Rcap}_{e f}(\operatorname{Part2})=\operatorname{Rcap}_{e f}(A)+\operatorname{Rcap}_{e f}(B)+\operatorname{Rcap}_{e f}($ ConnectionCircuit $)$, where


Figure 4.17: The structure of $L Y D(N)$, derived from [LD94].

$$
\begin{aligned}
\operatorname{Rcap}_{\text {eff }}(A) & =\sum_{i=2}^{1}\left(t i+2 i-i^{2}-t-1\right) \\
& =t \sum_{i=2}^{t} i+2 \sum_{i=2}^{t} i-\sum_{i=2}^{i} i^{2}-\sum_{i=2}^{i} t-\sum_{i=2}^{i} 1 \\
& =\frac{t \cdot t(t+1)}{2}+t(t+1)-\frac{t(t+1)(2 t+1)}{6}-t^{2}+t-t+1-t-2+1 \\
& =\frac{1}{6}\left(t^{3}-t\right)
\end{aligned}
$$

$\operatorname{Rcap}_{\text {eff }}(B)=\tau\left\lceil\lg N_{1}\right\rceil+\left\lceil\lg N_{1}\right\rceil$.
Rcap(ConnectionCircuit $)=\frac{1}{2}\left[t^{2}\left\lceil\lg N_{1}\right\rceil+t^{2}-t\left\lceil\lg N_{1}\right\rceil-t\right]$.
Hence,
$\operatorname{Rcap}_{e f f}($ Part 2$)=\frac{t^{3}}{6}+\frac{1}{2}\left[t^{2}+t^{2}\left\lceil\lg N_{1}\right\rceil+t\left\lceil\lg N_{1}\right\rceil\right]+\left\lceil\lg N_{1}\right\rceil-\frac{2}{3} t$

Since $t=\left\lceil\lg N_{1}\right\rceil$,

$$
\operatorname{Rcap}_{e f f}(\text { Part } 2)=\frac{2}{3}\left\lceil\lg N_{1}\right\rceil+\left\lceil\lg N_{1}\right\rceil^{2}+\frac{\left\lceil\lg N_{1}\right\rceil}{3}
$$

Part 3 represents the circuit $S\left(N_{3}\right)$. Rcapeff (Part3) is given by

$$
\begin{aligned}
\operatorname{Rcap}_{\text {eff }}(\text { Part } 3) & =\sum_{i=1}^{N_{3}}\left(\left\lceil\lg N_{1}\right\rceil+i\right) \\
& =\sum_{i=1}^{N_{3}}\left\lceil\lg N_{1}\right\rceil+\sum_{i=1}^{N_{3}} i \\
& =N_{3}\left\lceil\lg N_{1}\right\rceil+\frac{\left(N_{3}+1\right) N_{3}}{2} \\
& =\frac{N_{3}}{2}\left(2\left\lceil\lg N_{1}\right\rceil+N_{3}+1\right)
\end{aligned}
$$

Part 4 represents the circuit $S\left(N_{4}\right)$. Rcap eff (Part4) is given by

$$
\begin{aligned}
\operatorname{Rcap}_{e f f}(\text { Part } 4) & =\sum_{i=1}^{N_{4}-2} i+N_{4}\left(\left\lceil\lg N_{1}\right\rceil+N_{3}+1\right) \\
& \left.=\frac{\left(N_{4}-2\right)\left(N_{4}-1\right)}{2}+N_{4}\left\lceil\lg N_{1}\right\rceil+N_{4} N_{3}+N_{4}\right) \\
& =\frac{N_{4}^{2}}{2}-\frac{N_{4}}{2}+\dot{N}_{4}\left\lceil\lg N_{1}\right\rceil+N_{4} N_{3}+1
\end{aligned}
$$

Thus the effective capacitance of $L Y D(N)$ circuit is given by

$$
\begin{aligned}
\operatorname{cap}_{e f f}(N)= & \left\{\left[1+\frac{3}{2} N_{1} \lg N_{1}\right]-\left[\frac{1}{2}\left[2 N_{1}+\left(\lg N_{1}\right)^{2}+\lg N_{1}\right]\right]+\right. \\
& {\left[\frac{2\left\lceil\lg N_{1}\right\rceil}{3}+\left\lceil\lg N_{1}\right\rceil+\frac{4\left\lceil\lg N_{1}\right\rceil}{3}+1\right]+} \\
& {\left.\left[\left(N_{3}+N_{4}\right)\left(\left\lceil\lg N_{1}\right\rceil+\frac{3}{2}\right)+\frac{1}{2}\left(N_{3}^{2}+N_{4}^{2}\right)+\left(N_{3} \cdot N_{4}\right)\right]\right\} C_{0}+}
\end{aligned}
$$

$$
\begin{aligned}
& \left\{\left[3\left(1+\frac{N_{1}}{2} \lg \frac{N_{1}}{2}\right)-\frac{1}{2}\left(3 N_{1}+\left(\lg \frac{N_{1}}{2}\right)^{2}+\lg \frac{N_{1}}{2}\right)\right]+\right. \\
& {\left[\frac{2}{3}\left\lceil\lg N_{1}\right\rceil+\left\lceil\lg N_{1} 7^{2}+\frac{\left\lceil\lg , N_{1}\right\rceil}{3}\right]+\right.} \\
& \\
& \left.\left[\frac{N_{3}}{2}\left(2\left\lceil\lg N_{1}\right\rceil+N_{3}+1\right)+\frac{N_{4}^{2}}{2}+N_{4}\left\lceil\lg N_{1}\right\rceil+N_{3} N_{4}+1-\frac{N_{4}}{2}\right]\right\} C^{\prime}
\end{aligned}
$$

Again, the effective capacitance under linear output capacitance of the $L Y D(N)$ circuit is $O(N\lceil\lg N]$.

### 4.3. Comparison

Table 4.4 provides a comparison of the effective circuit capacitance of the different prefix circuits considered here. The serial prefix circuit has the largest effective circuit capacitance, $\mathrm{O}\left(N^{2}\right)$. All parallel prefix circuits have $\mathrm{O}(N \lg N)$ effective circuit capacitance, except the divide-and-conquer prefix circuit and $L F_{0}$ prefix circuit whose values are $O\left(N(\lg N)^{2}\right)$.

Table 4.4: The effective circuit capacitance of prefix circuits.

| Prefix Circuit | $\operatorname{capes}^{(N)}$ |
| :---: | :---: |
| Serial | $\left\{\frac{N(N-1)}{2}\right\} C_{0}+\left\{\frac{(N-1)(N-2)}{2}\right\} c^{*}$ |
| Divide-and-Conquer | $\left.\left\{\frac{N}{4}\left(\operatorname{tg} N^{2}+\operatorname{l8} N\right)\right\} C_{0}+\left\{\frac{N}{4}(\operatorname{tg} N)^{2}-\lg N\right)\right\} C^{\circ}$ |
| Brent-Kung | $\left\{1+\frac{3}{2} N \lg N-\frac{1}{2}\left[2 N+\left(\lg N^{2}+\lg N\right]\right\} C_{0}+\left\{3\left(1+\frac{N}{2} \lg \frac{N}{2}\right)-\frac{1}{2}\left(3 N+\left(\lg \frac{N}{2}\right)^{2}+\lg \frac{N}{2}\right)\right\} C^{\circ}\right.$ |
| $L F_{k}$ | $\begin{aligned} & \left\{\frac{N}{4}\left(\left(\lg N^{2}+\lg N\right)\right\} C_{0}+\left\{\frac{N}{4}(\lg N)^{2}-\lg N\right)\right\} C^{\prime} \leq L E_{6} \leq\left\{1+\frac{3}{2} N \lg N-\frac{1}{2}\left[2 N+(\lg N)^{2}+\lg N\right]\right\} C_{0}+ \\ & \left\{\left(1+\frac{N}{2} \lg \frac{N}{2}\right)-\frac{1}{2}\left(3 N+\left(\lg \frac{N}{2}\right)^{2}+\lg \frac{N}{2}\right)\right\} c^{c} \end{aligned}$ |
| Snir |  |
| Shih-Lin | $\begin{aligned} & \left\{\left[1+\frac{3}{2} N_{1}\left(\lg N_{1}\right)\right]-\left[\frac{1}{2}\left[2 N_{1}+\left(\lg _{8} N_{1}\right)^{2}+\operatorname{(\operatorname {Cg}N_{1})]}\right]+\left[N_{2}\left[\left(\lg N_{1}\right)\right]-\left[\left(\lg N_{1}\right)\right]+\left(\frac{N_{3}^{2}-N_{2}}{2}\right)\right]\right] C_{0}+\right. \\ & \left\{\left[\left\{\left(1+\frac{N_{1}}{2} \lg \frac{N_{1}}{2}\right)-\frac{1}{2}\left(3 N_{1}+\left(\lg \frac{N_{1}}{2}\right)^{2}+\lg \frac{N_{1}}{2}\right)\right]+\left[\left[\lg N_{1}\left(N_{2}-1\right)+\frac{1}{2}\left(N_{2}^{2}-3 N_{2}+2\right)\right]\right] C^{.}\right.\right. \end{aligned}$ |
| LYD |  |

## CHAPTER 5

## POWER-SPEED TRADE-OFF IN PREFIX CIRCUITS

In Chapter 4, the power modeling for prefix circuits was proposed. One way to validate the model is to use simulation. This Chapter deals with the circuit simulations we conducted to investigate the prefix circuits' behavior to match with the prediction of our linear output capacitance assumption. These simulations allow the circuit designers to choose the best prefix circuit for a particular application. The degrees of freedom studied include different prefix circuit designs and voltage scaling. Voltage scaling is used because power consumption is a quadratic function of the voltage.

For purpose of investigating the linear output capacitance assumption, we implemented XOR gates under various prefix circuits at fixed supply voltage using PSpice. The power consumption of these circuits was measured and then compared with the estimated power consumption using the linear output capacitance assumption. After observing the behavior of power consuription of these prefix circuits, simulation was extended to study the effect of voltage reduction on power consumption. The 64-bit XOR gates implemented with different prefix circuits were simulated starting at power supply voltage 2.8 V and then scaling down to 1.4 V . The range for the supply voltage is based on current technology and market trends [RCNO1]. The possible decrease in power consumption under different circuit constraints has also been investigated to see which circuit will be more appropriate for a desired throughput.

### 5.1 Prefix Circuits at Fixed Voltage

In this section, we present simulation results for the 8-bit, 16-bit, 32-bit, and 64-bit XOR gates of seven prefix circuits $\left(D C(N), B K(N), L F_{0}(N), L F_{1}(N), S N(N), S L(N)\right.$, and $\operatorname{LYD}(N)$ ). Simulations were first carried out using PSpice at a power supply voltage of 2.8 V to investigate the effect of various prefix circuits on the circuit power consumption. The simulation results for 32-bit XOR with different prefix circuits using the worst case input for serial prefix circuit (i.e., the first input is equal to 0 and the other inputs are $0 \rightarrow 1$.), is shown in Figure 5.1. The result in Figure 5.2 is calculated from the formula $P($ normalized $)=\operatorname{cap}_{\text {eff }}(N) V_{D D}^{2} f /\left(C^{\prime} f\right)$. Commensurate with our model analysis in Chapter 4, amongst all the prefix circuits, the serial prefix circuit consumes the maximum power due to the longest ripple (the maximum number of switching); power consumption is much larger compared to other circuits. Amongst the remaining circuits, results obtained from simulations (Figure 5.1) and the theoretical model (Figure 5.2) show that the divide-and-conquer prefix circuit consumes the most power, followed by the $L F_{0}$ and the LYD prefix circuits.


Figure 5.1: Power consumption of the 32-bit XOR parallel prefix circuits, obtained through PSpice simulation.


Figure5.2: Estimated power consumption of prefix circuits when $\mathbf{N}=32$ bits.

Comparing Figures 5.1 and 5.2, we find that our estimated values have the same distribution as the values obtained by simulation. There is, however, one discrepancy the power consumption value of the serial circuit to the other parallel prefix circuits according to the model estimate is much greater than simulation results. This may be due to the fact that we did not consider static power consumption in our estimation, which depends on the gate technology and circuit size. The size of parallel prefix circuits is almost two times as large as that of the serial circuit. Thus, in simulation, the static power consumption component is more pronounced for parallel prefix circuits than for the serial circuit. This reduces the power consumption ratio between serial and parallel prefix circuits in simulation. Figures 5.3(a) and 5.3(b) plot the simulation result and a modified estimation by adding static power consumption component to the original estimation. We see that the simulation result in this case corroborate with the estimated values for parallel prefix circuits.


Figure5.3: Comparison between simulation results and modified estimation results for $\boldsymbol{N}=\mathbf{3 2}$ bit. The modified estimation enhances the original estimation by including a component of power proportionally to circuit size.

The simulation and theoretical results for the 8-bit, 16-bit, 32-bit and 64-bit XOR parallel prefix circuits with different designs are shown in Figures 5.4 and 5.5, respectively. Amongst the parallel prefix circuits, the divide-and-conquer prefix circuit consumes the most power, followed by the $L F_{0}$ prefix circuit and the LYD prefix circuit. The Shih-Lin and the Snir prefix circuits' power consumption is similar to the power consumption of the Brent-Kung prefix circuit. Comparing the simulation result (Figure 5.4) with the theoretical results (Figures 5.5), it is easily seen that the linear output capacitance assumption could be used reliably to predict power consumption of prefix circuits.


Figure 5.4: Power consumption of the XOR parallel prefix circuits at fixed voltage, obtained through Pspice simulation.


Figure5.5: Estimated power consumption of parallel prefix circuits with fixed voltage.

### 5.2 Effects of Voltage Scaling on Prefix Circuits

To study the effect of voltage scaling on power consumption, while aiming at circuit design for reduction on delay, the following experiment was conducted. The 64-bit XOR gate under seven parallel prefix circuit implementations (divide-and-conquer, Brent-

Kung, $L F_{0}, L F_{1}$, Snir, Shih-Lin, and LYD) introduced in Chapter 2, were carefully simulated to measure the power consumption and the circuit delay under supply voltage ranging from 2.8 V to 1.4 V [RCN01] to see the effect of speed on low power consumption under different circuit constraints.

Before presenting the results of simulation studies, an overview of the effects of scaling on supply voltage is given. As noted in Chapter 3, the average power consumption in a CMOS module can be written as follows:

$$
\begin{equation*}
P_{\text {swiuching }}=C_{e f f} V_{D D}^{2} f . \tag{5.1}
\end{equation*}
$$



Figure 5.6: Plot of supply voltage vs. normalized delay [CB95].

This equation indicates that the most effective way to reduce power consumption is by operating the circuit at a lower $V_{D D}$, allowing a quadratic reduction in power. However, as seen from Figure 5.6, as $V_{D D}$ decreases, the circuit delay generally increases. Hence, the system throughput reduces. The relationship between circuit delay, $T_{p}$, and supply
voltage, $V_{D D}$, is modeled as follow [Mac96]

$$
\begin{equation*}
T_{p}=\frac{C_{L} V_{D D}}{k(W / L)\left(V_{D D}-V_{f}\right)^{2}} \tag{5.2}
\end{equation*}
$$

where $C_{L}$ is the gate capacitance, $V_{1}$ is the threshold voltage, $k$ is the technologydependent parameter, and $W$ and $L$ are the channel width and length of the transistors, respectively. According to Eq. 5.2, $T_{p}$ increases as $V_{D D}$ approaches $V_{r}$. A sharp increase in delay can be observed if $V_{D D} \leq 2 V$, [RCN01].

Thus where on one hand lowering power supply reduces the power consumption, on the other, it reduces the throughput. Looking closely at Eq. 5.1 and Eq. 5.2, we observe that though power consumption decreases quadratically with decrease in power supply, it only increases the time-delay inversely with the power reduction. Therefore, a commonly used technique to reduce power consumption without loss in throughput is to introduce parallelism [CB95]. Using parallelism will reduce the time-delay relative to the effective degree of parallelism. Hence we can use lower supply voltage proportionally (according to Figure 5.6) to maintain the same level of throughput with overall lower power consumption. Unfortunately, introduction of parallelism increases the number of computation nodes in many circuits, which, in turn, increases the power consumption. Because of the size-depth trade-off characteristic of the prefix circuits (Chapter 2), we can take advantage of parallelism only to the extent that parallelism reduces circuit depth.

## Theoretical Results

Figures 5.7, 5.8, and 5.9 give estimated delay, power consumption, and power-delay


Figure 5.7: Estimated delay of parallel prefix circuits when $N=64$.


Figure 5.8: Estimated power consumption of parallel prefix circuits when $N=64$.


Figure 5.9: Estimated power-delay product of parallel prefix circuits when $N=64$.
product for the 64-bit parallel prefix circuits obtained from the theoretical model. Figure5.7 shows the result obtained by assuming the circuits' delay to be proportional to the circuits' depth and applying the normalized delay from Figure 5.6 in order to take the effect of the supply voltage on the delay. The estimated power consumptions for the circuit considered are shown in Figure 5.8. The divide-and-conquer circuit that has the shortest depth and largest size consumes the maximum power. The Brent-Kung prefix circuit has the highest power-delay product while the divide-and-conquer prefix circuit and the $L F_{0}$ prefix circuits have the power-delay product lower than that of the BrentKung prefix circuit, the Snir prefix circuit, the Shih-Lin prefix circuit and the LYD prefix circuit.

Table 5.1 shows the estimated power consumption of the different prefix circuits at fixed and reduced supply voltage when $N=64$. The power is estimated using the formula of Eq. 4.1, $P=\operatorname{cap}_{e f f}(N) V_{D D}^{2} f$, where $\operatorname{cap}_{e f}(N)$ is the effective circuit capacitance. For this study we used $C_{0}=0.9$ and $C^{\prime}=0.3[\operatorname{Smi} 97]$. When the supply voltage is fixed at 2.8 V , the serial prefix circuit consumes more power than any other circuit.

To lower power consumption by reducing the supply voltage, let us assume a fixed acceptable delay. Further, assume that time-delay is proportional to depth and that a delay proportional to a depth of 10 with $V_{D D}=2.8$ volts is acceptable. Thus the voltage of the Brent-Kung and Snir circuits cannot be lowered, and the delay of the serial circuits is not acceptable. The supply voltages of the other five prefix circuits can be dropped from 2.8 V and still achieve the acceptable delay. For example, because the delay for the
divide-and-conquer prefix circuit is proportional to 6 at 2.8 V , the voltage can be dropped from 2.8 V to 1.48 V to obtain a time-delay proportional to a depth of 10 . The operating frequency can be decreased by a factor of 0.6 . Thus the normalized power consumption of the divide-and-conquer prefix circuit is:

$$
P(\text { normalized })=\operatorname{cap}_{e f f}(N) V_{D D}^{2} f=\left(2,496 C^{\prime}\right)(1.48)^{2}(0.6 f) /\left(C^{\prime} f\right) \approx 3,280 .
$$

Table 5.1: Estimated power consumption based on Eq. 4.1 for various prefix circuits for $N=64$.

| Prefix Circuit | Depth | cappof(64) | $\begin{gathered} \text { Power } \\ \text { (normalized) } \\ v_{n}=2.8 v \\ \hline \end{gathered}$ | NewPower (normalized) atter ruheing Vis |
| :---: | :---: | :---: | :---: | :---: |
| Serial | 63 | $2016 C_{0}+1953 C^{\prime}$ | 62,728 | - |
| Divide-and-Conquer | 6 | $672 C_{0}+480 C^{\prime}$ | 19,569 | $\begin{gathered} 3,280 \\ v_{00}=1.48 \mathrm{~V} \end{gathered}$ |
| Brent-Kung | 10 | $492 C_{0}+372 C^{\prime}$ | 14,488 | $\begin{aligned} & 14,488 \\ & V_{00}=2.0 \mathrm{~V} \end{aligned}$ |
| $L F_{0}$ | 6 | $625 C_{0}+457 C^{\circ}$ | 18,283 | $\begin{gathered} 3,065 \\ v_{00}=1.48 \mathrm{~V} \end{gathered}$ |
| $L F_{1}$ | 7 | $527 C_{0}+390 C^{\circ}$ | 15,453 | $\begin{array}{r} 3,987 \\ v_{00}=1.7 \mathrm{~V} \\ \hline \end{array}$ |
| Snir | 10 | $487 C_{0}+371 C^{\prime}$ | 14,363 | $\begin{array}{r} 14,363 \\ V_{00}=2.0 \mathrm{~V} \end{array}$ |
| Shih-Lin | 9 | $487 C_{0}+370 C^{\prime}$ | 14,355 | $\begin{gathered} 9,491 \\ v_{00}=20 \mathrm{~V} \end{gathered}$ |
| LYD | 8 | $528 C^{\circ}+410 C^{\prime}$ | 15,633 | $\begin{aligned} & 6,381 \\ & v_{00}=2 \mathrm{~V} \\ & \hline \end{aligned}$ |

After scaling the supply voltage, there is a power improvement in the circuits having depth shorter than 10. Among these circuits, the $L F_{0}$ prefix circuit has a major reduction in power due to its shortest depth.

## Simulation Results

Figures 5.10, 5.11, and 5.12 give delay, power consumption, and power-delay product for


Figure 5.10: Delay of the 64-bit XOR parallel prefix circuits, obtained through PSpice simulation.


Figure 5.11: Power consumption of the 64-bit XOR parallel prefix circuits, obtained through PSpice simulation.


Figure 5.12: Power-delay product of the 64-bit XOR parallel prefix circuits, obtained through PSpice simulation.
the 64-bit XOR parallel prefix circuits obtained through PSpice simulation over random inputs. As expected, amongst the parallel prefix circuits considered, the divide-andconquer prefix circuit consumes the most power. Also, though the delay of the divide-and-conquer prefix circuit is the least for some values of the voltage supply, it is not so for lower voltages. This may be due to its very high fan-out compared to others $(\mathrm{O}(N)$ vs. $\mathrm{O}(\lg N)$ ). As the supply voltage is reduced, power consumption is also reduced. Comparing the model predictions (Figures 5.8) to simulation result (Figures 5.11), it was found that the use of the linear output capacitance assumption gives similar results as PSpice simulation.

From the point of view of the power-delay product metric, the LYD prefix circuit is found to be the best across the entire voltage scaling. This means that the circuit provides the best trade-off between power and delay. Another result of simulation studies shows that the power-delay product of the divide-and-conquer prefix circuit is the highest, followed by that of the $L F_{0}$ prefix circuit. This is at variance with our model prediction and may be due to the fact that these circuits have a very high fan-out (see Table 2.1 for fan-out). In our model, we do not take into account the effect of fan-out on the delay.

Also according to the simulation, with voltage-scaling technique, the LYD prefix circuit has the least power consumption compared to other circuits. For example, let us assume the maximum acceptable delay is $6.4 \mu \mathrm{~s}$. From Figures 5.10 and 5.11, to achieve this time-delay, the supply voltage of the divide-and-conquer, $L F_{0}, L F_{1}$, Shih-Lin, and LYD prefix circuits can be $1.8 \mathrm{~V}, 1.78 \mathrm{~V}, 1.78 \mathrm{~V}, 2 \mathrm{~V}$, and 1.8 V , respectively. Therefore, the power consumption of the divide-and-conquer, $L F_{0}, L F_{1}$, Shih-Lin, and LYD prefix circuits is $2.25,1.94,1.59,1.64$, and 1.44 W , respectively. This shows that power
reduction of about 1.6 times can be obtained without speed loss by using the LYD prefix circuit compared with using the divide-and-conquer prefix circuit by using appropriately chosen supply voltage.

### 5.3 Summary

This chapter presented a comparative study of different parallel prefix circuits from the point of view of power-speed trade-off. The power consumption and the power-delay product of seven parallel prefix circuits were compared. We have shown that the use of the linear output capacitance assumption provides results that are consistent with those obtained by using PSpice simulation. The model enables us to understand the power consumption behavior of prefix circuits, and to pick the suitable prefix circuit for the acceptable power consumption and/or time-delay. We have also shown that parallelism at a certain level coupled with the use of low supply voltage can be used to reduce the power consumption in the prefix circuit without throughput loss. Our analysis, combined with PSpice simulations, shows that amongst the parallel prefix circuits the divide-andconquer prefix circuit consumes the most power in spite of having the shortest depth and the highest parallelism. Also according to PSpice simulation, the trade-off between power and delay of the LYD prefix circuit seems to the best of all the circuits considered.

The main discrepancy between the model and the simulation result is the power-delay product metric. This may be due to the fact that the fan-out of the divide-and-conquer prefix circuit and the $L F_{0}$ prefix circuit is very high as compared to other prefix circuits. In this analysis, we have assumed that the delay is uniquely determined by the depth of the circuit. The results of the simulation of the divide-and-conquer prefix circuit in
particular indicate that large fan-out in addition to contributing to larger power consumption may also indirectly affect the time-delay. Modeling this interaction between high fan-out and time-delay is an interesting problem.

## CHAPTER 6

## ADDITION CIRCUITS

In this chapter we study the application of prefix circuits in adders and look at the power consumption characteristics when different prefix circuits are used. An addition of two binary numbers is of great interest to digital designers since it is the most commonly used operation in many other operations (e.g., counting, multiplication, division, etc.). Many researchers have investigated the various implementations of adder circuits. Examples are [BD01, BL01, FB01, FL00, Lin81]. For a general introduction refer to [HP90, Hwa79, Kor93, Omo94]. There are a number of ways of formulating the process of binary addition. Each way provides different insight and thus suggests different implementation. Although each implementation is available to serve different requirements, the focus of various implementations is on the calculation of all carry bits quickly, since the key to fast addition is the fast calculation of all the carries. In one of these implementations, the addition of two binary numbers is expressed as a prefix problem by transforming the computation of all carry bits to prefix computation. The adder using this technique is called a prefix adder.

Prefix adder has been addressed in various papers. For example, [JS95] investigated 32-bit Ladner-Fischer prefix adder on speed and area. [AD98] introduced a new irregular parallel prefix adder. [BL01, ZS01] introduce an algorithm to construct low area-delay product parallel prefix adders. A similar study [KnoO1] explored space used and speed of
two parallel prefix adder designs. Previous work on low power adder can be found in many studies. By using the same CMOS technology, [Cal96, NIO96] compared power consumption of various adder architectures (i.e., ripple carry adder, carry look-ahead adder (CLA), etc.). [KBL95] studied different technologies (i.e., full static CMOS, complementary pass-transistor logic, double pass-transistor logic, etc.) with a given adder architecture for low-power adder design. Besides considering different technologies and adder architectures, another approach employs transistor sizing for a low power full adder design [BAW00, Rad01]. Although we will measure power consumption of the various implementations of adders, our objective is different from theirs. In our study, we concentrate on investigating and comparing power consumption of prefix adder based on Brent's algorithm [Bre70], using some of the prefix circuits from Chapter 2, with different sizes. Brent's algorithm transforms the carry computations to a prefix problem and hence is an ideal candidate for studying prefix circuits.

In the following, the basics of an adder are given in detail. Then, the method of implementing a fast parallel prefix addition based on the Brent's algorithm [Bre70, LD90] is explored and details of how the computation of all carry bits is transformed into the prefix computation are given.

### 6.1 Adder: Theory

A circuit for adding two binary digits and a carry-bit is called a full-adder (FA). Figure 6.1 shows the block diagram of the full-adder. The FA adder circuit takes $x, y$, and $z$ as inputs and produces $s$ and $c$ as outputs. Table 6.1 is the truth table for the full adder. When all input bits are 0 , both the outputs are 0 . When an odd number of inputs equals 1 ,
the $s$ output will be 1 . The $c$ output has a value of 1 if two or three inputs equal 1 . Following is a possible set of algebraic expressions for the two output variables derived from the K-map (Figure 6.2)


Figure 6.1: The Block diagram of the full-adder circuit.

Table 6.1: Adder truth table.

| Inputs |  |  |  | Outputs |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
| $x$ | $y$ | $z$ | $c$ | 8 |  |
| 0 | 0 | 0 | 0 | 0 |  |
| 0 | 0 | 1 | 0 | 1 |  |
| 0 | 1 | 0 | 0 | 1 |  |
| 0 | 1 | 1 | 1 | 0 |  |
| 1 | 0 | 0 | 0 | 1 |  |
| 1 | 0 | 1 | 1 | 0 |  |
| 1 | 1 | 0 | 1 | 0 |  |
| 1 | 1 | 1 | 1 | 1 |  |


$c=(x \wedge y) \vee(x \oplus y) \wedge z$

Figure 6.2: The K-Maps for the full-adder circuit.

$$
\left.\begin{array}{l}
s=x \oplus y \oplus z  \tag{6.1.1}\\
c=(x \wedge y) \vee(x \oplus y) \wedge z
\end{array}\right\}
$$

where $\oplus, \vee$, and $\wedge$ are logical exclusive-OR, (inclusive) OR, and AND operations, respectively. Note that the output $\boldsymbol{c}$ can be expressed in any one of the following forms:

- $c=(x \wedge y) \vee(x \vee b) \wedge z$,
- $c=(x \wedge y) \vee(x \wedge z) \vee(y \wedge z)$, and
- $c=(x \vee y) \wedge((x \wedge y) \vee z)$.

One addition circuit of $N$-bit integers is the chain of $\boldsymbol{N}$ full-adders as shown in Figure 6.3. Let $a=a_{N} a_{N-1} \ldots a_{2} a_{1}, b=b_{N} b_{N-1} \ldots b_{2} b_{1}$, and $s=s_{N} \ldots s_{2} s_{1}$ be $N$-bit integers. Let $s$ be the sum of $a$ and $b$. We sum the binary bits from right to left, propagating any carry from $\mathrm{FA}_{\boldsymbol{i}}$ to $\mathrm{FA}_{\boldsymbol{i}+\boldsymbol{l}}$, for $1 \leq i \leq \boldsymbol{N}$ (see Figure 6.3). In the $\boldsymbol{i}^{\text {th }}$ FA, we take as inputs bits $a_{i}$ and $b_{i}$ and the carry-in bit $c_{i-1}$ to produce the sum bit $s_{i}$ and the carry-out bit $c_{i}$. The carry-out bit $c_{i}$ from the $i^{\text {th }} \mathrm{FA}$ is the carry-in bit into the $(i+1)^{\boldsymbol{n}}$ FA. Since there is no carry-in for position 0 , we assume that $c_{0}=0$. The carry-out $c_{N}$ is the sum bit $s_{N+1}$. Therefore, in general, for $1 \leq i \leq N$,

$$
\left.\begin{array}{l}
s_{i}=a_{i} \oplus b_{i} \oplus c_{i-1}  \tag{6.1.2}\\
c_{i}=\left(a_{i} \wedge b_{i}\right) \vee\left(a_{i} \oplus b_{i}\right) \wedge c_{i-1}
\end{array}\right\}
$$

where $c_{0}=0$. This circuit is called the ripple-carry adder. Each full-adder takes three stages and five logical elements (Figure 6.4). For the chain of $N$ full-adders, $5 N-3$ logic elements and $2 N-1$ stages are needed to compute the output $s$. Hence, the circuit has $O(N)$ size and $O(N)$ time. As seen in (6.1.2), to compute the sum bits, $s_{i}(1 \leq i \leq N)$ in parallel, we need the carry bit $c_{i}(1 \leq i<N)$. Therefore, the faster the carry bit $c_{i}$ is
known, the faster the addition circuit is. In other words, the key to parallel addition is parallelizing the computation of all the carry bits.


Figure 6.3: A chain of $N$ full-adders.


Figure 6.A: A full-adder circuit.

### 6.2 Parallel Addition

Let $a=a_{N} a_{N-1} \ldots a_{2} a_{1}$, and $b=b_{N} b_{N-1} \ldots b_{2} b_{1}$ be two integers to be added. Let $s=(a+b) \bmod 2^{N}$, where $s=s_{N} \ldots s_{2} s_{1}$. Therefore,
where
and

$$
\left.\begin{array}{l}
s_{i}=a_{i} \oplus b_{i} \oplus c_{i-1}  \tag{6.2.1}\\
c_{0}=0 \\
c_{i}=p_{i} \wedge\left(g_{i} \vee c_{i-1}\right) \\
p_{i}=a_{i} \vee b_{i} \\
g_{i}=a_{i} \wedge b_{i}
\end{array}\right\}
$$

for $(1 \leq i \leq N)$. The carry bit $c_{i}$ is the carry from the $i^{\text {th }}$ bit position, $p_{i}$ is a carry propagate condition, and $g_{i}$ is a carry generate condition. As discussed before, we need the carry bit $c_{i-1}(1 \leq i \leq N)$ for computing the sum bit $s_{i}$. By distributing the propagate bit $p_{i}$ and the generate bit $g_{i}$ to $c_{i}=p_{i} \wedge\left(g_{i} \vee c_{i-1}\right)$ in (6.2.1), we obtain

$$
\left.\begin{array}{l}
c_{1}=p_{1} \wedge g_{1}  \tag{6.2.2}\\
c_{2}=p_{2} \wedge\left(g_{2} \vee\left(p_{1} \wedge g_{1}\right)\right) \\
\cdot \\
c_{i}=p_{i} \wedge\left(g_{i} \vee\left(p_{i-1} \wedge\left(g_{i-1} \vee \ldots \vee\left(p_{1} \wedge g_{1}\right) \ldots\right)\right)\right) .
\end{array}\right\}
$$

The implementation of the fast addition is carried out in three stages: the preprocessing stage, the carry computation stage and the postprocessing stage (see Figure 6.5). The preprocessing stage computes the carry propagate bit $p_{i}$ and the carry generate bit $g_{i}$ in parallel in just one unit step (that is $p_{i}=a_{i} \vee b_{i}$ and $g_{i}=a_{i} \wedge b_{i} ;$ for $1 \leq i \leq N$ ). In the carry computation stage, the calculation of all carry bits is converted into the prefix circuit problem, which is discussed later in this section. The inputs of the prefix circuit
for carry calculation are the carry propagate bits and the carry generate bits from the preprocessing stage. Once all the carry bits are known, the postprocessing stage produces the sum bits in two steps (that is $s_{i}=a_{i} \oplus b_{i} \oplus c_{i-1}$; for $1 \leq i \leq N$ ). The time in preprocessing and postprocessing stages is negligible compared to the computation time of the carry. As a result, computing all carry bits quickly is the key to high-speed addition. Therefore, we will concentrate on the carry computation for the rest of this chapter.


Figure 6.5: Three stages of the implementation of the fast adder.

### 6.2.1 Binary Addition as a Prefix Problem

Brent [Bre70] has presented the upper bound on the computation of the carry for the parallel addition of two $\boldsymbol{N}$-bit integers. The following discussions are derived from [Bre70, LD90]. Let $T_{A}(N)$ be the time required to add two $N$-bit binary numbers. Then from the above discussion

$$
T_{A}(N)=T_{c}(N-1)+3
$$

## Carry Computation

A schematic diagram of the Brent's algorithm for computing $c_{N}$ is given in Figure 6.7.
To expedite the carry computation, Brent uses the following strategy to compute $c_{N}$ :

- Compute all $p_{i}\left(p_{i}=a_{i} \vee b_{i}\right)$ and $g_{i}\left(g_{i}=a_{i} \wedge b_{i}\right)$.
- Partition all $p_{i}$ and $g_{i}$ into $r$ groups; each group has $q$ members, where $N=r q$ (see Figure 6.6).


Figure 6.6: The partition of all $p_{i}$ and $g_{i}$ into $r$ groups with $q$ members each.

- Brent's algorithm to express the carry calculation $\boldsymbol{c}_{\boldsymbol{N}}$ is as follow.

Let $r \geq 1$ and $q \geq 1$ be integers such that $N=r q$.
Let

$$
P_{i}=p_{i q} \wedge \ldots \wedge p_{(i-1) q+1}
$$

$$
\begin{gathered}
D_{i}=P_{r} \wedge \ldots \wedge P_{i+1}, \quad D_{r}=1, \\
E_{i}=p_{i q} \wedge\left(g_{i q} \vee \ldots\left(p_{(i-1) q+1} \wedge g_{(i-1) q+1}\right) \ldots\right),
\end{gathered}
$$

and

$$
F_{i}=D_{i} \wedge E_{i},
$$



Figure 6.7: A parallel scheme for computing the carry, derived from [LD90].
then, by associativity, commutativity, and distributivity,

$$
c_{N}=F_{r} \vee F_{r-1} \vee \ldots \vee F_{2} \vee F_{1} .
$$

During the time the value of $c_{N}$ is obtained, all other $c_{i}$ 's are also obtained. Note that $P_{i}$ is the product of carry propagates of group $i, D_{i}$ is prefix circuit (i.e., $P_{i+1}: P_{r}$ ), and $E_{i}$ is the carry out of block $i(1 \leq i \leq r)$. The upper bound on the computation time of the last carry bit, $T_{c}(N)$, is given by

$$
T_{c}(N) \leq 1+\lceil\lg r\rceil+\max \left\{T_{c}(q),\lceil\lg q\rceil+\text { PrefixComputaionTime }\right\} .
$$

As can be seen from Figure 6.7, the running time of the algorithm depends on the number of blocks, the block size, and the choice of the prefix circuit. The detailed proof is given in [LD90].

The following example illustrates these quantities. Let $N=8$. There are four possible choices to partition all $p_{i}$ and $\boldsymbol{g}_{i}$ :

- $r=1$ and $q=8$
- $r=2$ and $q=4$
- $r=4$ and $q=2$
- $r=8$ and $q=1$

Case1: $r=1$ and $q=8$.
Then,
$P_{1}=p_{8} \wedge p_{7} \wedge p_{6} \wedge p_{5} \wedge p_{4} \wedge p_{3} \wedge p_{2} \wedge p_{1}$
$D_{1}=1$

$$
\begin{aligned}
E_{1} & =p_{8} \wedge\left(g _ { 8 } \vee \left(p _ { 7 } \wedge \left(g _ { 7 } \vee \left(p _ { 6 } \wedge \left(g _ { 6 } \vee \left(p _ { 5 } \wedge \left(g _ { 5 } \vee \left(p _ { 4 } \wedge \left(g _ { 4 } \vee \left(p _ { 3 } \wedge \left(g_{3} \vee\left(p_{2} \wedge\left(g_{2} \vee\left(p_{1} \wedge g_{1}\right)\right) \ldots\right)\right.\right.\right.\right.\right.\right.\right.\right.\right.\right.\right. \\
F_{1} & =D_{1} \wedge E_{1} \\
& =p_{8} \wedge\left(g _ { 8 } \vee \left(p _ { 7 } \wedge \left(g _ { 7 } \vee \left(p _ { 6 } \wedge \left(g _ { 6 } \vee \left(p _ { 5 } \wedge \left(g _ { 5 } \vee \left(p _ { 4 } \wedge \left(g _ { 4 } \vee \left(p _ { 3 } \wedge \left(g_{3} \vee\left(p_{2} \wedge\left(g_{2} \vee\left(p_{1} \wedge g_{1}\right)\right) . .\right)\right.\right.\right.\right.\right.\right.\right.\right.\right.\right.\right.
\end{aligned}
$$

Then,
$c_{3}=F_{1}=E_{1}$
All other $c_{i}$, for $1 \leq i \leq 7$, are also available when $c_{8}$ is completed.
$c_{7}=p_{7} \wedge\left(g_{7} \vee\left(p_{6} \wedge\left(g_{6} \vee\left(p_{5} \wedge\left(g_{5} \vee\left(p_{4} \wedge\left(g_{4} \vee\left(p_{3} \wedge\left(g_{3} \vee\left(p_{2} \wedge\left(g_{2} \vee\left(p_{1} \wedge g_{1}\right) \ldots\right)\right.\right.\right.\right.\right.\right.\right.\right.\right.\right.$
$c_{6}=p_{6} \wedge\left(g_{6} \vee\left(p_{5} \wedge\left(g_{5} \vee\left(p_{4} \wedge\left(g_{4} \vee\left(p_{3} \wedge\left(g_{3} \vee\left(p_{2} \wedge\left(g_{2} \vee\left(p_{1} \wedge g_{1}\right) \ldots\right)\right.\right.\right.\right.\right.\right.\right.\right.$
$c_{5}=p_{5} \wedge\left(g_{5} \vee\left(p_{4} \wedge\left(g_{4} \vee\left(p_{3} \wedge\left(g_{3} \vee\left(p_{2} \wedge\left(g_{2} \vee\left(p_{1} \wedge g_{1}\right) \ldots\right)\right.\right.\right.\right.\right.\right.$
$c_{4}=p_{4} \wedge\left(g_{4} \vee\left(p_{3} \wedge\left(g_{3} \vee\left(p_{2} \wedge\left(g_{2} \vee\left(p_{1} \wedge g_{1}\right) \ldots\right)\right.\right.\right.\right.$
$c_{3}=p_{3} \wedge\left(g_{3} \vee\left(p_{2} \wedge\left(g_{2} \vee\left(p_{1} \wedge g_{1}\right)\right)\right)\right.$
$c_{2}=p_{2} \wedge\left(g_{2} \vee\left(p_{1} \wedge g_{1}\right)\right)$
$c_{1}=p_{1} \wedge g_{1}$
Note that Case 1 is just a serial computation of $c_{1}, c_{2}, \ldots$, and $c_{8}$.
Case 2: $r=2$ and $q=4$.
Then,
$P_{1}=p_{4} \wedge p_{3} \wedge p_{2} \wedge p_{1}, \quad P_{2}=p_{8} \wedge p_{7} \wedge p_{6} \wedge p_{5}$
$D_{1}=P_{2}, \quad D_{2}=1$
$E_{1}=p_{4} \wedge\left(g_{4} \vee\left(p_{3} \wedge\left(g_{3} \vee\left(p_{2} \wedge\left(g_{2} \vee\left(p_{1} \wedge g_{1}\right)\right)\right)\right)\right)\right.$
$E_{2}=p_{8} \wedge\left(g_{8} \vee\left(p_{7} \wedge\left(g_{7} \vee\left(p_{6} \wedge\left(g_{6} \vee\left(p_{5} \wedge g_{5}\right)\right)\right)\right)\right)\right)$
and

$$
\begin{aligned}
& F_{1}=D_{1} \wedge E_{1}=\left[P_{2}\right] \wedge E_{1} \\
& F_{2}=D_{2} \wedge E_{2}=E_{2} .
\end{aligned}
$$

Then,

$$
\begin{array}{rlr}
c_{8} & =F_{2} \vee F_{1} \\
& =E_{2} \quad \vee \quad\left[P_{2}\right] \wedge E_{1}
\end{array}
$$

All other $c_{i}$, for $1 \leq i \leq 7$, are also obtained as a byproduct of the $c_{8}$ computation.

$$
\begin{array}{lll}
c_{7}=\left[p_{7} \wedge\left(g_{7} \vee\left(p_{6} \wedge\left(g_{6} \vee\left(p_{5} \wedge g_{5}\right)\right)\right)\right)\right] & \vee & {\left[p_{7} \wedge p_{6} \wedge p_{5}\right] \wedge E_{1}} \\
c_{6}=\left[p_{6} \wedge\left(g_{6} \vee\left(p_{5} \wedge g_{5}\right)\right)\right] & \vee & {\left[p_{6} \wedge p_{5}\right] \wedge E_{1}} \\
c_{5}=\left[p_{5} \wedge g_{5}\right] & \vee & {\left[p_{5}\right] \wedge E_{1}} \\
c_{4}=E_{1} & & \\
c_{3}=\left[p_{3} \wedge\left(g_{3} \vee\left(p_{2} \wedge\left(g_{2} \vee\left(p_{1} \wedge g_{1}\right)\right)\right)\right)\right] & \\
c_{2}=\left[p_{2} \wedge\left(g_{2} \vee\left(p_{1} \wedge g_{1}\right)\right)\right] & \\
c_{1}=\left[p_{1} \wedge g_{1}\right] &
\end{array}
$$

Case3: $r=4$ and $q=2$.

$$
\begin{aligned}
& P_{1}=p_{2} \wedge p_{1}, \quad P_{2}=p_{4} \wedge p_{3}, \quad P_{3}=p_{6} \wedge p_{5}, \quad P_{4}=p_{8} \wedge p_{7} \\
& D_{1}=P_{4} \wedge P_{3} \wedge P_{2}, \quad D_{2}=P_{4} \wedge P_{3}, \quad D_{3}=P_{4}, \quad D_{4}=1 \\
& E_{1}=p_{2} \wedge\left(g_{2} \vee\left(p_{1} \wedge g_{1}\right)\right) \quad E_{2}=p_{4} \wedge\left(g_{4} \vee\left(p_{3} \wedge g_{3}\right)\right) \\
& E_{3}=p_{6} \wedge\left(g_{6} \vee\left(p_{5} \wedge g_{5}\right)\right) \quad E_{4}=p_{8} \wedge\left(g_{8} \vee\left(p_{7} \wedge g_{7}\right)\right)
\end{aligned}
$$

and

$$
\begin{aligned}
& F_{1}=D_{1} \wedge E_{1}=\left[P_{4} \wedge P_{3} \wedge P_{2}\right] \wedge E_{1} \\
& F_{2}=D_{2} \wedge E_{2}=\left[P_{4} \wedge P_{3}\right] \wedge E_{2} \\
& F_{3}=D_{3} \wedge E_{3}=\left[P_{4}\right] \wedge E_{3} \\
& F_{4}=D_{4} \wedge E_{4}=E_{4}
\end{aligned}
$$

Then,

$$
\begin{aligned}
c_{8} & =F_{4} \vee F_{3} \vee F_{2} \vee F_{1} \\
& =E_{4} \vee \vee \quad\left[P_{4}\right] \wedge E_{3} \quad \vee \quad\left[P_{4} \wedge P_{3}\right] \wedge E_{2} \vee\left[P_{4} \wedge P_{3} \wedge P_{2}\right] \wedge E_{1}
\end{aligned}
$$

All other $c_{i}$, for $1 \leq i \leq 7$, are also obtained as a byproduct of the $c_{8}$ computation.

$$
\begin{array}{rllll}
c_{7}= & {\left[p_{7} \wedge g_{7}\right]} & \vee & {\left[p_{7}\right] \wedge E_{3}} & \vee \\
& {\left[p_{7} \wedge P_{3} \wedge P_{2}\right] \wedge E_{1}} & & {\left[p_{7} \wedge P_{3}\right] \wedge E_{2} \vee} \\
c_{6}= & E_{3} & \vee & {\left[P_{3}\right] \wedge E_{2}} & \vee \\
c_{5}= & {\left[p_{5} \wedge g_{5}\right]} & \vee & \left.\left[p_{5}\right] \wedge E_{2} \wedge P_{2}\right] \wedge E_{1} \\
c_{4}= & \vee & {\left[p_{5} \wedge P_{2}\right] \wedge E_{1}} \\
c_{3}= & \vee & {\left[p_{3} \cap \wedge g_{3}\right]} & \vee & {\left[p_{3}\right] \wedge E_{1}} \\
c_{2}= & E_{1} \\
c_{1}= & {\left[p_{1} \wedge g_{1}\right]}
\end{array}
$$

Cased: $r=8$ and $q=1$.
Then,

$$
\begin{array}{llll}
P_{1}=p_{1}, & P_{2}=p_{2}, & P_{3}=p_{3}, & P_{4}=p_{4}, \\
P_{5}=p_{5}, & P_{6}=p_{6}, & P_{7}=p_{7}, & P_{8}=p_{8}
\end{array}
$$

$$
\begin{aligned}
& D_{1}=P_{8} \wedge P_{7} \wedge P_{6} \wedge P_{5} \wedge P_{4} \wedge P_{3} \wedge P_{2}, D_{2}=P_{8} \wedge P_{7} \wedge P_{6} \wedge P_{5} \wedge P_{4} \wedge P_{3}, \\
& D_{3}=P_{8} \wedge P_{7} \wedge P_{6} \wedge P_{5} \wedge P_{4}, \\
& D_{4}=P_{8} \wedge P_{7} \wedge P_{6} \wedge P_{5}, \\
& D_{5}=P_{4} \wedge P_{3} \wedge P_{2}, \\
& D_{6}=P_{8} \wedge P_{7}, \\
& D_{7}=P_{8}, \\
& D_{8}=1 \\
& E_{1}=p_{1} \wedge g_{1} \quad E_{2}=p_{2} \wedge g_{2} \quad E_{3}=p_{3} \wedge g_{3} \quad E_{4}=p_{4} \wedge g_{4} \\
& E_{5}=p_{5} \wedge g_{5} \quad E_{6}=p_{6} \wedge g_{6} \\
& E_{7}=p_{7} \wedge g_{7} \\
& E_{\mathrm{B}}=p_{\mathrm{B}} \wedge g_{\mathrm{g}}
\end{aligned}
$$

and

$$
\begin{aligned}
& F_{1}=D_{1} \wedge E_{7}=\left[P_{8} \wedge P_{7} \wedge P_{6} \wedge P_{5} \wedge P_{4} \wedge P_{3} \wedge P_{2}\right] \wedge E_{1} \\
& F_{2}=D_{2} \wedge E_{2}=\left[P_{8} \wedge P_{7} \wedge P_{6} \wedge P_{5} \wedge P_{4} \wedge P_{3}\right] \wedge E_{2} \\
& F_{3}=D_{3} \wedge E_{3}=\left[P_{8} \wedge P_{7} \wedge P_{6} \wedge P_{5} \wedge P_{4}\right] \wedge E_{3} \\
& F_{4}=D_{4} \wedge E_{4}=\left[P_{8} \wedge P_{7} \wedge P_{6} \wedge P_{5}\right] \wedge E_{4} \\
& F_{5}=D_{5} \wedge E_{5}=\left[P_{8} \wedge P_{7} \wedge P_{6}\right] \wedge E_{5} \\
& F_{6}=D_{6} \wedge E_{6}=\left[P_{8} \wedge P_{7}\right] \wedge E_{6} \\
& F_{7}=D_{7} \wedge E_{7}=\left[P_{8}\right] \wedge E_{7} \\
& F_{8}=D_{8} \wedge E_{8}=E_{8}
\end{aligned}
$$

Then,

$$
\begin{aligned}
& c_{8}=F_{8} \vee F_{7} \vee F_{6} \vee F_{5} \vee F_{4} \vee F_{3} \vee F_{2} \vee F_{1} \\
& c_{8}=E_{8} \\
& {\left[P_{8} \wedge P_{7}\right] \wedge E_{6}} \\
& {\left[P_{8} \wedge P_{7} \wedge P_{6} \wedge P_{5}\right] \wedge E_{4}} \\
& \vee \quad\left[P_{8}\right] \wedge E_{7} \\
& \vee\left[P_{8} \wedge P_{7} \wedge P_{6}\right] \wedge E_{5} \quad \vee \\
& \vee\left[P_{8} \wedge P_{7} \wedge P_{6} \wedge P_{5} \wedge P_{4}\right] \wedge E_{3} \quad \vee
\end{aligned}
$$

$$
\left[P_{8} \wedge P_{7} \wedge P_{6} \wedge P_{5} \wedge P_{4} \wedge P_{3}\right] \wedge E_{2} \quad \vee \quad\left[P_{8} \wedge P_{7} \wedge P_{6} \wedge P_{5} \wedge P_{4} \wedge P_{3} \wedge P_{2}\right] \wedge E_{1}
$$

All other $c_{i}$, for $1 \leq i \leq 7$, are also obtained as a byproduct of the $c_{8}$ computation.

$$
\begin{aligned}
& c_{7}=E_{7} \\
& {\left[P_{7} \wedge P_{6}\right] \wedge E_{5} \quad \vee \quad\left[P_{7} \wedge P_{6} \wedge P_{5}\right] \wedge E_{4} \quad \vee} \\
& {\left[P_{7} \wedge P_{6} \wedge P_{5} \wedge P_{4}\right] \wedge E_{3} \quad \vee \quad\left[P_{7} \wedge P_{6} \wedge P_{5} \wedge P_{4} \wedge P_{3}\right] \wedge E_{2} \quad \vee} \\
& {\left[P_{7} \wedge P_{6} \wedge P_{5} \wedge P_{4} \wedge P_{3} \wedge P_{2}\right] \wedge E_{1}} \\
& c_{6}=E_{6} \\
& {\left[\boldsymbol{P}_{6} \wedge \boldsymbol{P}_{5}\right] \wedge E_{4}} \\
& {\left[\boldsymbol{P}_{6} \wedge \boldsymbol{P}_{5} \wedge \boldsymbol{P}_{4} \wedge \boldsymbol{P}_{3}\right] \wedge E_{2}} \\
& c_{5}=E_{5} \\
& {\left[\boldsymbol{P}_{5} \wedge \boldsymbol{P}_{4} \wedge \boldsymbol{P}_{3}\right] \wedge E_{2}} \\
& c_{4}=E_{4} \quad \vee \quad\left[P_{4}\right] \wedge E_{3} \\
& c_{3}=E_{3} \quad \vee \quad\left[P_{3}\right] \wedge E_{2} \quad \vee \quad\left[P_{3} \wedge P_{2}\right] \wedge E_{1} \\
& c_{2}=E_{2} \quad \vee \quad\left[P_{2}\right] \wedge E_{1} \\
& c_{1}=E_{1}
\end{aligned}
$$

In general, given $N=2^{n}$, let $r \geq 1$ and $q \geq 1$ be integers such that $N=r q$. The following holds.

- Decomposition: $r$ blocks, each has $q$ elements.

$$
r=2^{i} \text { and } q=2^{n-i} \quad \text { for } \quad 0 \leq i \leq n .
$$

$>$ Number of possible parallel implementations $=\boldsymbol{n}$.

- Family of prefix circuit:
- For each $r>1$, number of prefix circuits built $=r-1$.

The computation time for an $N$-bit adder is at least

$$
4+\lceil\lg r\rceil+\max \left\{T_{c}(q),\lceil\lg q\rceil+\text { PrefixComputaionTime }\right\}
$$

Table 6.1 lists all the operations used in the calculation of a N -bit adder. The total number of AND gates used depends on the prefix circuit. The degree of parallelism depends on the size of the block and the number of blocks: the bigger the block size, the lower the degree of parallelism. When the size of the block is $N$, it is a serial computation of all carry bits.

Table 6.2: Gate count of a $\mathbf{N}$-bit adder.

| Gate Count | AND | OR | XOR |
| :--- | :---: | :---: | :---: |
| Computing $p_{i}$ and $g_{i}$ for $1 \leq i \leq N$ | $N$ | $N$ | - |
| Computing $E_{i}$ for $1 \leq i \leq r$ | $N$ | $r(q-1)$ | - |
| Prefix Circuit | $\frac{q(r-1) r}{2}$ | - | - |
| Combining $E_{i}$ and Prefix Circuit | - | $\frac{q(r-1) r}{2}$ | - |
| Calculating $c_{i}$ for $1 \leq i \leq N$ | - | - | $2 N$ |
| Calculating $s_{i}$ for $1 \leq i \leq N$ |  |  | - |

## CHAPTER 7

## SIMULATION RESULTS

In Chapter 5, the performance in term of time-delay, power consumption, and powerdelay product of parallel prefix circuits described in Chapter 2 was investigated. In this chapter, we extend the investigation of their application in prefix adder. Binary adder using the prefix circuits of varying sizes was proposed by Brent [Bre70]. In our investigation, we use Brent's algorithm. We compare number of operations used, timedelay and power consumption as well as the power-delay product in order to find out the effect of varying the size of different parallel prefix circuits on speed and power consumption.

### 7.1 Effect of Block Size on Adder Implementation

The 8-, 16-, 32-, and 64-bit adders with varying block sizes for computation of carries were simulated. The simulation was carried out using PSpice at a power supply voltage of 3V. The divide-and-conquer prefix circuit is the candidate circuit for the study. The simulation results are shown in Table 7.1 and Figure 7.1.

Table 7.1 compares the exact gate count, time-delay, power consumption, and powerdelay product for all prefix adders studied. Figure 7.1 illustrates the graphical representation of the comparisons. The comparison results obtained from PSpice simulation allow us to conclude that there is a difference in speed and power between

Table 7.1: Gate count, delay time, power consumption, and power-delay-product of different design of 8 -, 16-, 32-, and 64-bit adders using the divide-conquer prefix circuit.

8-blt adder

| Type of Implementations |  | Gate Count |  |  | Delay (us) | $\qquad$ Consumption (uW) | Power-delay Product |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $\begin{gathered} \text { Number of } \\ \text { Blocks } \end{gathered}$ | Block Size | AND | OR | XOR |  |  |  |
| 1 | 8 | 16 | 15 | 16 | 8.14 | 0.99 | 8.05 |
| 2 | 4 | 24 | 18 | 16 | 6.01 | 1.15 | 6.90 |
| 4 | 2 | 37 | 24 | 16 | 4.50 | 1.46 | 6.56 |
| 8 | 1 | 65 | 36 | 16 | 4.02 | 2.10 | 8.42 |

16-blt adder

| Type of Implementations |  | Gate Count |  |  |  | Delay <br> (us) | Power <br> Consumption <br> (uW) |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Mumber of <br> Blocks Block Ster-delay <br> Product  | AND | OR | XOR |  |  |  |  |
| 1 | 16 | 32 | 31 | 32 | 16.43 | 1.99 | 32.65 |
| 2 | 8 | 52 | 38 | 32 | 10.20 | 2.39 | 24.31 |
| 4 | 4 | 80 | 52 | 32 | 6.60 | 3.07 | 20.22 |
| 8 | 2 | 137 | 80 | 32 | 5.08 | 4.32 | 21.9 |
| 16 | 1 | 257 | 136 | 32 | 4.51 | 5.08 | 22.88 |

32-bit adder

| Type of Implementations |  | Gate Count |  |  | Delay (us) | $\qquad$ Consumption (uW) | Power-delay Product |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $\begin{gathered} \text { Number of } \\ \text { Blocks } \end{gathered}$ | Block Size | AND | OR | XOR |  |  |  |
| 1 | 32 | 64 | 63 | 64 | 33.12 | 3.96 | 131.17 |
| 2 | 16 | 112 | 78 | 64 | 18.55 | 4.85 | 89.91 |
| 4 | 8 | 172 | 108 | 64 | 10.79 | 6.30 | 68.01 |
| 8 | 4 | 288 | 168 | 64 | 7.19 | 9.13 | 65.66 |
| 16 | 2 | 529 | 288 | 64 | 5.68 | 14.84 | 84.33 |

64-bit adder

| Type of implementations |  | Gate Count |  |  | Delay (us) | Power Consumption (uW) | Power-delay Product |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Number of Blocks | Block Stze | AND | OR | XOR |  |  |  |
| 4 | 16 | 368 | 220 | 128 | 19.19 | 13.57 | 260.30 |
| 8 | 8 | 604 | 334 | 128 | 11.41 | 19.65 | 224.17 |
| 16 | 4 | 1088 | 592 | 128 | 7.81 | 31.36 | 244.78 |

## 8-bit adder





16-bit adder




## 32-bit adder





## 64-bit adder





Figure 7.1: The plots of delay time, power consumption, and power-delay-product of different design of 8-, 16-, 32-, and 64-bit adder using the divide-conquer prefix circuit.
different blocking schemes. In regard to the power-delay product, it is interesting to observe that the best performance of the implementation lies somewhat in the middle value of the various choices for block size. The optimum block size of 8-bit adder is two. The block size of four is optimum for 16 -bit and 32 -bit adders while the block size of eight is optimum for 64-bit adder.

The effect of the blocking schemes can be summarized as in Figure 7.2. As the block size increases, it takes longer to complete the computation, but it consumes less power. The bigger block size performs fewer operations and has less degree of parallelism. For example, an $N$-bit adder implemented with a block size of $\boldsymbol{N}$ has the smallest number of operations and the lowest power consumption compared to implementations using smaller block sizes.


Figure 7.2: The illustration of the effect of the block size on other factors.

Unfortunately, it is also the slowest. The implementation with the smallest block size gives the fastest adder for every input length at the cost of a large number of operations and power consumption. From the power consumption point of view, the implementation with biggest block size is, therefore, the most efficient one. The opposite holds true if the circuit delay is important. On the other hand, the biggest and smallest block sizes show poor power-delay product.

### 7.2 Effect of Prefix Circuit on Adder Implementation

In this section, three different prefix circuits, the divide-and-conquer, the Shih-Lin, and the LYD prefix circuits, were chosen to be candidates for carry computation in our simulation study. This is because, in Chapter 5, we found that the divide-and-conquer prefix circuit is the fastest prefix circuit while the LYD prefix circuit gives the best performance in terms of power-delay product. The Shih-Lin prefix circuit is a (size, depth)-optimal prefix circuit [LS99]. Recall that there are two issues to be considered in the process of carry computation. The first issue is the computation of the prefix circuit inside the block and another is the computation of a family of prefix circuits.

In the simulation, three best block schemes (that is four, eight, and sixteen) in term of power-delay product are considered. Figure 7.3 shows the simulation result of the 64 -bit adder with three different block size schemes implemented with three different prefix circuits. The optimum block size in terms of the power-delay product turns out to be eight for each of the three prefix circuits. The power-delay-products of the LYD and Shih-Lin prefix circuits are similar. The divide-and-conquer prefix circuit has the highest powerdelay product when the block size is eight and sixteen. However, when the block size
reduces to four, the power-delay product of the divide-and-conquer becomes closer to the LYD and Shih-Lin prefix circuits. This is due to the strong impact of using the divide-and-conquer prefix circuit in the computation of the family of prefix circuit on power consumption. The computation benefits from the divide-and-conquer prefix circuit's well-organized structure, which allows efficient implementation.

The size of the family of prefix circuits depends on the block size; the smaller the block size, the bigger the family of prefix circuits. For example, a block size of four has fifteen prefix circuits in the family. On the other hand, a block size of sixteen has only three prefix circuits in the family. When the family of prefix circuits is larger, there is a large possibility of its members sharing the structure with other members. This will result in lower power consumption due to reduced number of computation nodes. On the other hand, with the smaller family of prefix circuit, the sharing is small.


Figure 7.3: The plot of power-delay product of the divide-and-conquer, the LYD. and the Shih-Lin prefix circuits.

The choice of the prefix circuit inside the block also dominates the power consumed especially when the block size is large. To see its effect on power-delay product, let us consider power-delay product of the 64-bit adder implemented with block size of sixteen in four different prefix circuits in Figure 7.4. The figure shows that the (size, depth)optimal prefix circuits (i.e., the Shih-Lin and the LYD prefix circuits) have smaller power-delay product than (size, depth)-non-optimal prefix circuits (i.e., the divide-andconquer and the Brent-Kung prefix circuits). Like the simulation results in Chapter 5, the divide-and-conquer prefix circuit has the highest power-delay product followed by the Brent-Kung, the Shih-Lin, and the LYD prefix circuits.


Figure 7.4: The plot of power-delay product of four prefix circuits using in carry calculation in 64-bit adder implementing with block size of sixteen.

### 7.3 Summary

The binary adder implemented with different block schemes consumes different levels of power. According to the Brent's algorithm [Bre70], there are $\lg N$ ways to implement parallel $\boldsymbol{N}$-bit adders. In terms of power-delay product, our simulation results show that the optimum block size falls somewhere in the middle of all the block sizes. In order to implement a low-power prefix adder [Bre70], the LYD prefix circuit is a good candidate for implementing prefix circuit inside the block while the prefix circuit with wellorganized structure is a good candidate for implementing the family of prefix circuits.

## CHAPTER 8

## CONCLUSIONS

The three most widely accepted metrics for measuring the quality of a circuit are its area, speed, and power consumption. Optimizing area and speed have been considered important for long time, but minimizing power consumption has been gaining prominence only recently. Power consumption is an important issue in both portable and non-portable systems.

The dominant source of power consumption is dependent on supply voltage and switching activity when capacitance and operating frequency are fixed. Therefore, the reduction in voltage and switching activity means the reduction in power. However, a reduction in voltage may result in longer delays, and reduced throughput. However, reduction in throughput can be overcome by parallelism. Because of the size-depth tradeoff characteristic of prefix circuits, parallelism can be increased only up to a certain level.

Different circuit structures induce different switching activity. As a result, different circuit architectures for performing the same function can consume different amount of power. Therefore, the implementation of the various prefix circuits in an application will have different power consumptions as well. Usually, the circuit architecture with longer depth will consume more power than one with shorter depth. However, due to the sizedepth trade-off characteristic of prefix circuits, the switching activity in a prefix circuit not only depends on its logic depth but also on the number of operation nodes at each
level. The circuit with shorter depth and more nodes might have more switching activity than the one with longer depth and less nodes.

In this dissertation we conducted a comparative study of various prefix circuits from the point of view of power-speed trade-off. The dissertation presented the linear output capacitance model for the estimation of power consumption in seven families of prefix circuits. The proposed linear output capacitance model allows us to estimate power consumption in prefix circuits considered. This model helps direct the design at the high level. Results obtained by the model and simulations refute several commonly held beliefs about the consumption of power in prefix circuits (i.e., a circuit with shorter depth consumes less power than a circuit with longer depth), and also lend insight into possible prefix circuits for future power-prediction prefix circuit applications. Besides, based on the model and simulations, we have investigated the possible decrease in power consumption with the use of low supply voltages while maintaining the original performance level under different prefix circuits. For example, the simulation results have shown that power reductions of about 1.6 times can be obtained without throughput loss by using the LYD prefix circuit compared with using the divide-and-conquer prefix circuit. Finally, the 8 -, 16 -, 32 - and 64 -bit prefix adders were implemented and simulated under different blocking schemes. In regard to power-delay product, we found that an optimum block size falls somewhere around the middle among the various possible block sizes. For example, the result shows that the optimum block size is two for 8-bit adder, four for $\mathbf{1 6}$-bit and 32-bit adders, and eight for 64-bit adder. In order to implement a lowpower prefix adder based on Brent's algorithm [Bre70], the (size, depth)-optimal prefix circuit is a good candidate for implementing prefix circuit inside the block while the
prefix circuit with well-organized structure is a good candidate for implementing the family of prefix circuits needed in the prefix adder.

There are several open questions for future work.

- Power modeling of prefix circuits with bounded fan-out.
- The effect of fan-out on time-delay.
- New prefix circuit that has a structure that benefits the computation of a family of prefix circuits.
- Pipelining implementation for low-power prefix circuit.


## BIBLIOGRAPHY

[AD98] C. Arjhan, and R. G. Deshmukh, "A Novel Fault-Model For Regular And Irregular Parallel-prefix Adders, Proceedings, IEEE on Southeastcon, pp. 397-400, 1998.
[BAW00] H. T. Bui, A. Al-Sheraidha, and Y. Wang, "Design and Analysis of 10Transistor Full Adders Using Novel XOR-XNOR Gates", IEEE Proceedings of ICSP, pp. 619-622, 2000.
[BD01] V. A. Bartlett, and A. G. Dempster, "Using Carry-save Adders in Lowpower Multiplier Blocks", The 2001 IEEE International Symposium on Circuits and Systems, Vol. 4, pp. 222-225, 2001.
[Bel01] C. Belady, "Cooling and Power Consideration for Semiconductors Into the Next Century", Proceedings of the 2001 International Symposium on Low Power Electronics and Design, pp.100-105, 2001.
[BL01] A. Beaumont-Smith, and C. C. Lim, "Parallel Prefix Adder Design", Proceedings The $15^{\text {th }}$ IEEE Symposium on Computer Arithmetic, pp. 218225, 2001.
[Boy99] R. L. Boylestad, Introductory Circuit Analysis, Prentice Hall, 1999.
[Bre70] R. P. Brent, "On the Addition of Binary Numbers", IEEE Transactions on Computers, Vol. 19, pp. 758-759, 1970.
[BK82] R. P. Brent, and H. T. Kung, "A Regular Layout for Parallel Adders", IEEE Transactions on Computers, Vol. 31, pp. 260-264, 1982.
[BM00] L. Benini, and G. Micheli, "System-level Power Optimization: Techniques and Tolls", ACM Transactions on Design Automation of Electronic Systems (TODAES), Vol. 5(2), April 2000.
[Cad00] Cadence Design Systems, Inc., PSpice User's Guide Manual, Version 9.2, San Jose, CA, January 2000.
[Ca196] T. K. Callaway, Area, Delay, and Power Modeling of CMOS Adder and Multipliers, Ph.D. Dissertation, The University of Texas at Austin, 1996.
[CB95] A. P. Chandrakasan, and R. W. Brodersen, Low Power Digital CMOS Design, Kluwer Academic Publishers, Norwell, MA, 1995.
[FB01] A. A. Fayed, and M. A. Bayoumi, "A Low Power 10-transistor Full Adder Cell For Embedded Architectures", The 2001 IEEE International Symposium on Circuits and Systems, Vol. 4, pp. 226 -229, 2001.
[FL00] S. B. Furber, and J. Liu, "A Novel Area-efficient Binary Adder", Conference Record of the Thirty-Fourth Asilomar Conference on Signals, Systems and Computers, Vol. 1, pp. 119-123, 2000.
[GNHF01] K. I. Geisler, T. D. Nielsen, D. F. Hall, and R. Frowd, "The Rise of Energy Delivery Management Systems", IEEE Transmission and Distribution Conference and Exposition, Vol. 2, pp. 895-900, 2001.
[HP90] J. Hennessy, and D. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann Publishers, 1990.
[Hub00] P. Huber, "Why 99.9 percent is not good enough", ACM: Ubiquity, New York, 2000, http://www.acm.org/ubiquity/interviews/p huber 1.html.
[Hwa79] K. Hwang, Computer Arithmetic: Principles, Architecture, and Design, New York, John Wiley and Sons, 1979.
[JS95] K. J. Janik, and L. Shih-Lien, " VLSI Implementation of a 32-bit Kozen Formulation Ladner/Fischer Parallel Prefix Adder", Proceedings of the $\mathbf{8}^{\mathbf{t h}}$ Annual IEEE International, pp. 57-59, 1995.
[Kno01] S. Knowles, "A Family of Adders". Proceedings. $15^{\text {th }}$ IEEE Symposium on Computer Arithmetic, pp. 277-281, 2001.
[KBL95] U. Ko, P. T. Balsara, and W. Lee, "Low-Power Design Techniques for High-Performance CMOS Adders", IEEE Transactions on Very Large Scale Integration Systems, Vol. 3(2), 1995.
[Kor93] I. Koren, Computer Arithmetic Algorithms, Englewood Cliffs, N.J., Prentice Hall, 1993.
[LF80] R. E. Ladner, and M. J. Fischer, "Parallel Prefix Computation", Journal of ACM, Vol. 27, pp. 831-838, 1980.
[LYD87] S Lakshmivarahan, C. M. Yang, and S. K. Dhall, "Optimal Parallel Prefix Circuits with $($ size,depth $)=2 N-2$ and $\lceil\log N\rceil \leq$ depth $\leq\lceil 2 \log N\rceil-3 "$, Proceedings of the International Conference on Parallel Processing, pp. 58-65, 1987.
[LD90] S. Lakshmivarahan, and S. K. Dhall, Analysis and Design of Parallel Algorithms: Arithmetic and Matrix Problems, McGraw Hill, New York, NY, 1990.
S. Lakshmivarahan, and S. K. Dhall, Parallel Computing Using the Prefix Problem. Oxford University Press, New York, NY, 1994.
[LS99] Y. M. Lin, and C. C. Shih, "A New Class of Depth-Size Optimal Parallel Prefix Circuits", Journal of Supercomputing, Vol. 14, pp. 39-52, 1999.
[Lin81] H. Ling, "High-Speed Binary Adder", IBM J. Research and Development, Vol. 25, pp. 156-166, May 1981.
[Mac96] E. Macii, "RT and Algorithmic-Level Optimization for Low Power", Low Power Design in Deep Submicron Electronics, Kluwer Academic Publishers, pp. 355-379, 1996.
[Mil00] M. Mills, "Kyto and the Internet: The Energy Implications of the Digital Economy", Testimony before the Sybcommittee on national Economic Growth, Natural, and Regulatory Affairs, U.S. House of Representatives, Washington, DC., February 2000, http://www.house.gov/reform/neg/hearings/020200/mills.htm.
[NIO96] C. Nagendra, M J. Irwin, and R. M. Owens, "Area-Time-Power Tradeoffs in Parallel Adders", IEEE Transactions on Circuits and Systems, Vol. 43(10), pp. 689-702, 1996.
[Omo94] A. R. Omondi, Computer Arithmetic Systems, Algorithms, Architecture and Implementatios, Prentice Hall, 1994.
[RCN01] J. M. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits A Design Perspective, early draft of the $2^{\text {nd }}$ edition, April 2001, http://bwrc.eecs.berkeley.edu/Classes//cBook/2ndEdition.html.
[Rad01] D. Radhakrishnan, "Low-voltage Low-power CMOS Full Adder", IEEE Proceedings on Circuits, Devices and Systems, Vol. 148, pp. 19 -24, 2001.
[RP96] J. M. Rabaey, and M. Pedram, Low Power Design Methodologies, Kluwer Academic Publishers, Boston, 1996.
[RP00] K. Roy, and S. Prasad, "Low-power CMOS VLSI Circuit Design", John Wiley, New York, 2000.
[Smi97] M. Smith, Application-Specific Integrated Circuits, Addison Wesley, Menlo Park, CA, 1997.
[Sni86] M. Snir, "Depth-Size Tradeoffs for Parallel Prefix Computation", Journal of Algorithms, Vol. 17, pp. 185-201, 1986.
[WE93] N. H. E. Weste, and K. Eshraghian, Principles of CMOS VLSI Design: A System Perspective, Addision-Wesley, MA, 1993.
[ZS01] M. Ziegler, and M. Stan, "Optimal logarithmic adder structures with a fanout of two for minimizing the area-delay product", The 2001 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 657-660, Vol. 2, 2001.

## APPENDIX A

## $\boldsymbol{R C}$ network

Recall that there are three sources of power dissipation in a digital CMOS circuit. The majority source of the power dissipation is due to the logic transitions. As the nodes in a digital CMOS circuit transition back and forth between the two logic levels, the parasitic capacitances are charged and discharged.


Figure A.1: Inverter.

Most of the models used to explain the power consumption behavior of ICs are based on the equations derived from the analysis of the CMOS inverter (see Figure A.1) [RCN01]. Understanding its behaviors can be extended to explain the behaviors of more complicated designs such as NAND gates, adders, etc. Hence, an overview of the CMOS inverter is presented. Refer to [RCN01, WE93], for more detail.

In the switch model [RCNO1], a transistor could be either a switch or a resistor, depending on the value of its gate-to-source voltage, $V_{G S}$, and its threshold voltage, $V_{T}$. The transistor acts like the switch when $\left|V_{G S}\right|<\left|V_{T}\right|$ and acts like the resistor when $\left|V_{G S}\right|>\left|V_{T}\right|$ (see Figure A.2).


Figure A.2: Switch model of CMOS transistor.

When applying a step input, the capacitor will charge and discharge in response to the input to an inverting gate. When the input goes from its low level to its high level ( $V_{m}$ going from 0 to $V$, the $P$-type transistor acts like the resistor and the $N$-type transistor acts likes the switch (see Figure A.3(a)). An RC network is formed. The capacitor charges toward the high level of the input through the resistor and $V_{\text {our }}=V$. This action is analogous to connecting a supply voltage to the $R C$ network as illustrated in Figure A.3(a). When the input goes from its high level back to its low level ( $V_{\text {in }}$ going from $V$ to 0 ), the $P$-type transistor acts like the switch and the $N$-type transistor acts likes the resistor (see Figure A.3(b)). An $R C$ network is formed. The $N$-type transistor provides a
current path to the ground. The capacitor discharges back through the ground and $V_{\text {ow }}=0$. This action is analogous to replacing the NMOS with the $R C$ network, as illustrated in Figure A.3(b).


Figure A.3: The equivalent action of an inverting gate when a step input charges and discharges the capacitor.

When a capacitor charges or discharges through a resistor $R$, a certain time is required for the capacitor to charge/discharge fully. The rate at which the capacitor charges or discharges is determined by the time constant $R C$ (i.e., it is the first-order analysis of digital circuits). The $R C$ is derived and has the units of time as follows:

$$
R C=\left(\frac{V}{I}\right)\left(\frac{Q}{V}\right)=\left(\frac{V}{Q / t}\right)\left(\frac{Q}{V}\right)=t
$$

Its symbol is $\tau$ (Greek letter tau). Thus,

$$
\tau=R C .
$$

Its unit of measure is the second. Note that the value of $\boldsymbol{R C}$ will never be greater than a few seconds because $C$ is very small (i.e., it is usually found in microfarads or picofarads), unless $\boldsymbol{R}$ is very large.

A capacitor charges and discharges following an exponential curve, as shown in Figure A.4. In the charging phase, the charging curve is an increasing exponential while in the discharging phase, the discharging curve is a decreasing exponential (see Figure A.4). The charging voltage of a capacitor at any instant of time is given as follow:

$$
V_{o m}(t)=\left(1-e^{-t / \tau}\right) V .
$$

In the discharging phase, the voltage across the capacitor would be the following:

$$
V_{o w}(t)=e^{-t / \tau} V
$$

It takes five time constants to approximately fully charge/discharge a capacitor [Boy99].
Considering charging phase, the factor ( $1-e^{-t / r}$ ) is exponential function of the form $\left(1-e^{-x}\right)$, where $x=t / \tau$ and $e=2.71828 \ldots$. A plot of $\left(1-e^{-x}\right)$ for $x \geq 0$ appears in Figure A.4. The time to reach the $50 \%$ point, the propagation delay $\left(t_{p}\right)$, is $0.69 \tau$.


Figure A.4: Charging and discharging exponential curves for an RC network.

A simple first-order derivation of the $\boldsymbol{t}_{\boldsymbol{p}}$ is given by [CB95, RCN01]

$$
t_{P}=0.69 C_{L} R \approx \frac{C_{L} V_{D D}}{I}=\frac{C_{L} V_{D D}}{k(W / L)\left(V_{D D}-V_{T}\right)^{2}},
$$

where $C_{L}$ is the gate capacitance, $V_{D D}$ is the supply voltage, $V_{T}$ is the threshold voltage, $k$ is a technologydependent parameter, and $W$ and $L$ are the channel width and length of the transistors, respectively. Clearly, the circuit delay is a function of supply


Figure A.5: The plot of the delay vs. $V_{D D}$. voltage. Figure A. 5 shows the plot of the circuit delay versus supply voltage and the figure suggests that there is the monotonic dependency of the propagation delay versus supply voltage. As the supply voltage is reduced, the delay of CMOS circuits increases monotonically.

Note that is the $t_{p}$ an artificial gate quality metric, which is used to compare different semiconductor technologies or logic design styles [RCN01]. The $t_{p}$ of a gate defines how quickly it responds to an input signal when passing through the gate.

## Power Consumption due to Switching

As the previous discussion, the power consumption in a CMOS circuit is from the dynamic source due to switching that can be calculated from the product of the current flows and voltage different. Considering a CMOS inverter, when the $P$-type transistor is charging a capacitance, $C_{e f f}$, at a frequency, $f=1 / T$, with supply voltage, $V_{D D}$, the current through the transistor is $C_{e f f}(d V / d t)$. The power consumption is thus $C_{e f f} V(d V / d t)$ for one-half the period of the input, $1 / 2 f$. The power consumed at the $P_{-}$ Type can be calculated by the equation

$$
\frac{1}{T} \int_{0}^{1 / 2} C_{e f f} V\left(\frac{d V}{d t}\right) d t=\frac{C_{e f f}}{T} \int_{0}^{V_{D D}} V d V=\frac{1}{2 T} C_{e f} V_{D D}^{2}
$$

When $N$-type transistor discharges the capacitor, the power dissipation is equal, and making the total power consumption [CB95, WE93]

$$
P_{\text {swiduluive }}=C_{e f} V_{D D}^{2} f
$$

Let $p_{f}$ be the switching activity per one clock cycle and $C_{L}$ be the amount of load capacitance switched. Thus, $C_{e f f}=p_{f} C_{L}$. The average dynamic power consumption of a CMOS gate due to the switching current is equal to

$$
\begin{equation*}
P_{\text {swincting }}=p_{f} C_{L} V_{D D}^{2} f \tag{1}
\end{equation*}
$$

Since the time integral of power is energy ( $\Delta E$ ), it follows that

$$
\Delta E=P_{\text {swinching }} \times T
$$

As an example of finding power consumption of the circuit, the following plot shows the waveforms of the CMOS inverter, simulated by using the circuit simulator called PSpice (see Figure A.6). From top to bottom, the waveforms represent the inverter input voltage, the inverter output voltage and the inverter dynamic power consumption. CMOS is very power efficient because it consumes power during the brief periods of switching (see Figure A.6). We can conclude that the power consumption of CMOS logic is directly proportional to switching frequency. Figure A. 7 shows the inverter's energy. The time integral of power is energy that can be computed as follow.

$$
P_{\text {swichehmg }}=\int_{0}^{T} I V_{D D} d t / T
$$

Thus, the power consumed by the inverter in Figure A. 7 is equal to $\frac{8.817 u}{2 m s}\left(\frac{\Delta E}{T}\right)$ watts.


Figure A.6: CMOS inverter's input and output waveforms.


Figure A.7: CMOS inverter's power and energy waveforms.

## APPENDIX B

## CMOS Gates



Figure B.1: A CMOS inverter


Figure B.2: A CMOS AND gate


Figure B.3: A CMOS OR gate


Figure B.A: A CMOS XOR gate

## APPENDIX C

Due to the size, the examples of the 8-bit XOR gates implemented are shown.


Figure C.1: The worst case input of XOR gate (i.e., the first input is equal to 0 and the other inputs are $0 \rightarrow 1$.), causing the output to ripple the most.


Figure C.2: The 8-bit XOR gate implemented with the serial prefix circuit.


Figure C.3: The outputs of 8-bit XOR gates implemented with the serial prefix circuit, showing the longest ripple (the maximum number of switching).


Figure C.A: Delay of 8-bit XOR gates implemented with the serial prefix circuit from PSpice simulation.


Figure C.5: Energy of 8-bit XOR gates implemented with the serial prefix circuit from PSpice simulation.


Figure C.6: The 8-bit XOR gate implemented with the divide-and-conquer prefix circuit.


Figure C.7: The outputs of 8 -bit XOR gates implemented with the divide-and-conquer prefix circuit


Figure C.8: Delay of 8-bit XOR gates implemented with the divide-and-conquer prefix circuit from PSpice simulation.


Figure C.9: Energy of 8-bit XOR gates implemented with the divide-and-conquer prefix circuit from PSpice simulation.

## APPENDIX D

The implementation of a prefix adder is divided into 3 parts; preprocessing, carry computation, and postprocessing. Preprocessing and postprocessing are shown first. Then carry computation with different block sizes (i.e., 4 possible ways) is shown. Due to the big size, the 8 -bit prefix adder implementations are shown.

## Preprocessing and Postprocessing



Figure D.1: Preprocessing: carry propagate bits and carry generate bits


Figure D.2: Postprocessing: $s_{i}=a_{i} \oplus b_{i} \oplus c_{i-1}$.

## Carry Computation

## 1. R1Q8



Figure D.3: The implementation of $E_{i}, i=1$.


Figure D.4: The implementation of carry bits.

## 2. R2Q4



Figure D.5: The implementation of $E_{i}, 1 \leq i \leq 2$.


Figure D.6: The implementation of prefix circuit with 4 inputs.


Figure D.7: The implementation of carry bits.

## 3. R4Q2



Figure D.8: The implementation of $E_{i}, 1 \leq i \leq 4$.


Figure D.9: The implementation of prefix circuit.


Figure D.10: The implementation of carry bits.

## 4. R1Q8



Figure D.11: The implementation of $E_{i}, 1 \leq i \leq 8$.


Figure D.12: The implementation of prefix circuit.


Figure D.13: The implementation of carry bits.

## APPENDIX E

Table E.1: A comparison of the exact capacitance values and the estimated capacitance values of the Brent-Kung prefix circuit.

| N | Estimate( $\mathbf{C o}_{\mathbf{0}}$ ) | Exact( $\mathrm{C}_{0}$ ) | \%Error |
| :---: | :---: | :---: | :---: |
| 2 |  | 1 | 0\% |
| 3 | 3.0838 | 3 | 2.79\% |
| 4 | 6 | 6 | 0\% |
| 5 | 9.5578 | 9 | 6.20\% |
| 6 | 13.6312 | 13 | 4.86\% |
| 7 | 18.1329 | 18 | 0.74\% |
| 8 | 23 | 23 | 0.00\% |
| 9 | 28.1848 | 27 | 4.39\% |
| 10 | 33.6504 | 32 | 5.16\% |
| 11 | 39.3671 | 38 | 3.60\% |
| 12 | 45.3109 | 44 | 2.98\% |
| 13 | 51.4617 | 51 | 0.91\% |
| 14 | 57.8028 | 57 | 1.41\% |
| 15 | 64.3197 | 64 | 0.50\% |
| 16 | 71 | 71 | 0\% |
| 17 | 77.8329 | 76 | 2.41\% |
| 18 | 84.8089 | 82 | 3.43\% |
| 19 | 91.9195 | 89 | 3.28\% |
| 20 | 99.1573 | 96 | 3.29\% |
| 21 | 106.5156 | 104 | 2.42\% |
| 22 | 113.9883 | 111 | 2.69\% |
| 23 | 121.5698 | 119 | 2.16\% |
| 24 | 129.2552 | 127 | 1.78\% |
| 25 | 137.0400 | 136 | 0.76\% |
| 26 | 144.9199 | 143 | 1.34\% |
| 27 | 152.8910 | 151 | 1.25\% |
| 28 | 160.9499 | 159 | 1.23\% |
| 29 | 169.0932 | 168 | 0.65\% |
| 30 | 177.3178 | 176 | 0.75\% |
| 31 | 185.6210 | 185 | 0.34\% |
| 32 | 194 | 194 | 0\% |
| 33 | 202.4524 | 200 | 1.23\% |
| 34 | 210.9757 | 207 | 1.92\% |
| 35 | 219.5679 | 215 | 2.12\% |
| 36 | 228.2269 | 223 | 2.34\% |
| 37 | 236.9507 | 232 | 2.13\% |
| 38 | 245.7375 | 240 | 2.39\% |


| N | Estimate( $\mathrm{C}_{0}$ ) | Exact( $\mathrm{C}_{0}$ ) | \%Error |
| :---: | :---: | :---: | :---: |
| 39 | 254.5856 | 249 | 2.24\% |
| 40 | 263.4933 | 258 | 2.13\% |
| 41 | 272.4590 | 268 | 1.66\% |
| 42 | 281.4813 | 276 | 1.99\% |
| 43 | 290.5588 | 285 | 1.95\% |
| 44 | 299.6901 | 294 | 1.94\% |
| 45 | 308.8739 | 304 | 1.60\% |
| 46 | 318.1091 | 313 | 1.63\% |
| 47 | 327.3945 | 323 | 1.36\% |
| 48 | 336.7289 | 333 | 1.12\% |
| 49 | 346.1113 | 344 | 0.61\% |
| 50 | 355.5407 | 352 | 1.01\% |
| 51 | 365.0161 | 361 | 1.11\% |
| 52 | 374.5366 | 370 | 1.23\% |
| 53 | 384.1012 | 380 | 1.08\% |
| 54 | 393.7091 | 389 | 1.21\% |
| 55 | 403.3594 | 399 | 1.09\% |
| 56 | 413.0515 | 409 | 0.99\% |
| 57 | 422.7843 | 420 | 0.66\% |
| 58 | 432.5574 | 429 | 0.83\% |
| 59 | 442.3698 | 439 | 0.77\% |
| 60 | 452.2210 | 449 | 0.72\% |
| 61 | 462.1103 | 460 | 0.46\% |
| 62 | 472.0369 | 470 | 0.43\% |
| 63 | 482.0004 | 481 | $0.21 \%$ |
| 64 | 492 | 492 | 0\% |
| 65 | 502.0352208 | 499 | $0.61 \%$ |
| 66 | 512.1054706 | 507 | 1.01\% |
| 67 | 522.2102 | 516 | 1.20\% |
| 68 | 532.3488765 | 525 | 1.40\% |
| 69 | 542.5209835 | 535 | 1.41\% |
| 70 | 552.7260201 | 544 | 1.60\% |
| 71 | 562.9634999 | 554 | 1.62\% |
| 72 | 573.2329504 | 564 | 1.64\% |
| 73 | 583.5339129 | 575 | 1.48\% |
| 74 | 593.8659414 | 584 | 1.69\% |
| 75 | 604.2286022 | 594 | 1.72\% |
| 76 | 614.6214737 | 604 | 1.76\% |
| 77 | 625.0441454 | 615 | 1.63\% |
| 78 | 635.496218 | 625 | 1.68\% |
| 79 | 645.9773024 | 636 | 1.57\% |
| 80 | 656.4870199 | 647 | 1.47\% |
| 81 | 667.0250013 | 659 | 1.22\% |
| 82 | 677.5908868 | 668 | 1.44\% |
| 83 | 688.1843256 | 678 | 1.50\% |
| 84 | 698.8049755 | 688 | 1.57\% |
| 85 | 709.4525028 | 699 | 1.50\% |


| N | Estimate(C0) | Exact(C0) | \%errror |
| :---: | :---: | :---: | :---: |
| 86 | 720.1265816 | 709 | 1.57\% |
| 87 | 730.826894 | 720 | 1.50\% |
| 88 | 741.5531294 | 731 | 1.44\% |
| 89 | 752.3049846 | 743 | 1.25\% |
| 90 | 763.0821631 | 753 | 1.34\% |
| 91 | 773.8843755 | 764 | 1.29\% |
| 92 | 784.7113387 | 775 | 1.25\% |
| 93 | 795.5627758 | 787 | 1.09\% |
| 94 | 806.4384162 | 798 | 1.06\% |
| 95 | 817.337995 | 810 | 0.91\% |
| 96 | 828.2612533 | 822 | 0.76\% |
| 97 | 839.2079374 | 835 | 0.50\% |
| 98 | 850.177799 | 844 | 0.73\% |
| 99 | 861.1705952 | 854 | 0.84\% |
| 100 | 872.1860878 | 864 | 0.95\% |
| 101 | 883.2240438 | 875 | 0.94\% |
| 102 | 894.2842347 | 885 | 1.05\% |
| 103 | 905.3664365 | 896 | 1.05\% |
| 104 | 916.47043 | 907 | 1.04\% |
| 105 | 927.5959998 | 919 | 0.94\% |
| 106 | 938.7429352 | 929 | 1.05\% |
| 107 | 949.9110293 | 940 | 1.05\% |
| 108 | 961.100079 | 951 | 1.06\% |
| 109 | 972.3098854 | 963 | 0.97\% |
| 110 | 983.5402531 | 974 | 0.98\% |
| 111 | 994.7909903 | 986 | 0.89\% |
| 112 | 1006.061909 | 998 | 0.81\% |
| 113 | 1017.352824 | 1011 | 0.63\% |
| 114 | 1028.663554 | 1021 | 0.75\% |
| 115 | 1039.993922 | 1032 | 0.77\% |
| 116 | 1051.343751 | 1043 | 0.80\% |
| 117 | 1062.71287 | 1055 | 0.73\% |
| 118 | 1074.101111 | 1066 | 0.76\% |
| 119 | 1085.508306 | 1078 | 0.70\% |
| 120 | 1096.934293 | 1090 | 0.64\% |
| 121 | 1108.378912 | 1103 | 0.49\% |
| 122 | 1119.842004 | 1114 | 0.52\% |
| 123 | 1131.323415 | 1126 | 0.47\% |
| 124 | 1142.822992 | 1138 | 0.42\% |
| 125 | 1154.340586 | 1151 | $0.29 \%$ |
| 126 | 1165.876048 | 1163 | 0.25\% |
| 127 | 1177.429234 | 1176 | 0.12\% |
| 128 | 1189 | 1189 | 0.00\% |

