# The Area-Time Complexity of Binary Multiplication 

R. P. BRENT

The Australian National Universuty, Canberra, Australia
AND
H. T. KUNG

Carnegie-Mellon University, Pittsburgh, Pennsylvania

ABSTRACT The problem of performing multiplication of $n$-bit binary numbers on a chip is considered Let $A$ denote the chip area and $T$ the time required to perform multiplication. By using a model of computation which is a realistic approximation to current and anticipated LSI or VLSI technology, it is shown that

$$
\left(\frac{A}{A_{0}}\right)\left(\frac{T}{T_{0}}\right)^{2 \alpha} \geq n^{1+\alpha}
$$

for all $\alpha \in[0,1]$, where $A_{0}$ and $T_{0}$ are positive constants which depend on the technology but are independent of $n$. The exponent $1+\alpha$ is the best possible $A$ consequence of this result is that binary multiplication is "harder" than binary addition More precisely, if $\left(A T^{2 \alpha}\right)_{\mathrm{M}}(n)$ and $\left(A T^{2 \alpha}\right)_{\mathrm{A}}(n)$ denote the minimum area-time complexity for $n$-bit binary multiplication and addition, respectively, then

$$
\frac{\left(A T^{2 \alpha}\right)_{\mathrm{M}}(n)}{\left(A T^{2 \alpha}\right)_{\mathrm{A}}(n)}=\left\{\begin{array}{ll}
\Omega\left(n^{1-\alpha}\right) & \text { for } 0 \leq \alpha \leq \frac{1}{2} \\
\Omega\left(\frac{n^{\alpha}}{\log ^{2 \alpha} n}\right) & \text { for } \frac{1}{2}<\alpha \leq 1 \\
\Omega\left(\frac{n}{\log ^{2 \alpha} n}\right) & \text { for } \alpha>1
\end{array}\right\}\left(=\Omega\left(n^{1 / 2}\right) \text { for all } \alpha \geq 0\right)
$$

KEY WORDS AND PHRASES. area-tıme complexity, binary multiplication, chip design, chip layout, circuit design, combinational logic, chip complexity, lower bounds, VLSI

CR CATEGORIES• 5 25, $61,6.32$

## 1. Introduction

We are interested in the design of multipliers suitable for implementation in VLSI chips. The multiplication problem has been considered by several authors (see, e.g., $[8,10,17,19,25,27])$. Much attention has been paid to the trade-off between time and the number of gates, but until recently little attention has been paid to the

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the tatle of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.
This research was supported in part by the National Science Foundation under Grant MCS 78-236-76 and the Office of Naval Research under Contracts N00014-76-C-0370 and N00014-80-C-0236. Most of this work was carred out at the Australian National University while H. T. Kung was there as a Visiting Fellow during May 1979
Authors' addresses. R P Brent, Department of Computer Science, The Australian National University, Canberra, A C T 2600, Australia, H. T Kung, Department of Computer Science, Schenley Park, CarnegreMellon University, Pittsburgh, PA 15213
(C) 1981 ACM 0004-5411/81/0700-0521 \$00 75
problem of connecting the gates in an economical and regular way to minimize chip area and design costs. In this paper we give lower and upper bounds on the areatime product for multiplication circuits, assuming a model of computation which is intended to approximate current and anticipated LSI or VLSI technology. Details of the model are given in Section 2.

The lower bound on $A T$, where $A$ is the chip area and $T$ the time to perform $n$-bit binary multiplication on the chip, is the special case $\alpha=\frac{1}{2}$ of a more general lower bound

$$
\begin{equation*}
A T^{2 \alpha}=\Omega\left(n^{1+\alpha}\right) \tag{1.1}
\end{equation*}
$$

which is valid for all $\alpha \in[0,1]$. We establish this general result in Section 3. The case $\alpha=1$ was established independently by Abelson and Andreae [1] using a more restrictive model than ours (see also [21]). In Section 4 we sketch a design for $n$-bit multiplication that gives the upper bound

$$
\begin{equation*}
A T^{2 \alpha}=O\left(n^{1+\alpha} \log ^{1+2 \alpha} n\right) \tag{1.2}
\end{equation*}
$$

for all $\alpha \geq 0 .{ }^{1}$ Thus the exponent $1+\alpha$ of $n$ in (1.1) and (1.2) is tight for $\alpha \in[0,1]$.
In [3] we give upper bounds on $A$ and $T$ for the problem of adding $n$-bit binary numbers. From (1.1) and the results of [3] we conclude in Section 5 that binary multiplication is harder than binary addition if the complexity measure is $A T^{2 \alpha}$, for any $\alpha \geq 0$ (see also [7]).

## 2. The Computational Model and Basic Assumptions

We assume the existence of circuit elements or "gates" which compute a logical function of two inputs in constant time and occupy at least a constant minimum area. Gates are connected by wires which have constant minimum width (equivalently, the wires must be separated by at least some minimal spacing). Our measure of the cost of a design is the area rather than the number of gates required. This is an important difference between our model and earlier models of $[4,26]$ and others. For motivation and discussion of models similar to ours, see [12, 23].

To prove the results of this paper, various subsets of the following assumptions A1 through A8 are used. Comments and justification are given following the statement of each assumption.

A1. The computation is performed in a convex planar region $R$ of area $A$.
Because of heat-dissipation, packing, and testing requirements, a two-dimensional planar model is reasonable. The convexity assumption is not restrictive in the sense that almost all existing chips or useful modular designs do have convex boundaries for packaging or modularity reasons. (The convexity assumption can be removed for part of Theorem 3.1 below by using a different proof.)

A2. Wires have minimal width $\lambda>0$.
$\lambda$ is assumed constant, but in applications of our results it will of course depend on the technology. We also assume $R$ has width at least $\lambda$ in every direction.

A3. At most $\nu \geq 2$ wires can overlap (or intersect) at any point in $R$.
A chip may consist of $v$ layers. Wire crossings through different layers are allowed. In fact, transistors are typically formed by crossovers of wires. Since

[^0]$\nu \geq 2$, the graph of wires (edges) and gates (nodes) need not be planar in a graph-theoretic sense.
A4. I/O ports each contain a $\lambda \times \lambda$ square and thus have area at least $\rho \geq \lambda^{2}$. An I/O port can be multiplexed to handle more than one input or output variable.
If $R$ is a complete chip, $\rho$ will be large compared to $\lambda^{2}$. If $R$ is only part of a chip and I/O is to other regions on the chip, $\rho$ could be of order $\lambda^{2}$. We do not require each input (or output) variable to appear in a distinct input (or output) port, as required in [23]. I/O ports may be multiplexed as they often are in practice.
A5. A bit requires minimal time $\tau>0$ to propagate along a wire or to be transmitted through an I/O port. The time for one gate computation and an arbitrary fanout of the result is included in $\tau$.

Since dimensions are limited by the minimal wire width $\lambda$ and minimal gate area, a minimal propagation time is reasonable. We do not need to assume that the propagation time increases with the length of the wire. With the (small) sizes of chips we now have or anticipate, the propagation time, which is the time needed to charge or discharge a wire, is limited by the wire capacitance rather than the velocity of light. A longer wire will generally have a larger capacitance and thus require a larger driver to maintain constant propagation time, but the driver area need not exceed a fixed percentage of the wire area and so can be ıgnored if $\lambda$ is increased slightly; see [15]. Although it would be reasonable to assume bounded fanout, we do not need this assumption for proving lower bounds. When proving upper bounds, we do assume bounded fanout.

A6. The times and locations at which input and output bits are available are fixed and independent of the values of the input bits.
When proving upper bounds in Section 4, we further assume that if $a_{\imath}$ and $a_{j}$ are any two bits in an operand such that $a_{i}$ is more significant than $a_{j}$, then $a_{i}$ is not input to (or output from) the chip before $a_{j}$, but they are allowed to be input to (or output from) the chip in parallel.
A7. Storage for one bit of information takes area at least $\beta>0$.
$\beta$ is typically several times larger than $\lambda^{2}$.
A8. Each input bit is available only once.
There is no free memory outside $R$. If the same input bit is required at different times, it must be stored within $R$, taking area at least $\beta$ (see A7).

## 3. Lower Bound Results

Let $p=p_{2 n} \cdots p_{1}$ be the $2 n$-bit product of $n$-bit integers $a=a_{n} \cdots a_{1}$ and $b=b_{n} \cdots b_{1}$.
3.1 Lower Bounds for Shifting Circuits. When $b=2^{j}, p$ is $a$ shifted $j$ bits to the left. Thus any multiplier circuit must also be a shifting circuit capable of performing $j$-bit shifts for all $0 \leq j \leq n-1$.

Theorem 3.1. Under assumptions A1-A6 of Section 2, any chip that is capable of performing the shifts described above must satisfy

$$
\begin{equation*}
A T^{2} \geq K_{1} n^{2} \tag{3.1}
\end{equation*}
$$

and

$$
\begin{equation*}
A T \geq K_{2} L n \tag{3.2}
\end{equation*}
$$

where

$$
\begin{align*}
& K_{1}=2\left[\frac{\lambda \tau\left(9-4 \cdot 5^{1 / 2}\right)}{\nu}\right]^{2}  \tag{3.3}\\
& K_{2}=\frac{\lambda \tau\left(9-4 \cdot 5^{1 / 2}\right)}{\pi \nu}
\end{align*}
$$

and $L$ is the perimeter of the chip.
Before proving Theorem 3.1 we need two Lemmas.
Lemma 3.1. For any convex planar figure with area $A$, perimeter $L$, diameter $D$, and chord of length $C$ perpendicular to a chord whose length is the diameter $D$,

$$
\begin{equation*}
A \geq \frac{C D}{2} \tag{3.4}
\end{equation*}
$$

and

$$
\begin{equation*}
A \geq \frac{C L}{2 \pi} \tag{3.5}
\end{equation*}
$$

Proof. The results follow from well-known inequalities for convex figures. For a proof (and a definition of "diameter," etc.) see, for example, [28].

Lemma 3.2

$$
\min _{0 \leq r<1} \max \left(2 r,(1-r)^{2} / 8\right)=2\left(9-4 \cdot 5^{1 / 2}\right)
$$

Proof. It is easy to verify that the minimum occurs when $16 r=(1-r)^{2}$, and the only root of this equation in $[0,1]$ is $r=9-4 \cdot 5^{1 / 2}$.

Proof of Theorem 3.1. Consider any chip that can perform $j$-bit shifts for all $0 \leq j \leq n-1$. By assumption A1, the chip forms a convex region $R$. Let $D$ be the diameter of $R$, and $Y$ a chord of length $D$.

Let $S=\left\{p_{2 n-1}, \ldots, p_{n}\right\}$ and $M$ be the maximum number of elements of $S$ sharing or multiplexing one output port of the chip. By assumption A4, an I/O port has area at least $\rho \geq \lambda^{2}$. We represent each I/O port by an infinitesimal point on the port. On the basis of these representatives of $1 / O$ ports, we partition the chip by a chord $X$ perpendicular to $Y$ as follows. The chord $X$ divides $S$ into two subsets $S_{1}$ and $S_{2}$ such that representatives of the output ports for elements of $S_{1}$ lie on one side of $X$ and those for elements of $S_{2}$ lie on the other side of $X$. (Since representatives of $1 / \mathrm{O}$ ports are of infinitesimal size, we can assume that by an infinitesimal perturbation from the perpendicular to $Y, X$ does not intersect any of them.) By "sliding" the intersection of $X$ and $Y$ along $Y$, we can arrange that

$$
\begin{equation*}
\left|S_{i}\right| \leq\left\lfloor\frac{n+M}{2}\right\rfloor \tag{3.6}
\end{equation*}
$$

for $i=1$ and 2. For notational convenience we use $d$ to denote $\lfloor(n+M) / 2\rfloor$. When the $j$-bit shift is performed, $p_{t+j}$ takes the value of $a_{2}$. For $d \leq i \leq n$, the $i$ th row in Table I indicates the $p_{i}$ 's that take the value of $a_{i}$ under $j$-bit shifts for all $n-i \leq j \leq n-1$. Note that in the table all the $p_{i}$ 's belong to $S$, which is divided into two parts by the chord $X$. By (3.6), in the $i$ th row of the table there are at most $d$ of

TABLE I The Dependence of the $p_{i}$ 's on the $a_{\imath}$ 's Under Various Shifts

| $a_{i}$ | $J$ |  |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | 0 | 1 | 2 | ... | $n-d-1$ | $n-d$ | ... | $n-2$ | $n-1$ |
| $a_{d}$ |  |  |  |  |  | $p_{n}$ | . . | $p_{n+d-2}$ | $p_{n+d-1}$ |
| $a_{d+1}$ |  |  |  |  | $p^{n}$ | . $\cdot$ |  | $p_{n+d-1}$ | $p_{n+d}$ |
| $a_{n-1}$ |  | $p_{n}$ | $p_{n+1}$ |  |  |  |  | $p^{2 n-3}$ | $p^{2 n-2}$ |
| $a_{n}$ | $p_{n}$ | $p_{n+1}$ | $p_{n+2}$ |  |  |  |  | $p_{2 n-2}$ | $p_{2 n-1}$ |

the $p_{i}$ 's for which the representatives of the output ports lie on the same side of $X$ as the representative of the input port for $a_{l}$. Consequently, in the $i$ th row there are at least $i-d$ of the $p_{1}$ 's for which the representative of the output ports do not lie on the same side of $X$ as the representative of the input port for $a_{l}$. For all rows in the table, there are a total of at least $\sum_{i=d}^{n}(i-d) \geq(n-M)^{2} / 8$ such $p_{\imath}$ 's. This implies that one of the $n$ columns in the table, say the $j$ th column, must have at least $(n-M)^{2} / 8 n$ such $p_{t}$ 's. In other words, if
$I=\left\{i \mid i \in\{d, d+1, \ldots, n\}\right.$ and the representative of the input port for $a_{t}$ does not lie on the same side of $X$ as that of the output port for $\left.p_{\imath+j}\right\}$,
then

$$
|I| \geq \frac{(n-M)^{2}}{8 n}
$$

For $t \in I$, the input port for $a_{i}$ or the output port for $p_{i+\jmath}$ may intersect the chord $X$, although their representatives do not. Define

$$
\begin{aligned}
I^{\prime}= & \left\{i \mid i \in I, \text { and the chord } X \text { intersects the input port for } a_{i}\right. \text { or the } \\
& \text { output port for } \left.p_{i+j}, \text { or both }\right\} .
\end{aligned}
$$

Then

$$
\begin{aligned}
I-I^{\prime}= & \left\{\imath \mid i \in\{d, d+1, \ldots, n\}, \text { and the input port for } a_{i}\right. \text { and the } \\
& \text { output port for } p_{t+j} \text { do not intersect } X \text { and they lie on different } \\
& \text { sides of } X\} .
\end{aligned}
$$

Consider the computation of the $j$-bit shift. Note that the $j$-bit shift, which maps $a_{\imath}$ to $p_{i+j}$ for $i=1, \ldots, n$, is an identity mapping. Hence, before the shift is complete, at least $\left|I-I^{\prime}\right|$ bits of information about $a_{\imath}, i \in I-I^{\prime}$, must cross $X$ for computing $p_{i+j}$, $i \in I-I^{\prime}$, and at least $\left|I^{\prime}\right|$ bits of information about $a_{i}, i \in I^{\prime}$, must be input to or output from some I/O ports intersecting $X$ for computing $p_{i+j}, i \in I^{\prime}$. Suppose that the chord $X$ is of length $C$. Then by assumptions A2-A4, at most $\nu C / \lambda$ wires or I/O ports cross $X$. Thus, by assumption A5, the time $T$ to perform the $j$-bit shift must satisfy the inequality

$$
\left(\frac{\nu C}{\lambda}\right)\left(\frac{T}{\tau}\right) \geq\left|I-I^{\prime}\right|+\left|I^{\prime}\right|=|I| \geq \frac{(n-M)^{2}}{8 n}
$$

or

$$
\begin{equation*}
T \geq \frac{(\lambda \tau / \nu C) n \cdot(1-r)^{2}}{8} \tag{3.7}
\end{equation*}
$$

where $r=M / n$. Since $M$ outputs come through one output port, assumption A5 gives

$$
\begin{equation*}
T \geq M \tau=\tau n r \tag{3.8}
\end{equation*}
$$

First suppose $M<n$. Then at least one wire or one $I / O$ port crosses $X$, and assumptions A2 and A4 give

$$
\begin{equation*}
C \geq \lambda \tag{3.9}
\end{equation*}
$$

By assumption $A 3, \nu \geq 2$. Combining this with (3.8) and (3.9) gives

$$
\begin{equation*}
T \geq \tau n r=\left(\frac{2 C \tau}{2 C}\right) n r \geq\left(\frac{\lambda \tau}{\nu C}\right) n \cdot 2 r \tag{3.10}
\end{equation*}
$$

From (3.7) and (3.10) it follows by Lemma 3.2 that

$$
\begin{equation*}
T \geq\left(\frac{2 K_{0}}{C}\right) n \tag{3.11}
\end{equation*}
$$

where

$$
K_{0}=\frac{\lambda \tau\left(9-4.5^{1 / 2}\right)}{\nu}
$$

so by (3.4),

$$
\begin{equation*}
A T^{2} \geq\left(\frac{C D}{2}\right)\left(\frac{2 K_{0}}{C}\right)^{2} n^{2} \geq 2 K_{0}^{2} n^{2} \tag{3.12}
\end{equation*}
$$

since $D \geq C$. Suppose on the other hand that $M=n$. Then $r=1$. Since there is at least one output port, assumption A4 gives $A \geq \rho \geq \lambda^{2}$, so by (3.8),

$$
\begin{equation*}
A T^{2} \geq(\lambda \tau n)^{2}>2 K_{0}^{2} n^{2} \tag{3.13}
\end{equation*}
$$

Result (3.1) follows from (3.12) and (3.13).
Result (3.2) follows in a similar way. If $M<n$, combining (3.11) with (3.5) gives

$$
\begin{equation*}
A T \geq\left(\frac{C L}{2 \pi}\right)\left(\frac{2 K_{0}}{C}\right) n=K_{2} L n \tag{3.14}
\end{equation*}
$$

Suppose on the other hand that $M=n$. By assumption A2, $R$ has width at least $\lambda$ in every direction, so we can choose a chord that is of length $C \geq \lambda$ and is perpendicular to $Y$. By (3.5) and (3.8) with $r=1$, we have

$$
A T \geq\left(\frac{C L}{2 \pi}\right)(\tau n)
$$

which gives

$$
A T \geq K_{2} L n
$$

Since any circuit that performs integer multiplications must also be able to perform shifts, (3.1) and (3.2) hold for any $n$-bit multiplication chip.

Result (3.2) can sometimes give useful lower bounds which are based on the I/O characteristics of a multiplication or shifting chip. If at one time the chip inputs or outputs a total of $z$ bits along its boundary, then by assumptions A3 and A4, $L \geq$ $z \lambda / \nu$, and (3.2) gives $A T \geq K_{2}(\lambda z / \nu) n$. Thus for any multiplication scheme that accepts, say $\Omega\left(n^{1 / 2}\right)$ input bits simultaneously along the chip boundary, we know immediately that $A T=\Omega\left(n^{3 / 2}\right)$ (cf. the multiplication scheme in Section 4).

Result (3.1) (with a smaller constant for $K_{1}$ ) could have been established by a proof parallel to that used by Thompson [23] for the discrete Fourier transform problem. In fact, using his result that relates the area of a graph to its minimum bisection width, one can derive (3.1) without the convexity assumption in A1. Our
proof above represents a new approach that incorporates geometric properties of the chip boundary in the lower bound proof. We feel that the extra convexity assumption we make is not restrictive, since most existing chips do have convex boundaries for packaging reasons. Furthermore, we note that the convexity assumption is needed for establishing results such as (3.2) that relate $A T$ to the perimeter $L$. In [6], under a similar convexity assumption, tight lower bounds on the minimum area required to layout complete binary (or $t$-ary) trees are obtained.

An interesting corollary of Theorem 3.1 is that lower bounds in (3.1) and (3.2) hold for chips that perform floating-point additions, for which shifts are needed to equalize exponents. This explains why the area-time requirements for floating-point addition are much higher than those for integer addition, as observed in practical implementations. (Charles Leiserson at CMU first pointed out to one of the authors the application of Theorem 3.1 to floating-point addition.)
3.2 A Lower Bound on the Area for Multiplier Circuits. In Theorem 3.1 we gave lower bounds on $A T^{2}$ and $A T$ for shifting circuits. Now, using different techniques, we give a lower bound on $A$ for multiplier circuits.

Theorem 3.2. Under assumptions A4 and A6-A8, any n-bit multiplication must satisfy

$$
A \geq A_{0} n
$$

where

$$
\begin{equation*}
A_{0}=\frac{5}{6}\left(\frac{\beta \rho}{\beta+\rho}\right) \tag{3.15}
\end{equation*}
$$

Let $\Phi_{N}=\{i j \mid 0 \leq i<N, 0 \leq j<N\}$ be the set of all integers which can be written as a product of two factors, each less than $N$, and let $\mu(N)=\left|\Phi_{N}\right|$ be the cardinality of $\Phi_{N}$. For example, $\Phi_{4}=\{0,1,2,3,4,6,9\}$ and $\mu(4)=7$. Before proving Theorem 3.2 we need lower bounds on $\mu(N)$ and a related function,

$$
\begin{equation*}
\delta(n)=\frac{\left\lceil\log \mu\left(2^{n}\right)+1-n\right\rceil}{n} \tag{3.16}
\end{equation*}
$$

Lemma 3.3

$$
\mu(N) \geq \sigma(N)
$$

where $\sigma(N)=\sum_{p \in P_{N-1}} p$ and $P_{N-1}$ is the set of prime numbers smaller than $N$.
Proof. The numbers $p j$ are distinct if $p \in P_{N-1}$ and $1 \leq j \leq p$. Thus the result follows from the definition of $\mu(N)$.

Lemma 3.4. For all $N \geq 4$,

$$
\mu(N) \geq \frac{N^{2}}{2 \ln N}
$$

Proof. Using a slight modification of Theorem 1 and eq. (4.13) of [18], we can show that for all $N \geq 348$,

$$
\sigma(N)>\frac{N^{2}}{2 \ln N}
$$

Thus the result for $N \geq 348$ follows from Lemma 3.3. For $4 \leq N \leq 347$ the result may be verified by a straightforward computation.

TABLE II. $\mu\left(2^{n}\right)$ and Related Functions for

| $1 \leq n \leq 17$ |  |  |  |
| :---: | ---: | :---: | :---: |
| $n$ | $\mu\left(2^{n}\right)$ | $\mu\left(2^{n}\right) / \mu^{*}\left(2^{n}\right)$ | $\delta(n)$ |
| 1 | 2 | 0.355000 | 1 |
| 2 | 7 | 0.748125 | 1 |
| 3 | 26 | 0932329 | 1 |
| 4 | 90 | 0952734 | 1 |
| 5 | 340 | 1.006695 | 1 |
| 6 | 1,238 | 0995890 | 1 |
| 7 | 4,647 | 0.997629 | 1 |
| 8 | 17,578 | 0.995092 | 1 |
| 9 | 67,592 | 1.000412 | 1 |
| 10 | 259,768 | 0.998846 | $9 / 10$ |
| 11 | $1,004,348$ | 0.998392 | $10 / 11$ |
| 12 | $3,902,357$ | 0.999002 | $11 / 12$ |
| 13 | $15,202,050$ | 0.999089 | $12 / 13$ |
| 14 | $59,410,557$ | 0999788 | $13 / 14$ |
| 15 | $232,483,840$ | 0.999637 | $14 / 15$ |
| 16 | $911,689,012$ | 0.999788 | $15 / 16$ |
| 17 | $3,581,049,040$ | 1.000005 | $16 / 17$ |

Lemma 3.5. If $\delta(n)$ is defined by (3.16), then for all $n \geq 1$,

$$
\delta(n) \geq \frac{5}{8} .
$$

Proof. From Lemma 3.4,

$$
\begin{equation*}
\delta(n) \geq \frac{\lceil n-\log (n \ln 2)\rceil}{n} \tag{3.17}
\end{equation*}
$$

and it is easy to verify that the right side of (3.17) is at least $\frac{5}{6}$ for all $n \geq 18$. (There is equality for $n=18$ and $n=24$.) For $1 \leq n \leq 17$, direct computation shows that $\delta(n) \geq \frac{9}{10}$.

Table II gives $\mu\left(2^{n}\right), \mu\left(2^{n}\right) / \mu^{*}\left(2^{n}\right)$, and $\delta(n)$ for $n=1,2, \ldots, 17$, where

$$
\mu^{*}(N)=\frac{N^{2}}{0.71+\log \log N}
$$

is an empirical approximation to $\mu(N)$. For $5 \leq n \leq 17$, the approximation error is less than 1 percent. If this remained true for $n>17$, it would follow that $\delta(n) \geq \frac{9}{10}$, and the constant $\frac{5}{6}$ in Lemma 3.5 and Theorem 3.2 could be increased. On the basis of the empirical evidence we conjecture that

$$
\lim _{N \rightarrow \infty}\left(\frac{\mu(N) \log \log N}{N^{2}}\right)=1 \quad \text { and } \quad \delta(n) \geq \frac{9}{10}
$$

for all $n \geq 1$.
Proof of Theorem 3.2. If $n=1$, there is at least one output port, so $A \geq \rho$, and the result holds. Hence, suppose that $n \geq 2$.

Consider the state of the computation just before the last input bit(s) is accepted. Let $m$ be the number of input bits still to be accepted, so $1 \leq m \leq 2 n$.

It is easy to show that there are some inputs $a$ and $b$ such that the output bits $p_{2 n}, \ldots, p_{n}$ are not determined by the $2 n-m$ input bits already accepted. Thus, by assumption A6, at most $n-1$ bits, $p_{n-1}, \ldots, p_{1}$, have been output.

Suppose that $s$ bits of information are stored in $R$ at this instant. Then we must have, by assumption A8,

$$
\mu\left(2^{n}\right) \leq 2^{m+(n-1)+s}
$$

or the circuit could not produce all $\mu\left(2^{n}\right)$ possible outputs and would fail for certain inputs. Thus

$$
m+s \geq\left\lceil\log \mu\left(2^{n}\right)+1-n\right\rceil=n \delta(n)
$$

and, from Lemma 3.5,

$$
\begin{equation*}
m+s \geq \frac{5 n}{6} \tag{3.18}
\end{equation*}
$$

By assumption A7,

$$
\begin{equation*}
A \geq \beta s \tag{3.19}
\end{equation*}
$$

Since a port can accept only one bit at a time, the last $m$ bits must be input through $m$ different ports; so assumption A4 gives

$$
\begin{equation*}
A \geq \rho m \tag{3.20}
\end{equation*}
$$

The result follows easily from (3.18)-(3.20).
3.3 General Lower Bounds for Multiplier Circuits. Theorems 3.1 and 3.2 are the extreme cases $\alpha=1$ and $\alpha=0$ of the following result.

Theorem 3.3. Under assumptions A1-A8, any n-bit multiplication chip must satisfy

$$
\begin{equation*}
\left(\frac{A}{A_{0}}\right)\left(\frac{T}{T_{0}}\right)^{2 \alpha} \geq n^{1+\alpha} \tag{3.21}
\end{equation*}
$$

for all $\alpha \in(0,1)$. Here $A_{0}$ is given by (3.15),

$$
T_{0}=\left(\frac{K_{1}}{A_{0}}\right)^{1 / 2}
$$

and $K_{1}$ is given by (3.3).
Proof. From Theorem 3.1,

$$
\left(\frac{A}{A_{0}}\right)\left(\frac{T}{T_{0}}\right)^{2} \geq n^{2}
$$

so

$$
\begin{equation*}
\left(\frac{A}{A_{0}}\right)^{\alpha}\left(\frac{T}{T_{0}}\right)^{2 \alpha} \geq n^{2 \alpha} \tag{3.22}
\end{equation*}
$$

From Theorem 3.2, since $\alpha \in[0,1]$,

$$
\begin{equation*}
\left(\frac{A}{A_{0}}\right)^{1-\alpha} \geq n^{1-\alpha} \tag{3.23}
\end{equation*}
$$

Multiplying (3.22) and (3.23) gives the result.
The following corollary of Theorem 3.3 seems worth stating separately, for $A T$ is often used as a complexity measure (see, e.g., [16]).

Corollary 3.1. Under assumptions A1-A8, any n-bit multiplication chip must satisfy
where

$$
A T \geqq K_{3} n^{3 / 2}
$$

$$
K_{3}=A_{0} T_{0}=\left(A_{0} K_{1}\right)^{1 / 2} .
$$

## 4. Upper Bound Results for Multiplication

It is easy to design practical $n$-bit multipliers with area $A=O(n)$ and time $T=O(n)$, so

$$
\begin{equation*}
A T^{2 \alpha}=O\left(n^{1+2 \alpha}\right) \tag{4.1}
\end{equation*}
$$

For example, the "serial pipeline multipliers" typically used in the implementation of digital filters and signal processors achieve these area and time bounds (see [ 9,14$]$ ). In this section we sketch the design of a multiplier with $A=O(n \log n)$ and $T=O\left(n^{1 / 2} \log n\right)$, giving

$$
\begin{equation*}
A T^{2 \alpha}=O\left(n^{1+\alpha} \log ^{1+2 \alpha} n\right), \tag{4.2}
\end{equation*}
$$

which is asymptotically better than (4.1). The design uses the Convolution Theorem to compute the product of two integers in a complex way, and consequently its implementation appears to be difficult. Nevertheless, the design is theoretically interesting because it shows that the exponent $1+\alpha$ of $n$ in Theorem 3.3 is tight. We do not know if there is any practical design having $A T^{2 \alpha}=o\left(n^{1+2 \alpha}\right)$ for $\alpha \in[0,1]$. Straightforward implementations of "fast" algorithms, for example, the SchonhageStrassen algorithm [22] or the "3-2 reduction" algorithm [17, 25], seem to require area at least order $n^{2}$.

In the remainder of this section we assume that
(a) $n=k^{2}$ is a perfect square, and
(b) $a_{j}=b_{j}=0$ if $j>n / 2$.
(If not, $n$ may be increased sufficiently without affecting the asymptotic results.) Let $p$ be the smallest prime of the form $n q+1, q \geq 1, F_{p}$ the finite field of integers $\bmod p$. It is known that $\log p=O(\log n)($ see $[13,24])$ and that $F_{p}$ has an $n$th root of unity $u$ (see [2]). Let $w=u^{k}$, so $w$ is a $k$ th root of unity. Note that in any circuit $n$ is fixed, so we are not concerned with the complexity of finding $p, u, w$, etc; they will be encoded into the circuit. For facilitating arithmetic in $F_{p}$ we assume that a $2\lceil\log p\rceil$ bit approximation to $1 / p$ is encoded into the circuit.
In steps 1-5 below, all arithmetic is done in $F_{p}$. In steps 1-3 we compute the discrete Fourier transform $a^{\prime}$ of $\left(a_{1}, \ldots, a_{n}\right)$ and $b^{\prime}$ of $\left(b_{1}, \ldots, b_{n}\right)$ over $F_{p}$; that is,

$$
a_{j+1}^{\prime}=\sum_{l=0}^{n-1} a_{l+1} u^{y}
$$

for $j=0, \ldots, n-1$, etc. In step 4 we multiply the Fourier transforms. In step 5 we take the inverse transform, and in step 6 the final result is computed.

Step 1. Let $A, B, U$, and $W$ be $k$ by $k$ matrices with elements

$$
\begin{array}{rlrl}
A_{y} & =a_{(i-1) k+j}, & & U_{y}=u^{(i-1)(,-1)}, \\
B_{i j} & =b_{(i-1) k+j}, & W_{i j}=w^{(i-1)(j-1)} .
\end{array}
$$

Perform $k$ by $k$ matrix multiplications to compute

$$
A^{\prime}=W A \quad \text { and } \quad B^{\prime}=W B,
$$

using a "systolic array" [11]. All computations are performed in $F_{p}$, so each processing element of the systolic array needs to perform multiplication and addition in $F_{p}$. Using a serial pipeline multiplier and a serial adder, a multiplication and addition step in $F_{p}$ requires no more than area $O(\log p)$ and time $O(\log p)$. Thus, step 1 can be done with area $O(n \log n)$ and time $O\left(n^{1 / 2} \log n\right)$.

Step 2. Compute $A^{\prime \prime}=A^{\prime} \circ U$ and $B^{\prime \prime}=B^{\prime} \circ U$, where $\circ$ denotes componentwise multiplication.

Step 3. Compute $A^{\prime \prime \prime}=A^{\prime \prime} W$ and $B^{\prime \prime \prime}=B^{\prime \prime} W$ using the same method as for step 1. It may be shown that $A^{\prime \prime \prime}$ and $B^{\prime \prime \prime}$ contain the Fourier transforms of ( $a_{1}, \ldots, a_{n}$ ) and ( $b_{1}, \ldots, b_{n}$ ); in fact, for $1 \leq i, j \leq k$,

$$
A_{\imath \prime}^{\prime \prime \prime}=a_{(J-1) k+\imath}^{\prime}, \quad B_{l \jmath}^{\prime \prime \prime}=b_{(j-1) k+\imath .}^{\prime} .
$$

Step 4. Compute $C^{\prime \prime \prime}=A^{\prime \prime \prime} \circ B^{\prime \prime \prime}$.
Step 5. Compute $C=W^{-1}\left(U^{\prime} \circ\left(C^{\prime \prime \prime} W^{-1}\right)\right)$ as in steps $1-3$. Here $U_{t j}^{\prime}=$ $u^{-(t-1)(J-1)}$. The matrix $C$ represents the inverse Fourier transform of $C^{\prime \prime \prime}$. Define the $c_{i}$ 's by

$$
C_{l j}=c_{(l-1) k+j}
$$

Then by the Convolution Theorem and assumptions (a) and (b) above,

$$
c_{j}=a_{1} b_{j}+a_{2} b_{j-1}+\cdots+a_{j} b_{1} \quad \text { for } \quad 1 \leq j \leq n .
$$

Thus,

$$
\sum_{i=1}^{2 n} p_{i} 2^{i-1}=\sum_{i=1}^{n} c_{i} 2^{i-1}
$$

Grouping the terms on the right-hand side into $k=n^{1 / 2}$ groups so that the $c_{2}$ 's in each row of the matrix $C$ belong to one group, we obtain

$$
\begin{equation*}
\sum_{i=1}^{2 n} p_{2} 2^{2-1}=\sum_{l=1}^{k} R_{l} 2^{(i-1) k} \tag{4.3}
\end{equation*}
$$

where

$$
R_{l}=\sum_{J=1}^{k} c_{(l-1) k+j} 2^{J-1}
$$

Given that the $c_{\imath}$ 's are outputs of the systolic array that computes the matrix $C$, all the $R_{t}$ 's can be formed in area $O(n \log n)$ and time $O\left(n^{1 / 2} \log n\right)$, using the result of Theorem 5.1 of Section 5 regarding addition circuits. Thus the problem of computing $p_{2 n}, \ldots, p_{1}$ has been reduced to the problem of summing $k=n^{1 / 2}$ terms in the righthand side of eq. (4.3). Hence, the final step in the computation is

Step 6. Compute $p_{2 n}, \ldots, p_{1}$ from the $R_{i}$ 's. Note that each $R_{l}$ has at most $n^{1 / 2}+\log n$ bits. Using (4.3), the $p_{\imath}$ 's can be computed, $n^{1 / 2}$ of them at a time, with an $\left(n^{1 / 2}+\log n\right)$-bit adder. This is depicted in Figure 1. At the end of the $i$ th addition, the first $n^{1 / 2}$ low order bits in the output are output as $p_{i k}, p_{i k-1}, \ldots, p_{(t-1) k+1}$, and the remaining bits in the output are fed back to the adder to be added to the arriving $R_{t}$ in the $(i+1)$ st addition. With the result of Theorem 5.1 one can easily see that all the $p_{i}$ 's can be computed in area $O(n \log n)$ and time $O\left(n^{1 / 2} \log n\right)$.

This completes our outline of the multiplier with area $A=O(n \log n)$ and time $T=O\left(n^{1 / 2} \log n\right)$, giving $A T^{2 \alpha}=O\left(n^{1+\alpha} \log ^{1+2 \alpha} n\right)$.


Fig 1. Computing the $p_{t}$ 's from the $R_{i}$ 's.
For $\alpha \in[0,1]$, the exponent $1+2 \alpha$ of $\log n$ can be reduced by using a more complicated design than the one outlined above, but we do not know what its minimal value is. For $\alpha>1$, a design based on the "3-2 reduction" algorithm gives $A T^{2 \alpha}=O\left(n^{2} \log ^{\delta} n\right)$ for some $\delta>0$, which is a better upper bound than (4.2).

## 5. Concluding Remarks

In [3] we demonstrate a regular layout for look-ahead adders, giving the following result.

Theorem 5.1. Let $1 \leq w \leq n$. Then all the carries in an $n$-bit addition can be computed in time proportional to $(n / w)+\log w$ and in area proportional to $w \log w+$ 1 , and so can the addition.

Let $\left(A T^{2 \alpha}\right)_{\mathrm{M}}(n)$ and $\left(A T^{2 \alpha}\right)_{\mathrm{A}}(n)$ be the area-time complexity for $n$-bit integer multiplication and addition, respectively. Note that the serial adder gives $\left(A T^{2 \alpha}\right)_{\mathrm{A}}(n)=O\left(n^{2 \alpha}\right)$, and that for $\alpha>1,\left(A T^{2 \alpha}\right)_{\mathrm{M}}(n)=\Omega\left(n^{2}\right)$, since for multiplication, by (3.1), $A(T / \tau)^{2 \alpha}>A(T / \tau)^{2} \geq K_{1}(n / \tau)^{2}$. These observations together with Theorems 3.3 and 5.1 establish the following result.

Theorem 5.2 Under assumptions A1-A8 of Section 2,

$$
\frac{\left(A T^{2 \alpha}\right)_{M}(n)}{\left(A T^{2 \alpha}\right)_{A}(n)}=\left\{\begin{array}{ll}
\Omega\left(n^{1-\alpha}\right) & \text { for } 0 \leq \alpha \leq \frac{1}{2} \\
\Omega\left(\frac{n^{\alpha}}{\log ^{2 \alpha} n}\right) & \text { for } \frac{1}{2}<\alpha \leq 1 \\
\Omega\left(\frac{n}{\log ^{2 \alpha} n}\right) & \text { for } \alpha>1
\end{array}\right\}\left(=\Omega\left(n^{1 / 2}\right) \text { for all } \alpha \geq 0\right) .
$$

Thus for any $\alpha \geq 0$, the area-time product for multiplication is asymptotically larger than that for addition. We can say that multiplication is harder than addition as far as the area-time complexity is concerned.

For binary division it is easy to deduce a lower bound of the same form as (3.21), using the method of [5], and an upper bound $A T^{2 \alpha}=O\left(n^{1+\alpha} \log ^{1+2 \alpha} n\right)$, using Newton's method.

In Section 3 we derived lower bounds on $A T^{2 \alpha}, \alpha \in[0,1]$, for binary multiplication. Similar lower bounds on $A T^{2}$ have been obtained for computation of the discrete Fourier transform by Thompson [23], and, for matrix multiplication by Savage [20]. It seems that area-time complexity is, in general, a useful measure for establishing the complexity hierarchy of many classes of problems because it captures important attributes of a computation such as time and space, as well as communication. One should expect that more results along this line will be obtained in the near future.

## REFERENCES

1. Abelson, H., and Andreae, $P$ Information transfer and area-time trade-offs for VLSI multiplication Commun ACM 23, 1 (Jan 1980), 20-23
2. Bonneau, R.J. A class of finte computation structures supporting the fast Fourrer transform. Tech Rep MAC Tech Memo 31, Project MAC, Massachusetts Institute of Technology, Cambridge, Mass , March 1973
3 Brent, R P, and Kung, H T A regular layout for parallel adders. Tech. Rep CMU-CS-79-131, Dep of Computer Scrence, Carnegie-Mellon Univ, Pittsburgh, Pa, June, 1979 (to appear in IEEE Trans. Comput.).
3. Brent, R P On the addition of binary numbers IEEE Trans. Comput C-19 (1970), 758-759.

5 Brent, R P The complexity of multiple-precision arithmetic In The Complexity of Computational Problem Solving, R.S Anderssen and R P Brent, Eds, Unversity of Queensland Press, Brisbane, Australia. 1976, pp. 126-165.
6 Brent, R.P., and Kung, H T On the area of binary tree layouts. Inf Proc Letters 11, (1980), 4648
7. Brent, R P , and Kung, H T. The chip complexity of binary anthmetic Proc 12th Ann ACM Symp on Theory of Computıng, Los Angeles, Calif., April 1980, pp 190-200
8 Garner, H.L A survey of some recent contnbutions to computer arithmetic IEEE Trans Comput. C-25 (1976), 1277-1282.
9 Jackson, L.B., Kaiser, S F , and McDonald, HS An approach to the implementation of digital filters. IEEE Trans Audio Electroacoust. A U-I6 (Sept 1968), 413-421.
10. Kuck, D.J. The Structure of Computers and Computations. John Wiley \& Sons, New York, 1978.
11. Kung, H.T., and Leiserson, C E Systolic arrays (for VLSI). Sparse Matrix Proceedings 1978, Knoxville, Tenn., Society for Industrial and Applied Mathematucs, 1979, pp 256-282 (a slightly different version appears in [15, Sec 8 3])
12. Leiserson, C E. Area-efficient graph layouts (for VLSI) Carnegie-Mellon Univ., Pittsburgh, Pa., Feb. 1980
13 Linnik, U V On the least prime in an arithmetic progression. I The basic theorem. Rec. Math 15 (1944), 139-178

14 Lyon, R.F Two's complement pıpeline multiphers" IEEE Trans Commun. COM-24, 4 (April 1976), 418-425.

15 Mead, C.A., and Conway, LA Introduction to VLSI Systems Addıson-Wesley, Reading, Mass, 1980.

16 Mead, C A, and Rem, M. Cost and performance of VLSI computing structures IEEE $J$ Solld State Circuits SC-14, 2 (April 1979), 455-462.
17 Ofman, Y On the algonthm complexity of discrete functions. Dokl. Akad. Nauk SSSR 145 (1962), 48-51 (in Russian)
18 Rosser, J B, and Schoenfeld, L. Approximate formulas for some functions of prime numbers. Illinots J Math. 6 (1962), 64-94
19 Savage, JE The Complexity of Computing John Wiley \& Sons, New York, 1976
20. Savage, J E. Area-tıme tradeoffs for matrix multiplication and related problems in VLSI models Tech. Rep. CS-50, Brown Univ, Providence, R I, Aug 1979
21 Savage, J E and Swamy, S Space-time tradeoffs for oblivious sorting and integer multiplication Tech Rep CS-37, Brown Univ, Providence, R I, 1978
22 Schonhage A., and Strassen, V. Schnelle Multıplıkation grosser Zahlen Comput. 7 (1971), 281292
23 Thompson, CD Area-time complexity for VLSI Proc 11th Ann ACM Symp. on Theory of Computing, Atlanta, Ga, May 1979, pp 81-88.
24 Wagstaff, S S. JR Greatest of the least primes in arithmetic progressions having a given modulus Math Comp. 33 (1979), 1073-1083.
25. Wallace, C.S. A suggestion for a fast multiplier. IEEE Trans. Elec. Comput. EC-13 (1964), 14-17.
26. Winograd, S. On the time required to perform addition. J. ACM 12, 2 (April 1965), 277-285.

27 Winograd, S. On the time required to perform multiplication. J. ACM 14, 4 (Oct. 1967), 793-802.
28. Yaglom, I.M., and Boltyanski, V.G. Convex Figures. Holt, Rinehart and Winston, New York, 1961 (translated by P.J. Kelly and L F Walton).
received august 1979; revised march 1980, accepted april 1980


[^0]:    ${ }^{1} \log$ denotes $\log$ to the base 2 throughout.

