## Area-time Optimal Division

$$
\text { for } T=\Omega\left((\log n)^{1+c}\right)
$$

by

## K. Mehlhorn* and F.P. Preparata**

* Fachbereich 10, Informatik, Universität des Saarlandes, 6600, Saarbrücken, West Germany.
* Coordinated Science Laboratory, University of Illinois, Urbana, IL 61801.

This work was supported by the DFG, SFB 124, VLSI Entwurf und Parallelität, and by NSF Grant ECS-84-10902.

A family of area-time ( $A T^{2}$ ) optimal networks for the computation of the inverse of an n-bit number (referred to here as "dividers") has been proposed some time ago by Mehlhorn [1]. A network of this type can be constructed for each computation time $T$ in the range $\left[\Omega\left(\log ^{2} n\right), 0(\sqrt{ } n)\right]$. Since then considerable progress has been made in the design of faster dividers [2], culminating in the result of Beame-Cook-Hoover [3] illustrating an $O(\log n)$-time divider (i.e. , a time-optimal network in the hypothesis of bounded-fan-in components). However the Beame-Cook- Hoover network (referred to here as the BCH network) does not achieve area optimality. Thus, it is natural to ask the question of the existence of area-time optimal dividers for $T=o\left(\log ^{2} n\right)$. This paper provides an affirmature answer for $T \in\left[\Omega\left((\log n)^{1+\epsilon}\right), O\left(\log ^{2} n\right)\right]$ for any positive constant $\epsilon \leq 1$. It must be pointed out that the proposed networks are so complicated - notwithstanding their area-time-optimality - that they are exclusively of theoretical interest.

The network (see Figure 1) consists of $\lceil 1 / \epsilon\rceil+2$ cascaded modules. (For simplicity we assume that $1 / \epsilon$ is an integer.) The first $1 / \epsilon$ modules are modified dividers of the BCH type, computing a sequence of approximations of the inverse with increasing numbers of bits $l_{1} \leq l_{2} \leq \ldots \leq l_{1 / \epsilon} \leq n$.


## Fig. 1: Block structure of the divider

The last two modules are designed to complete the build-up of the result size from $l_{1 / \epsilon}$ to $n$ bits by implementing the Newton approximation method, which, at each iteration doubles the length of the result. This is carried out in two phases, respectively executed by the "fast" and "slow" approximators. The fast approximator basically consists of a single area-time optimal fastest multiplier, used to execute the initial iterations; the slow approximator is instead a cascade of affordably slow multipliers, each executing one of the final iterations. Both approximators execute $0(\log \log n)$ iteration steps. Note that the cascade of the two Newton approximators structurally coincides with Mehlhorn's divider [1].

The paper is organized as follows. In section 2 we present a more efficient implementation of the BCH method leading to a circuit referred to as "modified BCH divider". In Section 3 we discuss an alternative method for the computation of the inverse, which uses the modified BCH method as a subroutine. Finally, in

Section 4 we illustrate the combination of the previous techniques with the Newton approximation, to yield our proposed network, while Section 5 contains a few closing remarks.

## 2.. An efficient implementation of the BCH method.

In this section we first describe (a variant of) the BCH method [3] and then modify it so as to reduce its area requirement.

The original BCH method computes the inverse of an $n$-bit number $x$ by adding the first $n$ powers of $u=1-x$ and truncating the $n^{2}$-bit result to its leading $n$-bits. Each power of $u$ is computed individually and the $n$ powers are subsequently added together; so we just consider the computation of $u^{n}$. The approach consists of taking the "logarithm" of $u$, multiplying it by $n$, and then taking the "antilogarithm".

Since taking logarithms of large numbers is very hard, the method resorts to a modular representation and works as follows:

## Algorithm INVERSE1 $(x)$

Input: an $n$-bit number $x$ in the range $[1 / 2,1)$. Given are primes $p, \ldots, p_{m}$ such that

$$
\prod_{j=1}^{m} p_{j} \geq 2^{\left(n^{2}\right)} \quad \begin{array}{ll} 
& \text { Note that } \left.m \simeq n^{2} / \log n\right) \\
& (\mathrm{n} \text { is assumed to be a power of two })
\end{array}
$$

Output: an $(n+2)$-bit number $v$ in the range
$(1,2]$, so that $v \times x=1+\delta$ with $\delta<2^{-n-1}$ ( $v$ is given by the first $n+2$ $1 \quad$ bits of $\left.\sum_{i=0}^{n-1}(1-x)^{i}\right)$
(1) begin $u:=(1-x) 2^{n} ;(* u$ is an integer *)
(2) for $j, 1 \leq j \leq m$
(3) pardo $b_{j}:=$ umodp $_{j}$;
(4) compute $r_{j}$ so that $a_{j}^{r_{j}}=b_{j}$, where
$a_{j}$ is a generator of the multiplicative group of $\mathbf{Z}^{*} p_{j}$;
for $l=0$ to $\log n-1$
do $m_{j}^{(l)}:=a_{j}^{r_{j} 2^{l} \bmod \left(p_{j}-1\right)}\left(* m_{j}^{(l)}=u^{2^{l}} \bmod p_{j} *\right)$
od;
$v_{j}:=\prod_{l=0}^{\log n-1}\left(m_{j}^{(l)}+1\right) \bmod p_{j}$ $\left(* v_{j}=\sum_{l=0}^{n-1} u^{l} \quad \bmod p_{j} *\right)$
$V_{j}:=v_{j} M_{j} \bmod \left(p_{1} \ldots p_{m}\right)$
(* Chinese remaindering *)
odpar;
$v:=\sum_{j=1}^{m} V_{j} \bmod \left(p_{1} \ldots p_{m}\right) ;$
$v:=$ truncate $v$ to the first $m+2$ bits and set
point after the second bit from the left
(13) end

Let us next describe the different steps of this algorithm in more detail. In this description we will make frequent use of the following two facts.

1) One can multiply two $k$-bit integers in time $T$ and area $A$ where $A T^{2}=$ $O\left(k^{2}\right)$ and $T \in[\Omega(\log k), O(\sqrt{ } k)]$. This is the result of $[6]$.
2) One can add $m \quad k$-bit integers in time $O(\log m+\log k)$ and area $O(k m$. $\log m$ ). This can be achieved by expressing the $m$ integers in redundant representation (see, e.g. [ $4,5,6]$ ) and then adding them in a tree-like fashion. The tree has depth $O(\log m)$ and requires area $O(m \log m)$ for every bit position. Each level of the tree introduces a delay of just $O(1)$ thanks to the redundant number representation.

We are now ready to describe the circuit in more detail. We start with the parallel loop, lines 2-10.
Line 3: This line is easily executed in time $O(\log n)$ and area $O\left(n(\log n)^{2}\right)$ by expressing $u$ by its binary expansion $u=\sum_{t=0}^{\log n-1} u_{t} 2^{t}, u_{t} \in\{0,1\}$, storing the numbers $2^{t} \bmod p_{j}$ in a table and performing the required additions in redundant number representation. We leave the details of this step to the reader.

Line 4: Step 4 is realized by a table-look-up, i.e. by a loop-up in a table which gives the value of $r_{j}$ for each possible value of $b_{j}$. Since $p_{j}$ can certainly be expressed using $2 \log n$ bits this table has $n^{2}$ extries of $2 \log n$ bits each. We realize this table by $2 \log n$ H-trees each requiring area $O\left(n^{2}\right)$. Thus the total area is $O\left(n^{2} \log n\right)$ and a table-look-up takes time $O(\log n)$.

Note that the $2 \log n$ slices of the table are accessed in parallel. Also note that this circuit is pipelinable, (its period is $O(1)$ in technical terms) and therefore $O(\log n)$ look-ups can also be performed in time $O(\log n)$ using the same area. This observation is important for step 6.
Line 5,6,7: Consider a fixed $l$ first. We first compute

$$
R_{j}^{(l)}=r_{j} 2^{l} \bmod \left(p_{j}-1\right)
$$

as outlined in line 3. Note that the $l$-place shift does not have to be executed explicitly; it only determines which powers of two need to be looked-up. The computation of $R_{j}^{(l)}$ takes time $O(\log n)$ and area $O\left(n(\log n)^{2}\right)$. We perform this computation in parallel for all $l, 0 \leq l \leq \log n-1$.

The integer $m_{j}^{(l)}$ is computed from $R_{j}^{(l)}$ by look-up in a table of "antilogarithms". The $\log n$ look-ups are pipelined and take time $O(\log n)$ and area $O\left(n^{2} \log n\right)$ (refer to the description of line 3 ).

Finally note that $m_{j}^{(l)}=a_{j}^{r_{j} 2^{l} \bmod \left(p_{j}-1\right)}=b_{j}^{2^{l}} \bmod p_{j}=u^{2^{l}} \bmod p_{j}$.
Line 8: We use a tree of multipliers. This tree has depth $O(\log \log n)$ and has $\log n$ nodes. Each node contains a circuit multiplying two $2 \log n$ bit numbers and reducing the result $\bmod p_{j}$ in time $O(\log \log n)$ and area $O\left((\log n)^{2}\right)$. This shows that step 8 takes time $(\log n)$ and area $O(n)$. Both estimates are very generous.

Finally note that

$$
\prod_{l=0}^{\log n-1}\left(1+m_{j}^{(l)}\right)=\prod_{l=0}^{\log n-1}\left(1+u^{\left(2^{l}\right)}\right)=\sum_{l=0}^{n-1} u^{l}
$$

Line 9: Let $M_{j}=\left[\left(p_{1} \ldots p_{m}\right) / p_{j}\right]^{p_{j}-1}\left(\bmod p_{1} \ldots p_{m}\right)$. Then $M_{j}$ is the coefficient of $v_{j}$ required for Chinese remaindering [7]. The number $M_{j}$ is precomputed and stored in a register of length $O\left(n^{2}\right)$. We multiply $v_{j}$ by $M_{j}$ by dividing $M_{j}$ into $n^{2} / \log n$ pieces of length $O(\log n)$, performing $n^{2} / \log n$ multiplications in parallel and then summing the results. This can certainly be done in time $O(\log n)$ and area $O\left(n^{2} \log n\right)$. Also the reduction $\bmod \left(p_{1} \ldots p_{m}\right)$ can be done in that area and time.

Summary: Line (3) to (9) take time $O(\log n)$ and area $O\left(n^{2} \log n\right)$ for each $p_{j}$. Since $u^{n}$ has $n^{2}$ bits we have $m=\Theta\left(n^{2} / \log n\right)$ and each modulus is representable in $2 \log n$ bits. We realize loop (2) to (10) by having a module for each modulus and hence the loop takes time $O(\log n)$ and area $O\left(n^{4}\right)$.
Line 11: In Line 11 we add $m$ numbers of $n^{2}$ bits each. This takes time $O(\log n)$ and area $O\left(m \log m \cdot n^{2}\right)=O\left(n^{4}\right)$.

Lemma 1. There exists a circuit which computes the $n$-bit inverse of an $n$-bit number in time $O(\log n)$ and area $O\left(n^{4}\right)$.
Proof: Immediate from the discussion above.
The enormous space requirement of the method sketched above is essentially due to the fact that the powers of $u$ are computed with $\Theta\left(n^{2}\right)$ bits of precision. However, only the leading $n+\log n$ bits are truly needed for the computation of $v$. This observation is the key to the "modified" BCH method, to be described next. In the modified method we compute the powers of an $l$-bit integer $u$ in $m$ rounds (this $m$ has nothing to do with the $m$ in algorithm INVERSE1), where $m$ is a design parameter to be selected. In each round we compute the sum of $s=(l)^{1 / m}$ consecutive powers using the method of Lemma 1 . We call $s$ the depth of the method. This takes time $O(\log l)$ and area $O\left((l s)^{2}\right)$ and yields a result of $O(l s)$ bits. The space requirement results from the fact that only $l s / \log (l s)$ different prime moduli, each of length $2 \log (l s)$ bits, must be used. We truncate this result to $l+\lceil\log 12 m\rceil$ bits and start the next round. The details are as follows.

## Algorithm INVERSE2( $x$ )

Input: an $l$-bit number $x \in[1 / 2,1)$ and an integer $s=(l)^{1 / m}$.
Output: an $(l+2)$-bit number $v \in(1,2]$
begin $u_{0}:=1-x$;
for $i=0$ to $m-1$ do
begin compute $u_{i}, u_{i}^{2}, \ldots, u_{i}^{s-1}, u_{i}^{s}$;

$$
\sigma_{i}:=\sum_{j=0}^{s-1} u_{i}^{j}
$$

$u_{i+1}:=$ truncate $u_{i}^{s}$ to $q=l+\lceil\log 12 m\rceil$ bits right of point;
end;
$v:=$ truncate $\sigma_{0} \sigma_{1} \ldots \sigma_{m-1}$ to $l$ bits right of point;
end.
To prove the correctness of this algorithm we must show that $v$ gives the $(l+2)$ leading bits of $1 /(1-u)$ (of which the rightmost $l$ bits represent the fractional part). To this end, we must show that the error of the approximation is $<2^{-l}$.

For any variable $a$ used by the above algorithm let $\tilde{a}$ denote the corresponding exact value (note that, since all numbers are nonnegative, the truncation mechanism gives $\tilde{a} \geq a)$, and $\delta(a)$ the absolute error on $a$, such that $a=\tilde{a}-\delta(a)$. Recall also that $\delta(a \cdot b)<\delta(a) \tilde{b}+\delta(b) \tilde{a}$ and that $\delta(a+b)=\delta(a)+\delta(b)$. Using these relationships, we readily have

$$
\delta\left(\sigma_{0} \ldots \sigma_{m-1}\right)<\tilde{\sigma}_{0} \ldots \tilde{\sigma}_{m-1}\left(\frac{\delta\left(\sigma_{0}\right)}{\tilde{\sigma}_{0}}+\ldots+\frac{\delta\left(\sigma_{m-1}\right)}{\sigma_{m-1}}\right)
$$

Since $\tilde{\sigma}_{0} \ldots \tilde{\sigma}_{m-1}<3$ and $\tilde{\sigma}_{i}>1(i=0, \ldots, m-1)$, we obtain

$$
\delta\left(\sigma_{0} \ldots \sigma_{m-1}\right)<3\left(\delta\left(\sigma_{0}\right)+\ldots+\delta\left(\sigma_{m-1}\right)\right)
$$

From $\tilde{\sigma}_{i}=\sum_{j=0}^{s-1} \tilde{u}_{i}^{j}$ we have

$$
\delta\left(\sigma_{i}\right)=\sum_{j=0}^{s-1} \delta\left(u_{i}^{j}\right)<\sum_{j=0}^{s-1} j \tilde{u}_{i}^{j-1} \delta\left(u_{i}\right)<\delta\left(u_{i}\right) /\left(1-\tilde{u}_{i}\right)^{2}<4 \delta\left(u_{i}\right)
$$

since $\tilde{u}_{i}<1 / 2$ for $i=1, \ldots, m-1$. (Obviously $\delta\left(\omega_{0}\right)=0$.)
Thus $\delta\left(\sigma_{0} \ldots \sigma_{m-1}\right)<12 m \max \delta\left(u_{i}\right)$ and the condition

$$
12 m \max \delta\left(u_{i}\right)<2^{-l}
$$

ensures the correctness of the method. We claim that $\delta\left(u_{i}\right)<2^{-q}$ as a result of truncating to $q$ bits right of the point. Indeed $\delta\left(u_{1}\right)<2^{-q}$, trivially. For $i \geq 1$, assuming $\delta\left(u_{i}\right)<2^{-q}$, let $u_{i+1}^{*}=u_{i}^{s}$ (before the truncation). Then

$$
\delta\left(u_{i+1}^{*}\right)<s \tilde{u}_{i}^{s-1} \delta\left(u_{i}\right)<\frac{s}{2^{s-1}} \delta\left(u_{i}\right)
$$

since $u_{i}<1 / 2$. If we assume $s \geq 4$, then $\delta\left(u_{i+1}^{*}\right)<2^{-q-1}$, which shows that its $(q+1)$ bits to the right of the point are correct. Thus, the prescribed truncation yields $\delta\left(u_{i+1}\right)<2^{-q}$, and the induction step is complete. In conclusion, we choose

$$
q \geq l+\log 12 m
$$

(Note that for any choice of $s,\lceil\log 12 m\rceil<4+\log \log l$ by the definition of $m$.)
Noting that $m \cdot O(\log l)=O(\log l / \log s)$, we have:
Lemma 2. For any $2 \leq s \leq l$ there exists a circuit computing the $l$-bit inverse of an $l$-bit number in time $O\left(\log ^{2} l / \log s\right)$ and with area $O\left((l s)^{2}\right)$.
The $A T^{2}$-performance of the above circuit is given by

$$
\begin{equation*}
A T^{2}=O\left(l^{2} \log ^{4} l \cdot \frac{s^{2}}{\log ^{2} s}\right) \tag{1}
\end{equation*}
$$

By choosing the depth $s$ as $s=l^{\epsilon}(\epsilon>0)$, the resulting circuit achieves $T=$ $O((1 / \epsilon) \log l)$ and $A T^{2}=O\left(l^{2(1+\epsilon)}\right)$, i.e. it is a moderately $A T^{2}$ - suboptimal divider still achieving $T=O(\log l)$, for fixed $\epsilon$. We are aware that this result had been previously obtained by F. T. Leighton [8], presumably by a similar argument.

## 3.. An Accelerating Technique

We now describe an alternative approach to the computation of the inverse of an $l$ bit number, which capitalizes on the presence of leading zeros in the representation of the number to be inverted. This method is best described for an $l$-bit integer $x \in[1,2)$.

The number $x \in[1,2)$ can be written as

$$
x=x_{1}+2^{-l_{1}} w
$$

where $x_{1}$ is an $l_{1}$-bit number (the leading $l_{1}$ bits of $x$ ) and $w$ is an ( $l-l_{1}$ )-bit number (the trailing $l-l_{1}$-bits of $x$ ). Then $x_{1} \in[1,2)$ and $2^{-l_{1}} w \in[0,2)$. Let $v_{1}$ be an $l_{1}$-bit approximation to $x_{1}$ (i.e. $x_{1} v_{1}=1+\eta, \eta<2^{-l_{1}}$ ). Then

$$
v_{1} x=v_{1} x_{1}+v_{1} w 2^{-l_{1}}=1+\eta+v_{1} w 2^{-l_{1}}
$$

that is, $v_{1} x$ has at least $l_{1}-1$ consecutive 0 's immediately to the right of the point.
Define

$$
y=1 / v_{1} x
$$

Then, if $v_{2}$ denotes an approximation of $1 / y$, we have $v_{1} v_{2} \simeq 1 / x$. Also, if $v_{1} y=1+\eta^{\prime}$ then $v_{1} v_{2} x=1+\eta^{\prime}$, i.e. $v_{1} v_{2}$ is an approximation of precision $\eta^{\prime}$. The process can be iterated for the computation of the inverse of $y$, thereby obtaining

$$
1 / x=v_{1} v_{2} \ldots v_{k}
$$

This leads to the following algorithm:

## Algorithm INVERSE3( $x$ )

Input: an $l$-bit number $x \in[1,2)$, and an integer sequence $l_{1}<l_{2}<\ldots<l_{k}=l$. Output: an $l$-bit number $v \in(1 / 2,1]$, such that $v x=1+\epsilon, \epsilon<2^{-l}$
(1) begin $z:=x$
(2) for $i=1$ to $k$ do

$$
\begin{equation*}
\text { begin } x_{i}:=\text { leftmost } l_{i} \text { bits of } z_{i} ; \tag{3}
\end{equation*}
$$ $v_{i}:=\left(l_{i}+1\right)$-bit inverse of $x_{i} ;$ $z_{i+1}:=z_{i} v_{i}$

end;
(6)

$$
\begin{equation*}
v:=v_{1} v_{2} \ldots v_{k} \tag{5}
\end{equation*}
$$

end
The correctness of the method follows from the fact that $\delta(v)=\tilde{v}_{1} \ldots \tilde{v}_{k-1}$. $\delta\left(v_{k}\right)<2 \cdot 2^{-l-1}=2^{-l}$.

Step 4 is the crucial action in the above algorithm. To analyze its performance, we need the following result.

Lemma 3. If an $l$-bit number $x \in[1,2)$ has $h-1$ zeros immediately to the right of the point, the $l$-bit inverse of $x$ can be computed in time $T=O\left(\log \left(l / l_{1}\right)\right.$ $\log l / \log s)$ and area $A=O\left((l s)^{2}\right)$, for any $2 \leq s<l / l_{1}$. (Note that this result subsumes Lemma 2 for $l_{1}=1$.)
Proof: Indeed $u=1-x$ is a (negative) number with $l_{1}$ zeros immediately to the right of the point. This implies that $u^{\left\lceil l / l_{1}\right\rceil}<2^{-l}$, so that only the first $\left\lceil l / l_{1}\right\rceil$ consecutive powers of $u$ need to be computed.

The numbers $x_{i}, i=1, \ldots, k$, used in Step 4 meet the conditions of Lemma 3, since $1-z_{i} v_{i}$ is a (negative) number with $l_{i}$ leading zeros. Step 4 is therefore carried out by applying Algorithm INVERSE2 so that the $i$-th iteration is characterized by length $l_{i}$ and depth $s_{i}$. An implementation of this accelerating technique is therefore completely specified by the two sequences:

$$
l_{1}, l_{2}, \ldots, l_{k}
$$

and

$$
s_{1}, s_{2}, \ldots, s_{k}
$$

Before closing this section we note that Step 5 involves a multiplication of ( $l_{i}+1$ )-bit numbers at the $i$-th step; thus this operation is no more complex than the execution of the homologous Step 4, and will not be further mentioned in this discussion.

## 4. The Divider Network

We have all the premises to illustrate in detail the structure of the divider sketched in Figure 1.

The first $1 / \epsilon$ stages are collectively designed to implement the accelerating technique; each module implements the modified BCH algorithm. For $i=1,2, \ldots, 1 / \epsilon$, let $l_{i}$ be the (output) operand length, $s_{i}$ the depth, $A_{1, i}$ the area, and $T_{1, i}$ the time of the $i$-th module. We seek a solution where all such modules have identical area (i.e. $A_{1, i}=A^{\prime}$ for $i=1, \ldots, 1 / \epsilon$ ) and identical computation time, equal to the target time (i.e. , $\left.T_{1, i}=\theta\left((\log n)^{1+\epsilon}\right), i=1, \ldots, 1 / \epsilon\right)$. By the requirement of optimality, we have

$$
\begin{equation*}
\sqrt{ } A_{1, i}=\frac{n}{T_{1, i}}=\frac{n}{(\log n)^{1+\epsilon}} \tag{2}
\end{equation*}
$$

We also choose:

$$
\begin{aligned}
& l_{i}=\frac{n}{(\log n)^{1+\epsilon} s_{i}} \\
& s_{i}=2^{(\log n)^{1-\epsilon} /\left(1+(\log n)^{\epsilon}\right)^{i-1}} \quad(i=1, \ldots, 1 / \epsilon)
\end{aligned}
$$

Since the area of the $i$-th module is $\theta\left(\left(l_{i} s_{i}\right)^{2}\right)$, condition (2) is obviously verified. Next note that $\log l_{i}=O(\log n), i=1, \ldots, 1 / \epsilon$. We therefore infer from Lemma 2:

$$
\begin{aligned}
T_{1,1} & =O\left(\log l_{1} \frac{\log l_{1}}{\log s_{1}}\right) \\
& =O\left((\log n)^{2} /(\log n)^{1-\epsilon}\right) \\
& =O\left((\log n)^{1+\epsilon}\right)
\end{aligned}
$$

and for $i=2, \ldots, 1 / \epsilon$

$$
\begin{aligned}
T_{1, i} & =O\left(\log \frac{l_{i}}{l_{i-1}} \cdot \frac{\log l_{i}}{\log s_{i}}\right) \\
& =O\left(\log \frac{s_{i-1}}{s_{i}} \cdot \frac{\log l_{i}}{\log s_{i}}\right) \quad \text { since } \quad l_{i} s_{i}=l_{i-1} s_{i-1} \\
& =O\left(\log l_{i} \cdot \frac{\log s_{i-1}}{\log s_{i}}\right) \quad \text { since } \quad s_{i} \geq 1 \\
& =O\left(\log n \cdot \frac{(\log n)^{1-\epsilon}\left(1+(\log n)^{\epsilon}\right)^{i-1}}{\left(1+(\log n)^{\epsilon}\right)^{i-2}(\log n)^{1-\epsilon}}\right) \\
& =O\left((\log n)^{1+\epsilon}\right)
\end{aligned}
$$

thus verifying the objective for the computation time.
With these choices, each module of the chain is $A T^{2}$-optimal, and the global computation time is $c_{1}(1 / \epsilon)(\log n)^{1+\epsilon}=\theta\left((\log n)^{1+\epsilon}\right)$, for some constant $c_{1}$. The value of $l_{1 / \epsilon}$, the number of bits of the result, is bounded from below as follows:

$$
l_{1 / \epsilon}=\frac{n}{(\log n)^{1+\epsilon} 2^{(\log n)^{1-\epsilon /\left(1+(\log n)^{\epsilon}\right)^{1 / \epsilon-1}}}>\frac{n}{(\log n)^{1+\epsilon} \cdot 2} . . . . ~ . ~}
$$

This value $l_{1 / \epsilon}$ represents the length of the operand supplied to the cascade of the two Newton approximators, to be described next.

Starting with the downstream approximator, we recall (see figure 2) that this module is in turn the cascade of $p$ submodules ( $p$ is an integer to be defined shortly), where the $i$-th submodule has area and time $A_{3, i}$ and $T_{3, i}$, respectively, and

$$
A_{3, i}=2 A_{3, i-1}, \quad T_{3, i}=\sqrt{2} T_{3, i-1} \quad i=2,3, \ldots, p
$$

With this choice (originally proposed in [1]), the global area and time of the slow approximator are respectively proportional to the area $A_{3, p}$ and time $T_{3, p}$ of


Fig. 2: The module structure of the slow "Newton approximator".
the $p$-th (last) submodule. Since we are aiming for an $A T^{2}$-optimal network with computation time $O(T)$, we must have

$$
A_{3, p} T_{3, p}^{2}=O\left(n^{2}\right)
$$

and

$$
T_{3, p}=T
$$

This condition enables us to specify the parameter $p$. Indeed, the speed of the submodules increases as we proceed upstreams (by decreasing submodule index), and each submodule must satisfy the condition that its multiplication time is at least logarithmic in the operand length. Since the operand length is halved in going from index $i$ to index $i-1$ (due to the mechanism of the Newton approximation), and the most stringent condition occurs for $i=1$, we have

$$
\frac{T}{(\sqrt{2})^{p}} \leq \log \left(\frac{n}{2^{p}}\right)
$$

which is certainly satisfied if we select $p$ as follows:

$$
\begin{equation*}
p=2 \log \left(\frac{T}{\log n}\right)=2 \epsilon \log \log n \tag{3}
\end{equation*}
$$

Finally we turn our attention to the "fast approximator". This module receives an approximation of length $l_{1 / \epsilon}=n /(\log n)^{1+\epsilon} \cdot 2$ and delivers an approximation of length $n /(\log n)^{2 \epsilon}$. Thus, this module must execute $(1-\epsilon) \log \log n+1$ iteration steps, each of them within time $\theta(\log n)$. The module essentially consists of a "fastest" multiplier of numbers of length $n /(\log n)^{2 \epsilon}$, and can be realized with area $A_{2}$ such that $A_{2}(\log n)^{2}=\Theta\left(\left(n / 2^{p}\right)^{2}\right)$, i.e., $A_{2}=\Theta\left(\left(n /(\log n)^{1+2 \epsilon}\right)^{2}\right)$. Thus, the resulting $A T^{2}$-measure for this module is

$$
A_{2} T^{2}=\Theta\left(\left(\frac{n}{(\log n)^{1+2 \epsilon}} \log n \cdot(1-\epsilon) \log \log n\right)^{2}\right)=O\left(n^{2}\right)
$$

and the optimality condition is clearly satisfied.
Since each of the three major units of our divider - the chain of modified BCH dividers, the fast Newton approximator and the slow Newton approximator - has area $O\left(\left(n /(\log n)^{1+\epsilon}\right)^{2}\right)$ and time $O\left((\log n)^{1+\epsilon}\right)$, we conclude with the following result:

Theorem 1 For any fixed $1 \geq \epsilon>0$, the $n$-bit inverse of an $n$-bit number can be calculated with optimal $A T^{2}$-performance for any $T \in\left[\Omega\left((\log n)^{1+\epsilon}\right), O\left((\log n)^{2}\right)\right]$.

## 5. Conclusion.

We constructed an $A T^{2}$-optimal divider with computation time $(\log n)^{1+\epsilon}$ for any $\epsilon>0$. The reader may wonder whether one can choose $\epsilon$ as a decreasing function of n (tending to zero as $n$ goes to infinity). This is indeed the case if the construction is slightly modified. In the construction as it is now we use a chain of modified BCH dividers each with the same area and speed. Thus both area and time grow as $1 / \epsilon$ and hence $A T^{2}$ grows (at least) as $(1 / \epsilon)^{3}$.

If $\epsilon$ is chosen as a function of $n$, then this simple chain of equally sized modules does not suffice. Rather one has to use a chain of increasingly larger (and slower) modules as we did for the Newton iteration. Omitting the tedious and not particular illuminating details we have:
Theorem 2. There is an $A T^{2}$-optimal divider for $n$-bit integers for any $T \in$ $\left[\Omega\left(\log n \cdot 2^{(\log \log n)^{3 / 4}}\right), O\left((\log n)^{2}\right)\right]$.
Note that $2^{(\log \log n)^{3 / 4}}=O\left((\log n)^{\epsilon}\right)$ for any $\epsilon>0$.

## References

[1] K. Mehlhorn: "AT ${ }^{2}$-optimal VLSI Integer Division and Integer Square Rooting", Integration, 2, 163-167, 1984.
[2] J. Reif: "Logarithmic Depth Circuits for Algebraic Functions", 24th FOCS, 138145, 1983.
[3] P.W. Beame, S.A. Cook, H.J. Hoover: "Log Depth Circuits for Division and Related Problems", 24th FOCS, 1-6, 1984.
[4] W.K. Luk, J. Vuillemin: "Recursive Implementation of Optimal Time VLSI Integer Multipliers", VLSI 83, Trondheim, Norway, 1983.
[5] O. Spaniol: "Arithmetik in Rechenanlagen", Teubner Verlag, Stuttgart, 1976.
[6] K. Mehlhorn, F.P. Preparata: "AT2 -optimal VLSI Integer Multiplier with Minimum Computation Time", Information and Control, 58, 1-3, 137-156, 1983.
[7] D.E. Knuth: "The Art of Computer Programming", Vol. 2: Seminumerical Algorithms, Addison-Wesley, Reading, Mass. 1981, 2d ed.
[8] F.T. Leighton, personal communication, May 1985.

