# A scalable RNS Montgomery multiplier over $\mathbb{F}_{2^{m}}$ 

Jingwei Hu ${ }^{1 \mathrm{a})}$, Wei Guo ${ }^{1,2 \mathrm{~b})}$, Jizeng Wei ${ }^{\text {1c) }}$, and Ray C.C. Cheung ${ }^{3 \mathrm{~d})}$<br>${ }^{1}$ School of Computer Science and Technology, Tianjin University<br>${ }^{2}$ Tianjin Key Laboratory of Cognitive Computing and Application<br>${ }^{3}$ Department of Electronic Engineering, City University of Hong Kong<br>a) jingweihu@tju.edu.cn<br>b) weiguo@tju.edu.cn<br>c) weijizeng@tju.edu.cn<br>d) r.cheung@cityu.edu.hk


#### Abstract

This paper presents a fully parallelized and scalable RNS Montgomery multiplier over binary field. By generalizing the RNS Montgomery Multiplication (RNS MM) and elaborating a highly efficient RNS base selection, we are able to obtain a considerably high speed in our FPGA implementation experiments with acceptable circuit area and modest critical path delay. Furthermore, this design can be easily scalable by adjusting a variety of field sizes and field polynomials.


Keywords: RNS MM, binary field, FPGA, scalable
Classification: Integrated circuits

## References

[1] J.-C. Bajard, L.-S. Didier and P. Kornerup: IEEE Trans. Comput. 47 [7] (1998) 766.
[2] R. C. C. Cheung, S. Duquesne, J. Fan, N. Guillermin, I. Verbauwhede and G. X. Yao: Cryptographic Hardware and Embedded Systems-CHES 20116917 (2011) 421.
[3] C. K. Koc and T. Acar: Designs, Codes and Cryptography 14 [1] (1998) 57.
[4] P. Kitsos, G. Theodoridis and O. Koufopavlou: Microelectronics Journal 34 [10] (2003) 975.
[5] D. Harris, R. Krishnamurthy, M. Anders, S. Mathew and S. Hsu: Proc. 17th IEEE Symp. Computer Arithmetic (ARITH) (2005) 172.
[6] D. Hankerson, A. J. Menezes and S. Vanstone: Guide to elliptic curve cryptography (Springer, 2004) 101.
[7] C. Grabbe, M. Bednara, J. Teich, J. von zur Gathen and J. Shokrollahi: Proc. 2003 International Symposium on Circuits and Systems 2 (2003) II-268.
[8] J.-P. Deschamps: Hardware implementation of finite-field arithmetic (McGraw-Hill, Inc., 2009).
[9] A. P. Fournaris and O. Koufopavlou: Integration, the VLSI Journal 41 [3] (2008) 371.

## 1 Introduction

Finite field arithmetic is used in many different digital communication systems and theoretical researches. Two well-known examples are error-coding theory and cryptography. Finite field is usually divided into prime field $\left(\mathbb{F}_{p}\right)$ and binary extension field $\left(\mathbb{F}_{2^{m}}\right) . \mathbb{F}_{2^{m}}$ is becoming rapidly attractive due to its "carry-free" logic which is well suitable for the hardware implementation.

In most cases, field multiplication (modular multiplication) defines the system overall performance, the efficient implementation of $\left(\mathbb{F}_{2^{m}}\right)$ multiplier is now gaining an extensive attention. Among them, well-known MSbit-first (MSB), LSBit-first (LSB) and Karatsuba multipliers [6] have been proposed and expanded by many researchers. However, one big disadvantage of these multipliers is that they usually work in a specific, fixed field $\mathbb{F}_{2^{m}}$, cannot be easily extended to any other fields. What's more, to achieve a higher speed and low space complexity these multipliers adopt special polynomials as moduli, like AOL, trinomials or pentanomials [8], making their work not so scalable when defined over arbitrary field polynomial. Among alternative modular multiplication algorithms found in the literature, Montgomery's method has been extensively analyzed, since it replaces divisions with additions, multiplications and shifts [3]. Also this method can be easily tuned into different field sizes and field polynomials, putting the concept of scalable design methodology into practice [4, 9]. In recent years, RNS has enjoyed renewed scientific interest due to its ability to perform fast and parallel modular arithmetic. This method splits large-scale numbers into smaller ones (RNS channel reduction) by exploiting different small co-prime modulis (RNS bases). As a result, integrating the two methods above together, a series of RNS Montgomery multiplier (RNS MM) architectures over prime field have been proposed [1, 2].

In this paper, RNS MM algorithm over $\mathbb{F}_{2^{m}}$ is elaborately studied and optimized. After a careful mapping into the FPGA platform, an efficient scalable RNS MM architecture is proposed. The main contributions of this paper include proposing a binary field version of RNS MM algorithm, concluding a modified base extension algorithm without parameter approximations and adopting pseudo-Mersenne-like modulis reducing the time and area complexity in the architecture. Additionally, our design is scalable for different field sizes and field polynomials.

The paper is organized as follows: A brief introduction to RNS and Montgomery algorithm is given in section 2. The proposed RNS MM algorithm is detailed in section 3. Section 4 depicts the hardware implementation of our design on FPGA platform. Relevant analyses, results and comparisons are concluded in section 5 .

## 2 Preliminaries

### 2.1 Extended binary field $\mathbb{F}_{2^{m}}$

Let $\beta \in \mathbb{F}_{2^{m}}$ and be a root of the irreducible polynomial $f(x)=x^{m}+$ $f_{m-1} x^{m-1}+\ldots+f_{1} x+f_{0}$ over $\mathbb{F}_{2}$. Then, the set of $\left\{1, \beta, \ldots, \beta^{m-1}\right\}$ con-
stitutes the polynomial basis in $\mathbb{F}_{2^{m}}$. With polynomials basis, the elements in $\mathbb{F}_{2}$ can be represented as polynomials of degree at $\mathrm{m}-1$ in the form $\mathbb{F}_{2^{m}}=$ $\left\{a(\beta) \mid a(\beta)=a_{m-1} \beta^{m-1}+\ldots+a_{1} \beta+a_{0}\right\}$, where the coefficients $a_{i}$ are the polynomial basis coordinates in $\mathbb{F}_{2}$. Polynomial basis can also be represented as the set $\left\{1, x, \ldots, x^{m-1}\right\}$ and, therefore, $\mathbb{F}_{2^{m}}=\{a(x) \mid a(x)=$ $\left.a_{m-1} x^{m-1}+\ldots+a_{1} x+a_{0}\right\}$.

Arithmetic operations in $\mathbb{F}_{2^{m}}$ are performed modulo an irreducible polynomial $f(x)$ over $\mathbb{F}_{2}$. Addition of polynomials is carried out under modulo 2 arithmetic. Therefore, the addition of two polynomials becomes the bitwise exclusive-or (XOR) of their binary representations. Subtraction is exactly the same as addition in modulo 2 arithmetic, so $1-x$ equals $1+x$.

Among the $\mathbb{F}_{2^{m}}$ arithmetic operations, multiplication is usually considered the most important, complex, and time-consuming operation. With reference to a degree m irreducible polynomial $f(x)=x^{m}+f_{m-1} x^{m-1}+\ldots+f_{1} x+f_{0}$ over $\mathbb{F}_{2}$, let $a(x)$ and $b(x)$ be two elements in the field and $c(x)$ be their product. Then field multiplication can be defined as:

$$
\begin{equation*}
c(x)=a(x) b(x) \bmod f(x) \tag{1}
\end{equation*}
$$

### 2.2 Montgomery multiplication on $\mathbb{F}_{\mathbf{2}^{m}}$

In order to reduce the computational complexity of field multiplication over $\mathbb{F}_{2^{m}}$, Koc [3] proposed Montgomery multiplication on $\mathbb{F}_{2^{m}}$. The basic idea of this method is to achieve field multiplication efficiently without trial divisions. Algorithm 1 [3] depicts the word-level algorithm for the Montgomery multiplication on $\mathbb{F}_{2^{m}}$. Operand $X$ is partitioned into $\lceil m / w\rceil$ words of length $w$ where $X=\sum_{i=0}^{\left\lceil\frac{m}{w}\right\rceil} X_{i} x^{i w}$. Let $S_{0}$ and $P_{0}$ be the least significant words of $S$ and $P$, respectively. This algorithm is similar to the algorithm given for the Montgomery multiplication of integers. The only difference is that the final subtraction step required in the integer case is not necessary in the polynomial case.

```
Algorithm 1: Montgomery Multiplication on \(\mathbb{F}_{2^{m}}\)
    Input: \(X=\sum_{i=0}^{\left\lceil\frac{m}{w}\right\rceil} X_{i} x^{i w}, Y=\sum_{i=0}^{m-1} y_{i} x^{i}, P=\sum_{i=0}^{\left\lceil\frac{m}{w}\right\rceil} P_{i} x^{i w}, P_{0}^{\prime}=P_{0}^{-1} \bmod x^{w}\)
    Output: \(S=X Y x^{-w} \bmod P\)
    \(S \leftarrow 0\)
    for \(i \leftarrow 1 \boldsymbol{t o}\left\lceil\frac{m}{w}\right\rceil\) do
        \(S \leftarrow S+X_{i} Y\)
        \(M \leftarrow S_{0} P_{0}^{\prime}\left(\bmod x^{w}\right)\)
        \(S \leftarrow S+M P\)
        \(S \leftarrow S / x^{w}\)
    return \(S\)
```


### 2.3 Residue Number System (RNS)

$R N S$ is defined by pairwise co-prime integer constants: $\mathfrak{B}=\left\{b_{1}, b_{2}, \ldots, b_{n}\right\}$ and $M_{\mathfrak{B}}=\Pi_{i=1}^{n} b_{i}, b_{i} \in \mathfrak{B}$. Any integer $X, 0 \leq X<M_{\mathcal{B}}$, is uniquely represented by $\{X\}_{\mathfrak{B}}=\left\{x_{1}, x_{2}, \ldots, x_{n}\right\}$, where $x_{i}=X \bmod b_{i}=|X|_{b_{i}}, 1 \leq$ $i \leq n . b_{i}, i \in[1, n]$ is called $R N S$ base (RNS channel), $\{X\}_{\mathfrak{B}}$ is called $R N S$
number, $x_{i}, i \in[1, n]$ is called $R N S$ element and the process of $X \rightarrow|X|_{b_{i}}$ is called channel reduction.

The arithmetic operations of RNS are similar to ordinary integers with a great improvement in their parallelism. Let $\{X\}_{\mathfrak{B}}=\left\{x_{1}, x_{2}, \ldots, x_{n}\right\}$, $\{Y\}_{\mathfrak{B}}=\left\{y_{1}, y_{2}, \ldots, y_{n}\right\},\{R\}_{\mathfrak{B}}=\left\{r_{1}, r_{2}, \ldots, r_{n}\right\}$ be the RNS representation of $X, Y, R$ respectively, the arithmetic operations of $R N S$ are defined as follows:

- $R=|X \pm Y|_{M_{\mathcal{B}}} \Leftrightarrow\{R\}_{\mathcal{B}}=\{X\}_{\mathcal{B}} \pm\{Y\}_{\mathcal{B}}$, where $r_{i}=\left|x_{i} \pm y_{i}\right|_{b_{i}}$
- $R=|X \cdot Y|_{M_{\mathcal{B}}} \Leftrightarrow\{R\}_{\mathcal{B}}=\{X\}_{\mathcal{B}} \cdot\{Y\}_{\mathcal{B}}$, where $r_{i}=\left|x_{i} y_{i}\right|_{b_{i}}$
- $R=|X / Y|_{M_{\mathcal{B}}} \Leftrightarrow\{R\}_{\mathcal{B}}=\{X\}_{\mathcal{B}} \cdot\left\{Y^{-1}\right\}_{\mathcal{B}}$, where $r_{i}=\left|x_{i} y_{i}^{-1}\right|_{b_{i}}$

According to the operations above, field arithmetics become easier due to smaller operands in length and data independency in each RNS channel. Therefore, they are extremely suitable for parallelized hardware architectures to implement.

Likewise, it is possible to apply RNS into binary field because Chinese Remainder Theorem (CRT) still holds in binary field. That means, for a very large element in $\mathbb{F}_{2^{m}}$, RNS is capable of dividing it into several much smaller elements as a whole for the sake of efficient computation.

On top of that, the computation of Montgomery multiplication on $\mathbb{F}_{2^{m}}$ can become beneficiaries from RNS as well. This is primarily because operations in each step in Algorithm 1 can be transformed into RNS form and thus, it is likely to make further efforts to improve the efficiency of this algorithm.

In the next section, RNS Montgomery multiplication is presented in accordance with this initiative.

## 3 Proposed RNS Montgomery multiplication on $\mathbb{F}_{2^{m}}$

### 3.1 RNS Montgomery reduction

Let $X, Y \in \mathbb{F}_{2^{m}}, \mathfrak{B}=\left\{b_{1}, b_{2}, \ldots, b_{n}\right\}, \mathfrak{C}=\left\{c_{1}, c_{2}, \ldots, c_{n}\right\}, n<m$ be two discrepant RNS bases and $M_{\mathfrak{B}}=\prod_{i=1}^{n} b_{i}, b_{i} \in \mathfrak{B}, M_{\mathfrak{C}}=\Pi_{i=1}^{n} c_{i}, c_{i} \in \mathfrak{B}$. RNS Montgomery multiplication aims at computing $S=\left|X Y M_{\mathfrak{B}}^{-1}\right|_{p}$ in $\mathfrak{B}, \mathfrak{C}$ (namely, $\{S\}_{\mathfrak{B}}$ and $\{S\}_{\mathfrak{C}}$ ), which is illustrated in Table $\mathbf{I}$ in comparison with the original Montgomery method. Base Extension shown in this table is used to transform representation in RNS base $\mathfrak{B}$ to that in RNS base $\mathfrak{C}$, or the opposite.

Table I. Derivation of RNS Montgomery Multiplication on $\mathbb{F}_{2^{m}}$


Notice that two RNS bases ( $\mathfrak{B}$ and $\mathfrak{C}$ ) are necessary for the correctness of this algorithm. If either of them is left out, the result in step 4 will always be zero. For instance, assume base $\mathcal{C}$ has been taken away, that means base extension in step 3 is discarded. Accordingly, the computation has to be done in base $\mathcal{B}$ instead: $\{S\}_{\mathcal{B}}=\left(\{T\}_{\mathfrak{B}}+\{Q\}_{\mathfrak{B}} \cdot\{p\}_{\mathfrak{B}}\right) \cdot\left\{\left|M_{\mathfrak{B}}{ }^{-1}\right|_{M_{\mathfrak{C}}}\right\}_{\mathfrak{B}}=$ $\left(\{T\}_{\mathfrak{B}}+\{T\}_{\mathfrak{B}}\right) \cdot\left\{\left|M_{\mathfrak{B}}{ }^{-1}\right|_{M_{\mathfrak{C}}}\right\}_{\mathfrak{B}}=0$, which is base off the right track.

We have concluded the RNS Montgomery Multiplication on $\mathbb{F}_{2^{m}}$ in Algorithm 2. For a simplistic description of this algorithm, technical details of base extension inside will be addressed in the next subsection. This algorithm offers high parallelism with each RNS channel working independently. Meanwhile, scalability is obtained by increasing RNS channel numbers in the algorithm mentioned regardless of filed sizes or irreducible polynomials. Additionally, the stage of pre-computation in step 1 is irrelevant to operands $X$ and $Y$. As a result, all the computations in this step can be computed beforehand and stored in memory components. Thus, this algorithm is further simplified.

```
Algorithm 2: RNS Montgomery Multiplication
    Input: RNS bases \(\mathfrak{B}, \mathfrak{C}\), multiplication operands \(\{X\}_{\mathfrak{B}},\{X\}_{\mathfrak{C}},\{Y\}_{\mathfrak{B}},\{Y\}_{\mathfrak{C}}\) being
                RNS representation of \(X\) and \(Y\left(X, Y<x^{k+1}\right)\) and moduli \(p\)
    Output: \(\{S\}_{\mathfrak{B}},\{S\}_{\mathfrak{B}}\) such that \(|S|_{p}=\left|X Y M_{\mathfrak{B}}^{-1}\right|_{p}\)
    Precompute \(\left\{\left|p^{-1}\right|_{M_{\mathfrak{B}}}\right\}_{\mathfrak{B}},\left\{\left|M_{\mathfrak{B}}{ }^{-1}\right|_{M_{\mathfrak{C}}}\right\}_{\mathfrak{C}}\) and \(\{p\}_{\mathfrak{C}}\), where
    \(M_{\mathfrak{B}}=\Pi_{i=1}^{n} b_{i}, b_{i} \in \mathfrak{B}, M_{\mathfrak{C}}=\prod_{i=1}^{n} c_{i}, c_{i} \in \mathfrak{C}\)
    \(\{T\}_{\mathfrak{B}} \leftarrow\{X\}_{\mathfrak{B}} \cdot\{Y\}_{\mathfrak{B}}\)
    \(\{Q\}_{\mathfrak{B}} \leftarrow\{T\}_{\mathfrak{B}} \cdot\left\{\left|p^{-1}\right|_{M_{\mathfrak{B}}}\right\}_{\mathfrak{B}}\)
    \(\{Q\}_{\mathfrak{B}} \xrightarrow{\text { BaseExtension }}\{Q\}_{\mathbb{C}}\)
    \(\{T\}_{\mathfrak{C}} \leftarrow\{X\}_{\mathfrak{C}} \cdot\{Y\}_{\mathfrak{C}}\)
    \(\{S\}_{\mathfrak{C}} \leftarrow\left(T_{\mathbb{C}}+\{Q\}_{\mathfrak{C}} \cdot\{p\}_{\mathfrak{C}}\right) \cdot\left\{\left|M_{\mathfrak{B}}{ }^{-1}\right|_{M_{\mathfrak{C}}}\right\}_{\mathbb{C}}\)
    \(\{S\}_{\mathfrak{B}} \stackrel{\text { BaseExtension }}{\rightleftarrows}\{S\}_{\mathfrak{C}}\)
    return \(\{S\}_{\mathfrak{B}}\)
```

Nevertheless, the boundary values of inputs in Algorithm 2 are supposed to be restricted so as to ensure the validity of this method. If the two baseextension steps (step 3 and step 5 in Algorithm 2) are error-free, we can specify the condition that $M_{\mathfrak{B}}$ and $M_{\mathfrak{C}}$ should satisfy for a given $p$. Condition that $\operatorname{gcd}\left(M_{\mathfrak{B}}, p\right)=1$ and $\operatorname{gcd}\left(M_{\mathfrak{B}}, M_{\mathfrak{C}}\right)=1$ is sufficient for the existence of $\left|p^{-1}\right|_{M_{\mathfrak{B}}}$ and $\left|M_{\mathfrak{B}}{ }^{-1}\right|_{M_{\mathfrak{C}}}$ respectively. $\quad M_{\mathfrak{B}} \geq x^{k+1}$ is also sufficient for $S<x^{k+1}$ when $x, y<x^{k+1}$. Actually,

$$
\begin{align*}
S & =\frac{x y+\left|x y \times p^{-1}\right|_{M_{\mathfrak{B}}} p}{M_{\mathfrak{B}}}<\frac{x y+M_{\mathfrak{B}} p}{M_{\mathfrak{B}}}  \tag{2}\\
& =\frac{x y}{M_{\mathfrak{B}}}+p<\max \left\{x^{k+1}, p\right\}=x^{k+1}
\end{align*}
$$

This equation also shows $M_{\mathfrak{C}} \geq x^{k+1}$ is sufficient for $S<M_{\mathfrak{C}}$. In summary, the following four conditions are sufficient for the correctness of this algorithm:

$$
\begin{equation*}
\operatorname{gcd}\left(M_{\mathfrak{C}}, p\right)=1, \operatorname{gcd}\left(M_{\mathfrak{B}}, M_{\mathfrak{C}}\right)=1, M_{\mathfrak{B}} \geq x^{k+1}, M_{\mathfrak{C}} \geq x^{k+1} \tag{3}
\end{equation*}
$$

### 3.2 Base extension

The operation to transform the representation in one RNS base to another base is called Base Extension (BE). The reason why we have to do this is that one can not obtain the correct value $S$ in step 4 of Algorithm 2 unless Base Extension is done in step 3 (actually one can infer $S=0$ ). The final base extension in step 5 of Algorithm 2 helps to convert the value $S$ back into the RNS form in $\mathfrak{B}$. To compute $\{T\}_{\mathfrak{C}}=\left\{t_{1}^{\prime}, t_{2}^{\prime}, \ldots, t_{n}^{\prime}\right\}$ from $\{T\}_{\mathfrak{B}}=\left\{t_{1}, t_{2}, \ldots, t_{n}\right\}$, we exploit Chinese Remainder Theorem (CRT) to obtain the following equations,

$$
\begin{equation*}
T=\left|\sum_{i=1}^{n}\right| t_{i} B_{i}^{-1}\left|b_{i} B_{i}\right|_{M_{\mathfrak{B}}}=\left|\sum_{i=1}^{n} \xi_{i} B_{i}\right|_{M_{\mathfrak{B}}}=\sum_{i=1}^{n} \xi_{i} B_{i}-\lambda M_{\mathfrak{B}}=\sum_{i=1}^{n} \xi_{i} B_{i} \tag{4}
\end{equation*}
$$

where $\left|\xi_{i}\right|_{b_{i}}=\left|t_{i} B_{i}^{-1}\right|_{b_{i}}, 1 \leq i \leq n$, and $B_{i}=M_{B} / b_{i} . C_{i}$ is defined similarly for Base $\mathfrak{C}$. Notice that approximation parameter $\lambda=0$ makes the Base Extension Transformation (Algorithm 3) quite easier in the binary field because no additional logic is required to evaluate the approximation, with multiplication and addition on $\mathbb{F}_{2^{m}}$ only to perform the base extension procedure. Then $\{T\}_{\mathfrak{C}}=\left\{t_{1}^{\prime}, t_{2}^{\prime}, \ldots, t_{n}^{\prime}\right\}$ can be computed as follows:

$$
\begin{equation*}
t_{j}^{\prime}=\left|\sum_{i=1}^{n} \xi_{i} B_{i}\right|_{c_{j}}=\left.\left.\left|\sum_{i=1}^{n} \xi_{i}\right| B_{i}\right|_{c_{j}}\right|_{c_{j}} \tag{5}
\end{equation*}
$$

$\left|B_{i}\right|_{c_{j}}(1 \leq i, j \leq n)$ can be precomputed once $\mathfrak{B}$ and $\mathfrak{C}$ are fixed.

```
Algorithm 3: Base extension algorithm for \(k\)-th element of \(\{T\}_{\mathfrak{C}}\)
    Input: \(|T|_{b_{i}}\), for \(i \in\{1, \ldots, n\}\)
    Output: \(|T|_{c_{k}}\)
    Precompute \(\left|B_{i}^{-1}\right|_{b_{i}},\left|B_{i}\right|_{c_{k}}\)
    \(z \leftarrow 0\)
    for \(i \leftarrow 1\) to \(n\) do
        \(\xi_{i} \leftarrow\left|t_{i} B_{i}^{-1}\right|_{b_{i}}\)
        \(z \leftarrow z+\xi_{i}\left|B_{i}\right|_{c_{k}}\)
    return \(|z|_{c_{k}}\)
```


### 3.3 Base selection

Before starting to perform RNS Montgomery multiplication (Algorithm 2), it is required to turn primitive operands in representation of binary field into RNS elements. This initial operation of conversion is called channel reduction [2]. In order to expedite the processing of channel reduction, it is essential to find an appropriate RNS base selection. Here we propose the pseudo-Mersenne-like numbers $b_{i}=x^{w}+\xi(i)\left(w<m, \xi(i)<x^{w / 2}\right)$ as RNS bases in this paper. Consequently, for an operand $X \in \mathbb{F}\left(2^{m}\right)$ and RNS base $\mathfrak{B}=\left\{b_{1}, b_{2}, \ldots, b_{l}\right\}$, the conversion of this operand can be written as $\left(X \rightarrow\{X\}_{\mathfrak{B}}\right)$ :

$$
\begin{align*}
x_{i} & =|X|_{b_{i}}=\left(X_{H} x^{w}+X_{L}\right) \bmod x^{w}+\xi(i) \\
& =X_{H}\left(x^{w}+\xi(i)\right)+X_{H} \xi(i)+X_{L} \bmod x^{w}+\xi(i)  \tag{6}\\
& =X_{H} \xi(i)+X_{L} \bmod x^{w}+\xi(i), i \in\{1, \ldots, l\}, x_{i} \in\{X\}_{\mathfrak{B}}
\end{align*}
$$

where $X_{L}$ denotes the least significant $w$ bits of $X, X_{H}$ denotes the most significant $m-w$ bits of $X$ and $x_{i} \in\{X\}_{\mathfrak{B}}$. It is noticeable that the conversion can be remarkably efficient when $\xi(i)<x^{w / 2}$. As a matter of fact, the degree $m$ of $X$ is reduced to degree $m-w / 2$ of $X_{H} \xi(i)+X_{L}$ via Equation (6). After $\left\lceil\frac{2 m}{w}\right\rceil-2$ iterations of Equation (6), RNS representation of $X$ can be obtained eventually. By using this base selection, readers can also find the efficiency of the modular multiplication on each RNS channel presented in the next subsection.

### 3.4 Fast field multiplication on $\mathbb{F}\left(\mathbf{2}^{\boldsymbol{w}}\right)$

In this subsection, we tackle the matter of efficient implementation of $\{R\}_{\mathcal{B}}=$ $\{X\}_{\mathcal{B}} \cdot\{Y\}_{\mathcal{B}}$, where $\{X\}_{\mathfrak{B}}=\left\{x_{1}, x_{2}, \ldots, x_{n}\right\},\{Y\}_{\mathfrak{B}}=\left\{y_{1}, y_{2}, \ldots, y_{n}\right\}$, $\{R\}_{\mathfrak{B}}=\left\{r_{1}, r_{2}, \ldots, r_{n}\right\}, r_{i}=\left|x_{i} y_{i}\right|_{b_{i}}(1 \leq i \leq n)$. which is the paramount fundamental operation in the proposed RNS Montgomery multiplication algorithm (Algorithm 2).

Notice the operation of modular multiplication (namely, $r_{i}=\left|x_{i} y_{i}\right| b_{i}$ ) on $\mathbb{F}\left(2^{w}\right)$ is critical becuase each operation is performed on $\mathbb{F}\left(2^{w}\right)$ in the context of RNS. The proposed base selection method in this paper is able to significantly reduce the computational intensiveness of this operation by exploiting the deliberately selected RNS bases.

Modular multiplication on $\mathbb{F}\left(2^{w}\right)$ involves two steps: polynomial multiplication and reduction modulo field polynomial. For the stage of polynomial multiplication, the multiplication of two operands in polynomial base consists of shift-operation and addition in $\mathbb{F}\left(2^{w}\right)$ and it can be easily implemented with utilization of XOR logic: Suppose $y_{i}=\sum_{j=0}^{w-1}\left(y_{i}\right)_{j} x^{j}$, then

$$
\begin{equation*}
r_{i}=x_{i} \cdot y_{i}=\sum_{j=0}^{w-1} x_{i}\left(y_{i}\right)_{j} x^{j}=\sum_{j=0}^{w-1}\left(x_{i} \cdot\left(y_{i}\right)_{j}\right) \gg j \tag{7}
\end{equation*}
$$

As far as the stage of reduction is concerned, with reference to the pseudo-Mersenne-like numbers used as RNS modulis in section 3.3, the final reduction is greatly simplified, and we choose the pentanomial on the best performance (that is, $b_{i}=x^{w}+\xi(i)=x^{w}+x^{l}+x^{m}+x^{n}+1$, we abandon the simplest trinomial because there are not enough co-prime trinomials for us to use). Nevertheless, these RNS bases in pentanomial form, unlike other literature $[6,8]$ mentioned in our introduction, do not exert negative influence on the scalability of our work. This is simply because all kinds of field polynomials (trinomials, pentanomials or whatever they are) can be adapted into RNS modulis via Channel Reduction. In other words, although RNS modulis are constructed in the form of pentanomials, our work is still capable of handling different type of filed polynomials, resulting in intactness of the hardware scalability. Algorithm 4 depicts the simplicity of the field multiplication on $\mathbb{F}\left(2^{w}\right)$. This algorithm can be easily deduced from Equation (6) aforementioned $\left(\xi(i)=x^{l}+x^{m}+x^{n}+1\right)$.

```
Algorithm 4: Fast Field Multiplication on \(\mathbb{F}\left(2^{w}\right)\)
    Input: \(x \in \mathbb{F}\left(2^{w}\right), y \in \mathbb{F}\left(2^{w}\right)\) and \(b=x^{w}+x^{l}+x^{m}+x^{n}+1, l<\lfloor w / 2\rfloor\)
    Output: \(|x y|_{b} \in \mathbb{F}\left(2^{w}\right)\)
    \(c \leftarrow x y\)
    for \(i \leftarrow 0\) to 1 do
        \(c_{H} \leftarrow c / x^{w}\)
        \(c_{L} \leftarrow c \bmod x^{w}\)
        \(c \leftarrow x^{l} c_{H}+x^{m} c_{H}+x^{n} c_{H}+c_{H}+c_{L}\)
    return \(c\)
```


## 4 Hardware implementation

We have implemented the algorithm aforementioned on Xilinx Virtex-II platform. Figure 1 depicts the suitable architecture for the proposed RNS Montgomery multiplication algorithm (digit size $w=33$, of $\mathbb{F}_{2^{m}}$ ). In our implementation, $\left\lceil\frac{m}{w}\right\rceil$ dual-mode multipliers (DMMs) are exploited to achieve


Fig. 1. Proposed $\mathbb{F}_{2^{m}}$ RNS Montgomery Multiplier Architecture
the highest parallelism. These multipliers controlled by a scheduling sequencer which performs the Algorithm 2 in section 3. In fact, the sequencer is a finite state machine with 4 states (S_1, BT_A, BT_B, S_2). Step 1 and step 2 computed on RNS channel $\mathfrak{B}$ in Algorithm 2 are executed in S_1, which takes two clock cycles each. On the other hand, step 5 and step 6, which are the computation on RNS channel $\mathfrak{C}$. Differently, S_2 takes three cycles to execute as step 5 takes one clock cycle while step 6 takes two (first perform multiply-accumulation then multiplication). BT_A and BT_B are about the Base Extension procedure in step 3 and step 4 respectively. Both of them take two cycles to achieve the task, which is detailed in Algorithm 3. The initial operands are conveyed into the multipliers through the system bus. The outputs of the multipliers are connected to the main MUX component which functions as the custodian for the bus entrance. There are interconnects between each of the multipliers because some intermediate results are shared during the Base Extension procedure (Algorithm 3).

Each DMM has two RNS channels (one in $\mathfrak{B}$ and the other in $\mathfrak{C}$ ) due to the two RNS bases employed within the algorithm. Our DMM is advantageous when one has to alter the modulis in the RNS channel to pursue a better time-area tradeoff, because one simply need to tune the nbitshifter to the modulis they want (as long as the modulis are pseudo-Mersenne-like pentanomials, that is, $\left.f=x^{w}+x^{l}+x^{m}+x^{n}+1, l<\lfloor w / 2\rfloor\right)$ without any other additional consumption. ROM is embedded into the multiplier with pre-computed $\left\{\left|p^{-1}\right|_{M_{\mathfrak{B}}}\right\}_{\mathfrak{B}},\left\{\left|M_{\mathfrak{B}}{ }^{-1}\right|_{M_{\mathfrak{C}}}\right\}_{\mathcal{C}},\{p\}_{\mathfrak{C}},\left|B_{i}\right|_{c_{k}},\left|C_{i}{ }^{-1}\right|_{c_{i}}$ and $\left|C_{i}\right|_{b_{k}}$ stored. Table II summarizes the memory requirements of the proposed architecture for $\mathbb{F}_{2^{m}}$ RNS Montgomery multiplication, note that there are exactly $5\left\lceil\frac{m}{w}\right\rceil w+2\left\lceil\frac{m}{w}\right\rceil^{2} w$ bits pre-computed data in total stored in the ROM. ALU serves as the arithmetic core component of the multiplier targeting the computation of multiplication and multiply-accumulation on $\mathbb{F}_{2^{w}}$ (namely, mul_x_op $\times$ mul_y_op and mac_op + mul_x_op $\times$ mul_y_op respectively). The intermediate result generated in Base Extension procedure (Algorithm 3) is stored in tmp_reg, which is shared by all the other dual-mode multipliers.

The internal structure of ALU is primarily composed of XOR gates and switch MUXs. The Partial Product Accumulator (PPA) in the number of $w$ $\mathbb{F}_{2}$ adders is used to generate the results of the multiplication with respect to (4). There are two reduction stages to obtain the final result by the method proposed in Algorithm 4. With 8 adders employed only, the reduction part

> Table II. Memory Consumption of the Proposed RNS MM Architecutre over $\mathbb{F}_{2^{m}}$

| Operation | Parameters stored | ROM Consumptions |
| :---: | :---: | :---: |
| RNS MM in Algorithm 2 | $\left\{\left\|p^{-1}\right\|_{M_{\mathfrak{B}}}\right\}_{\mathfrak{B}}$ | $\left.\frac{m}{w}\right\rceil w$ bits |
|  | $\left\{\left\|M_{\mathfrak{B}}{ }^{-1}\right\|_{M_{\mathfrak{C}}}\right\}_{\mathbb{C}}$ | $\left.\frac{m}{w}\right\rceil w$ bits |
|  | $\{p\}_{\mathbb{C}}$ | $\left.\frac{w}{w}\right\rceil w$ bits |
| Base Extension in Algorithm 3 |  | $\begin{aligned} & \left\lceil\frac{w}{m}\right\rceil w \text { bits } \\ & \left\lceil\frac{\underline{m}}{w}\right\rceil^{2} w \text { bits } \end{aligned}$ |
|  | $\begin{aligned} & \left\|C_{i}^{-1}\right\|_{c_{i}}\left(i\left\{\in 1, \ldots,\left[\frac{m}{w}\right]\right\}\right) \\ & \left\|C_{i}\right\|_{b_{k}}\left(i, k\left\{\in 1, \ldots,\left[\frac{m}{w}\right\rceil\right\}\right) \end{aligned}$ | $\begin{aligned} & \left\lceil\frac{m}{m}\right\rceil w \text { bits } \\ & \left\lceil\frac{m}{w}\right\rceil^{2} w \text { bits } \end{aligned}$ |

reveals a dramatic reduction in timing and area complexity. The final adder is functional when multiply-accumulation mode is enabled.

Another significant characteristic of our work is that the architecture proposed is easily adjusted into a variety of field sizes and field polynomials, making this design nicely scalable. One merely needs to trim the number of the multipliers to get adjusted to different field sizes without making any changes to interior of the multiplier. As for the field polynomials adopted for various applications, rewriting the data stored in ROM is sufficient enough to meet this requirement.

## 5 Performance and comparisons

In an effort to obtain the exact complexity of the proposed multiplier, we analyze and compare the time and area complexities of the presented RNS MM multiplier over $\mathbb{F}_{2^{233}}$ when it comes to different digit sizes in Table III. As the escalation of the digit size, it runs faster but with a considerable increase in circuit area. Thus it can be fairly desirable to restrict the digit size obtaining modest area consumption and an appropriate latency. The clock frequency of the proposed pipelined RNS Montgomery multiplier implementations remains almost constant as the chip covered area increases. It indicates that our design will maintain a modestly high clock frequency with a dynamic range of digit size.

Table III. Time and area complexity comparison of RNS Montgomery multiplier architecture over $\mathbb{F}_{2^{233}}$

| Digit Size | Clock Cycles | Critical Path Delay | Time | Area |
| :---: | :---: | :---: | :---: | :---: |
| $w=17$ | 35 | $10 T_{\text {xor }}+6 T_{\text {mux }}{ }^{\text {a }}$ | $350 T_{\text {xor }}+210 T_{\text {mux }}$ | $\begin{gathered} \leq 5488 S_{\text {xor }}+ \\ 432 S_{\text {mux }}{ }^{\text {b }} \end{gathered}$ |
| $w=33$ | 23 | $11 T_{\text {xor }}+6 T_{\text {mux }}$ | $253 T_{\text {xor }}+138 T_{\text {mux }}$ | $\begin{aligned} & \leq 9528 S_{\text {xor }}+ \\ & 344 S_{\text {mux }} \end{aligned}$ |
| $w=63$ | 15 | $12 T_{\text {xor }}+6 T_{\text {mux }}$ | $180 T_{\text {xor }}+90 T_{\text {mux }}$ | $\begin{gathered} \leq 16644 S_{x o r}+ \\ 292 S_{m u x} \end{gathered}$ |
| $w=127$ | 11 | $13 T_{\text {xor }}+6 T_{\text {mux }}$ | $143 T_{\text {xor }}+66 T_{\text {mux }}$ | $\begin{gathered} \leq 33026 S_{x o r}+ \\ 274 S_{m u x} \end{gathered}$ |

${ }^{\mathrm{a}} T_{x o r}=$ delay of xor gate, $T_{m u x}=$ delay of mux
${ }^{\mathrm{b}} S_{x o r}=$ area of xor gate, $S_{m u x}=$ area of mux
A casestudy on the performance for different field sizes (some of the NIST recommended binary fields and $\mathbb{F}_{2^{193}}$ ) is given in Table IV. We assume each RNS channel is fixed on $\mathbb{F}_{2^{33}}$ and maximal parallelism is reached ( $\left\lceil\frac{m}{33}\right\rceil$ DMMs work simultaneously) for a more intuitive and clearer illustration. In this case, critical path delay is not affected by field size because DMM structure remains intact once datawidth of RNS channel is determined. To cater to different field sizes, one merely has to arrange MMs available, precomupte data in ROMs and slightly alter the implementation of sequencer in favor of constituting a robust scalable design.

The proposed RNS MM architecture is captured in Verilog HDL and implemented in hardware using Xilinx ${ }^{\circledR}$ Virtex ${ }^{\text {TM }}$-II Family Series XC2V3000 device as the target FPGA (Virtex-4 and Virtex-5 implementations are also

Table IV. Scalability analysis of RNS Montgomery multiplier architecture over $\mathbb{F}_{2^{163}}, \mathbb{F}_{2^{193}}, \mathbb{F}_{2^{233}}$ and $\mathbb{F}_{2^{283}}$

| Field Size | Clock Cycles ${ }^{\text {a }}$ | Critical Path Delay | Time | Area |
| :---: | :---: | :---: | :---: | :---: |
| $m=163$ | 17 | $11 T_{\text {xor }}+6 T_{\text {mux }}$ | $187 T_{\text {xor }}+102 T_{\text {mux }}$ | $\begin{gathered} \leq 5955 S_{\text {xor }}+ \\ \hline 215 S_{\text {mux }} \end{gathered}$ |
| $m=193$ | 19 | $11 T_{\text {xor }}+6 T_{\text {mux }}$ | $209 T_{\text {xor }}+114 T_{\text {mux }}$ | $\begin{gathered} \leq 7146 S_{\text {xor }}+ \\ 258 S_{\text {mux }} \end{gathered}$ |
| $m=233$ | 23 | $11 T_{\text {xor }}+6 T_{\text {mux }}$ | $253 T_{\text {xor }}+138 T_{\text {mux }}$ | $\leq 9528 S_{\text {xor }}+$ |
| $m=283$ | 25 | $11 T_{\text {xor }}+6 T_{\text {mux }}$ | $275 T_{\text {xor }}+150 T_{\text {mux }}$ | $\begin{gathered} \leq 10719 S_{x o r}+ \\ 387 S_{\text {mux }} \end{gathered}$ |

${ }^{\text {a }}$ Assume the highest parallelism is obtained and digit size $w=33$.

Table V. FPGA Implementations for the Proposed RNS Montgomery Multipliers on $\mathbb{F}_{2^{163}}, \mathbb{F}_{2^{233}}$, and $\mathbb{F}_{2^{283}}$

| Field Size | Platform | LUTs | Critical Path Delay(ns) | Clock Cycles |
| :---: | :---: | :---: | :---: | :---: |
| 163 | Virtex-II | 6,216 | 10.02 | 17 |
|  | Virtex-4 | 5,481 | 7.16 | 17 |
|  | Virtex-5 | 3,863 | 6.19 | 17 |
| 233 | Virtex-II | 9,978 | 10.13 | 23 |
|  | Virtex-4 | 8,785 | 7.21 | 23 |
|  | Virtex-5 | 6,215 | 6.25 | 23 |
| 283 | Virtex-II | 11,325 | 10.16 | 25 |
|  | Virtex-4 | 9,893 | 7.22 | 25 |
|  | Virtex-5 | 6,997 | 6.31 | 25 |

Table VI. Performance Comparison of FPGA Implementations for $\mathbb{F}_{2^{m}}$ Multipliers
$\left.\begin{array}{|c|c|c|c|c|c|c|}\hline \text { Reference } & \begin{array}{c}\text { Field } \\ \text { Size }\end{array} & \text { Platform } & \text { LUTs } & \begin{array}{c}\text { CPD }^{\mathbf{b}} \\ \text { (ns) }\end{array} & \begin{array}{c}\text { Clock } \\ \text { Cycles }\end{array} & \text { Scal. }\end{array} \begin{array}{c}\text { Speed- } \\ \text { Up }\end{array}\right]$
${ }^{\text {a }}$ scalability
${ }^{\mathrm{b}}$ critical path delay
included for future comparison by other people). We choose this device as our evaluation platform mainly for a fair and square comparison with other previous works published in literature.

To demonstrate the scalability of our work, the synthesized results of the proposed RNS MM multiplier on $\mathbb{F}_{2^{163}}, \mathbb{F}_{2^{233}}$ and $\mathbb{F}_{2^{283}}$ are indicated in Table V. Unified 33-bit DMM is adopted in this experiment. 5 DMMs like this are instantiated to construct a $\mathbb{F}_{2^{163}}$ RNS Montgomery multiplier, while 8 DMMs and 9 DMMs are required for a $\mathbb{F}_{2^{233}}$ multiplier and a $\mathbb{F}_{2^{283}}$ multiplier seperately.

Table VI compares the synthesized results of the proposed RNS MM multiplier on $\mathbb{F}_{2^{233}}$ with other existing work $\left(\mathbb{F}_{2^{233}}\right.$ is the recommended binary field by NIST for elliptic curve digital signature algorithm (ECDSA)).

To clarify the timing performance of this work, we introduce the following indicator shown in the last column of Table IV,

$$
\begin{equation*}
\text { speed-up }=\frac{\mathbf{T}_{\text {benchmark }}}{\mathbf{T}_{\text {ours }}}=\frac{\mathbf{C Y C L E S}_{\text {benchmark }} \times \mathbf{C P D}_{\text {benchmark }}}{\mathbf{C Y C L E S}_{\text {ours }} \times \mathbf{C P D}_{\text {ours }}} \tag{8}
\end{equation*}
$$

Our work performs best among all the scalable design in terms of total computing time elapsed, with $9.09,4.29,5.65$ speed-up improvement when the benchmark comes to $[4,5,9]$ respectively. One can also find that the unscalable design (HybridKara [7]) performs even better than ours (about $57 \%$ above ours from the perspective of speed-up). But it should be marked that this design is defined over fixed size field $\mathbb{F}_{2^{233}}$ and over fixed special form irreducible polynomials (AOL, trinomials, pentanomials), which implies that their work will be unscalable when defined over arbitrary fields. The large LUT consumption of this design compared with $[4,5,9]$ is due to the intrinsic property of RNS parallelism, in which 8 DMMs are employed to obtain an ultimate computational capability. But for some area or power constrained applications, it is not so critically intended. To get adjusted to these applications, the numbers of DMMs used can be trimmed to be smaller (that is, 4 DMMs or 2 DMMs, so that LUTs can be cut down into a half or a quarter of the primitive design). Summarizing all the above, computation speed is very fast for the proposed RNS Montgomery multiplier implementation with the modestly acceptable circuit area and our work features in supporting scalable design methodology.

## 6 Conclusion

In this paper, the RNS Montgomery multiplication (RNS MM) has been generalized into binary extension field and an efficient base selection method has been examined. Then we have presented a scalable RNS Montgomery multiplier architecture and we also have evaluated the performance of the proposed RNS Montgomery multiplier over $\mathbb{F}_{233}$. It has been implemented in FPGA and the area and timing results have been presented. The experimental results have shown that we are able to achieve an impressive high speed capacity (for field $\mathbb{F}_{2233}$, the proposed architecture attains the Montgomery multiplication in $0.233 \mu \mathrm{~s}$, with at least 4.29 speed-up compared to other Montgomery scalable designs in literature) and support different field sizes and irreducible polynomials.

## Acknowledgments

This work is supported by the Key Project Foundation of Fundamental and Frontier Technology Research of Tianjin under Grant No.11JCZDJC1580 and the Open Project Program of State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences.

