Keywords

1 Introduction

There is an increasing discord between the convenience provided by cloud services and privacy concerns. Privacy can be achieved by encrypting data before uploading it to the cloud. However, with traditional cryptosystems, one is not able to process encrypted data [14], nullifying the benefits of cloud computing. A solution to this problem is the application of FHE, which allows for the creation of malleable cryptograms [14]. Homomorphic operations can afterwards be applied to these cryptograms. Despite its wide range of applicability, FHE has seldom been applied in practice due to its low computational performance.

Most recent research on FHE has focused on improving its efficiency [23]. LWE [27] and its ring-variant RLWE [21] have been suggested in this context as a framework for improving the complexity of FHE. The benefits brought forth by RLWE are twofold. First, operations are executed in a cyclotomic ring, and therefore benefit from its algebraic structure. Second, the plaintext space is isomorphic to the Cartesian product of multiple smaller spaces, designated batching slots, allowing for multiple messages to be encrypted and processed in a single ciphertext in parallel. Most of the related research has focused on cyclotomic rings of the form \(\mathbb {Z}[X]/(X^N + 1)\), for N a power of two, due to the simple arithmetic associated with the resulting ring. However, they only allow for a single batching slot. While RLWE modulo other cyclotomic polynomials, providing for a large number of batching slots, has been previously considered [12, 28], it remains very much unused much because of its less efficient arithmetic. In [22], Lyubashevsky et al. proposed algorithms using a multivariate/tensored representation of polynomials, natively supporting operations on any cyclotomic rings.

In this work we analyse and improve the arithmetic associated with the univariate representation of polynomials in general cyclotomic rings. Efficient reduction algorithms are proposed, which have the same asymptotic complexities as those of [21]. The Barrett reduction, which is a generic technique to perform polynomial reduction, had been considered in [9] to perform such reductions. However, the algorithms used in [9] do not take into account the characteristics of cyclotomic polynomials, and hence are sub-optimal. Herein, the degree of the polynomials is reduced using cyclotomic properties before applying Barrett’s algorithm, decreasing the complexity of the reduction and leading to theoretical speed-ups up to 2.06. We also show that in this context a Montgomery representation leads to more efficient reductions than generic Barrett algorithms, leading to theoretical speed-ups up to 2.63.

These gains have been confirmed in practice, with experimental speed-ups of up to 1.95 and 2.55 for the improved Barrett and Montgomery reductions, respectively, when compared with a straightforward Barrett algorithm in an i7-5960X processor. Moreover we have tested the applicability of our algorithms to two of the most currently used homomorphic encryption schemes, namely BGV [8] developed in HELib [16] and FV [10] developped in SEAL 2.0 [19] resulting in speed-ups up to 1.37 for homomorphic multiplication.

2 Background

Throughout the paper, \(\varvec{\phi }_m(X)\in \mathbb {Z}[X]\) will denote the m-th cyclotomic polynomial of degree \(n = \varphi (m)\), where \(\varphi \) is Euler’s totient function. The ring \(\mathcal {R}=\mathbb {Z}[X]/(\varvec{\phi }_m(X))\) is the main structure of R-LWE-based schemes such as FV and BGV. An element of \(\mathcal {R}\) can be thought of as a polynomial with integer coefficients and a degree strictly smaller than n. Unless mentioned otherwise, polynomials are represented in the power-basis \(\{1,X,\ldots X^{n-1}\}\). For \(\varvec{a} = \sum _{i=0}^{n-1}a_iX^i\in \mathbb {Z}[X]\), we denote \(\Vert \varvec{a}\Vert _\infty =\text {max}\{|a_i|, 0\leqslant i< n\}\). The underlying space for ciphertexts is \(\mathcal {R}_q = \mathcal {R}/q\mathcal {R}=\mathbb {Z}/q\mathbb {Z}[X]/(\varvec{\phi }_m(X))\), which is composed of elements of \(\mathcal {R}\) with coefficients reduced modulo q. The notation \(|\cdot |_q\) denotes the classical residue modulo q in [0, q), while the centered residue in \([-q/2,q/2)\) is denoted by \([\cdot ]_q\). Moreover \(\lfloor \cdot \rfloor \) denotes flooring while \(\lfloor \cdot \rceil \) denotes rounding to the nearest integer.

2.1 Residue Number System (RNS)

In practice, the value of q underpinning \(\mathcal {R}_q\) is chosen to be the product of small different prime numbers \(q=q_1\cdots q_k\). Therefore, thanks to the Chinese Remainder Theorem (CRT), \(\mathcal {R}_q\) splits into a cartesian product of smaller rings through the following isomorphism:

$$\begin{aligned} \text {RNS}_{q=q_1\ldots q_k}:\left| \begin{array}{rcl} \mathcal {R}_q&{}\rightarrow &{}\mathcal {R}_{q_1}\times \ldots \times \mathcal {R}_{q_k}\\ \varvec{a} &{} \mapsto &{} (\varvec{a}\bmod q_1,\ldots ,\varvec{a}\bmod q_k) \end{array}\right. \end{aligned}$$
(1)

The RNS exploits (1) to transfer the arithmetic modulo q of the coefficients to k smaller arithmetics modulo each \(q_i\). Thus, better performance is achieved due to smaller arithmetics and parallel computations over each \(\mathcal {R}_{q_i}\).

2.2 Product of Elements in \(\mathcal {R}_q\)

Since the degree n of the polynomials is large in practice, the polynomial product is a major bottleneck for the efficiency of R-LWE based schemes. However, when n is close or equal to a power of two, it can be performed efficiently thanks to the Number Theoretic Transform (NTT). Let \({\mathcal {N}_2}\) be the function defined over \(\mathbb {N}\) such that for any \(n\in \mathbb {N} \ {\mathcal {N}_2}(n)\) is the smallest power of two greater than or equal to n. In our context the products have a degree strictly smaller than 2n, therefore we denote \(N={\mathcal {N}_2}(2n)\). For a prime q, a primitive \(N^{\text {th}}\)-root of unity \(\omega \in \mathbb {F}_q\) exists if and only if \(q\equiv 1\bmod N\). For a ring \(\mathbb {Z}/q\mathbb {Z}\) equipped with \(\omega \), the following ring morphism is a bijection:

$$\begin{aligned} \text {NTT}_{q,N,\omega }:\left| \begin{array}{rcl} \mathbb {Z}/q\mathbb {Z}[X]/(X^N-1)&{}\rightarrow &{} \mathbb {Z}/q\mathbb {Z}[X]/(X-1)\times \cdots \times \mathbb {Z}/q\mathbb {Z}[X]/(X-\omega ^{N-1}) \\ \varvec{a} &{} \mapsto &{} \varvec{\widehat{a}} = \left( \varvec{a}(1),\varvec{a}(\omega ),\ldots ,\varvec{a}(\omega ^{N-1})\right) \end{array}\right. \end{aligned}$$
(2)

Once (2) is applied to obtain the \(\text {NTT}\) representation of polynomials, their product can be performed coordinate-wise. The time-complexity of arithmetic becomes linear in N and the bottleneck changes to the evaluation of (2), as well as of its inverse, whose evaluations require \(O(N\log (N))\) multiplications in \(\mathbb {Z}/q\mathbb {Z}\) by state-of-the-art algorithms [17, 20]. When the context is clear, we will denote an NTT transformation of degree N by \(\text {NTT}_{N}\) instead of \(\text {NTT}_{q,N,\omega }\).

In order to efficiently compute the product of elements \(\varvec{a}\) and \(\varvec{b}\) of \(\mathcal {R}_q\), seen as polynomials over \(\mathbb {Z}/q\mathbb {Z}[X]\) of degree at most \(n-1\), one has to compute the NTT representation of \(\varvec{c}=\varvec{a}\times \varvec{b}\) of degree \(2n-2\) through the following formula:

$$\begin{aligned} \text {NTT}_{N}(\varvec{c}) =\text {NTT}_{N}(\varvec{a}) \odot \text {NTT}_{N}(\varvec{b}) \end{aligned}$$
(3)

where \(\odot \) denotes the component-wise multiplication in \(\mathbb {Z}/q\mathbb {Z}\). To obtain \(\varvec{c} = \varvec{a} \times \varvec{b} \in \mathcal {R}_q\), a second step is needed, which consists in reducing the result of (3) modulo \(\varvec{\phi }_m(X)\). Notice that when m is a power of two \((\varvec{\phi }_m(X)=X^{m/2}+1)\), one can use a negatively-wrapped convolution for the evaluation of the \(\text {NTT}\) and its inverse [20]. This allows us to use an \(\text {NTT}\) of size \(\mathcal {N}_2(n)\) instead of \(\mathcal {N}_2(2n)\) for the evaluation of (3) and also to recover the polynomial reduced modulo \(X^{m/2}+1\) at the end of (3) just with an inverse \(\text {NTT}\) of size \(\mathcal {N}_2(n)\). However, for other values of m this method cannot be applied and a more complex reduction has to be carried out after applying (3).

Barrett’s strategy for modular reduction over the integers [5] can be adapted to polynomial modular arithmetic to reduce a polynomial \(\varvec{c}\) of degree \(n+\alpha \) by \(\varvec{\phi }_m\) of degree n. The quotient of the Euclidean division \(\lfloor \varvec{c} / \varvec{\phi }_m \rfloor \) is computed through multiplications by precomputed constants and shifts. We present this strategy in Algorithm 1. The performance of the algorithm is directly related to the size of the polynomial to be reduced: the algorithm is more efficient when \(\alpha \) is small. By denoting \(\tilde{n}=\mathcal {N}_2(n)\) and \(A=\mathcal {N}_2(2\alpha +1)\) the cost of the algorithm is:

$$\begin{aligned} \mathcal {C}_{\text {NTT}}(N)+2\mathcal {C}_{\text {NTT}}(\tilde{n}) + 2\mathcal {C}_{\text {NTT}}(A) + (\tilde{n}+A)\texttt {MMult}_q \end{aligned}$$

where \(\mathcal {C}_{\text {NTT}}(x)\) denotes the cost for evaluating (2) (or its inverse) of size x and \(\texttt {MMult}_q\) the cost of a modular multiplication modulo q. One may also notice that Barrett’s reduction used in [9] uses NTT of size \(N=\mathcal {N}_2(2n)\) to perform the second product while in fact it only requires NTT of size \(\mathcal {N}_2(n)\). For the sake of completeness a proof of the correctness of Algorithm 1 is given in Appendix  A.1.

figure a

2.3 RNS Variant of the FV and BGV Encryption Schemes

Fan and Vercauteren [10] have adapted Brakerski’s scale invariant FHE scheme [7] to the RLWE framework. More recently, Bajard et al. have provided a full RNS variant of FV [3]. We briefly recall how this RNS variant works.

We first need to introduce the two functions \(\xi _q: \mathcal {R}_q\rightarrow \mathcal {R}_{q_1}\times \cdots \times \mathcal {R}_{q_k}\) and \(\mathcal {P}_{\texttt {RNS},q}:\mathcal {R}_q\rightarrow \mathcal {R}_q^k\) such that for any \(\varvec{a}\in \mathcal {R}_q\), \(\xi _q(\varvec{a}) = (|\varvec{a}\cdot q_i/q|_{q_i})_{i\in [1,k]}\) and \(\mathcal {P}_{\texttt {RNS},q}(\varvec{a}) = (|\varvec{a}\cdot q/q_i|_q)_{i\in [1,k]}\). It is straightforward to notice that for any \((\varvec{a},\varvec{b})\in \mathcal {R}_q^2\), \(\left\langle \xi _q(\varvec{a}),\mathcal {P}_{\texttt {RNS},q}(\varvec{b})\right\rangle \equiv \varvec{ab} \bmod q\). This scalar product occurs in \(\mathcal {R}_q\), and so it is composed of polynomial products.

In the context of the FV scheme, the secret-key \(\varvec{s} \in \mathcal {R}\) is defined as a “small” polynomial drawn from a distribution \(\chi _{key}\). An encryption of \(\varvec{m}\in \mathcal {R}_t\) corresponds to a pair of polynomials \({{{\mathbf {\mathtt{{ct}}}}}}= (\varvec{c_0},\varvec{c_1}) \in \mathcal {R}_q^2\) satisfying:

$$\begin{aligned}{}[\varvec{c_0} + \varvec{c_1} \varvec{s}]_q = [\lfloor q/t\rfloor \cdot [\varvec{m}]_t+\varvec{v}]_q \end{aligned}$$
(4)

where \(\varvec{v}\) is a noise term that is originally introduced during encryption (which is related to a distribution \(\chi _{err}\)) and that grows as homomorphic operations are applied. Decryption works correctly so long as this noise is below a certain bound, which limits the amount of operations one can perform.

The homomorphic addition of two ciphertexts corresponds to the pairwise addition of the ciphertexts’ polynomials. Regarding homomorphic multiplication, it is useful to see ciphertexts as polynomials of degree 1 with coefficients in \(\mathcal {R}\). In this context, homomorphic multiplication takes place in two steps. First, \({{{\mathbf {\mathtt{{ct}}}}}}_{mult}\leftarrow \left( \left[ \left\lfloor \tfrac{t}{q} \varvec{c_0}^1 \varvec{c_0}^2 \right\rceil \right] _q,\left[ \left\lfloor \tfrac{t}{q} \left( \varvec{c_0}^1 \varvec{c_1}^2+\varvec{c_1}^1\varvec{c_0}^2\right) \right\rceil \right] _q,\left[ \left\lfloor \tfrac{t}{q} \varvec{c_1}^1\varvec{c_1}^2 \right\rceil \right] _q \right) \) is computed with a Karatsuba like algorithm. During this procedure, the RNS representations of the input polynomials are extended to bases with larger dynamic ranges so as to compute the products over \(\mathcal {R}\) instead of over \(\mathcal {R}_q\). Moreover, the division operation is achievable using [3, Sect. 4.4]. Finally, one has to convert the three-element ciphertext back to a two-element ciphertext, through a process called relinearisation. This is done by multiplying \(\xi _q({{{\mathbf {\mathtt{{ct}}}}}}_{\texttt {mult},2})\) by pseudo-encryptions of \(\mathcal {P}_{\texttt {RNS},q}(\varvec{s}^2)\) (designated \(\overrightarrow{{{{\mathbf {\mathtt{{rlk}}}}}}}\)), and adding the result to the other two elements:

$$\begin{aligned} \small {{{{\mathbf {\mathtt{{ct}}}}}}_{relin} \leftarrow \left( \left[ {{{\mathbf {\mathtt{{ct}}}}}}_{\texttt {mult},0}+\left\langle \xi _{q}({{{\mathbf {\mathtt{{ct}}}}}}_{\texttt {mult},2}), \overrightarrow{{{{\mathbf {\mathtt{{rlk}}}}}}}_0\right\rangle \right] _q, \left[ {{{\mathbf {\mathtt{{ct}}}}}}_{\texttt {mult},1}+\langle \xi _{q}({{{\mathbf {\mathtt{{ct}}}}}}_{\texttt {mult},2}), \overrightarrow{{{{\mathbf {\mathtt{{rlk}}}}}}}_1\rangle \right] _q\right) } \end{aligned}$$
(5)

The scheme introduced by Brakerski, Gentry and Vainkuntanathan [8] shares many features of FV, and can be similarly adapted to the techniques in [3]. A secret-key is also defined to be a “small” polynomial \(\varvec{s} \in \mathcal {R}_q\), and ciphertexts correspond to pairs \((\varvec{c}_0, \varvec{c}_1) \in \mathcal {R}_q^2\), but messages are encrypted in the Least Significant Bits of (6):

$$\begin{aligned}{}[\varvec{c_0} + \varvec{c_1} \varvec{s}]_q = [[\varvec{m}]_t+t\varvec{v}]_q \end{aligned}$$
(6)

The change in the positioning of the message bits leads to simpler homomorphic multiplications. First, we compute the degree 2 ciphertext \({{{\mathbf {\mathtt{{ct}}}}}}_{mult}\leftarrow \left( \left[ \varvec{c_0}^1 \varvec{c_0}^2 \right] _q,\left[ \left( \varvec{c_0}^1 \varvec{c_1}^2+\varvec{c_1}^1\varvec{c_0}^2\right) \right] _q,\left[ \varvec{c_1}^1\varvec{c_1}^2 \right] _q \right) \). Since operations are performed modulo q, no RNS base extension is required. Afterwards, an operation similar to (5) is applied, so as to convert the three-element ciphertext to a classical two-element ciphertext. Finally, a noise management technique is applied to reduce the growth rate of the norm of \(\varvec{v}\) in (6). This technique consists of scaling the ciphertext to a smaller ring \(R_{q'}\) with an appropriate rounding, and is performed in two steps:

$$\begin{aligned} \begin{array}{rcl} \varvec{\delta }_i &{} \leftarrow &{} t\cdot [-{{{\mathbf {\mathtt{{ct}}}}}}_{\texttt {mult},i}/t]_{q/q'} \text { for } i=0,1\\ \texttt {ct} &{} \leftarrow &{} \left( \left[ q'/q\cdot ({{{\mathbf {\mathtt{{ct}}}}}}_{\texttt {mult},0}+\varvec{\delta }_0)\right] _{q'}, \left[ q'/q\cdot ({{{\mathbf {\mathtt{{ct}}}}}}_{\texttt {mult},1}+\varvec{\delta }_1)\right] _{q'}\right) \end{array} \end{aligned}$$

In certain steps of the aforementioned schemes, one needs to add the result of multiple polynomial products. In this case, it is possible to sum the NTT forms of all the products, so that one has to apply only a single polynomial reduction in the end.

2.4 Batching

A common way to improve the efficiency of RLWE schemes is to encrypt several plaintexts in a single ciphertext, through a technique called batching [28]. Under some conditions, \(\varvec{\phi }_m\) splits modulo t into \(\ell \) distinct irreducible polynomials \(f_1,\ldots ,f_\ell \) of degree \(n/\ell \). This leads to the following ring isomorphism of the plaintext space \(R_t\): \(\mathcal {R}_t\cong \mathbb {Z}/t\mathbb {Z}[X]/(\varvec{f_1})\times \cdots \times \mathbb {Z}/t\mathbb {Z}[X]/(\varvec{f_\ell })\). In this manner, \(\ell \) plaintexts \(\varvec{m}_1, \ldots , \varvec{m}_\ell \) can be compactly represented as a single polynomial \(\varvec{m}\in \mathcal {R}_t\). Afterwards \(\varvec{m}\) is encrypted and homomorphic operations applied to this ciphertext operate on each slot individually. This technique is mostly used when evaluating Boolean circuits, i.e. with \(t=2\), to pack \(\ell \)-bits in a single ciphertext. However, since \(X^{m/2}+1 \equiv (X+1)^{m/2}\text { mod }2\), this technique cannot be used when m is a power of two. Thus the efficient arithmetic associated with power-of-two cyclotomic polynomials has limited applicability.

3 Improving Polynomial Reduction Modulo \({\phi }_m\)

In this section, we propose two efficient methods to compute polynomial reductions. The first method takes advantage of the properties of the cyclotomic polynomials to improve the efficiency of the Barrett algorithm. The second reduction rests on an adaptation of the Montgomery modular reduction [25].

3.1 Improving Barrett’s Reduction for Cyclotomic Polynomials

As explained in Sect. 2, Barrett’s algorithm is sensitive to the difference between the degree of the polynomial to be reduced and that of the polynomial we want to reduce by. The smaller the difference, the more efficient the algorithm will be. Herein we propose an efficient method to reduce this difference.

Polynomials to be reduced modulo \(\varvec{\phi }_m\) have a degree of at most \(2n-2\). Let \(\varvec{c}\) be such a polynomial. If \(\varvec{c}\) were reduced by a polynomial \(\varvec{Q}_{sp}\) of degree \(n+\alpha +1\), the difference between the degree of the polynomial and the degree of \(\varvec{\phi }_m\) would be reduced to \(\alpha \). However, in order to obtain the correct value of \(\varvec{c}\bmod \varvec{\phi }_m\) in the end, \(\varvec{\phi }_m\) has to divide \(\varvec{Q}_{sp}\) and for the efficiency of the reduction \(\varvec{Q}_{sp}\) should be sparse enough so that its reduction can be handled through few operations in \(\mathbb {Z}/q\mathbb {Z}\). Thanks to the cyclotomic property \(\prod _{d/m}\varvec{\phi }_d(X) = X^m-1\), \(\varvec{Q}_{sp}\) can be taken as the product of \(\varvec{\phi }_m\) and some \(\varvec{\phi }_d\) for d dividing m. Good candidates can be found by recursively using the fact that if p is a prime not dividing \(m'\) then \(\varvec{\phi }_{m'\cdot p}(X)\cdot \varvec{\phi }_{m'}(X)=\varvec{\phi }_{m'}(X^p)\). If \(\varvec{Q}_{sp}\) is found in this way in less than 2 recursions then, since it will correspond to a cyclotomic with at most two distinct odd prime factors, it will have coefficients in \(\{-1,0,1\}\). In this case, the reduction modulo \(\varvec{Q}_{sp}\) only requires additions in \(\mathbb {Z}/q\mathbb {Z}\) and can be done very efficiently.

In addition, when \(m < 2n-2\), \(\varvec{c}\) can initially be reduced by \(X^m-1\) with \(2n-m-1\) additions in \(\mathbb {Z}/q\mathbb {Z}\). Since \(\varvec{\phi }_m(X) \vert \varvec{Q}_{sp}(X) \vert X^m-1\) the strategy remains correct, and the complexity of the reduction by \(\varvec{Q}_{sp}(X)\) is further reduced. Let \(\texttt {HW}(\varvec{Q}_{\text {sp}})\) be the Hamming weight of \(\varvec{Q}_{sp}\). The cost of the reduction of \(\varvec{c}\) by \(\varvec{Q}_{sp}\) is \((\texttt {HW}(\varvec{Q}_{sp})-1)(m - \deg (\varvec{Q}_{sp}))\) additions in \(\mathbb {Z}/q\mathbb {Z}\). At this point, we obtain \(\varvec{c'}=\varvec{c}\bmod \varvec{Q}_{sp}\) (with \(\deg (\varvec{c'})\le n+\alpha \)) and \(\varvec{c'}\equiv \varvec{c}\bmod \varvec{\phi }_m\).

The final algorithm is depicted in Algorithm 2. It starts by recovering \(\varvec{c}\) in power-basis from the NTT representation output by (3). Then it consecutively reduces \(\varvec{c}\) of degree \(2n-2\) by \(X^m-1\) and by the sparse polynomial \(\varvec{Q}_{sp}\). This allows to recover \(\varvec{c'}=\varvec{c}\bmod \varvec{Q}_{sp}\) of degree \(n+\alpha \) very efficiently. Afterwards, steps 2 to 7 of Algorithm 1 are applied to \(\varvec{c'}\) to get \(\varvec{c''}=\varvec{c}\bmod \varvec{\phi }_m\).

figure b

The impact of this sparse reduction is illustrated in Table 1, where polynomials \(\varvec{Q}_{sp}\) are presented for different cyclotomic polynomials. Cyclotomics have been chosen with a degree \(n=\varphi (m)\) close or equal to a power of two. The number of batching slots \(\ell \) associated with each cyclotomic is also presented. The degree of \(\varvec{Q}_{sp}\) is \(n+\alpha +1\) thus NTTs of size \(A=\mathcal {N}_2(2\alpha +1)\) are required to compute the first polynomial product in Algorithm 2. This is in contrast with \(N=\mathcal {N}_2(2n)\) which would have been the size required for the Barrett algorithm without using the sparse reduction. In order to highlight the sparsity of \(\varvec{Q}_{sp}\) we give \(\texttt {HW}(\varvec{Q}_{sp})\) which is the number of non-zero coefficients of \(\varvec{Q}_{sp}\).

Complexity. Since the complexity of computing multiplications in \(\mathbb {Z}/{q}\mathbb {Z}\) is much higher than additions, the cost of the reduction by the sparse polynomial can be neglected. Moreover, with the RNS, each multiplication in \(\mathcal {R}_q\), with \(q=q_1\ldots q_k\) can be decomposed into k independent and smaller multiplications. Therefore the cost, in terms of modular multiplications, to reduce the polynomial \(\varvec{c}\) output by (3) is essentially k times the cost of Algorithm 2:

$$\begin{aligned} k\cdot \left( \mathcal {C}_{\text {NTT}}(N) + 2\cdot \mathcal {C}_{\text {NTT}}(A) + 2\cdot \mathcal {C}_{\text {NTT}}(\tilde{n}) + A+\tilde{n} \right) . \end{aligned}$$

While the cost of the method by using directly Barrett’s algorithm, i.e. without performing the reduction by the sparse polynomial, is:

$$\begin{aligned} k\cdot \left( 3\cdot \mathcal {C}_{\text {NTT}}(N)+2\cdot \mathcal {C}_{\text {NTT}}(\tilde{n}) + N + \tilde{n} \right) . \end{aligned}$$

Based on this analysis, we also provide in Table 1 the theoretical speed-up obtained with the use of the sparse reduction.

Table 1. Sparse polynomials used for partial reduction with their related parameters.

3.2 NTT-based Montgomery’s Reduction

We propose a Montgomery reduction of a polynomial given in NTT representation, inspired by [4]. The bottleneck of the previous optimized Barrett algorithm is the computation of the inverse NTT of size N of (3). Our Montgomery reduction takes advantage of the presence of the NTT basis of size N / 2 (seen as an RNS basis in [4]) in the basis of size N allowing to perform all the computations, in particular the inverse NTT evaluation, in the basis of size N / 2 instead of N.

The NTT representation of a polynomial of size N was defined in (2) as the set \(\{\varvec{c} \bmod (X-\omega ^j) | 0 \le j < N\}\). This representation can be seen as a polynomial-RNS representation of \(\varvec{c} \bmod (X^N-1)\) since \(X^N-1 = \prod _{0 \le j < N} (X-\omega ^j) \bmod q\), with respect to the following NTT-basis:

$$\begin{aligned} \mathcal {B}_{\omega ,N}=\{|X-1|_{q},|X-\omega |_{q},\ldots ,|X-\omega ^{N-1}|_{q}\} \end{aligned}$$

As \(X^N-1\) splits in \((X^{N/2}-1)(X^{N/2}+1)\) when N is even, half of the \(\text {NTT}_N\) representation of \(\varvec{c}\) corresponds to its \(\text {NTT}_{N/2}\) representation. Hence, the basis \(\mathcal {B}_{\omega ,N}\) is split along even and odd powers of \(\omega \). We can then define two sub-bases defining two polynomials:

$$\begin{aligned} \left\{ \begin{array}{rcllrcl} \mathcal {B}_{\omega ,N}^{(e)} &{}= &{}\{[X-\omega ^{2j}|_{q}, 0\le j \le \frac{N}{2}-1\} &{} \text {and} &{}\varvec{\Psi }^{(e)}&{}=&{}\left| \mathop {\prod }\nolimits _{j=0}^{\frac{N}{2}-1}(X-\omega ^{2j})\right| _{q}\\ \mathcal {B}_{\omega ,N}^{(o)} &{}= &{}\{|X-\omega ^{2j+1}|_{q},0\le j\le \frac{N}{2}-1\} &{} \text {and} &{}\varvec{\Psi }^{(o)}&{}=&{}\left| \mathop {\prod }\nolimits _{j=0}^{\frac{N}{2}-1}(X-\omega ^{2j+1})\right| _{q} \end{array}\right. \end{aligned}$$
(7)

It is straightforward to notice that \(\varvec{\Psi }^{(e)}\equiv |X^{N/2}-1|_q\) and \(\varvec{\Psi }^{(o)}\equiv |X^{N/2}+1|_q\). We also note that since N is a power of two, one has \(\varvec{\Psi }^{(o)}\equiv \varvec{\phi }_N\bmod q\). Thanks to Lemma 1, whose first point is a direct consequence of Lemma 2 in [11], we can choose \(X^{N/2}+1\) as the Montgomery factor.

Lemma 1

Let \(\varvec{\phi }_m\) be the m-th cyclotomic polynomial of degree n and N be the smallest power of two greater than or equal to 2n. If m is not a power of two then:

  • there exists \((\varvec{U},\varvec{V})\in \mathbb {Z}[X]^2\) such that \(\varvec{U}(X)\cdot \varvec{\phi }_m(X)+\varvec{V}(X)\cdot \varvec{\phi }_N(X)=1\);

  • for any prime p, \(\varvec{\phi }_m\) and \((X^N-1)\) are coprime in \(\mathbb {Z}/p\mathbb {Z}\). In particular \(\varvec{\phi }_m\) is a unit in \(\mathbb {Z}/p\mathbb {Z}[X]/(\varvec{\phi }_N)\).

One can extract from the coordinates of \(\varvec{c}\) in \(\mathcal {B}_{\omega ,N}\) the representation \(\varvec{\widehat{c}}^{(e)}\) of \(\varvec{c}\) in \(\mathcal {B}_{\omega ,N}^{(e)}\) (resp. \(\varvec{\widehat{c}}^{(o)}\) in \(\mathcal {B}_{\omega ,N}^{(o)}\)). So, given \(\varvec{\widehat{c}}^{(o)}\) and \(\varvec{\widehat{c}}^{(e)}\), we can use the NTT operator to get:

$$\begin{aligned} \texttt {NTT}_{N/2}^{-1}(\varvec{\widehat{c}}^{(e)})=\varvec{c}\bmod (q,X^{N/2}-1). \end{aligned}$$

Definition 1

We define the following function, which takes in as input the residues of the polynomial \(\varvec{c}\) (\(\deg (\varvec{c})<N\)) modulo a prime p:

$$\begin{aligned} \texttt { modMg}_{\varvec{\phi }_m,\varvec{\Psi }^{(o)},p}(\varvec{c})=\frac{\varvec{c}+\varvec{\phi }_m\times |-\varvec{c}/\varvec{\phi }_m|_{\varvec{\Psi }^{(o)}}}{\varvec{\Psi }^{(o)}}\bmod p. \end{aligned}$$
(8)

The \(\texttt {modMg}_{\varvec{\phi }_m,\varvec{\Psi }^{(o)},p}\) function defined in (8) is a classical Montgomery reduction with factor \(\varvec{\Psi }^{(o)}\) and consisting in an exact polynomial division. It always outputs a polynomial congruent to \(|\varvec{c}/\varvec{\Psi }^{(o)}|_{\varvec{\phi }_m}\) but when \(\deg (\varvec{c})\leqslant N/2+n-1\) the output is exactly \(|\varvec{c}/\varvec{\Psi }^{(o)}|_{\varvec{\phi }_m}\).

Lemma 2

If \(\deg (\varvec{c})\leqslant \frac{N}{2}+n-1\), then \(\texttt {modMg}_{\varvec{\phi }_m,\varvec{\Psi }^{(o)},p}(\varvec{c})=\varvec{c}/\varvec{\Psi }^{(o)}\bmod (p,\varvec{\phi }_m)\).

First the degree of the numerator in (8) is bounded by \(\max (\deg (\varvec{c}), \deg (\varvec{\phi }_m)+\deg (\varvec{\Psi }^{(o)})-1)\leqslant n+N/2-1\). Thus, the degree of the resulting quotient is bounded by \(n-1 < N/2\). Therefore, the output is \(|\varvec{c}/\varvec{\Psi }^{(o)}|_{\varvec{\phi }_m}\) and the computation of (8) can be made modulo \(X^{N/2}-1\), i.e. in an NTT representation of size N / 2 when using primes \(q_i\equiv 1\bmod N\). Algorithm 3 details the computation of (8). The following precomputations are used therein:

$$\begin{aligned} \left\{ \begin{array}{rcl} \varvec{\widehat{W}}^{(o)} &{} : &{} -1/\varvec{\phi }_m\bmod (q,\varvec{\Psi }^{(o)})\text { in base }\mathcal {B}_{\omega ,N}^{(o)}\\ \varvec{\widehat{Y}}^{(e)} &{} : &{} 1/\varvec{\Psi }^{(o)}\equiv 1/2 \bmod (q,\varvec{\Psi }^{(e)})\text { in base }\mathcal {B}_{\omega ,N}^{(e)}\\ \varvec{\widehat{Z}}^{(e)} &{} : &{} \varvec{\phi }_m/\varvec{\Psi }^{(o)}\equiv \varvec{\phi }_m/2 \bmod (q,\varvec{\Psi }^{(e)})\text { in base }\mathcal {B}_{\omega ,N}^{(e)}\\ \end{array}\right. \end{aligned}$$
figure c

In line 3 of Algorithm 3, we require an operator which takes in as input a vector of coefficients in base \(\mathcal {B}_{\omega ,N}^{(o)}\). This vector defines a unique polynomial \(\varvec{Q}\) with \(\deg (\varvec{Q})<N/2\). Then this operator must output the vector of coefficients of \(\varvec{Q}\) in base \(\mathcal {B}_{\omega ,N}^{(e)}\). The function \(\texttt {BaseConv}\) is defined for any \(\varvec{Q}\) with \(\deg (\varvec{Q})<N/2\) by:

$$\begin{aligned} \texttt {BaseConv} : (\varvec{Q}(\omega ),\varvec{Q}(\omega ^3),\ldots ,\varvec{Q}(\omega ^{N-1})) \mapsto (\varvec{Q}(1),\varvec{Q}(\omega ^2),\ldots ,\varvec{Q}(\omega ^{N-2}))\ [q]. \end{aligned}$$
(9)

In [4], (9) is computed with a classical Lagrange interpolation. Our context is more specific, because the points in which polynomials are evaluated are powers of a \(N^{\text {th}}\) root of unity \(\omega \) in \(\mathbb {Z}/q\mathbb {Z}\). With this purpose, Algorithm 4 implements such base conversion by only using NTTs of degree N / 2, with \(\omega ^{2}\) as a primitive \(N/2^{\text {th}}\) root of unity. The proof of correctness of Algorithm 4 is provided in Appendix A.3.

figure d

Complexity. The total cost of Algorithm 3 in terms of modular multiplications is:

$$\begin{aligned} k\cdot \left( 3\cdot \mathcal {C}_{\text {NTT}}(\mathcal {N}_2(n)) + 4\cdot \mathcal {N}_2(n) \right) . \end{aligned}$$

One can find in Table 2 the predicted speed-up of the proposed Montgomery reduction. Despite its lower complexity, the Montgomery algorithm suffers from one main drawback which is the presence of the Montgomery factor in the output.

Table 2. Theoretical speed-up of Algorithm 3 when compared with Algorithm 1

4 Adaptation of FV and BGV to the Montgomery Representation

In this section, we show how the Montgomery representation impacts the BGV and FV schemes and suggest modifications to handle these changes. For simplicity, we denote by \(\varvec{M}\) the Montgomery factor \((X^{N/2}+1)\bmod \varvec{\phi }_m\). Thanks to Lemma 1 we also know that \(\varvec{M}^{-1}\) exists in \(\mathcal {R}\). We assume that ciphertexts \(\widetilde{\texttt {ct}}\) are given in Montgomery representation, i.e. such that \(\widetilde{\texttt {ct}}=(\varvec{\widetilde{c}_0},\varvec{\widetilde{c}_1})=(\varvec{c}_0\varvec{M},\varvec{c}_1\varvec{M})\). The conversion to the Montgomery domain can be integrated in the encryption procedure for increased efficiency, and the \(\varvec{M}\) factor can be removed during decryption by applying a Montgomery reduction to \([\varvec{\widetilde{c}_0} + \varvec{\widetilde{c}_1} \varvec{s}]_q\). The Montgomery reduction only impacts procedures involving multiplications in \(\mathcal {R}_q\) by multiplying the product by \(\varvec{M}^{-1}\). Therefore the Montgomery representation is stable with respect to multiplication. Homomorphic additions are not affected by this different representation. Thus the only impact one has to consider is on homomorphic multiplication. For further discussions, we recall that the expansion factor of the ring \(\mathcal {R}\) is the quantity defined by \(\delta _\mathcal {R}=\sup \{\Vert \varvec{ab}\Vert _\infty /\Vert \varvec{a}\Vert _\infty \Vert \varvec{b}\Vert _\infty \ (\varvec{a},\varvec{b})\in \mathcal {R}-\{0\} \}\).

4.1 Impact of the Montgomery Representation in FV

We note that the first step of the FV homomorphic multiplication corresponds to the extension of polynomials of ciphertexts to a larger RNS basis, so that multiplications are computed over \(\mathcal {R}\) instead of \(\mathcal {R}_q\). In order to improve efficiency, an approximate extension is used [3] and thus the norm of the polynomials is bounded by \(\tfrac{q}{2}(1+\rho )\) for a parameter \(\rho >0\) [3]. A bound on the noise associated to \(\widetilde{{{{\mathbf {\mathtt{{ct}}}}}}}_{mult}\leftarrow \left( \left[ \left\lfloor \tfrac{t}{q} M\varvec{c_0}^1 \varvec{c_0}^2 \right\rceil \right] _q,\left[ \left\lfloor \tfrac{t}{q} M(\varvec{c_0}^1 \varvec{c_1}^2+\varvec{c_1}^1\varvec{c_0}^2) \right\rceil \right] _q,\left[ \left\lfloor \tfrac{t}{q} M\varvec{c_1}^1\varvec{c_1}^2 \right\rceil \right] _q \right) \) is given in Proposition 1 whose proof can directly be derived from the one of [3].

Proposition 1

Let \((\widetilde{\varvec{c}}_0,\widetilde{\varvec{c}}_1,\widetilde{\varvec{c}}_2) = \widetilde{{{{\mathbf {\mathtt{{ct}}}}}}}_{mult}\), \(r_\infty =\tfrac{1+\rho }{2}(1+\delta _\mathcal {R}\Vert \varvec{s}\Vert _\infty )+\delta _R\Vert M\Vert _\infty \) and \(\varvec{v}_i\) be the inherent noise of \(\texttt {ct}_i=(\varvec{c_0}^i,\varvec{c_1}^i)\). Then

$$\begin{aligned} (\widetilde{\varvec{c}}_0+\widetilde{\varvec{c}}_1\varvec{s}+\widetilde{\varvec{c}}_2\varvec{s}^2)\varvec{M^{-1}}\equiv \Delta \left[ \varvec{m}_1\varvec{m}_2\right] _t+\varvec{\widetilde{v}}_{mult}\bmod q \end{aligned}$$

with the following bound on the noise:

$$\begin{aligned} \Vert \varvec{\widetilde{v}}_{mult}\Vert _\infty&<\delta _\mathcal {R}t \left( \delta _\mathcal {R}\left\| \varvec{M}^{-1}\right\| _\infty r_\infty +\tfrac{1}{2}\right) \left( \Vert \varvec{v}_1\Vert _\infty +\Vert \varvec{v}_2\Vert _\infty \right) +\tfrac{\delta }{2}\min \Vert \varvec{v}_i\Vert _\infty \\&+\delta t |q|_t\left( r_\infty +1\right) +\tfrac{1}{2}\delta _\mathcal {R}\left\| \varvec{M}^{-1}\right\| _\infty \left( 1+\delta _\mathcal {R}\Vert \varvec{s}\Vert _\infty \left( 1+\delta _\mathcal {R}\Vert \varvec{s}\Vert _\infty \right) \right) +\tfrac{|q|_t}{2}+1\nonumber . \end{aligned}$$
(10)

Now, we assume that we need to relinearize the ciphertext \(\widetilde{\texttt {ct}}_{mult}=(\widetilde{\varvec{c}}_0,\widetilde{\varvec{c}}_1,\widetilde{\varvec{c}}_2)\). Within the original variant, the following dot products were computed over \(\mathcal {R}_{q}\): \(\langle \xi _{q}(\varvec{c}_2), \texttt {\text {rlk}}_{i} \rangle \), where \(\texttt {\text {evk}}_{0}=\left[ \mathcal {P}_{RNS,q}(\varvec{s}^2)+\overrightarrow{\varvec{a}}\varvec{s}+\overrightarrow{\varvec{e}}\right] _q\) and \(\texttt {\text {evk}}_{1}=\left[ -\overrightarrow{\varvec{a}}\right] _q\). The goal of relinearisation is to obtain \(\left\langle \xi _{q}(\varvec{c}_2),\mathcal {P}_{RNS,q}(\varvec{s}^2) \right\rangle \equiv \varvec{c}_2\varvec{s}^2\bmod q\) with a limited increase of the noise. Indeed, one can write:

$$\begin{aligned} \left\{ \begin{array}{rcl} \left\langle \xi _{q}(\varvec{c}_2), \texttt {evk}_0 \right\rangle &{}\equiv &{} \varvec{c}_2\varvec{s}^2+ \left\langle \xi _{q}(\varvec{c}_2), \overrightarrow{\varvec{a}}\right\rangle \varvec{s}+ \left\langle \xi _{q}(\varvec{c}_2), \overrightarrow{\varvec{e}}\right\rangle \equiv \varvec{c}_2\varvec{s}^2+ \varvec{a'}\varvec{s}+\varvec{e'} \bmod q \\ \left\langle \xi _{q}(\varvec{c}_2), \texttt {evk}_1 \right\rangle &{}\equiv &{} -\left\langle \xi _{q}(\varvec{c}_2), \overrightarrow{\varvec{a}}\right\rangle \equiv -\varvec{a'} \bmod q \end{array}\right. \end{aligned}$$
(11)

Now, we need to obtain the Montgomery representation of the output of this relinearisation, i.e. a cryptogram like \(\left( \left( \varvec{c}_2\varvec{s}^2+ \varvec{a'}\varvec{s}+\varvec{e'} \right) \varvec{M},-\varvec{Ma'}\right) \).

When the Montgomery representation is used, \(\varvec{\widetilde{c}}_2\) replaces \(\varvec{c}_2\) in (11). Hence, the relinearisation key has to be modified as follows:

$$\begin{aligned} \texttt {evkM}_{0}=\left[ \left( \mathcal {P}_{RNS,q}\left( \varvec{s}^2/\varvec{M}\right) +\overrightarrow{\varvec{a}}\varvec{s}+\overrightarrow{\varvec{e}}\right) \varvec{M}^2\right] _q,\ \texttt {evkM}_{1}=\left[ -\overrightarrow{\varvec{a}}\varvec{M}^2\right] _q \end{aligned}$$

In the following equations, we simulate the effect of the Montgomery reduction by introducing a factor \(\varvec{M}^{-1}\) (\(\bmod \varvec{\phi }_m\)):

$$\begin{aligned} \begin{array}{rcl} \left\langle \xi _{q}\left( \varvec{\widetilde{c}}_2\right) , \texttt {evkM}_0 \right\rangle \varvec{M}^{-1} &{}=&{} \left( \varvec{\widetilde{c}}_2\varvec{s}^2\varvec{M}+ \left\langle \xi _{q}\left( \varvec{\widetilde{c}}_2\right) , \overrightarrow{\varvec{a}}\right\rangle \varvec{s}\varvec{M}^2+ \left\langle \xi _{q}\left( \varvec{\widetilde{c}}_2\right) , \overrightarrow{\varvec{e}}\right\rangle \varvec{M}^2 \right) \varvec{M}^{-1}\\ &{}=&{} \varvec{\widetilde{c}}_2\varvec{s}^2+ \left( \varvec{a''}\varvec{s}+\varvec{e''} \right) \varvec{M}\\ &{}=&{} \left( \varvec{c}_2\varvec{s}^2+ \varvec{a''}\varvec{s}+\varvec{e''} \right) \varvec{M} \end{array} \end{aligned}$$

Similarly, we get \(\left\langle \xi _{q}\left( \varvec{\widetilde{c}}_2\right) , \texttt {evkM}_1 \right\rangle \varvec{M}^{-1}=-\varvec{a''}\varvec{M}\). Hence, we have obtained the Montgomery representation of the output of relinearisation step at no extra cost - both computationally and in terms of noise growth.

4.2 Impact of the Montgomery Representation in BGV

For the first step of the BGV homomorphic multiplication, no scaling operation is required. Thus, the noise is not affected by a change in representation. Next, relinearisation is applied. An analysis similar to the one in Sect. 4.1 can be performed, with minor adaptations for the relinearisation key. Similarly, one concludes that the Montgomery reduction introduces no cost neither in terms of computation nor in noise growth.

Finally, one needs to apply scaling so as to manage noise growth. We consider the ciphertext \((\varvec{\widetilde{c}}_0,\varvec{\widetilde{c}}_1)\) encrypting \(\varvec{m}\) and given in Montgomery representation. Let \(q'\mid q\) and \(\varvec{\delta }_i=[-\varvec{\widetilde{c}}_i/t]_{q/q'}\times t\). Then the BGV-scaling function applied to \(\varvec{\widetilde{c}}_i\) outputs \(\varvec{\widehat{c}}_i=(\varvec{\widetilde{c}}_i+\varvec{\delta }_i)\times \frac{q'}{q}\).

Lemma 3

If \(\left\| \left[ \varvec{c}_0+\varvec{c}_1\cdot \varvec{s}\right] _q \right\| _\infty < \frac{q}{2}-\delta _\mathcal {R}\left\| \varvec{M}^{-1}\right\| _\infty \frac{qt}{2q'}\left( 1+\delta _\mathcal {R}\cdot \Vert \varvec{s}\Vert _\infty \right) \) and \(q=q'\bmod t\), then

$$\begin{aligned} \left[ \left( \varvec{\widehat{c}}_0+\varvec{\widehat{c}}_1\cdot \varvec{s}\right) \varvec{M}^{-1}\right] _{q'} = \left[ \varvec{c}_0+\varvec{c}_1\cdot \varvec{s}\right] _{q}\bmod t \end{aligned}$$
(12)

and

$$\begin{aligned} \left\| \left[ \left( \varvec{\widehat{c}}_0+\varvec{\widehat{c}}_1\cdot \varvec{s}\right) \varvec{M}^{-1}\right] _{q'}\right\| _\infty \leqslant \frac{q'}{q} \left\| \left[ \varvec{c}_0+\varvec{c}_1\cdot \varvec{s}\right] _q \right\| _\infty + \delta _\mathcal {R}\left\| \varvec{M}^{-1}\right\| _\infty \frac{t}{2}\left( 1+\delta _\mathcal {R}\Vert \varvec{s}\Vert _\infty \right) \end{aligned}$$
(13)

Proof

It is similar to the proof of lemma 4 in [8]. By definition of \(\varvec{\widetilde{c}}_i\), we have:

$$\begin{aligned} \left[ \left( \varvec{\widetilde{c}}_0+\varvec{\widetilde{c}}_1\cdot \varvec{s}\right) \varvec{M}^{-1}\right] _q=\left[ \varvec{c}_0+\varvec{c}_1\cdot \varvec{s}\right] _q=\varvec{c}_0+\varvec{c}_1\cdot \varvec{s}-q\varvec{u}. \end{aligned}$$

By definition of \(\varvec{\widehat{c}}_i\), we can write:

$$\begin{aligned} \begin{array}{rcl} \left( \varvec{\widehat{c}}_0+\varvec{\widehat{c}}_1\cdot \varvec{s}\right) \varvec{M}^{-1}&{}=&{}\frac{q'}{q}\left( \varvec{\widetilde{c}}_0+\varvec{\widetilde{c}}_1\cdot \varvec{s}+\varvec{\delta }_0+\varvec{\delta }_1\cdot \varvec{s}\right) \varvec{M}^{-1}\\ &{}=&{}\frac{q'}{q}\left( \varvec{c}_0+\varvec{c}_1\cdot \varvec{s}\right) +\frac{q'}{q}\left( \varvec{\delta }_0+\varvec{\delta }_1\cdot \varvec{s}\right) \varvec{M}^{-1}\\ &{}=&{}\frac{q'}{q}\left[ \varvec{c}_0+\varvec{c}_1\cdot \varvec{s}\right] _q+q'\varvec{u}+\frac{q'}{q}\left( \varvec{\delta }_0+\varvec{\delta }_1\cdot \varvec{s}\right) \varvec{M}^{-1}. \end{array} \end{aligned}$$
(14)

Moreover, since \(\left\| \varvec{\delta }_i\right\| _\infty \leqslant \frac{qt}{2q'}\), we get the following bound \(\left\| \left( \varvec{\delta }_0+\varvec{\delta }_1\cdot \varvec{s}\right) \varvec{M}^{-1}\right\| _\infty \leqslant \delta _\mathcal {R}\frac{qt}{2q'}\left\| \varvec{M}^{-1}\right\| _\infty \left( 1+\delta _\mathcal {R}\Vert \varvec{s}\Vert _\infty \right) \). Thus, from the above and by considering the hypothesis on the norm of \(\left[ \left( \varvec{\widetilde{c}}_0+\varvec{\widetilde{c}}_1\cdot \varvec{s}\right) \varvec{M}^{-1}\right] _q =\left[ \varvec{c}_0+\varvec{c}_1\cdot \varvec{s}\right] _q\), we deduce that:

$$\begin{aligned} \left\| \left( \varvec{\widehat{c}}_0+\varvec{\widehat{c}}_1\cdot \varvec{s}\right) \varvec{M}^{-1}-q'\varvec{u} \right\| _\infty <q'/2 \end{aligned}$$

and then that

$$\begin{aligned} \left( \varvec{\widehat{c}}_0+\varvec{\widehat{c}}_1\cdot \varvec{s}\right) \varvec{M}^{-1}-q'\varvec{u}=\left[ \left( \varvec{\widehat{c}}_0+\varvec{\widehat{c}}_1\cdot \varvec{s}\right) \varvec{M}^{-1}\right] _{q'}. \end{aligned}$$

Hence, from this previous equality and by bounding the norm of last member of (14), we obtain (13). Finally, we get (12) by:

$$\begin{aligned} \begin{array}{rcl} \left[ \left( \varvec{\widehat{c}}_0+\varvec{\widehat{c}}_1\cdot \varvec{s}\right) \varvec{M}^{-1}\right] _{q'}&{}=&{} \left( \varvec{\widehat{c}}_0+\varvec{\widehat{c}}_1\cdot \varvec{s}\right) \varvec{M}^{-1}-q'\varvec{u}\\ &{}=&{} \left( \varvec{\widetilde{c}}_0+\varvec{\widetilde{c}}_1\cdot \varvec{s}\right) /\varvec{M}-q\varvec{u}\bmod t\ (\varvec{\widehat{c}}_i=\varvec{\widetilde{c}}_i\bmod t; q=q'\bmod t)\\ &{}=&{}\varvec{c}_0+\varvec{c}_1\cdot \varvec{s}-q\varvec{u}\bmod t\ (\text {def. of }\varvec{\widetilde{c}})\\ &{}=&{}\left[ \varvec{c}_0+\varvec{c}_1\cdot \varvec{s}\right] _q\bmod t. \end{array} \end{aligned}$$

   \(\square \)

From this lemma, we can see that the Montgomery representation of the ciphertext impacts the modulus switching procedure by the addition of an extra factor \(\delta _\mathcal {R}\Vert \varvec{M}^{-1}\Vert _\infty \) to the last term on the bound of the hypothesis and of (13).

4.3 Overall Impact on Noise Growth

For both BGV and FV, the norm \(\Vert \varvec{M}^{-1}\Vert _\infty \) and the expansion factor \(\delta _\mathcal {R}\) are involved in the noise growth due to the scaling steps performed with the Montgomery reduction. We report some observations concerning the size of the coefficients of \(\varvec{M}^{-1}\) in Appendix B, but they seem to remain small for most of the cases. When m is a power of two, \(\delta _\mathcal {R}\) is equal to n, but for other m it can be larger than that. Let us consider \(\mathcal {F}_m:\mathbb {Q}_{2n-2}[X]\rightarrow \mathbb {Q}[X]/(\varvec{\phi }_m(X))\), so that \(\mathcal {F}_m(\varvec{a}) = \varvec{a}\bmod \varvec{\phi }_m\) for every \(\varvec{a}\in \mathbb {Q}[X]\) of degree lesser than or equal to \(2n-2\).

Lemma 4

Let m be a positive integer and let \(\mathcal {R}= \mathbb {Z}[X]/(\varvec{\phi }_m(X))\), with \(\deg (\varvec{\phi }_m)=n\). If \(\delta _\mathcal {R}\) denotes the expansion factor of the ring \(\mathcal {R}\), then \(\delta _\mathcal {R}\le n \cdot \Vert \mathcal {F}_m \Vert _\infty .\)

These three parameters are given in Table 3 for the different cyclotomic polynomials considered in this paper. Assuming that distributions \(\chi _{key}\) and \(\chi _{err}\) output elements whose infinite norms are bounded by \(B_{key}\) and \(B_{err}=6\sigma _{err}\), we analyse which depth can be reached in a multiplicative tree.

FV: The initial noise of a ciphertext is at most \(V_{init}=B_{err}(1+2\delta B_{key})\) [10]. We recall that \(r_\infty =\tfrac{1+\rho }{2}(1+\delta _\mathcal {R}B_{key})+\delta _R\Vert \varvec{M}\Vert _\infty \). The output of a tree of depth L has a noise bounded by \(C_1^L V + L C_1^{L-1}C_2\) (cf. [6], Lemma 9) with:

$$\begin{aligned} \left\{ \begin{array}{rcl} C_{1}&{}=&{}2\delta _\mathcal {R}t \left( \delta _\mathcal {R}\left\| \varvec{M}^{-1}\right\| _\infty r_\infty +\tfrac{1}{2}\right) +\tfrac{\delta _\mathcal {R}}{2}\\ C_{2}&{}=&{}\delta _\mathcal {R}t |q|_t (r_\infty +1)+\tfrac{1}{2}\delta _\mathcal {R}\left\| \varvec{M}^{-1}\right\| _\infty \left( 1+\delta _\mathcal {R}B_{key} (1 + \delta B_{key})\right) + \tfrac{|q|_t}{2}+1\\ &{} &{}+k(1+\delta _\mathcal {R}B_{key} (1+\delta _\mathcal {R}B_{key})) + \delta _\mathcal {R}k B_{err} 2^{\nu +1} \end{array}\right. \end{aligned}$$

We denote by \(L_{max}=\max \{L\in \mathbb {N}\mid C_{1}^LV+LC_{1}^{L-1}C_{2}\leqslant B_{dec}\}\) the depth allowed by the homomorphic multiplication where \(B_{dec}\) corresponds to the decryption bound given by the full RNS version of FV [3].

BGV: As long as a ciphertext satisfies the condition on Lemma 3, one can perform a scaling operation and thus an homomorphic multiplication. Initially, one has \(\Vert \varvec{c}_0+\varvec{c}_1\varvec{s}\Vert _\infty \leqslant V_{init} = t/2+tB_{err}(2\delta _\mathcal {R}B_{key}+1)\). Let \(\ell _{\omega ,q} = \lceil \log _{\omega } q \rceil \), by assuming that after each scaling, the size of q is reduced by \(\omega \) bits, the growth of the size V of \(\varvec{c}_0+\varvec{c}_1\varvec{s}\), after one multiplication can be expressed by:

$$\begin{aligned} \tfrac{q'}{q}(\delta _\mathcal {R}V^2+B_{relin}) + \tfrac{1}{2}\delta _R t \left\| M^{-1}\right\| _\infty \left( 1+\delta _\mathcal {R}B_{key}\right) \end{aligned}$$

with \(B_{relin}=\tfrac{\delta _\mathcal {R}}{2} \ell _{\omega ,q}\omega B_{err}B_{key}\) the bound on the noise due to the relinearisation.

In Table 3 we present the maximal theoretical depths for BGV and FV with and without the use of the Montgomery reduction. For these computations we have taken parameters \(B_{key}=1\), \(\sigma _{err}=2\sqrt{n}\) and a number k of 62-bits moduli to get the largest size for q ensuring at least 80-bits of security according to [2].

Table 3. Theoretical depths with and without Montgomery reduction. Values in parenthesis are the depths observed in practice.

We notice that the depths of BGV are almost unchanged with the usage of the Montgomery reduction. However for FV the depths are far smaller with a Montgomery representation. These behaviours have been confirmed in practice.

4.4 Mixing Optimized Barrett and Montgomery Reductions

Considering the non-negligible impact of the Montgomery representation on the multiplicative depth of FV, a more robust strategy for this cryptosystem corresponds to a mixed Barrett/Montgomery approach. Algorithm 2 is used during the first stage of homomorphic multiplication, with ciphertexts not exploiting a Montgomery representation. This avoids the noise growth caused by the Montgomery factor. Nonetheless, the Montgomery reduction can still be used during the relinearisation stage, since we have seen that this does not cause a larger noise growth. To obtain a valid result, the relinearisation key needs to be modified, by replacing the factor \(\varvec{M}^2\) of the Montgomery approach by \(\varvec{M}\).

5 Experimental Results

The proposed methods for polynomial reduction have been implemented using C++, and compiled with gcc using the optimization flag -O3. All the experimental results presented herein were measured on an i7-5960X, running at 3.0 GHz with 32 GB of main memory. No parallelism was exploited.

In Fig. 1, one can find the execution timings of polynomial reduction, using NFL for power-of-two cyclotomics [1]; Mayer’s implementation of \(CRT^{-1}\) operator exploiting a tensored representation [24]; the unoptimized and optimized Barrett reductions and the Montgomery reduction for non-power-of-two cyclotomics. In order to highlight the gain brought by our algorithms compared to generic ones we also compare with NTL’s reduction using preconditioning [15]. All timings were normalized based on the number of batching slots \(\ell \), and executed for a single modulus of 62-bits. One finds the straightforward application of Barrett reduction, to be more efficient than the preconditioned methods employed in the NTL library. Moreover, speed-ups up to 1.95 and 2.55 were achieved for the optimized Barrett and Montgomery algorithms when compared with the unoptimized Barrett reduction. The figure suggests that using power-of-two cyclotomics is not scalable with respect to the throughput. In contrast, the remaining approaches present very little variation when considering the execution timing per batching slot for the different m given in Table 3. It should be noted that using larger values of m enables FHE parameters with a larger multiplicative depth. Finally, while tensored representations natively support operations on cyclotomic rings, research on its algorithmic efficiency is still in its infancy, and hence a fair comparison is not possible.

Fig. 1.
figure 1

Execution time per batching slot \(T/\ell [\mu s]\) for multiple reduction strategies and mth cyclotomic polynomials. The y-axis is in logarithmic scale

The aforementioned reduction methods were used to implement the homomorphic multiplication operations of the FV and BGV schemes. One can find in Figs. 2 and 3 the execution times of the homomorphic multiplication of two freshly encrypted ciphertexts for FV and BGV, respectively with the parameters given in Table 3. The experimental results for NFL are omitted due to its low performance with respect to the timing per batching slot.

Unlike with Fig. 1, the execution time of homomorphic multiplication increases significantly with increasing m. This trend is explained by the relinearisation procedure, which requires a number of \(\text {NTT}\)s that increases quadratically with \(\log _2 q\). Nevertheless, the employed reduction procedure plays a preponderant role in the efficiency of the homomorphic multiplication. Indeed, one achieves speed-ups up to 1.37 and 1.24 when comparing the homomorphic multiplication exploiting the optimized Barrett reduction with the one exploiting the unoptimized Barrett method for the FV and BGV schemes, respectively. Since with the mixed Barrett/Montgomery approach, required by the FV scheme, one is not able to fully take advantage of the gains brought forth by the Montgomery representation, one achieves speed-ups similar to those of the optimized Barrett reduction. In contrast, for BGV, one can exploit the Montgomery representation throughout the whole procedure, leading to speed-ups up to 1.32.

Fig. 2.
figure 2

Execution time per batching slot \(T/\ell [\mu s]\) for the homomorphic multiplication operation of FV with several reduction strategies and mth cyclotomic polynomials.

Fig. 3.
figure 3

Execution time per batching slot \(T/\ell [\mu s]\) for the homomorphic multiplication operation of BGV with several reduction strategies and mth cyclotomic polynomials.

The speed-up of the proposed methods decreases as the degree n of the cyclotomic, and thus \(\log _2 q\) gets larger due to the increasing complexity of the relinearisation procedure. This suggests that they are most beneficial when one needs to homomorphically evaluate small circuits. Since most of practical applications of FHE [13, 18, 26] have circuits with small depth, the proposed methods can potentially have a wide range of applicability.

6 Conclusion

In this paper, the arithmetic of non-power-of-two cyclotomics has been considered and improved. Two methods for polynomial reduction have been proposed: one based on the Barrett reduction and the other on a Montgomery representation. The optimized Barrett algorithm does not offer a better computational complexity than the Montgomery reduction. However, since it does not require changes in the representation of ciphertexts, it provides for a slower noise growth, making it more amenable to application in the FV homomorphic scheme than the Montgomery approach. In contrast, the Montgomery approach is more suitable for the BGV scheme. Speed-ups up to 1.95 and 2.55 have been obtained on an i7-5960X when comparing the optimized Barrett and Montgomery reductions with the unoptimized Barrett reduction, respectively. Finally, the polynomial reductions have been incorporated into the homomorphic multiplication procedures of FV and BGV, producing speed-ups up to 1.37.