Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Anomalous binary curves, generally referred to as Koblitz curves, are binary elliptic curves satisfying the Weierstrass equation, \(E_a:y^2+xy = x^3 + ax^2+1\), with \(a\in \{0,1\}\). Since their introduction in 1991 by Koblitz [21], these curves have been extensively studied for their additional structure that allows, in principle, a performance speedup in the computation of the elliptic curve point multiplication operation. As of today, the research works dealing with standardized Koblitz curves in commercial use, such as the binary curves standardized by NIST [23] or the suite of elliptic curves supported by the TLS protocol [4, 9], have exclusively analyzed the security and performance of curves defined over binary extension fields \(\mathbb {F}_{2^m}\), with m a prime number (for recent examples see [1, 5, 32, 36]).

Nevertheless, Koblitz curves defined over \(\mathbb {F}_4\) were also proposed in [21]. We find interesting to explore the cryptographic usage of Koblitz curves defined over \(\mathbb {F}_4\) due to their inherent usage of quadratic field arithmetic. Indeed, it has been recently shown [3, 25] that quadratic field arithmetic is extraordinarily efficient when implemented in software. This is because one can take full advantage of the Single Instruction Multiple Data (SIMD) paradigm, where a vector instruction performs simultaneously the same operation on a set of input data items.

Quadratic extensions of a binary finite field \(\mathbb {F}_{q^{2}}\) can be defined by means of a monic polynomial h(u) of degree two irreducible over \(\mathbb {F}_{q}\). The field \(\mathbb {F}_{q^{2}}\) is isomorphic to \(\mathbb {F}_{q}[u]/(h(u))\) and its elements can be represented as \(a_0 + a_1u\), with \(a_0, a_1 \in \mathbb {F}_{q}\). The addition of two elements \(a, b \in \mathbb {F}_{q^2},\) can be performed as \(c = (a_0 + b_0) + (a_1 + b_1)u.\) By choosing \(h(u) = u^2 + u +1,\) the multiplication of ab can be computed as, \(d = a_0b_0 + a_1b_1 + ((a_0 + a_1)\cdot (b_0 + b_1) + a_0b_0)u.\) By carefully organizing the code associated to these arithmetic operations, one can greatly exploit the pipelines and their inherent instruction-level parallelism that are available in contemporary high-end processors.

Our Contributions. In this work we designed for the first time, a 128-bit secure and timing attack resistant scalar multiplication on a Koblitz curve defined over \(\mathbb {F}_4,\) as they were proposed by Koblitz in his 1991 seminal paper [21]. We developed all the required algorithms for performing such a computation. This took us to reconsider the strategy of using redundant trinomials (also known as almost irreducible trinomials), which were proposed more than ten years ago in [6, 10]. We also report what is perhaps the most comprehensive analysis yet reported of how to efficiently implement arithmetic operations in binary finite fields and their quadratic extensions using the vectorized instructions available in high-end microprocessors. For example, to the best of our knowledge, we report for the first time a 128-bit AVX implementation of the linear pass technique, which is useful against side-channel attacks.

The remaining of this paper is organized as follows. In Sect. 2 we formally introduce the family of Koblitz elliptic curves defined over \(\mathbb {F}_4.\) In Sects. 3 and 4 a detailed description of the efficient implementation of the base and quadratic field arithmetic using vectorized instructions is given. We present in Sect. 5 the scalar multiplication algorithms used in this work, and we present in Sect. 6 the analysis and discussion of the results obtained by our software library. Finally, we draw our concluding remarks and future work in Sect. 7.

2 Koblitz Curves over \(\mathbb {F}_4\)

Koblitz curves over \(\mathbb {F}_4\) are defined by the following equation

$$\begin{aligned} E_a:y^2 + xy = x^3 + a \gamma x^2 + \gamma , \end{aligned}$$
(1)

where \(\gamma \in \mathbb {F}_{2^2}\) satisfies \(\gamma ^2 = \gamma + 1\) and \(a \in \{0,1\}\). Note that the number of points in the curves \(E_0(\mathbb {F}_4)\) and \(E_1(\mathbb {F}_4)\) are, \(\#E_0(\mathbb {F}_{4}) = 4\) and \(\#E_1(\mathbb {F}_{4}) = 6,\) respectively. For cryptographic purposes, one uses Eq. (1) operating over extension fields of the form \(\mathbb {F}_{q},\) with \(q = 4^m\), and m a prime number. The set of affine points \(P = (x, y) \in \mathbb {F}_{q} \times \mathbb {F}_{q}\) that satisfy Eq. (1) together with a point at infinity represented as \(\mathcal {O}\), forms an abelian group denoted by \(E_a(\mathbb {F}_{4^m}),\) where its group law is defined by the point addition operation.

Since for each proper divisor l of k, \(E(\mathbb {F}_{4^l})\) is a subgroup of \(E(\mathbb {F}_{4^k}),\) one has that \(\#E(\mathbb {F}_{4^l})\) divides \(\#E(\mathbb {F}_{4^k})\). Furthermore, by choosing prime extensions \(m,\!\) it is possible to find \(E_a(\mathbb {F}_{4^m})\) with almost-prime order, for instance, \(E_0(\mathbb {F}_{2^{2 \cdot 163}})\) and \(E_1(\mathbb {F}_{2^{2 \cdot 167}})\). In the remaining of this paper, we will show that the aforementioned strategy can be used for the efficient implementation of a 128-bit secure scalar multiplication on software platforms counting with 64-bit carry-less native multipliers, such as the ones available in contemporary personal desktops.

The Frobenius map \(\tau : E_a(\mathbb {F}_{q}) \rightarrow E_a(\mathbb {F}_{q})\) defined by \(\tau (\mathcal {O}) = \mathcal {O},\) \(\tau (x,y) = (x^4,y^4),\) is a curve automorphism satisfying \((\tau ^2 + 4)P = \mu \tau (P)\) for \(\mu =(-1)^{a}\) and all \({P \in E_a(\mathbb {F}_{q})}\). By solving the equation \(\tau ^2+4 = \mu \tau \), the Frobenius map can be seen as the complex number \(\tau = (\mu \pm \sqrt{-15})/2\).

2.1 The \(\tau \)-adic Representation

Given a Koblitz curve \(E_a/\mathbb {F}_{2^{2m}}\) with group order \(\#E_a(\mathbb {F}_{2^{2m}})\) = \(h \cdot \rho \cdot r\), where h is the order \(\#E_a(\mathbb {F}_4)\), r is the prime order of our subgroup of interest, and \(\rho \) is the order of a group of no cryptographic interest.Footnote 1 We can express a scalar \(k \in \mathbb {Z}_r\) as an element in \(\mathbb {Z}[\tau ]\) using the now classical partial reduction introduced by Solinas [31], with a few modifications. The modified version is based on the fact that \(\tau ^2 = \mu \tau - 4\).

Given that the norm of \(\tau \) is \(N(\tau ) = 4\), \(N(\tau -1) = h\), \(N(\tau ^m - 1) = h \cdot \rho \cdot r\) and \(N((\tau ^m - 1)/(\tau - 1)) = \rho \cdot r\), the subscalars \(r_0\) and \(r_1\) resulting from the partial modulo function will be both of size approximately \(\sqrt{\rho \cdot r}\). As a consequence, the corresponding scalar multiplication will need more iterations than expected, since it will consider the order \(\rho \) of a subgroup which is not of cryptographic interest.

For that reason, we took the design decision of considering that the input scalar of our point multiplication algorithm is already given in the \(\mathbb {Z}[\tau ]\) domain. As a result, a partial reduction of the scalar k is no longer required, and the number of iterations in the point multiplication will be consistent with the scalar k size. If one needs to retrieve the equivalent value of the scalar k in the ring \(\mathbb {Z}_r\), this can be easily computed with one multiplication and one addition in \(\mathbb {Z}_r\). This strategy is in line with the degree-2 scalar decomposition method within the GLS curves context as suggested in [12].

2.2 The Width-w \(\tau \)NAF Form

Assuming that the scalar k is specified in the \(\mathbb {Z}[\tau ]\) domain, one can represent the scalar in the regular width-w \(\tau \)NAF form as shown in Algorithm 1. The length of the representation width-w \(\tau \)NAF of an element \(k \in \mathbb {Z}[\tau ]\) is discussed in [30].

figure a

Given a width w, after running Algorithm 1, we have \(2^{2(w-1)-1}\) different digits.Footnote 2 As a result, it is necessary to be more conservative when choosing the width w, when compared to the Koblitz curves defined over \(\mathbb {F}_2\). For widths \(w = 2, 3, 4, 5\) we have to pre- or post-compute 2, 8, 32 and 128 points, respectively. For the 128-bit point multiplication, we estimated that the value of the width w must be at most four, otherwise, the costs of the point pre/post-processing are greater than the addition savings obtained in the main iteration.

Table 1. Representations of \(\alpha _v = v \bmod {\tau ^w}\), for \(w \in \{2, 3, 4\}\) and \(a = 1\) and the required operations for computing \(\alpha _v\). Here we denote by DFAMAT the point doubling, full addition, mixed addition and the Frobenius map, respectively. In addition, we consider that the point \(\alpha _1P\) is represented in affine coordinates. The order for computing the points is given in roman numbers

In addition, we must find efficient expressions of \(\alpha _v = v \bmod {\tau ^w}\). The method for searching the best expressions in Koblitz curves over \(\mathbb {F}_2\) [33] cannot be directly applied in the \(\mathbb {F}_4\) case. Because of this, we manually provided \(\alpha _v\) representations for \(w \in \{2, 3, 4\}\) and \(a = 1\), which are our implementation parameters. The main rationale for our representation choices was to minimize the number of field arithmetic operations. In practice, we strive for reducing the required number of full point additions by increasing the number of point doublings and mixed additions, which have a cheaper computational cost.Footnote 3 In Table 1 we present the \(\alpha _v\) representatives along with the operations required to generate the multiples of the base point.Footnote 4

Therefore, one point doubling and one full addition are required to generate the points \(\alpha _v \cdot P\) for \(w = 2\), one point doubling, four full additions, three mixed additions and four applications of the Frobenius map for the \(w = 3\) case and one point doubling, twenty full additions, eleven mixed additions and five applications of the Frobenius map for the \(w = 4\) case.

2.3 Security of the Koblitz Curves Defined over \(\mathbb {F}_4\)

Since the Koblitz curves defined over \(E_a(\mathbb {F}_{4^m})\) operate over quadratic extensions fields, it is conceivable that Weil descent attacks [13, 16] could possibly be efficiently applied on these curves. However, Menezes and Qu showed in [22] that the GHS attack cannot be implemented efficiently for elliptic curves defined over binary extension fields \(\mathbb {F}_{q},\) with \(q=2^m,\) and m a prime number in \([160, \dots , 600].\) Further, a specialized analysis for binary curves defined over fields of the form \(\mathbb {F}_{4^m}\) reported in [14], proved that the only vulnerable prime extension in the range \([80, \dots , 256],\) is \(m = 127\). Therefore, the prime extension used in this work, namely, \(m = 149,\) is considered safe with respect to the state-of-the-art knowledge of the Weil descent attack classes.

For a comprehensive survey of recent progress in the computation of the elliptic curve discrete problem in characteristic two, the reader is referred to the paper by Galbraith and Gaudry [11].

3 Base Field Arithmetic

In this section, we present the techniques used in our work in order to implement the binary field arithmetic. We selected a Koblitz curve with the parameter \(a = 1\) defined over \(\mathbb {F}_{4^m}\) with \(m = 149\). This curve was chosen because the order of its subgroup of interest is of size \(2^{254}\), which yields a security level roughly equivalent to a 128-bit secure scalar multiplication.

3.1 Modular Reduction

One can construct a binary extension field \(\mathbb {F}_{2^m}\) by taking a polynomial \(f(x) \in \mathbb {F}_2[x]\) of degree m which is irreducible over \(\mathbb {F}_2\). It is very important that the form of the polynomial \(f(x),\!\) admits an efficient modular reduction. The criteria for selecting f(x) depends on the architecture where the implementation will be executed as it was extensively discussed in [29].

For our field extension choice, we do not have degree-149 trinomials which are irreducible over \(\mathbb {F}_2\). An alternative solution is to construct the field through irreducible pentanomials. Given an irreducible pentanomial \(f(x) = x^m + x^a + x^b + x^c + 1\), the efficiency of the shift-and-add reduction method depends mostly on the fact that the term-degree differences \(m - a\), \(m - b\) and \(m - c,\) are all equal to 0 modulo W, where W is the architecture word size in bits.

Using the terminology of [29], lucky irreducible pentanomials are the ones where the three previously mentioned differences are equal to 0 modulo W. Fortunate irreducible pentanomials are the ones where two out of the three above differences are equal to 0 modulo W. The remaining cases are called ordinary irreducible pentanomials. Performing an extensive search with \(W = 8\), we found no lucky pentanomials, 189 fortunate pentanomials and 9491 ordinary pentanomials for the extension \(m = 149\).

The problem is that fortunate pentanomials make the modular reduction too costly if we compare it with the field multiplication computed with carry-less instructions. This is because we need to perform four shift-and-add operations per reduction step. Besides, two of those operations require costly shift instructions, since they are shifts not divisible by 8.

3.2 Redundant Trinomials

As a consequence of the above analysis, we resorted to the redundant trinomials strategy introduced in [6, 10], also known as almost irreducible trinomials. Given a non-irreducible trinomial g(x) of degree n that factorizes into an irreducible polynomial f(x) of degree \(m < n\), the idea is to perform the field reduction modulo g(x) throughout the scalar multiplication and, at the end of the algorithm, reduce the polynomials so obtained modulo f(x). In a nutshell, throughout the algorithm we represent the base field elements as polynomials in the ring \(\mathbb {F}_2[x]\) reduced modulo g(x). At the end of the algorithm, the elements are reduced modulo f(x) in order to bring them back to the target field \(\mathbb {F}_{2^{149}}\). For the sake of simplicity, throughout this paper, we will refer to those elements as field elements.

Since our target software platform counts with a native 64-bit carry-less multiplier, an efficient representation of the field elements must have at most 192 bits, i.e, three 64-bit words. For that reason, we searched for redundant trinomials of degree at most 192.

We selected the trinomial, \(g(x) = x^{192} + x^{19} + 1,\) for two reasons. First, since our target architecture contains 128-bit vectorized registers, the difference \((m - a) > 128\) allows us to perform the shift-and-add reduction in just two steps. Second, the property \(m \bmod {64} = 0,\) which allows us to perform efficiently the first part of the shift-and-add reduction. The steps to perform the modular reduction are described in Algorithm 2.Footnote 5 The reduction using 128-bit registers is presented in Sect. 4, where we discuss our strategy for implementing the arithmetic in the quadratic field extension.

figure b

The overall cost of the modular reduction is ten xors and five bitwise shifts. At the end of the scalar multiplication, we have to reduce the 192-bit polynomial to an element of the field \(\mathbb {F}_{2^{149}}\). Note that the trinomial \(g(x) = x^{192} + x^{19} + 1\) factorizes into a 69-term irreducible polynomial f(x) of degree 149.

The final reduction is performed via the mul-and-add reduction which, experimentally, performed more efficiently than the shift-and-add reduction.Footnote 6 Concisely, the mul-and-add technique consists in a series of steps which includes shift operations (in order to align the bits in the registers), carry-less multiplications and xor operations for eliminating the extra bits.

The basic mul-and-add step is described in Algorithm 3. Here, besides the usual notation, we represent the 64-bit carry-less multiplication by the symbol \(\times _{ij}\), where \(i,j = \{L, H\}\), with L and H representing the lowest and highest 64-bit word packed in a 128-bit register, respectively. For example, if one wants to multiply the 128-bit register A lowest 64-bit word by the 128-bit register B highest 64-bit word, we would express this operation as \(T \leftarrow A \times _{LH} B\).

figure c

Algorithm 3 requires four xors, three bitwise shifts and three carry-less multiplications. In our particular case, the difference between the degrees of the two most significant monomials of f(x) is three. Also, note that we need to reduce 43 bits (191–148). As a result, it is required \(\lceil \frac{43}{3} \rceil = 15\) applications of the Algorithm 3 in order to conclude this reduction.

4 Quadratic Field Arithmetic

In this Section, the basic arithmetic operations in the quadratic field are presented. As usual, the quadratic field \(\mathbb {F}_{2^{2 \cdot 149}}\) was constructed by the degree two monic polynomial \(h(u) = u^2 + u + 1\), and its elements are represented as \(a_0 + a_1u\), with \(a_0, a_1 \in \mathbb {F}_{2^{149}}\).

4.1 Register Allocation

The first aspect to be considered is the element allocation into the architecture’s available registers. In our case, we have to store two polynomials of 192 bits into 128-bit registers in such a way that it allows an efficient modular reduction and, at the same time, it generates a minimum overhead in the two main arithmetic operations, namely, the multiplication and squaring.

Let us consider an element \(a = (a_0 + a_1u) \in \mathbb {F}_{2^{2 \cdot 149}}\), where \(a_0 = C \cdot x^{128} + B \cdot x^{64} + A\) and \(a_1 = F \cdot x^{128} + E \cdot x^{64} + D\) are 192-bit polynomials, each one of them stored into three 64-bit words (A-C, D-F). Also, let us have three 128-bit registers \(R_i\), with \(i \in \{0, 1, 2\}\), which can store two 64-bit words each. In this Section, we adopted the following notation, given a 128-bit register R, its most and least significant packed 64-bit words, denoted respectively by S and T, are represented as \(R = S|T\). The first option is to rearrange the 384-bit element \(a = (a_0 + a_1u)\) as,

$$R_0 = A | B, \quad R_1 = C | D, \quad R_2 = E | F.$$

The problem with this representation is that a significant overhead is generated in the multiplication function, more specifically, in the pre-computation phase of the Karatsuba procedure (cf. Sect. 4.2 with the computation of \(V_{0,1}\), \(V_{0,2}\) and \(V_{1,2}\)). Besides, in order to efficiently perform the subsequent reduction phase, we must temporarily store the polynomial terms into four 128-bit vectors, which can cause a register overflow. A better method for storing the element a is to use the following arrangement,

$$R_0 = D | A, \quad R_1 = E | B, \quad R_2 = F | C.$$

Using this setting, there still exists some overhead in the multiplication and squaring arithmetic operations, even though the penalty on the latter operation is almost negligible. In the positive side, the terms of the elements \(a_0, a_1\) do not need to be rearranged and the modular reduction of these two base field elements can be performed in parallel, as discussed next.

4.2 Multiplication

Given two \(\mathbb {F}_{2^{2 \cdot 149}}\) elements \(a = (a_0 + a_1u)\) and \(b = (b_0 + b_1u)\), with \(a_0, a_1, b_0, b_1\) in \(\mathbb {F}_{2^{149}}\), we perform the multiplication \(c = a \cdot b\) as,

$$\begin{aligned} c = a \cdot b&= (a_0 + a_1u) \cdot (b_0 + b_1u)\\&= (a_0 b_0 \oplus a_1b_1) + (a_0b_0 \oplus (a_0 \oplus a_1)\cdot (b_0 \oplus b_1))u, \end{aligned}$$

where each element \(a_i, b_i \in \mathbb {F}_{2^{149}}\) is composed by three 64-bit words. The analysis of the Karatsuba algorithm cost for different word sizes was presented in [35]. There, it was shown that the most efficient way to multiply three 64-bit word polynomials \(s(x) = s_2x^2 + s_1x + s_0\) and \(t(x) = t_2x^2 + t_1x + t_0\) as \(v(x) = s(x) \cdot t(x)\) is through the one-level Karatsuba method,

$$V_0 = s_0 \cdot t_0, V_1 = s_1 \cdot t_1, V_2 = s_2 \cdot t_2,$$
$$V_{0,1} = (s_0 \oplus s_1) \cdot (t_0 \oplus t_1), V_{0,2} = (s_0 \oplus s_2) \cdot (t_0 \oplus t_2) V_{1,2} = (s_1 \oplus s_2) \cdot (t_1 \oplus t_2),$$
$$v(x) = V_2 \cdot x^4 + (V_{1,2} \oplus V_1 \oplus V_2) \cdot x^3 + (V_{0,2} \oplus V_0 \oplus V_1 \oplus V_2) \cdot x^2 + (V_{0,1} \oplus V_0 \oplus V_1) \cdot x + V_0,$$

which costs six multiplications and twelve additions. The Karatsuba algorithm as used in this work is presented in Algorithm 4.Footnote 7

figure d

Algorithm 4 requires six carry-less instructions, six vectorized xors and three bitwise shift instructions. In order to calculate the total multiplication cost, it is necessary to include the Karatsuba pre-computation operations at the base field level (twelve vectorized xors and six byte interleaving instructions) and at the quadratic field level (six vectorized xors). Also, we must consider the reorganization of the registers in order to proceed with the modular reduction (six vectorized xors).

4.3 Modular Reduction

The modular reduction of an element \(a = (a_0 + a_1u)\), where \(a_0\) and \(a_1\) are 384-bit polynomials, takes nine vectorized xors and six bitwise shifts. The computational savings of the previously discussed register configuration can be seen when we compare the reduction of quadratic field elements, presented in Algorithm 5 with the modular reduction of the base field elements (see Algorithm 2). The cost of reducing an element in \(\mathbb {F}_{2^{149}}\) in 64-bit registers is about the same as the cost of the reduction of an element in \(\mathbb {F}_{2^{2 \cdot 149}}\) stored into 128-bit registers. Thus, we achieved a valuable speedup of 100 %.

figure e

4.4 Squaring

Squaring is a very important function in the Koblitz curve point multiplication algorithm, since it is the building block for computing the \(\tau \) endomorphism. In our implementation, we computed the squaring operation through carry-less multiplication instructions which, experimentally, was an approach less expensive than the bit interleaving method (see [15, Sect. 2.3.4]). The pre-processing phase is straightforward, we just need to rearrange the 32-bit packed words of the 128-bit registers in order to prepare them for the subsequent modular reduction.

The pre- and post-processing phases require three shuffle instructions, three vectorized xors and three bitwise shifts. The complete function is described in Algorithm 6. Given 128-bit registers \(R_i\), we depict the SSE 32-bit shuffle operation as \(R_0 \leftarrow R_1 \between xxxx\). For instance, if we compute \(R_0 \leftarrow R_1 \between 3210\), it just maintains the 32-bit word order of the register \(R_1\), in other words, it just copies \(R_1\) to \(R_0\). The operation \(R_0 \leftarrow R_1 \between 2103\) rotates the register \(R_1\) to the left by 32-bits. See [17, 18] for more details.

4.5 Inversion

The inversion operation is computed via the Itoh-Tsujii method [19]. Given an element \(c \in \mathbb {F}_{2^m}\), we compute \(c^{-1} = c^{(2^{m-1}-1) \cdot 2}\) through an addition chain, which in each step computes the terms \((c^{2^i-1})^{2^j} \cdot c^{2^j-1}\) with \(0 \le j \le i \le m-1\). For the case \(m = 149\), the following chain is used,

$$1 \rightarrow 2 \rightarrow 4 \rightarrow 8 \rightarrow 16 \rightarrow 32 \rightarrow 33 \rightarrow 66 \rightarrow 74 \rightarrow 148.$$

This addition chain is optimal and was found through the procedure described in [7]. Note that although we compute the inversion operation over polynomials in \(\mathbb {F}_2[x]\) (reduced modulo \(g(x) = x^{192} + x^{19} + 1\)), we still have to perform the addition chain with \(m = 149\), since we are in fact interested in the embedded \(\mathbb {F}_{2^{149}}\) field element.

As previously discussed, in each step of the addition chain, we must calculate an exponentiation \(c^{2^j}\) followed by a multiplication, where the value j represents the integers that form the addition chain. Experimentally, we found that when \(j \ge 4\), it is cheaper to compute the exponentiation through table look-ups instead of performing consecutive squarings. Our pre-computed tables process four bits per iteration, therefore, it is required \(\lceil \frac{192}{4} \rceil = 48\) table queries in order to complete the multisquaring function.

5 \(\tau \)-and-add Scalar Multiplication

In this Section we discuss the single-core algorithms that compute a timing-resistant scalar multiplication through the \(\tau \)-and-add method over Koblitz curves defined over \(\mathbb {F}_4\). There are two basic approaches, the right-to-left and the left-to-right algorithms.

5.1 Left-to-right \(\tau \)-and-add

This algorithm is similar to the traditional left-to-right double-and-add method. Here, the point doubling operation is replaced by the computationally cheaper \(\tau \) endomorphism. In addition, we need to compute the width w-\(\tau \)NAF representation of the scalar k and perform linear passes (cf. Sect. 5.3) in the accumulators in order to avoid cache attacks [26, 34]. The method is shown in Algorithm 7.

figure f

The main advantage of this method is that the sensitive data is indirectly placed in the points \(P_{v_i}\). However, those points are only read and then added to the unique accumulator Q. As a consequence, only one linear pass per iteration is required before reading \(P_{v_i}\). On the other hand, the operation \(\tau ^{w-1}(Q)\) must be performed by successive squarings, since computing it through look-up tables could leak information about the scalar k.

5.2 Right-to-left \(\tau \)-and-add

This other method processes the scalar k from the least to the most significant digit. Taking advantage of the \(\tau \) endomorphism, the GLV method is brought to its full extent. This approach is presented in Algorithm 8.

figure g
figure h

Here, we have to perform a post-computation in the accumulators instead of precomputing the points \(P_i\) as in the previous approach. Also, the \(\tau \) endomorphism is applied to the point P, which is usually public. For that reason, we can compute \(\tau \) with table look-ups instead of performing squarings multiple times.

The downside of this algorithm is that the accumulators carry sensitive information about the digits of the scalar. Also, the accumulators are read and written. As a result, it is necessary to apply the linear pass algorithm to the accumulators \(Q_i\) twice per iteration.

5.3 Linear Pass

The linear pass is a method designed to protect sensitive information against side-channel attacks associated with the CPU cache access patterns. Let us consider an array A of size l. Before reading a value A[i], with \(i \in [0, l-1]\), the linear pass technique reads the entire array A but only stores, usually into an output register, the requested data A[i]. In that way, the attacker does not know which array index was accessed just by analyzing the location of the cache-miss in his own artificially injected data. However, this method causes a considerable overhead, which depends on the size of the array.

In this work, we implemented the linear pass method using 128-bit SSE vectorized instructions and registers. For each array index j, we first copy j to a register and compare this value with the current scalar k \(\tau \)NAF digit. The SSE instruction pcmpeqq compares the values of two 128-bit registers A and B and sets the resulting register C with bits one, if A and B are equal, and bits zero otherwise. For that reason, we can use the register C as a mask: if j is equal to the scalar k digit, the register C will contain only bits one. Then, by performing logical operations between C and each of the array values A[j], we can retrieve the requested data.

Experimental results shown that the implementation of the linear pass technique with SSE registers is more efficient than using 64-bit conditional move instructions [25] by a factor of 2.125. The approach just described is presented in Algorithm 9.

figure i

6 Results and Discussion

Our software library can be executed in any Intel platform, which comes with the SSE 4.1 vector instructions and the 64-bit carry-less multiplier instruction pclmulqdq. The benchmarking was executed in an Intel Core i7 4770k 3.50 GHz machine (Haswell architecture) with the TurboBoost and HyperThreading features disabled. Also, the library was coded in the GNU11 C and Assembly languages.

Regarding the compilers, we performed an experimental analysis on the performance of our code compiled with different systems: GCC (Gnu Compiler Collection) versions 5.3, 6.1; and the clang frontend for the LLVM compiler infrastructure versions 3.5 and 3.8. All compilations were done with the flags -O3 -march=haswell -fomit-frame-pointer. For the sake of comparison, we reported our timings for all of the previously mentioned compilers. However, when comparing our code with the state-of-the-art works, we opted for the clang/llvm 3.8, since it gave us the best performance.

6.1 Parameters

Given \(q = 2^m\), with \(m = 149\), we constructed our base binary field \(\mathbb {F}_q \cong \mathbb {F}_2[x]/(f(x))\) with the 69-term irreducible polynomial f(x) described in Sect. 4. The quadratic extension \(\mathbb {F}_{q^2} \cong \mathbb {F}_q[u]/(h(u))\) was built through the irreducible quadratic \(h(u) = u^2 + u + 1\). However, our base field arithmetic was computed modulo the redundant trinomial \(g(x) = x^{192} + x^{19} + 1\), which has among its irreducible factors, the polynomial f(x).

Our Koblitz curve was defined over \(\mathbb {F}_{q^2}\) as \(E_1/\mathbb {F}_{q^2} : y^2 + xy = x^3 + ux^2 + u,\) and the group \(E_1(\mathbb {F}_{q^2})\) contains a subgroup of interest of order

$$\begin{aligned} r&= \texttt {0x637845F7F8BFAB325B85412FB54061F148B7F6E79AE11CC843ADE1470F7E4E29}, \end{aligned}$$

which corresponds to approximately 255 bits. In addition, throughout our scalar multiplication, we represented the points in \(\lambda \)-affine [20, 28] and \(\lambda \)-projective [25] coordinates. We selected an order-r base point P at random represented in \(\lambda \)-affine coordinates.

6.2 Field and Elliptic Curve Arithmetic Timings

In Table 2, we present the timings for the base and the quadratic field arithmetic. The multisquaring operation is used to support the Itoh-Tsujii addition chain, therefore, it is implemented only in the field \(\mathbb {F}_{2^{149}}\) (actually, in a 192-bit polynomial in \(\mathbb {F}_2[x]\)). In addition, we gave timings to reduce a 192-bit polynomial element in \(\mathbb {F}_2[x]\) modulo f(x). Finally, all timings of operations in the quadratic field include the subsequent modular reduction.

Table 2. Timings (in clock cycles) for the finite field operations in \(\mathbb {F}_{2^{2 \cdot 149}}\) using different compiler families

Applying the techniques presented in [27], we saw that our machine has a margin of error of four cycles. This range is not of significance when considering the timings of the point arithmetic or the scalar multiplication. Nevertheless, for inexpensive functions such as multiplication and squaring, it is recommended to consider it when comparing the timings between different compilers.

Table 3. The ratio between the arithmetic and multiplication in \(\mathbb {F}_{2^{149}}\). The timings were taken from the code compiled with the clang 3.8 compiler

In the following, we compare in Table 3 the base arithmetic operation timings with the multiplication operation, which is the main operation of our library. The ratio squaring/multiplication is relatively expensive. This is because the polynomial \({g(x) = x^{192} + x^{19} + 1},\) does not admit a reduction specially designed for the squaring operation. Furthermore, the multisquaring and the inversion operations are also relatively costly. A possible explanation is that here, we are measuring timings in a Haswell architecture, which has a computationally cheaper carry-less multiplication when compared with the Sandy Bridge platform[18].

In Table 4 we give the timings of the point arithmetic functions. There, we presented the costs of applying the \(\tau \) endomorphism to an affine point (two coordinates) and a \(\lambda \)-projective point (three coordinates). The reason is that, depending on the scalar multiplication algorithm, one can apply the Frobenius map on the accumulator (projective) or the base point (affine). In addition, we report in Table 4, the mixed point doubling operation, which is defined as follows. Given a point \(P = (x_P, y_P)\), the mixed-doubling function computes, \(R = (X_R, L_R, Z_R) = 2P\). In other words, it performs a point doubling on an affine point and returns the resulting point in projective representation. Such primitive is useful in the computation of the \(\tau \)NAF representations \(\alpha _v = v \bmod {\tau ^w}\) (see Sect. 2.2).

Table 4. Timings (in clock cycles) for point addition over a Koblitz curve \(E_1/q^2\) using different compiler families

Table 4 also shows the superior performance of the clang compiler in the point arithmetic timings, since the only operations where it has a clear disadvantage are the full and mixed point doubling. However, those functions are rarely used throughout a Koblitz curve scalar multiplication. In fact, they are used only in the precomputing phase. Next, in Table 5, we show the relation of the point arithmetic timings with the field multiplication.

Table 5. The ratio between the timings of point addition and the field multiplication. The timings were taken from the code compiled with the clang 3.8 compiler

6.3 Scalar Multiplication Timings

Here the timings for the left-to-right regular w-\(\tau \)NAF \(\tau \)-and-add scalar multiplication, with \(w = 2,3,4\) are reported. The setting \(w = 2\) is presented in order to analyze how the balance between the pre-computation and the main iteration costs works in practice. Our main result lies in the setting \(w = 3\). Also, among the scalar multiplication timings, we show, in Table 6, the costs of the regular recoding and the linear pass functions.

Table 6. A comparison of the scalar multiplication and its support functions timings (in clock cycles) between different compiler families

Regarding the regular recoding function, we saw an increase of about \(46\,\%\) in the 3-\(\tau \)NAF timings when comparing with the \(w = 2\) case. The reason is that, for the \(w = 3\) case, we must compute a more complicated arithmetic. Also, when selecting the digits, we must perform a linear pass in the array that stores them. Otherwise, an attacker could learn about the scalar k by performing a timing-attack based on the CPU cache.

The linear pass function also becomes more expensive in the \(w = 3\) case, since we have more points in the array. However, in the \(m = 149\) case, we have to process 64 more iterations with the width \(w = 2\), when compared with the 3-\(\tau \)NAF point multiplication (since the number of iterations depends on m and w: \(\frac{m+2}{w-1}\,+\,2\)). As a result, the linear pass function overhead is mitigated by the savings in mixed additions and applications of \(\tau \) endomorphisms in the main loop. Finally, our scalar multiplication measurements consider that the point \(Q = kP\) is returned in the \(\lambda \)-projective coordinate representation. If the affine representation is required, it is necessary to add about 2,000 cycles to the total scalar multiplication timings.

6.4 Comparisons

In Table 7, we compare our implementation with the state-of-the-art works. Our 3-\(\tau \)NAF left-to-right \(\tau \)-and-add point multiplication outperformed by 29.64\(\%\) the work in [24], which is considered the fastest protected 128-bit secure Koblitz implementation. When compared with prime curves, our work is surpassed by \(15.29\,\%\) and \(13.06\,\%\) by the works in [8] and [2], respectively.

Table 7. Scalar multiplication timings (in clock cycles) on 128-bit secure ellitpic curves

Skylake architecture. In addition, we present timings for our scalar multiplication algorithms, also compiled with clang 3.8, in the Skylake architecture (Intel Core i7 6700K 4.00 GHz). The results (in clock cycles) for the cases \(w = 2, 3, 4\) are, respectively, 71,138, 51,788 and 66,286.

7 Conclusion

We have presented a comprehensive study of how to implement efficiently Koblitz elliptic curves defined over quaternary fields \(\mathbb {F}_{4^m},\) using vectorized instructions on the Intel micro-architectures codename Haswell and Skylake.

As a future work, we plan to investigate the use of 256-bit AVX2 registers to improve the performance of our code. In addition, we intend to implement the scalar multiplication algorithms in other architectures such as the ARMv8. Finally, we would like to design a version of our point multiplication in the multi-core and known point scenarios.