Keywords

1 Introduction

Implementing cryptographic algorithms in constrained embedded devices is a challenging task. In the 1990s, Kocher et al.  [Koc96, KJJ99] showed that one may often use the physical leakage of the underlying device during the algorithm execution (e.g., the running-time, the power consumption or the electromagnetic radiations) to recover some secret information. Side-channel analysis is the class of cryptanalytic attacks that exploit such physical emanations to hinder the strength of the underlying cryptography.

One of the most common technique to protect implementations against side-channel attacks is to mask internal secret variables. This so-called masking technique [GP99, CJRR99] splits every sensitive data manipulated by the algorithm (which depends on the secret key and possibly on other variables known to the attacker) into \(d+1\) shares (where \(d \ge 1\), the masking order, plays the role of a security parameter). The first d shares are usually generated uniformly at random and the last one is computed so that the combination of the \(d+1\) shares for some group law is equal to the initial value. With this technique, the attacker actually needs the whole set of \(d+1\) shares to learn any information on the initial value. Since each share’s observation comes with noise, the higher is the order d, the more complex is the attack [CJRR99, PR13]. This masking approach permits to achieve provable security in formal security models, notably the probing security model [ISW03] and the noisy leakage model [PR13, DDF14].

Most symmetric cryptographic algorithms manipulate binary data \(x \in \{0,1\}^n\) (for some integer \(n \ge 1\)) and the natural group law used for masking is the Boolean XOR \(\oplus \) over \(\mathbb {F}_2\) (or more generally addition in the finite field \(\mathbb {F}_{2^n}\) or in the vector space \(\mathbb {F}_2^n\)). Using this Boolean masking, each sensitive data x is thus split into \(d+1\) shares \(x_0,\dots ,x_d\) whose addition returns the initial value (i.e. \(x = x_0 \oplus x_1 \oplus \cdots \oplus x_d\)). One can then compute securely any linear function f of the sensitive value x since a sharing of \(y = f(x)\) is readily obtained by computing \(y_i = f(x_i)\) for \(i \in \{0,\dots ,d\}\) (such that \(y = y_0 \oplus y_1 \oplus \cdots \oplus y_d\)). It is unfortunately not so easy to compute a sharing of f(x) for non-linear functions f but solutions were proposed for multiplication (e.g. see [ISW03, RP10, BBP+16]). However, if the evaluation cost of linear function is linear in d, the best known algorithms for multiplication have \(O(d^2)\) computational complexity.

In practice, iterative block cipher (such as AES) apply several time a round function to an internal state composed itself usually of a linear round key addition, of linear operations to ensure diffusion and of non-linear operations (usually called s-boxes) to ensure confusion. The main issue to provide secure implementation of block ciphers is thus to provide an efficient and secure way to mask the s-box(es). The most widely-used solution is to consider their representation as polynomial functions over finite fields \(\mathbb {F}_{2^n}\) (using Lagrange’s interpolation theorem) and to find an efficient way to evaluate this polynomial using a minimal number of multiplications. In this paper, we present a generalization of known methods and we obtain new interesting construction for efficiency/memory trade-offs.

1.1 Related Work

The first generic method to mask any s-box at any masking order d was proposed in 2012 by Carlet, Goubin, Prouff, Quisquater and Rivain [CGP+12] (following prior work by Rivain and Prouff for the AES block cipher [RP10]). The core idea is to split into simple operations over \(\mathbb {F}_{2^n}\) (namely, addition, multiplication by constant, squaring and regular multiplication of two distinct elements), the evaluation of the polynomial representation of the s-box. Among these operations, only the regular multiplication of two distinct elements is non-linear (since squaring over a characteristic 2 finite field is actually linear), and one can use the secure multiplication algorithms mentioned above [ISW03, RP10, BBP+16] to evaluate them. Since these operations have \(O(d^2)\) complexity, it is interesting to propose an evaluation scheme of the polynomial with as few as possible regular multiplications. Carlet et al. [CGP+12] defined the masking complexity (also known as multiplicative complexity and non-linear complexity) of an s-box as the minimal number of such multiplications necessary to evaluate the corresponding polynomial and they adapted known methods for polynomial evaluation based on addition chains (see [CGP+12] for details).

This technique was later improved by Roy and Vivek in [RV13] using cyclotomic cosets addition chains. They notably presented a polynomial evaluation method for the DES s-boxes that requires 7 non-linear multiplications (instead of 10 in  [CGP+12]). They also presented lower-bound on the length of such a chain and showed that the multiplicative complexity of the DES s-boxes is lower bounded by 3. In 2014, Coron, Roy and Vivek [CRV14] proposed an heuristic method which may be viewed as an extension of the ideas developed in [CGP+12] and [RV13]. The so-called CRV method considers the s-box as a polynomial over \(\mathbb {F}_{2^n}\) and has heuristic multiplicative complexity \(O(2^{n/2}/\sqrt{n})\) instead of \(O(2^{n/2})\) proven multiplicative complexity for the previous methods. They also proved a matching lower bound of \(\varOmega (2^{n/2}/\sqrt{n})\) on the multiplicative complexity of any generic method to evaluate n-bit s-boxes. For all the tested s-boxes their method is at least as efficient as the previous proposals and it often requires less non-linear multiplications (e.g. only 4 for the DES s-boxes).

In [GR16], Goudarzi and Rivain introduced a new method to decompose an s-box into a circuit with low multiplicative complexity. One can see their approach as a way to model the s-box as a polynomial over \(\mathbb {F}_2^n\) (instead of \(\mathbb {F}_{2^n}\)) and it consists in applying masking at the Boolean level by bitslicing the s-boxes within a block cipher round. The proposed decomposition then relies on the one proposed by Coron et al. [CRV14] and extends it to efficiently deal with several coordinate functions. The schemes from [ISW03, RP10, BBP+16] can then be used to secure bitwise multiplication and the method allows to compute all the s-boxes within a cipher round at the same time.

Finally, in [PV16], Pulkus and Vivek generalized and improved Coron et al. technique [CRV14] by working over slightly larger fields than strictly needed (i.e. they considered the s-box as a polynomial over \(\mathbb {F}_{2^t}\) instead of \(\mathbb {F}_{2^n}\), where \(t \ge n\)). Their technique permits notably to evaluate DES s-boxes with only 3 non-linear multiplications over \(\mathbb {F}_{2^8}\) (compared to 4 over \(\mathbb {F}_{2^6}\) with Coron et al. method [CRV14]).

1.2 Our Results

We propose a generalized decomposition method for s-boxes that unifies these previously proposed methods and provides new median case decompositions. More precisely, in our approach any \(n\lambda \)-bit s-box for some integers \(n \ge 1\) and \(\lambda \ge 1\) can be seen as a polynomial (or a vector of \(m \ge 1\) polynomials) over \(\mathbb {F}_{2^{\lambda }}^n\). We first prove a lower bound of \(\varOmega (2^{\lambda n/2}\sqrt{m/\lambda })\) for the complexity of any method to evaluate \(n\lambda \)-bit to \(m\lambda \)-bit s-boxes. We then describe our generalized decomposition method for which we provide concrete parameters to achieve decomposition for several triplet \((n,m,\lambda )\) and for exemplary s-boxes of popular block ciphers (namely PRESENT [BKL+07], SC2000 [SYY+02], CLEFIA [SSA+07] and KHAZAD [BR00]).

Depending on the s-box, our generalized method allows one to choose the parameters n, m and \(\lambda \) to obtain the best possible s-box decomposition in terms of multiplications over \(\mathbb {F}_{2^{\lambda }}\). In particular, for \(8\times 8\) s-boxes, the CRV decomposition method [CRV14] (\(n=1\), \(m=1\), \(\lambda = 8\)) and the bitslice decomposition method [GR16] (\(n=8\), \(m=8\), \(\lambda = 1\)) are special cases of this generalized decomposition method. The implementation results provided in Sect. 6 (\(8\times 8\) s-boxes on a 32-bit ARM architecture) show that our method is comparable with [CRV14] while being more space efficient. It is therefore a good alternative to prior techniques and can be effectively implemented in software on devices with limited resources.

In the full version of this paper, we generalize the method further by exploring the problem of decomposing arbitrary (nm)-bit s-boxes over an arbitrary field \(\mathbb {F}_{2^{\lambda }}\). Namely we do not require that \(\lambda \) divides the s-box input and output bit-lengths. This allows us to also integrate, in addition to [CRV14, GR16], the method of [PV16] that considers decomposition when \(\lambda \ge n\).

2 Preliminaries

2.1 Notations and Notions

Let \(\lambda \) be a positive integer. Then \(\mathbb {F}_{2^{\lambda }}\) denotes the finite field with \(2^{\lambda }\) elements. Let \(\mathcal {F}_{\lambda ,n}\) be the set of functions from \(\mathbb {F}_{2^{\lambda }}^{n}\) to \(\mathbb {F}_{2^{\lambda }}\). Using Lagrange’s interpolation theorem, any function \(f \in \mathcal {F}_{\lambda ,n}\) can be seen as a multivariate polynomial over \(\mathbb {F}_{2^{\lambda }}[x_1,x_2,\dots ,x_n]/(x_1^{2^{\lambda }}-x_1,x_2^{2^{\lambda }}-x_2,\dots ,x_n^{2^{\lambda }}-x_n)\):

$$\begin{aligned} f({x}) = \sum _{{u}\in [0,2^{\lambda }-1]^n} a_{{u}} \, {x}^{{u}}~,\end{aligned}$$
(1)

where \({x} = (x_1,x_2,\dots ,x_n)\), \({x}^{{u}} = x_1^{u_1} \cdot x_2^{u_2} \cdot \ldots \cdot x_n^{u_n}\), and \(a_{{u}}\in \mathbb {F}_{2^{\lambda }}\) for every \({u} = (u_1,\dots ,u_n) \in [0,2^{\lambda }-1]^n\).

The multiplicative complexity of a function in \(\mathcal {F}_{\lambda ,n}\) (also called the non-linear complexity) is defined as the minimal number of \(\mathbb {F}_{2^{\lambda }}\)-multiplications required to evaluate it.

2.2 S-box Characterization

In the following, an s-box S is characterized with respect to 3 parameters: the number of input elements n; the number of output elements m; and the bit-size of the elements \(\lambda \). In other words, an s-box with \(\lambda n\) input bits and \(\lambda m\) outputs bits is represented as follows:

$$\begin{aligned} S(x) = (f_1(x),f_2(x),\dots ,f_m(x)), \end{aligned}$$
(2)

where functions \(f_1,f_2,\dots ,f_m \in \mathcal {F}_{\lambda ,n}\) are called the coordinate functions of S.

As mentioned in the introduction, Roy and Vivek presented in [RV13] lower-bound on the length of cyclotomic coset addition chains and used it to derive a logarithmic lower bound on the multiplicative complexity of an s-box (i.e. on the minimal number of such multiplications necessary to evaluate the corresponding polynomial). Coron et al. [CRV14] improved this lower bound and showed that the non-linear complexity of any generic method to evaluate n-bit s-boxes when seen as a polynomial defined over \(\mathbb {F}_{2^n}\) is in \(\varOmega (2^{n/2}/\sqrt{n})\).

In the following section, we generalize their approach and provide a new lower bound on the multiplicative complexity of a sequence of n-variate polynomials over \(\mathbb {F}_{2^\lambda }\). Following [RV13], we define the multiplicative complexity notion for such a sequence as follows:

Definition 1

(Polynomial chain). Let \(\lambda \ge 1\), \(n \ge 1\) and \(m \ge 1\) be three integers and let \(f_1,\dots ,f_m \in \mathbb {F}_{2^\lambda }[x_1,\dots ,x_n]\) be a sequence of n-variate polynomials over \(\mathbb {F}_{2^\lambda }\). A polynomial chain \(\mathbf {\pi }\) for \((f_1,\dots ,f_m)\) is a sequence \(\mathbf {\pi } = ({{\pi }_i})_{i \in \{-n,\dots ,\ell \}}\) and a list \((i_1,\dots ,i_m) \in \{-n,\dots ,\ell \}^m\) with

$$ \pi _{-n} = x_n,~ \pi _{1-n} = x_{n-1},~ \dots ,~ \pi _{-1} = x_1,~ \pi _{0} = 1, $$
$$ \pi _{i_j} = f_j(x_1,\dots ,x_n) \bmod (x_1^{2^\lambda }+x_1,\dots ,x_n^{2^\lambda }+x_n), ~\forall j \in \{1,\dots ,m\}, $$

and such that for every \(i \in \{1,\dots ,\ell \}\), one of the following condition holds:

  1. 1.

    there exist j and k in \(\{-n,\dots ,i-1\}\) such that \(\pi _i = \pi _j \cdot \pi _k\);

  2. 2.

    there exist j and k in \(\{-n,\dots ,i-1\}\) such that \(\pi _i = \pi _j + \pi _k\);

  3. 3.

    there exists j in \(\{-n,\dots ,i-1\}\) such that \(\pi _i = \pi _j^2\);

  4. 4.

    there exists j in \(\{-n,\dots ,i-1\}\) and \(\alpha \in \mathbb {F}_{2^\lambda }\) such that \(\pi _i = \alpha \cdot \pi _j\).

Given such a polynomial chain \(\mathbf {\pi }\) for \((f_1,\dots ,f_m)\), the multiplicative complexity of \(\mathbf {\pi }\) is the number of times the first condition holds in the whole chain \(\mathbf {\pi }\). The multiplicative complexity of \((f_1,\dots ,f_m)\) over \(\mathbb {F}_{2^\lambda }\), denoted \(\mathcal {M}(f_1,\dots ,f_m)\) is the minimal multiplicative complexity over all polynomial chains for \((f_1,\dots ,f_m)\).

Remark 1

The multiplicative complexity is similar to the classical circuit complexity notion in which we do not count the linear operations over \(\mathbb {F}_{2^\lambda }\) (namely addition, scalar multiplication and squaring operations). For any sequence of n-variate polynomials \(f_1,\dots ,f_m \in \mathbb {F}_{2^\lambda }[x_1,\dots ,x_n]\) we obviously have:

$$\mathcal {M}(f_1,\dots ,f_m) \le \mathcal {M}(f_1) + \cdots + \mathcal {M}(f_m)$$

3 Multiplicative Complexity Lower Bound

In the next section, we will provide a heuristic method which given a sequence of n-variate polynomials over \(\mathbb {F}_{2^\lambda }\) provides an evaluation scheme (or a circuit) with “small” multiplicative complexity. Following, Coron et al. [CRV14], Proposition 1 provides a \(\varOmega (2^{n\lambda /2}\sqrt{m/\lambda })\) lower bound on this multiplicative complexity. As in [CRV14], the proof is a simple combinatorial argument inspired by [PS73].

Proposition 1

Let \(\lambda \ge 1\), \(n \ge 1\) and \(m \ge 1\) be three integers. There exists \(f_1,\dots ,f_m \in \mathbb {F}_{2^\lambda }[x_1,\dots ,x_n]\) a sequence of n-variate polynomials over \(\mathbb {F}_{2^\lambda }\) such that \(\mathcal {M}(f_1,\dots ,f_m) \ge \sqrt{\frac{m 2^{n\lambda }}{\lambda }} - (2n+m-1)\).

Proof

We consider a sequence of n-variate polynomials \(f_1,\dots ,f_m\) in the algebra \(\mathbb {F}_{2^\lambda }[x_1,\dots ,x_n]\) with multiplicative complexity \(\mathcal {M}(f_1,\dots ,f_m) = r\) for some integer \(r \ge 1\). If we consider only the non-linear operations in a polynomial chain of minimal multiplicative complexity \(\mathbf {\pi } = ({{\pi }_i})_{i \in \{-n,\dots ,\ell \}}\), we can see that there exists indices \(m_0,m_1,\dots ,m_{n+r+(m-1)}\) with \(m_i \in \{-n,\dots ,\ell \}\) for \(i \in \{0,\dots ,n+r+(m-1)\}\) such that

  • \(m_{j} = -j-1\) for \(j \in \{0,\dots ,n-1\}\) (i.e. \(\pi _{m_{j}} = \pi _{-j-1} = x_{j+1}\) for \(j \in \{0,\dots ,n-1\}\));

  • for \(k \in \{n,\dots ,n+r-1\}\), there exist field elements \(\beta _{k},\beta '_{k} \in \mathbb {F}_{2^\lambda }\) and \(\beta _{k,i,j},\beta '_{k,i,j} \in \mathbb {F}_{2^\lambda }\) for \((i,j) \in \{0,\dots ,k-1\} \times \{0,\dots ,\lambda -1\}\) such that

    $$\begin{aligned} \pi _{m_k}&= \left( \beta _{k} + \sum _{i=0}^{k-1} \sum _{j = 0}^{\lambda -1} \beta _{k,i,j} \pi _{m_i}^{2^j} \right) \cdot \left( \beta '_{k} + \sum _{i=0}^{k-1} \sum _{j = 0}^{\lambda -1} \beta '_{k,i,j} \pi _{m_i}^{2^j}\right) \\&\quad \quad \quad \bmod (x_1^{2^\lambda }+x_1,\dots ,x_n^{2^\lambda }+x_n); \end{aligned}$$
  • for \(k \in \{n+r,\dots ,n+r+(m-1)\}\) there exist field elements \(\beta _{k} \in \mathbb {F}_{2^\lambda }\) and \(\beta _{k,i,j} \in \mathbb {F}_{2^\lambda }\) for \((i,j) \in \{0,\dots ,n+r-1\} \times \{0,\dots ,\lambda -1\}\) such that

    $$\begin{aligned} f_{k+1-(n+r)} = \pi _{m_{k}}&= \beta _{k} + \sum _{i=0}^{n+r-1} \sum _{j = 0}^{\lambda -1} \beta _{k,i,j} \pi _{m_i}^{2^j} \bmod (x_1^{2^\lambda }+x_1,\dots ,x_n^{2^\lambda }+x_n). \end{aligned}$$

The total number of parameters \(\beta \) in this evaluation scheme of P is simply equal to:

$$ \sum _{k = n}^{n+r-1} 2 \cdot (1 + k \cdot \lambda ) + m(1 + (n+r) \cdot \lambda ) = r^2 \lambda + r (\lambda m + 2 \lambda n - \lambda + 2) + \lambda m n + m $$

and each parameter can take any value in \(\mathbb {F}_{2^\lambda }\). The number of sequence of n-variate polynomials \(f_1,\dots ,f_m \in \mathbb {F}_{2^\lambda }[x_1,\dots ,x_n]\) with multiplicative complexity \(\mathcal {M}(f_1,\dots ,f_m ) = r\) is thus upper-bounded by \( 2^{\lambda (r^2 \lambda + r (\lambda m + 2 \lambda n - \lambda + 2) + \lambda m n + m )} \).

Since the total number of sequence of n-variate polynomials \(f_1,\dots ,f_m \in \mathbb {F}_{2^\lambda }[x_1,\dots ,x_n]\) defined \(\bmod (x_1^{2^\lambda }+x_1,\dots ,x_n^{2^\lambda }+x_n)\) is \(((2^\lambda )^{2^{n \lambda }})^m\), in order to be able to evaluate all such polynomials with at most r non-linear multiplications, a necessary condition is to have

$$ r^2 \lambda + r (\lambda m + 2 \lambda n - \lambda + 2) + \lambda m n + m \ge m 2^{n \lambda } $$

and therefore

$$ r^2 \lambda + r (\lambda m + 2 \lambda n - \lambda + 2) - (m 2^{n \lambda } - \lambda m n - m) \ge 0. $$

Eventually, we obtain

$$\begin{aligned} r \ge \frac{\sqrt{\lambda 4\,m 2^{n\lambda } + (\lambda m + 2 \lambda n - \lambda + 2)^2}-(\lambda m + 2 \lambda n - \lambda + 2)}{2\lambda } \end{aligned}$$
(3)

and

$$r \ge \frac{\sqrt{4\lambda m 2^{n\lambda }}-2(\lambda m + 2 \lambda n - \lambda + 2) }{2\lambda } \ge \sqrt{\frac{m 2^{n\lambda }}{\lambda }} - (2n+m-1).$$

    \(\square \)

4 Generalized Decomposition Method

In this section, we propose a generalized decomposition method for s-boxes that aims at encapsulating previously proposed methods and at providing new median case decompositions. Depending on the s-box, we can then choose the parameters n, m and \(\lambda \) in order to obtain the best possible s-box decomposition in terms of multiplications over \(\mathbb {F}_{2^{\lambda }}\). In particular, for \(8\times 8\) s-boxes, the CRV decomposition method [CRV14] (\(n=1\), \(m=1\), \(\lambda = 8\)) and the bitslice decomposition method [GR16] (\(n=8\), \(m=8\), \(\lambda = 1\)) are special cases of this generalized decomposition method.

4.1 Decomposition of a Single Coordinate Function

Let us define the linear power class of a function \(\phi \in \mathcal {F}_{\lambda ,n}\), denoted by \(\mathcal {C}_{\phi }\), as the set

$$\begin{aligned} \mathcal {C}_{\phi } = \{\phi ^{2^i}: i=0,\dots , \lambda -1\}. \end{aligned}$$
(4)

Intuitively, \(\mathcal {C}_{\phi }\) corresponds to the set of functions in \(\mathcal {F}_{\lambda ,n}\) that can be computed from \(\phi \) using only the squaring operation. It is not hard to see that \(\{\mathcal {C}_{\phi }\}_{\phi }\) are equivalence classes partitioning \(\mathcal {F}_{\lambda ,n}\). For any set \(\mathcal {B}\subseteq \mathcal {F}_{\lambda ,n}\), let us define the linear power closure of \(\mathcal {B}\) as the set

$$\overline{\mathcal {B}}= \bigcup _{\phi \in \mathcal {B}}\,\mathcal {C}_{\phi }$$

and the linear span of \(\mathcal {B}\) as the set

$$\langle \mathcal {B}\rangle = \Big \{~\sum _{\phi \in \mathcal {B}}a_{\phi }\phi ~\big |~ a_{\phi }\in \mathbb {F}_{2^{\lambda }}\Big \}.$$

Let f be a function in \(\mathcal {F}_{\lambda ,n}\). The proposed decomposition makes use of a basis of functions \(\mathcal {B}\subseteq \mathcal {F}_{\lambda ,n}\) and consists in writing f as:

$$\begin{aligned} f(x) = \sum _{i=0}^{t-1}g_i(x)\cdot h_i(x) + h_t(x), \end{aligned}$$
(5)

where \(g_i,h_i\in \langle \overline{\mathcal {B}}\rangle \) and \(t\in \mathbb {N}\). By definition, the functions \(g_i\) and \(h_i\) can be written as

$$\begin{aligned} g_i(x) = \sum _{j=1}^{|\mathcal {B}|}\ell _{j}(\varphi _j(x)) ~\text { and }~ h_i(x) = \sum _{j=1}^{|\mathcal {B}|}\ell '_{j}(\varphi _j(x)), \end{aligned}$$

where the \(\ell _{j},\ell '_{j}\) are linearized polynomials over \(\mathcal {F}_{\lambda ,n}\) (i.e. polynomials for which the exponents of all the constituent monomials are powers of 2) and where \(\{\varphi _j\}_{1\le j\le |\overline{\mathcal {B}}|}=\overline{\mathcal {B}}\). We now explain how to find such a decomposition by solving a linear system.

Solving a Linear System. In the following, we shall consider a basis \(\mathcal {B}\) such that \(1 \in \mathcal {B}\) and we will denote \(\mathcal {B}^* = \mathcal {B}\setminus \{1\} = \{\phi _1,\phi _2,\dots , \phi _{|\mathcal {B}|-1}\}\). We will further heuristically assume \(|\mathcal {C}_{\phi _i}| = \lambda \) for every \(i \in \{1,2,\ldots ,|\mathcal {B}|-1\}\). We then get \(|\overline{\mathcal {B}}| = 1 + \lambda |\mathcal {B}^*| = 1 + \lambda (|\mathcal {B}| - 1)\).

We first sample t random functions \(g_i\) from \(\langle \overline{\mathcal {B}}\rangle \). This is simply done by picking \(t\cdot |\overline{\mathcal {B}}|\) random coefficients \(a_{i,0}, a_{i,j,k}\) of \(\mathbb {F}_{2^{\lambda }}\) and setting \(g_i = a_{i,0} + \sum _{j,k} a_{i,j,k}\phi _j^{2^k}\) for every \(i\in [0,t-1]\) where \(1 \le k \le \lambda \) and \(1 \le j \le |\mathcal {B}|-1\). Then we search for a family of \(t+1\) functions \(\{h_i\}_i\) satisfying (5). This is done by solving the following system of linear equations over \(\mathbb {F}_{2^{\lambda }}\):

$$\begin{aligned} A\cdot c = b \end{aligned}$$
(6)

where \(b = (f(e_1),f(e_2),\dots ,f(e_{2^{n\lambda }}))^\mathsf {T}\) with \(\{e_i\} = \mathbb {F}_{2^{\lambda }}^n\) and where A is a block matrix defined as

$$\begin{aligned} A = (\mathbf {1}|A_{0}|A_1|~\cdots ~|A_{t}), \end{aligned}$$
(7)

where \(\mathbf {1}\) is the all-one column vector and where

$$\begin{aligned} A_i = (A_{i,0}|A_{i,1}|~\cdots ~|A_{i,|\mathcal {B}|-1}) \end{aligned}$$
(8)

with

$$\begin{aligned} A_{i,0} = (g_{i}(e_1), g_{i}(e_2), \ldots , g_{i}(e_{2^{n\lambda }}))^{\mathsf {T}} \end{aligned}$$
(9)

for every \(i \in [0,t]\), with

$$\begin{aligned} A_{i,j} = \begin{pmatrix} \phi _j(e_1)\cdot g_{i}(e_1)&{}\phi _j^2(e_1)\cdot g_{i}(e_1)&{}...&{}\phi _j^{2^{\lambda -1}}(e_1)\cdot g_{i}(e_1)\\ \phi _j(e_2)\cdot g_{i}(e_2)&{}\phi _j^2(e_2)\cdot g_{i}(e_2)&{}...&{}\phi _j^{2^{\lambda -1}}(e_2)\cdot g_{i}(e_2)\\ \\ \vdots &{}\vdots &{}\ddots &{}\vdots \\ \\ \phi _j(e_{2^{n\lambda }})\cdot g_{i}(e_{2^{n\lambda }})&{}\phi _j^2(e_{2^{n\lambda }})\cdot g_{i}(e_{2^{n\lambda }})&{}...&{}\phi _j^{2^{\lambda -1}}(e_{2^{n\lambda }})\cdot g_{i}(e_{2^{n\lambda }})\\ \end{pmatrix} , \end{aligned}$$
(10)

for every \(i\in [0,t-1]\) and \(j\in [1,|\mathcal {B}|-1]\), and with

$$\begin{aligned} A_{t,j} = \begin{pmatrix} \phi _j(e_1)&{}\phi _j^2(e_1)&{}...&{}\phi _j^{2^{\lambda -1}}(e_1)\\ \phi _j(e_2)&{}\phi _j^2(e_2)&{}...&{}\phi _j^{2^{\lambda -1}}(e_2)\\ \\ \vdots &{}\vdots &{}\ddots &{}\vdots \\ \\ \phi _j(e_{2^{n\lambda }})&{}\phi _j^2(e_{2^{n\lambda }})&{}...&{}\phi _j^{2^{\lambda -1}}(e_{2^{n\lambda }})\\ \end{pmatrix} , \end{aligned}$$
(11)

for every \(j\in [1,|\mathcal {B}|-1]\).

It can be checked that the vector c, solution of the system, gives the coefficients of the \(h_i\)’s over the basis \(\overline{\mathcal {B}}\) (plus the constant term in first position). A necessary condition for this system to have a solution whatever the target vector b (i.e. whatever the coordinate function f) is to get a matrix A of full rank. In particular, the following inequality must hold:

$$\begin{aligned} (t+1)|\overline{\mathcal {B}}| + 1 \ge 2^{n\lambda } ~. \end{aligned}$$
(12)

Another necessary condition to get a full-rank matrix is that the squared linear power closure \(\overline{\mathcal {B}}\times \overline{\mathcal {B}}\) spans the entire space \(\mathcal {F}_{\lambda ,n}\). More details about the choice of such basis are discussed in the following.

4.2 S-box Decomposition

Let \(\mathcal {S}:x\mapsto (f_1(x),f_2(x),\dots ,f_m(x))\) be an s-box. We could apply the above decomposition method to each of the m coordinate functions \(f_i\), which could roughly result in multiplying by m the multiplicative complexity of a single function in \(\mathcal {F}_{\lambda ,n}\). As suggested in [BMP13, GR16], we can actually do better: the product involved in the decomposition of a coordinate function can be added to the basis for the subsequent decompositions. Specifically, we start with some basis \(\mathcal {B}_1\) and, for every \(i \ge 1\), we look for a decomposition

$$\begin{aligned} f_i(x) = \sum _{j=0}^{t_i - 1}g_{i,j}(x)\cdot h_{i,j}(x) + h_{i,t_i}(x), \end{aligned}$$
(13)

where \(t_i \in \mathbb {N}\) and \(g_{i,j}, h_{i,j} \in \langle \overline{\mathcal {B}}_i \rangle \). Once such a decomposition has been found, we carry on with the new basis \(\mathcal {B}_{i+1}\) defined as:

$$\begin{aligned} \mathcal {B}_{i+1} = \mathcal {B}_i \cup \{g_{i,j}\cdot h_{i,j}\}_{j=0}^{t_i-1}. \end{aligned}$$
(14)

This update process implies that, for each decomposition, the basis grows and hence the number \(t_i\) of multiplicative terms in the decomposition of \(f_i\) might decrease. In this context, the necessary condition on the matrix rank (see (12)) is different for every i. In particular, the number \(t_i\) of multiplications at step i satisfies:

$$\begin{aligned} t_i\ge \frac{2^{n\lambda } - 1 }{\lambda |\mathcal {B}_i^*| +1}-1 ~,\end{aligned}$$
(15)

where as above \(\mathcal {B}_i^*\) stands for \(\mathcal {B}_i \setminus \{1\}\).

4.3 Basis Selection

Let us recall that the basis \(\mathcal {B}_1\) needs to be such that the squared basis \(\overline{\mathcal {B}}_1\times \overline{\mathcal {B}}_1\) spans the entire space \(\mathcal {F}_{\lambda ,n}\), i.e. \(\langle \overline{\mathcal {B}}_1\times \overline{\mathcal {B}}_1\rangle = \mathcal {F}_{\lambda ,n}\) in order to have a solvable linear system. This is called the spanning property in the following. This property can be rewritten in terms of linear algebra. For every \(\mathcal {S}\subseteq \mathcal {F}_{\lambda ,n}\), let us define \(\text {Mat}(\mathcal {S})\) as the \((\lambda n \times |\mathcal {S}|)\)-matrix for which each column corresponds to the evaluation of one function of \(\mathcal {S}\) in every point of \(\mathbb {F}_{2^{\lambda }}^n\), that is

(16)

where \(\{\varphi _1,\varphi _2,\dots , \varphi _{|\mathcal {S}|}\} = \mathcal {S}\). Then, we have

(17)

To construct the basis \({\mathcal {B}}_1\), we proceed as follows. We start with the basis composed of all monomials of degree 1 plus unity, i.e. the following basis:

$$\begin{aligned} \mathcal {B}_1 = \{1, x_1,x_2,\dots , x_{n}\}. \end{aligned}$$
(18)

Then, we iterate \(\mathcal {B}_1 \leftarrow \mathcal {B}_1\cup \{\phi \cdot \psi \}\), where \(\phi \) and \(\psi \) are randomly sampled from \(\langle \overline{\mathcal {B}}_1\rangle \) until reaching a basis with the desired cardinality and satisfying \(\text {rank}(\text {Mat}(\overline{\mathcal {B}}_1\times \overline{\mathcal {B}}_1)) \ge 2^{\lambda n}\). We add the constraint that, at each iteration, a certain amount of possible products are tried and only the best product is added to the basis, namely the one inducing the greatest increase in the rank of \(\text {Mat}(\overline{\mathcal {B}}_1\times \overline{\mathcal {B}}_1)\). To summarize, the construction of the basis \(\mathcal {B}_1\) is given in the following algorithm:

figure a

Remark 2

In [GR16], the starting basis \(\mathcal {B}_1\) is constructed from a basis \(\mathcal {B}_0\) that has the spanning property by design. In practice, the optimal parameters for s-boxes are always obtained by taking \(\mathcal {B}_1 = \mathcal {B}_0\) and could be improved with a smaller basis. Our experiments showed that we can achieve a smaller basis \(\mathcal {B}_1\) with the spanning property (hence slightly improving the optimal parameters from [GR16]) with a random generation as described above. For instance, in the Boolean case applied on a 8-bit s-box, Algorithm 1 easily finds a basis \(\mathcal {B}_1\) of 26 elements (involving 17 multiplications) instead of 31 by taking \(\mathcal {B}_1 = \mathcal {B}_0\) as in [GR16].

4.4 Optimal Parameters

Assuming that satisfying the lower bound on \(t_i\) (see (15)) is sufficient to get a full-rank system, we can deduce optimal parameters for our generalized decomposition method. Specifically, if we denote \(s_i = |\mathcal {B}_i^*|\), we get a sequence \((s_i)_i\) that satisfies

$$\begin{aligned} {\left\{ \begin{array}{ll} s_1 = r + n \\ s_{i+1} = s_i + t_i ~~~ \text {with} ~~~ t_i = \left\lceil \frac{2^{n\lambda } - 1 }{\lambda s_i +1} \right\rceil - 1 \end{array}\right. } \end{aligned}$$
(19)

for \(i=1\) to \(m-1\), where r denotes the number of multiplications involved in the construction of the first basis \(\mathcal {B}_1\) (the n free elements of \(\mathcal {B}_1\) being the monomials \(x_1\), \(x_2\), ..., \(x_n\)). From this sequence, we can determine the optimal multiplicative complexity of the method \(C^*\) which then satisfies

$$\begin{aligned} C^* = \min _{r \ge r_0} (r + t_1 + t_2 + \cdots + t_m) ~, \end{aligned}$$
(20)

where \(r_0\) denotes the minimal value of r for which we can get an initial basis \(\mathcal {B}_1\) satisfying the spanning property (that is \(\langle \overline{\mathcal {B}}_1\times \overline{\mathcal {B}}_1\rangle = \mathcal {F}_{\lambda ,n}\)) and where the \(t_i\)’s are viewed as functions of r according to the sequence (19).

Table 1 provides a set of optimal parameters r, \(t_1\), \(t_2\), ..., \(t_m\) and corresponding \(C^*\) for several s-box sizes and several parameters \(\lambda \) and \(n=m\) (as for bijective s-boxes). For the sake of completeness, we included the extreme cases \(n=1\), i.e. standard CRV method [CRV14], and \(\lambda =1\), i.e. Boolean case [GR16]. We obtain the same results as in [CRV14] for the standard CRV method. For the Boolean case, our results slightly differ from [GR16]. This is due to our improved generation of \(\mathcal {B}_1\) (see Remark 2) and to our bound on the \(t_i\)’s (see (15)) which is slightly more accurate than in [GR16].

Table 1. Theoretical optimal parameters for our decomposition method.

Table 2 gives the size of the smallest randomised basis we could achieve using Algorithm 1 for various parameters. The number of tries made was \(N=1000\) before adding a product of random linear combination to the current basis.

Table 2. Achievable smallest randomised basis computed according to Algorithm 1.

5 Experimental Results

In this section, we report the concrete parameters for random and specific s-boxes achieved using our generalized decomposition method. Table 3 compares the achievable parameters vs. the optimal estimate for random s-boxes. Note that in the table, the parameters \(|\mathcal {B}_1|\), r, \(t_1,t_2,\ldots ,t_n\) correspond to the parameters in the achievable decomposition for randomly chosen s-boxes. The last column gives the probability of obtaining a successful decomposition for random S-boxes and for randomly chosen coefficients in the basis computation as well as the decomposition step. In all the cases 10 trials each were made to compute the probabilities except for the decomposition of 8-bit S-boxes over \(\mathbb {F}_{2^2}\) where 100 trials each were made.

Table 3. Optimal and achievable parameters for random s-boxes.

In the experiments, successive basis elements were added by products of random linear combinations of elements from the current basis. The basis \(\mathcal {B}_1\) was chosen such that the corresponding matrix for the first coordinate function resulted in full rank (implying that the spanning property of the basis \(\mathcal {B}_1\) was satisfied). The basis was successively updated with the \(t_i\) products formed in the decomposition step of the ith coordinate function. While the parameter \(t_1\) is invariant of the chosen s-box, the other \(t_i\) are indeed dependent on it. As we see from Table 3, the probabilities increase with the size of the field used for the decomposition.

Table 4 gives the concrete parameters to achieve decomposition for s-boxes of popular block ciphers (namely PRESENT [BKL+07], DES S1 and S8 [DES77], SC2000 S6 [SYY+02], CLEFIA S0 and S1 [SSA+07] and KHAZAD [BR00]). Note that for all the cases considered the parameters from Table 4 yield a decomposition. As above, the last column of the table gives the success probability over the random choice of the coefficients in the basis computation as well as the decomposition step. In all the cases 10 trials each were made to compute the probabilities except for the decomposition of 8-bit S-boxes over \(\mathbb {F}_{2^2}\) where 100 trials each were made.

Table 4. Achievable parameters to decompose specific s-boxes.

6 Implementation

Based on our generic decomposition method, we now describe our implementation of an s-box layer protected with higher-order masking in ARM v7. We focused our study on the common scenario of a layer applying 16 8-bit s-boxes to a 128-bit state. We apply our generalized decomposition with parameters \(n=m=2\) and \(\lambda =4\) (medium case) to compare the obtained implementation to the ones for the two extreme cases:

  • Plain field case (\(\lambda =8\), \(n=1\)): standard CRV decomposition [CRV14];

  • Boolean case (\(\lambda =1\), \(n=8\)): Boolean decomposition from [GR16].

Our implementation is based on the decomposition obtained for the CLEFIA S0 s-box with parameters \((r,t_1,t_2)=(7,7,4)\). Note that it would have the same performances with any other 8-bit s-box with the same decomposition parameters (which we validate on all our tested random 8-bit s-boxes). The input \((x_1,x_2)\) of each s-box is shared as \(([x_1],[x_2])\) where

$$\begin{aligned}{}[x_i] = (x_{i,1},x_{i,2},\dots ,x_{i,d}) ~~~\text {such that}~~~ \sum _{j=1}^dx_{i,j} =x_i . \end{aligned}$$
(21)

Note that for those chosen parameters \((n,m,\lambda )\), the input \(x_1\) and \(x_2\) are 4-bit elements, i.e. the inputs of the 8-bit s-boxes are split into 2. The output of the computation is a pair \(([y_1],[y_2])\) where \(y_1\) and \(y_2\) are the two 4-bit coordinates of the s-box output.

We start from a basis that contains the input sharings \(\{[z_1],[z_2]\} = \{[x_1],[x_2]\}\). Then for \(i = 3\) to 21 each of the 18 multiplications is performed between two linear combinations of the elements of the basis, that is

$$\begin{aligned}{}[z_i] = [u_i] \odot [v_i] ~, \end{aligned}$$
(22)

where \(\odot \) denotes the ISW multiplication with refreshing of one of the operand (see [GR17] for details) and where

$$\begin{aligned} u_{i,j} = \sum _{k< i} \ell _{i,k} (z_{k,j}) ~~\text {and}~~ v_{i,j} = \sum _{k < i} \ell '_{i,k} (z_{k,j}) ~\text {for every}~j \in [1,d], \end{aligned}$$
(23)

for some linearized polynomials \(\ell _{i,k}\) and \(\ell '_{i,k}\) obtained from the s-box decomposition. Once all the products have been computed, the output sharings \([y_1]\) and \([y_2]\) are simple linear combinations of the computed \([z_i]\).

To make the most of the 32-bit architecture, the s-box evaluations are done eight-by-eight since we can fill a register with eight 4-bit elements. The ISW-based multiplications can then be parallelized as suggested in [GR17] except for the field multiplications between two shares. To perform those multiplications, we simply need to unpack the eight 4-bit elements in each 32-bit operand, and then to sequentially perform the 8 field multiplications. These field multiplications are fully tabulated which only takes 0.25 KB of ROM on \(\mathbb {F}_{16}\) (following the results of [GR17]). Using such a degree-8 parallelized ISW multiplication allows to improve by \(58\%\) the asymptotic gain compared to 8 serials ISW multiplications [GR17].

Table 5. Performances in clock cycles.

We compare our results with the bitslice implementation from [GR16] and the CRV-based optimized implementation from [GR17]. The former evaluates 16 s-boxes in parallel (based on bitslicing), whereas the latter performs 4 times 4 s-boxes in parallel (by filling 32-bits registers with four 8-bit elements). Table 5 summarizes the obtained performances in terms clock cycles, RAM consumption, and the random usage (needed by both the ISW multiplication and the refreshing procedure) with respect to the masking order d and in terms of code size (including the look-up tables).

These results show that our implementation is slightly less efficient in terms of timings (Fig. 1). However, it provides an interesting tradeoff in terms of memory consumption. Indeed the bitslice implementation has the drawback of being quite consuming in terms of RAM (with 644d bytes needed) and the CRV-based implementation has the drawback of having an important code size (27.5 KB) which is mainly due to the half-table multiplication and the tabulation linearized polynomials over \(\mathbb {F}_{256}\). Our implementation offers a nice alternative when both RAM and code size are constrained. It also needs the same amount of randomness than the CRV decomposition and more than twice less than the bitslice decomposition.

Additionally, implementations based on our medium case decomposition might provide further interesting tradeoffs on smaller (8-bit or 16-bit) architectures where bitslice would be slowed down and where the optimized CRV-based implementation from [GR17] might be too consuming in terms of code size.

Fig. 1.
figure 1

Timings for \(n=8\).