Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The masking countermeasure is among the most investigated solutions to improve the security of cryptographic implementations against side-channel analysis. Concretely, masking amounts to perform cryptographic operations on secret shared data, say with d shares. Very summarized, it allows amplifying the noise in the physical measurements (hence the security level) exponentially in d, at the cost of quadratic (in d) performance overheads [27, 38]. As discussed in [25], these performance overheads may become a bottleneck for the deployment of secure software implementations, especially as the number of shares increases – which is however needed if high security levels are targeted [15].

In this respect, two recent works from Eurocrypt 2017 tackled the challenge of improving the performances of masked implementations. In the first one, Goudarzi and Rivain leveraged the intuition that bitslice implementations are generally well suited to improve software performances, and described optimizations leading to fast masked implementations of the AES (and PRESENT), beating all state-of-the-art implementations based on polynomial representations [22]. In the second one, Barthe et al. introduced new masking algorithms that are perfectly suited for parallel (bitslice) implementations and analyzed the formal security guarantees that can be expected from them [5].

Building on these two recent works, our contributions are in four parts:

First, since the new masking algorithms of Barthe et al. are natural candidates for bitslice implementations, we analyze their performance on a 32-bit ARM Cortex M4 processor. Our results confirm that they allow competing with the performances of Goudarzi and Rivain with limited optimization efforts.

Second, we put forward the additional performance gains that can be obtained when applying the algorithms of Barthe et al. to bitslice ciphers with limited non-linear gates, such as the LS-design Fantomas from FSE 2014 [23].

Third, and since our implementations can run with very high number of shares (we focus on the case with \(d=32\)), we question their security evaluation. For this purpose, we start from the observation that current evaluation methodologies (e.g., based on leakage detection [10, 16, 21, 33, 44] or on launching high order attacks [35, 39, 49]) are not sufficient to gain quantitative insights about the security level of these implementations (and the risks of errors in these evaluations). Hence, we introduce a new “multi-model” methodology allowing to mitigate these limitations. This methodology essentially builds on the fact that by investigating the security of the masked implementations in different security models, starting from the most abstract “probing model” of Ishai et al. [27], following with the intermediate “bounded moment model” of Barthe et al. [5] and ending with the most concrete “noisy leakage model” of Prouff and Rivain [38], one can gradually build a confident assessment of the security level.

Finally, we apply our new multi-model methodology to our implementations of the AES and Fantomas, and discuss its limitations. Its application allows us to claim so far unreported security levels (e.g., against adversaries exploiting more than \(2^{64}\) measurements) and to conclude that, in front of worst-case adversaries taking advantage of all the exploitable leakage samples in an implementation, performance improvements naturally lead to security improvements.

2 Background

In this section, we recall the parallel masking scheme we aim to study, and the two block ciphers we choose to work with, namely the AES and Fantomas.

2.1 Barthe et al.’s Parallel Masking Algorithm

Masking is a popular side-channel countermeasure formalized by the seminal work of Ishai et al. [27]. Its main idea is to split all the key dependent data (often called sensitive variables) in different pieces which are randomly generated. More formally, masking consists in sharing a sensitive value s such that:

$$\begin{aligned} s = s_1 \oplus s_2 \oplus \cdots \oplus s_d. \end{aligned}$$

In the case of Boolean masking we will consider next, \(\oplus \) is the XOR operation, each share \(s_i\) is a random bit and d is the number of shares. In order to apply masking to a block cipher, one essentially needs a way to perform secure multiplications and to refresh the shares. In the case of the bitslice implementations we will consider next, this amounts to perform secure AND gates and XORing with fresh random values. For this purpose, we will use the algorithms proposed by Barthe et al. at Eurocrypt 2017 [5]. Namely, and following their notations, we denote as \(\varvec{a}= (a_1, a_2, \cdots , a_d)\) a vector of d shares, by \(\textsf {rot}(\varvec{a},n)\) the rotation of vector \(\varvec{a}\) by n positions. Moreover, the bitwise addition and multiplication operations (i.e., the XOR and AND gates) between two vectors \(\varvec{a}\) and \(\varvec{b}\) are denoted as \(\varvec{a} \oplus \varvec{b}\) and \(\varvec{a} \cdot \varvec{b}\), respectively. Based on these notations, the refreshing algorithm is given by Algorithm 1 for any number of shares d. Its time complexity is constant in the number of shares d and requires d bits of fresh uniform randomness.

figure a

For readability, we next give the multiplication algorithm for the case \(d=4\) in Algorithm 2. Its description for any d can be found in [5]. The time complexity of the algorithm is linear in the number of shares d and it requires \(d\cdot \lceil \frac{d-1}{4}\rceil \) bits of randomness. Intuitively, this algorithm can be viewed as a combination of different steps: (1) the loading (and possible rotation) of the input share(s), (2) a partial product phase between the shares, (3) the loading and rotation of the fresh randomness, and (4) a compression phase where partial products are XORed together, interleaved with the addition of fresh randomness.

figure b

2.2 Target Algorithms

The AES Rijjndael [13] is a 128-bit block cipher operating on bytes and allowing three different key sizes (128, 192 and 256 bits). We will focus on the 128-bit variant that has 10 rounds. Each round is composed of the succession of 4 operations: SubBytes (which is the non-linear part), ShiftRows, MixColumns and AddRoundKey (except for the last round where MixColumns is removed). Each round key is generated thanks to a key schedule algorithm. Operations will be detailed in the implementation section. The AES’ robustness over the years and widespread use makes it a natural benchmark to compare implementations.

Fantomas is an instance of LS-Design [23], of which the main goal is to make Boolean masking easy to apply. It is a 128-bit cipher iterating 12 rounds based on the application of an 8-bit bitslice S-box followed by a 16-bit linear layer (usually stored in a table and called the L-box), together with a partial round constant addition and a key addition. The internal state of Fantomas can be seen as an \(8\times 16\)-bit matrix where the S-box is applied on the columns and the L-box is applied on the rows. The precise description of the S-box and L-box are provided in the extended version of this work available on the IACR ePrint.

We note that another instance of LS-design (namely Robin) has been recently cryptanalyzed by Leander et al. [29] and Todo et al. [47]: both attacks highlight a dense set of weak keys in the algorithm and can be thwarted by adding full round constants in each round [28]. Despite there is no public indication that a similar attack can be applied to Fantomas, we considered a similar tweak as an additional security margin (and denote this variant as Fantomas \(^*\)).

2.3 Target Device and Measurement Setups

Our implementations are optimized for a 32-bit ARM Cortex-M4 processor clocked at 100 MHz and embedded in the SAM4C-EK evaluation board [1]. Of particular interest for our experiments, this device has an embedded True Random Number Generator (TRNG) which provides 32 bit of randomness every 80 clock cycles. We recall the description of the ARM processor and instructions set given in [22]. The processor is composed of sixteen 32-bit purpose registers labeled from R0 to R15. Registers R0 to R12 are the variable registers (available for computations), R13 contains the stack pointer, R14 contains the link register and R15 is the program counter. The ARM instructions can be classified in three distinct sets: the data instructions such as AND, XOR, OR, LSR, MOV, ..., which cost 1 clock cycle; the memory instructions such as STR, LDR,..., which cost 2 clock cycles (with the thumb extension); and the branching instructions such as B, BL, BX, ..., which cost from 2 to 4 clock cycles. A useful property of the ARM assembly is the barrel shifter. It allows applying one of the following instructions on one of the operands of any data instruction for free: the logical shift (right LSR and left LSL), the arithmetic shift right ASR and the rotate-right ROR.

As for our security evaluations, we performed power analysis attacks using a standard setup measuring voltage variations across a resistor inserted in the supply circuit, with acquisitions performed using a Lecroy WaveRunner HRO 66 oscilloscope running at 625 Msamples/second and providing 8-bit samples.

3 Efficient Implementations

We designed our implementations in a modular manner, starting with building blocks such as refreshing and multiplication algorithms, and then building more complex components such as the S-boxes, rounds, and full cipher upon the previous ones. This adds flexibility to the implementation (i.e., we can easily change one of the building blocks, for example the random number generator) and enables simple cycle counts for various settings. Following this strategy, we first describe the implementation of cipher independent operations, and then discuss optimizations that specifically relate to the AES and Fantomas \(^*\).

3.1 Cipher Independent Components

We start by setting up the parameters of our parallel masking scheme and then depict the implementation of the refreshing and multiplication algorithms.

Given the register size r of a processor, parallel masking offers different tradeoffs to store the shares of a masked implementation. In the following, we opted for the extreme solution where the number of shares d equals r (which minimizes the additional control overheads needed to store the shares of several intermediate values in a single register). In our 32-bit ARM processor example, this implies that we consider a masked implementation with 32 shares.

More precisely, let \(\varvec{s} = (s_1, \cdots s_{32})\) be a 32-bit word where each \(s_i\) for \(1\le i \le 32\) is a bit and s be a sensitive bit. We have that \(s = \bigoplus _{i=1}^{32} s_i\). Concretely, our implementations will store such vectors of 32 shares corresponding to a single bit of sensitive data in single registers. This allows us to take advantage of the parallelization offered by bitwise operations such as XOR, AND, OR, ... That is, let \(\perp \) be such a bitwise operator and \(\varvec{s}^a\), \(\varvec{s}^b\) two 32-bit words, we have:

$$\begin{aligned} \varvec{s}^a \perp \varvec{s}^b = (s^a_1 \perp s^b_1, \cdots , s^a_{32} \perp s^b_{32}). \end{aligned}$$

In practice, for a block cipher of size n with key size k, its internal state will therefore be represented and stored as \(n+k\) 32-bit words in our parallel masking setting. The initial key sharing (performed once in a leak-free environment) is done as usual by ensuring that the \(s_i\)’s are random bits for \(2\le i \le d\) and \(s_1 = s \oplus s_2 \oplus \cdots \oplus s_d\). These shares are then refreshed with Algorithm 1 before each execution. And the un-sharing can finally be done by computing the value \(\bigoplus _{i=1}^{32} s_i\), or equivalently by computing the Hamming weight modulo 2 of \(\varvec{s}\).

One natural consequence of this data representation is that it requires the block cipher description to be decomposed based on Boolean operations. Bitslice ciphers such as Fantomas \(^*\) are therefore very suitable in this context, since directly optimized to minimize the complexity of such a decomposition.

Refreshing and Multiplication Algorithms. Since only requiring simple AND, XOR and rotation operations, these algorithms have naturally efficient implementations on our target device. The only particular optimization we considered is to keep all intermediate values in registers whenever possible, in order to minimize the overheads due to memory transfers. (An ARM pseudo-code for the multiplication with \(d=4\) is given in the ePrint version). The random values needed for the refreshings are first loaded and kept in registers. We then compute the \(c_i\)’s and \(d_i\)’s together instead of successively as in Algorithm 2, allowing to save costly load and store instructions. Eventually, the randomness was produced according to two different settings. In the first one, we generated it on-the-fly thanks to the embedded TRNG of our board which costs \(RC=80\) clock cycles per 32-bit word. In the second one, we considered a cheaper PRG following the setting of [22], which costs \(RC=10\) cycles per 32-bit word. Based on these figures, the refreshing algorithm is implemented in 28 (resp. 98) clock cycles and the multiplication algorithm in 197 (resp. 757) clock cycles.

3.2 Cipher Dependent Components

We now describe how we implemented the AES Rijndael and Fantomas \(^*\) in bitslice mode rather than in based on their (more) usual byte representation.

AES Components. The AES S-box is an 8-bit permutation which can be viewed as the composition of an inverse in \(\mathbb {F}_{2^8}\) and an affine function. A well-known method to mask this S-box, first proposed by Rivain and Prouff in [42], is to decompose the inversion in a chain of squarings and multiplications. Yet, this decomposition is not convenient in our parallel masking setting since not based on binary operations. Hence, a better starting point for our purposes is the binary circuit put forward by Boyar and Peralta in 2010 [8]. It requires 83 XOR, 32 AND and 4 NOT gates. Recently, Goudarzi and Rivain re-arranged some operations of this circuit in order to improve their implementation of a masked bitsliced AES [22]. We therefore implemented the AES S-box thanks to the latter representation, with each AND replaced by a secure multiplication and the XORs transposed using the corresponding ARM assembly instructions.

Following, and thanks to our internal state representation, the ShiftRows operation is easy to implement: it just consists in a re-ordering of the data which is achieved by a succession of load and store instructions.

The AES MixColumns operation is slightly more involved. The usual representation of MixColumns is based on a matrix product in \(\mathbb {F}_{2^8}\), as depicted in the following, where \(c_i\) and \(d_i\) for \(0\le i \le 3\) are bytes:

$$\begin{aligned} \begin{pmatrix} 02&{}03&{}01&{}01\\ 01&{}02&{}03&{}01\\ 01&{}01&{}02&{}03\\ 03&{}01&{}01&{}02 \end{pmatrix} \times \begin{pmatrix} c_1\\ c_2\\ c_3\\ c_4 \end{pmatrix} = \begin{pmatrix} d_1\\ d_2\\ d_3\\ d_4 \end{pmatrix}\cdot \end{aligned}$$

The multiplication by 01 is trivial and the one by 03 can be split into \(02 \oplus 01\), which only leaves the need of a good multiplication by 02 (sometimes called the xtimes function). This function is usually performed thanks to pre-computed tables [13], but it can also be achieved solely with binary instructions. Let \(b = (b_0, \cdots , b_7)\) be a byte with \(b_i \in \{0,1\}\) for \(0\le i \le 7\). We recall that the AES field is defined as \(\mathbb {F}_{2^8} \equiv \mathbb {F}_2[x]/(x^8+x^4+x^3+x+1)\). Using this polynomial, the xtimes can be turned into the following Boolean expression:

$$\begin{aligned} \textsf {xtimes}(b) = \textsf {xtimes}(b_0, \cdots , b_7) = (b_1, b_2, b_3, b_4 \oplus b_0, b_5 \oplus b_0, b_6, b_7\oplus b_0, b_0). \end{aligned}$$
(1)

For the parallel masking scheme, each bit \(b_i\) is again replaced by a 32-bit word. So in practice, we simply implement Mixcolumns by small pieces: for each byte \(c_i\) we load the eight 32-bit words, compute all the products by 02 thanks to Eq. (1), and store the results in a temporary memory slot. Eventually, we recombine the temporary values by XORing them to obtain the right output.

Fantomas \(^{\varvec{*}}\) components. Fantomas \(^*\)’s 8-bit S-box is an unbalanced Feistel network built from 3- and 5-bit S-boxes originally proposed in the MISTY block cipher (see [34], Sect. 2.1 and [23]). It can be decomposed in 11 AND gates, 25 XOR gates and 5 NOT gates. Since the S-box is bitsliced, the implementation of the parallel scheme is straightforward. Namely, each \(W_i\) in the algorithm is a 32-bit word encoding one secret bit in 32 shares. As for the AES S-box, ANDs are replaced by secure multiplications and XORs are applied directly.

The Fantomas \(^*\) linear layer so-called L-box can be represented as a \(16 \times 16\) binary matrix M (given in the ePrint version). Let V a \(16\times 1\)-bit vector of the internal state of Fantomas \(^*\). Applying the L-box consists in doing the product \(M*V\), which corresponds to executing XOR gates between the bits of V, defined by the entries of the matrix M. Since the XOR is a bitwise and linear operation, the L-box can again be computed directly in the parallel masking context (where a bit in the vector V simply becomes a 32-bit word of shares). In practice, as in the original publication of Fantomas \(^*\) [23], we split M in two \(16\times 8\) matrices: a left one and a right one. This allows us to work independently with the first 8 bits and the last 8 bits of V. For this purpose, we load eight 32-bit words and compute the XORs between them corresponding to the left/right parts of M, and store these intermediate values in a temporary memory slot. Eventually, one has just to XOR the results of these two products to recover the output.

3.3 Performance Evaluation

Table 1 provides the total number of total clock cycles for both the AES and Fantomas \(^*\) in our two settings for the randomness generation. The S-box column reports the percentage of clock cycles spent in the evaluation of the S-box (excluding the randomness generation and refreshings). The linear layer column reports the percentage of clock cycles spent in the evaluation of the linear parts (i.e., ShiftRows, MixColumns and AddRoundKey for the AES; the L-boxes, key and round constant additions for Fantomas \(^*\)). The rand. column reports the percentage of clock cycles spent in the generation of fresh random numbers (including the refresh operations and random values needed in the multiplication). Note that in order to make our results comparable with the ones of Goudarzi and Rivain, we did not consider the evaluation of the AES key schedule and simply assumed that the round keys (or the master key for Fantomas \(^*\)) were pre-computed, stored in a shared manner and refreshed before each execution of the ciphers. Besides, and as in this previous work (Sect. 6.2), we systematically refreshed one of the inputs of each multiplication in order to avoid flaws related to the multiplication of linearly-related inputs.Footnote 1 The masked AES implementation in [22] is evaluated on a device similar to ours with up to 10 shares. Using their cost formulas, we can extrapolate the number of clock cycles of their implementation for \(d=32\) shares as approximately 3, 821, 312 cycles (considering \(RC=10\)), which highlights that the linear complexity of our multiplication algorithm indeed translates into excellent concrete performances. The further comparison of our (share-based) bitslicing approach with the (algorithm-based) one in [22] is an interesting scope for further research. In this respect, we note the focus of our codes was on regularity and simplicity, which allowed fast development times while also leaving room for further optimizations.

Table 1. Performance evaluation results for \(d=32\).

As expected, using the bitslice cipher Fantomas \(^*\) rather than the standard AES Rijndael allows reducing the cycle counts by an approximate factor 2. This is essentially due to the fact that the overall number of secure multiplications of the latter is roughly twice lower (2112 against 5120 multiplications).

This benchmarking highlights that the time spent in the linear layers in very high order (parallel) masked implementations is negligible: efforts are spent in the S-box executions and (mostly) the randomness generation. It suggests various tracks for improved designs, ranging from the minimization of the non-linear components thanks to powerful linear layers, the reduction of the randomness requirements in secure multiplications or the better composition of linear & non-linear gadgets (see Sects. 4.1 and 4.3), and the design of efficient RNGs.

4 Side-Channel Security Evaluation

The previous section showed that bitslice implementations of masking schemes lead to excellent performances, as already hinted by Goudarzi and Rivain [22], and that the parallel refreshing and multiplication algorithms of Barthe et al. in [5] are perfectly suited to them. Thanks to these advances, we are able to obtain realistic timings for very high order masked implementations.

Quite naturally, such very high order implementations raise the complementary challenge that they are not trivial to evaluate. In particular, since one can expect that they lead to very high security levels (if their shares’ leakages are independent and sufficiently noisy), an approach based on “launching attacks” is unlikely to provide any meaningful conclusion. That is, unsuccessful attacks under limited evaluation time and cost do not give any indication of the actual security level (say \(2^x\)) other than that the evaluator was unable to attack in complexity \(2^y\), with potentially \(2^x\gg 2^y\). In the following, we introduce a new methodology for this purpose, based on recent progresses in the formal analysis of masking exploiting different proof techniques and leakage models.

4.1 Rationale: A Multi-model Approach

The core idea of our following security evaluation is to exploit a good separation of duties between the different leakage models and metrics that have been introduced in the literature. More precisely, we will use the probing model of Ishai et al. to guarantee an “algorithmic security order” [27], the bounded moment model of Barthe et al. to guarantee a “physical security order” [5], and the noisy leakage model of Prouff and Rivain to evaluate concrete security levels [38].

Step 1. The probing model, composability and formal methods. In general, the first important step when evaluating a masked implementation is to study its security against (abstract) t-probing attacks. In this model, the adversary is able to observe t wires within the implementation (usually modeled as a sequence of operations). From a theoretical point of view, it has been shown in [14] that (under conditions of noise and independence considered in the following steps), probing security is a necessary condition for concrete (noisy leakage) security against (e.g., power or electromagnetic) side-channel attacks. It has also been shown in [5] that it is equally relevant in the case of parallel implementation we study here (i.e., that it is also a necessary condition in this context).

From a practical point of view, the probing security of simple gadgets such as given by Algorithms 1 and 2 is given in their original papers, and the main challenge for their application to complete ciphers is their composability. That is, secure implementations must take into account the fact that using the output of a computational gadget (e.g., an addition or multiplication) as the input of another computational gadget may provide additional information to the adversary. Such an additional source of leakage is essentially prevented by adding refreshing gadgets. There exists two strategies to ensure that the refreshings in an implementation are sufficient. First, one can use probing-secure computational gadgets, test implementations with formal methods such as [3], and add refreshing gadgets whenever a composition issue is spotted by the tool. This solution theoretically leads to the most efficient implementations, but is limited by the complexity of analyzing full implementations at high orders. Second, one can impose stronger (local) requirements to the computational gadgets, such as the Strong Non Interference (SNI) property introduced in [4]. Those gadgets are generally more expensive in randomness, but save the designers/evaluators from the task of analyzing their implementation globally. As mentioned in Sect. 3.3 we exploited a rough version of this second strategy, by applying an SNI refreshing to one input of every multiplication. As discussed in [7] (e.g., when masking the AES S-box based on a polynomial representation in Sect. 7.2), it is actually possible to obtain SNI circuits with less randomness thanks to a clever combination of SNI and NI gadgets. The investigation of such optimizations in the case of bitslice implementations is an interesting open problem.

Step 2. The bounded moment model and Welch’s T-test. Given that probing security is guaranteed for an implementation, the next problem is to guarantee the shares’ leakages physical independence. In other words, the evaluator needs to test whether the leakage function does “re-combine” the shares in some way that is not detectable by abstract probing attacks. From a theoretical viewpoint, this recombination can be captured by a reduction of the security order in the bounded moment model [5]. Concretely, it may be due to defaults such as computational glitches [31, 32] and memory transitions [2, 11].

From a practical point of view, the security order in the bounded moment leakage model can be estimated thanks to so-called “moment-based security evaluations”. One option for this purpose is to launch high order attacks such as [35, 39, 49]. In recent years, and alternative and increasingly popular solution for this purpose has been to exploit simple(r) leakage-detection tests [10, 16, 21, 33, 44]. We will next rely on the recent discussion and tools from [46].Footnote 2

Step 3. The noisy leakage model and concrete evaluations. Eventually, once a designer/evaluator is convinced that his target implementation guarantees a certain security order, it remains to evaluate the amount of noise in the implementation. Indeed, from a theoretical point of view, a secure masking scheme is expected to amplify the impact of the noise in any side-channel attack (and therefore the worst-case measurement complexity) exponentially in the number of shares. This concrete security is reflected by the noisy leakage model [38].

From a practical point of view, the noise condition for secure masking (and the resulting noisy leakage security) can be captured by an information theoretic or security analysis [45]. In this respect, it is important to note that this condition depends on both the physical noise of the operations in the target implementation and the number of such operations. When restricting the evaluation to divide-and-conquer attacks, which is the standard strategy to exploit physical leakages [30], this number of operations drops to the number of exploitable operations (i.e., the operations that depend on an enumerable part of the key). We will next consider this standard adversarial setting.Footnote 3

Besides, as mentioned at the beginning of the section, one may expect that the security level of a very high order masked implementation is beyond the evaluator’s measurement (and time, memory) capacities. In this context, rather than trying to launch actual attacks we will rely on the (standard cryptographic) strategy of bounding the attack complexity based on the adversary’s power. For this purpose, we will use the tools recently introduced in [15, 26] which show that such bounds can be obtained from the information theoretic analysis of the leakage function (i.e., a characterization of the individual shares’ leakages).

Wrapping up. The main observation motivating our rationale is that security against side-channel attacks can be gradually built by exploiting existing leakage models, starting from the most abstract probing model, following with the intermediate bounded moment model, and finishing with the most physical noisy leakage model. In this respect, one great achievement of recent research in side-channel analysis is that each of those theoretical leakage models has a concrete counterpart allowing its practical evaluation. Namely, the probing security of an algorithm (represented as a sequence of operations) is challenged by formal methods or guaranteed by composable gadgets, bounded moment security is tested thanks to moment-based distinguishers or leakage-detection tools, and noisy leakage security is quantified thanks to information theoretic metrics which eventually bound standard security metrics such as the success rate.

Cautionary note. Because of place constraints, the following sections will not recall the technical details of the tools used in our evaluations (i.e., Welch’s T-test, linear regression and the mutual information metric). We rather specify all the parameters used and link to references for the description of the tools.

4.2 Bounded Moment Security and Security Order

Noise-Efficient Leakage Detection Test. As we rely on SNI refreshings to ensure the composability of our masked implementations, the first step in our evaluation is to evaluate the extent to which the shares’ physical leakages are independent.Footnote 4 As mentioned in the previous subsection, this independence is reflected by a security order in the bounded moment model, which can be estimated thanks to leakage detection. For this purpose, we used a variant of leakage detection test recently introduced in [46], Sect. 3.2. As with the standard detection tools, its main idea is to consider two leakage classes: one corresponding to a fixed plaintext and key, the other corresponding to random (or fixed [16]) plaintext(s) and a fixed key. The test then tries to detect a differences between these classes at different orders (i.e., after raising the leakage samples to different powers). The only specificity of this “noise-efficient” variation is that it mitigates the exponential amplification of the noise due to masking by averaging the traces before raising them to some power, thus reducing the evaluation time and storage. Such an averaging is possible because of our evaluation setting where masks are known. It admittedly makes the test completely qualitative (i.e., the number of traces needed to detect is not correlated with the security level that we discuss in the next subsection). Yet, in view of the noise level of our implementation, it was the only way to detect high order leakages somewhat efficiently.

Unfortunately, and even using this tweak, the complexity of the leakage detection is still exponential in the number of shares and therefore hardly achievable at order 32 (see again [46]). As a result, we studied reduced-order implementations with limited number of shares/randomness. Similarly to reduced-round versions in block cipher cryptanalysis, the goal of such implementations is to extrapolate the attacks’ behavior based on empirically verifiable but weakened versions of our implementations. In particular, we used such implementations to verify the extent to which the shares are recombined by the physical leakages. Since the implementations considered for this purpose are similar to the one using 32 shares (see next), the hope is that they give the evaluator an estimate of the “security order reduction factor” f caused by physical defaults (e.g., [2] showed that transition-based leakages reduce this order by a factor two).

Concretely, we analyzed both tweaked implementations with \(d=2\) and \(d=4\) shares (thanks to an adapted software) and the implementation with 32 shares where only 2 (resp. 4) bits of the random numbers generated were actually random – the other 30 (resp. 28) bits being kept constant. All tests gave consistent results and no leakage of order below the expected 2 (resp. 4) was detected. For illustration, the result of a leakage detection test for the Fantomas \(^*\) S-box with \(d=4\) shares (tweaked implementation) is given in Fig. 1. We used 120,000 different traces, each of them repeated 50 times, for a total of 6,000,000 measurements. The top of the figure shows the average trace, the bottom of the figure is the result of the detection test at order 4, where we see that the standard threshold of 4.5 is passed for a couple of samples. We additionally checked that those samples correspond to the multiplications performed during the S-box execution. By contrast, we could not spot evidence of lower order leakages (for which detection plots are given in the ePrint version). We insist that testing such reduced-order implementations does not offer formal guarantees that no flaw may happen for the full version with 32 random shares.Footnote 5 Nevertheless, (i) the fact that we observed consistent results for the \(d=2\) and \(d=4\) cases is reassuring; (ii) we may expect that some physical defaults (such as couplings [9]) become less critical with larger number of shares, since the shares will be more physically separated in this case; and (iii) most importantly, we will use the factor f as a parameter of our security evaluations, allowing a good risk assessment.

Fig. 1.
figure 1

Noise-efficient leakage detection with 6M traces (50x averaging).

Robustness Against Transition-Based Leakages. The results of the previous detection tests are (positively) surprising since one would typically expect that the transition-based leakages discussed in [2] reduce the security order in the bounded moment model from the optimal \(o = d-1\) to \(o = \left\lceil d/2 \right\rceil -1\). For example, assuming a sharing \(s=s_1\oplus s_2\), observing the Hamming distance between the shares \(s_1\) and \(s_2\) would provide the adversary with leakages of the form \(\mathsf {HD}(s_1,s_2)=\mathsf {HW}(s_1\oplus s_2)=s\). By contrast, in our parallel implementation setting, no such transitions could be detected. While we leave the full analysis of this phenomenon (e.g., with formal methods) as an open problem, we next provide preliminary explanations why this positive result is at least plausible. For this purpose, we first observe that the multiplication Algorithm 2 essentially iterates three types of operations: partial products, compressions and refreshings; and it ensures that any pair of partial products (\(a_i\cdot b_j\)\(a_j\cdot b_i\)) is separated from the other pairs (and the \(a_i\cdot b_i\) partial products) by a refreshing. As already hinted in [5], the distances between such pairs of intermediate results do not lead to additional information to the adversary. So the main source of transition-based leakages would be based on intermediate results separated by refreshings. In this respect, we note that our implementation was designed so that intermediate results are produced progressively according to the previous “compute partial products – compress – refresh” structure, which additionally limits the risk that many unrefreshed intermediates remain in the registers. Eventually, we checked that intermediate results in successive clock cycles do not lead to detectable transition-based leakages in the bounded moment model thanks to simulations. So intuitively, we can explain the absence of such transition-based leakages by the fact that our parallel manipulation of the shares mitigates them.Footnote 6

Summarizing, as any hypothesis test, leakage detection offers limited theoretical guarantees that no lower-order leakages could be detected with more measurements. Yet, our experiments do not provide any evidence of strong re-combinations of the shares’ leakages via transitions or other physical defaults, which can be explained by algorithmic features. Hence, in the following, we will consider two possible settings for our evaluations: the empirically observed one, assuming a security order 31 in the bounded moment model, and a more conservative one, assuming a security order 15 in the bounded moment model.

4.3 Noisy Leakage Security and Information Theoretic Analysis

Assuming the security order of our implementations to be 31 (as observed experimentally) or 15 (taking a security margin due to a risk of physical defaults that we could not spot), we now want to evaluate the security level of these implementations in the noisy leakage model, based on an information theoretic and security analysis. For this purpose, our next investigations will follow three main steps. First we will estimate the deterministic and noisy parts of the leakage function corresponding to our measurements, thanks to linear regression [43]. This will additionally lead to an estimation of our implementations’ Signal to Noise Ratio (SNR). Second, we will use this estimation of the leakage function to quantify the information leakage of our Boolean encodings (assuming security orders 31 and 15, as just motivated), using the numerical integration techniques from [15]. Finally, we will take advantage of the tightness of masking security proofs recently put forward in [26], in order to bound the complexity of multivariate (aka horizontal) attacks taking advantage of all the leakage samples computationally exploitable by a divide-and-conquer side-channel adversary.

Linear Regression and Noise Level. For this first step, we again considered a simplified setting where the evaluator has access to the masks during his profiling phase. Doing so, he is able to efficiently predict the 32 bits of the bus in our ARM Cortex device, and therefore to estimate the leakage function for various target operations thanks to linear regression. More precisely, and given a sensitive value s and its shares vector \(\varvec{s}\) considered in our masking scheme, linear regression allows estimating the true leakage function \(\hat{\mathsf {L}}(\varvec{s})\approx \hat{\mathsf {D}}(\varvec{s})+\hat{N}\), with \(\hat{\mathsf {D}}(\varvec{s})\) the deterministic part of the leakages and \(\hat{N}\) a noise random variable. As frequently considered in the literature, we used a linear basis (made of the 32 bits of the bus and a constant element) for this purpose. Such a model rapidly converged towards close to Hamming weight leakages, with estimated SNR of 0.05 for the best sample (defined as the variance of \(\hat{\mathsf {D}}(\varvec{s})\) divided by the variance of \(\hat{N}\)).

Fig. 2.
figure 2

Information theoretic analysis of the encoding.

Encoding Leakage. Given the previous sensitive value s, its shares vector \(\varvec{s}\) considered in our masking scheme, and a leakage function \(\mathsf {L}\) leading to samples \(l=\mathsf {L}(\varvec{s})\), a standard metric to capture the informativeness of these leakages is the Mutual Information [45], defined as follows:

$$ \mathrm {MI}(S;\mathsf {L}(\varvec{S}))=\mathrm {H}[S]+\sum _{s\in \mathcal {S}}\Pr [s]\cdot \sum _{l \leftarrow \mathsf {L}} \mathsf {f}(l|s)\cdot \log _2 \Pr [s|l]. $$

In this equation, \(\mathrm {H}[S]\) is the entropy of the sensitive variable S and \(\mathsf {f}(l|s)\) the conditional Probability Density Function (PDF) of the leakages \(\mathsf {L}(\varvec{s})\) given the secret s. Assuming Gaussian noise, it can be written as a mixture model:

$$ \mathsf {f}(l|s)=\sum _{\varvec{s}\in \mathcal {S}^{d-1}} \mathcal {N}\left( l|(s,\varvec{s}),\sigma _n^2\right) \cdot $$

The conditional probability \(\Pr [s|l]\) is then computed thanks to Bayes’ theorem as:

$$ \Pr [s|l]=\frac{\mathsf {f}(l|s)}{\sum _{s^*\in \mathcal {S}} \mathsf {f}(l|s^*)}\cdot $$

Unfortunately, what we obtained thanks to linear regression is not the true leakage function \(\mathsf {L}(\varvec{s})\) but only its estimate \(\hat{\mathsf {L}}(\varvec{s})\). Hence, what we will compute in the following is rather the Hypothetical Information (HI), defined as:

$$ \mathrm {HI}(S;\hat{\mathsf {L}}(\varvec{S}))=\mathrm {H}[S]+\sum _{s\in \mathcal {S}}\Pr [s]\cdot \sum _{l \leftarrow \hat{\mathsf {L}}} \hat{\mathsf {f}}(l|s)\cdot \log _2 \hat{\Pr }[s|l]. $$

Formally, it corresponds to the amount of information that would be leaked from an implementation of which the leakages would be exactly predicted by \(\hat{\mathsf {L}}(\varvec{s})\). Admittedly, we cannot expect that \(\mathrm {HI}(S;\hat{\mathsf {L}}(\varvec{S}))=\mathrm {MI}(S;\mathsf {L}(\varvec{S}))\) in practice (e.g., since we used a linear basis rather than a full one in our regression).Footnote 7 However, we note that the information leakages of a masked implementation depend only on their security order and SNR, not on variations of the leakage function’s shape. So small errors on \(\hat{\mathsf {L}}\) should not affect our conclusions. Furthermore, in our parallel setting the addition of significant non-linear terms in the regression basis would also directly decrease the security order because it would re-combine the shares in a non-linear manner (see [5]). Since the previous moment-based evaluation did not detect such re-combinations, a linear leakage model is also well motivated from this side. We finally note that adding quadratic terms in our basis could be a way to capture the reduction of the security order from 31 to 15. Yet, for efficiency, we next reflect such reductions of the security order by simply (and pessimistically) reducing the number of random shares in \(\varvec{s}\).

The result of such an information theoretic evaluation for our Boolean encoding is given in Fig. 2, where we plot the HI in log scale, for various SNRs. Of particular interest are the measured SNR and the SNRs with (2, 4 and 6\(\times \)) averaging, which would correspond to the noise level of sensitive shares vectors appearing multiple times in the implementation, therefore allowing the adversary to reduce the noise of these leakage samples by averaging (which we will discuss next). We also plotted the curves corresponding to the security orders 31, 15 and 7 (i.e., corresponding to a flaw parameter \(f=1,2\) and 4). Remarkably, we see that for the measured SNR, the leakage of a single encoding secure of order 31 would lead to an HI below \(2^{-128}\). Since the masking proofs in [15] show that the measurement complexity of any side-channel attack is inversely proportional to (and bounded by) this information leakage, it implies that a simple attack exploiting a single leakage sample corresponding to a 32-tuple of parallel shares would not be successful even with the full AES/Fantomas \(^*\) codebook. Similarly, a 15th-order secure implementation would be secure with up to a comfortable \(10^{26}\approx 2^{82}\) measurements. Table 2 provides an alternative view of these findings and lists experimental HI values for different levels of averaging.

Table 2. Experimental bounds on \(\log _{10}(\mathrm {HI})\) for the encoding.

Worst-Case Security Level. While the previous figure and table show that an adversary exploiting a single 32-tuple of parallel shares, assuming security order 31 (or 15) and the SNR estimated in the previous section, will not be able to perform efficient key recoveries, it has been recently put forward in [6] and more formally discussed in [26] that optimal side-channel adversaries are actually much more powerful. Namely, such adversaries can theoretically exploit all the 32-tuples in the implementation, and if some of these tuples are manipulated multiple times, average their leakages in order to improve their SNR.

In order to take such a possibility into account in our security evaluations, we therefore started by inspecting the codes of our implementations in order to determine (1) the number of linear and non-linear operations that can be targeted by a divide-and-conquer attack (for illustration, we considered an adversary targeting a single S-box), and (2) the number of such operations for which one of the operands is repeated x times in the code. The result of such a code inspection is given in Table 3. Note that the table includes the count of the SNI refreshings added to one input of each multiplication, which we reported as 32 (resp. 11) additional linear operations for the AES (resp. Fantomas \(^*\)).Footnote 8

Table 3. S-box code inspection for the AES and Fantomas \(^*\).

Thanks to the tools in [26], we then bounded the measurement complexity of adversaries taking advantage of a single tuple (considered in the previous section), all the tuples, and all the tuples with averaging in Fig. 3. Concretely, the second adversary is simply captured by relying on an “Independent Operation Leakage” assumption which considers (pessimistically for the designer) that the information of all the 32-tuples of shares in the implementation is independent and therefore can be summed. Taking the example of the Fantomas \(^*\) S-box, it means that this adversary can exploit the information of 41 encodings for the linear operations, and 11*32 encodings for the non-linear ones (where the factor 32 comes from the linear cost of the parallel multiplication algorithm, of which the leakage was bounded in [38]). And the third adversary is captured by adapting the encoding leakages depending on the number of repetitions allowed by the code. Taking the example of the linear operations in Fantomas \(^*\), it means that this adversary can exploit the information of 13 encodings with double SNR, 18 encodings with triple SNR, ... The latter is admittedly pessimistic too since it considers an averaging based on the most repeated operand only. Besides, it assumes that sensitive values manipulated multiple times will leak according to the same model (which is not always the case in practice [19]). The main observations of this worst-case security evaluation are threefold:

Fig. 3.
figure 3

Measurement complexity bounds for different attacks.

First, the security levels reached for the two first adversaries are significantly higher than previously reported thanks to “attack-based evaluations”. In particular, we reach the full codebook (measurement) security if the security order was 31 (as empirically estimated) and maintain \(>\!2^{64}\) measurement security if this order was only 15. In this respect, we insist that this order is the only parameter which could lead to an overstated security level (i.e., all the other assumptions in our evaluations are pessimistic for the designer). Quite naturally, the figure also exhibits that masked implementations with lower orders (e.g., 8 or 4) cannot offer strong security guarantees in case of SNRs in the 0.01 range.

Second, the impact of averaging is much more critical, since the adversary then essentially cancels the exponential increase of the noise that is the expected payload of the masking countermeasure. Roughly, for an implementation secure of order d, doubling the SNR thanks to 2-averaging reduces the security by an approximate factor \(2^d\). By contrast, multiplying the number of target d-tuples (without averaging) by \(\alpha \) only reduces the security by a factor \(\alpha \).

Third, in front of these optimal adversaries, Fantomas \(^*\) offers (slightly) more security than the AES despite we assume the same information leakages for their encodings. This gain is essentially due to the fact that Fantomas \(^*\) implementations are slightly more efficient, effectively reducing the opportunities for the adversary to exploit many leakage samples and to average them.

Towards Mitigating Averaging Attacks. As a conclusion of this paper, we first observe that our experiments raise interesting optimization problems for finding new representations of block cipher S-boxes, minimizing the number of non-linear operations and the multiple manipulation of the same intermediate values during their execution. Besides, and quite fundamentally, Fig. 3 recalls that the security of the masking countermeasure is the result of a tradeoff between an amount of physical noise (reflected by the SNR) and an amount of digital noise (reflected by the shares’ randomness) in the implementations. In this respect, there is a simple way to mitigate the previous “averaging attacks”, namely to add refreshing gadgets to prevent the repetition of the same sensitive values multiple times in an implementation. Remarkably, the systematic refreshing that we add to one input of each multiplication does contribute positively to this issue. For example, we show in the ePrint version that the number of repetitions in our codes increases if one removes these refreshings. By extending this approach brutally (i.e., by refreshing all the intermediate tuples in an implementation so that they are never used more than twice: once when generated, once when used), one can therefore mitigate the “all tuples + avg.” adversary of Fig. 3. But most interestingly, the latter observations suggest the search for good tradeoffs between physical and digital noise as a fundamental challenge for sound masking. That is, how to efficiently ensure composability as mentioned in Sect. 4.1 (first step) and prevent the averaging attacks in this section?