1 Introduction

Physical attacks are known to pose a major threat to the cryptographic components and security services in many embedded devices. An attacker obtaining side-channel leakages such as the power consumption or electromagnetic emissions from a cryptographic implementation can extract the secret cryptographic key by applying suitable statistical tools on the collected data. A number of reports have demonstrated that such attacks are not just a theoretical concern but that also real-world devices can be compromised [18, 28, 38, 51]. As a consequence, the seminal Differential Power Analysis (DPA) paper by Kocher et al. [21] has been followed by a vast literature on solutions for a wide range of contexts to mitigate these attacks. For example, the inclusion of random delays [10], or shuffling [49] are a frequently used heuristic to improve the physical protection of software implementations. In contrast to this, re-keying strategies, formalized under the name of leakage-resilient cryptography, provide theoretical tools that enable reducing the security of multiple iterations to a single one (cf. [17] for an early result and [47] for a recent one). In this context, one of the most investigated and best understood protection against side-channel attacks is masking [7, 13, 41] that bridges theory and practice. Its underlying principle is to encode any sensitive variable in an implementation into d shares, and to perform the computations on these shares only. Given that the leakage of all the shares is independent and that the measurements are sufficiently noisy, it ensures that the smallest key-dependent (mixed) moment in the leakage distribution is d. Therefore, any adversary trying to extract information from a masked implementation should (ideally) estimate this (mixed) higher-order moment, a task of which the complexity increases exponentially in d.

A drawback with all these solutions is the significant performance overhead. As a result, the development of methodologies enabling a fair assessment of their security level has evolved in parallel with the development of countermeasures so that designers can discuss security and performance implications for their implementations on a sound basis [46]. Since side-channel analysis is essentially based on the comparison of key-dependent leakage models with actual measurements, these methodological developments have led to a central division between profiled and non-profiled evaluation tools and attacks [50]. In the first case, the adversary/evaluator is allowed to build an accurate (yet not perfect [16]) model for his target device that generally corresponds to an estimation of the leakage Probability Density Function (PDF)Footnote 1. As depicted in the upper left part of Fig. 1, Gaussian Template Attacks (TA) are the most common tool for this purpose [8]. In this (here: exhaustive) approach, one builds a Gaussian model for the leakage of every target intermediate value in the implementation. The main limitation of Gaussian templates is that they are bound to the analysis of the first two moments in a leakage distribution (i.e., unprotected implementations and masking with \(d=2\)). According to the state-of-the-art, the canonical way to analyze higher-order masked implementations would be to switch to non-parametric PDF estimation, e.g., based on histograms and kernels. But this comes at the cost of two important drawbacks. First, these tools imply a more complex (hence measurement intensive) estimation problem. Second, they estimate all the statistical moments at once, meaning that one loses the detailed intuition that could be obtained from the separate examination of all moments. Alternatively, one could use the Moments-Correlating Profiled DPA (MCP-DPA) introduced in [31] that suffers from the complementary drawback. Namely, since MCP-DPA is essentially a “per moment” approach, the intuitions extracted now only correspond to moment taken separately, and it is unclear how one could extend these attacks towards the joint exploitation of multiple moments at the same time.

Fig. 1.
figure 1

Summary of side-channel evaluation tools and attacks.

A comprehensive understanding of how the information leakage of a masked cryptographic implementation is spread among different statistical moments is essential to interpret the results of its security evaluation. That is, in general a \((d-1)\)th-order secure implementation is defined as an implementation for which the smallest key-dependent moment in the leakage distribution is d, and this is ideally expected to occur for d shares. But in practice, it frequently happens that glitches (i.e., non-independent leakages) contradict this expectation, leading to informative moments of smaller orders than d, both in hardware and software case studies [9, 26]. Significant research efforts have been dedicated to the design of glitch-free implementations, e.g., based on multiparty computation [42] or threshold implementations [30, 32]. However, in the latter case the number of shares is larger than the claimed order. This, however, highlights the demand for the ability to determine the exact moment that actually leaks [3]. Simple leakage detection tests (e.g., t-test [44]) can be used for this, however they provide only limited information and merely show the existence of leakage (for a more detailed discussion of the limitations of t-test based leakage detection see [15]). Eventually, the recent results in [14] showed that by quantifying the informativeness of each statistical moment in a side-channel attack, one can extrapolate the security level of an implementation in function of the noise in its measurements (i.e., a parameter that is typically easier to adapt for HW engineers).

Contribution. Based on this state-of-the art, our contribution is threefold.

First, we extend the evaluation toolbox for profiled side-channel analysis with three new PDF estimation tools, based on Exponentially Modified Gaussian (EMG) distributions, Pearson distribution system and Shifted Generalized Lognormal (SGL) distributions. As illustrated in the upper left part of Fig. 1, they allow characterizing statistical moments up to the fourth one, which captures all most relevant masked implementations published so far.

Second, we show that these tools enable the computation of the information leakage in each statistical moment of a leakage distribution (up to the fourth one). We further illustrate that based on such computations, we can design efficient attacks that are able to exploit the information in all the leaking moments jointly, and that the efficiency of these attacks is proportional to the sum of the information provided by each moment.

Eventually, we observe that our tools also have applications in the context of non-profiled side-channel analysis, where the adversary assumes some a-priori model for his target implementation (e.g., typically Hamming weights, Hamming distances). In this context as well, one can divide existing solutions between “per moment” and “PDF-based” distinguishers (see the middle right part of Fig. 1). Usual representatives of the first category include Correlation Power Analysis (CPA) [6] or its equivalents [25] for first-order moments, and higher-order DPA [37], Correlation-Enhanced Power Analysis Collision Attacks (CEPACA) [27] or Moments Correlating Collision-DPA (MCC-DPA) [31] for higher-order moments. The most common representative of the second category is Mutual Information Analysis (MIA) [19], which usually relies on (non-parametric) histograms or kernels [2], although any PDF estimation tool is in principle eligibleFootnote 2. We show that MIA based on the previously mentioned PDF estimation tools (EMG, Pearson, SGL) leads to interesting efficiency tradeoffs for implementations leaking in moments up to four.

The combination of these tools and methods are valuable inputs for the evaluation of the masking countermeasure, since they allow a more accurate understanding of its implementation weaknesses due to glitches (or any other physical default). Furthermore, they are not limited to analysis techniques and also lead to new attacks exploiting a (practically relevant) combination of moments. Eventually, we remark that our results raise relevant questions regarding the so-called simplifying distinguishers in the bottom of Fig. 1. In this context, the adversary/evaluator does not build a model for every target intermediate value but for a combination of them (or of their bits). All the published simplifying distinguishers (e.g., linear regression in the profiled case [43], its on-the-fly extension [12] and stepwise regression [50] in the non-profiled case) mix a “per moment” approach [11] with simple (typically Gaussian) PDF estimations. Hence, finding whether one could combine a simplifying distinguisher (that provides useful intuitions regarding the parts of the computations that leak more) with more complex PDF estimation tools as in this paper (that provide similarly useful intuitions regarding which moments are leaking) remains an interesting open problem.

2 Background

Generally, density estimation – as a well-studied field in statistics – refers to two major categories, namely non-parametric and parametric methods. Histograms and kernels are amongst the well-known non-parametric ones, which do not make any assumptions about the form of the distribution and use only the sampled data to estimate the distribution. By contrast, Gaussian density estimation, which is the most popular parametric PDF estimator, assumes a symmetric form for the distribution, and characterizes it based on its (sample) mean and standard deviation only. As mentioned in the introduction, our focus in this paper is side-channel evaluation, which is commonly based on PDF estimation for building the leakage models. In this section, we shortly recall some frequently-applied PDF estimation techniques in the field of side-channel analysis. We only consider a univariate scenario, which is motivated by our experimental case study in Sect. 5, that is based on a threshold implementation in which all the shares are manipulated in parallel.

Notations. The parametric PDF estimators make use of statistical moments that we specify as follows. Let X be a (univariate) random variable. The dth-order raw statistical moments are defined as \(\mathsf {E}(X^d)\), with \(\mu =\mathsf {E}(X)\) the mean of the distribution and \(\mathsf {E}(.)\) the expectation operator. The dth-order central moments are defined as \(\mathsf {E}\left( \left( X-\mu \right) ^d\right) \), with \(\sigma ^2=\mathsf {E}\left( \left( X-\mu \right) ^2\right) \) the variance of the distribution. The dth-order standardized moments are defined as \(\mathsf {E}\left( \left( \frac{X-\mu }{\sigma }\right) ^d\right) \), with \(\gamma _1=\mathsf {E}\left( \left( \frac{X-\mu }{\sigma }\right) ^3\right) \) the skewness (a measure of the asymmetry of the distribution, also known as the first shape parameter), and \(\beta _2=\mathsf {E}\left( \left( \frac{X-\mu }{\sigma }\right) ^4\right) \) the kurtosis (a measure of the peakedness of the distribution, also known as the second shape parameter). Unless otherwise stated, for simplicity we denote first raw, second central, third (and fourth) standardized moments by first, second, third (and fourth) moments respectively.

Gaussian Density Estimation. In this case, it is assumed that the leakages follow a Gaussian (normal) distribution, and the PDF is given by:

$$ F(x) = \frac{1}{\sigma \sqrt{2\pi }}e^{-\frac{(x-\mu )^2}{2\sigma ^2}}, $$

with \(\mu \) and \(\sigma \) the estimated mean and standard deviation of the samples. Since a Gaussian distribution considers only the first two moments, it generally leads to a more efficient estimation compared to the non-parametric histograms or kernels (as long as the actual distribution is close enough to a Gaussian one). In other words, if the higher (>2nd) statistical moments of the underlying distribution of the samples are negligible, Gaussian density estimation is going to be extremely efficient. Gaussian Templates and regression-based models are part of the widely-used tools exploiting such an assumption [16].

Gaussian Mixtures. We mention that yet another approach to PDF estimation for masked implementations would be to consider mixture distributions. As demonstrated in [48], this solution is especially efficient when the profiling phase assumes the knowledge of the shares. By contrast, it becomes heuristic – since based on the Expectation Maximization (EM) algorithm – if they are not [23], which will be our running scenario in this work. In particular, we will consider contexts where the different modes of the mixture distributions are well interleaved (i.e. when the noise is large enough for masking to enforce good security guarantees), which makes the EM algorithm hard(er) to apply and stands in contrast with contexts where the modes can be trivially identified by the adversary (for example see [29]). That is, our goal is to investigate simple(r) tools that apply to masking when it delivers its promises and are guaranteed to converge without any need to guess about the number of shares in the target device.

3 New Proposals

We now describe three alternative parametric distributions that can cover moments up to the fourth one. We discuss their advantages as well as the challenges one may face to set the parameters to use them.

3.1 Exponentially Modified Gaussian

Since the Gaussian distribution is symmetric, its skewness is always zero. The exponentially Modified Gaussian (EMG) is another parametric distribution which additionally includes this first shape parameter. The PDF of such a distribution, that covers the first three moments, is defined by [20]:

$$\begin{aligned} F(x) = \frac{\lambda _3}{2}e^{\frac{\lambda _3}{2}(2\lambda _1 + \lambda _3\lambda _2^2 - 2x)}erfc\left( \frac{\lambda _1 + \lambda _3\lambda _2^2 - x}{\sqrt{2}\lambda _2}\right) , \end{aligned}$$
(1)

where \(\lambda _1\), \(\lambda _2\), \(\lambda _3\) are the parameters of the distribution and erfc(.) refers to the complementary error function defined as:

$$\begin{aligned} erfc(x)=\frac{2}{\sqrt{\pi }} \int _x^\infty e^{-t^2}\,dt. \end{aligned}$$

By means of the sample mean \(\mu \), standard deviation \(\sigma \) and skewness \(\gamma _1\) of the data, these three parameters can be estimated as:

$$ \lambda _1 = \mu - \sigma \left( \frac{\gamma _1}{2}\right) ^{1/3}, \quad \lambda _2^2 = \sigma ^2\left( 1- \left( \frac{\gamma _1}{2}\right) ^{2/3}\right) , \quad \lambda _3 = \frac{1}{\sigma \left( \frac{\gamma _1}{2}\right) ^{1/3}}\cdot $$

It should be noted that EMG does not cover symmetric distributions, i.e., \(\gamma _1\,=\,0\). However, it usually causes no issue in practice (and in particular for side-channel attacks) as the estimated skewness is never exactly zero. Nevertheless, if the underlying skewness is zero, the estimated skewness might be very small. These cases can lead to numerical problems, which can be solved by using libraries for higher precision computations or switching to a distribution which covers zero skewness (Gaussian, Pearson). Besides, note that for a negative skewness \(\gamma _1\,<\,0\), the distribution is parametrized with the absolute value \(|\gamma _1|\), and then mirrored around the mean.

3.2 Pearson Distribution System

The Pearson distribution system is a collection of probability distributions that can be parametrized using the first four moments. In total twelve different distributions (cf. [33,34,35]) are defined in such a way that depending on the estimated moments one type is preferred, and the corresponding PDF estimation technique is applied. In our experiments we noticed that types I, IV and VI (which are presented in detail below) are the only necessary ones. For further descriptions of the other types, the interested reader is referred to the original articles [33,34,35].

Cautionary Note. Distribution systems like Pearson’s are in general very flexible as they allow characterizing a broad range of combinations of moments. However, they require the estimation of several PDFs, and may face stability problems at the transitions between the different types of distributions (which may occur, e.g., by increasing the number of side-channel samples). Hence, in these cases, it is preferable to rely on a single distribution.

3.3 Shifted Generalized Lognormal

In [24], Low introduced the Shifted Generalized Lognormal distribution (SGL). It can be parametrized with the first four moments and covers a large interval of possible combinations of skewness and kurtosis. Both of these properties are desirable in side-channel evaluations, and therefore this distribution can be an interesting alternative to the Pearson’s distribution system. The realm covered by the SGL is vast and we found it to be sufficient for all our practical experiments.

Concretely, the PDF of the SGL is given by:

$$\begin{aligned} F(x) = \frac{1}{2\lambda _3^{1/\lambda _3}\lambda _4\varGamma (1+1/\lambda _3)(x-\lambda _1)}e^{-\frac{1}{\lambda _3\lambda _4^{\lambda _3}}\left| ln\left( \frac{x-\lambda _1}{\lambda _2} \right) \right| ^{\lambda _3}}, \end{aligned}$$
(2)

for \(\lambda _1\,<\,x\,<\,\infty \), where \(\lambda _1\), \(\lambda _2\), \(\lambda _3\), and \(\lambda _4\) are the distribution parameters and \(\varGamma (.)\) denotes the gamma function. These parameters can be estimated using the first four moments. For conciseness, we only give a brief overview of the resulting estimation problem in the full version of the paper [45], and refer the interested readers to [24].

3.4 Computational Complexity

The presented parametric methods have all different PDFs with different computation complexities. For SGL, the computation of the parameters from the first four moments takes considerably longer than for all other discussed distributions. To present some intuitions on the run time of the different PDFs, we performed experiments using 100 randomly generated sets of moments and run each PDFFootnote 3 100 times for each of these sets. Then we computed the average over all 1000 executions of each PDF. The Gaussian distribution is used as a reference value and has an average of 0.0034 s on an Intel i5-4200M CPU. The averages increase with the number of moments considered in the distribution: 0.0082 s (EMG), 0.029 s (Pearson), 1.70 s (SGL).

4 Simulated Experiments

In order to better understand the interest of the tools proposed in Sect. 3 in the context of side-channel analysis, we present a couple of simulated experiments. In the following we use mathematically-generated leakages derived from:

$$\begin{aligned} l = \mathrm {HW}(s\oplus c_1\oplus c_2) + \mathrm {HW}(c_1) + \mathrm {HW}(c_2), \end{aligned}$$
(3)

where \(\mathrm {HW}(.)\) denotes the Hamming weight function, s a sensitive (secret) 4-bit variable, and \(c_1\) and \(c_2\) uniformly distributed random masks in \(\{0,1\}^4\). Note that this example is related to any nibble-oriented cipher, e.g., PRESENT [4], and the basic evaluation procedure presented in this paper does not change for larger bit sizes. The only adjustment is the number of possible different classifications, i.e., \(2^n\) instead of \(2^4\) for n-bit variables. In this simulation it is supposed that the target is a hardware design where the shares are processed at the same time. This scenario essentially emulates a second-order Boolean masking scheme, where we only focus on the encoding of a single variable s in a noise-free situation. In this context, the first and second moments of the leakage distribution are expected to be independent of s. For each \(s\in \{0,1\}^4\), we estimate the PDF using both non-parametric (kernels) and parametric (Gaussian, EMG, Pearson, SGL) tools. The first four moments for each s, plotted in Fig. 2(a), reveal that there is indeed no dependency between s and the first two moments (i.e., they remain constant for all s). Hence, the only way that s can be distinguished is by observing the third moment. Since kernel-based density estimation considers all possible moments, it can be used to distinguish s as shown in Fig. 2(b).

Fig. 2.
figure 2

The estimated moments for each possible \(s\in \{0,1\}^4\) (a) and kernel-estimated PDFs (b) for mathematically-generated leakages corresponding to a 2nd-order masking.

Fig. 3.
figure 3

The estimated PDFs for mathematically-generated leakages corresponding to a 2nd-order masking, obtained with various parametric tools from Sects. 2 and 3.

By contrast, the third moment is not used to parametrize the Gaussian distribution and thus each s results in the same distribution in this case (as per Fig. 3(a)). This example shows why Gaussian density estimation cannot be used to analyze the leakages that reside in an order higher than two. Eventually, our newly proposed estimators consider moments up to the fourth one, and therefore they can be used to quantify the information leakage of our simulated masking experiment (this can be seen in the remaining part of Fig. 3).

5 Practical Case Studies

To examine the application and efficiency of the above-mentioned solutions, we consider a threshold implementation of the PRESENT cipher [4] on an FPGA platform. More precisely, the target design is the Profile 2 presented in [36] that follows a serialized architecture, i.e., using one instance of the S-box for the whole SLayer. Such a masked hardware implementation has been selected for the practical investigations due to its second- and third-order univariate leakages which allow us to examine our proposed tools. If we would have no leakage at order three and higher, examining the difference between our tools and Gaussian would not be possible.

In the target implementation, the data state is represented by \(d=3\) Boolean shares, and the SLayer is based on the 2-stage masked S-box described in [32]. In other words, each S-box on a 4-bit data is implemented in a pipeline fashion and needs two clock cycles to be computed. For more details on the design architecture we refer the interested reader to [31, 36].

The leakage traces are collected from a Xilinx Virtex-II Pro FPGA embedded on SASEBO [1]. The sampling rate was set to \(1\,\mathrm {GS/s}\) and the target FPGA clock was driven at a frequency of \(3\,\mathrm {MHz}\).

We collected 100,000,000 traces to be used in our experiments. During the measurements, the PRNG that provides random data (masks) for the sharing of the plaintext was kept active. We also examined and confirmed the uniform distribution of the masks.

A former analysis of MCP-DPA by Moradi and Standaert in [31] on the same implementation revealed that the first pipeline stage of the target S-box exhibits the most informative leakages. It confirms that no first-order leakage can be exploited from this implementation, whereas the second and third moments are indeed informative. It also suggests that second-order leakages are more informative than third ones. By contrast, and as exhaustively discussed in the introduction, two important questions remain open. First, can we quantify the informativeness of the different moments on a (somewhat) more formal basis? Second, and given that more than a single moment provides information, can we design an attack that jointly exploits these moments? (which is in contrast with MCP-DPA that only exploits moments one by one).

Both questions can be answered in the affirmative by the following discussion. In order to make our results comparable with [31], we focus on the same parts of the leakage traces. Namely, we analyze the most informative clock cycle in the S-box execution. Based on this case study, we show that the newly introduced PDF estimation tools are powerful ingredients for the information theoretic analysis of a threshold implementation. First, they are able to extract an amount of information from the traces comparable to a kernel density estimation. Second, they are useful to estimate the informativeness of each moment, and to perform attacks based on the best combination of moments carrying significant information. Eventually, they can naturally and efficiently be embedded in PDF-based non-profiled attacks such as MIA.

5.1 Profiled Evaluations and Attacks

First, we examine the information leakage of the target device using an information theoretic approach. The idea to use Mutual Information (MI) as an evaluation metric was introduced in [46]. It was later refined in [39] to incorporate the fact that the leakage distribution is only estimated, which can potentially bias the estimation of the MI. The so-called Perceived Information (PI) is used to reflect this bias and can be computed as:

$$\begin{aligned} \hat{PI}(S;L) = H[S] - \sum _{s\in \mathcal {S}}Pr[s]\sum _{l\in \mathcal {L}}Pr_{\mathsf {chip}}[l|s]\cdot log_2\hat{Pr}_{\mathsf {model}}[s|l], \end{aligned}$$
(4)

where \(Pr_{\mathsf {chip}}\) denotes the chip’s true distribution (which is unknown but can be sampled) and \(\hat{Pr}_{\mathsf {model}}\) refers to the adversary’s estimated model (for which we have an analytical formula). Computing the PI essentially requires an estimated \(\hat{Pr}_{\mathsf {model}}\), which is exactly what our PDF estimation tools provide. In our experiments, we followed the procedure presented in [16] for computing this metric. In particular, we used 10-fold cross-validation and report the mean of the resulting PI estimates. We start by looking at the information extracted using all the moments enabled by each PDF estimation tool. We then analyze (subsets of) these moments separately.

Combined Moments. In order to compare our proposed solutions (EMG, Pearson, SGL) with the established ones (kernels, Gaussian), we first compute the PI using all the covered moments. We estimate \(\hat{Pr}_{\mathsf {model}}\) using the different estimators and compare the results. As previously mentioned, this experiment only covers 100 sample points corresponding to the power peak of the targeted clock cycle. The 100,000,000 traces are divided into 10 sets. For each of the 10 runs we use one of these 10 sets (each with 10,000,000 measurements) as samples of the chip’s true distributions, and the remaining 9 sets (90,000,000 measurements) to estimate the model distribution (\(\hat{Pr}_{\mathsf {model}}\)). Figure 4(a) contains the results.

Fig. 4.
figure 4

Kernel-, Gaussian-, EMG-, Pearson- and SGL-based PI estimation with all covered moments (a) using 100,000,000 meas., (b) over the number of meas.

At the first glance, it can be observed that the achieved PI using the Gaussian distribution to estimate \(\hat{Pr}_{\mathsf {model}}\) is the lowest. This implies that not all available information is contained in the first two moments (that are the only ones captured by a Gaussian distribution). More interestingly, kernel-based density estimation is non-parametric and therefore is expected to provide the highest PI if its bandwidth is well adapted and enough samples are available. Yet, we observe that this is not exactly the case in our experiments. As depicted in Fig. 4(a) (where we focus on the most informative sample 719), this is most likely due to an estimation issue (i.e., a lack of samples). As expected, the non-parametric kernel density estimation is the slowest to converge in this case. This suggests an interesting feature of our new parametric tools. Namely, whereas Gaussian estimation is very fast but limited to the exploitation of two moments (hence leads to less efficient attacks, as will be discussed next), EMG-, Pearson- and SGL-based estimations combine a faster convergence than kernels with a similar informativeness.

Summarizing, we can conclude that PDFs covering the right combination of moments lead to the best tradeoff between a fast convergence towards a well estimated model, and a well-informative model once properly estimated (i.e., a model for which the PI should be close to the MI [16]). By contrast, the previous results do not allow to deduce about the relative informativeness of each moment (which could possibly be used to further speed up the model estimation and attacks), which motivates the following analysis.

Separate Moments. An interesting property of the parametric estimators is the ability to consider only selected moments instead of trying to characterize any possible moment (as in non-parametric estimations). Using the Gaussian distribution as an example, we can compute the information contained exclusively in the first two moments, as this distribution only considers the mean and variance. Similarly, it is also possible to compute the PI for the first three moments (with EMG distributions) and the first four moments (with Pearson’s distribution system and SGL distributions). In the following, we present an approach that enables us to compute the PI both for each moment taken separately and for any combination of those.

For this purpose, and taking the case where we focus on a single moment, we simply have to set all but one of the moments to a fixed value. For example, suppose that we want to consider the information contained in the first moments of a Gaussian distribution only. We achieve this by considering a Gaussian model where the means are estimated as in the previous section, but the variances are set to a fixed value, which essentially removes any secret-dependent information they could carry from the templates through the second moments. Since changing the variances affects the shape of the distributions, the fixed value can be chosen as the average of the variances (over the 16 templates) to minimize the distance between the original distributions and the ones with a fixed varianceFootnote 4. A similar technique actually works for any of our parametric estimators, and for any (combination of) moments.

As an illustration, let us first recall the influence of the first four statistical moments on the shape of the resulting distribution. The third moment, called skewness, measures the asymmetry of the distribution. Therefore, distributions with positive skewness tend to left while distributions with a negative skewness tend to the right. The fourth moment, called kurtosis, measures the “peakedness” (sharpness) of the distribution. As a consequence, the higher the kurtosis, the sharper is the distribution. Note that we consider only the first four moments in our analysis, hence we omitted definitions for moments of any further orders.

When we set specific higher-order moments (as in our approach) to specific values, we actually fix the width of the distributions (i.e., variance), or their right-left tendency (i.e., skewness) or their sharpness (i.e., kurtosis). Hence, information sitting in the corresponding moments does not contribute in the information-theoretic-based evaluation, e.g., mutual information. We like to emphasize that the estimated higher-order moments in real side-channel measurements (categorized, for example, based on the processed data) are very slightly different. Consider for example the PDFs of four exemplary distributions shown in Fig. 5(a), taken from the most leaking point of the measurements of our case study (see Fig. 4(a)). The first four moments of each distribution are given in Table 1. All moments of the four distributions are very similar to each other, e.g., the skewness of all these four distributions is only slightly different. Hence, fixing the skewness of all of them to a specific value (e.g., the average of all skewnesses given by 0.0064627) does not significantly change the shape of the distributions.

Fig. 5.
figure 5

The PDFs of the six distributions from Tables 1 and 2.

Table 1. The first four statistical moments of four distributions at sample point 719.

Here we consider four different cases:

  1. 1.

    All moments except the first are fixed to their average (evaluation through means).

  2. 2.

    All moments except the second are fixed to their average (evaluation through variances).

  3. 3.

    All moments except the third are fixed to their average (evaluation through skewnesses)

  4. 4.

    All moments except the fourth are fixed to their average (evaluation through kurtoses).

For each case, the shape of the resulting distributions is very close to the original shape in Fig. 5(a). The resulting PDFs of the modified distributions for each case is provided in the full version of the paper [45].

It should be noted that in case of simulated data with significantly different moments for each distribution the resulting shapes of each distribution would be also dramatically different to each other. Therefore in this case, setting the corresponding moments to a fixed (average) value does not make the distributions to roughly follow the same shape. If such a huge difference between the moments of the (categorized) distributions exists in practice by any (rare) chance, the corresponding implementation is significantly vulnerable to certain attacks. Obviously, this makes the necessity of performing per-moment evaluations questionable. As an example, we show in Fig. 5(b) two simulated distributions formed by the moments from Table 2. It is obvious that the shape of the distribution with fixed moments is considerably different than that of the original two distributions. In this case, a per-moment approach would not be easily possible with an information-theoretic evaluation tool.

Table 2. The first four statistical moments of two simulated distributions.

We analyze this moment-based investigation based on the same case study as for the previous information theoretic analysis. Hence, we repeat the previous experiments (of Fig. 4(a)) with the same parametric estimators (Gaussian, EMG, Pearson, SGL), but this time we consider each possible moment separately. The results are depicted in Fig. 6 where the PI curves are categorized based on the employed estimator. Each part of the figure contains the PI curves obtained for each moment separately. For example, in Fig. 6(a) the curve labeled Gaussian (1st) shows the PI achieved for the first moments (and the curve Gaussian (2nd) depicts the same for the second moments, etc.). Further, we included the PI curve of the combined moments (taken from Fig. 4(a)) and the sum of the PI curves of the separate moments (e.g., Gaussian Sum as the sum of the PI curves of Gaussian (1st) and Gaussian (2nd)).

Fig. 6.
figure 6

PI estimates for the separate moments.

As expected, the first moment does not contain any exploitable information as the implementation is first-order secure. It is also noticeable that the chosen estimator does not affect the PI for the first moment. The second moment leads to the highest PI, and therefore is the most informative moment. As similarly indicated by MCP-DPA, the third moment is informative but not as much as the second one. Furthermore, using two estimators (Pearson, SGL) that also cover the fourth moment, we are not able to detect any significant information leakage in the fourth moment. Therefore, a combination of the second and third moments should suffice to capture most of the available information in the underlying measurements.

Most interestingly, we observe that the sum of the PI values obtained for the separate moments is actually close to the PI estimated with the combined moments. Although informal, this observation is particularly interesting in view of the recent results by Duc et al. in [14] where the PI values are connected with the success rate of a (worst-case) template attack using the same model. Indeed, since the sum of the PI values obtained per moment is essentially the same as the PI value obtained with the non-parametric kernel method, it implies that in our case study, the separation between moments did not lead to any significant information loss. This suggests that a (simple and intuitive) moment-based side-channel evaluation could be well-founded, at least in certain contexts that would be worth formalizing. And very concretely, it also means that an attack exploiting out two informative (i.e., second and third) moments will be close to optimal in our case.

Profiled Attacks. The results in [14] prove that (under sufficiently noisy leakages) the success rate of a profiled template attack is inversely proportional to the PI value estimated with the same model. In view of the previous discussions, it means that our proposed estimation tools (EMG, Pearson, SGL) should lead to more effective profiled attacks than their counterparts with Gaussian estimation (because of modeling errors) and kernels (because of assumption errors). Furthermore, the attacks exploiting the second moment should lead to a higher success rate than attacks exploiting the other three moments. Eventually, the best attack should exploit the combination of second and third moments. For completeness, we ran experiments to confirm these expectations. We built univariate templates (for the most informative sample point 719) from 90,000,000 measurements and, for each given number of measurements, repeated an attack 1000 times for different measurements (excluding those used for profiling) to compute a subkey recovery success rate. The results of this last experiment are depicted in Fig. 7 and are well in line with theoretical predictions. In this respect, the most interesting curves are the ones corresponding to the combination of second and third moments, since they correspond to the best tradeoff between model complexity and attack efficiency, and could not have been reached with existing side-channel evaluation tools.

Fig. 7.
figure 7

Success rate of several univariate template attacks exploiting separate and combined moments, for the most informative sample point 719 in our traces.

Non-profiled Attacks. A detailed discussion with experimental results is provided in the full version of the paper [45].

5.2 Selection of Tools

We have discussed multiple parametric tools, each with its own advantages and disadvantages. Compared to the traditional non-parametric tools, they offer a higher flexibility and convergence. Therefore, they should be preferred if the number of samples is too small or a special case (e.g., only two moments) should be evaluated. The PDF of EMG can be computed very efficiently compared SGL and Pearson. However, it considers only the first three moments instead of four. The Pearson distribution system includes the kurtosis and its PDF is still relatively efficient compared to SGL. Nevertheless, it is made up of multiple different distributions which can be problematic in certain cases as pointed out in Sect. 3.2. Therefore, in scenarios where the computation time of the PDF can be ignored and the leakages are covered by SGL, it is the preferable tool.

However, the computation time is often a limiting factor and it can be significantly reduced in certain cases by choosing a more limited distribution which is still sufficient to capture all relevant leakage. If the type of implementation and leakage is known, this choice is easily possible. Gaussian (resp. EMG) is the preferred choice for leakage which is exclusive in the first two (resp. three) moments due to its very efficient PDF. Leakage in the fourth moment can be also efficiently captured with the Pearson distribution system, assuming that the aforementioned problems do not arise. If the type of masked implementation, i.e., the order of masking, is unknown, then this choice of distribution cannot be that easily made. SGL is the best approach, if the distribution is inside the plane of existence of SGL.

6 Conclusions

This paper introduced a variety of PDF estimation tools to improve the evaluation of leaking devices, both in the profiled and non-profiled settings. Their main interest is their flexibility: our proposals can indeed capture information lying in different moments of the leakage PDF. As a result, we can easily analyze masked implementations and extract useful feedback to hardware designers, i.e. in terms of how much information is lying in every moment and how to combine it. This brings a concrete and more founded counterpart the recent evaluations of implementations with non-independent leakages in [14], where this quantity of information “per moment” is required. More generally, our findings provide efficient tradeoffs between the cost of profiling and the efficiency of the resulting attacks, since they allow adversaries and evaluators to build models that are tailored to their implementations. These results naturally open various interesting research challenges for future work. As mentioned in introduction, combining an analysis of moments as in this work with simplifying approaches to leakage modeling (e.g. based on linear regression) would be even more convenient to evaluators. Besides, investigating the “summing rule” of Sect. 5.1 more formally is certainly worth further efforts as well. Eventually, our current tools are limited to univariate leakages. While this was sufficient to analyze our hardware case study, it naturally suggests the extension to multivariate case studies as yet another important question. This is especially interesting given that even hardware designs with univariate d-order security may include a multivariate vulnerability for which less than d points are combined [40]. A starting point for this purpose would be to exploit some popular “combining” functions from the side-channel literature (which would allow us to exploit our univariate tools directly).