1 Introduction

The security of a stream cipher relies on its ability to mimic the properties of the perfectly secure One Time Pad (OTP): predicting future keystream bits (e.g., by recovering its inner state) must be computationally infeasible. In fact, as highlighted by algebraic and correlation attacks, any statistical correlation between output bits and linear combinations of input bits is a potential security breach for the cipher. Cryptographers are therefore caught in between implementation requirements, which suggest the use of efficient primitives such as Feedback Shift Registers (FSRs) or Finite State Machines (FSMs), and security requirements, which demand for solutions able to disguise the dependence of keystream-bits on the inner state of the registers. Many recent stream ciphers therefore rely upon irregular clocks, mutual clock control, non-linear and/or mutual feedback among different registers, or combinations of these solutions.

The cube attack, proposed by Dinur and Shamir [10], can be classified as an algebraic known-plaintext attack. Assuming that a chunk of keystream can be recovered from a known plaintext-ciphertext pair, the attack allows determining a set of linear equations binding key-bits. However, cube attacks significantly deviate from traditional algebraic attacks in that the equations are not recovered symbolically, but rather extracted through exhaustive searches over selected public/IV bits – the edges of the cubes the attack is named after. The possibility that a cube yields a linear equation depends on both its size and on the algebraic properties of the cipher. Since the Algebraic Normal Form (ANF) of the cipher (that is, its representation as a binary polynomial) is generally unknown beforehand, in practice the attack usually runs without clear prior insights into a convenient strategy for selecting the cubes – an approach made possible by the fact that the attack only requires black-box access to the attacked cipher. Exploring cubes of different (possibly large) size, trying many different sets of indices, and varying the binary assignment of the public bits not belonging to the tested cube are all promising solutions, but they all come at an exponential cost. In a sense, cube attacks can be therefore assimilated to Time-Memory-Data Trade-Off (TMDTO) attacks, as their success rate strongly depends on the extensiveness of the pre-computation stage, on the memory available to store the results of that stage, and on the amount of data usable to implement it. Consequently, identifying the most favourable design choices is the main pillar of a possibly successful cube attack.

Contributions. The present paper motivates and discusses in depth an implementation for Graphics Processing Unit (GPU) of the cube attack. The target cipher is Trivium [8, 22], already considered in the literature to test the viability of the cube attack [10, 14]. Our contributions can be summarized as follows: (i) We tailor the design and implementation of the cube attack to the characteristics of GPUs, in order to fully exploit parallelization while coping with limited memory. Our framework is extremely flexible and can be adapted to any other cipher at no more cost than some fine (performance) tuning, mostly related to memory allocation. (ii) We show the performance gain with respect to a CPU implementation, including results obtained on latest generation GPU cards. (iii) Our implementation allows for exhaustively assigning values to (subsets of) public variables with negligible additional costs. This means extending the quest for superpolys to a dimension never explored in previous works, and, by not being tied to a very small set of IV combinations, potentially weakening one of the basic requirements of the cube attack, that is, the assumption of a completely tweakable IV. (iv) Even though we run the attack with only a few preliminary sets of cubes – specifically selected to both validate our code and compare our results with the literature – our findings improve on the state-of-the-art for attacks against reduced-round versions of Trivium.

Roadmap. This paper is organized as follows: Sect. 2 introduces the cube attack and the targeted cipher Trivium; our implementation of the attack is described in Sect. 3, whereas experimental results are reported and discussed in Sect. 4; Sect. 5 gives an overview of related works; finally, Sect. 6, draws conclusions and suggests possible directions for future work.

2 Preliminaries

In this section, we first describe the theoretical implant of the cube attack, and we then briefly introduce Trivium. More details about Trivium are reported in Appendix A.

The Cube Attack. Let z denote a generic keystream bit produced by a stream cipher \(\mathcal {E}\). z is the result of a function \(E:\mathbb {F} _2^{n+k}\rightarrow \mathbb {F} _2\), computed over the \(n+k\) input bits obtained from an Initial Vector IV of length n and a secret key K of length k. It is well known that z can be expressed as \(z=p(\mathbf {x},\mathbf {y})\), where p is the polynomial representation of E, \(\mathbf {x} =(x_1,\ldots ,x_n)\) is the vector of public variables (IV), \(\mathbf {y} =(y_1,\ldots ,y_k)\) is the vector of secret variables (K), and all variables in p appear with degree 1, at most. The cube attack relies on extracting from p a set of linear equations binding private variables in \(\mathbf {y} \), through a suitable offline pre-computation phase involving public variables in \(\mathbf {x} \).

Let \(I=\{i_1,\ldots ,i_m\}\subset \{1,\ldots ,n\}\) and let us introduce the complement \(\overline{I} =\{1,\ldots ,n\}\setminus I\) of the set I. With a slight abuse of notation, let us consider variables in \(\mathbf {x} \) as partitioned by I: \(\mathbf {x} =({\mathbf {x}_{I}},\mathbf {x}_{\overline{I}})\), i.e., we tell apart the variables \({\mathbf {x}_{I}} \) indexed by I from those \(\mathbf {x}_{\overline{I}} \) indexed by its complement \(\overline{I}\). Let \(t_I =x_{i_1}\cdots x_{i_m}\) be the monomial induced by I, that is, the product of all variables in \({\mathbf {x}_{I}} \). By writing \(t_I({\mathbf {x}_{I}})\) we want to stress that \(t_{I}\) contains only variables in \({\mathbf {x}_{I}} \). If we factor \(t_I({\mathbf {x}_{I}})\) out of \(p(\mathbf {x},\mathbf {y})\) we obtain

$$\begin{aligned} p(\mathbf {x},\mathbf {y})=t_I({\mathbf {x}_{I}})\cdot p_{S(I)}(\mathbf {x},\mathbf {y})+q(\mathbf {x},\mathbf {y}) \end{aligned}$$

where the quotient \(p_{S(I)}(\mathbf {x},\mathbf {y})\) of the division is called the superpoly of I in p, whereas \(q(\mathbf {x},\mathbf {y})\) is the remainder of the division.

Now, for any binary vector \(\mathbf {v}_{\overline{I}} \), we consider a fixed assignment for variables \(\mathbf {x}_{\overline{I}} \) Footnote 1, and let \(C_I(\mathbf {v}_{\overline{I}})\) denote the cube induced by I and \(\mathbf {v}_{\overline{I}} \), that is, the set of all \(2^m\) possible binary assignments to \(\mathbf {x} \) in which variables \(\mathbf {x}_{\overline{I}} \) assume values specified by the binary vector \(\mathbf {v}_{\overline{I}} \) and the remaining variables in \({\mathbf {x}_{I}} \) take all the possible combinations. It is easy to verify that all monomials in \(p_{S(I)}\) do not contain any of the variables \({\mathbf {x}_{I}} \) (i.e., \(p_{S(I)}(\mathbf {x},\mathbf {y})=p_{S(I)}(\mathbf {x}_{\overline{I}},\mathbf {y})\)), whereas all monomials in q do not contain at least one of the variables in \({\mathbf {x}_{I}} \). For this reason, regardless of \(\mathbf {y} \), the sum of \(p(\mathbf {x},\mathbf {y})\) over all elements \(\mathbf {v} \) of \( C_I(\mathbf {v}_{\overline{I}})\) yields [10]

$$\begin{aligned} \sum _{\mathbf {v} \in C_I(\mathbf {v}_{\overline{I}})} p(\mathbf {v},\mathbf {y})=p_{S(I)}(\mathbf {v}_{\overline{I}},\mathbf {y}) \end{aligned}$$
(1)

which obviously does not depend on variables \({\mathbf {x}_{I}} \) anymore.

If \(p_{S(I)}(\mathbf {v}_{\overline{I}},\mathbf {y})\) is linear, the monomial \(t_I({\mathbf {x}_{I}})\) is called a maxterm for p with the assignment \(\mathbf {v}_{\overline{I}} \). If we can identify maxterms and find the symbolic expression of their superpolys, we obtain a system of linear equations that can be used to recover the secret key.

As \(\mathbf {v}_{\overline{I}} \) is always clear from the context, to improve readability in the following we simply denote \(C_I(\mathbf {v}_{\overline{I}})\) and \(p_{S(I)}(\mathbf {v}_{\overline{I}},\mathbf {y})\) as \(C_I\) and \(p_{S(I)}(\mathbf {y})\), respectively.

Trivium. Trivium [8] is a stream cipher conceived by Christophe De Cannière and Bart Preneel, part of the eSTREAM portfolio. It generates up to \(2^{64}\) bits of output from an 80-bit key K and an 80-bit Initial Vector IV, and it shows remarkable resistance to cryptanalysis despite its simplicity and its excellent performance. Trivium is composed by a 288-bit internal state consisting of three shift registers of length 93, 84 and 111, respectively. The feedback to each of these registers and the output bit of the cipher are obtained through non-linear combinations involving in total 15 out of the 288 internal state bits. To initialize the cipher, K and IV are written into two of the shift registers, with a fixed pattern filling the remaining bits. 1152 initialization rounds guarantee that the output begins to be produced only after all key-bits and IV-bits have been sufficiently mixed together to define the internal state of the registers.

3 The Proposed GPU Implementation of the Attack

In this section, we present, detail, and discuss our attack, designed to run on a cluster equipped with Graphics Processing Units (GPU). As previously mentioned, the success of a cube attack is highly dependent on suitable implementation choices. In order to better explain our own approach, we start with an analysis of the cube attack from a more implementative perspective.

3.1 Practical Cube Attack

At a high level, any practical implementation of the cube attack requires performing the following steps:  

S1:

Find as many maxterms as possible;

S2:

For each maxterm, find the corresponding linear equation(s);

S3:

Solve the obtained linear system.

 

Step S1. This is the core of the attack, where cubes that yield linear equations are identified. Choosing candidate maxterms (i.e., cubes) is non-trivial. Intuitively, the degree of most maxterms lies in a specific range that depends on the (unknown) degree distribution of the monomials of the polynomial p. If the degree of \(t_I\) is too small, then \(p_{S(I)}\) is most likely non-linear, but if the degree of \(t_I\) is too large, then \(p_{S(I)}\) will probably be constant (e.g., null). Moreover, since the complexity of the offline phase scales exponentially with |I|, the degree of tested potential maxterms is strongly influenced by practical limitations.

In [10], the authors propose a random walk to explore a maximal cube \(C_{I_{\max }} \), i.e., starting from a random subset \(I\subset {I_{\max }} \) and iteratively testing the superpoly \(p_{S(I)}\) to decide whether the degree of \(t_I\) should be increased or decreased. The underlying idea is to use a probabilistic approach to identify the optimal size |I|. In [14], the authors evaluate the cipher upon all vertices of a maximal cube \(C_{I_{\max }} \), store the results in a table T of size \(|T| = 2^{|{I_{\max }} |}\), and then apply the Moebius transform to the entire table T, thus computing at once the sums over \({|{I_{\max }} | \atopwithdelims ()d}\) sub-cubes of \(C_{I_{\max }} \) of degree d, for \(d=0,\ldots ,|{I_{\max }} |\). These cubes are all possible sub-cubes of \(C_{I_{\max }} \) in which the variables outside the cube have been set equal to 0. In this case the rationale is minimizing processing cost by reusing partial computations as much as possible. Interestingly, the authors of [14] show that specific cubes perform better than others, at least for reduced-round variants of Trivium, and use their findings to select the most promising maximal set \({I_{\max }} \).

None of these two strategies is suitable for GPUs. The stochastic nature of the random walk prevents the sequence of steps from being determined a priori, since the computation is performed only when (and if) needed. On the other hand, the Moebius transform requires a rigid schema of calculations and a large number of alternating read and write operations in memory that must be synchronized. Both approaches are conceived for implementations in which computational power is a constraint (while memory is not), and all advantages of using the Moebius Transform are lost in case of parallel processing. We rather perform an exhaustive search over a portion of a maximal cube, a solution that is highly parallelizable and feasible with our computational resources.

For each candidate maxterm \(t_I\), we need to verify whether the superpoly \(p_{S(I)}\) is linear. The goal being recovering key bits, any fixed assignment of variables \(\mathbf {x}_{\overline{I}} \) with the bit vector \(\mathbf {v}_{\overline{I}} \) can be used to get rid of variables. In order to guarantee that the degree of each superpoly is reduced to the bare minimum, the assignment \(\mathbf {v}_{\overline{I}} \) to \(\mathbf {0}\) is usually preferred, but we argue that this is not necessarily the best choice, as motivated later in Sect. 4.2. In any case, at this stage the superpoly \(p_{S(I)}\) only depends on \(\mathbf {y} \). In principle, assessing the linearity of \(p_{S(I)}(\mathbf {y})\) requires finding all of its coefficients, but efficient probabilistic linearity tests [7, 21] can safely replace deterministic ones in most practical settings. Probabilistic tests involve verifying if

$$\begin{aligned} p_{S(I)}(\mathbf {u} _1+ \mathbf {u} _2)=p_{S(I)}(\mathbf {u} _1)+ p_{S(I)}(\mathbf {u} _2)+ p_{S(I)}(\mathbf {0}) \end{aligned}$$
(2)

holds for random pairs of vectors \(\mathbf {u} _1,\mathbf {u} _2\). Practically, this means evaluating numerically four sums: \(\sum _{\mathbf {v} \in C_I} E(\mathbf {v},\mathbf {0})\), \(\sum _{\mathbf {v} \in C_I} E(\mathbf {v},\mathbf {u} _1)\), \(\sum _{\mathbf {v} \in C_I} E(\mathbf {v},\mathbf {u} _2)\), and \(\sum _{\mathbf {v} \in C_I} E(\mathbf {v},\mathbf {u} _1+\mathbf {u} _2)\).

Probabilistic tests rely on the fact that (2) must be true for all \(\mathbf {u} _1,\mathbf {u} _2\) if \(p_{S(I)}\) is linear, whereas, in general, it holds with probability \(\frac{1}{2}\). In particular, as done for previous cube attacks [10, 14], we will resort to a complete-graph test [21], which guarantees a slightly lesser accuracy than the (truly-random) BLR test [7] with far fewer evaluations of \(p_{S(I)}\). Let us remark that what ultimately matters in the envisaged scenario is identifying “far-from-linear” superpolys [20]. To clarify, let us consider the superpoly \(p_{S(I)}\)

$$\begin{aligned} p_{S(I)}(\mathbf {y})=l(\mathbf {y})+\prod _{i=1}^k y_i \end{aligned}$$

formed by a sum \(l(\mathbf {y})\) of linear terms, plus one nonlinear term given by the product of all variables in \(\mathbf {y} \). Despite the equality \(p_{S(I)}(\mathbf {y})=l(\mathbf {y})\) is formally wrong (the degree of \(p_{S(I)}\) is as large as k), \(p_{S(I)}(\mathbf {u})=l(\mathbf {u})\) is numerically correct for all \(\mathbf {u} \in \mathbb {F} _2^k\), except \(\mathbf {u} =(1,1,\ldots ,1)\). In other words, mistaking \(p_{S(I)}\) for linear has practical consequences only if \(\mathbf {u} =(1,1,\ldots ,1)\).

Steps S2 and S3. Step S2 consists in finding the symbolic expression of the superpoly of all identified maxterms, and the free term of the corresponding equation. Again, this turns into a set of numerical evaluations: the free term of \(p_{S(I)}(\mathbf {y})\) is

$$\begin{aligned} p_{S(I)}(\mathbf {0})=\sum _{\mathbf {v} \in C_I} E(\mathbf {v},\mathbf {0}) \end{aligned}$$

whereas the coefficient of each variable \(y_i\) is

$$\begin{aligned} p_{S(I)}(\mathbf {e} _i)+p_{S(I)}(\mathbf {0}) = \sum _{\mathbf {v} \in C_I} E(\mathbf {v},\mathbf {e} _i) + \sum _{\mathbf {v} \in C_I} E(\mathbf {v},\mathbf {0})\end{aligned}$$

where \(\mathbf {e} _i\) is the unit vector with all null coordinates except \(y_i=1\). Once the polynomial \(p_{S(I)}(\mathbf {y})\) is found, the attack assumes the availability of the \(2^m\) keystream bits produced in correspondence to a fixed (unknown) assignment to the variables \(\mathbf {y} \), as the variables \(\mathbf {x} \) take all possible assignments in \(C_I\). This produces the linear equation

$$\begin{aligned} p_{S(I)}(\mathbf {y})=\sum _{\mathbf {v} \in C_I} E(\mathbf {v},\mathbf {y}) \end{aligned}$$

whose left side is a linear combination of the key variables \(\mathbf {y} \) with coefficients found offline, whereas the right side is a number found online, and whose solution is the sought unknown assignment to \(\mathbf {y} \).

Finally, Step S3 just requires solving the obtained linear system with any suitable technique described in the literature.

3.2 The Setting

Generally speaking, GPUs are processing units characterized by the following advantages and limitations:  

Computing::

Each unit features a large number (i.e., thousands) of simple cores, that make possible running a much higher number of parallel threads compared to a standard CPU. More precisely, the GPU’s basic processing unit is the warp consisting of 32 threads each. Threads are designed to work on 32-bit words, and the performance is maximized if all threads belonging to the same warp execute exactly the same operations at the same time on different but contiguous data.

Memory::

The so-called global memory available on a GPU is limited, typically between 4 and 12 GB. Each thread can independently access data (random access is fully supported, but costly performance-wise). However, when threads in a warp access consecutive 32-bit words, the cost is equivalent to a single memory operation. Concurrent readings and writings by different threads to the same resources, which require some level of synchronization, should be avoided to prevent serialization that defeats parallelism.

 

The basic step of the attack is the sum of \(E(\mathbf {v},\mathbf {y})\) over all elements \(\mathbf {v} \) of a cube \(C_I\). Each time we sum over a cube, the key variables \(\mathbf {y} \) are fixed, either to a random \(\mathbf {u} _j\) for the linearity tests, or to \(\mathbf {0}\) and to versors \(\mathbf {e} _i\) for determining the superpoly. In both cases exactly the same sum \(\sum _{\mathbf {v} \in C_I} E(\mathbf {v},\mathbf {u} _j)\) must be performed for all elements of a set of keys \(\{\mathbf {u} _1,\ldots ,\mathbf {u} _M\}\).

We define the following strategy for carrying out the sums over a cube with the goal of maximizing the parallelization and fully exploiting at its best the computational power offered by GPUs:

  • Assigning to all the threads within a warp the computation of the same cube \(C_I\) but with a different key \(\mathbf {u} _j\). This choice guarantees that all threads perform the same operation at the same time for the entire computation.

  • Leveraging the GPU computational power to calculate all the elements of a cube \(C_I\), providing to the threads just a bit-mask representing the set I. With this approach we can exploit all available GPU memory to store the cubes evaluations and minimize, at the same time, the number of memory access operations.

  • Defining a keystream generator function \(E(\mathbf {x},\mathbf {y})\) which outputs a 32-bit word, and letting each thread work on the whole word, fully leveraging the GPU computing model. This approach offers two remarkable benefits: (i) considering 32 keystream bits altogether is equivalent to concurrently attacking 32 different polynomials, and (ii) working on 32-bit integers fits much better with the GPUs features, whereas forcing the threads to work on single bits would critically affect the performance of the attack. Therefore, attacking 32 keystream bits altogether reduces (of a factor 32) the memory needed for storing the cubes’ evaluation, thus imposing some limitations on the size of the cubes to be tested, as we will clarify later.

  • Choosing the number M of keys to be a multiple of the warp size in order to perform the probabilistic linearity test on 32 keystream bits at the same time and for all M keys.

3.3 The Attack

A severe constraint in any GPU implementation is represented by the amount of memory |T| currently available on GPUs. Moreover, for each cube, we need to consider M different keys in order to run the linearity test, thus reducing the amount of available memory even further to |T| / M. Storing single evaluations of the cipher in T means testing only sub-cubes of a maximal cube of size \(|{I_{\max }} |=\log _2(|T|/M)\). With the memory available in current GPUs, \(\log _2(|T|/M)\) is not large enough for any reasonably strong cipher. The new approach we propose is highly parallelizable, it can fully exploit the computational resources offered by GPU, and it is able to exploit GPU memory to test high order maximal cubes.

The proposed design of the attack relies on the following rationale: exploring only a portion of the maximal cube \(C_{I_{\max }} \), considering only subsets \(I\subseteq {I_{\max }} \) characterized by a non-empty minimal intersection \(I_{\min }\). Quite naturally, a similar design leads to two distinct CUDAFootnote 2 kernels, respectively responsible for: (1) computing many variants of the cube \(C_{I_{\min }}\), one for each of the possible combinations of the indices in \({I_{\max }} \setminus I_{\min }\), and writing the results in memory; (2) combining the stored results to test all cubes \(C_I\) such that \(I_{\min }\subseteq I\subseteq {I_{\max }} \). Following this approach, the size of the explored \({I_{\max }} \) can be raised to \(|{I_{\max }} |=|I_{\min }|+ \log _2(|T|/M)\), with read and write memory operations carried out by different kernels.

According to the notation introduced in Sect. 2, the public variables are \(\mathbf {x} =(x_1,\ldots ,x_n)\). Now, let us distinguish these n public variables into three sets \(\mathbf {x}_{\mathrm {fix}} =(x_{i_1},\ldots ,x_{i_{d_{\mathrm {fix}}}})\), \(\mathbf {x}_{\mathrm {free}} =(x_{j_1},\ldots ,x_{j_{d_{\mathrm {free}}}})\), and \(\mathbf {x} ^*\), of size \({d_{\mathrm {fix}}} \), \({d_{\mathrm {free}}} \), and \(n-d\), respectively, where \(d={d_{\mathrm {fix}}}-{d_{\mathrm {free}}} \). The variables \(\mathbf {x}_{\mathrm {fix}} \) correspond to the fixed components of \(C_{I_{\max }} \) identified by \(I_{\min }\), i.e., \(I_{\min }=\{i_1,\ldots ,i_{d_{\mathrm {fix}}} \}\), whereas the variables \(\mathbf {x}_{\mathrm {free}} \) correspond to the remaining free components of \(C_{I_{\max }} \), i.e., \({I_{\max }} \setminus I_{\min } = \{j_1,\ldots ,j_{d_{\mathrm {free}}} \}\) and \(|{I_{\max }} |=d\). The variables \(\mathbf {u} ^*\) are the remaining public variables that fall outside \({I_{\max }} \).

The two kernels of our attack can be described as follows:

 

Kernel 1::

It uses \(2^{{d_{\mathrm {free}}}}\) warps. Since, as described before, the 32 threads belonging to the same warp perform exactly the same operations but for different keys, in the following we simply consider a representative thread per warp and ignore the private variables \(\mathbf {y} \).Footnote 3 For \(t=0,\ldots ,2^{{d_{\mathrm {free}}}}-1\), thread (i.e., warp) s sums \(E(\mathbf {u},\mathbf {y})\) over each vertex of the cube \(C^s_{I_{\min }}\) of size \({d_{\mathrm {fix}}} \) determined by the assignment of the \({d_{\mathrm {free}}} \)-bit representation \(\mathbf {u}_{\mathrm {free}} \) of integer s to the variables \(\mathbf {x}_{\mathrm {free}} \) and of \(\mathbf {0}\) to the variable \(\mathbf {u} ^*\). Finally, thread s writes the sum in the \(s^{th}\) entry of table T, so that, at the end of the execution of the kernel, each entry of T contains the sum over a cube of size \({d_{\mathrm {fix}}} \). These evaluations allow for testing the monomial \(t_{I_{\min }}\) with all the aforementioned assignments to the other \(n-{d_{\mathrm {fix}}} \) variables.

Kernel 2::

By simply combining the values stored in T at the end of Kernel 1, it is now possible to explore cubes of potentially any size \({d_{\mathrm {fix}}} +\delta \), with \(0\le \delta \le {d_{\mathrm {free}}} \). Although the exploration can potentially follow many other approaches (e.g., a random walk as in [10]), the large computing power of our platform suggests to test cubes exhaustively. Moreover, we extend the exhaustive search to an area never reached, to the best of our knowledge, in the literature. For all I such that \(I_{\min }\subseteq I\subseteq {I_{\max }} \), this kernel considers all variants of cube \(C_I\) obtained assigning all possible combinations of values to the variables in \({I_{\max }} \setminus I\). More precisely, for each possible choice of \(\delta \in [0,{d_{\mathrm {free}}} ]\), there are exactly \({{d_{\mathrm {free}}} \atopwithdelims ()\delta }2^{{d_{\mathrm {free}}}-\delta }\) distinct cubes of size \({d_{\mathrm {fix}}} +\delta \) available. In fact, we can choose \(\delta \) free variables (the additional dimensions of the cube) in \({{d_{\mathrm {free}}} \atopwithdelims ()\delta }\) different ways, and we can choose the fixed assignment to the remaining \({d_{\mathrm {free}}}-\delta \) variables in any of the \(2^{{d_{\mathrm {free}}}-\delta }\) possible combinations.

 

As a matter of fact, the number of cubes considered in [14] is \(\sum _{\delta =0}^{{d_{\mathrm {free}}}} {{d_{\mathrm {free}}} \atopwithdelims ()\delta } = 2^{{d_{\mathrm {free}}}}\), whereas the number of cubes tested by our approach is significantly larger, namely, \(\sum _{\delta =0}^{{d_{\mathrm {free}}}} {2^{{d_{\mathrm {free}}}-\delta }} {{d_{\mathrm {free}}} \atopwithdelims ()\delta } = 3^{{d_{\mathrm {free}}}}\). We would like to highlight that Kernel 2 is computationally dominated by Kernel 1, so the cost of our exhaustive search is negligible. Therefore, our design entails considering any possible assignment to variables outside the cube, to finally address the common conjecture (never proved in the literature), that assigning \(\mathbf {0}\) is the best possible solution.

Let us underline that, in order to validate our implementation of the cube attack, we symbolically evaluated the polynomial p of Trivium up to 400 initialization rounds, and used p to identify all possible maxterms and their superpoly. We then ran the attack to find all maxterms whose variables belonged to selected sets I. Our experimental findings matched the symbolical findings. Further experimental validation of our code is reported in Sect. 4.

3.4 Performance Analysis

To evaluate the performance of our GPU based solution, we developed both a CPU and a GPU version of the cube attack. The cluster we used for the experiments is composed by 3 nodes, each equipped with 4 Tesla K80 with 12 GB of global memory and 4 Intel Xeon CPU E5-2640 with 128 GB of RAM. The CPU experiments were conducted on a parallel version based on OpenMP that exploits 32 cores of the four Intel(R) Xeon(R) CPU E5-2640. Each performance test was executed 5 times and the average time is reported. It is worth noticing that all versions rely on the same base functions to implement Trivium.

In Fig. 1a, we report the speed-up gained by the GPU version with respect to the parallel CPU version. We evaluated the two solutions over growing size maximal cubes \(C_{I_{\max }}\), in which we anchor the size of \(I_{\min }\), consequently causing the size of the set \(I_{\max } \setminus I_{\min }\) to exponentially increase. Overall, the experiments show that the benefit of using the GPU version grows with the number of free variables \({d_{\mathrm {free}}} \) considered, reaching a speed-up up to 70\(\times \) when \({d_{\mathrm {free}}} = 13\). The rationale is that the execution time of the CPU version increases almost linearly with \({d_{\mathrm {free}}} \) from the very beginning, whereas a similar trend can be observed for the GPU version only when the number of blocks in use gets larger than the number of Streaming Multiprocessors (SMs) of the GPU, which happens when \({d_{\mathrm {free}}} \ge 9\) in our case. Of course, slight fluctuations are possible, mostly due to the complex interactions among the multiple cache levels of a modern CPU. Moreover, we evaluated how the GPU solution scales when \({d_{\mathrm {free}}} \) increases. As reported in Fig. 1b, our solution scales linearly with the size of the problem, i.e., exponentially with the size of the sub-cubes \(C_{I_{\min }}\), thus paving the way for future works in the area.

Fig. 1.
figure 1

Performance experiments

Finally, we ran the attack under the control of the Nvidia profiler in order to measure the ALU occupancy achieved by our kernels. Kernel 1 is invoked just once per run to fill the whole table T, with an occupancy consistently over 95% when \({d_{\mathrm {free}}} \ge 10\). Kernel 2 is instead invoked once per each \(\delta \in [0,{d_{\mathrm {free}}} ]\), to compute all available cubes of size \({d_{\mathrm {fix}}} +\delta \). The maximum occupancy exceeds 95% as soon as \({d_{\mathrm {free}}} \ge 12\), with an average of approximately 50%. In either case the impact of \({d_{\mathrm {fix}}} \), which determines the load of each thread, is negligible. Considering that \({d_{\mathrm {free}}} \) should be maximized to improve the attack success rate, our kernels guarantee an excellent use of resources in any realistic application. For instance, in our experiments discussed in Sect. 4 we set \({d_{\mathrm {free}}} =16\), which guarantees an occupancy above 99% for Kernel 1, and a maximum occupancy above 98% for Kernel 2.

4 Results

Finally, this section reports the results obtained by our GPU implementation of the cube attack against reduced-round Trivium. We recall that the attack ran on a cluster composed by 3 nodes, each equipped with 4 Tesla K80 with 12 GB of global memory and 4 Intel Xeon CPU E5-2640 with 128 GB of RAM.

As mentioned in Sect. 3.3, we performed a formal evaluation of our implementation, by checking our experimental results against Trivium’s polynomials, explicitly computed up to 400 initialization rounds. In the following, the number of initialization rounds instead matches (and slightly overtakes) the best results from the literature, thus reaching a point where a symbolic evaluation would be prohibitive. Still, the results we exhibit are obtained from experiments specifically designed to reproduce tests carried out in the recent past [14], so as to provide, at the same time: (i) a direct comparison of our results with the state-of-the-art; (ii) an immediate means to assess the advantages of our approach, and (iii) a further validation of the correctness of our code.

Experimental Setting. In our attack, we consider two different reduced-round variants of Trivium, corresponding to 768 and 800 initialization rounds, respectively. As explained and motivated in Sect. 3.2, in our scheme, each call to Trivium produces 32 key-stream bits, which we use in our concurrent search for superpolys. The most significant practical consequence of a similar construction is the ability to devise attacks to Trivium reduced to any number of initialization rounds ranging from 768 to 831, at the cost of just two attacks, although the number of available superpolys decreases with the number of rounds. As a matter of fact, the \(j^{th}\) output bit after 768 rounds can also be interpreted as the \((j-i)^{th}\) bit of output after \(768+i\) initialization rounds, for any \(j\ge i\). In other words, an attack to Trivium reduced to \(768+i\) initialization rounds can count upon all superpolys found in correspondence of the \(j^{th}\) output bit after 768 rounds, for all \(j\ge i\).

For each of the two attacks (768 and 800 initialization rounds), we ran a set of independent runs, each using a different choice for the pair of sets of variables \(I_{\min },{I_{\max }} \) (with \(I_{\min } \subset {I_{\max }} \)) that define the minimal and maximal tested cubes \(C_{I_{\min }}\) and \(C_{{I_{\max }}}\). The size of \(I_{\min }\) and \({I_{\max }} \setminus I_{\min }\) is \({d_{\mathrm {fix}}} =25 \) and \({d_{\mathrm {free}}} =16 \), respectively, for all runs, so that all maximal cubes have size \(d={d_{\mathrm {fix}}} +{d_{\mathrm {free}}} = 41 \). Peculiarly to our implementation, when we test the monomial composed of all variables in some set \(I_{\min }\subseteq I \subseteq {I_{\max }} \), we exhaustively assign values to all public variables in \({I_{\max }} \setminus I\), thus concurrently testing the linearity of \(2^{41-|I|}\) possibly different superpolys. This feature of our attack – a possibility overlooked in the literature, but almost free-of-charge in our framework – provides primary benefits, as described in Sect. 4.2.

In all the reported experiments, we use a complete-graph linearity test based on combining 10 randomly sampled keys.

4.1 Summary of Results

As mentioned before, we implemented two attacks, against Trivium reduced to 768 (Trivium-768 in the following) and 800 (Trivium-800) initialization rounds, respectively. In both cases, our setting allows obtaining superpolys corresponding to 32 output bits altogether, at the cost of a single attack.

Results Against Trivium-768. For the attack against Trivium-768, we took inspiration from [14]: we launched 12 runs based on 12 different pairs \(I_{\min },{I_{\max }} \), chosen so as to guarantee that each of the 12 linearly independent superpolys found in [14] after 799 initialization rounds was to be found by one of our runs. The rationale of reproducing results from [14] was to both test the correctness of our implementation, and provide a better understanding of the advantages of our implementation with respect to the state-of-the-art. In this sense, let us highlight that a single run of ours cannot be directly compared with all results presented in [14], because each of our runs only explores the limited portion of the maximal cube \(C_{{I_{\max }}}\) composed by all super-cubes of \(C_{I_{\min }}\).

To better describe our results, let us introduce the binary matrix A whose element A(ij) is the coefficient of variable \(y_j\) in the \(i^{th}\) available superpoly. The rank of A, denoted \(\mathrm {rk} (A)\), clearly determines the number of key bits that can be recovered in the online phase of the attack based on the available superpolys, before recurring to brute-force.

As described before, the superpolys yielded by the \(i^{th}\) output bit after round 768 are usable to attack Trivium for any number of initialization rounds between 768 and \(768+i\). It is possible to define 32 different matrices \(A_{768},\ldots ,A_{799}\): \(A_{768}\) includes all superpolys found, while each matrix \(A_{768+i}\) is obtained by incrementally removing the superpolys yielded by output bits \(0,\ldots ,i-1\). Figure 2a shows \(\mathrm {rk} (A_i)\) as a function of i, comparing our findings with those of [14].

Overall, our results extend the state-of-the-art in a remarkable way, especially if we consider that our quest for maxterms was circumscribed to multiples of 12 base monomials of degree 25. In particular, let us highlight a few aspects that emerge from Fig. 2a:

  • Since our runs were designed to include all 12 maxterms found in [14] after 799 initialization rounds, it is not surprising that \(\mathrm {rk} (A_{799})\) is at least 12. Yet, it is indeed larger: we found 3 more linearly independent superpolys, reaching \(\mathrm {rk} (A_{799})=15\).

  • Although we did not force our tested cube to include the maxterms found in [14] after 784 rounds, we have \(\mathrm {rk} (A_{784})=59\), compared with rank 42 found in [14].

  • Finally, and probably most importantly, our attack allows a full key recovery up to 781 initialization rounds.

Selected superpolys that guarantee the above ranks are reported in Appendix B, together with the corresponding maxterms. Very interesting is also how novel superpolys were found, a point that is better described in the following.

Results Against Trivium-800. To provide a further test of the quality of our attack, we launched a preliminary attack against Trivium-800. We kept unvaried all the parameters of the attack (\({d_{\mathrm {fix}}} =25\), \({d_{\mathrm {free}}} =16\), 32 output bits attacked altogether), but this time we only launched 4 runs, and we chose the sets \(I_{\min },{I_{\max }} \) at random. In total, we were able to find a single maxterm corresponding to 800 rounds, and no maxterms afterwards. This maxterm and the corresponding superpoly are also reported in Appendix B. Although our findings only allow to cut in half the complexity of a brute force attack, this is the first ever superpoly found considering more than 799 initialization rounds. We recall that our limited results should not appear as surprising: as previous work suggests [10, 14], when the number of initialization rounds grows, a cube attack should increase the average degree of candidate maxterms and/or implement specific strategies for the selection of the index sets [14].

Fig. 2.
figure 2

Our results

4.2 Further Discussion

Hereafter, we provide a more detailed analysis and a further discussion of our findings, considering two aspects in particular: the reliability of commonly used linearity tests, and the peculiar advantages of our attack design. Unless otherwise specified, in the following we always focus on Trivium-768.

On Probabilistic Linearity. A common practice in the cube attack related literature consists in using a probabilistic linearity test, meaning that a (small) chance exists that the superpolys found by an attack are not actually linear. In particular, the best results obtained with the cube attack against Trivium use a complete-graph test, which, with respect to the standard BLR test, trades-off accuracy for efficiency. The viability of a similar choice is supported by previous work [12, 21], showing that the complete-graph test behaves essentially as a BLR test in testing a randomly chosen function f, with the quality of the former being especially high if the nonlinearity (minimum distance from any affine function) of f is large, that is, when the result of the test is particularly relevant.

Following the trend, we chose to implement a complete-graph test based on a set of 10 randomly chosen keys, exactly as done in [14]. However, while increasing the number of tests done during the attack was costly for us (it impacts on memory usage), implementing further test on the superpolys found at the end of the attack was not. We therefore decided to put our superpolys through additional tests involving other 15 keys chosen uniformly at random. Figure 2b compares \(\mathrm {rk} (A_i)\) as a function of i, for our full results and our filtered results, in which all superpolys that failed at least one of the additional tests have been removed. Let us stress once more that these two sets of results cannot be defined as wrong and correct, but they rather correspond to two different levels of trust in the found superpolys. In a sense, choosing between the two sets is equivalent to selecting the desired trade-off between efficiency and reliability of the attack: our full results permit a faster attack, which however may fail for a subset of all possible keys. Of course, many middle ways/intermediate approaches are possible. Investigating whether the reason of these failing tests is related to any of our design choices is left to future work.

On Using 32 Output Bits. A significant novelty of our implementation consists in the ability to concurrently attack 32 different polynomials, which describe 32 consecutive output bits of the target cipher. This choice is induced by GPUs features – as discussed in Sect. 3.2 – yet it is natural to assess what benefits it introduces. In Sect. 4.1 we showed that looking at 32 output bits altogether can be considered a way to concurrently attack 32 different reduced-round variants of Trivium. However, aiming to extend the attack to the full version of the cipher, our implementation can be used to check whether the same set of monomials yield different superpolys, hopefully involving different key variables, when we focus on different output bits. To this end, let us introduce a new set of matrices \(B_{768}^0,\ldots ,B_{768}^{31}\), where each \(B_{768}^{j}\) is obtained considering only the superpolys yielded by output bits \(0,\ldots ,j\) after 768 initialization rounds (i.e., \(A_{768}=B_{768}^{31}\)). Figure 2c shows \(\mathrm {rk} (B_{768}^j)\) as a function of j, for both our full results and our filtered results. What the figure highlights is that considering several output bits altogether for the same version of the cipher, albeit possibly causing issues related to memory usage, does introduce the expected benefit, indeed a remarkable benefit if the matrix rank is initially (i.e., when \(j=0\)) low. This is the first ever result showing that considering a larger set of output bits is a viable alternative to exploring a larger cube.

On the Advantages of the Exhaustive Search. As described before, our implementation allows to find significantly more linearly independent superpolys than previous attempts from the literature. One of the reasons of our findings is the parallelization that makes possible to carry out, at a negligible cost, an exhaustive search over all public variables in \({I_{\max }} \setminus I\) when the cube \(C_I\) is under test. Figure 2d, again focusing on \(\mathrm {rk} (A_i)\), compares our full results with results obtained without exhaustive search (shortened “no ex.”), i.e., setting all variables in \({I_{\max }} \setminus I\) to 0, as usually done in related work. What emerges is that through an exhaustive search it is indeed possible to remarkably increase \(\mathrm {rk} (A_i)\). Significantly, the exhaustive search is what allows us to improve on the state-of-the-art for \(i=799\), which, among other things, suggests that the benefits of the exhaustive search are particularly relevant when increasing the number of tested cubes would be difficult otherwise (e.g., by considering other monomials).

Another consequence of implementing an exhaustive search is that we found many redundant superpolys, i.e., superpolys that are identical or just linearly dependent with the ones composing the maximal rank matrix \(\tilde{A}\). A similar finding is extremely interesting, because we expect it to provide a wide choice of different IV combinations yielding superpolys that compose a maximal rank sub-matrix \(\tilde{A}\), thus weakening the standard assumption that cube attacks require a completely tweakable IV.

5 Related Work

The cube attack is a widely applicable method of cryptanalysis introduced by Dinur and Shamir [10]. The underlying idea, similar to Vielhaber’s AIDA [24], can be extended, e.g., by assigning a dynamic value to IV bits not belonging to the tested cube [3, 11], or by replacing cubes with generic subspaces of the IV space [25], and it is used in so-called cube testers to detect nonrandom behaviour rather than performing key extraction [4, 5]. Despite the cube attack and its variants have shown promising results against several ciphers (e.g., Trivium [10], Grain [11], Hummingbird-2 [13], Katan and Simon [3], Quavium [26]), Bernstein [6] expressed harsh criticism to the feasibility and convenience of cube attacks. Indeed, a general trend for cube attacks is to focus on reduced-round variants of a cipher, without any evidence that the full version can be equally attacked. However, while Bernstein suggests that the cube attack only works if the ANF of the cipher has low degree, Fouque and Vannet [14] argue (and, to some extent, experimentally show) that effective cube attacks can be run not aiming at the maximum degree of the ANF, but rather exploiting a nonrandom ANF by searching for maxterms of significantly lower degree. Along this line, O’Neil [18] suggests that even the full version of Trivium exhibits limited randomness, thus indicating the potential vulnerability of this cipher to cube attacks.

In recent years, several implementations of the cube attack attempted at breaking Trivium, our target cipher described in Sect. A. Quedenfeld and Wolf [19] found cubes for Trivium up to round 446. Srinivasan et al. [23] introduces a sufficient condition for testing a superpoly for linearity in \(\mathbb {F} _2\) with a time complexity O(\(2^{c + 1} (k^2 + k)\)), yielding 69 extremely sparse linearly independent superpolys for Trivium reduced to 576 rounds. In their seminal paper [10], Dinur and Shamir found 63, 53, and 35 linearly independent superpolys after, respectively, 672, 735, and 767 rounds. Fouque and Vannet [14] even improve over Dinur and Shamir, by obtaining 42 linearly independent superpolys after 784 rounds, and 12 linearly independent superpolys (plus 6 quadratic superpolys) after 799 rounds. To the best of our knowledge, these are the best results against Trivium to date, making our attack comparable to (or better than) the state-of-the-art.

Distributed computing and/or parallel processing have been explored in the literature to render attacks to crypto systems computationally or storage-wise feasible/practical. Smart et al. [15] develop a new methodology to assess cryptographic key strength using cloud computing. Marks et al. [16] provide numerical evidence of the potential of mixed GPU(AMD, Nvidia) and CPU technology to data encryption and decryption algorithms. Focusing on GPU, Milo et al. [17] leverage GPUs to quickly test passphrases used to protect private keyrings of OpenPGP cryptosystems, showing that the time complexity of the attack can be reduced up to three-orders of magnitude with respect to a standard procedure, and up to ten times with respect to a highly tuned CPU implementation. A relevant result is obtained by Agostini [2] leveraging GPUs to speed up Dictionary Attacks to the BitLocker technology commonly used in Windows OSes to encrypt disks. Finally, and most closely related to the present work, Fan and Gong [13] make use of GPUs to perform side channel cube attacks on Hummingbird-2. They describe an efficient term-by-term quadraticity test for extracting simple quadratic equations, leveraging the cube attack. Just like us, Fan and Gong speed-up the implementation of the proposed term-by-term quadraticity test by leveraging GPUs and finally recovering 48 out of 128 key bits of the Hummingbird-2 with a data complexity of about \(2^{18}\) chosen plaintexts. However, we present a complete implementation of the cube attack thoroughly designed and optimized for GPUs. Our flexible construction allows an exhaustive exploration of subsets of IV bits, thus overcoming the limitations of dynamic cube attacks, which try to find the most suitable assignment to those bits by analyzing the target cipher.

6 Conclusions and Future Work

This work has discussed in depth an advanced GPU implementation of the cube attack aimed at breaking a reduced-round version of Trivium. The implemented attack allows extending the quest for superpolys to a dimension never explored in previous works, and weakens the previous cube attack assumption of a completely tweakable IV. An extensive experimental campaign is discussed and results validate the approach and improve over the state-of-the-art attacks against reduced-round versions of Trivium.

The tool, that we expect to release into the public domain, opens new perspectives by allowing a more comprehensive and hopefully exhaustive analysis of stream-ciphers security. For instance, along the line proposed in [1], we envisage developing our implementation to test the effectiveness of the generalized cube attack over \(\mathbb {F} _n\).