Keywords

1 Introduction

The supersingular isogeny key encapsulation (SIKE) proposal [7] – the actively secure version of Jao and De Feo’s SIDH key exchange [8] – is one of 17 second round candidate public key encryption or key establishment proposals submitted to the post-quantum cryptography standardization process initiated by the U.S. National Institute of Standards and Technology (NIST). It is the only proposal whose security is based on the computational supersingular isogeny (CSSI) problem. Currently, the best known classical and quantum attacks on the CSSI problem are generic claw finding attacks: given two functions \(f :A \rightarrow C\) and \(g :B \rightarrow C\) with domains of equal size, the claw finding problem is to find a pair (ab) such that \(f(a) = g(b)\). The original security analysis by Jao and De Feo [8, §5.2] estimates the complexity of the CSSI problem by assuming the optimal black-box asymptotic complexities for the claw finding problem: classically, it can be solved in \(O(|A|+|B|)\) time using O(|A|) space. On a quantum computer, Tani’s algorithm [22] relies on a generalization of Grover’s search algorithm by Szegedy [21] and uses quantum walks on Johnson graphs to solve the claw finding problem in \(O(\root 3 \of {|A||B|})\) time. Following Jao and De Feo, the SIKE team used these asymptotics to specify three Round-1 parametrizations that were intended to meet the requirements for the NIST security categories 1, 3 and 5 defined in terms of resources needed for AES key search [14, p. 18].

Prior to 2018, the literature on SIDH (starting with Jao and De Feo’s original paper [8]) has consistently cited a meet-in-the-middle algorithm for claw finding as the best known classical algorithm for solving the CSSI problem. In 2018, Adj, Cervantes-Vázquez, Chi-Domínguez, Menezes and Rodríguez-Henríquez [1] made a significant step towards a better understanding of the problem’s concrete classical complexity. They show that, while the meet-in-the-middle algorithm has the lowest known classical runtime, its storage requirements are so large (for instances of cryptographic size) that its application is not meaningful in any reasonable model of cryptanalytic computation. Indeed, the best classical AES key search algorithms only require a modest amount of storage, so a fair and correct analysis must take into account the available time/memory trade-offs. Consequently, Adj et al. fix a conservative upper bound on storage capacity that is considered “prohibitively costly for the foreseeable future” [1, §5], i.e., \(2^{80}\) units of storage, and analyze the runtime of relevant algorithms subject to this capacity. They conclude that despite its higher running time, the van Oorschot-Wiener (vOW) parallel collision finding algorithm [23] has significantly lower space requirements and is the best classical algorithm for the CSSI problem. Thus, its concrete complexity should instead be used to assess the security of SIDH/SIKE against (known) classical attacks. Their analysis ultimately shows that the SIKE team used rather conservative classical security estimates and that significantly smaller parameters can be used to achieve the requisite level of classical security.

Jaques and Schanck [9] provide an in-depth analysis of quantum algorithms for claw finding applied to the CSSI problem. In particular, they analyse the complexity of implementing and querying quantum memory, which is needed in Tani’s algorithm and which previously had not been taken into account in the quantum security estimates for SIDH/SIKE. Along with Tani’s algorithm, they also consider a direct application of Grover search [5] to claw finding. Similar to the classical analysis of Adj et al., they conclude that the SIKE proposal’s quantum security estimates were too conservative. In fact, Jaques and Schank’s analysis shows that the best known quantum algorithms do not achieve a significant advantage over the classical vOW algorithm. In some attack scenarios, it is the classical security that is the limiting factor for achieving a specified security level. While quantum algorithms promise to be more efficient for attackers with limited memory, classical vOW outperforms quantum algorithms for attackers with limited time. Thus, the precise, real-world complexity of the vOW parallel collision search algorithm is paramount in the discussion of (current and future) parameters for SIDH/SIKE.

Based on the above cryptanalytic results, the parameter sets in the SIKE specification were adjusted in Round 2 of the NIST standardization process. The specification now contains the parameter sets SIKEp434, SIKEp503, SIKEp610 and SIKEp751 targeting the NIST security categories 1, 2, 3 and 5, respectively.

Contributions. We present an implementation of the van Oorschot-Wiener algorithm that is intended to be a step towards a real-world, large-scale cryptanalytic effort. Our work extends that of Adj et al. by introducing novel improvements to implementations of the generic vOW collision finding algorithm and improving the instantiations specific to the contexts of SIDH and SIKE. Besides significantly optimizing the efficiency of the underlying finite-field and elliptic-curve arithmetic by incorporating the state-of-the-art formulas, we present several optimizations related to the structure of the isogeny graph.

The source code will be released under a free license. Beyond being able to reproduce our results, we hope that our C/C++ implementations can function as the basis for further experiments to assess the security of isogeny-based cryptography, and that they can be used for other applications of the collision finding algorithm. In fact, we provide two implementations: an optimized C code base for both generic collision finding as well as solving the CSSI problem, and a C++ version designed for modularity, and to allow easy porting to alternative collision finding settings at little cost to efficiency (e. g. for the hybrid attack on lattice-based schemes [6], symmetric cryptography, or highly distributed setups).

Our extensions and improvements to the vOW implementation and analysis in [1] include:

  • Faster collision checking. One of the main steps in the vOW algorithm is to check whether a given collision is the golden collision (see Sect. 2). Experimentally, our optimized version of generic vOW found that it constitutes close to 20% of the entire algorithm (aligning with van Oorschot and Wiener’s analysis [23, §4.2]). We give a novel, much more efficient method, which is based on a cycle-finding technique by Sedgewick, Szymanski and Yao [18]. It temporarily uses a small amount of local storage (which can be input dynamically as a parameter) during the random walks to accelerate collision checking – see Sect. 3.4.

  • SIKE-specific optimizations. Although the best algorithm for the general CSSI problem is generic (i.e. there are no better known algorithms that exploit its underlying mathematical structure), we take advantage of multiple optimizations that apply to the concrete instantiations in the SIKE specification [7]. Firstly, we show how to exploit the choice of the starting curve as a subfield curve, by defining random walks on (conjugate) classes of j-invariants; such a modified walk is analogous to the walk that exploits the negation map in Pollard’s rho algorithm for the ECDLP [27] – see Sect. 3.1. Secondly, we show how to exploit that, in SIKE, the isomorphism class of the output curve is not randomized (this possibility was already pointed out by De Feo, Jao and Plût [3]), by using the leakage of the dual of the final isogeny – see Sect. 3.1. We quantify the precise security loss suffered by these choices.

  • Precomputation. Generic collision finding algorithms like vOW are often implemented to target high-speed symmetric primitives. In contrast to those applications, for the CSSI problem, the computation of large-degree isogenies is the overwhelming bottle-neck of the random walks. Therefore, speeding up the isogeny computations translates directly to a similar speedup of the entire collision finding process. We show how to exhaust any available local memory to achieve such speedups via the precomputation of parts of the isogeny tree – see Sect. 3.3.

  • Experimental results. For all of the improvements mentioned above, we demonstrate their feasibility by analyzing the runtime of the implementation. In doing so, we re-confirm the analyses of van Oorschot and Wiener [23] and Adj et al. [1] in the context of SIDH (with a factor \(2\) improvement) and extend them to SIKE – see Table 1. Furthermore, we go beyond the setting of small parameters and propose an alternative way of predicting the vOW runtime for actual Round 2 parameters, in particular SIKEp434, giving an upper bound on their security level – see Sect. 5.1.

2 Preliminaries: van Oorschot-Wiener’s Collision Search

After defining the CSSI problem in Sect. 2.1, we describe the classical meet-in-the-middle claw finding algorithm in Sect. 2.2. It is both simpler than, and helps motivate, the description of the vOW parallel collision finding algorithm in Sect. 2.3. The complexity analysis of the generic vOW algorithm is given in Sect. 2.4.

2.1 The CSSI Problem

Herein, we restrict to the popular scenario whereby an instance of SIDH/SIKE is parameterized by a prime \(p=2^{e_2}3^{e_3}-1\) with \(2^{e_2} \approx 3^{e_3}\) and \(e_3 \gg 1\); all known implementations, including those in the SIKE submission, specify a prime of this form. Since \(p \equiv 3 \bmod {4}\), we fix \(\mathbb {F}_{p^2}=\mathbb {F}_p(i)\) with \(i^2+1=0\) throughout. We work with the set of isomorphism classes of supersingular elliptic curves in characteristic p. There are roughly p/12 such classes, and these are identified by their \(\mathbb {F}_{p^2}\)-rational j-invariants [20, p. 146]. Each supersingular j-invariant belongs to the same isogeny class [11].

In this paper, isogenies are non-constant rational maps between two elliptic curves that are also group homomorphisms. We work only with separable isogenies, meaning that the degree of any given isogeny is equal to the size of its kernel. Any subgroup \(G \subset E\) determines a unique isogeny (up to isomorphism) whose kernel is G; this isogeny can be computed using Vélu’s formulas [25].

For a prime \(\ell \ne p\), there are precisely \(\ell +1\) isogenies of degree \(\ell \) that emanate from a given supersingular curve. This induces a graph \(\mathcal {G}_\ell \) – called a supersingular isogeny graph – whose nodes are the supersingular isomorphism classes and whose vertices are the degree-\(\ell \) isogenies (up to isomorphism) between them. The graph \(\mathcal {G}_\ell \) is connected and, with the exception of the nodes corresponding to \(j\)-invariants \(0\) and \(1728\), an \((\ell +1)\)-regular multigraph which satisfies the Ramanujan expansion property (see [3, §2.1]). Since every isogeny \(\phi :E \rightarrow E'\) has a unique (up to isomorphism) dual isogeny \(\hat{\phi } :E' \rightarrow E\), we can view \(\mathcal {G}_\ell \) as an undirected graph (excluding \(j=0,1728\)). We discuss the special node with \(j\)-invariant \(1728\) in Sect. 3.1.

For any n with \(p \not \mid n\), the set of n-torsion points, \(E[n] = \{P \in E(\bar{\mathbb {F}}_p) : [n]P = 0_E\}\), satisfies \(E[n] \cong \mathbb {Z}_n \oplus \mathbb {Z}_n\). Let \((\ell ,e) \in \{(2,e_2), (3,e_3)\}\). Following [3, Problem 5.2] (see also [1, §2.4]), we define a simplified version of the CSSI problem that underlies the SIDH and SIKE protocols within the above context as follows.

Definition 1

(CSSI). Given two supersingular elliptic curves \(E\) and \(E/G\) defined over \(\mathbb {F}_{p^2}\) such that up to isomorphism there exists a unique isogeny \(\phi : E \rightarrow E/G\) of degree \(\ell ^e\) with (cyclic) kernel \(\ker \phi = G\), the computational supersingular isogeny (CSSI) problem is to compute \(\phi \) or, equivalently, to determine a generator for \(G\).

2.2 The Meet-in-the-middle Claw Finding Algorithm

The most naive approach to solving CSSI is to perform a brute force search for G. Since the number of cyclic subgroups of order \(\ell ^e\) in \(E(\mathbb {F}_{p^2})\) is \((\ell +1)\ell ^{e-1}\), this takes \(O(\ell ^e)\) time. The claw finding algorithm uses the fact that we can view \(\mathcal {G}_\ell \) as an undirected graph, so that we can instead meet in the middle. Following [8] (and assuming for simplicity that e is even), we can build two trees of curves: the leaves of the first determine the set of all isomorphism classes \(\ell ^{e/2}\)-isogenous to that of E, those of the second the set of all classes \(\ell ^{e/2}\)-isogenous to that of E/G. While there are \((\ell +1)\ell ^{e/2-1}\) classes in each set, with overwhelmingly high probability there is only one class that lies in both sets [8, §5.1]. It corresponds to the node in the middle of the path from E to E/G, and once it is found, the CSSI problem is solved by composing the \(\ell ^{e/2}\)-isogeny emanating from E with the dual of that emanating from E/G. Assuming that all \((\ell +1)\ell ^{e/2-1}\) classes emanating from one of the sides can be computed and stored, solving the CSSI problem this way takes \(O(\ell ^{e/2})\) time.

It was not until the work of Adj et al. [1] that the classical complexity of this claw finding algorithm in the context of CSSI analysis was scrutinized. Given that \(\ell ^{e/2}\approx p^{1/4}\), and that the smallest prime p used to instantiate SIDH/SIKE prior to [1] was larger than \(2^{500}\), Adj et al. argue that the \(O(p^{1/4})\) storage required to solve the problem as described above is infeasible. Instead, they fix \(2^{80}\) as an upper bound on the number of units that can be stored, and analyze the runtime of the claw finding algorithm subject to this storage capacity. At any given time, an attacker can now only afford to store a small fraction of the \(O(\ell ^{e/2})\) nodes emanating from one side, try all nodes from the other side, and repeat this process until the CSSI problem is solved. Adj et al. therefore conclude that, for CSSI instances of cryptographic relevance, the meet-in-the-middle algorithm is more costly than the vOW algorithm described in the sequel.

2.3 Solving CSSI with van Oorschot-Wiener

Let \(S = \left\{ 0,1\right\} \times \left\{ 0,\ldots ,(\ell +1)\ell ^{e/2-1}-1\right\} \), \(E_0 = E\) and \(E_1 = E/G\). Each \((i,y)\in S\) represents a kernel subgroup on the elliptic curve \(E_i\). For example, for \(\ell = 2\), Adj et al. [1, §4.4] define a correspondence between \((i,y) = (i,(b,k))\in \left\{ 0,1\right\} \times \left( \left\{ 0,1,2\right\} \times \left\{ 0,\ldots ,2^{e/2-1}-1\right\} \right) \) and the cyclic subgroup \(\langle R_i\rangle \subset E_i\) with

$$ R_i = {\left\{ \begin{array}{ll} P_i + \big [b2^{e/2-1} + k\big ]Q_i &{}\text {if } b = 0,1\,, \\ \big [2k\big ]P_i + Q_i &{}\text {if } b = 2\,, \end{array}\right. }\text {where }~ \langle P_i,Q_i \rangle = E_i[2^{e/2-1}] \,. $$

Let \(h: S\rightarrow E_0(\mathbb {F}_{p^2}) \cup E_1(\mathbb {F}_{p^2}), (i,y)\mapsto R_i\) and let \(f : S \rightarrow S\) be the function that, on input of \((i,y)\), computes the isogeny of degree \(\ell ^{e/2}\) with kernel subgroup \(\langle R_i \rangle \) emanating from \(E_i\), evaluates the \(j\)-invariant \(j(E_i/\langle R_i \rangle )\), and maps it back to \(S\) using a function \(g\). In order to make \(f\) behave like a (pseudo-)random function on \(S\), the function \(g: \mathbb {F}_{p^2} \rightarrow S \) is chosen to be (pseudo-)random.

A collision for f is a pair \(x, x' \in S\) with \(f(x)=f(x')\) and \(x\ne x'\). If f is modeled as a random function, the expected number of collisions (over the set of random functions) is around |S|/2 [23, §4.2]. For SIDH, we rely on the function \(h\) described above, while for SIKE, h is defined in Sect. 3.2 (in both cases for \(\ell = 2\)). Note that necessarily there exists one special collision, namely the one between the two subgroups (one on E and one on E/G) that map to the same j-invariant and solve the CSSI problem. Since this is the only useful collision, we follow convention [1, 23] and refer to it as the golden collision. For the remainder of this section we abstract away from the setting of isogenies, since it is not necessary to understand the van Oorschot-Wiener algorithm. That is, we assume that \(f\) is a truly random function on \(S\) for which we aim to find a single golden collision.

The vOW algorithm requires a proportion \(\theta \) of the points in |S| to be distinguished points. Whether or not a point is distinguished can be decided by any efficiently computable function \(S\rightarrow \{0,1\}\), so long as it ensures that close to \(\theta \cdot |S|\) of the |S| points are deemed distinguished. The algorithm searches for collisions of f by performing many iterative walks in parallel as follows. Each walk starts at a random point \(x_0 \in S\) and produces a trail of points \(x_i = f(x_{i-1})\) for \(i=1, 2, \dots \) until a distinguished point \(x_d\) is reached. The triple \((x_0,x_d,d)\) is then added to a single common list and the processor chooses a new starting point at random to produce a new trail.Footnote 1

Let w denote the number of triples of the form \((x_0,x_d,d)\) that can be stored in the list. To simplify memory access, van Oorschot and Wiener suggest making the memory address for a given triple a function of its distinguished point. Optimized parametrizations geared towards real-world CSSI instantiations will have \(w \ll \theta \cdot |S|\), i. e. one cannot store enough triples to account for all of the distinguished points. This gives rise to three scenarios when we attempt to store a given triple in memory. The first is that the memory at the given address is empty, in which case we write the triple there and continue; the second is that the memory is occupied by a triple with a different distinguished point, in which case we overwrite it with the new triple and continue; the third scenario is that the two triples contain the same distinguished point, in which case we have a collision and we must now check whether or not it is the golden collision. Let these two triples be \((x_0, x_d, d)\) and \((x_0',x_{d'}',d')\) with \(x_d=x_{d'}'\), and assume \(d'>d\). To check the collision, we walk \(x_0'\) forward by iterating \((x_0',d') \leftarrow (f(x_0'), d'-1)\) until \(d'=d\), so that both walks are the same number of steps from the distinguished point. We then step both walks forward in unison iterating \((x_0,x_0')\leftarrow (f(x_0),f(x_0'))\) until we find \(x_0 \ne x_0'\) such that \(f(x_0)=f(x_0')\). If this is the golden collision, we are done. Otherwise, we replace the old triple with the new triple and continue. Note that the expected value of d, i. e. the expected length of the trails, is geometrically distributed with mean \(1/\theta \).

Van Oorschot and Wiener note that two undesirable occurrences can arise during their algorithm. First, a trail can collide with the starting point of another trail, which is called a Robin Hood. In practice, they note that \(\theta \) is small enough that this occurs rarely. If it does, we replace the triple in memory by the triple found last. Second, a walk can enter into a cycle that does not contain a distinguished point. In [23], the suggested workaround is to set a maximum trail length (e. g. \(20/\theta \)), and to abandon trails beyond this point.

Perhaps the most subtle aspect of the algorithm is that we are essentially forced to restart the above process many times, for many different instantiations of the random function f. As explained in [23, §4.2], there exist roughly |S|/2 collisions for f, and on average we have to find this many collisions before we encounter the golden collision. However, not all collisions occur equally likely; for any given f, the golden collision may have a very low probability of detection. For example, one or both of the two points that constitute the golden collision could have very few trails leading into them, or in the extreme case, none at all; if so we would have to be extremely lucky to find the collision, i. e. by randomly choosing the two points as starting points. Thus, van Oorschot and Wiener explain that the best average runtime is achieved by trying a function f until a requisite number of distinguished points have been found (how many will be discussed in the next subsection), and then restarting with a new function until the golden collision is found. Henceforth, we use \(f_n\) with \(n\in \mathbb {Z}\) instead of f, where the subscript indicates the different function versions.

2.4 Complexity Analysis of van Oorschot-Wiener

Van Oorschot and Wiener give a complexity analysis for finding a golden collision [23, §4.2], but note that their complexity analysis is “flawed”, giving multiple reasons as to why a precise closed formula for the runtime is difficult to achieve. Instead, after obtaining a general form for the runtime formula, they choose to determine several of the constants experimentally. We reproduce this flawed analysis, since we refer back to it throughout.

Recall that w triples \((x_0, x_d, d)\) can be stored in memory. Whenever the memory is full, the average number of points on trails leading to those w distinguished points is \(w/\theta \). Writing \(N = |S|\) and given any element of \(S\), (uniformly) randomly generated as output of the random function \(f_n\), the probability of it being on the pre-existing trails is therefore \(w / (N\theta )\). Thus, on average we compute \(N\theta /w\) points per collision. Checking a collision using the method described above requires \(2/\theta \) steps on average, which gives the total average cost per collision as \(N\theta /w+2/\theta \). Taking \(\theta =\sqrt{2w/N}\) minimizes this cost to \(\sqrt{8N/w}\). As \(N/2\) collisions are required (on average) to find the golden collision, we require (on average) \(\sqrt{2N^3/w}\) function iterations to solve the CSSI problem.

Let \(m\) be the number of processors run in parallel and \(t\) the time taken to evaluate the function \(f_n\). Since the algorithm parallelizes perfectly [23, §3] (in theory), the total runtime \(T\) required to find the golden collision is

$$\begin{aligned} T = \frac{2.5}{m}\sqrt{N^3 / w}\cdot t\,, \end{aligned}$$
(1)

where \(2.5\) is one of the constants determined experimentally in [23]. Some adjustments need to be made to the parameters because the phase where the memory is being filled with distinguished points is not accurately captured in the analysis. To describe the true performance of the algorithm, the fraction of distinguished points is set to \(\theta = \alpha \sqrt{w/N}\) and the optimal constant \(\alpha \) is determined experimentally. The heuristic analysis by van Oorschot and Wiener suggests \(\alpha = 2.25\), which is verified by Adj et al. for SIDH.

Equation (1) shows that the memory size of w distinguished points has a crucial influence on the runtime of the vOW algorithm. It is therefore important to store distinguished points as compactly as possible. If the property for a point to be distinguished is a number of leading or trailing zeroes in its bit representation, these zeroes do not have to be stored, shortening the bit length of \(x_d\) in the triple \((x_0, x_d, d)\). Given a distinguished point rate \(\theta \), the number of zeroes would be \(\lfloor -\log \theta \rfloor \). The counter d must be large enough to store the number of steps in the longest trail, for example d must have \(\lceil \log (20/\theta ) \rceil \) bits. A distinguished point can thus be stored with about \(2\log N + \log 20\) bits as most of the counter can be stored in the space of the omitted zero bits.

This deduction of the total runtime assumes that \(f_n\) behaves like an average random function. The average behavior can be achieved by using a number of different function versions \(f_n\) as explained above. To decide how long one such function \(f_n\) should be run before moving on, van Oorschot and Wiener introduce the constant \(\beta \). The function version needs to be changed and distinguished points in memory discarded after \(\beta \cdot w\) distinguished points have been produced. This constant is determined heuristically, analogously to the determination of \(\alpha \). For that purpose, a single \(n\) is fixed and run until \(\beta \cdot w\) distinguished points are produced. In the meantime, the number of function iterations (\(i\)) and distinct collisions (\(c\)) are counted. The number of function versions can then be approximated as \(n / (2c)\), while the expected runtime can be estimated as \(in/(2c)\). It is concluded that the latter is minimal for \(\beta = 10\).

We note that this experiment is extremely useful. Namely, it provides a very close estimate on the runtime without having to complete the full algorithm. For that reason, we run the same experiment to estimate the impact of improved collision checking (see Fig. 3 in Sect. 3.4).

3 Parallel Collision Search for Supersingular Isogenies

In this section we describe optimizations that we employ when specializing the van Oorschot-Wiener algorithm to SIKE. We discuss improvements based on the SIKE design in Sect. 3.1 and explain the specific instantiation of the vOW algorithm in Sect. 3.2. Finally, we show how to use local memory for precomputation in Sect. 3.3 and to improve collision locating in Sect. 3.4.

3.1 Solving SIKE Instances

Although the problem underlying SIKE is closely related to the original SIDH problem, there are slight differences. In this section, we discuss their impact on the vOW algorithm and show how to reduce the search space from size \(3\cdot 2^{e_2-1}\) (resp. \(4\cdot 3^{e_3-1}\)) to \(2^{e_2-4}\) (resp. \(3^{e_3-1}\)).

As usual, let \(\{\ell ,m\} = \{2,3\}\) and let \(\phi :E\rightarrow E_A\) be an isogeny of degree \(\ell ^{e_\ell }\) for which the goal is to retrieve the (cyclic) kernel \(\ker \phi \). We opt to represent curves in Montgomery form [13] \(E_A: y^2 = x^3 + Ax^2 + x\) with constant \(A\in \mathbb {F}_{p^2}\). The Montgomery form allows for very efficient arithmetic, which is why it has been used in the SIKE proposal. Further note that, if \(\{U,V\}\) is a basis of \(E[m^{e_m}]\), then the points \(\phi (U),\phi (V)\) are given as well. But as we do not use these points on \(E_A\) and assume the simplified version of the CSSI problem as presented in Definition 1, we simply think of a challenge as being given by the curve \(E_A \).

Since isogenies of degree \(\ell ^{e_{\ell }}\) are determined by cyclic subgroups of size \(\ell ^{e_\ell }\), there are exactly \((\ell +1)\ell ^{e_\ell -1}\) of them. This forms the basis for the general algorithm specified for SIDH by Adj et al. [1], essentially defining a random function on the set of cyclic subgroups.

Moving to SIKE, we observe that an important public parameter of its specification is the starting curve \(E_0\). Since \(p = 2^{e_2}\cdot 3^{e_3}-1\) is congruent to \(3\) modulo \(4\) for \(e_2>1\), the curve \(y^2 = x^3 + x\) is supersingular for any choice of (large) \(e_2\) and \(e_3\), and this curve was chosen as the starting curve in the Round-1 SIKE specification. In Round 2, the starting curve has been changed to \(y^2 = x^3 + 6x^2 + x\).

Choice of Secret Keys. Any point \(R\) of order \(\ell ^{e_\ell }\) on \(E_0\) satisfies \(R = [s]P + [r]Q\) for \(r,s\in \mathbb {Z}_{\ell ^{e_\ell }}\), where both \(s\) and \(r\) do not vanish modulo \(\ell \). The SIKE specification [7, §1.3.8] assumes \(s\) to be invertible and simply sets \(s= 1\). This choice simplifies implementations by making the secret key a sequence of random bits that is easy to sample. When \(\ell =2\), an appropriate choice of \(P,Q\) allows to avoid exceptional cases in the isogeny arithmetic [17, Lemma 2]. The main consequence of this is that the key space has size \(\ell ^{e_\ell }\) as opposed to \((\ell +1)\ell ^{e_\ell -1}\).

The Initial Step. Our first observation is that although nodes in the isogeny graph generally have in-degree \(\ell +1\), this is not true for vertices adjacent or equal to \(j = 0\) or \(j= 1728\). In particular, the curve \(E_0 : y^2 = x^3 + x\) has \(j\)-invariant \(j = 1728\) which in the case of \(\ell =2\) has in-degree 2, while its (only) adjacent node has in-degree 4. This is shown in Fig. 1a. For \(\ell = 3\) the curve has in-degree 2, while its adjacent nodes have in-degree 5; see Fig. 1b. This illustrates that although the number of distinct kernels is \(\ell ^{e_{\ell }}\), the number of distinct walks (say, as a sequence of \(j\)-invariants) in the isogeny graph is only \(2^{e_2-1}\) (resp. \(2\cdot 3^{e_3-1}\)) for \(\ell =2\) (resp. \(\ell =3\)). We align the two (without loss of precision) by starting our walks from the curve \(E_6 : y^2 = x^3 + 6x^2 + x\) when \(\ell =2\). If \(\ell = 3\), we can define the kernel on a curve in the class of the left or right adjacent node to \(j = 1728\) (the choice indicated by a single bit).

The reason for this behavior is that \(E_0\) has a non-trivial automorphism group containing the distortion map \(\psi \) that maps \((x,y)\mapsto (-x,iy)\) (with inverse \(-\psi \)). For any kernel \(\langle R\rangle \) of size \(\ell ^{e_\ell }\) we have \( E_0/\langle R\rangle \cong E_0/\langle \psi (R)\rangle \) while \(\langle R\rangle \ne \langle \psi (R)\rangle \), essentially collapsing the two kernels into a single walk in the graph.

Fig. 1.
figure 1

Isogeny graphs starting from curves \(y^2 = x^3 + Ax^2 + x\) where nodes are labeled by their \(A\)-coefficient.

Remark 1

The presence of the distortion map on the node with \(j = 1728\) thus leads to loops and double edges in the graph, which reduces the entropy of the private and public keys. This security reduction for SIDH or SIKE can be easily circumvented by moving the starting node from \(E_0\) to \(E_6\) (with \(j(E_6) = 287496\)), which avoids the loop and double edge for \(\ell = 2\). More concretely, setting up a torsion basis \(\left\{ P,Q\right\} \) of \(E_6[2^e]\) such that \([2^{e-1}]Q = (0,0)\) and choosing private keys \(r\in \mathbb {Z}_{\ell ^e}\) corresponding to kernels \(\langle P + [r]Q\rangle \) implies this result. This suggestion has indeed been included in the Round-2 update to the SIKE specification. Note that the Round-1 SIKE specification set up \(Q\) as a point of order \(2^e\) defined over \(\mathbb {F}_p\) [7, §1.3.3]. Such a point does not exist on \(E_6\), as \(E_6[2^e](\mathbb {F}_p)\cong \mathbb {Z}_{2^{e-1}}\times \mathbb {Z}_2\). This only implies that the description of \(Q\) is longer as it lies in \(E_6(\mathbb {F}_{p^2})\setminus E_6(\mathbb {F}_p)\).

It is not obvious how the nodes of \(E_6\) and \(E_0\) are connected in the \(3\)-isogeny graph, there is no reason to believe they are close. Therefore, we believe moving to \(E_6\) alleviates issues with double edges in the \(3\)-isogeny graph as well.

The Final Step. Recall that our elliptic curves are represented in Montgomery form and that isogenies of degree \(2^{e_2}\) are computed as a sequence of \(4\)-isogenies. As already noted in [3, §4.3.2], the choice of arithmetic in SIKE implies that the points \((1,\pm \sqrt{A+2}) \in E_A\) lie in the kernel of the dual of the secret isogeny. Hence, the final step can be immediately recomputed from the public key. Consequently, \(E_A/\langle (1,\pm \sqrt{A+2})\rangle \) is isogenous to \(E_0\) by an isogeny of degree \(2^{e_2-2}\), and to \(E_6\) by an isogeny of degree \(2^{e_2-3}\). Therefore, replacing \(E_A\) by \(E_A/\langle (1,\pm \sqrt{A+2})\rangle \) reduces the number of distinct walks to \(2^{e_2-3}\) for \(\ell =2\).

For \(\ell = 3\), the representative \(E_A\) of its isomorphism class can be obtained as the co-domain curve of a \(3\)-isogeny starting from any of its adjacent nodes. As far as we know, this does not leak any information about the final \(3\)-isogeny.

Remark 2

To address the issue of leaking the final kernel, we notice that for any \(\bar{A}\in \mathbb {F}_{p^2}\) with \(j(E_{\bar{A}}) = j(E_A)\) we have

$$\begin{aligned} \bar{A} \in \left\{ \pm A, \pm (3x_2 + A)/\sqrt{x_2^2-1}, \pm (3z_2 + A)/\sqrt{z_2^2-1} \right\} \,, \end{aligned}$$
(2)

where \(x_2,z_2\in \mathbb {F}_{p^2}\) are chosen such that \(x^3 + Ax^2 + x = x(x-x_2)(x-z_2)\). That is, the isomorphism class contains exactly six Montgomery curves. One can show that each of the 6 distinct 4-isogenies emanating from \(j(E_A)\) can be computed by selecting \(\bar{A}\) as above and using a kernel point (of order 4) with \(x\)-coordinate 1. Therefore, randomly choosing \(\bar{A}\) from any of the options in (2) is equivalent to randomizing the kernel of the final isogeny. Unfortunately, selecting \(\bar{A}\) to be anything other than \(\pm A\) seems to require an expensive square root. For this reason, we do not suggest full randomization, but emphasize that the random selection of one of \(\pm A\) leads to a single bit of randomization at essentially no computational effort. As a result, one would only leak the kernel of the final \(2\)-isogeny (with kernel \((0,0)\)) instead of the last \(4\)-isogeny.

The Frobenius Endomorphism. Every isomorphism class can be represented by an elliptic curve \(E\) defined over \(\mathbb {F}_{p^2}\) and has an associated Frobenius map \(\pi : E \rightarrow E^{(p)}, (x,y)\mapsto (x^p,y^p)\). For any kernel \(\langle R\rangle \subset E\), we have

$$ j(E/\langle R\rangle )^p = j(E^{(p)}/\langle \pi (R)\rangle ) \,. $$

As a result, it suffices to search for a path to a curve with \(j\)-invariant equal to \(j(E_A)\) or \(j(E_A)^p\). In other words, we define an equivalence relation on the set of \(j\)-invariants by \(j_0\sim j_1\) if and only if \(j_1\in \{j_0,j_0^p\}\). Finding a path to \(E_A\) reduces to finding a path to any representative of the class \([j(E_A)]\). In Fig. 2 we show how the classes propagate through the \(2\)-isogeny graph starting at \(E_6\). A very similar structure appears in the \(3\)-isogeny graph. Note that we assume that isogeny degree is approximately \(\sqrt{p}\), making it unlikely for endomorphisms of that degree to exist. As such, the leaves of trees such as in Fig. 2 most probably are all distinct.

Fig. 2.
figure 2

Part of the \(2\)-isogeny graph for any large \(p = 2^{e_2}\cdot 3^{e_3}-1\) starting at \(E_6 : y^2 = x^3 + 6x^2 + x\). Black dots represent curves defined over \(\mathbb {F}_p\), \(j\)-invariants in the same equivalence class are denoted by equal numbers. All edges represent \(2\)-isogenies. In particular, there are exactly \(2^3 + 1 = 9\) classes at distance 4 from \(E_6\).

Although the number of classes is approximately half the number of \(j\)-invariants, it is perhaps not obvious how to translate this into a computational advantage. First assume that \(\ell = 2\), and that the optimizations specified above are taken into consideration. That is, we start on the curve \(E_6\) and look for an isogeny of degree \(2^{e_2-3}\) to the curve \(E_A\). As usual, kernels are of the form \(P + [r]Q\) for some basis \(\{P,Q\}\). Note that there is no reason to choose \(P\) and \(Q\) exactly as (multiples of) those in the SIKE specification, so we expand on a particularly simple choice here.

Recall first that \(\#E_6(\mathbb {F}_p) = 2^{e_2}\cdot 3^{e_3}\) [20, Exercise V.5.10]. Since the \(\mathbb {F}_p\)-rational endomorphism ring of \(E_6\) is isomorphic to one of \(\mathbb {Z}[\pi ]\) or \(\mathbb {Z}[(1+\pi )/2]\) [4, Proposition 2.4], a result by Lenstra [10, Theorem 1(a)] tells us that

$$ E_6(\mathbb {F}_p) \cong {\left\{ \begin{array}{ll} \mathbb {Z}_{3^{e_3}}\times \mathbb {Z}_{2^{e_2}} &{}\text {if } {{\,\mathrm{End}\,}}_{\mathbb {F}_p}(E)\cong \mathbb {Z}[\pi ] \,, \\ \mathbb {Z}_{3^{e_3}}\times \mathbb {Z}_{2^{e_2-1}}\times \mathbb {Z}_2 &{}\text {if } {{\,\mathrm{End}\,}}_{\mathbb {F}_p}(E)\cong \mathbb {Z}[\frac{1+\pi }{2}] \,. \end{array}\right. } $$

Consequently, there exists an \(\mathbb {F}_p\)-rational point of order \(2^{e_2-3}\) and we can choose \(Q\) to be this element. Moreover, \(p\equiv 7\!\mod 8\) implies that \(\sqrt{2}\in \mathbb {F}_p\), and therefore that \(E_6[2]\subset E_6(\mathbb {F}_p)\). In other words, \(\pi \) acts trivially on points of order 2. Since \(\pi \) fixes \(Q\) and has eigenvalues \(\pm 1\), for any other element \(P\) such that \(\langle P,Q\rangle =E_6[2^{e_2-3}]\), the action of Frobenius is given by

$$ \pi \!\mid _{\langle P,Q\rangle } = \begin{pmatrix} -1 &{} 0 \\ \mu &{} 1 \end{pmatrix} \,,\quad \text {for some } \mu \in \mathbb {Z}_{2^{e_2-3}} \,. $$

Note that \([2^{e_2-2}]P\) has order 2 and therefore is fixed under \(\pi \). As a result, \(\mu \) is even. Replacing \(P\) by \(P - \frac{\mu }{2}Q\) leads to a basis \(\{P,Q\}\) such that \(\pi (P) = -P\) and \(\pi (Q) = Q\). Note that the value of \(\mu \) can be easily found (e. g. by using the Pohlig-Hellman algorithm [19]) since the group order is extremely smooth.

Given such a basis \(\{P,Q\}\), the conjugate of the \(j\)-invariant determined by \(\langle R = P + [r]Q\rangle \) is given by the isogeny with kernel \(\langle -\pi (R) = P + [2^{e_2-3}-r]Q\rangle \). As a result, every class \(\{j,j^p\}\) can be uniquely represented by \(r\in \{0,1,\ldots ,2^{e_2-4}\}\). If we start the algorithm by separately testing \(r= 2^{e_2-4}\), the remainder can be reduced to searching for kernels \(\langle P + [r]Q\rangle \) where \(r\in \{0,1,\ldots ,2^{e_2-4}-1\}\). This reduces the search space to size \(2^{e_2-4}\).

By a completely analogous (and even simpler) argument, we can fix a basis of \(E[3^{e_3-1}]\) on any of the two adjacent nodes of \(E_0\) in the \(3\)-isogeny graph such that the action of \(\pi \) on this basis is described by a diagonal matrix with eigenvalues \(\pm 1\). Similar to the case of \(\ell = 2\), this allows a reduction of the search space from \(2\cdot 3^{e_3-1}\) to (approximately) \(3^{e_3-1}\).

Overall, the presence of the Frobenius endomorphism on the node with \(j = 1728\) reduces the number of equivalence classes that are at a given distance from \(j\). While the Round-2 SIKE specification has moved away from \(j = 1728\), the curve \(E_6\) still has a Frobenius endomorphism. Indeed, in that case it is not helpful to differentiate between \(j\)-invariants in the same equivalence class. As (almost) every equivalence class contains 2 representatives at a certain depth, one less bit of randomness is needed to compute an isogeny of the same degree (see e. g. Fig. 2, where the final step could always move to the left node). These issues can be avoided by moving to a curve where the Frobenius map is not an endomorphism. While this prevents the Frobenius trick, it is a subtle issue (see Remark 3).

Remark 3

The curve \(E_0 : y^2 = x^3 + x\) has a known endomorphism ring [20, III.4.4], which is helpful in certain attack scenarios [16]. Although one would prefer to start on a random node in the graph, there is no known way of randomly selecting one other than choosing a random walk in the isogeny graph. However, the walk itself cannot be public and it is unclear how to verifiably achieve this.

3.2 Applying van Oorschot-Wiener to SIKE

In this section, we fix \(\ell = 2\) and describe in detail how to implement the van Oorschot-Wiener algorithm (with parameters defined as in Sects. 2.32.4). We point out a subtle mistake in the algorithm (appearing already in the original paper [23] and also used in the work of Adj et al. [1]) and show how to overcome it. The solution involves using a different notion of distinguishedness, and it allows us to achieve the average vOW runtime for a fixed instance. This allows us to focus on one particular instance, where we are then able to use precomputation in order to analyze the algorithm’s behavior (when applied to SIKE) at a much larger scale.

Again, we assume to be given a challenge curve \(E_A\) that is isogenous of degree \(2^{e_2-3}\) to \(E_6\) and aim to find the isogeny. We write \(e = e_2 / 2\) and let \(S = \{0,1,\ldots ,2^{e-1}-1\}\). Fix points \(P,Q\in E_6\) and \(U,V\in E_A\) such that \(E_6[2^{e-1}] = \langle P,Q\rangle \) and \(E_A[2^{e-2}] = \langle U,V\rangle \), where \(\pi (P) = -P\) and \(\pi (Q) = Q\).

The Step Function. We begin by describing the function family \(f_n\). As \(f_n\) maps through classes (of size 1 or 2) in \(\mathbb {F}_{p^2}\), we first define a canonical representative of the class. Since the conjugate of \(j = a + b\cdot i\in \mathbb {F}_{p^2}\) is \(\overline{j} = a - b\cdot i\), we say that \(j\) is even whenever \({{\,\mathrm{lsb}\,}}(b) = 0\). Using >> to denote the rightshift operator, we define the function \(h\) from \(S\) to the set of supersingular \(j\)-invariants by

$$ h : r \mapsto {\left\{ \begin{array}{ll} j &{}\text {if } j \text { is even} \\ \overline{j} &{}\text {otherwise } \end{array}\right. } \,,~ \text {for } j = {\left\{ \begin{array}{ll} j(E_6 / \langle P + [r\,\texttt {>>}\,1]Q\rangle ) &{}\text {if } {{\,\mathrm{lsb}\,}}(r) = 0 \\ j(E_A / \langle U + [r\,\texttt {>>}\,1]V\rangle ) &{}\text {if } {{\,\mathrm{lsb}\,}}(r) = 1 \end{array}\right. } \,. $$

In other words, the least significant bit of \(r\) determines whether we compute an isogeny starting from \(E_6\) or \(E_A\), while we always ensure to end up on an even \(j\)-invariant. Finally, we define \(f_n : S\rightarrow S\) by \(f_n(r) = g_n(h(r))\), where \(g_n\) is a hash function indexed by \(n\) that maps \(h(r)\) back into \(S\). More concretely, we let \(g_n\) be the extended output function (XOF) based on AES in CBC mode using the AES-NI instruction set (see Sect. 4), with the initialization vector and plaintext set to 0 and the key determined by \(n\).

Note that the Frobenius map \(\pi \) is an endomorphism on \(E_6\), but not (necessarily) on \(E_A\). Given \(r\in \{0,1,\ldots 2^{e-2}-1\}\), kernels of the form \(P + [r]Q\) determine isogenies of degree \(2^{e-1}\) starting from \(E_6\), yet it follows from Sect. 3.1 that they correspond to exactly \(2^{e-2}\) (distinct) equivalence classes of \(j\)-invariants. Kernels of the form \(U + [r]V\) determine \(2^{e-2}\)-isogenies from \(E_A\), all of which lead to distinct, non-conjugate \(j\)-invariants. So \(h\) maps bijectively into a set of size \(2^{e-1}-1\), with only a single collision given by the isogeny from \(E_6\) to \(E_A\).

Distinguished Points and Memory. Assume the memory to have size \(w\) a power of 2. This is not technically necessary, but simplifies both the arguments and the implementation. Elements of \(S\) are represented by exactly \(e-1\) bits and we assume that \(\log w\ll e-1\).

Adj et al. [1, §4.4] determine the memory position of a triple \((r_0,r_d,d)\) using the \(\log w\) least significant bits of \({{\,\mathrm{MD5}\,}}(3,r_d)\). Moreover, the value \(r_d\) is distinguished if and only if \({{\,\mathrm{MD5}\,}}(2,r_d)\le 2^{32}\theta \!\mod 2^{32}\) (viewing the output of \({{\,\mathrm{MD5}\,}}\) as an integer). Although the algorithm will run, it has several complications.

  1. 1.

    Calling a hash function at every step to check for distinguishedness causes overhead. Similarly, requiring a hash function computation for every read and write operation to memory causes unnecessary overhead.

  2. 2.

    The algorithm (typically) requires the use of several functions \(f_n\) for distinct \(n\). Since the memory location of elements is independent of \(n\), distinguished points \((r_0,\,r_d,\,d)\) found by \(f_n\) and \((s_0,\,s_e,\,e)\) found by \(f_{n+1}\) (say), with \(s_e = r_d\), will be classified as a valid collision, triggering the backtracking subroutine. This will fail since \(f_n\) and \(f_{n+1}\) give rise to different random functions, leading to work going to waste. To counteract this, one could keep track of \(n\) in memory. As this is costly, the approach of Adj et al. is to zero out the memory when the maximum number of distinguished points for a given \(n\) is reached. This can get expensive as well, especially in the case of large distributed memory.

  3. 3.

    The distinguishedness property is independent of \(n\). Although the runtime of the algorithm is estimated to be \(2.5\sqrt{|S|^3/w}\) by van Oorschot and Wiener [23, §4.2], this is only true if one takes the average over all collisions. However, for SIKE (and whenever one wants to find a specific collision), its input values are fixed. That is, if the golden collision of the function \(f\) is determined by values \(r,s\in S\) such that \(f(r) = f(s)\), then the golden collision of \(f_n\) (for all \(n\)) also occurs for \(r\) and \(s\). The runtime will be above average if one or both of \(r\) and \(s\) are distinguished. This is because the algorithm samples a new starting value every time it reaches \(r\) or \(s\), only computing \(f_n(r)\) or \(f_n(s)\) when they are sampled as initial values. Since distinguishedness is independent of \(n\), this behavior propagates throughout all the \(f_n\).

We give a solution to all of these problems. First, we note that elements of S are uniform bit strings of length \(e-1\). Since the value \(r_d\) of the triple is always the output of the (random) step function, we simply let the \(\log w\) least significant bits determine the memory location. More precisely, the triple \((r_0,r_d,d)\) is stored in the memory location indexed by \((r_d+n)\mod w\). Notice that we choose the location to be dependent on \(n\). Therefore, if two triples \((r_0,\,r_d,\,d)\) and \((s_0,\,s_e,\,e)\) with \(s_e = r_d\) are distinguished under functions \(f_n\) and \(f_m\) respectively (with \(n \ne m\)), they will be stored at different locations \((r_d + n) \mod w \ne (s_e + m) \mod w\), sparing us the backtracking. Moreover, any other value \((t_0,\,t_c,\,c)\) that is stored during function version \(f_m\) at the address of \((r_0,\,r_d,\,d)\) will have \(t_c \ne r_d\), and will not be a collision, sparing us the backtracking. Of course, a memory address could be written to during both \(f_n\) and \(f_{n+w}\) and never in between. But for reasonable values of \(n\) and \(w\) this is highly unlikely, and it would only incur in the (relatively small) cost of checking for an invalid collision when it happens.

Secondly, we define a better distinguishedness property. Since it should be independent of the memory location, we use the value of \(r_d\,\texttt {>>}\,\log w\). As usual, using all of the remaining \(e-1-\log w\) independent bits of \(r_d\), we define an integer bound by \(B = \theta \cdot 2^{e-1-\log w}\). We then define \(r_d\) to be distinguished if and only if

$$ (r_d\,\texttt {>>}\,\log w) + n\cdot B\le B\,\bmod 2^{e-1-\log w} \,. $$

With that, every element of \(S\) is distinguished for approximately one in every \(B\) functions \(f_n\). Although we do not prove that this reduces every instance to the average case, it holds true heuristically.

We observe that the most significant bits \(r_d\,\texttt {>>}\,\log w\) of a distinguished element \(r_d\) are not always zero. This would be preferable since it reduces the memory requirement, not needing to store the top bits that are zero [23, §6]. Instead we can simply write the value \((r_d\,\texttt {>>}\,\log w) + n\cdot B\!\mod 2^{e-1-\log w}\) to memory, which by definition is at most \(B\). Adding and subtracting \(n\cdot B\) modulo \(2^{e-1-\log w}\) when writing to and reading from memory has negligible overhead.

We note that making distinguishedness depend on the function version also causes a triple (—,\(\,r_d,\)—) to be unlikely to be distinguished often (where time is measured in function versions), giving time to the algorithm to overwrite a stored triple \((r_0,\,r_d,\,d)\) with a different triple \((s_0,\,s_e,\,e)\) with \(s_e \ne r_d\), reducing the change of invalid collisions. Since both \(f_n\)-dependent memory location and distinguishedness are cheap to realise, we keep both.

Remark 4

The problems we address appear for SIDH, while the above description solves them for SIKE. An analogous solution works for SIDH, but one should be careful that the values of \(S\) are not uniform bit strings. They are elements \((i,b,k)\in \{1,2\}\times \{0,1,2\}\times \{0,\ldots ,2^{e_2/2}-1\}\) [1, §4.4] which are represented as \((3+e_2/2)\)-bit strings where the least significant bit determines \(i\) and the two next lower order bits determine \(b\). Instead, we define the memory location by the value \(((r_d\,\texttt {>>}\,3) + n)\!\mod w\) and the distinguishedness property by

$$ (r_d\,\texttt {>>}\,(\log w+3)) + n\cdot B\le B\,\bmod 2^{e-1-\log w} \,,\quad B = \theta \cdot 2^{e-4-\log w} \,. $$

Here, one should be even more careful not to lose too much precision for \(\theta \), but again the assumption that \(e-1\gg \log w\) should alleviate this. In all of our instances this is not a concern.

Precomputing the Step Function and Experiments. The main upside to the above modifications is that every problem instance will have a guaranteed average runtime of (approximately) \(2.5\sqrt{|S|^3/w}\). As such, we do not have to worry about running into an unlucky instance.

However, there is a second useful consequence: to analyze the behavior of our modifications, it is sufficient to analyze a single instance. Now observe that any function \(f_n\) is of the form \(f_n = g_n\circ h\), where \(h\) is fixed across the different \(n\) and by far the most expensive part of the evaluation of \(f_n\). For testing any instance for which \(h(S)\) fits into our memory, we can therefore simply precompute \(h(r)\) for all \(r\in S\) and store them in a table indexed by \(r\). The evaluation of the step function \(f_n(r)\) then simply looks up \(h(r)\) in the table, and evaluates it under \(g_n\) (which is comparatively fast). This improves the speed of our benchmarks significantly, while not affecting any outcomes regarding a precise analysis of the vOW algorithm.Footnote 2

We summarize the results so far in Table 1, comparing the results of our implementation to the expected theoretical outcome as well as the results of Adj et al. [1]. Note that our results are close to optimal, and showcase the expected speedup of a factor \(\sqrt{6^3}\approx 15\times \) in the number of steps when moving from SIDH to SIKE. Moreover, we note that our software solves the SIDH instances using less than half the number of steps that were taken for the same instances in [1]. The primes used in Table 1 are

$$\begin{aligned} 23\cdot 2^{32}\cdot 3^{20} - 1\,,&\;31\cdot 2^{36}\cdot 3^{22} - 1\,, \;\;71\cdot 2^{40}\cdot 3^{25} - 1\,, \;\;37\cdot 2^{44}\cdot 3^{27} - 1\,, \\ 13\cdot 2^{48}\cdot 3^{30} - 1\,,&\qquad 2^{52}\cdot 3^{33} - 1\,, 57\cdot 2^{56}\cdot 3^{35} - 1\,.\, \end{aligned}$$
Table 1. The average number of function versions \(n\) and evaluations of \(f_n\) used for finding an isogeny of degree \(2^{e_2}\). The expected value (Exp.) for the number of function versions resp. steps is reported as \({0.45\cdot |S|/w}\) resp. \(\log {(2.5\cdot \sqrt{|S|^3/w})}\), for set size \(|S| = 3\cdot 2^{e_2/2}\) resp. \(|S| = 2^{e_2/2-1}\) for SIDH resp. SIKE. The numbers are averaged over 1000 iterations and use 20 cores.

3.3 Partial Isogeny Precomputation

Computationally, the most expensive part of the vOW step function is the (repeated) evaluation of isogenies of degree \(\ell ^{e_\ell /2-1}\). To alleviate this burden, one can partially precompute the isogeny tree by computing all possible isogenies of a fixed degree \(\varDelta \) and storing a table of the image curves together with some torsion points (that help to complete the isogenies from these intermediate curves. Such precomputation presents a trade-off between memory and computation time for the step function). We elaborate on the method in detail. As it applies to the general case of SIDH, we discuss that first and then specialize to SIKE instances with \(\ell =2\).Footnote 3

Let \(E\) be a supersingular curve and \(P,Q\in E\) be such that \(\langle P,Q\rangle = E[\ell ^d]\), for some \(d > 0\) (typically \(d \approx e_\ell / 2\)). Let \(R = [s]P + [r]Q\) be a point of order \(\ell ^d\), and \(\phi : E\rightarrow E/\langle R\rangle \) an isogeny of degree \(\ell ^d\) with kernel \(\langle R\rangle \). Recall that \(\ell \) does not divide both \(r\) and \(s\). We split the isogeny \(\phi \) into two isogenies in the usual way, with the first having degree \(\ell ^\varDelta \) for some \(0< \varDelta < d\) as follows.

Write \(s = s_0 + s_1 \ell ^\varDelta \) and \(r = r_0 + r_1 \ell ^\varDelta \) for \(s_0,r_0\in \mathbb {Z}_{\ell ^\varDelta }\) and \(s_1,r_1\in \mathbb {Z}_{\ell ^{d-\varDelta }}\). Then \(R = [s_0]P + [r_0]Q + [\ell ^\varDelta ]([s_1]P + [r_1]Q)\), while the point \(R_\varDelta = [\ell ^{d-\varDelta }]R = [s_0]([\ell ^{d-\varDelta }] P) + [r_0]([\ell ^{d-\varDelta }] Q)\) generates the kernel of the isogeny \(\phi _\varDelta : E \rightarrow E/\langle R_\varDelta \rangle \) of degree \(\ell ^\varDelta \). The point \(\phi _\varDelta (R)\) on \(E/\langle R_\varDelta \rangle \) has order \(\ell ^{d-\varDelta }\) and determines an isogeny \(\psi _\varDelta : E/\langle R_\varDelta \rangle \rightarrow E/\langle R\rangle \) of degree \(\ell ^{\varDelta -d}\) such that \(\phi = \psi _\varDelta \circ \phi _\varDelta \). Crucially, the first pair of partial scalars \((s_0 = s\!\mod \ell ^\varDelta \,,r_0 = r\!\mod \ell ^\varDelta )\) determines \(\phi _\varDelta \) and the points \(\phi _\varDelta ([s_0]P + [r_0]Q)\), \(\phi _\varDelta ([\ell ^\varDelta ] P)\) and \(\phi _\varDelta ([\ell ^\varDelta ] Q)\) on \(E/\langle R_\varDelta \rangle \). Given this curve and these points, the second pair of partial scalars \((s_1 = \lfloor s/\ell ^\varDelta \rfloor \,,r_1 = \lfloor r/\ell ^\varDelta \rfloor )\) determines \(\ker \psi _\varDelta = (\phi _\varDelta ([s_0]P + [r_0]Q)) + [s_1]\phi _\varDelta ([\ell ^\varDelta ] P) + [r_1]\phi _\varDelta ([\ell ^\varDelta ] Q)\) and allows to complete the isogeny \(\phi \). Therefore, precomputation consists of computing a table with entries

$$\begin{aligned} \Big [E/\langle R_\varDelta \rangle , \phi _\varDelta ([s_0]P + [r_0]Q), \phi _\varDelta ([\ell ^\varDelta ] P), \phi _\varDelta ([\ell ^\varDelta ] Q)\Big ]\,, \end{aligned}$$

for all \((s_0, r_0) \in \mathbb {Z}_{\ell ^\varDelta }^2\) such that \(\ell \) does not divide both \(s_0\) and \(r_0\). Such a table entry can then be used to compute any full degree isogeny of degree \(\ell ^d\) with kernel point \(R = [s]P + [r]Q\) such that \(s \equiv s_0\!\mod \ell ^\varDelta \) and \(r \equiv r_0\!\mod \ell ^\varDelta \) and any \((s_1, r_1)\).

However, it suffices to store only two points on \(E/\langle R_\varDelta \rangle \). If \(\ell \not \mid s\), we can assume that \(s = 1\) and \(R = P + [r]Q\) for \(r\in \mathbb {Z}_{\ell ^d}\). Then \(R_\varDelta = [\ell ^{d-\varDelta }]P + [r_0\cdot \ell ^{d-\varDelta }]Q\) and the precomputed table only needs to contain entries of the form

$$\begin{aligned} \left[ E/\langle R_\varDelta \rangle , P_\varDelta = \phi _\varDelta (P + [r_0]Q), Q_\varDelta = \phi _\varDelta ([\ell ^\varDelta ] Q) \right] \end{aligned}$$
(3)

for all \(r_0\in \mathbb {Z}_{\ell ^\varDelta }\). The kernel of \(\psi _\varDelta \) (for completing \(\phi \)) can be computed as \(\phi _\varDelta (R) = P_\varDelta + [r_1]Q_\varDelta \) for any r with \(r \equiv r_0\!\mod \ell ^\varDelta \). If \(\ell \mid s\), then \(\ell \not \mid r\) and \(R = [\ell t]P + Q\) for some \(t\in \mathbb {Z}_{\ell ^{d-1}}\) such that \(s = \ell t\). In that case table entries are of the form

$$ \left[ E/\langle R_\varDelta \rangle , P_\varDelta = \phi _\varDelta ([\ell ^\varDelta ] P), Q_\varDelta = \phi _\varDelta ([\ell t_0]P + Q)\right] $$

for all \(t_0\in \mathbb {Z}_{\ell ^{\varDelta -1}}\), while \(\ker \psi _\varDelta = [t_1]P_\varDelta + Q_\varDelta \). Altogether, the table contains \(\ell ^\varDelta + \ell ^{\varDelta -1} = (\ell +1)\cdot \ell ^{\varDelta -1}\) entries and reduces the cost of any isogeny of degree \(\ell ^d\) from \(d\log d\) to \((d-\varDelta )\log (d-\varDelta )\) [3, §4.2.2].

Now we move on to SIKE and fix \(\ell = 2\). That is, we assume \(s = 1\) and every table entry to be of the form (3). Recall that the function \(h\) takes as input a value \(r\in \mathbb {Z}_{\ell ^{e-1}}\) (where \(e = e_2/2\)) and computes an isogeny with kernel \(\langle P + [r\,\texttt {>>}\,1]Q\rangle \) on \(E_6\) if \({{\,\mathrm{lsb}\,}}(r) = 0\), and an isogeny with kernel \(\langle U + [r\,\texttt {>>}\,1]V\rangle \) on \(E_A\) otherwise. The latter reflects the case above with \(d = e - 2\) perfectly, leading to a precomputed table of size \(2^\varDelta \) from \(E_A\) while reducing the cost of the isogeny from \((e-2)\log (e-2)\) to \((e-2-\varDelta )\log (e-2-\varDelta )\). The case of the curve \(E_6\) is slightly different due to the presence of the Frobenius endomorphism. Although there are \(2^{e-2}\) distinct equivalence classes of \(j\)-invariants, the degree of the corresponding isogenies is \(2^{e-1}\). As such, we compute a table of size \(2^{\varDelta }\) comprising of the equivalence classes of \(j\)-invariants at depth \(\varDelta +1\) away from \(E_6\).Footnote 4 As a result, all isogenies used throughout the whole implementation have fixed degree \(e-2-\varDelta \). The isogeny cost reduces from \((e-1)\log (e-1)\) to \((e-2-\varDelta )\log (e-2-\varDelta )\) and choosing \(\varDelta \) such that \(e-2-\varDelta \equiv 0\bmod 2\) allows the use of 4-isogenies as in SIKE. Table 2 demonstrates the effect of precomputation on the SIKE step function.

Table 2. Effect of precomputation on the running time of the SIKE step function. Numbers represent the cumulative running time in seconds of 1000000 calls to the step function, for the corresponding modulus and precomputation depth \(\varDelta \). All experiments were run on Atomkohle.

Computing an Isogeny Tree. To obtain the lookup table, one computes image curves and torsion points for all isogenies of degree \(2^\varDelta \) (resp. \(2^{\varDelta +1}\)) and stores them indexed by their kernel representation. Adj et al. [1, Section 3.2] describe a depth-first-search approach to compute the required curves as the leaves of a full 2-isogeny tree of depth \(e_2/2\) for the meet-in-the-middle algorithm (c. f. [1, Fig. 1]). This method is much more efficient than the naive way of computing full \(2^{e_2/2}\)-isogenies for all possible kernel points. Obviously, it can be applied for partial trees to compute isogenies of degree \(2^\varDelta \) (resp. \(2^{\varDelta +1}\)) and an analogous version can utilize a 4-isogeny tree.

Using Memory for Precomputation. Depending on the specific problem instances and communication properties of the network, the memory required for precomputation could alternatively be used as part of the main memory that stores distinguished points. In other words, precomputed tables might take away a certain amount of memory from the distinguished point storage space.

Assume that due to latency and communication constraints, each of the m parallel processors needs its own table of size \(\tau (\varDelta )\), and for simplicity that every processor precomputes the same depth tree. For example, for the SIDH case of Adj et al. [1] we would assume each processor to have precomputed a table of size \(\tau (\varDelta ) = 2\cdot (2^\varDelta + 2^{\varDelta -1}) = 3\cdot 2^{\varDelta }\). For SIKE, this size is \(\tau (\varDelta ) = 2\cdot 2^{\varDelta } = 2^{\varDelta +1}\).

As shown in Sect. 2.4, each distinguished point is represented with roughly \(e_2\) bits (i. e. about \(\frac{1}{2}\log p\) bits) since \(\log |S| = e_2/2-1\). This takes into account that the \(\lfloor -\log \theta \rfloor \) leading zeros in a distinguished point are omitted in memory. Every entry in the precomputed table can be represented by three \(\mathbb {F}_{p^2}\) elements (i. e. about \(6 \log p\) bits). Therefore, each such table element uses memory that could store about 12 distinguished points instead. For precomputation depth \(\varDelta \), the table entries thus use space for \(12\cdot \tau (\varDelta )\) distinguished points. This means that the vOW main memory is reduced from \(w\) to \(w - 12\cdot \tau (\varDelta )\cdot m\) points (when each of the \(m\) processors stores its own table). Thus, the number of function iterations increases by a factor \(1 / \sqrt{1 - 12\cdot \tau (\varDelta )\cdot m/w}\). Note that this is well-defined since \(12\cdot \tau (\varDelta )\cdot m\) cannot exceed the maximum available memory \(w\).

While taking away memory increases the expected number of function iterations, precomputation reduces the step function cost by a factor \(\sigma (\varDelta ,e)\). We have \(\sigma (\varDelta ,e) = (e-\varDelta )\log (e-\varDelta )/(e\log e)\) for SIDH (given \(e_2\) is even), while for SIKE (separating the two equally likely cases where we start from \(E_6\) resp. \(E_A\))

$$\begin{aligned} \sigma (\varDelta ,e) = \frac{1}{2} \left( \frac{(e-2-\varDelta )\log (e-2-\varDelta )}{(e-2)\log (e-2)} + \frac{(e-2-\varDelta )\log (e-2-\varDelta )}{(e-1)\log (e-1)} \right) . \end{aligned}$$
(4)

The total runtime of the van Oorschot-Wiener algorithm decreases if

$$ \frac{\sigma (\varDelta ,e)}{\sqrt{1 - 12\cdot \tau (\varDelta )\cdot m/w}} < 1\,. $$

Remark 5

In an actual distributed implementation, the situation might be different and favor precomputation more. For example, it is reasonable to assume that several processors in a multi-core machine are able to share a precomputed table. Furthermore, depending on the design of the main memory, each machine may have memory available that cannot contribute to it and might as well be used to store a table for a limited amount of precomputation. In such situations, using memory for lookup tables might not have any negative effect on the overall runtime. Example 1 shows that speed-ups for cryptographic parameters can be obtained with very small tables, making this scenario more realistic.

Example 1

Let \(p = 2^{216}\cdot 3^{137}-1\) and \((e,m,w) = (108,2^{64},2^{80})\), following the setup of [1, Remark 6]. For both SIKE and SIDH, the (near) optimal pre-computation depth is \(\varDelta = 6\) and each processor pre-computes a local table that takes up space for \(12\cdot \tau (\varDelta )\) distinguished elements; this requires around 41 resp. 62 kilobytes of memory per processor (totalling \(2.34\%\) resp. \(3.52\%\) of the full memory \(w\)). In both cases, the step function cost is reduced by a factor \(\sigma (\varDelta ,e) \approx 0.93\). For SIKE, we decrease the runtime of the full algorithm by a factor approximately \(0.94\), for SIDH, by about \(0.95\).

However, a more realistic example assumes that many processors can share the precomputation table. In our setup, a machine of 40 cores can share a single table. In that case, the optimal depth is found at \(\varDelta = 12\). For SIKE, we use a table of about \(2.7\) megabytes per processor (totalling approximately \(3.75\%\) of the total memory \(w\)). The cost of the algorithm is reduced by a factor \(0.88\). For SIDH we obtain a table of size \(4.0\) megabytes (\(5.63\%\) of the total memory). The runtime is decreased by a factor \(0.89\).

3.4 Fast Collision Checking

As discussed in Remark 5, in a distributed implementation processors are likely to have local memory that cannot contribute to the main memory (that which is used for storing distinguished point triples). We now describe another way to use such memory to significantly improve the overall runtime of van Oorschot-Wiener. Analogous to Sect. 3.3, even if memory is consumed that could otherwise be used to store distinguished points, we argue that dedicating a moderate amount of storage to this faster collision checking reduces the overall runtime.

Recall from Sect. 2.3 that a single walk in the vOW algorithm starts at a point \(x_0 \in S\) and produces a trail of points \(x_i = f(x_{i-1})\) for \(i=1,2,\dots \), until it reaches a distinguished point \(x_d\). Assume that the triple \((x_0,x_d,d)\) collides with a triple, say \((y_0,y_e,e)\), previously stored in main memory and that it is not a mere memory collision. To check if we have found the golden collision, we need to locate the indices \(i<d\) and \(j<e\) for which \(x_i \ne y_j\) and \(f(x_i) = f(y_j)\). Van Oorschot and Wiener note that, since d and e have expected value \(1/\theta \), retracing the two paths from their starting points to the colliding point requires \(2/\theta \) total steps on average [23, p. 9]. Our goal is to lower the overall runtime by reducing the number of function iterations for retracing.

Saving Intermediate Values. Suppose that apart from the global memory for keeping distinguished points, a processor has access to enough local memory to store \(t-1\) additional points intermittently (more on what this means in a moment). On a walk from \(x_0\) to \(x_d\), it now stores \(t+1\) points in total. These points \( (x_{d_0}=x_0, x_{d_1}, \dots , x_{d_t}=x_d )\), where \(0=d_0<d_1<\dots <d_t\), can now be used together with \((y_0,y_e)\), to locate the collision more efficiently.

We start by copying \(y_0\) to \(y'\), e to \(e'\) and iterate steps \(y' \leftarrow f(y'), e'\leftarrow e'-1\). When \(y'\) is the same distance away from the distinguished point as the closest of the saved points, say \(x_{d_j}\) (i.e. j is minimal with \(e'=d_t-d_j\)), we check whether \(y' = x_{d_j}\). If not, we set \(y_0 \leftarrow y'\) and step \(y'\) forward \(d_{j+1}-d_j\) steps and compare again. This is repeated until \(y'\) collides with one of the saved points, say \(x_{d_k}\). Note that equality checks only occur with the \(x_{d_i}\) and not at every step as in the original collision checking function. Once the minimal index k with \(y' = x_{d_k}\) is detected, we know that the collision must take place between \(x_{d_{k-1}}\) and \(x_{d_k}\). At this point, the original collision checking function without saving intermediate points can be called on the triples \((x_{d_{k-1}}, x_{d_k}, d_k - d_{k-1})\) and \((y_0, y', d_k-d_{k-1})\). Note that if the collision occurs at \(x_{d_0}\), we have a Robin Hood and return false.

What have we gained? First of all, the trail with stored points is not retraced at all, only in the final call to the original collision checking, on a single subinterval of length \(d_k-d_{k-1}\), which in general is much shorter than the original trail length \(d_t\). The trail starting at \(y_0\) is fully retraced to the collision, where additional steps are taken that cover the colliding interval. The savings are larger when intervals are shorter and thus when more intermediate points are saved. This approach is implemented in our software.

Figure 3 shows how the number of function steps for checking and locating collisions is reduced when running vOW on an AES-based function with a set of size \(2^{30}\) and memory of size \(2^{15}\). With \(\alpha =2.25\), the average walk length is \(1/\theta \approx 80\). There is an immediate gain for even allowing a small number of intermediate points. However, additional gains become smaller when increasing this number because, when the maximal number of intermediate points approaches the average trail length, almost every point can be stored and adding more memory does not add more intermediate points, nor influence the distance between them.

Fig. 3.
figure 3

Number of steps used for locating a collision as a function of maximum amount of intermediate values allowed for the AES-based random function with \(\log {|S|} = 30\), \(\log {w} = 15\). Averaged over 64 function versions, using 28 cores and run on Atomkohle.

Remark 6

There is potential for further improvement by allowing storage for \(2t-2\) points. As above, the \(t-1\) points \((x_{d_0}, \dots , x_{d_{k-2}},x_{d_{k+1}},x_{d_t})\) are stored while walking the trail. But during collision checking against \((y_0, y_e, e)\), \(t-1\) additional intermediate points are stored when retracing the trail from \(y_0\). When the collision is encountered, the latter points take the place of the \(x_{d_i}\) and \((y_0,y_e) \leftarrow (x_{d_{k-1}},x_{d_{k}})\). Storage for the \(t-1\) elements \((x_{d_0}, \dots , x_{d_{k-2}},x_{d_{k+1}},x_{d_t})\) can be reused for keeping intermittent points when retracing the trail from the new \(y_0\). Repeating this procedure, we recurse until \(y_e=f(y_0)\), at which point we check for the golden collision. Note that splitting the space for \(2t-2\) points in half eases the exposition, but might be suboptimal. The optimal allocation of memory to the different trails should be determined for a large scale cryptanalytic effort based on how much memory is available.

How to Save Points Intermittently. It remains to describe how the \(t-1\) intermediate points are stored. Given the expected trail length of \(1/\theta \), one could store points at regular intervals of length \(1/(t\theta )\). However, walks much longer than \(1/\theta \) would lead to a much larger distance between the final intermediate point and the distinguished point; walks much shorter than \(1/\theta \) would lead to unused memory that could have decreased the average gap between intermediate points. In the ideal scenario, a full set of \((t-1)\) additional points is stored and they are as close to being equally spaced as possible when the distinguished point is reached. Since trails randomly vary in length, the best approach involves overwriting previously placed points in such a way that the distances between points grow with the trail length.

We modify an algorithm for finding cycles in random walks by Sedgewick, Szymanski and Yao [18]. In the first t steps of the trail, the allocated memory is exhausted by storing a point at every step, so that \((d_0,d_1, \dots , d_{t-1}, d_t) = (0, 1, \dots , t-1, t)\), and the points are all at distance 1 from one another. At any stage of the procedure, define \(\delta = \mathrm{min}_{j >0}\{d_j-d_{j-1}\}\). From hereon, every \(\delta \) steps, we simply look for the smallest value of j where \(d_j-d_{j-1}=\delta \), remove the point \(x_{d_j}\) from the list, and add the current point to the list. At some point, the last point that is \(\delta \) steps away from another point will be deleted and replaced by a point that is twice as far away from the last; by definition, \(\delta \) is simultaneously doubled and all of the points in the list are \(\delta \) away from each other.

4 Implementation

We produced two implementations of the van Oorschot-Wiener algorithm, one in C, optimized for efficiency, and a more modular one in C++. The C implementation makes use of the Microsoft SIDH library [12] for field and curve arithmetic when running the attack against SIDH and SIKE instances. We have modified their code to support smaller primes, and added non-constant time operations if beneficial (e. g. finite field inversions). For parallel computations we use the gcc implementation of OpenMP 4.5 [15]. For simplifying batch experiments we wrote Python wrappers to our code using SWIG [2].

The experiments are run on two different machines. The first, referred to as Atomkohle, contains two Intel(R) Xeon(R) E5-2690 v4 CPUs running at 2.60 GHz that both have 14 physical cores (so 28 in total). The second, referred to as Solardiesel, contains two Intel(R) Xeon(R) Gold 6138 CPUs at 2.00 GHz that have 20 cores each (40 in total). Unless specified otherwise, all measurements and statistics reported in this paper have been produced using the C implementation and are compiled with gcc version 6.3.0.

Optimized Implementation. The C software contains three step functions to run experiments. The first is a generic, fast random function, and the other two are those arising from random walks in the 2-isogeny graph as determined by the SIDH (see Sect. 2.3) and SIKE (see Sect. 3.2) specifications. This allows the use of a fast random function to verify that our implementation matches the expected asymptotic values (confirming the original vOW analysis [23]) and linear speed-up on larger sets, while also displaying our improvements in the SIDH and SIKE settings (e. g. as shown in Table 1).

Modular Implementation. While for all SIDH and SIKE experiments we used our C implementation on individual multi-core machines, it would be interesting to deploy the van Oorschot-Wiener algorithm in alternative settings. For example, running attacks with more cores distributed over the internet could change the balance between the cost of a step function evaluation and the cost of memory access, and would certainly present memory topology and core synchronization challenges. Furthermore, collision-finding techniques play a role in the cryptanalysis of other encryption schemes, e. g. NTRU [6, 24], where memory constrained cryptanalytic experiments could be useful. Since it could be tricky to adapt our C code to such varied settings, we also produced a C++ implementation with the goal of obtaining a more modular, developer-friendly, code base. Test results on a fast, generic, random function showing that it matches the expected asymptotics can be found in Table 3. Ideally, it should not be too difficult to write “drivers” for access to different forms of memory (say, storage over the internet rather than local RAM), or different sets S and step functions \(f_n\).

Table 3. Reproduction of Table 3 from [1], using our C++ implementation, using an AES-based generic random function on Atomkohle. Experiments are run using 20 cores. \(\#f_n\) is the number of different random functions used per instance.

Selecting a XOF and PRNG. One goal of actually implementing vOW is to verify the runtime against the asymptotic theoretical values, using a fast random function. Adj et al. [1] chose to use an MD5-based random function for this purpose. We have instead opted for a custom XOF based on AES-CBC mode using AES-NI instructions. This provides much better performance on modern hardware, while guaranteeing cryptographic properties of the function. Regarding our PRNG, we use AES-CTR mode with AES-NI instructions.

In Table 4 we reproduce [23, Table 1] which computes the \(O(\cdot )\) constant in front of the expected number of steps for the optimal choice of \(\theta \) and is used to determine the constant \(\alpha \), to demonstrate the validity of our pseudo-random step function.

Table 4. Reproduction of [23, Table 1], using the AES-based XOF on Solardiesel, i. e. the number of function steps required to find the golden collision divided by \({|S|}^{3/2}/w^{1/2}\). The experiments are averaged over 1000 function versions and run with 20 cores.

5 Analysis of SIKE Round-2 Parameters

In the Round 2 of the NIST standardization effort, the analyses of Adj et al. [1] and Jaques and Schanck [9] have prompted the introduction of two new parameter sets to the SIKE submission, SIKEp434 and SIKEp610, as well as a security reassessment of the parameter sets SIKEp503 and SIKEp751. The four sets are based on the primes \(\text {p434}=2^{216}3^{137}-1\), \(\text {p503}=2^{250}3^{159}-1\), \(\text {p610}=2^{305}3^{192}-1\) and \(\text {p751}=2^{372}3^{239}-1\), and target security categories 1, 2, 3 and 5, respectively.

This section provides concrete classical security estimates for these parameter sets, in two different ways; the first follows an approach similar to the one by van Oorschot and Wiener and Adj et al. We count the average number of oracle calls to run the vOW algorithm and multiply them by the complexity of the oracle itself, measured in x64 instructions. This leads to a more informed estimate than provided by Adj et al. and Jaques and Schanck, but the final result remains the same – see Sect. 5.1. A downside of this approach is that although it captures much of the algorithm’s cost, it ignores some potentially significant parts. In particular, it does not account for the cost of memory access (assumed free) or the practical difficulty of scaling across different cores (assumed linear), see [1, §5, Remark 6]. We present an alternative method in Sect. 5.2.

5.1 Concrete Security of SIKE Round-2 Parameters

In Table 5 we use the Round-2 SIKE implementation to estimate the number of x64 instructions necessary to compute half-size isogenies. More specifically, we provide estimations for \(2^{\lfloor e_2/2\rfloor -2}\)-isogenies in Table 5a and for \(3^{\lfloor e_3/2\rfloor }\)-isogenies in Table 5b. These instruction counts are intended to be lower bounds on the number of classical gates required to mount vOW, and we argue that these estimates are still conservative with respect to the true gate count. A lower bound on the runtime of the vOW algorithm can now simply be obtained by multiplying the costs of the above isogeny oracles with the number of times they are called, which we summarize in Table 6. Our analysis concludes that the number of classical gates required for (i) vOW on SIKEp434 is at least \(2^{143}\), (ii) vOW on SIKEp503 is at least \(2^{170}\), (iii) vOW on SIKEp610 is at least \(2^{210}\), and (iv) vOW on SIKEp751 is at least \(2^{262}\). Note that the counts for (i) and (iii) closely agree with classical gate counts by Jaques and Schanck, who are also rather conservative in their costing of the isogeny functions – see [9, §7.1].

Table 5. Isogeny costs in terms of the total number of x64 instructions \(i_\mathbf{sum}\), broken down into multiplication instructions \(i_\mathbf{mul}\), addition, subtraction and logical instructions \(i_\mathbf{asl}\) and move instructions \(i_\mathbf{mov}\);M denotes multiplication, S squaring, add addition and sub subtraction in \(\mathbb {F}_{p^2}\).
Table 6. Average number of x64 instructions to run vOW on the 2- and 3-torsion for the Round-2 SIKE parameters with memory size \(w=2^{80}\), set size \(N=|S|=2^{e_2/2-1}\) for the 2-torsion and \(N=|S|=3^{(e_3-1)/2}\) for the 3-torsion – see Sect. 3. Numbers are shown as the floor of their base-2 logarithms. The number of isogeny computations, #isog, is computed by setting \(m=t=1\) in Eq. (1), and the numbers \(i_\mathbf{sum}\) of instructions for each isogeny are taken from Tables 5a and b. The total number of instructions, vOW, is the product of #isog and \(i_\mathbf{sum}\) and is intended to act as a lower bound on the number of gates required to solve the CSSI problem with the vOW algorithm.

5.2 Concrete Security of SIKEp434

Finally, we focus our attention on arguably the most interesting cryptanalytic target, namely the SIKE Round-2 category-1 parameter set SIKEp434 with claimed (classical) security comparable to AES-128. Although the analysis in the previous section shows agreement between our estimates and those in the literature, all approaches so far have one thing in common: communication and memory access costs are not taken into account. As these become non-negligible when the memory and the number of cores grow — already mentioned in the context of SIDH/SIKE by Adj. et al. [1, Remark 6] — one can wonder how significant they are. Since such costs are often difficult to capture in theoretical models, we take a more practical approach.

We start by noticing that the current complexity estimates are measured in average number of oracle calls, where an oracle call corresponds to an isogeny computation (e. g. of degree \(2^{106}\) or \(3^{68}\) for SIKEp434). Given the fact that we now have an optimized implementation of the algorithm itself, a simple alternative is to measure the complexity in average number of cycles instead. Much of the heuristic approach of van Oorschot and Wiener [23, §4.2] remains the same; we run a single function version and measure the number of distinct collisions it generates, from which we approximate the runtime of the full algorithm. That is, we assume that each function version behaves approximately the same with respect to the number of distinct collisions it generates, which van Oorschot and Wiener heuristically show to be true for \(w \ge 2^{16}\) (the results for different function versions are within 1% of one another). Thus, writing \(N\) for the set size and \(c\) for the number of distinct collisions generated per function version, every function version has (independent) probability \(2c/N\) to find the golden collision and completing the vOW algorithm requires on average \(N/(2c)\) versions. If each one requires \(t\) cycles to complete, the average total runtime is therefore \(tN/(2c)\).

Equivalently, on average we need \(t/c\) cycles per generated collision, of which there are \(N/2\) in total, leading to the above average runtime. Therefore, one may want to simplify the analysis by generating only very few collisions and approximating the runtime from that. However, we note that \(t/c\) is very large in the beginning of the algorithm as the memory starts out being empty, while the distribution of distinguished points in memory becomes biased towards those with lower probability of producing a collision – see [23, §4.2]. It may be possible to run less than a full function version to get a close approximation of \(t/c\), but we consider this out of scope for this work and stick with completing a function version for our estimations.

Looking at the conjectured setup proposed by Adj et al. (i. e. memory \(w = 2^{80}\), \(m = 2^{64}\) cores), when used against SIKEp434 the number of oracle calls grows linearly with \(\sqrt{Nw}/m\), where \(N = 2^{107}\), while each oracle call takes on the order of \(2^{22}\) x64 instructions (see Table 5a). In the theoretical model where memory accesses are free and the algorithm parallelizes perfectly, the function version can be run with approximately \(2^{51.5}\) x64 instructions per core (and to run the full algorithm we need approximately \(2^{27}\) function versions, agreeing with the estimates in Table 6). If each x64 instruction were a single cycle on a machine running at 1 GHz, such a computation would finish in about 37 days. Although it should be noted that such a setup is not realistic, other combinations of resources allow for (theoretically) running a single function version within a reasonable amount of time (say, a year). It is not clear that these runtimes will hold true in practice, as for example distributing the experiment across different machines can cause significant overhead. We consider exploring this overhead, e. g. by analyzing how different network topologies affect the results, a very worthwhile research direction.

In a more constrained environment, i. e. when running experiments on Atomkohle for which we choose \(w \in \{2^{16},2^{18},2^{20}\}\) and \(m = 28\), running a single SIKEp434 function version requires millions of years. Instead, we decrease the degree \(e\) of the isogeny we try to reconstruct, but do not change the finite field, to a point where experiments run in a few hours. Crucially, if the theoretical analysis of van Oorschot and Wiener holds up for these resources, then the runtime of a function version grows linearly with \(\sqrt{N}\) and we can extrapolate the runtime of a single function version for the actual SIKEp434 parameters on such a setup by drawing a line through the data points. Interestingly, the difference between this approximation of the security of SIKEp434 when compared to the theory can be seen as an error measure for the theoretical analysis of vOW (the better the fit, the closer the theory to reality).

More concretely, we choose \(e = 28,30,\ldots ,42\) and measure the cycle counts to complete one function version and the number of distinct collisions that they generate. We use precomputation depth \(\varDelta = 16\) and to account for the difference of the cost of the oracle (a \(2^e\)-isogeny) we normalize the cycle count by a factor \(\sigma (\varDelta ,e)\cdot \zeta (e)\), with \(\zeta (e) = (1/2)\cdot \left( (e-2)\log (e-2) + (e-1)\log (e-1)\right) \) the estimated average cost of the oracle and \(\sigma (\varDelta ,e)\) given as in Eq. (4). Hence, we have a measure for the average number \(t/c\) of cycles required to generate a single collision, which we summarize in Table 7.

For a fixed \(w\), we then extrapolate, using the least squares method, the function that maps \(\sqrt{2^{e-1}}\) to the corresponding value in the table. This leads to the three approximation functions

$$\begin{aligned} z_{16}(e)&= \sigma (\varDelta ,e)\cdot \zeta (e)\cdot (3.44..\cdot \sqrt{2^{e-1}} + 19247.78..)\,, \\ z_{18}(e)&= \sigma (\varDelta ,e)\cdot \zeta (e)\cdot (1.72..\cdot \sqrt{2^{e-1}} + 6151.88..)\,, \\ z_{20}(e)&= \sigma (\varDelta ,e)\cdot \zeta (e)\cdot (0.87..\cdot \sqrt{2^{e-1}} - 928.81..)\,, \end{aligned}$$

where the factor \(\sigma (\varDelta ,e)\cdot \zeta (e)\) is only there to undo the normalization factor. For any \(w\), the runtime of a single function version for SIKEp434 is then \(z_{\log (w)}(e)\) cycles, while the full algorithm has total runtime \(2^{e-2}\cdot z_{\log (w)}(e)\) cycles, since \(|N| = 2^{e-1}\). Thus, setting \(e = 108\), we expect vOW on SIKEp434 to have a runtime of \(2^{170.47..}\), \(2^{169.47..}\) and \(2^{168.50..}\) cycles for \(w = 2^{16}\), \(w = 2^{18}\) and \(w = 2^{20}\) respectively. For comparison, using Eq. (1) combined with the approximation of the cost of the isogeny oracle of Table 5a, we expect runtimes \(2^{170.71..}\), \(2^{169.71..}\) and \(2^{168.71..}\) x64 instructions respectively. We observe that these approximations match very closely, confirming that the theoretical estimates lie very close to the practical runtimes for these values of \(w\) and \(m\). Indeed, this is no surprise, as such small values should not cause significant overhead.

Table 7. Number of cycles (measured in thousands and rounded to the nearest multiple of \(10^3\)) to generate a single collision, for different memory sizes \(w\) and isogeny instances of degree \(2^{e_2}\), where \(e = e_2/2\). All numbers are scaled by a factor \(\sigma (\varDelta ,e)\cdot \zeta (e)\).

However, we emphasize that this is the first time a theoretical estimate on the security of SIKEp434 is met with serious practical consideration (i. e. without ignoring memory access times and issues with parallelism). If our setup with \(w = 2^{20}\) was run on an instance with \(e = 108\) it would require (on average) \(2^{168.50..}\) cycles to complete. We believe this value could therefore be viewed as an upper bound on the security level of SIKEp434. On the other hand, the analyses of Adj et al. [1] and Jaques and Schanck [9], assuming \(w = 2^{80}\) and \(m = 2^{64}\), provide a lower bound on the security level. This gap could be closed by computing \(z_{\log (w)}\) for larger values of \(w\) and \(m\) and showing that they agree with the theoretical estimations, which is a valuable effort that should be seriously considered to understand the security of SIKEp434. It is of course not clear that the gap between the upper and lower bound will vanish completely; scaling the setup to large memory and distributed systems will cause significant overhead, which is also noticeable in cryptanalytic efforts in other domains [26].