Keywords

1 Introduction

With the explosive growth of demand for secure information exchange, efficient public-key cryptography is becoming more and more indispensable. Compared to traditional achievable schemes such as RSA and Diffie-Hellman, elliptic curve cryptography (ECC) has become the more attractive alternative. In December 2010, Chinese State Cryptography Administration published the national public-key cryptography based on ECC in [1], short for SM2. The same series of national cryptographic algorithms also include hash function and symmetric cipher, known as SM3 and SM4. These algorithms have got accepted and vigorously promoted by the government, possessing a spacious application foreground. SM2 is ECC defined over a pseudo-Mersenne prime field of 256 bits, which is represented by SCA-256 in the following discussion. Facing such a bright application prospect, ASIC implementation of SM2 has become very valuable, but study on this aspect is far form sufficient. So we decide to do some research to fill this gap as much as we can. In this paper, both analysis and implementation of SM2 are conducted in a meticulous and deep going way, while defensive measures for SPA are also under consideration. Then we propose a high-performance implementation of SM2 with SPA resistance, using an innovative isochronous architecture of MPL with mathematical optimization.

The remainder of this paper proceeds as follows. Section 2 briefly presents the background knowledge of ECC and SM2. Section 3 introduces Our SM2 architecture in detail. The implementation performance as well as comparison with previous works are given in Sect. 4, followed by a comprehensive conclusion.

2 Mathematical Background

A non-supersingular elliptic curve over GF(p) is usually expressed as the Weierstrass equation in Eq. 1:

$$\begin{aligned} E: y^2=x^3+ax+b \end{aligned}$$
(1)

where \(a,b\in GF(p)\), \(4a^3+27b^2\ne 0\pmod {p}\). All the solutions \((x,y)\in GF(p)*GF(p)\) that fulfill this equation make up the curve, together with the point \(P_\infty \) at infinity. To form an abelian group, ECC arithmetic defines the operation addition of two points over this curve in Eq. 2, which distinguishes the case for equal and unequal ones. Let \(P=(x_1,y_1), Q=(x_2,y_2)\in E\), then \(R(x_3,y_3)=P+Q\in E\). If \(P\ne Q\), we have the point addition formulas, otherwise we have point double formulas.

$$\begin{aligned} x_3&=\lambda ^2-x_1-x_2\\ y_3&=-y_1-(x_3-x_1)\lambda \end{aligned}$$

where

$$\begin{aligned} \lambda = {\left\{ \begin{array}{ll} \frac{3x_1^2+a}{2y_1}\qquad (x_1,y_1)=(x_2,y_2) \\ \frac{y_1-y_2}{x_1-x_2}\qquad otherwise. \end{array}\right. } \end{aligned}$$
(2)

The main operation that dominates the execution time of ECC is point multiplication (PM), also called scalar multiplication. It is defined as \(kP= \begin{matrix} \sum _{1}^k P \end{matrix}=P+P+\cdots +P\), where P is a point on elliptic curve and k is a random integer. It is computed by a series of point addition (PA) and point double (PD), further decomposed into a certain number of finite field operations. The design level of point multiplication determines the final performance of SM2. To achieve high-performance implementation of PM is our core subject in this paper.

Compared with international standard ECC algorithm, SM2 adopts the unique prime field, called SCA-256. It is also improved in some aspects, such as the procedures of encryption, the structure of data to be signed and so on, which enhances its applicability and safety in commercial environment. The parameters of SM2 are clearly specified in [1]. Among them, the most important one is the selected pseudo-Mersenne prime field: \(p_{SCA-256}=2^{256}-2^{224}-2^{96}+2^{32}-1\).

3 Proposed ECC Processor

In this section, we firstly choose the main algorithm of point multiplication based on a complete consideration. Then the succeeding units are achieved and optimized from the bottom up, with the entire architecture presented at last.

3.1 Optimization for Point Multiplication

Since proposed, SPA has proved to be the most common threat for cryptographic devices. Resistant strategies mainly concentrates on algorithm level, grouped into three methods: double-and-add always (DAA) algorithm, normalization algorithm and Montgomery powering ladder. The first one used in [2] makes a simple change to traditional LR-DA. It performs both PA and PD operation in each iteration, resisting SPA by average 50 % PA operation overhead. But it gives great opportunity to C-fault analysis [3] attack. [4] normalized PA and PD. Addition formulas on the elliptic curve were rewritten so that the same formula apply equally to add two different or same points. As the third method, MPL was first proposed in [5], shown as Algorithm 1.

figure a

By maintaining the relation of \(Q_1=Q_0+P(x_p,y_p),\) PA and PD are executed without any redundant operation. What’s more, the sum of two points whose difference is fixed can be computed without the y-coordinate, reducing both computation effort and storage space. Brier and Joye [4] deduced the mathematic simplification formula of PA and PD for MPL over prime field in projective coordinate, shown as Eqs. 3 and 4. This optimization can save all storage space for y-coordinate and the efforts of computing them, resulting in higher efficiency. They also recovered the y-coordinate of result kP, shown as Eq. 5.

$$\begin{aligned} {\left\{ \begin{array}{ll} X(Q_0+Q_1)=-4bZ_0Z_1(X_0Z_1+X_1Z_0)+(X_0X_1+aZ_0Z_1)^2\\ Z(Q_0+Q_1)=x_p\cdot (X_1Z_0-X_0Z_1)^2 \end{array}\right. } \end{aligned}$$
(3)
$$\begin{aligned} {\left\{ \begin{array}{ll} X(2Q_1)=(X_1^2-aZ_1^2)^2-8bX_1Z_1^3\\ Z(2Q_1)=4Z_1(X_1^3+aX_1Z_1^2+bZ_1^3) \end{array}\right. } \end{aligned}$$
(4)
$$\begin{aligned} y={(2y_p)}^{-1}[2b+(a+x_px_1)(x_p+x_1)-x_0(x_p-x_1)^2] \end{aligned}$$
(5)

Since x-coordinates of both \(Q_0\) and \(Q_1\) are needed for the final result, we convert them to affine coordinate system using the Eq. 6 as follow. By this special design highlight, once MPL scalar multiplication need no more modular inversion than usual algorithm.

$$\begin{aligned} Z_{inv}=\frac{1}{Z_0Z_1} \quad \Longrightarrow \quad x_0=Z_{inv}Z_1X_0,\quad x_1=Z_{inv}Z_0X_1 \end{aligned}$$
(6)

The computation amount of once kP in different algorithms are compared in Table 1. It is obvious that MPL with simplification offers better SPA resistance with lower consumption. And it’s the only one that can execute PA and PD in parallel. Without redundant operations and y-coordinate, it can also resist C safe-error and M safe-error fault attacks. So our implementation uses it as our main algorithm.

Table 1. Calculation cost of different algorithm

3.2 Optimization for Finite Field Arithmetic

Modular multiplication (MM) is made up of regular multiplication and modular divider. For the pseudo-Mersenne prime of SM2, we adopt the Fast Reduction Scheme in [6], whose execution cycles can be precisely controlled according to the number of adders. Then by matching appropriate number of adders to multipliers and introducing pipelined structure, we achieve configurable MM modules. For one MM module with M’s N-bit multiplier and one 2N-bit adder, the number of execution cycles of a 256-bit MM will be \((\frac{256}{N})^2*\frac{1}{M}+1\), called as a unit of cycles. For modular inversion, we adopt the fast radix-4 unified division algorithm in [7]. For modular addition and subtraction, we design a combination module which can execute both and costing only one cycle. By these optimization, all the finite field operation units achieve high hardware utilization and fast speed, which will greatly improve the overall performance.

3.3 Optimization for Point Addition and Point Double

Since the final performance of MPL algorithm largely depends on the implementation effect of point arithmetic layer, then optimization for PA and PD become what matters the most. The computation steps of them have been defined as Eqs. 3 and 4. As we can know from last section, modular multiplication and addition/subtraction are performed by corresponding units respectively. And time consumption needed for the later is far less than the former. So on the premise of saving area as much as possible, performing addition/subtraction in fully parallel with multiplication would be the most efficient case undoubtedly. But it’s not easy. There are at least three issues that ought to be considered. Firstly, data dependencies in PA and PD formulas are very complicated. Secondly, due to our two-pipelined architecture, it needs two units of cycles before the modular multiplication result can be used after data incoming, which also brings difficulty to operation scheduling. Finally, in order to save storage space, we want to reduce intermediate data as much as possible. All the three make stringent requests on the design of execution order. After a lot of careful thought and analysis, we have found the optimal scheduling scheme fortunately. The optimization results for point doubling is shown in Table 2 while addition in Table 3.

Table 2. Point double execution order
Table 3. Point addition execution order

In this optimal schedule, we keep multipliers busy all the time and hide the execution time of adders in parallel with them. Since PA and PD need almost the same computation load of multiplication, we assign each of them one MM unit with multiplier scale of 64-bit. Then we encapsulate PA and PD into two synchronous modules, at the same time they share only one modular addition/subtraction unit to perform all the operation other than MM. Base on the above optimization, our PA and PD units achieve superior performance and hit the design expectation successfully.

3.4 SM2 Architecture

Table 4. Performance comparison

The whole architecture is composed of three modules. Storage module is made up of register heap and store logic. It efficiently performs the storage and communication of all the data in ECC algorithm. PA and PD share one modular addition/subtraction unit, forming the arithmetic module. Main Control module serves as the commander, guiding the other two to execute PM efficiently.

4 Comparison and Conclusion

This architecture has been verified in Verilog-HDL and evaluated in 0.13 \(\upmu \)m CMOS standard cell library. Results are shown in Table 4, compared with the previously published results over 256-bit prime field. While absolutely fair comparison can’t be promised due to different backgrounds, area-time product provides the most objective assessment standards. And in this comparison, our architecture offers the best AT product.

In a word, this paper proposes a high-performance ASIC implementation of point multiplication for SM2. For the field operation level, modular addition/subtraction and inversion are designed into efficient modules that fit the whole architecture very well. As for the most important modular multiplication, we adopt Fast Reduction scheme and achieve a configurable modular multiplication with pipelined architecture. For the addition level, execution order of PA and PD are elaborately planned, bringing ultra high hardware efficiency. For the algorithm level, MPL algorithm with computation simplification brings both efficiency and security. Synthesize results show that this processor only needs 208 \(\upmu \)s and 6.8 \(\upmu \)J to achieve a 256-bit point multiplication at 228 MHz, and it can effective resist SPA. Compare with related works, this architecture offers not only the superior area-time product but also great security. However, our performance comes at a costly price of flexibility. In our future work, we will focus on the extended application of isochronous and configurable architecture of ECC, hoping to achieve more instructive and flexible implementation.