Ultra High-Performance ASIC Implementation of SM2 with SPA Resistance

Zhang, Dan; Bai, Guoqiang

doi:10.1007/978-3-319-29814-6_17

Dan Zhang¹⁷ &
Guoqiang Bai^17,18

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 9543))

Included in the following conference series:

International Conference on Information and Communications Security

1587 Accesses
1 Citations

Abstract

To ensure secure information exchange, demand for hardware implementation of elliptic curve cryptography (ECC) is increasing rapidly in recent years. In this paper, we propose an ASIC design for ECC over SCA-256 prime field, delivering both high performance and great SPA resistance. For algorithm selection, we integrate calculation simplification into the classic algorithm, Montgomery Powering Ladder (MPL). Based on the deduction of Fast NIST Reduction, we innovatively achieve the configurable modular multiplication module and then the isochronous point addition and double units. Pipeline architecture, execution order optimization and modular design are all applied to improved performance. Evaluated by CMOS standard cell library of 0.13 $\upmu $m, this ECC processor costs only 208 $\upmu $s and 6.8 $\upmu $J for one scalar multiplication and runs at high frequency of 228 MHz with area of 156 k gates. Compared to related works, it is much more advantageous in not only area-time product but also SPA resistant protection.

You have full access to this open access chapter, Download conference paper PDF

Compact Implementation of Modular Multiplication for Special Modulus on MSP430X

Hardware Implementation and Optimization of Critical Modules of SM9 Digital Signature Algorithm

Parallel Implementation of SM2 Elliptic Curve Cryptography on Intel Processors with AVX2

Keywords

1 Introduction

With the explosive growth of demand for secure information exchange, efficient public-key cryptography is becoming more and more indispensable. Compared to traditional achievable schemes such as RSA and Diffie-Hellman, elliptic curve cryptography (ECC) has become the more attractive alternative. In December 2010, Chinese State Cryptography Administration published the national public-key cryptography based on ECC in [1], short for SM2. The same series of national cryptographic algorithms also include hash function and symmetric cipher, known as SM3 and SM4. These algorithms have got accepted and vigorously promoted by the government, possessing a spacious application foreground. SM2 is ECC defined over a pseudo-Mersenne prime field of 256 bits, which is represented by SCA-256 in the following discussion. Facing such a bright application prospect, ASIC implementation of SM2 has become very valuable, but study on this aspect is far form sufficient. So we decide to do some research to fill this gap as much as we can. In this paper, both analysis and implementation of SM2 are conducted in a meticulous and deep going way, while defensive measures for SPA are also under consideration. Then we propose a high-performance implementation of SM2 with SPA resistance, using an innovative isochronous architecture of MPL with mathematical optimization.

The remainder of this paper proceeds as follows. Section 2 briefly presents the background knowledge of ECC and SM2. Section 3 introduces Our SM2 architecture in detail. The implementation performance as well as comparison with previous works are given in Sect. 4, followed by a comprehensive conclusion.

2 Mathematical Background

A non-supersingular elliptic curve over GF(p) is usually expressed as the Weierstrass equation in Eq. 1:

$$\begin{aligned} E: y^2=x^3+ax+b \end{aligned}$$

(1)

where $a,b\in GF(p)$, $4a^3+27b^2\ne 0\pmod {p}$. All the solutions $(x,y)\in GF(p)*GF(p)$ that fulfill this equation make up the curve, together with the point $P_\infty $ at infinity. To form an abelian group, ECC arithmetic defines the operation addition of two points over this curve in Eq. 2, which distinguishes the case for equal and unequal ones. Let $P=(x_1,y_1), Q=(x_2,y_2)\in E$, then $R(x_3,y_3)=P+Q\in E$. If $P\ne Q$, we have the point addition formulas, otherwise we have point double formulas.

$$\begin{aligned} x_3&=\lambda ^2-x_1-x_2\\ y_3&=-y_1-(x_3-x_1)\lambda \end{aligned}$$

where

$$\begin{aligned} \lambda = {\left\{ \begin{array}{ll} \frac{3x_1^2+a}{2y_1}\qquad (x_1,y_1)=(x_2,y_2) \\ \frac{y_1-y_2}{x_1-x_2}\qquad otherwise. \end{array}\right. } \end{aligned}$$

(2)

The main operation that dominates the execution time of ECC is point multiplication (PM), also called scalar multiplication. It is defined as $kP= \begin{matrix} \sum _{1}^k P \end{matrix}=P+P+\cdots +P$, where P is a point on elliptic curve and k is a random integer. It is computed by a series of point addition (PA) and point double (PD), further decomposed into a certain number of finite field operations. The design level of point multiplication determines the final performance of SM2. To achieve high-performance implementation of PM is our core subject in this paper.

Compared with international standard ECC algorithm, SM2 adopts the unique prime field, called SCA-256. It is also improved in some aspects, such as the procedures of encryption, the structure of data to be signed and so on, which enhances its applicability and safety in commercial environment. The parameters of SM2 are clearly specified in [1]. Among them, the most important one is the selected pseudo-Mersenne prime field: $p_{SCA-256}=2^{256}-2^{224}-2^{96}+2^{32}-1$.

3 Proposed ECC Processor

In this section, we firstly choose the main algorithm of point multiplication based on a complete consideration. Then the succeeding units are achieved and optimized from the bottom up, with the entire architecture presented at last.

3.1 Optimization for Point Multiplication

Since proposed, SPA has proved to be the most common threat for cryptographic devices. Resistant strategies mainly concentrates on algorithm level, grouped into three methods: double-and-add always (DAA) algorithm, normalization algorithm and Montgomery powering ladder. The first one used in [2] makes a simple change to traditional LR-DA. It performs both PA and PD operation in each iteration, resisting SPA by average 50 % PA operation overhead. But it gives great opportunity to C-fault analysis [3] attack. [4] normalized PA and PD. Addition formulas on the elliptic curve were rewritten so that the same formula apply equally to add two different or same points. As the third method, MPL was first proposed in [5], shown as Algorithm 1.

By maintaining the relation of $Q_1=Q_0+P(x_p,y_p),$ PA and PD are executed without any redundant operation. What’s more, the sum of two points whose difference is fixed can be computed without the y-coordinate, reducing both computation effort and storage space. Brier and Joye [4] deduced the mathematic simplification formula of PA and PD for MPL over prime field in projective coordinate, shown as Eqs. 3 and 4. This optimization can save all storage space for y-coordinate and the efforts of computing them, resulting in higher efficiency. They also recovered the y-coordinate of result kP, shown as Eq. 5.

$$\begin{aligned} {\left\{ \begin{array}{ll} X(Q_0+Q_1)=-4bZ_0Z_1(X_0Z_1+X_1Z_0)+(X_0X_1+aZ_0Z_1)^2\\ Z(Q_0+Q_1)=x_p\cdot (X_1Z_0-X_0Z_1)^2 \end{array}\right. } \end{aligned}$$

(3)

$$\begin{aligned} {\left\{ \begin{array}{ll} X(2Q_1)=(X_1^2-aZ_1^2)^2-8bX_1Z_1^3\\ Z(2Q_1)=4Z_1(X_1^3+aX_1Z_1^2+bZ_1^3) \end{array}\right. } \end{aligned}$$

(4)

$$\begin{aligned} y={(2y_p)}^{-1}[2b+(a+x_px_1)(x_p+x_1)-x_0(x_p-x_1)^2] \end{aligned}$$

(5)

Since x-coordinates of both $Q_0$ and $Q_1$ are needed for the final result, we convert them to affine coordinate system using the Eq. 6 as follow. By this special design highlight, once MPL scalar multiplication need no more modular inversion than usual algorithm.

$$\begin{aligned} Z_{inv}=\frac{1}{Z_0Z_1} \quad \Longrightarrow \quad x_0=Z_{inv}Z_1X_0,\quad x_1=Z_{inv}Z_0X_1 \end{aligned}$$

(6)

The computation amount of once kP in different algorithms are compared in Table 1. It is obvious that MPL with simplification offers better SPA resistance with lower consumption. And it’s the only one that can execute PA and PD in parallel. Without redundant operations and y-coordinate, it can also resist C safe-error and M safe-error fault attacks. So our implementation uses it as our main algorithm.

Table 1. Calculation cost of different algorithm

Full size table

3.2 Optimization for Finite Field Arithmetic

Modular multiplication (MM) is made up of regular multiplication and modular divider. For the pseudo-Mersenne prime of SM2, we adopt the Fast Reduction Scheme in [6], whose execution cycles can be precisely controlled according to the number of adders. Then by matching appropriate number of adders to multipliers and introducing pipelined structure, we achieve configurable MM modules. For one MM module with M’s N-bit multiplier and one 2N-bit adder, the number of execution cycles of a 256-bit MM will be $(\frac{256}{N})^2*\frac{1}{M}+1$, called as a unit of cycles. For modular inversion, we adopt the fast radix-4 unified division algorithm in [7]. For modular addition and subtraction, we design a combination module which can execute both and costing only one cycle. By these optimization, all the finite field operation units achieve high hardware utilization and fast speed, which will greatly improve the overall performance.

3.3 Optimization for Point Addition and Point Double

Since the final performance of MPL algorithm largely depends on the implementation effect of point arithmetic layer, then optimization for PA and PD become what matters the most. The computation steps of them have been defined as Eqs. 3 and 4. As we can know from last section, modular multiplication and addition/subtraction are performed by corresponding units respectively. And time consumption needed for the later is far less than the former. So on the premise of saving area as much as possible, performing addition/subtraction in fully parallel with multiplication would be the most efficient case undoubtedly. But it’s not easy. There are at least three issues that ought to be considered. Firstly, data dependencies in PA and PD formulas are very complicated. Secondly, due to our two-pipelined architecture, it needs two units of cycles before the modular multiplication result can be used after data incoming, which also brings difficulty to operation scheduling. Finally, in order to save storage space, we want to reduce intermediate data as much as possible. All the three make stringent requests on the design of execution order. After a lot of careful thought and analysis, we have found the optimal scheduling scheme fortunately. The optimization results for point doubling is shown in Table 2 while addition in Table 3.

Table 2. Point double execution order

Full size table

Table 3. Point addition execution order

Full size table

In this optimal schedule, we keep multipliers busy all the time and hide the execution time of adders in parallel with them. Since PA and PD need almost the same computation load of multiplication, we assign each of them one MM unit with multiplier scale of 64-bit. Then we encapsulate PA and PD into two synchronous modules, at the same time they share only one modular addition/subtraction unit to perform all the operation other than MM. Base on the above optimization, our PA and PD units achieve superior performance and hit the design expectation successfully.

3.4 SM2 Architecture

Table 4. Performance comparison

Full size table

The whole architecture is composed of three modules. Storage module is made up of register heap and store logic. It efficiently performs the storage and communication of all the data in ECC algorithm. PA and PD share one modular addition/subtraction unit, forming the arithmetic module. Main Control module serves as the commander, guiding the other two to execute PM efficiently.

4 Comparison and Conclusion

This architecture has been verified in Verilog-HDL and evaluated in 0.13 $\upmu $m CMOS standard cell library. Results are shown in Table 4, compared with the previously published results over 256-bit prime field. While absolutely fair comparison can’t be promised due to different backgrounds, area-time product provides the most objective assessment standards. And in this comparison, our architecture offers the best AT product.

In a word, this paper proposes a high-performance ASIC implementation of point multiplication for SM2. For the field operation level, modular addition/subtraction and inversion are designed into efficient modules that fit the whole architecture very well. As for the most important modular multiplication, we adopt Fast Reduction scheme and achieve a configurable modular multiplication with pipelined architecture. For the addition level, execution order of PA and PD are elaborately planned, bringing ultra high hardware efficiency. For the algorithm level, MPL algorithm with computation simplification brings both efficiency and security. Synthesize results show that this processor only needs 208 $\upmu $s and 6.8 $\upmu $J to achieve a 256-bit point multiplication at 228 MHz, and it can effective resist SPA. Compare with related works, this architecture offers not only the superior area-time product but also great security. However, our performance comes at a costly price of flexibility. In our future work, we will focus on the extended application of isochronous and configurable architecture of ECC, hoping to achieve more instructive and flexible implementation.

References

State Cryptography Administration of China. Public Key Cryptographic Algorithm SM2 Based on Elliptic Curves (2010). http://www.oscca.gov.cn/News/201012/News_1198.htm
Lee, J.-W., Chung, S.-C., Chang, H.-C., Lee, C.-Y.: Efficient power-analysis-resistant dual-field elliptic curve cryptographic processor using heterogeneous dual-processing-element architecture. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 22(1), 49–61 (2014)
Article Google Scholar
Junfeng Fan, X., Guo, E.D., Mulder, P.S., Preneel, B., Verbauwhede, I.: State-of-the-art of secure ECC implementations: a survey on known side-channel attacks and countermeasures. In: IEEE International Symposium on Hardware-Oriented Security and Trust (HOST), pp. 76–87. IEEE (2010)
Google Scholar
Brier, E., Joye, M.: Weierstraß elliptic curves and side-channel attacks. In: Naccache, D., Paillier, P. (eds.) PKC 2002. LNCS, vol. 2274, pp. 335–345. Springer, Heidelberg (2002)
Chapter Google Scholar
Montgomery, P.L.: Speeding the pollard and elliptic curve methods of factorization. Math. Comput. 48(177), 243–264 (1987)
Article MathSciNet Google Scholar
Hankerson, D., Vanstone, S., Menezes, A.J.: Guide to Elliptic Curve Cryptography. Springer Professional Computing. Springer, New York (2004)
Google Scholar
Chen, Y.-L., Lee, J.-W., Liu, P.-C., Chang, H.-C., Lee, C.-Y.: A dual-field elliptic curve cryptographic processor with a radix-4 unified division unit. In: IEEE International Symposium on Circuits and Systems (ISCAS), pp. 713–716. IEEE (2011)
Google Scholar
Chung, S.-C., Lee, J.-W., Chang, H.-C., Lee, C.-Y.: A high-performance elliptic curve cryptographic processor over GF (p) with SPA resistance. In: IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1456–1459. IEEE (2012)
Google Scholar
Satoh, A., Takano, K.: A scalable dual-field elliptic curve cryptographic processor. IEEE Trans. Comput. 52(4), 449–460 (2003)
Article Google Scholar
Chen, G., Bai, G., Chen, H.: A high-performance elliptic curve cryptographic processor for general curves over GF (p) based on a systolic arithmetic unit. IEEE Trans. Circuits Syst. II Express Briefs 54(5), 412–416 (2007)
Article Google Scholar

Download references

Acknowledgment

This work was supported by the National Natural Science Foundation of China (Grants U1135004 and 61472208) and National Key Basic Research Program of China (Grant 2013CB338004).

Author information

Authors and Affiliations

Institute of Microelectronics, Tsinghua University, Beijing, China
Dan Zhang & Guoqiang Bai
Tsinghua National Laboratory for Information Science and Technology, Beijing, China
Guoqiang Bai

Authors

Dan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Guoqiang Bai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guoqiang Bai .

Editor information

Editors and Affiliations

Institute of Information Engineering, Chinese Academy of Science, Beijing, China
Sihan Qing
Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, Japan
Eiji Okamoto
School of Computing, KAIST, Daejeon, Korea (Republic of)
Kwangjo Kim
Westone Corporation, Beijing, China
Dongmei Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, D., Bai, G. (2016). Ultra High-Performance ASIC Implementation of SM2 with SPA Resistance. In: Qing, S., Okamoto, E., Kim, K., Liu, D. (eds) Information and Communications Security. ICICS 2015. Lecture Notes in Computer Science(), vol 9543. Springer, Cham. https://doi.org/10.1007/978-3-319-29814-6_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-29814-6_17
Published: 05 March 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-29813-9
Online ISBN: 978-3-319-29814-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Ultra High-Performance ASIC Implementation of SM2 with SPA Resistance

Abstract

Similar content being viewed by others

Compact Implementation of Modular Multiplication for Special Modulus on MSP430X

Hardware Implementation and Optimization of Critical Modules of SM9 Digital Signature Algorithm

Parallel Implementation of SM2 Elliptic Curve Cryptography on Intel Processors with AVX2

Keywords

1 Introduction

2 Mathematical Background

3 Proposed ECC Processor

3.1 Optimization for Point Multiplication

3.2 Optimization for Finite Field Arithmetic

3.3 Optimization for Point Addition and Point Double

3.4 SM2 Architecture

4 Comparison and Conclusion

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Ultra High-Performance ASIC Implementation of SM2 with SPA Resistance

Abstract

Similar content being viewed by others

Compact Implementation of Modular Multiplication for Special Modulus on MSP430X

Hardware Implementation and Optimization of Critical Modules of SM9 Digital Signature Algorithm

Parallel Implementation of SM2 Elliptic Curve Cryptography on Intel Processors with AVX2

Keywords

1 Introduction

2 Mathematical Background

3 Proposed ECC Processor

3.1 Optimization for Point Multiplication

3.2 Optimization for Finite Field Arithmetic

3.3 Optimization for Point Addition and Point Double

3.4 SM2 Architecture

4 Comparison and Conclusion

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation