Montgomery Modular Multiplication on ARM-NEON Revisited

Seo, Hwajeong; Liu, Zhe; Großschädl, Johann; Choi, Jongseok; Kim, Howon

doi:10.1007/978-3-319-15943-0_20

Hwajeong Seo¹⁵,
Zhe Liu¹⁶,
Johann Großschädl¹⁶,
Jongseok Choi¹⁵ &
…
Howon Kim¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 8949))

Included in the following conference series:

International Conference on Information Security and Cryptology

1000 Accesses

Abstract

Montgomery modular multiplication constitutes the “arithmetic foundation” of modern public-key cryptography with applications ranging from RSA, DSA and Diffie-Hellman over elliptic curve schemes to pairing-based cryptosystems. The increased prevalence of SIMD-type instructions in commodity processors (e.g. Intel SSE, ARM NEON) has initiated a massive body of research on vector-parallel implementations of Montgomery modular multiplication. In this paper, we introduce the Cascade Operand Scanning (COS) method to speed up multi-precision multiplication on SIMD architectures. We developed the COS technique with the goal of reducing Read-After-Write (RAW) dependencies in the propagation of carries, which also reduces the number of pipeline stalls (i.e. bubbles). The COS method operates on 32-bit words in a row-wise fashion (similar to the operand-scanning method) and does not require a “non-canonical” representation of operands with a reduced radix. We show that two COS computations can be “coarsely” integrated into an efficient vectorized variant of Montgomery multiplication, which we call Coarsely Integrated Cascade Operand Scanning (CICOS) method. Due to our sophisticated instruction scheduling, the CICOS method reaches record-setting execution times for Montgomery modular multiplication on ARM-NEON platforms. Detailed benchmarking results obtained on an ARM Cortex-A9 and Cortex-A15 processors show that the proposed CICOS method outperforms Bos et al’s implementation from SAC 2013 by up to 57 % (A9) and 40 % (A15), respectively.

This work was supported by the ICT R&D program of MSIP/IITP. [10043907, Development of high performance IoT device and Open Platform with Intelligent Software].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SMCOS: Fast and Parallel Modular Multiplication on ARM NEON Architecture for ECC

ARM/NEON Co-design of Multiplication/Squaring

PhiRSA: Exploiting the Computing Power of Vector Instructions on Intel Xeon Phi for RSA

Notes

1.
Note that the timings in the proceedings version of Bos et al’s paper differ from the version in the IACR eprint archive at https://eprint.iacr.org/2013/519. We used the faster timings from the eprint version for comparison with our work.
2.
Operands $A[0 \sim 7]$ and $B[0 \sim 7]$ are stored in 32-bit registers. Intermediate results $C[0 \sim 15]$ are stored in 64-bit registers. We use two packed 32-bit registers in the 64-bit register.
3.
In the first round, the range is within [0, 0x1_ffff_fffd], because higher bits and lower bits of intermediate results $(C[0 \sim 7])$ are located in range of [0, 0xffff_fffe] and [0, 0xffff_ffff], respectively. From second round, the addition of higher and lower bits are located within [0, 0x1_ffff_fffe], because both higher and lower bits are located in range of [0, 0xffff_ffff].
4.
In the first round, intermediate results ($C[0\sim 7]$) are in range of [0, 0x1_ffff_fffd] so multiplication and accumulation results are in range of [0, 0xffff_ffff_ffff_fffe]. From second round, the intermediate results are located in [0, 0x1_ffff_fffe] so multiplication and accumulation results are in range of [0, 0xffff_ffff_ffff_ffff].
5.
NEON engine supports sixteen 128-bit registers. We assigned four registers for operands ($A, B$), four for intermediate results ($C$) and four for temporal storages.
6.
Operands $A[0 \sim 7]$, $B[0 \sim 7]$, $M[0 \sim 7]$, $Q[0 \sim 7]$ and $M'$ are stored in 32-bit registers. Intermediate results $C[0 \sim 15]$ are stored in 64-bit registers.
7.
In the first round, the range is within [0, 0x1_ffff_fffd], because higher bits and lower bits of intermediate results $(C[0 \sim 7])$ are located in range of [0, 0xffff_fffe] and [0, 0xffff_ffff], respectively. From second round, the addition of higher and lower bits are located within [0, 0x1_ffff_fffe], because both higher and lower bits are located in range of [0, 0xffff_ffff].

References

Barrett, P.: Implementing the rivest shamir and adleman public key encryption algorithm on a standard digital signal processor. In: Odlyzko, A.M. (ed.) CRYPTO 1986. LNCS, vol. 263, pp. 311–323. Springer, Heidelberg (1987)
Chapter Google Scholar
Bernstein, D.J., Schwabe, P.: NEON crypto. In: Prouff, E., Schaumont, P. (eds.) CHES 2012. LNCS, vol. 7428, pp. 320–339. Springer, Heidelberg (2012)
Chapter Google Scholar
Lin, B.: Solving sequential problems in parallel: An SIMD solution to RSA cryptography, Feb 2006. http://cache.freescale.com/files/32bit/doc/app_note/AN3057.pdf
Bos, J.W., Kaihara, M.E.: montgomery multiplication on the cell. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2009, Part I. LNCS, vol. 6067, pp. 477–485. Springer, Heidelberg (2010)
Chapter Google Scholar
Bos, J.W., Montgomery, P.L., Shumow, D., Zaverucha, G.M.: Montgomery multiplication using vector instructions. In: Lange, T., Lauter, K., Lisoněk, P. (eds.) SAC 2013. LNCS, vol. 8282, pp. 471–490. Springer, Heidelberg (2014)
Chapter Google Scholar
Câmara, D., Gouvêa, C.P.L., López, J., Dahab, R.: Fast software polynomial multiplication on ARM processors using the NEON engine. In: Cuzzocrea, A., Kittl, C., Simos, D.E., Weippl, E., Xu, L. (eds.) CD-ARES Workshops 2013. LNCS, vol. 8128, pp. 137–154. Springer, Heidelberg (2013)
Chapter Google Scholar
Faz-Hernández, A., Longa, P., Sánchez, A.H.: Efficient and secure algorithms for GLV-based scalar multiplication and their implementation on GLV-GLS curves. In: Benaloh, J. (ed.) CT-RSA 2014. LNCS, vol. 8366, pp. 1–27. Springer, Heidelberg (2014)
Chapter Google Scholar
Gueron, S., Krasnov, V.: Software implementation of modular exponentiation, using advanced vector instructions architectures. In: Özbudak, F., Rodríguez-Henríquez, F. (eds.) WAIFI 2012. LNCS, vol. 7369, pp. 119–135. Springer, Heidelberg (2012)
Chapter Google Scholar
Intel Corporation: Using streaming SIMD extensions (SSE2) to perform big multiplications. Application note AP-941, July 2000. http://software.intel.com/sites/default/files/14/4f/24960
Montgomery, P.L.: Modular multiplication without trial division. Math. Comput. 44(170), 519–521 (1985)
Article MATH Google Scholar
Pabbuleti, K.C., Mane, D.H., Desai, A., Albert, C., Schaumont, P.: Simd acceleration of modular arithmetic on contemporary embedded platforms. In: 2013 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6. IEEE (2013)
Google Scholar
Quisquater, J.-J.: Procédé de codage selon la méthode dite rsa, par un microcontrôleur et dispositifs utilisant ce procédé. Demande de brevet français. (Dépôt numéro: 90 02274), 122 (1990)
Google Scholar
Quisquater, J.-J.: Encoding system according to the so-called rsa method, by means of a microcontroller and arrangement implementing this system, 24 November 1992. US Patent 5,166,978
Google Scholar
Sánchez, A.H., Rodríguez-Henríquez, F.: NEON implementation of an attribute-based encryption scheme. In: Jacobson, M., Locasto, M., Mohassel, P., Safavi-Naini, R. (eds.) ACNS 2013. LNCS, vol. 7954, pp. 322–338. Springer, Heidelberg (2013)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, Pusan National University, San-30, Jangjeon-Dong, Geumjeong-gu, Busan, 609–735, Republic of Korea
Hwajeong Seo, Jongseok Choi & Howon Kim
Laboratory of Algorithmics, Cryptology and Security (LACS), University of Luxembourg, 6, rue R. Kirchberg, 1359, Luxembourg-Kirchberg, Luxembourg
Zhe Liu & Johann Großschädl

Authors

Hwajeong Seo
View author publications
You can also search for this author in PubMed Google Scholar
Zhe Liu
View author publications
You can also search for this author in PubMed Google Scholar
Johann Großschädl
View author publications
You can also search for this author in PubMed Google Scholar
Jongseok Choi
View author publications
You can also search for this author in PubMed Google Scholar
Howon Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Howon Kim .

Editor information

Editors and Affiliations

Sejong University, Seoul, Korea, Republic of (South Korea)
Jooyoung Lee
Kookmin University, Seoul, Korea, Republic of (South Korea)
Jongsung Kim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Seo, H., Liu, Z., Großschädl, J., Choi, J., Kim, H. (2015). Montgomery Modular Multiplication on ARM-NEON Revisited. In: Lee, J., Kim, J. (eds) Information Security and Cryptology - ICISC 2014. ICISC 2014. Lecture Notes in Computer Science(), vol 8949. Springer, Cham. https://doi.org/10.1007/978-3-319-15943-0_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-15943-0_20
Published: 17 March 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-15942-3
Online ISBN: 978-3-319-15943-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Montgomery Modular Multiplication on ARM-NEON Revisited

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

SMCOS: Fast and Parallel Modular Multiplication on ARM NEON Architecture for ECC

ARM/NEON Co-design of Multiplication/Squaring

PhiRSA: Exploiting the Computing Power of Vector Instructions on Intel Xeon Phi for RSA

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Montgomery Modular Multiplication on ARM-NEON Revisited

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

SMCOS: Fast and Parallel Modular Multiplication on ARM NEON Architecture for ECC

ARM/NEON Co-design of Multiplication/Squaring

PhiRSA: Exploiting the Computing Power of Vector Instructions on Intel Xeon Phi for RSA

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation