Skip to main content

Montgomery Modular Multiplication on ARM-NEON Revisited

  • Conference paper
  • First Online:
Information Security and Cryptology - ICISC 2014 (ICISC 2014)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 8949))

Included in the following conference series:

  • 1000 Accesses

Abstract

Montgomery modular multiplication constitutes the “arithmetic foundation” of modern public-key cryptography with applications ranging from RSA, DSA and Diffie-Hellman over elliptic curve schemes to pairing-based cryptosystems. The increased prevalence of SIMD-type instructions in commodity processors (e.g. Intel SSE, ARM NEON) has initiated a massive body of research on vector-parallel implementations of Montgomery modular multiplication. In this paper, we introduce the Cascade Operand Scanning (COS) method to speed up multi-precision multiplication on SIMD architectures. We developed the COS technique with the goal of reducing Read-After-Write (RAW) dependencies in the propagation of carries, which also reduces the number of pipeline stalls (i.e. bubbles). The COS method operates on 32-bit words in a row-wise fashion (similar to the operand-scanning method) and does not require a “non-canonical” representation of operands with a reduced radix. We show that two COS computations can be “coarsely” integrated into an efficient vectorized variant of Montgomery multiplication, which we call Coarsely Integrated Cascade Operand Scanning (CICOS) method. Due to our sophisticated instruction scheduling, the CICOS method reaches record-setting execution times for Montgomery modular multiplication on ARM-NEON platforms. Detailed benchmarking results obtained on an ARM Cortex-A9 and Cortex-A15 processors show that the proposed CICOS method outperforms Bos et al’s implementation from SAC 2013 by up to 57 % (A9) and 40 % (A15), respectively.

This work was supported by the ICT R&D program of MSIP/IITP. [10043907, Development of high performance IoT device and Open Platform with Intelligent Software].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Note that the timings in the proceedings version of Bos et al’s paper differ from the version in the IACR eprint archive at https://eprint.iacr.org/2013/519. We used the faster timings from the eprint version for comparison with our work.

  2. 2.

    Operands \(A[0 \sim 7]\) and \(B[0 \sim 7]\) are stored in 32-bit registers. Intermediate results \(C[0 \sim 15]\) are stored in 64-bit registers. We use two packed 32-bit registers in the 64-bit register.

  3. 3.

    In the first round, the range is within [0, 0x1_ffff_fffd], because higher bits and lower bits of intermediate results \((C[0 \sim 7])\) are located in range of [0, 0xffff_fffe] and [0, 0xffff_ffff], respectively. From second round, the addition of higher and lower bits are located within [0, 0x1_ffff_fffe], because both higher and lower bits are located in range of [0, 0xffff_ffff].

  4. 4.

    In the first round, intermediate results (\(C[0\sim 7]\)) are in range of [0, 0x1_ffff_fffd] so multiplication and accumulation results are in range of [0, 0xffff_ffff_ffff_fffe]. From second round, the intermediate results are located in [0, 0x1_ffff_fffe] so multiplication and accumulation results are in range of [0, 0xffff_ffff_ffff_ffff].

  5. 5.

    NEON engine supports sixteen 128-bit registers. We assigned four registers for operands (\(A, B\)), four for intermediate results (\(C\)) and four for temporal storages.

  6. 6.

    Operands \(A[0 \sim 7]\), \(B[0 \sim 7]\), \(M[0 \sim 7]\), \(Q[0 \sim 7]\) and \(M'\) are stored in 32-bit registers. Intermediate results \(C[0 \sim 15]\) are stored in 64-bit registers.

  7. 7.

    In the first round, the range is within [0, 0x1_ffff_fffd], because higher bits and lower bits of intermediate results \((C[0 \sim 7])\) are located in range of [0, 0xffff_fffe] and [0, 0xffff_ffff], respectively. From second round, the addition of higher and lower bits are located within [0, 0x1_ffff_fffe], because both higher and lower bits are located in range of [0, 0xffff_ffff].

References

  1. Barrett, P.: Implementing the rivest shamir and adleman public key encryption algorithm on a standard digital signal processor. In: Odlyzko, A.M. (ed.) CRYPTO 1986. LNCS, vol. 263, pp. 311–323. Springer, Heidelberg (1987)

    Chapter  Google Scholar 

  2. Bernstein, D.J., Schwabe, P.: NEON crypto. In: Prouff, E., Schaumont, P. (eds.) CHES 2012. LNCS, vol. 7428, pp. 320–339. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  3. Lin, B.: Solving sequential problems in parallel: An SIMD solution to RSA cryptography, Feb 2006. http://cache.freescale.com/files/32bit/doc/app_note/AN3057.pdf

  4. Bos, J.W., Kaihara, M.E.: montgomery multiplication on the cell. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2009, Part I. LNCS, vol. 6067, pp. 477–485. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  5. Bos, J.W., Montgomery, P.L., Shumow, D., Zaverucha, G.M.: Montgomery multiplication using vector instructions. In: Lange, T., Lauter, K., Lisoněk, P. (eds.) SAC 2013. LNCS, vol. 8282, pp. 471–490. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  6. Câmara, D., Gouvêa, C.P.L., López, J., Dahab, R.: Fast software polynomial multiplication on ARM processors using the NEON engine. In: Cuzzocrea, A., Kittl, C., Simos, D.E., Weippl, E., Xu, L. (eds.) CD-ARES Workshops 2013. LNCS, vol. 8128, pp. 137–154. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  7. Faz-Hernández, A., Longa, P., Sánchez, A.H.: Efficient and secure algorithms for GLV-based scalar multiplication and their implementation on GLV-GLS curves. In: Benaloh, J. (ed.) CT-RSA 2014. LNCS, vol. 8366, pp. 1–27. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  8. Gueron, S., Krasnov, V.: Software implementation of modular exponentiation, using advanced vector instructions architectures. In: Özbudak, F., Rodríguez-Henríquez, F. (eds.) WAIFI 2012. LNCS, vol. 7369, pp. 119–135. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  9. Intel Corporation: Using streaming SIMD extensions (SSE2) to perform big multiplications. Application note AP-941, July 2000. http://software.intel.com/sites/default/files/14/4f/24960

  10. Montgomery, P.L.: Modular multiplication without trial division. Math. Comput. 44(170), 519–521 (1985)

    Article  MATH  Google Scholar 

  11. Pabbuleti, K.C., Mane, D.H., Desai, A., Albert, C., Schaumont, P.: Simd acceleration of modular arithmetic on contemporary embedded platforms. In: 2013 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6. IEEE (2013)

    Google Scholar 

  12. Quisquater, J.-J.: Procédé de codage selon la méthode dite rsa, par un microcontrôleur et dispositifs utilisant ce procédé. Demande de brevet français. (Dépôt numéro: 90 02274), 122 (1990)

    Google Scholar 

  13. Quisquater, J.-J.: Encoding system according to the so-called rsa method, by means of a microcontroller and arrangement implementing this system, 24 November 1992. US Patent 5,166,978

    Google Scholar 

  14. Sánchez, A.H., Rodríguez-Henríquez, F.: NEON implementation of an attribute-based encryption scheme. In: Jacobson, M., Locasto, M., Mohassel, P., Safavi-Naini, R. (eds.) ACNS 2013. LNCS, vol. 7954, pp. 322–338. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Howon Kim .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Seo, H., Liu, Z., Großschädl, J., Choi, J., Kim, H. (2015). Montgomery Modular Multiplication on ARM-NEON Revisited. In: Lee, J., Kim, J. (eds) Information Security and Cryptology - ICISC 2014. ICISC 2014. Lecture Notes in Computer Science(), vol 8949. Springer, Cham. https://doi.org/10.1007/978-3-319-15943-0_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-15943-0_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-15942-3

  • Online ISBN: 978-3-319-15943-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics