Skip to main content

Efficient Multiplication of Somewhat Small Integers Using Number-Theoretic Transforms

  • Conference paper
  • First Online:
Advances in Information and Computer Security (IWSEC 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13504))

Included in the following conference series:

  • 569 Accesses

Abstract

Conventional wisdom purports that FFT-based integer multiplication methods (such as the Schönhage–Strassen algorithm) begin to compete with Karatsuba and Toom–Cook only for integers of several tens of thousands of bits. In this work, we challenge this belief, leveraging recent advances in the implementation of number-theoretic transforms (NTT) stimulated by their use in post-quantum cryptography. We report on implementations of NTT-based integer arithmetic on two Arm Cortex-M CPUs on opposite ends of the performance spectrum: Cortex-M3 and Cortex-M55. Our results indicate that NTT-based multiplication is capable of outperforming the big-number arithmetic implementations of popular embedded cryptography libraries for integers as small as 2048 bits. To provide a realistic case study, we benchmark implementations of the RSA encryption and decryption operations. Our cycle counts on Cortex-M55 are about \(10\times \) lower than on Cortex-M3.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://gmplib.org/manual/FFT-Multiplication.

  2. 2.

    The layers are merged as \(4+3\) resp. \(4+2+2\) in the forward NTTs, exploiting that the upper half of the input coefficients are zero, and \(3+2+2\) resp. \(3+3+2\) in the inverse NTTs. Register pressure prohibits more aggressive merging.

  3. 3.

    https://github.com/libopencm3/libopencm3.

  4. 4.

    https://github.com/mupq/pqm3.

  5. 5.

    https://developer.arm.com/tools-and-software/development-boards/fpga-prototyping-boards/download-fpga-images.

  6. 6.

    See https://github.com/ARMmbed/mbedtls/issues/5666

    and https://github.com/ARMmbed/mbedtls/issues/5360.

References

  1. Agarwal, R.C., Burrus, C.S.: Fast convolution using Fermat number transforms with applications to digital filtering. IEEE Trans. Signal Process. 22(2), 87–97 (1974)

    Article  MathSciNet  Google Scholar 

  2. Abdulrahman, A., et al.: Multi-moduli NTTs for Saber on Cortex-M3 and cortex-M4. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2022(1), 127–151 (2022)

    Google Scholar 

  3. Pornin, T.: BearSSL: a smaller TLS/SSL library

    Google Scholar 

  4. Becker, H., et al.: Neon NTT: faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2022(1), 221–244 (2022)

    Google Scholar 

  5. Becker, H., et al.: Polynomial multiplication on embedded vector architectures. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2022(1), 482–505 (2022)

    Google Scholar 

  6. Chung, C.-M.M., et al.: NTT multiplication for NTT-unfriendly rings: new speed records for Saber and NTRU on Cortex-M4 and AVX2. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2021(2), 159–188 (2021)

    Google Scholar 

  7. Fürer, M.: Faster integer multiplication. SIAM J. Comput. 39(3), 979–1005 (2009)

    Article  MathSciNet  Google Scholar 

  8. García, L.C.C.: Can Schönhage multiplication speed up the RSA decryption or encryption? In: MoraviaCrypt 2007 (2007)

    Google Scholar 

  9. Gaudry, P., Kruppa, A., Zimmermann, P.: A GMP-based implementation of Schönhage-Strassen’s large integer multiplication algorithm. In: ISSAC 2007, pp. 167–174. ACM (2007)

    Google Scholar 

  10. Free Software Foundation. The GNU Multiple Precision Arithmetic Library

    Google Scholar 

  11. Harvey, D., van der Hoeven, J.: Integer multiplication in time \(O(n \log n)\). Ann. Math. 193(2), 563–617 (2021)

    Article  MathSciNet  Google Scholar 

  12. Koc, C.K., Acar, T., Kaliski, B.S.: Analyzing and comparing Montgomery multiplication algorithms. IEEE Micro 16(3), 26–33 (1996)

    Article  Google Scholar 

  13. Karatsuba, A., Ofman, Y.: Multiplication of multidigit numbers on automata. Soviet Phys. Doklady 7, 595–596 (1963). Translated from Doklady Akademii Nauk SSSR, vol. 145, no. 2, pp. 293–294, July 1962

    Google Scholar 

  14. Arm Ltd., Mbed TLS

    Google Scholar 

  15. Montgomery, P.L.: Modular multiplication without trial division. Math. Comput. 44(170), 519–521 (1985)

    Article  MathSciNet  Google Scholar 

  16. Pollard, J.M.: The fast Fourier transform in a finite field. Math. Comput. 25, 365–374 (1971)

    Article  MathSciNet  Google Scholar 

  17. Rivest, R., Shamir, A., Adleman, L.: A method for obtaining digital signatures and public-key cryptosystems. Commun. ACM 21(2), 120–126 (1978)

    Article  MathSciNet  Google Scholar 

  18. Schönhage, A., Strassen, V.: Schnelle Multiplikation großer Zahlen. Computing 7(3–4), 281–292 (1971)

    Article  MathSciNet  Google Scholar 

  19. Toom, A.L.: The complexity of a scheme of functional elements realizing the multiplication of integers. Soviet Math. Doklady 3, 714–716 (1963)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Hanno Becker , Vincent Hwang , Matthias J. Kannwischer , Lorenz Panny or Bo-Yin Yang .

Editor information

Editors and Affiliations

Appendices

A Reduction Algorithms for Cortex-M3 and Cortex-M55

figure d
figure e

B On Precomputing the Montgomery Constant

Montgomery multiplication (see Sect. 2.4) requires the precomputation of \(q^{-1} \bmod {\texttt {R}}\). When implementing RSA via “large” Montgomery multiplication, rather than a FIOS approach, this means that we need to precompute \(n^{-1}\bmod {}\texttt {R}\) for encryption and \(p^{-1} \bmod {\texttt {R}}\) and \(q^{-1} \bmod {\texttt {R}}\) for decryption. For decryption this can be computed as a part of key generation and stored as a part of the secret key. For encryption, however, it needs to be computed online.

Modular inversion \(x^{-1}\bmod {2^r}\) can be performed using “Hensel lifting”: If \(xy-1= 2^ka\), so that y is an inverse to x modulo \(2^k\), then \(y^{\prime }=2y-x^2y\) satisfies \(xy^{\prime }-1 = -(xy-1)^2 = 2^{2k}a^2\), and hence \(y^{\prime }\) is an inverse of x modulo \(2^{2k}\). This yields \(x^{-1}\bmod 2^k\) after \(\mathcal {O}(\log k)\) iterations. One may observe that this is the sequence of approximate solutions to \(xy=1\) for x via the Newton–Raphson method in the 2-adic integers.

We prototyped Hensel-lifting to assess its relative cost compared to the modular exponentiation; we did not seek a fully optimized version. On the Cortex-M3 we implement both a variable-time variants using umlal for encryption and a constant-time variant using mla for key generation. For the Cortex-M55, we achieve the best performance using umaal. We list the performance in Table 4. We see that already a basic implementation has only a small performance overhead compared to an exponentiation (e.g., \(<5\%\) for RSA-4096).

Table 4. Performance of Hensel lifting; numbers for RSA-4096 in bold.

C Table Lookup

figure f

D Pipeline Efficiency of Cortex-M55 Implementation

Table 5 shows Performance Monitoring Unit (PMU) statistics for the subroutines of our Cortex-M55 modular exponentiation (\(N=4096\)). We use \(\texttt {ARM\_PMU\_CYCCNT}\), \(\texttt {ARM\_PMU\_INST\_RETIRED}\), \(\texttt {ARM\_PMU\_MVE\_INST\_RETIRED}\), and \(\texttt {ARM\_PMU\_MVE\_STALL}\) for counting cycles, retired instructions, retired MVE instructions, and MVE instructions causing a stall, respectively. We derive the rate of Instructions per Cycle (IPC), as well as \(\texttt {ARM\_PMU\_MVE\_INST\_RETIRED}/\texttt {ARM\_PMU\_MVE\_STALL}\) as a measure of the MVE overlapping efficiency. Despite most MVE instructions running for 2 cycles, instruction overlapping allows achieving an IPC \(>0.9\).

Table 5. Performance Monitoring Unit statistics for Cortex-M55 implementation.

E High-level Multiplication Structure

See Fig. 2.

Fig. 2.
figure 2

High-level structure of our integer multiplication algorithm. Finely dotted arrows denote a conceptual reinterpretation with no change in representation. Dashed arrows denote a canonical choice of lift, e.g., a representative of minimal degree for polynomials or a smallest non-negative representative for integers.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Becker, H., Hwang, V., Kannwischer, M.J., Panny, L., Yang, BY. (2022). Efficient Multiplication of Somewhat Small Integers Using Number-Theoretic Transforms. In: Cheng, CM., Akiyama, M. (eds) Advances in Information and Computer Security. IWSEC 2022. Lecture Notes in Computer Science, vol 13504. Springer, Cham. https://doi.org/10.1007/978-3-031-15255-9_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-15255-9_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-15254-2

  • Online ISBN: 978-3-031-15255-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics