Abstract
Conventional wisdom purports that FFT-based integer multiplication methods (such as the Schönhage–Strassen algorithm) begin to compete with Karatsuba and Toom–Cook only for integers of several tens of thousands of bits. In this work, we challenge this belief, leveraging recent advances in the implementation of number-theoretic transforms (NTT) stimulated by their use in post-quantum cryptography. We report on implementations of NTT-based integer arithmetic on two Arm Cortex-M CPUs on opposite ends of the performance spectrum: Cortex-M3 and Cortex-M55. Our results indicate that NTT-based multiplication is capable of outperforming the big-number arithmetic implementations of popular embedded cryptography libraries for integers as small as 2048 bits. To provide a realistic case study, we benchmark implementations of the RSA encryption and decryption operations. Our cycle counts on Cortex-M55 are about \(10\times \) lower than on Cortex-M3.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
The layers are merged as \(4+3\) resp. \(4+2+2\) in the forward NTTs, exploiting that the upper half of the input coefficients are zero, and \(3+2+2\) resp. \(3+3+2\) in the inverse NTTs. Register pressure prohibits more aggressive merging.
- 3.
- 4.
- 5.
- 6.
References
Agarwal, R.C., Burrus, C.S.: Fast convolution using Fermat number transforms with applications to digital filtering. IEEE Trans. Signal Process. 22(2), 87–97 (1974)
Abdulrahman, A., et al.: Multi-moduli NTTs for Saber on Cortex-M3 and cortex-M4. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2022(1), 127–151 (2022)
Pornin, T.: BearSSL: a smaller TLS/SSL library
Becker, H., et al.: Neon NTT: faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2022(1), 221–244 (2022)
Becker, H., et al.: Polynomial multiplication on embedded vector architectures. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2022(1), 482–505 (2022)
Chung, C.-M.M., et al.: NTT multiplication for NTT-unfriendly rings: new speed records for Saber and NTRU on Cortex-M4 and AVX2. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2021(2), 159–188 (2021)
Fürer, M.: Faster integer multiplication. SIAM J. Comput. 39(3), 979–1005 (2009)
García, L.C.C.: Can Schönhage multiplication speed up the RSA decryption or encryption? In: MoraviaCrypt 2007 (2007)
Gaudry, P., Kruppa, A., Zimmermann, P.: A GMP-based implementation of Schönhage-Strassen’s large integer multiplication algorithm. In: ISSAC 2007, pp. 167–174. ACM (2007)
Free Software Foundation. The GNU Multiple Precision Arithmetic Library
Harvey, D., van der Hoeven, J.: Integer multiplication in time \(O(n \log n)\). Ann. Math. 193(2), 563–617 (2021)
Koc, C.K., Acar, T., Kaliski, B.S.: Analyzing and comparing Montgomery multiplication algorithms. IEEE Micro 16(3), 26–33 (1996)
Karatsuba, A., Ofman, Y.: Multiplication of multidigit numbers on automata. Soviet Phys. Doklady 7, 595–596 (1963). Translated from Doklady Akademii Nauk SSSR, vol. 145, no. 2, pp. 293–294, July 1962
Arm Ltd., Mbed TLS
Montgomery, P.L.: Modular multiplication without trial division. Math. Comput. 44(170), 519–521 (1985)
Pollard, J.M.: The fast Fourier transform in a finite field. Math. Comput. 25, 365–374 (1971)
Rivest, R., Shamir, A., Adleman, L.: A method for obtaining digital signatures and public-key cryptosystems. Commun. ACM 21(2), 120–126 (1978)
Schönhage, A., Strassen, V.: Schnelle Multiplikation großer Zahlen. Computing 7(3–4), 281–292 (1971)
Toom, A.L.: The complexity of a scheme of functional elements realizing the multiplication of integers. Soviet Math. Doklady 3, 714–716 (1963)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Appendices
A Reduction Algorithms for Cortex-M3 and Cortex-M55


B On Precomputing the Montgomery Constant
Montgomery multiplication (see Sect. 2.4) requires the precomputation of \(q^{-1} \bmod {\texttt {R}}\). When implementing RSA via “large” Montgomery multiplication, rather than a FIOS approach, this means that we need to precompute \(n^{-1}\bmod {}\texttt {R}\) for encryption and \(p^{-1} \bmod {\texttt {R}}\) and \(q^{-1} \bmod {\texttt {R}}\) for decryption. For decryption this can be computed as a part of key generation and stored as a part of the secret key. For encryption, however, it needs to be computed online.
Modular inversion \(x^{-1}\bmod {2^r}\) can be performed using “Hensel lifting”: If \(xy-1= 2^ka\), so that y is an inverse to x modulo \(2^k\), then \(y^{\prime }=2y-x^2y\) satisfies \(xy^{\prime }-1 = -(xy-1)^2 = 2^{2k}a^2\), and hence \(y^{\prime }\) is an inverse of x modulo \(2^{2k}\). This yields \(x^{-1}\bmod 2^k\) after \(\mathcal {O}(\log k)\) iterations. One may observe that this is the sequence of approximate solutions to \(xy=1\) for x via the Newton–Raphson method in the 2-adic integers.
We prototyped Hensel-lifting to assess its relative cost compared to the modular exponentiation; we did not seek a fully optimized version. On the Cortex-M3 we implement both a variable-time variants using umlal for encryption and a constant-time variant using mla for key generation. For the Cortex-M55, we achieve the best performance using umaal. We list the performance in Table 4. We see that already a basic implementation has only a small performance overhead compared to an exponentiation (e.g., \(<5\%\) for RSA-4096).
C Table Lookup

D Pipeline Efficiency of Cortex-M55 Implementation
Table 5 shows Performance Monitoring Unit (PMU) statistics for the subroutines of our Cortex-M55 modular exponentiation (\(N=4096\)). We use \(\texttt {ARM\_PMU\_CYCCNT}\), \(\texttt {ARM\_PMU\_INST\_RETIRED}\), \(\texttt {ARM\_PMU\_MVE\_INST\_RETIRED}\), and \(\texttt {ARM\_PMU\_MVE\_STALL}\) for counting cycles, retired instructions, retired MVE instructions, and MVE instructions causing a stall, respectively. We derive the rate of Instructions per Cycle (IPC), as well as \(\texttt {ARM\_PMU\_MVE\_INST\_RETIRED}/\texttt {ARM\_PMU\_MVE\_STALL}\) as a measure of the MVE overlapping efficiency. Despite most MVE instructions running for 2 cycles, instruction overlapping allows achieving an IPC \(>0.9\).
E High-level Multiplication Structure
See Fig. 2.
High-level structure of our integer multiplication algorithm. Finely dotted arrows denote a conceptual reinterpretation with no change in representation. Dashed arrows denote a canonical choice of lift, e.g., a representative of minimal degree for polynomials or a smallest non-negative representative for integers.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Becker, H., Hwang, V., Kannwischer, M.J., Panny, L., Yang, BY. (2022). Efficient Multiplication of Somewhat Small Integers Using Number-Theoretic Transforms. In: Cheng, CM., Akiyama, M. (eds) Advances in Information and Computer Security. IWSEC 2022. Lecture Notes in Computer Science, vol 13504. Springer, Cham. https://doi.org/10.1007/978-3-031-15255-9_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-15255-9_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15254-2
Online ISBN: 978-3-031-15255-9
eBook Packages: Computer ScienceComputer Science (R0)