Efficient Multiplication of Somewhat Small Integers Using Number-Theoretic Transforms

Becker, Hanno; Hwang, Vincent; Kannwischer, Matthias J.; Panny, Lorenz; Yang, Bo-Yin

doi:10.1007/978-3-031-15255-9_1

Hanno Becker⁹,
Vincent Hwang^10,11,
Matthias J. Kannwischer¹¹,
Lorenz Panny¹¹ &
…
Bo-Yin Yang¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13504))

Included in the following conference series:

International Workshop on Security

569 Accesses

Abstract

Conventional wisdom purports that FFT-based integer multiplication methods (such as the Schönhage–Strassen algorithm) begin to compete with Karatsuba and Toom–Cook only for integers of several tens of thousands of bits. In this work, we challenge this belief, leveraging recent advances in the implementation of number-theoretic transforms (NTT) stimulated by their use in post-quantum cryptography. We report on implementations of NTT-based integer arithmetic on two Arm Cortex-M CPUs on opposite ends of the performance spectrum: Cortex-M3 and Cortex-M55. Our results indicate that NTT-based multiplication is capable of outperforming the big-number arithmetic implementations of popular embedded cryptography libraries for integers as small as 2048 bits. To provide a realistic case study, we benchmark implementations of the RSA encryption and decryption operations. Our cycle counts on Cortex-M55 are about $10\times $ lower than on Cortex-M3.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Low Complexity and High Speed Montgomery Multiplication Based on FFT

Faster modular arithmetic for isogeny-based crypto on embedded devices

Article 26 April 2019

Polynomial multiplication over binary finite fields: new upper bounds

Article 17 April 2019

Notes

1.
https://gmplib.org/manual/FFT-Multiplication.
2.
The layers are merged as $4+3$ resp. $4+2+2$ in the forward NTTs, exploiting that the upper half of the input coefficients are zero, and $3+2+2$ resp. $3+3+2$ in the inverse NTTs. Register pressure prohibits more aggressive merging.
3.
https://github.com/libopencm3/libopencm3.
4.
https://github.com/mupq/pqm3.
5.
https://developer.arm.com/tools-and-software/development-boards/fpga-prototyping-boards/download-fpga-images.
6.
See https://github.com/ARMmbed/mbedtls/issues/5666
and https://github.com/ARMmbed/mbedtls/issues/5360.

References

Agarwal, R.C., Burrus, C.S.: Fast convolution using Fermat number transforms with applications to digital filtering. IEEE Trans. Signal Process. 22(2), 87–97 (1974)
Article MathSciNet Google Scholar
Abdulrahman, A., et al.: Multi-moduli NTTs for Saber on Cortex-M3 and cortex-M4. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2022(1), 127–151 (2022)
Google Scholar
Pornin, T.: BearSSL: a smaller TLS/SSL library
Google Scholar
Becker, H., et al.: Neon NTT: faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2022(1), 221–244 (2022)
Google Scholar
Becker, H., et al.: Polynomial multiplication on embedded vector architectures. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2022(1), 482–505 (2022)
Google Scholar
Chung, C.-M.M., et al.: NTT multiplication for NTT-unfriendly rings: new speed records for Saber and NTRU on Cortex-M4 and AVX2. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2021(2), 159–188 (2021)
Google Scholar
Fürer, M.: Faster integer multiplication. SIAM J. Comput. 39(3), 979–1005 (2009)
Article MathSciNet Google Scholar
García, L.C.C.: Can Schönhage multiplication speed up the RSA decryption or encryption? In: MoraviaCrypt 2007 (2007)
Google Scholar
Gaudry, P., Kruppa, A., Zimmermann, P.: A GMP-based implementation of Schönhage-Strassen’s large integer multiplication algorithm. In: ISSAC 2007, pp. 167–174. ACM (2007)
Google Scholar
Free Software Foundation. The GNU Multiple Precision Arithmetic Library
Google Scholar
Harvey, D., van der Hoeven, J.: Integer multiplication in time $O(n \log n)$. Ann. Math. 193(2), 563–617 (2021)
Article MathSciNet Google Scholar
Koc, C.K., Acar, T., Kaliski, B.S.: Analyzing and comparing Montgomery multiplication algorithms. IEEE Micro 16(3), 26–33 (1996)
Article Google Scholar
Karatsuba, A., Ofman, Y.: Multiplication of multidigit numbers on automata. Soviet Phys. Doklady 7, 595–596 (1963). Translated from Doklady Akademii Nauk SSSR, vol. 145, no. 2, pp. 293–294, July 1962
Google Scholar
Arm Ltd., Mbed TLS
Google Scholar
Montgomery, P.L.: Modular multiplication without trial division. Math. Comput. 44(170), 519–521 (1985)
Article MathSciNet Google Scholar
Pollard, J.M.: The fast Fourier transform in a finite field. Math. Comput. 25, 365–374 (1971)
Article MathSciNet Google Scholar
Rivest, R., Shamir, A., Adleman, L.: A method for obtaining digital signatures and public-key cryptosystems. Commun. ACM 21(2), 120–126 (1978)
Article MathSciNet Google Scholar
Schönhage, A., Strassen, V.: Schnelle Multiplikation großer Zahlen. Computing 7(3–4), 281–292 (1971)
Article MathSciNet Google Scholar
Toom, A.L.: The complexity of a scheme of functional elements realizing the multiplication of integers. Soviet Math. Doklady 3, 714–716 (1963)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Arm Research, Cambridge, UK
Hanno Becker
National Taiwan University, Taipei, Taiwan
Vincent Hwang
Academia Sinica, Taipei, Taiwan
Vincent Hwang, Matthias J. Kannwischer, Lorenz Panny & Bo-Yin Yang

Authors

Hanno Becker
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Hwang
View author publications
You can also search for this author in PubMed Google Scholar
Matthias J. Kannwischer
View author publications
You can also search for this author in PubMed Google Scholar
Lorenz Panny
View author publications
You can also search for this author in PubMed Google Scholar
Bo-Yin Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Hanno Becker , Vincent Hwang , Matthias J. Kannwischer , Lorenz Panny or Bo-Yin Yang .

Editor information

Editors and Affiliations

BTQ AG, Vaduz, Liechtenstein
Chen-Mou Cheng
NTT, Tokyo, Japan
Mitsuaki Akiyama

Appendices

A Reduction Algorithms for Cortex-M3 and Cortex-M55

B On Precomputing the Montgomery Constant

Montgomery multiplication (see Sect. 2.4) requires the precomputation of $q^{-1} \bmod {\texttt {R}}$. When implementing RSA via “large” Montgomery multiplication, rather than a FIOS approach, this means that we need to precompute $n^{-1}\bmod {}\texttt {R}$ for encryption and $p^{-1} \bmod {\texttt {R}}$ and $q^{-1} \bmod {\texttt {R}}$ for decryption. For decryption this can be computed as a part of key generation and stored as a part of the secret key. For encryption, however, it needs to be computed online.

Modular inversion $x^{-1}\bmod {2^r}$ can be performed using “Hensel lifting”: If $xy-1= 2^ka$, so that y is an inverse to x modulo $2^k$, then $y^{\prime }=2y-x^2y$ satisfies $xy^{\prime }-1 = -(xy-1)^2 = 2^{2k}a^2$, and hence $y^{\prime }$ is an inverse of x modulo $2^{2k}$. This yields $x^{-1}\bmod 2^k$ after $\mathcal {O}(\log k)$ iterations. One may observe that this is the sequence of approximate solutions to $xy=1$ for x via the Newton–Raphson method in the 2-adic integers.

We prototyped Hensel-lifting to assess its relative cost compared to the modular exponentiation; we did not seek a fully optimized version. On the Cortex-M3 we implement both a variable-time variants using umlal for encryption and a constant-time variant using mla for key generation. For the Cortex-M55, we achieve the best performance using umaal. We list the performance in Table 4. We see that already a basic implementation has only a small performance overhead compared to an exponentiation (e.g., $<5\%$ for RSA-4096).

Table 4. Performance of Hensel lifting; numbers for RSA-4096 in bold.

Full size table

C Table Lookup

D Pipeline Efficiency of Cortex-M55 Implementation

Table 5 shows Performance Monitoring Unit (PMU) statistics for the subroutines of our Cortex-M55 modular exponentiation ($N=4096$). We use $\texttt {ARM\_PMU\_CYCCNT}$, $\texttt {ARM\_PMU\_INST\_RETIRED}$, $\texttt {ARM\_PMU\_MVE\_INST\_RETIRED}$, and $\texttt {ARM\_PMU\_MVE\_STALL}$ for counting cycles, retired instructions, retired MVE instructions, and MVE instructions causing a stall, respectively. We derive the rate of Instructions per Cycle (IPC), as well as $\texttt {ARM\_PMU\_MVE\_INST\_RETIRED}/\texttt {ARM\_PMU\_MVE\_STALL}$ as a measure of the MVE overlapping efficiency. Despite most MVE instructions running for 2 cycles, instruction overlapping allows achieving an IPC $>0.9$.

Table 5. Performance Monitoring Unit statistics for Cortex-M55 implementation.

Full size table

E High-level Multiplication Structure

See Fig. 2.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Becker, H., Hwang, V., Kannwischer, M.J., Panny, L., Yang, BY. (2022). Efficient Multiplication of Somewhat Small Integers Using Number-Theoretic Transforms. In: Cheng, CM., Akiyama, M. (eds) Advances in Information and Computer Security. IWSEC 2022. Lecture Notes in Computer Science, vol 13504. Springer, Cham. https://doi.org/10.1007/978-3-031-15255-9_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-15255-9_1
Published: 12 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15254-2
Online ISBN: 978-3-031-15255-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient Multiplication of Somewhat Small Integers Using Number-Theoretic Transforms