Pushing the Limit of Vectorized Polynomial Multiplications for NTRU Prime

Hwang, Vincent

doi:10.1007/978-981-97-5028-3_5

Vincent Hwang⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14896))

Included in the following conference series:

Australasian Conference on Information Security and Privacy

259 Accesses

Abstract

We conduct a systematic examination of vector arithmetic for polynomial multiplications in software. Vector instruction sets and extensions typically specify a fixed number of registers, each holding a power-of-two number of bits, and support a wide variety of vector arithmetic on registers. Programmers then try to align mathematical computations with the vector arithmetic supported by the designated instruction set or extension. We delve into the intricacies of this process for polynomial multiplications. In particular, we introduce “vectorization-friendliness” and “permutation-friendliness”, and review “Toeplitz matrix-vector product” to systematically identify suitable mappings from homomorphisms to vectorized implementations.

To illustrate how the formalization works, we detail the vectorization of polynomial multiplication in $ \left. {\mathbb {{Z}}_{4591}[x]} \big / {\left\langle {x^{761} - x - 1} \right\rangle } \right. $ used in the parameter set sntrup761 of the NTRU Prime key encapsulation mechanism.

For practical evaluation, we implement vectorized polynomial multipliers for the ring $ \left. {\mathbb {{Z}}_{4591}[x]} \big / {\left\langle {\varPhi _{17}\left( x^{96} \right) } \right\rangle } \right. $ with AVX2 and Neon. We benchmark our AVX2 implementation on Haswell and Skylake and our Neon implementation on Cortex-A72 and the “Firestorm” core of Apple M1 Pro. Our AVX2-optimized implementation is $1.99\!-\! 2.16$ times faster than the state-of-the-art AVX2-optimized implementation by [Bernstein, Brumley, Chen, and Tuveri, USENIX Security 2022] on Haswell and Skylake, and our Neon-optimized implementation is $1.29\!-\! 1.36$ times faster than the state-of-the-art Neon-optimized implementation by [Hwang, Liu, and Yang, ACNS 2024] on Cortex-A72 and Apple M1 Pro.

For the overall scheme with AVX2, we reduce the batch key generation cycles (amortized with batch size 32) by 7.9%–12.0%, encapsulation cycles by 7.1%–10.3%, and decapsulation cycles by 10.7%–13.3% on Haswell and Skylake. For the overall performance with Neon, we reduce the encapsulation cycles by 3.0%–6.6% and decapsulation cycles by 12.8%–15.1% on Cortex-A72 and Apple M1 Pro.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Algorithmic Views of Vectorized Polynomial Multipliers – NTRU

Algorithmic Views of Vectorized Polynomial Multipliers – NTRU Prime

Parallel modular multiplication using 512-bit advanced vector instructions

Article Open access 13 February 2021

Notes

1.
https://marc.info/?l=openssh-unix-dev &m=164939371201404 &w=2.
2.
$\forall j = 1, \dots , n - 1, \sum _{i = 0}^{n - 1} \omega _n^{ij} = 0$.
3.
Notice that $\omega _{17} = \omega _{17}^{1344} = \left( \omega _{17}^{14} \right) ^{96}$.

References

Alkim, E., et al.: Polynomial multiplication in NTRU prime comparison of optimization strategies on Cortex-M4. IACR Trans. Crypt. Hardw. Embed. Syst. 2021(1), 217–238 (2021). https://tches.iacr.org/index.php/TCHES/article/view/8733
Alkim, E., Hwang, V., Yang, B.Y.: Multi-parameter support with NTTs for NTRU and NTRU prime on Cortex-M4. IACR Trans. Crypt. Hardw. Embed. Syst. 2022(4), 349–371 (2022)
Google Scholar
Bernstein, D.J.: Fast norm computation in smooth-degree abelian number fields. Cryptology ePrint Archive, Paper 2022/980 (2022). https://eprint.iacr.org/2022/980
Bernstein, D.J., et al.: NTRU Prime. Submission to the NIST Post-Quantum Cryptography Standardization Project [21] (2020). https://ntruprime.cr.yp.to/
Bernstein, D.J., Brumley, B.B., Chen, M.S., Tuveri, N.: OpenSSLNTRU: faster post-quantum TLS key exchange. In: 31st USENIX Security Symposium (USENIX Security 22), pp. 845–862 (2022)
Google Scholar
Bernstein, D.J., Duif, N., Lange, T., Schwabe, P., Yang, B.Y.: High-speed high-security signatures. J. Cryptogr. Eng. 2(2), 77–89 (2012)
Article Google Scholar
Bernstein, D.J., Yang, B.Y.: Fast constant-time gcd computation and modular inversion. IACR Trans. Crypt. Hardw. Embed. Syst. 2019(3), 340–398 (2019). https://tches.iacr.org/index.php/TCHES/article/view/8298
Blake, I.F., Gao, S., Mullin, R.C.: Explicit factorization of $x^{2^k} + 1$ over $\mathbb{F} _p$ with prime $p \equiv 3 \; \text{ mod } \;4$. Appl. Algebra Eng. Commun. Comput. 4(2), 89–94 (1993)
Article Google Scholar
Bourbaki, N.: Algebra I. Springer, Heidelberg (1989)
Google Scholar
Bruun, G.: z-transform DFT filters and FFT’s. IEEE Trans. Acoust. Speech Sig. Process. 26(1), 56–63 (1978)
Article Google Scholar
Chen, H.T., Chung, Y.H., Hwang, V., Yang, B.Y.: Algorithmic views of vectorized polynomial multipliers – NTRU. In: Chattopadhyay, A., Bhasin, S., Picek, S., Rebeiro, C. (eds.) Progress in Cryptology, INDOCRYPT 2023. LNCS, vol. 14460, pp. 177–196. Springer, Cham (2024). https://doi.org/10.1007/978-3-031-56235-8_9
Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19(90), 297–301 (1965)
Article MathSciNet Google Scholar
Franchetti, F., et al.: Spiral: extreme performance portability. Proc. IEEE 106(11), 1935–1968 (2018). https://ieeexplore.ieee.org/document/8510983
Hwang, V., Liu, C.T., Yang, B.Y.: Algorithmic views of vectorized polynomial multipliers – NTRU prime. In: Pöpper, C., Batina, L. (eds.) Applied Cryptography and Network Security, ACNS 2024. LNCS, vol. 14584, pp. 24–46. Springer, Cham (2024). https://doi.org/10.1007/978-3-031-54773-7_2
Hwang, V.B.: Case studies on implementing number–theoretic transforms with Armv7-M, Armv7E-M, and Armv8-A. Master’s thesis, National Taiwan University (2022). https://github.com/vincentvbh/NTTs_with_Armv7-M_Armv7E-M_Armv8-A
Jacobson, N.: Basic Algebra I. Courier Corporation (2012)
Google Scholar
Jacobson, N.: Basic Algebra II. Courier Corporation (2012)
Google Scholar
Karatsuba, A.A., Ofman, Y.P.: Multiplication of many-digital numbers by automatic computers. Dokl. Akad. Nauk 145(2), 293–294 (1962)
Google Scholar
Írem Keskinkurt Paksoy, Cenk, M.: Faster NTRU on ARM Cortex-M4 with TMVP-based multiplication. IEEE Transac. Circ. Syst. I Regul. Pap. 69(10), 4083–4092 (2022). https://ieeexplore.ieee.org/document/9835023
Murakami, H.: Real-valued fast discrete Fourier transform and cyclic convolution algorithms of highly composite even length. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 3, pp. 1311–1314 (1996)
Google Scholar
NIST, The US National Institute of Standards and Technology: Post-quantum cryptography standardization project. https://csrc.nist.gov/Projects/post-quantum-cryptography
Rader, C.M.: Discrete Fourier transforms when the number of data samples is prime. Proc. IEEE 56(6), 1107–1108 (1968)
Article Google Scholar
Shor, P.W.: Algorithms for quantum computation: discrete logarithms and factoring. In: Proceedings 35th Annual Symposium on Foundations of Computer Science, pp. 124–134. IEEE (1994)
Google Scholar

Download references

Author information

Authors and Affiliations

Max Planck Institute for Security and Privacy, Bochum, Germany
Vincent Hwang

Authors

Vincent Hwang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vincent Hwang .

Editor information

Editors and Affiliations

City University of Macau, Macau, China
Tianqing Zhu
University of Wollongong, Keiraville, Wollongong, NSW, Australia
Yannan Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hwang, V. (2024). Pushing the Limit of Vectorized Polynomial Multiplications for NTRU Prime. In: Zhu, T., Li, Y. (eds) Information Security and Privacy. ACISP 2024. Lecture Notes in Computer Science, vol 14896. Springer, Singapore. https://doi.org/10.1007/978-981-97-5028-3_5

Download citation

DOI: https://doi.org/10.1007/978-981-97-5028-3_5
Published: 16 July 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5027-6
Online ISBN: 978-981-97-5028-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Pushing the Limit of Vectorized Polynomial Multiplications for NTRU Prime