Abstract
In this paper, we explore the cost of vectorization for multiplying polynomials with coefficients in \(\mathbb {{Z}}_q\) for an odd prime q, as exemplified by NTRU Prime, a postquantum cryptosystem that found early adoption due to its inclusion in OpenSSH.
If there is a large power of two dividing \(q - 1\), we can apply radix-2 Cooley–Tukey fast Fourier transforms to multiply polynomials in \(\mathbb {{Z}}_q[x]\). The radix-2 nature admits efficient vectorization. Conversely, if 2 is the only power of two dividing \(q - 1\), we can apply Schönhage’s and Nussbaumer’s FFTs to craft radix-2 roots of unity, but these double the number of coefficients.
We show how to avoid the doubling while maintaining the vectorization friendliness with Good–Thomas, Rader’s, and Bruun’s FFTs. In particular, in sntrup761, the most common instance of NTRU Prime we have \(q=4591\), and we exploit the existing Fermat-prime factor of \(q - 1\) for Rader’s FFT and power-of-two factor of \(q + 1\) for Bruun’s FFT.
Polynomial multiplications in \(\mathbb {{Z}}_{4591}[x]/\left\langle {x^{761}-x-1} \right\rangle \) is still a worthwhile target because while out of the NIST PQC competition, sntrup761 is still going to be used with OpenSSH by default in the near future.
Our polynomial multiplication outperforms the state-of-the-art vector-optimized implementation by \(6.1 \times \). For ntrulpr761, our keygen, encap, and decap are \(2.98 \times \), \(2.79 \times \), and \(3.07 \times \) faster than the state-of-the-art vector-optimized implementation. For sntrup761, we outperform the reference implementation significantly.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
ARMv8-A, which naturally comes with the SIMD technology Neon, is currently the most prevalent architecture for mobile devices and is used for all Apple hardware.
- 3.
There are some exceptions, including addv, smaxv, sadalp. We are not using them in this paper and refer to [ARM15] for more details.
- 4.
We write some assembly and only obtain comparable performance. So we keep the implementations with intrinsics instead for readability.
- 5.
There are several options for signed-extending vector elements—saddl{,2} and ssubl{,2} which go to either F0/F1, sxtl{,2} to F1, and smull{,2} going to F0.
- 6.
\(\forall \text { coprime } q_0, q_1, \left\{ {\omega _{q_0}^{i_0} \omega _{q_1}^{i_1}| 0 \le i_0 < q_0, 0 \le i_1 < q_1} \right\} = \left\{ {\omega _{q_0 q_1}^i | 0 \le i < q_0 q_1} \right\} \) in the splitting field of \(x^{q_0 q_1} - 1\).
- 7.
ARM’s DIT flag, according to https://developer.arm.com/documentation/ddi0595/2021-06/AArch64-Registers/DIT--Data-Independent-Timing, does not guarantee the high half multiplications sqrdmulh and sqdmulh to be constant-time.
References
Alagic, G., et al.: NISTIR8413 – status report on the second round of the nist post-quantum cryptography standardization process (2022). https://doi.org/10.6028/NIST.IR.8413-upd1
Alkim, E., et al.: Polynomial multiplication in NTRU Prime comparison of optimization strategies on cortex-M4. IACR Trans. Cryptogr. Hardware Embed. Syst. 2021(1), 217–238 (2021). https://tches.iacr.org/index.php/TCHES/article/view/8733
Alkim, E., Hwang, V., Yang, B.Y.: Multi-parameter support with NTTs for NTRU and NTRU Prime on cortex-M4. IACR Trans. Cryptogr. Hardware Embed. Syst. 349–371 (2022)
ARM. Cortex-A72 Software Optimization Guide (2015). https://developer.arm.com/documentation/uan0016/a/
ARM. Arm Architecture Reference Manual, Armv8, for Armv8-A architecture profile (2021). https://developer.arm.com/documentation/ddi0487/gb/?lang=en
Barrett, P.: Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor. In: Odlyzko, A.M. (ed.) CRYPTO 1986. LNCS, vol. 263, pp. 311–323. Springer, Heidelberg (1986). https://doi.org/10.1007/3-540-47721-7_24
Bernstein, D.J., et al.: NTRU Prime. In: Submission to the NIST Post-Quantum Cryptography Standardization Project [?] (2020). https://ntruprime.cr.yp.to/
Bernstein, D.J., Brumley, B.B., Chen, M.S., Tuveri, N.: OpenSSLNTRU: faster post-quantum TLS key exchange. In: 31st USENIX Security Symposium (USENIX Security 2022), pp. 845–862 (2022)
Brawley, J.V., Carlitz, L.: Irreducibles and the composed product for polynomials over a finite field. Disc. Math. 65(2), 115–139 (1987)
Bernstein, D.J.: Multidigit multiplication for mathematicians (2001)
Blake, I.F., Gao, S., Mullin, R.C.: Explicit factorization of \(x^{2^k} + 1\) over \(\mathbb{F}_p\) with prime \(p \equiv 3 \;mod \;4\). Appl. Algebra Eng. Commun. Comput. 4(2), 89–94 (1993)
Becker, H., Hwang, V., Kannwischer, M.J., Yang, B.Y., Yang, S.Y.: Neon NTT: faster Dilithium, Kyber, and Saber on cortex-A72 and apple M1. IACR Trans. Cryptogr. Hardware Embed. Systems 2022(1), 221–244 (2022). https://tches.iacr.org/index.php/TCHES/article/view/9295
Becker, H., Kannwischer, M.J.: Hybrid scalar/vector implementations of Keccak and SPHINCS+ on AArch64. Cryptology ePrint Archive (2022)
Bruun, G.: z-transform DFT filters and FFT’s. IEEE Trans. Acoust. Speech Signal Process. 26(1), 56–63 (1978)
Bernstein, D.J., Yang, B.Y.: Fast constant-time GCD computation and modular inversion. IACR Trans. Cryptogr. Hardware Embed. Syst. 2019(3), 340–398 (2019). https://tches.iacr.org/index.php/TCHES/article/view/8298
Chung, C.M.M., Hwang, V., Kannwischer, M.J., Seiler, G., Shih, C.J., Yang, B.Y.: NTT multiplication for NTT-unfriendly rings new speed records for Saber and NTRU on Cortex-M4 and AVX2. IACR Trans. Cryptogr. Hardware Embed. Syst. 2021(2), 159–188 (2021). https://tches.iacr.org/index.php/TCHES/article/view/8791
Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965)
Dubois, E., Venetsanopoulos, A.: A new algorithm for the radix-3 FFT. IEEE Trans. Acoust. Speech Signal Process. 26(3), 222–225 (1978)
Good, I.J.: The interaction algorithm and practical Fourier analysis. J. Roy. Stat. Soc.: Ser. B (Methodol.) 20(2), 361–372 (1958)
Haasdijk, J.: Optimizing NTRU LPRime on the ARM Cortex - A72 (2021). https://github.com/jhaasdijk/KEMobi
Kannwischer, M.J., Schwabe, P., Stebila, D., Wiggers, T.: PQClean. https://github.com/PQClean
Meyn, H.: Factorization of the cyclotomic polynomial \(x^{2^n} + 1\) over finite fields. Finite Fields Appl. 2(4), 439–442 (1996)
Murakami, H.: Real-valued fast discrete Fourier transform and cyclic convolution algorithms of highly composite even length. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 3, pp. 1311–1314 (1996)
Martínez, F.E., Vergara, C.R., de Oliveira, L.B.: Explicit factorization of \(x^n-1 \in \mathbb{F} _q[x]\). arXiv preprint arXiv:1404.6281 (2014)
Nguyen, D.T., Gaj, K.: Optimized software implementations of CRYSTALS-Kyber, NTRU, and Saber using NEON-based special instructions of ARMv8,. In: Third PQC Standardization Conference (2021)
Nussbaumer, H.: Fast polynomial transform algorithms for digital convolution. IEEE Trans. Acoust. Speech Signal Process. 28(2), 205–215 (1980)
Rader, C.M.: Discrete Fourier transforms when the number of data samples is prime. Proc. IEEE 56(6), 1107–1108 (1968)
Schönhage, A.: Schnelle multiplikation von polynomen über körpern der charakteristik 2. Acta Informatica 7(4), 395–398 (1977)
Tuxanidy, A., Wang, Q.: Composed products and factors of cyclotomic polynomials over finite fields. Des. Codes Crypt. 69(2), 203–231 (2013)
van der Hoeven, J.: The truncated Fourier transform and applications. In: Proceedings of the 2004 International Symposium on Symbolic and Algebraic Computation, pp. 290–296 (2004)
Yansheng, W., Yue, Q.: Further factorization of \(x^n - 1\) over a finite field (II). Disc. Math. Algor. Appl. 13(06), 2150070 (2021)
Yansheng, W., Yue, Q., Fan, S.: Further factorization of \(x^n - 1\) over a finite field. Finite Fields Appl. 54, 197–215 (2018)
Acknowledgments
This work was supported in part by the Academia Sinica Investigator Award AS-IA-109-M01, and Taiwan’s National Science and Technology Council grants 112-2634-F-001-001-MBK and 112-2119-M-001-006.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
A Detailed Performance Numbers
A Detailed Performance Numbers
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hwang, V., Liu, CT., Yang, BY. (2024). Algorithmic Views of Vectorized Polynomial Multipliers – NTRU Prime. In: Pöpper, C., Batina, L. (eds) Applied Cryptography and Network Security. ACNS 2024. Lecture Notes in Computer Science, vol 14584. Springer, Cham. https://doi.org/10.1007/978-3-031-54773-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-54773-7_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-54772-0
Online ISBN: 978-3-031-54773-7
eBook Packages: Computer ScienceComputer Science (R0)