Algorithmic Views of Vectorized Polynomial Multipliers – NTRU

Chen, Han-Ting; Chung, Yi-Hua; Hwang, Vincent; Yang, Bo-Yin

doi:10.1007/978-3-031-56235-8_9

Han-Ting Chen¹¹,
Yi-Hua Chung¹²,
Vincent Hwang^12,13 &
…
Bo-Yin Yang¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14460))

Included in the following conference series:

International Conference on Cryptology in India

239 Accesses
3 Citations

Abstract

The lattice-based post-quantum cryptosystem NTRU is used by Google for protecting Google’s internal communication. In NTRU, polynomial multiplication is one of bottleneck. In this paper, we explore the interactions between polynomial multiplications, Toeplitz matrix-vector products, and vectorization with architectural insights. For a unital commutative ring R, a positive integer n, and an element $\zeta \in R$, we reveal the benefit of vector-by-scalar multiplication instructions while multiplying in $ \left. {R[x]} / {\left\langle {x^n - \zeta } \right\rangle } \right. $.

We aim at designing an algorithm exploiting no algebraic and number–theoretic properties of n and $\zeta $. An obvious way is to multiply in R[x] and reduce modulo $x^n - \zeta $. Since the product in R[x] is a polynomial of degree at most $2n - 2$, one usually chooses a polynomial modulus $\boldsymbol{g}$ such that (i) $\text {deg}(\boldsymbol{g}) \ge 2n - 1$, and (ii) there exists a well-studied fast polynomial multiplication algorithm f for multiplying in $ \left. {R[x]} / {\left\langle {\boldsymbol{g}} \right\rangle } \right. $.

We deviate from common approaches and point out a novel insight with dual modules and vector-by-scalar multiplications. Conceptually, we relate the module-theoretic duals of $ \left. {R[x]} / {\left\langle {x^n - \zeta } \right\rangle } \right. $ and $ \left. {R[x]} / {\left\langle {\boldsymbol{g}} \right\rangle } \right. $ with Toeplitz matrix-vector products, and demonstrate the benefit of Toeplitz matrix-vector products with vector-by-scalar multiplication instructions. It greatly reduces the register pressure, and allows us to multiply with essentially no permutation instructions that are commonly used in vectorized implementation.

We implement the ideas for the NTRU parameter sets ntruhps2048677 and ntruhrss701 on a Cortex-A72 implementing the Armv8.0-A architecture with the single-instruction-multiple-data (SIMD) technology Neon. For polynomial multiplications, our implementation is $2.18 \times $ and $2.23 \times $ for ntruhps2048677 and ntruhrsss701 than the state-of-the-art optimized implementation. We also vectorize the polynomial inversions and sorting network by employing existing techniques and translating AVX2-optimized implementations into Neon. Compared to the state-of-the-art optimized implementation, our key generation, encapsulation, and decapsulation for ntruhps2048677 are $7.67 \times $, $2.48 \times $, and $1.77 \times $ faster, respectively. For ntruhrss701, our key generation, encapsulation, and decapsulation are $7.99 \times $, $1.47 \times $, and $1.56 \times $ faster, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Algorithmic Views of Vectorized Polynomial Multipliers – NTRU Prime

Pushing the Limit of Vectorized Polynomial Multiplications for NTRU Prime

The Matrix Reloaded: Multiplication Strategies in FrodoKEM

Notes

1.
Number-theoretic transform refers to a broad family of algebra monomorphisms that doesn’t contain Toom–Cook.
2.
For possibly non-commutative unital rings, we only have $ \left. {R[x]} / {\left( \left\langle {\boldsymbol{g}_i} \right\rangle \cap \left\langle {\boldsymbol{g}_j} \right\rangle \right) } \right. \cong \left. {R[x]} / {\left\langle {\boldsymbol{g}_i} \right\rangle } \right. \times \left. {R[x]} / {\left\langle {\boldsymbol{g}_j} \right\rangle } \right. $ for coprime polynomials $\boldsymbol{g}_i$ and $\boldsymbol{g}_j$. If R is commutative, R[x] is also commutative and we have $\left\langle {\boldsymbol{g}_i} \right\rangle \cap \left\langle {\boldsymbol{g}_j} \right\rangle = \left\langle {\boldsymbol{g}_i} \right\rangle \left\langle {\boldsymbol{g}_j} \right\rangle = \left\langle {\boldsymbol{g}_i \boldsymbol{g}_j} \right\rangle $. This leads to $ \left. {R[x]} / {\left\langle {\boldsymbol{g}_i \boldsymbol{g}_j} \right\rangle } \right. \cong \left. {R[x]} / {\left\langle {\boldsymbol{g}_i} \right\rangle } \right. \times \left. {R[x]} / {\left\langle {\boldsymbol{g}_j} \right\rangle } \right. $ in our context.

References

ARM: Cortex-A72 Software Optimization Guide (2015). https://developer.arm.com/documentation/uan0016/a/
ARM: Arm Architecture Reference Manual, Armv8, for Armv8-A architecture profile (2021). https://developer.arm.com/documentation/ddi0487/gb/?lang=en
Becker, H., Hwang, V., Kannwischer, M.J., Yang, B.Y., Yang, S.Y.: Neon NTT: faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2022(1), 221–244 (2022). https://tches.iacr.org/index.php/TCHES/article/view/9295
Bernstein, D.J.: Multidigit multiplication for mathematicians (2001). https://cr.yp.to/papers.html#m3
Bernstein, D.J., Yang, B.Y.: Fast constant-time GCD computation and modular inversion. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2019(3), 340–398 (2019). https://tches.iacr.org/index.php/TCHES/article/view/8298
Chen, C., et al.: NTRU. Submission to the NIST Post-Quantum Cryptography Standardization Project [15] (2020). https://ntru.org/
Cook, S.A., Aanderaa, S.O.: On the minimum computation time of functions. Trans. Am. Math. Soc. 142, 291–314 (1969)
Article MathSciNet Google Scholar
Hwang, V.B.: Case Studies on Implementing Number-Theoretic Transforms with Armv7-M, Armv7E-M, and Armv8-A. Master’s thesis (2022). https://github.com/vincentvbh/NTTs_with_Armv7-M_Armv7E-M_Armv8-A
Kannwischer, M.J., Rijneveld, J., Schwabe, P.: Faster multiplication in $\mathbb{Z}_{2^m}[x]$ on Cortex-M4 to speed up NIST PQC candidates. In: Deng, R.H., Gauthier-Umaña, V., Ochoa, M., Yung, M. (eds.) ACNS 2019. LNCS, vol. 11464, pp. 281–301. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21568-2_14
Chapter Google Scholar
Kannwischer, M.J., Schwabe, P., Stebila, D., Wiggers, T.: PQClean. https://github.com/PQClean
Karatsuba, A.A., Ofman, Y.P.: Multiplication of many-digital numbers by automatic computers. In: Doklady Akademii Nauk, vol. 145, no. 2, pp. 293–294 (1962)
Google Scholar
Írem Keskinkurt Paksoy, Cenk, M.: TMVP-based Multiplication for Polynomial Quotient Rings and Application to Saber on ARM Cortex-M4. Cryptology ePrint Archive (2020). https://eprint.iacr.org/2020/1302
Írem Keskinkurt Paksoy, Cenk, M.: Faster NTRU on ARM Cortex-M4 with TMVP-based multiplication (2022). https://eprint.iacr.org/2022/300
Nguyen, D.T., Gaj, K.: Optimized Software Implementations of CRYSTALS-Kyber, NTRU, and Saber Using NEON-Based Special Instructions of ARMv8 (2021). third PQC Standardization Conference
Google Scholar
NIST, the US National Institute of Standards and Technology: Post-quantum cryptography standardization project. https://csrc.nist.gov/Projects/post-quantum-cryptography
Sanal, P., Karagoz, E., Seo, H., Azarderakhsh, R., Kermani, M.M.: Kyber on ARM64: compact implementations of Kyber on 64-bit ARM Cortex-A processors. Cryptology ePrint Archive, Report 2021/561 (2021). https://eprint.iacr.org/2021/561
Toom, A.L.: The complexity of a scheme of functional elements realizing the multiplication of integers. In: Soviet Mathematics Doklady, vol. 3, no. 4, pp. 714–716 (1963)
Google Scholar
Winograd, S.: Arithmetic Complexity of Computations, vol. 33. Siam, New Delhi (1980)
Google Scholar

Download references

Author information

Authors and Affiliations

National Taiwan University, Taipei, Taiwan
Han-Ting Chen
Academia Sinica, Taipei, Taiwan
Yi-Hua Chung, Vincent Hwang & Bo-Yin Yang
Max Planck Institute for Security and Privacy, Bochum, Germany
Vincent Hwang

Authors

Han-Ting Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Hua Chung
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Hwang
View author publications
You can also search for this author in PubMed Google Scholar
Bo-Yin Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Vincent Hwang or Bo-Yin Yang .

Editor information

Editors and Affiliations

Nanyang Technological University, Singapore, Singapore
Anupam Chattopadhyay
Nanyang Technological University, Singapore, Singapore
Shivam Bhasin
Radboud University, Nijmegen, The Netherlands
Stjepan Picek
Indian Institute of Technology Madras, Chennai, India
Chester Rebeiro

Appendices

A Proof for the Toeplitz Transformation

For an algebra homomorphism $f: R[x]_{<n} \rightarrow S$ with $f_k {:}{=}f|_{R[x]_{<k}}$ a monomorphism, and module homomorphism $(\boldsymbol{a}, -) = {\left\{ \begin{array}{ll} {R^k \rightarrow R^n} \\ \boldsymbol{b}\mapsto \boldsymbol{a}\boldsymbol{b}\end{array}\right. }$ where $n \ge 2k - 1$, we have

$$ \left( {{\textbf {Toeplitz}}}_{k \times k}(-) \right) (\boldsymbol{a}) = \text {rev}_{k \times k} \circ f_k^* \circ (f_k(\boldsymbol{a}), -)^* \circ (f^{-1})^* \circ \text {id}_{(2k - 1) \rightarrow n}. $$

Proof

Observe $(\boldsymbol{a}, -)^* = f_k^* \circ \left( f_k(\boldsymbol{a}), - \right) ^* \circ \left( f^{-1} \right) ^* \circ \text {id}_{(2k - 1) \rightarrow n}$, it remains to show $\left( {{\textbf {Toeplitz}}}_{k \times k}(-)\right) (\boldsymbol{a}) = \text {rev}_{k \times k} \circ (\boldsymbol{a}, -)^*$. Let $\boldsymbol{z}= (z_0, \dots , z_{2k - 2})$, $[k] = \left\{ {0, \dots , k - 1} \right\} $, and $\boldsymbol{0}_{m_0, m_1}$ the $m_0 \times m_1$ matrix of zeros. We have:

Applying $\text {rev}_{k \times k}$ from the left finishes the proof (cf. [18, Theorem 6]).

B Examples of Toeplitz Transformations

We give some examples of f’s implementing $\begin{pmatrix} z_1 &{} z_2 \\ z_0 &{} z_1 \end{pmatrix} \begin{pmatrix} a_1 \\ a_0 \end{pmatrix}$:

where ${\textbf {F}}_k^{-1} = \left( {\textbf {F}}_k^{-1} \right) ^T$ is the inverse of the cyclic size-k FFT.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, HT., Chung, YH., Hwang, V., Yang, BY. (2024). Algorithmic Views of Vectorized Polynomial Multipliers – NTRU. In: Chattopadhyay, A., Bhasin, S., Picek, S., Rebeiro, C. (eds) Progress in Cryptology – INDOCRYPT 2023. INDOCRYPT 2023. Lecture Notes in Computer Science, vol 14460. Springer, Cham. https://doi.org/10.1007/978-3-031-56235-8_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-56235-8_9
Published: 29 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56234-1
Online ISBN: 978-3-031-56235-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics