Skip to main content

Fast NEON-Based Multiplication for Lattice-Based NIST Post-quantum Cryptography Finalists

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12841))

Abstract

This paper focuses on high-speed NEON-based constant-time implementations of multiplication of large polynomials in the NIST PQC KEM Finalists: NTRU, Saber, and CRYSTALS-Kyber. We use the Number Theoretic Transform (NTT)-based multiplication in Kyber, the Toom-Cook algorithm in NTRU, and both types of multiplication in Saber. Following these algorithms and using Apple M1, we improve the decapsulation performance of the NTRU, Kyber, and Saber-based KEMs at the security level 3 by the factors of 8.4, 3.0, and 1.6, respectively, compared to the reference implementations. On Cortex-A72, we achieve the speed-ups by factors varying between 5.7 and 7.5\(\times \) for the Forward/Inverse NTT in Kyber, and between 6.0 and 7.8\(\times \) for Toom-Cook in NTRU, over the best existing implementations in pure C. For Saber, when using NEON instructions on Cortex-A72, the implementation based on NTT outperforms the implementation based on the Toom-Cook algorithm by \(14\%\) in the case of the MatrixVectorMul function but is slower by \(21\%\) in the case of the InnerProduct function. Taking into account that in Saber, keys are not available in the NTT domain, the overall performance of the NTT-based version is very close to the performance of the Toom-Cook version. The differences for the entire decapsulation at the three major security levels (1, 3, and 5) are \(-4\), \(-2\), and \(+2\%\), respectively. Our benchmarking results demonstrate that our NEON-based implementations run on an Apple M1 ARM processor are comparable to those obtained using the best AVX2-based implementations run on an AMD EPYC 7742 processor. Our work is the first NEON-based ARMv8 implementation of each of the three NIST PQC KEM finalists.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://www.tomshardware.com/news/arm-6-7-billion-chips-per-quarter.

  2. 2.

    https://godbolt.org/z/5qefG5.

  3. 3.

    https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested.

  4. 4.

    https://github.com/dougallj.

  5. 5.

    https://github.com/GMUCERG/PQC_NEON/blob/main/neon/kyber/m1cycles.c.

References

  1. Post-Quantum Cryptography: Round 3 Submissions (2021). https://csrc.nist.gov/Projects/post-quantum-cryptography/round-3-submissions

  2. Alkim, E., Alper Bilgin, Y., Cenk, M., Gérard, F.: Cortex-M4 optimizations for \(\{\text{R,M}\}\) LWE schemes. TCHES 2020(3), 336–357 (2020). https://doi.org/10.46586/tches.v2020.i3.336-357

  3. Avanzi, R., et al.: CRYSTALS-Kyber: algorithm specifications and supporting documentation (version 3.01). Technical report, January 2021

    Google Scholar 

  4. Bermudo Mera, J.M., Karmakar, A., Verbauwhede, I.: Time-memory trade-off in Toom-Cook multiplication: an application to module-lattice based cryptography. IACR Trans. Cryptographic Hardware Embed. Syst. 2020(2), 222–244 (2020). https://doi.org/10.13154/TCHES.V2020.I2.222-244

  5. Bernstein, D.J., Lange, T.: eBACS: ECRYPT Benchmarking of Cryptographic Systems (2021). https://bench.cr.yp.to

  6. Bodrato, M., Zanoni, A.: Integer and polynomial multiplication: towards optimal Toom-Cook matrices. In: International Symposium on Symbolic and Algebraic Computation, ISSAC 2007, pp. 17–24 (2007). https://doi.org/10.1145/1277548.1277552

  7. Botros, L., Kannwischer, M.J., Schwabe, P.: Memory-efficient high-speed implementation of Kyber on Cortex-M4. In: Buchmann, J., Nitaj, A., Rachidi, T. (eds.) AFRICACRYPT 2019. LNCS, vol. 11627, pp. 209–228. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-23696-0_11

    Chapter  Google Scholar 

  8. Chung, C.M.M., Hwang, V., Kannwischer, M.J., Seiler, G., Shih, C.J., Yang, B.Y.: NTT Multiplication for NTT-unfriendly Rings: New Speed Records for Saber and NTRU on Cortex-M4 and AVX2. TCHES, pp. 159–188, February 2021. https://doi.org/10.46586/tches.v2021.i2.159-188

  9. Cook, S.A., Aanderaao, S.O.: On the minimum computation time of functions. Trans. Am. Math. Soc. 142, 291–314 (1969)

    Article  MathSciNet  Google Scholar 

  10. Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965)

    Article  MathSciNet  Google Scholar 

  11. Danba, O.: Optimizing NTRU Using AVX2. Master’s thesis, Radboud University, Nijmegen, Netherlands, July 2019

    Google Scholar 

  12. Fujisaki, E., Okamoto, T.: Secure integration of asymmetric and symmetric encryption schemes. J. Cryptol. 26(1), 80–101 (2013). 10/bxwqr4

    Google Scholar 

  13. Gentleman, W.M., Sande, G.: Fast fourier transforms: for fun and profit. In: Fall Joint Computer Conference, AFIPS 1966, San Francisco, CA, pp. 563–578. ACM Press, November 1966. https://doi.org/10.1145/1464291.1464352

  14. Gupta, N., Jati, A., Chauhan, A.K., Chattopadhyay, A.: PQC acceleration using GPUs: FrodoKEM, NewHope, and Kyber. IEEE Trans. Parallel Distrib. Syst. 32(3), 575–586 (2021). https://doi.org/10.1109/TPDS.2020.3025691

    Article  Google Scholar 

  15. Hoang, G.L.: Optimization of the NTT Function on ARMv8-A SVE. Bachelor’s thesis, Radboud University, The Netherlands, June 2018

    Google Scholar 

  16. Kannwischer, M.J., Rijneveld, J., Schwabe, P., Stoffelen, K.: Pqm4 - Post-quantum crypto library for the \(\{\text{ ARM }\}\)\(\{\text{ Cortex-M4 }\}\) (2019). https://github.com/mupq/pqm4

  17. Karatsuba, A., Ofman, Y.: Multiplication of many-digital numbers by automatic computers. Dokl. Akad. Nauk SSSR 145(2), 293–294 (1962)

    Google Scholar 

  18. Karmakar, A., Bermudo Mera, J.M., Sinha Roy, S., Verbauwhede, I.: Saber on ARM. IACR Trans. Cryptographic Hardware Embed. Syst. 2018(3), 243–266 (2018). https://doi.org/10.13154/tches.v2018.i3.243-266

    Article  Google Scholar 

  19. Mansouri, F.: On the parallelization of integer polynomial multiplication. Master’s thesis, The University of Western Ontario (2014)

    Google Scholar 

  20. Scott, M.: A note on the implementation of the number theoretic transform. In: O’Neill, M. (ed.) IMACC 2017. LNCS, vol. 10655, pp. 247–258. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-71045-7_13

    Chapter  Google Scholar 

  21. Seiler, G.: Faster AVX2 optimized NTT multiplication for Ring-LWE lattice cryptography. Cryptology ePrint Archive 2018/039, January 2018

    Google Scholar 

  22. Sinha Roy, S.: SaberX4: high-throughput software implementation of saber key encapsulation mechanism. In: 2019 IEEE 37th International Conference on Computer Design (ICCD), Abu Dhabi, United Arab Emirates, pp. 321–324. IEEE, November 2019. https://doi.org/10.1109/ICCD46524.2019.00050

  23. Streit, S., De Santis, F.: Post-quantum key exchange on ARMv8-A: a new hope for NEON made simple. IEEE Trans. Comput. 67(11), 1651–1662 (2018). 10/gff3sc

    Google Scholar 

  24. Terpstra, D., Jagode, H., You, H., Dongarra, J.: Collecting performance data with PAPI-C. In: Müller, M., Resch, M., Schulz, A., Nagel, W. (eds.) Tools for High Performance Computing 2009, pp. 157–173. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11261-4_11

    Chapter  Google Scholar 

  25. Toom, A.: The complexity of a scheme of functional elements realizing the multiplication of integers. Soviet Math. Doklady 3, 714–716 (1963)

    MATH  Google Scholar 

  26. Westerbaan, B.: When to Barrett reduce in the inverse NTT. Cryptology ePrint Archive 2020/1377, November 2020

    Google Scholar 

  27. Zhou, S., et al.: Preprocess-then-NTT technique and its applications to Kyber and NewHope. In: Guo, F., Huang, X., Yung, M. (eds.) Inscrypt 2018. LNCS, vol. 11449, pp. 117–137. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-14234-6_7

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Duc Tri Nguyen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nguyen, D.T., Gaj, K. (2021). Fast NEON-Based Multiplication for Lattice-Based NIST Post-quantum Cryptography Finalists. In: Cheon, J.H., Tillich, JP. (eds) Post-Quantum Cryptography. PQCrypto 2021. Lecture Notes in Computer Science(), vol 12841. Springer, Cham. https://doi.org/10.1007/978-3-030-81293-5_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-81293-5_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-81292-8

  • Online ISBN: 978-3-030-81293-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics