Parallel implementation of Nussbaumer algorithm and number theoretic transform on a GPU platform: application to qTESLA

Lee, Wai-Kong; Akleylek, Sedat; Wong, Denis Chee-Keong; Yap, Wun-She; Goi, Bok-Min; Hwang, Seong-Oun

doi:10.1007/s11227-020-03392-x

Parallel implementation of Nussbaumer algorithm and number theoretic transform on a GPU platform: application to qTESLA

Published: 11 August 2020

Volume 77, pages 3289–3314, (2021)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

1216 Accesses
Explore all metrics

Abstract

Among the popular post-quantum schemes, lattice-based cryptosystems have received renewed interest since there are relatively simple, highly parallelizable and provably secure under a worst-case hardness assumption. However, polynomial multiplication over rings is the most time-consuming operation in most of the lattice-based cryptosystems. To further improve the performance of lattice-based cryptosystems for large scale usage, polynomial multiplication must be implemented in parallel. The polynomial multiplication can be performed using either number theoretic transform (NTT) or Nussbaumer algorithm. However, Nussbaumer algorithm is inherently serial. Meanwhile, the efficient implementation of NTT using various indexing methods on GPU platform remains unknown. In this paper, we explore the best combination of various indexing methods to implement NTT on GPU platform and the efficient way to parallelize the Nussbaumer algorithm. Our results suggest that the combination of Gentleman–Sande and Cooley–Tukey (GS-CT) indexing methods produced the best performance on RTX2060 GPU (i.e. 422,638 polynomial multiplications per second). A technique to parallelize Nussbaumer algorithm by reducing the non-coalesced global memory access to half is produced. To the best of our knowledge, this is the first GPU implementation of Nussbaumer algorithm and it outperforms the best aforementioned NTT (GS-CT) implementation by 14.5%. For illustration purpose, the proposed GPU implementation techniques are applied to qTESLA, a state-of-the-art lattice based signature scheme. We emphasize that the proposed implementation techniques are not specific to any cryptosystem; they can be easily adapted to any other lattice-based cryptosystems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerating Number Theoretic Transform in GPU Platform for qTESLA Scheme

On the Efficiency of Polynomial Multiplication for Lattice-Based Cryptography on GPUs Using CUDA

Accelerating Lattice Based Proxy Re-encryption Schemes on GPUs

References

Shor P (1994) Algorithms for quantum computation: discrete logarithm and factoring. In: IEEE Proceedings of the 35th Annual Symposium on Foundations of Computer Science. IEEE, Santa Fe, pp 124–134
NIST Post-Quantum Cryptography Standardization: Round 2 Submission. https://csrc.nist.gov/Projects/Post-Quantum-Cryptography/Round-2-Submissions. Accessed 25 May 2019
Dai W, Sunar B (2015) cuHE: a homomorphic encryption accelerator library. In: International Conference on Cryptography and Information Security in the Balkans. Springer
Dai W, Chen D, Cheung RCC, Koc CK (2018) FFT-based McLaughlin’s montgomery exponentiation without conditional selections. IEEE Trans Comput 67(9):1301–1314
MathSciNet MATH Google Scholar
Feng X, Li S, Xu S (2019) RLWE-oriented high-speed polynomial multiplier utilizing multi-lane Stockham NTT algorithm. IEEE Trans Circuits Systems II Express Briefs 67(3):556–559
Article Google Scholar
Akleylek S, Tok ZY (2014) Efficient arithmetic for lattice-based cryptography on GPU using the CUDA platform. In: 22nd IEEE Signal Processing and Communications Applications Conference (SIU), Trabzon
Akleylek S, Dagdelen O, Tok ZY (2016) On the efficiency of polynomial multiplication for lattice-based cryptography on GPUs using CUDA. In: International Conference on Cryptography and Information Security in the Balkans. Koper, pp 155–168
Lee W-K, Akleylek S, Yap W-S, Goi B-M (2019) Accelerating number theoretic transform in GPU platform for qTESLA scheme. In: 15th International Conference on Information Security Practice and Experience (ISPEC 2019), Kuala Lumpur
Nussbaumer H (1980) Fast polynomial transform algorithms for digital convolution. IEEE Trans Acoust Speech Signal Process 28:205–215
Article MathSciNet Google Scholar
Cooley JW, Tukey JW (1965) An algorithm for the machine calculation of complex Fourier series. Math Comput 19(90):297–301
Article MathSciNet Google Scholar
Gentleman WM, Sande G (1966) Fast Fourier transforms—for fun and profit. Proc Joint Comput Conf 29:563–578
Google Scholar
Cochran WT, Cooley JW, Favin DL, Helms HD, Kaenel RA, Lang WW, Maling GC, Nels DE (1967) What is the fast Fourier transform? Proc IEEE 55:1664–1674
Article Google Scholar
Bindel N, Akleylek S, Alkim E, Barreto PSLM, Buchmann J, Eaton E, Gutoski G, Kramer J, Longa P, Polat H, Jefferson, ER, Zanon G (2020) qTESLA. https://qtesla.org/wp-content/uploads/2019/04/qTESLA_round2_04.26.2019.pdf. Accessed 17 Jan 2020
Pollard JM (1971) The fast Fourier transform in a finite field. Math Comput 25(114):365–374
Article MathSciNet Google Scholar
Emmart N, Weems CC (2011) High precision integer multiplication with a GPU using Strassen’s algorithm with multiple FFT sizes. Parallel Process Lett 21(3):359–375
Article MathSciNet Google Scholar
Wang W, Hu Y, Chen L, Huang X, Sunar B (2013) Exploring the feasibility of fully homomorphic encryption. IEEE Trans Comput 64(3):698–706
Article MathSciNet Google Scholar
Barrett P (1986) Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor. Advances in Cryptology—CRYPTO’ 86. Lecture Notes in Computer Science, vol 263, pp 311–323
Shone N, Ngoc TN, Phai VD, Shi Q (2018) A deep learning approach to network intrusion detection. IEEE Trans Emerg Top Comput Intell 2(1):41–50
Article Google Scholar
Lee WK, Achar R, Nakhla MS (2018) Dynamic GPU parallel sparse LU factorization for fast circuit simulation. IEEE Trans Very Large Scale Integration (VLSI) Syst 26(11):2518–2529
Article Google Scholar
Emmart N, Zheng, F, Weems C (2018) Faster modular exponentiation using double precision floating point arithmetic on the GPU. In: Proceedings of the IEEE 25th Symposium on Computer Arithmetic. IEEE, Amherst, Massachusetts, pp 130–137
CUDA Programming Guide v10.2. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html. Accessed 27 Dec 2019
Du C, Bai G (2016) Efficient polynomial multiplier architecture for ring-LWE based public key cryptosystems. In: IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, Montreal, pp 1162–1165
Montgomery P (1985) Modular multiplication without trial division. Math Comput 44(170):519–521
Article MathSciNet Google Scholar
Garben van der Lubbe. A New Hope for Nussbaumer. Bachelor Thesis, Radboud University. https://www.cs.ru.nl/bachelors-theses/2016/Gerben_van_der_Lubbe_4389026_A_New_Hope_for_Nussbaumer.pdf. Accessed 1 July 2019
Avanzi R, Bos JW, Ducas L, Kiltz E, Lepoint T, Lyubashevsky V, Schanck JM, Schwabe P, Seiler G, Stehlé D CRYSTALS-KYBER: Algorithm Specifications and Supporting Documentation. https://pq-crystals.org/. Accessed 25 June 2020
Alkim E, Avanzi R, Bos J, Ducas L, Piedra A, Pöppelmann T, Schwabe P, Stebila D, Newhope-Algorithm Specifications and Supporting Documentation. https://newhopecrypto.org/. Accessed 25 June 2020
Lyubashevsky V, Ducas L, Kiltz E, Lepoint T, Schwabe P, Seiler G, Stehle D. CRYSTALS-DILITHIUM. https://pq-crystals.org/. Accessed 25 June 2020
D-Wave Systems. https://www.dwavesys.com/quantum-computing. Accessed 24 May 2020
Chang CC, Lee WK, Liu Y, Goi BM, Phan RCW (2018) Signature gateway: offloading signature generation to IoT gateway accelerated by GPU. IEEE Internet Things J 6(3):4448–4461
Article Google Scholar

Download references

Acknowledgements

Wai-Kong Lee was supported by Korea Research Fellowship program funded by the Ministry of Science and ICT, Korea through the National Research Foundation (NRF) of Korea (2019H1D3A1A01102607). Seong-Oun Hwang was supported by the NRF grant funded by the Korea government (MSIT) (2020R1A2B5B01002145). Denis Chee-Keong Wong, Wun-She Yap and Bok-Min Goi were supported by the Fundamental Research Grant Scheme (FRGS), Malaysia under Project Number FRGS/1/2018/STG06/UTAR/03/1. Sedat Akleylek was partially supported by TUBITAK under Grant No. EEEAG-117E636.

Author information

Authors and Affiliations

Gachon University, Seongnam, South Korea
Wai-Kong Lee & Seong-Oun Hwang
Ondokuz Mayıs University, Samsun, Turkey
Sedat Akleylek
Universiti Tunku Abdul Rahman, Bandar Sungai Long, Malaysia
Denis Chee-Keong Wong, Wun-She Yap & Bok-Min Goi

Authors

Wai-Kong Lee
View author publications
You can also search for this author inPubMed Google Scholar
Sedat Akleylek
View author publications
You can also search for this author inPubMed Google Scholar
Denis Chee-Keong Wong
View author publications
You can also search for this author inPubMed Google Scholar
Wun-She Yap
View author publications
You can also search for this author inPubMed Google Scholar
Bok-Min Goi
View author publications
You can also search for this author inPubMed Google Scholar
Seong-Oun Hwang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Seong-Oun Hwang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, WK., Akleylek, S., Wong, D.CK. et al. Parallel implementation of Nussbaumer algorithm and number theoretic transform on a GPU platform: application to qTESLA. J Supercomput 77, 3289–3314 (2021). https://doi.org/10.1007/s11227-020-03392-x

Download citation

Published: 11 August 2020
Issue Date: April 2021
DOI: https://doi.org/10.1007/s11227-020-03392-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel implementation of Nussbaumer algorithm and number theoretic transform on a GPU platform: application to qTESLA

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Accelerating Number Theoretic Transform in GPU Platform for qTESLA Scheme

On the Efficiency of Polynomial Multiplication for Lattice-Based Cryptography on GPUs Using CUDA

Accelerating Lattice Based Proxy Re-encryption Schemes on GPUs

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now