Loading web-font TeX/Main/Regular
A High-Throughput Toom-Cook-4 Polynomial Multiplier for Lattice-Based Cryptography Using a Novel Winograd-Schoolbook Algorithm | IEEE Journals & Magazine | IEEE Xplore

A High-Throughput Toom-Cook-4 Polynomial Multiplier for Lattice-Based Cryptography Using a Novel Winograd-Schoolbook Algorithm


Abstract:

Polynomial multiplication over rings is a significant bottleneck of ring learning with error (RLWE)-based encryption. To speed it up, the number theoretic transform (NTT)...Show More

Abstract:

Polynomial multiplication over rings is a significant bottleneck of ring learning with error (RLWE)-based encryption. To speed it up, the number theoretic transform (NTT) and Toom-Cook-4 (TC4) are commonly used algorithms. Compared with NTT, TC4 is less restrictive and more flexible. However, there is a large opportunity at the algorithm level to improve the Schoolbook algorithm and postprocessing of TC4. Therefore, we propose a novel and efficient Winograd-Schoolbook algorithm that reduces multiplication by 29.1% (N = 256). We also propose a fused and low-density postprocessing that simplifies the algorithm flow and reduces multiplication by 56.25%. In total, these two-part improvements reduce the multiplication of TC4 by 32.47%. A high throughput and efficiency TC4 polynomial multiplier (TCMW) is proposed to speed up polynomial multiplication over rings. In TCMW, a highly parallel full pipelined structure without data waiting between modules is designed to make the parallelism of each module match and avoid the storage of intermediate results. In addition, based on the improved TC4, data buffers with data reuse, elementwise vector multiplication (EWVM) arrays, and efficient interpolation arrays are all designed to improve the performance and efficiency of TCMW. Implemented on the Xilinx VC709 field programmable gate array (FPGA) platform, TCMW can perform a TC4-based 256\times 256 polynomial multiplication over rings with an unrestricted modulus (as long as its factors do not contain 3 or 5) every 1.89~\mu s at a 385 MHz clock frequency. Compared with prior designs of TC4, under the same conditions, the throughput of TCMW achieves an improvement of 1.91\times \,\,\sim \,\,7.71\times , and the efficiency of LUT and DSP achieve improvements of 1.31\times \,\,\sim \,\,3.67\times and 1.87\times \,\,\sim \,\,4.92\times , respectively.
Published in: IEEE Transactions on Circuits and Systems I: Regular Papers ( Volume: 71, Issue: 1, January 2024)
Page(s): 359 - 372
Date of Publication: 05 December 2023

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.