Abstract
This work presents a hardware design for constant-time implementation of the HQC (Hamming Quasi-Cyclic) code-based key encapsulation mechanism. HQC has been selected for the fourth round of NIST’s Post-Quantum Cryptography standardization process and this work presents the first, hand-optimized design of HQC key generation, encapsulation, and decapsulation written in Verilog targeting implementation on FPGAs. The three modules further share a common SHAKE256 hash module to reduce area overhead. All the hardware modules are parametrizable at compile time so that designs for the different security levels can be easily generated. The design currently outperforms the other hardware designs for HQC, and many of the fourth-round Post-Quantum Cryptography standardization process, with one of the best time-area products as well. For the combined HighSpeed design targeting the lowest security level, we show that the HQC design can perform key generation in 0.09 ms, encapsulation in 0.13 ms, and decapsulation in 0.21 ms when synthesized for an Xilinx Artix 7 FPGA. Our work shows that when hardware performance is compared, HQC can be a competitive alternative candidate from the fourth round of the NIST PQC competition.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We report both Slices and LUTs in our tables since slices can be often partially used based on the optimization strategy of the synthesis tool, which makes slice utilization not a complete indication of the density of the design.
- 2.
References
Aguilar Melchor, C., et al.: HQC. Technical report, National Institute of Standards and Technology (2020). https://pqc-hqc.org/doc/hqc-specification_021-06-06.pdf
Aguilar Melchor, C., et al.: HQC. Technical report, National Institute of Standards and Technology (2023). http://pqc-hqc.org/doc/hqc-specification_2023-04-30.pdf
Aguilar-Melchor, C., et al.: Towards automating cryptographic hardware implementations: a case study of HQC. Cryptology ePrint Archive, Paper 2022/1425 (2022). https://eprint.iacr.org/2022/1425
Aragon, N., et al.: BIKE. Technical report, National Institute of Standards and Technology (2020). https://csrc.nist.gov/projects/post-quantum-cryptography/round-3-submissions
Azad, A.A., Shahed, I.: A compact and fast FPGA based implementation of encoding and decoding algorithm using Reed Solomon codes. Int. J. Future Comput. Commun. 31–35 (2014)
Barrett, P.: Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor. In: Odlyzko, A.M. (ed.) CRYPTO 1986. LNCS, vol. 263, pp. 311–323. Springer, Heidelberg (1987). https://doi.org/10.1007/3-540-47721-7_24
Bernstein, D.D.: Fast multiplication and its applications (2008)
Chen, P., et al.: Complete and improved FPGA implementation. https://doi.org/10.46586/tches.v2022.i3.71-113
Dang, V.B., Mohajerani, K., Gaj, K.: High-speed hardware architectures and FPGA benchmarking of CRYSTALS-Kyber, NTRU, and saber. Cryptology ePrint Archive, Paper 2021/1508 (2021). https://eprint.iacr.org/2021/1508
Deshpande, S., del Pozo, S.M., Mateu, V., Manzano, M., Aaraj, N., Szefer, J.: Modular inverse for integers using fast constant time GCD algorithm and its applications. In: Proceedings of the International Conference on Field Programmable Logic and Applications. FPL (2021)
Deshpande, S., Xu, C., Nawan, M., Nawaz, K., Szefer, J.: Fast and efficient hardware implementation of HQC. Cryptology ePrint Archive, Paper 2022/1183 (2022). https://eprint.iacr.org/2022/1183
Gigliotti, P.: Implementing barrel shifters using multipliers. Technical report, XAPP195, Xilinx (2004). https://www.xilinx.com/support/documentation/application_notes/xapp195.pdf
Guo, Q., Hlauschek, C., Johansson, T., Lahr, N., Nilsson, A., Schröder, R.L.: Don’t reject this: key-recovery timing attacks due to rejection-sampling in HQC and bike. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2022(3), 223–263 (2022). https://doi.org/10.46586/tches.v2022.i3.223-263. https://tches.iacr.org/index.php/TCHES/article/view/9700
Hashemipour-Nazari, M., Goossens, K., Balatsoukas-Stimming, A.: Hardware implementation of iterative projection-aggregation decoding of reed-muller codes. In: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2021, pp. 8293–8297 (2021). https://doi.org/10.1109/ICASSP39728.2021.9414655
Hu, J., Wang, W., Cheung, R.C., Wang, H.: Optimized polynomial multiplier over commutative rings on FPGAS: a case study on bike. In: 2019 International Conference on Field-Programmable Technology (ICFPT), pp. 231–234 (2019). https://doi.org/10.1109/ICFPT47387.2019.00035
Massolino, P.M.C., Longa, P., Renes, J., Batina, L.: A compact and scalable hardware/software co-design of SIKE. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020(2), 245–271 (2020). https://doi.org/10.13154/tches.v2020.i2.245-271. https://tches.iacr.org/index.php/TCHES/article/view/8551
Richter-Brockmann, J., Chen, M.S., Ghosh, S., Güneysu, T.: Racing bike: improved polynomial multiplication and inversion in hardware. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2022(1), 557–588 (2021). https://doi.org/10.46586/tches.v2022.i1.557-588. https://tches.iacr.org/index.php/TCHES/article/view/9307
Richter-Brockmann, J., Mono, J., Guneysu, T.: Folding bike: scalable hardware implementation for reconfigurable devices. IEEE Trans. Comput. 71(5), 1204–1215 (2022). https://doi.org/10.1109/TC.2021.3078294
Sandoval-Ruiz, C.: VHDL optimized model of a multiplier in finite fields. Ingenieria y Universidad 21(2), 195–212 (2017). https://doi.org/10.11144/Javeriana.iyu21-2.vhdl. https://revistas.javeriana.edu.co/index.php/iyu/article/view/195
Schöffel, M., Feldmann, J., Wehn, N.: Code-based cryptography in IoT: a HW/SW co-design of HQC. CoRR abs/2301.04888 (2023). https://doi.org/10.48550/arXiv.2301.04888
Scholl, S., Wehn, N.: Hardware implementation of a Reed-Solomon soft decoder based on information set decoding. In: 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1–6 (2014). https://doi.org/10.7873/DATE.2014.222
Sendrier, N.: Secure sampling of constant-weight words - application to bike. Cryptology ePrint Archive, Paper 2021/1631 (2021). https://eprint.iacr.org/2021/1631
Wang, W., Tian, S., Jungk, B., Bindel, N., Longa, P., Szefer, J.: Parameterized hardware accelerators for lattice-based cryptography and their application to the HW/SW co-design of qTESLA. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020(3), 269–306 (2020). https://doi.org/10.13154/tches.v2020.i3.269-306. https://tches.iacr.org/index.php/TCHES/article/view/8591
Xing, Y., Li, S.: A compact hardware implementation of CCA-secure key exchange mechanism CRYSTALS-KYBER on FPGA. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2021(2), 328–356 (2021). https://doi.org/10.46586/tches.v2021.i2.328-356. https://tches.iacr.org/index.php/TCHES/article/view/8797
Acknowledgement
We would like to thank the reviewers for the valuable feedback and Dr. Cuauhtemoc Mancillas López for constructive comments and shepherding our article. We would like to thank Dr. Victor Mateu and Dr. Carlos Aguilar Melchor for their helpful discussions. We would like to thank Dr. Shanquan Tian for his optimization recommendations for the SHAKE256 module. The work was supported in part by a research grant from Technology Innovation Institute.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix 1.A Fast and Non-Biased (FNB) Fixed-Weight Vector Generation
Although the CWW design is constant in time, it does have a small bias. As an alternative, we propose a new FNB fixed-weight vector generation design which is based on fixed-weight vector generation technique given in [1]. Our FNB fixed-weight generation module can be parametrized to create design with an arbitrarily small probability of timing attack being possible. In our hardware module, have a parameter ACCEPTABLE_REJECTIONS, which can be used to specify how many indices could be rejected in either rejection sampling or in duplicated detection and still, the design will behave constant time. The parameter (ACCEPTABLE_REJECTIONS) can be set based on user’s target failure probability. If the actual failures are within the failure probability set by the selected parameter value, then the timing side channel given in [13] is not possible.
We use SHAKE256 module described in Sect. 2.1 to expand 320-bit seed to a \(24 \times w\)-bit string. Since the SHAKE256 module has 32-bit interface, the seed is loaded in 32-bit chunks, and the seed is stored in a BRAM. The 32-bit chunk from SHAKE256 is broken into 24-bit integer by preprocess unit and stored in the ctx_RAM then threshold check and reduction are performed. For the reduction, we use Barrett reduction [6]. Unlike the variable Barrett reduction discussed in Sect. 2.1.A, this specific Barrett reduction is optimized as we always reduce the inputs to a specific fixed value (n). After the reduction, the integer values (locations) are stored in a BRAM. Once the locations BRAM is filled, the duplicate detection module is triggered. The duplicate detection module helps detect potential duplicates values in the locations BRAM by traversing through all address locations and updating the value stored in a dual-ported BRAM. While the duplicate detection module checks for duplicates, the SHAKE256 module generates the next \(24 \times w\)-bit string to tackle any potential duplicates and stores them in another BRAM. This way, we can hide any clock cycles taken for seed expansion. Our hardware design uses a PRNG to generate the uniformly random bits required for the fixed weight vector generation from an input seed of length 320-bits. Our hardware design includes this PRNG in the form of SHAKE256 and assumes that the seed will be initialized by some other hardware module implementing a true random number generator. Our FNB fixed-weight generation module can be parametrized to create design with an arbitrarily small probability of timing attack being possible. In our hardware module, have a parameter name is ACCEPTABLE_REJECTIONS, which can be used to specify how many indices could be rejected, and still, the design will behave constant time (at the cost of extra area for more storage and extra cycles). The extra area is needed because we generate additional (based on parameter value) uniformly random bits in advance and store them in the a BRAM. The extra clock cycles are needed because even after we found the required number of indices that are under the threshold value, we still go over all the locations to maintain constant time behavior. For the duplicate detection logic inside the duplicate detection module, the control logic is programmed to take the same cycles in both cases of duplicate being detected or not. The parameter (ACCEPTABLE_REJECTIONS) can be set based on the user’s target failure probability. If the actual failures are within the failure probability set by the selected parameter value, then the timing side channel given in [13] is not possible.
Table 5 shows the comparison of our new FNB design to the CCW design. The area results shown in Table 5 exclude SHAKE256 module as the SHAKE256 is shared among all primitives. The reported frequency in Although the CWW algorithm ensures the constant time behavior in generating fixed-weight vectors, there is a small bias between the uniform distribution and the algorithm’s output. Meanwhile, for the new FNB algorithm, there is no bias. Further, FNB is faster than CWW, and the time-area product is better. These benefits come at the cost of extremely small probabilities that the design is not constant time, but only if it happens that there are more rejections than \(w_r\). Table 5 shows that the probability of non-constant time behavior for FNB can be \(2^{-200}\) or even smaller. To compute the failure probability (given in Table 5) for each parameter set, we take into account both threshold check failure and duplicate detection probabilities for the respective parameter sets.
Comparison to Hardware Designs for Other Round 4 Algorithms
We also provide Table 14 where we tabulate latest hardware implementations of all other post-quantum cryptographic algorithm hardware implementations from the fourth round of NIST’s standardization process, plus the to-be standardized Kyber algorithm. We focus on comparison of the hardware designs for lowest level of security, Level 1, as all publications give clear time and area numbers.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Deshpande, S., Xu, C., Nawan, M., Nawaz, K., Szefer, J. (2024). Fast and Efficient Hardware Implementation of HQC. In: Carlet, C., Mandal, K., Rijmen, V. (eds) Selected Areas in Cryptography – SAC 2023. SAC 2023. Lecture Notes in Computer Science, vol 14201. Springer, Cham. https://doi.org/10.1007/978-3-031-53368-6_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-53368-6_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53367-9
Online ISBN: 978-3-031-53368-6
eBook Packages: Computer ScienceComputer Science (R0)