Abstract
GPU is widely used in various applications that require huge computational power. In this paper, we contribute to the cryptography and high performance computing research community by presenting techniques to accelerate symmetric block ciphers (AES-128, CAST-128, Camellia, SEED, IDEA, Blowfish and Threefish) in NVIDIA GTX 980 with Maxwell architecture. The proposed techniques consider various aspects of block cipher implementation in GPU, including the placement of encryption keys and T-box in memory, thread block size, cipher operating mode, parallel granularity and data copy between CPU and GPU. We proposed a new method to store the encryption keys in registers with high access speed and exchange it with other threads by using the warp shuffle operation in GPU. The block ciphers implemented in this paper operate in CTR mode, and able to achieve high encryption speed with 149 Gbps (AES-128), 143 Gbps (CAST-128), 124 Gbps (Camelia), 112 Gbps (SEED), 149 Gbps (IDEA), 111 Gbps (Blowfish) and 197 Gbps (Threefish). To the best of our knowledge, this is the first implementation of block ciphers that exploits warp shuffle, an advanced feature in NVIDIA GPU. On the other hand, block ciphers can be used as pseudorandom number generator (PRNG) when it is operating under counter mode (CTR), but the speed is usually slower compare to other PRNG using lighter operations. Hence, we attempt to modify IDEA and Blowfish in order to achieve faster PRNG generation. The modified IDEA and Blowfish manage to pass all NIST Statistical Test and TestU01 SmallCrush except the more stringent tests in TestU01 (Crush and BigCrush).




Similar content being viewed by others
References
Cao, W., Xu, C., Wang, Z., Yao, L., Liu, H.: CPU/GPU computing for a multi-block structured grid based high-order flow solver on a large heterogeneous system. J. Clust. Comput. 17(2), 255270 (2014)
Islam, M.S., Kim, C., Kim, J.: A GPU-based (8, 4) Hamming decoder for secure transmission of watermarked medical images. J. Clus. Comput. 18(1), 333–341 (2015)
Chopkowski, M., Walkowiak, R.: A general purpose lossless data compression method for GPU. J. Parallel Distrib. Comput. 75, 40–52 (2015)
Osa, G.L.: Fast implementation of two hash algorithm on NVidia CUDA GPU. Master thesis, Norwegian University of Science and Technology (2009)
Bos, J.W., Osvik, D.A., Stefan, D.: Fast implementations of AES on various platforms. In: Software Performance Enhancement for Encryption and Decryption and Cryptographic Compilers (SPEED-CC), pp. 19–34 (2009)
Hu, G., Ma, J., Huang, B.: High throughput implementation of MD5 algorithm on GPU. In: IEEE Proceedings of the 4th International Conference on Ubiquitous Information Technologies & Applications, pp. 1–5 (2009)
Seo, S.C., Kim, T.H., Hong, S.K.: Accelerating elliptic curve scalar multiplication over GF(\(2^m\))on graphic hardware. J. Parallel Distrib. Comput. 75, 152–167 (2015)
National Institute of Standards and Technology (NIST): FIPS-197: advanced encryption standard. http://www.itl.nist.gov/fipspubs/ (2001). Accessed 1 Sept 2015
Adams, C.: The CAST-128 Encryption Algorithm. RFC 2144 (Informational) (1997)
Lee, H.J., Lee, S.J., Yoon, J.H., Cheon, D.H., Lee, J.I., Korea Information Security Agency: The SEED encryption algorithm. The Internet Engineering Task Force RFC 4269 [online database]. http://www.ietf.org/rfc/rfc4269.txt (2005). Accessed 18 May 2015
Aoki, K., Ichikawa, T., Kanda, M., Matsui, M., Moriai, S., Nakajima, J., Tokita, T.: Specifications of Camellia a 128-bit block cipher. http://info.isl.ntt.co.jp/crypt/eng/camellia/dl/01espec (2001). Accessed 6 May 2015
Lai, X., Massey, J.L.: A proposal for a new block encryption standard. In: EUROCRYPT 1990, pp. 389404 (1990)
Schneier, B.: Description of a new variable-length key, 64-bit block cipher (Blowfish). In Fast Software Encryption, Cambridge Security Workshop Proceedings, pp. 191204. Springer (1993)
Ferguson, N., Lucks, S., Schneier, B., Whiting, D., Bellare, M., Kohno, T., Callas, J., Walker, J.: The Skein Hash Function Family, a SHA-3 candidate (2009)
Transport Layer Security (TLS) Protocol Version 1.2, RFC 5246, 2008
The Secure Sockets Layer (SSL) Protocol Version 3.0, RFC 6101 (1996)
Leetmaa, M., Skorodumova, N.V.: KMCLib 1.1: extended random number support and technical updates to the KMCLib general framework for kinetic Monte-Carlo simulations. Comput. Phys. Commun. 196, 611–613 (2015)
NIST Statistical Test Suite: A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications. In: SP800-22, Revision 1a (2010)
LEcuyer, P., Simard, R.: TestU01: a C library for empirical testing of random number generators. ACM Trans. Math. Softw. 33, 22 (2007)
CUDA C Programming Guide V7.0. NVIDIA Corporation. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html (2015). Accessed 26 June 2015
Dworkin, M.: Recommendation for block cipher mode of operations. http://csrc.nist.gov/publications/nistpubs/800-38a/sp800-38a (2001). Accessed 2 June 2015
Biagio, A.D., Barenghi, A., Agosta, G., Pelosi, G.: Design of a parallel AES for graphics hardware using the CUDA framework. In: IEEE International Symposium on Parallel and Distributed Processing, pp. 1–8 (2009)
Mei, C., Jiang, H., Jenness, J.: CUDA-based AES parallelization with fine-tuned GPU memory utilization. In: IEEE International Symposium on Parallel and Distributed Processing, Workshops and Ph.D. Forum (IPDPSW) (2010)
Bos, J.W., Osvik, D.A., Stefan, D., Canright, D.: Fast software AES encryption. In: Proceedings of the 17th International Conference on Fast Software Encryption (2010)
Tran, N.P., Lee, M., Hong, S., Lee, S.J.: Parallel execution of AES-CTR algorithm using extended block size. In: IEEE 14th International Conference on Computational Science and Engineering (2011)
Nishikawa, N., Iwai, K., Kurokawa, T.: High-performance symmetric block ciphers on multicore CPU and GPUs. Int. J. Netw. Comput. 2(2), 251–268 (2012)
Gilger, J., Barnickel, J., Meyer, U.: GPU-acceleration of block ciphers in the OpenSSL cryptographic library. In: Proceedings of the 15th International Conference on Information Security, pp. 338–353, Springer (2012)
Lee, S., Kim, D., Yi, J., Ro, W.W.: An efficient block cipher implementation on many-core graphics processing unit. J. Inf. Process. Syst. 8(1), 159–174 (2012)
Li, Q., Zhong, C., Zhao, K., Mei, X., Chu, X.: Implementation and analysis of AES encryption on GPU. In: IEEE 14th International Conference on High Performance Computing and Communications, pp. 843–848 (2012)
Zola, W., Bona, L.C.E.: Parallel speculative encryption of multiple AES contexts on GPUs. In: IEEE International Conference on Innovative Parallel Programming, pp. 1–9 (2012)
Nishikawa, N., Iwai, K., Tanaka, H., Kurokawa, T.: Throughput and power efficiency evaluations of block ciphers on Kepler and GCN GPUs using micro-benchmark analysis. IEICE Trans. Inf. Syst. E97–D(6), 1506–1515 (2014)
Mukherjee, R., Rehman, M.S., Kothapalli, K., Narayanan, P.J., Srinathan, K.: Fast, scalable, and secure encryption on the GPU. http://researchweb.iiit.ac.in/~rishabh_m/gpu_crypto (2011). Accessed 2 Aug 2015
Salmon, J.K., Moraes, M.A., Dror, R.O., Shaw, D.E.: Parallel random numbers: as easy as 1, 2, 3. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–12 (2011)
Tuning CUDA Applications for Kepler V7.0. NVIDIA Corporation. http://docs.nvidia.com/cuda/kepler-tuning-guide/#axzz3wKHPco4y (2015). Accessed 21 May 2015
Acknowledgments
This work was supported partially by Universiti Tunku Abdul Rahman Research Fund (UTARRF) under Grant IPSR/RMC/UTARRF/2012-C2/L04. We would also like to thank the all members in Accelerative Technology Lab, MIMOS, Malaysia for their great support. This research work is also partially supported by Ministry of Science, Technology and Innovation (MOSTI), Malaysia under Grant 01-02-11-SF0202.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lee, WK., Cheong, HS., Phan, R.CW. et al. Fast implementation of block ciphers and PRNGs in Maxwell GPU architecture. Cluster Comput 19, 335–347 (2016). https://doi.org/10.1007/s10586-016-0536-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-016-0536-2