Abstract
PHAST library is a high-level heterogeneous STL-like C\(++\) library that can be targeted on multi-core processors and Nvidia GPUs. It permits to exploit the performance of modern parallel architectures without the complexity of parallel programming. The library manages the programming and critical fine tuning of the parallel execution on both platforms without interfering with the application code structure, while maintaining the possibility to use architecture-specific features and instructions. In cryptography, performance and architectural efficiency of software implementations is crucial. This is witnessed by the extensive research in highly optimized and specialized versions of many protocols. In this paper, we assess the performance overhead and productivity improvement achievable through the PHAST library. We implement a pseudo random number generator (PRNG) based on cache-timing-attack resistant AES. We compare it with the fastest implementations in both CPU and Nvidia GPU domains. Achieved results show that the PHAST code is shorter and simpler than the state-of-the-art implementations. Its source length is 59.59% of the reference CUDA C implementation and 88.18% of the sequential C\(++\) version for CPUs, despite being the same for both targets. It is also far less complex in terms of McCabe’s and Halstead’s metrics. Results show that these productivity improvements induce a limited performance overhead of the library layer: less than 5% on single-thread execution for CPUs and around 10% on Nvidia GPUs. Furthermore, performance of the PHAST PRNG automatically scales with the available cores in a nearly linear fashion, allowing programmers to fully exploit multi-core resources with the same source code.
Similar content being viewed by others
Notes
The exact amount of these components depends on the generation and the model of the graphic card.
Within the global memory of the video card.
References
Boyar, J., Peralta, R.: A New Combinational Logic Minimization Technique with Applications to Cryptology, pp. 178–189. Springer, Berlin (2010). https://doi.org/10.1007/978-3-642-13193-6_16
Canright, D.: A very compact S-box for AES. In: Proceedings of the 7th International Conference on Cryptographic Hardware and Embedded Systems, CHES ’05, pp. 441–455. Springer, Berlin (2005). https://doi.org/10.1007/11545262_32
Dagum, L., Menon, R.: OpenMP: an industry-standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998). https://doi.org/10.1109/99.660313
Edwards, H.C., Trott, C.R.: Kokkos: enabling performance portability across manycore architectures. In: 2013 Extreme Scaling Workshop (xsw 2013), pp. 18–24 (2013). https://doi.org/10.1109/XSW.2013.7
Enmyren, J., Kessler, C.W.: SkePU: a multi-backend skeleton programming library for multi-GPU systems. In: Proceedings of the Fourth International Workshop on High-Level Parallel Programming and Applications, HLPP ’10, pp. 5–14. ACM, New York (2010). https://doi.org/10.1145/1863482.1863487
Gepner, P., Kowalik, M.F.: Multi-core processors: new way to achieve high system performance. In: International Symposium on Parallel Computing in Electrical Engineering (PARELEC’06), pp. 9–13 (2006). https://doi.org/10.1109/PARELEC.2006.54
Gregory, K., Miller, A.: C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++. O’Reilly, Sebastopol (2012)
Haidl, M., Gorlatch, S.: PACXX: towards a unified programming model for programming accelerators using C++14. In: Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, LLVM-HPC ’14, pp. 1–11. IEEE Press, Piscataway (2014). https://doi.org/10.1109/LLVM-HPC.2014.9
Halstead, M.H.: Elements of Software Science (Operating and Programming Systems Series). Elsevier Science Inc., New York (1977)
Han, T.D., Abdelrahman, T.S.: Reducing branch divergence in GPU programs. In: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, pp. 3:1–3:8. ACM, New York (2011). https://doi.org/10.1145/1964179.1964184
Hellekalek, P., Wegenkittl, S.: Empirical evidence concerning AES. ACM Trans. Model. Comput. Simul. 13(4), 322–333 (2003). https://doi.org/10.1145/945511.945515
Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach, 5th edn. Morgan Kaufmann Publishers Inc., San Francisco (2011)
Hunt, A., Thomas, D.: The Pragmatic Programmer. Addison-Wesley, Boston (2000)
Intel: Intel 64 and IA-32 Architectures Software Developer’s Manual—Volume 1: Basic Architecture. http://download.intel.com/design/processor/manuals/253665.pdf (2011). Accessed 17 Sept 2016
ISO: ISO/IEC 14882:2011—Information technology—Programming languages—C++. Standard, International Organization for Standardization, Geneva (2011)
Käsper, E., Schwabe, P.: Faster and timing-attack resistant AES-GCM. In: Proceedings of the 11th International Workshop on Cryptographic Hardware and Embedded Systems, CHES ’09, pp. 1–17. Springer, Berlin (2009). https://doi.org/10.1007/978-3-642-04138-9_1
Khronos OpenCL Working Group: SYCL Provisional Specification, version 2.2. https://www.khronos.org/registry/sycl/specs/sycl-2.2.pdf (2016). Accessed 17 Sept 2016
Khronos OpenCL Working Group: The OpenCL Specification, version 2.2. https://www.khronos.org/registry/cl/specs/opencl-2.2.pdf (2016). Accessed 17 Sept 2016
Kim, C., Burger, D., Keckler, S.W.: Nonuniform cache architectures for wire-delay dominated on-chip caches. IEEE Micro 23(6), 99–107 (2003). https://doi.org/10.1109/MM.2003.1261393
Knuth, D.E.: The Art of Computer Programming. Seminumerical Algorithms, vol. 2, 3rd edn. Addison-Wesley Longman Publishing Co., Inc., Boston (1997)
Lim, R.K., Petzold, L.R., Koç, Ç.K.: Bitsliced high-performance AES-ECB on GPUs. In: Ryan, A.P.Y., Naccache, D., Quisquater, J.J. (eds.) The New Codebreakers: Essays Dedicated to David Kahn on the Occasion of His 85th Birthday, pp. 125–133. Springer, Berlin (2016). https://doi.org/10.1007/978-3-662-49301-4_8
Lutz, K.: Boost.Compute. http://www.boost.org/doc/libs/1_61_0/libs/compute/doc/html/index.html (2016). Accessed 17 Sept 2016
McCabe, T.J.: A complexity measure. IEEE Trans. Softw. Eng. 2(4), 308–320 (1976). https://doi.org/10.1109/TSE.1976.233837
Microsoft: Multithreading with C and Win32. https://msdn.microsoft.com/en-us/library/y6h8hye8.aspx. Accessed 17 Sept 2016
Miller, R., Stout, Q.F.: Algorithmic techniques for networks of processors. In: Atallah, M.J. (ed.) Algorithms and Theory of Computation Handbook, 2nd edn., Chap. 46, pp. 46:1–46:18. CRC Press, Boca Raton (1999)
National Institute of Standards and Technology (NIST): FIPS PUB 197: Announcing the ADVANCED ENCRYPTION STANDARD (AES). National Institute for Standards and Technology, Gaithersburg (2001)
Nichols, B., Buttlar, D., Farrell, J.P.: Pthreads Programming—A POSIX Standard for Better Multiprocessing. O’Reilly, Sebastopol (1996)
NVIDIA: NVIDIA GF100 Whitepaper. http://www.nvidia.com/object/IO_89569.html (2010). Accessed 17 Sept 2016
NVIDIA: CUDA C Best Practices Guide. http://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf (2015). Accessed 17 Sept 2016
NVIDIA: CUDA C Programming Guide. http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf (2015). Accessed 17 Sept 2016
NVIDIA: NVIDIA GeForce GTX 1080 Whitepaper. http://international.download.nvidia.com/geforce-com/international/pdfs/geforce_gtx_1080_whitepaper_final.pdf (2016). Accessed 17 Sept 2016
OpenACC: OpenACC Programming and Best Practices Guide. http://www.openacc.org/sites/default/files/OpenACC_Programming_Guide_0.pdf (2015). Accessed 17 Sept 2016
Perkins, H.: EasyCL—easy to run kernels using OpenCL. https://github.com/hughperkins/EasyCL (2016). Accessed 17 Sept 2016
Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly Media Inc, Sebastopol (2007)
Schäling, B.: The Boost C++ Libraries, 2nd edn. XML Press, Laguna Hills (2014)
Steuwer, M., Kegel, P., Gorlatch, S.: SkelCL—a portable skeleton library for high-level GPU programming. In: Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, IPDPSW ’11, pp. 1176–1182. IEEE Computer Society, Washington (2011). https://doi.org/10.1109/IPDPS.2011.269
Sutter, H.: The free lunch is over: a fundamental turn toward concurrency in software. Dr. Dobb’s J. 30(3), 202–210 (2005)
Yalamanchili, P., Arshad, U., Mohammed, Z., Garigipati, P., Entschev, P., Kloppenborg, B., Malcolm, J., Melonakos, J.: ArrayFire—a high performance software library for parallel computing with an easy-to-use API. https://github.com/arrayfire/arrayfire (2015)
Acknowledgements
We would like to thank Rone Kwei Lim for sharing with us the source code of his CUDA AES-based PRNG, which constituted a valuable reference for the experimental work described in this paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Peccerillo, B., Bartolini, S. & Koç, Ç.K. Parallel bitsliced AES through PHAST: a single-source high-performance library for multi-cores and GPUs. J Cryptogr Eng 9, 159–171 (2019). https://doi.org/10.1007/s13389-017-0175-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13389-017-0175-4