Parallel bitsliced AES through PHAST: a single-source high-performance library for multi-cores and GPUs

Peccerillo, Biagio; Bartolini, Sandro; Koç, Çetin Kaya

doi:10.1007/s13389-017-0175-4

Parallel bitsliced AES through PHAST: a single-source high-performance library for multi-cores and GPUs

Regular Paper
Published: 29 October 2017

Volume 9, pages 159–171, (2019)
Cite this article

Journal of Cryptographic Engineering Aims and scope Submit manuscript

374 Accesses
5 Citations
Explore all metrics

Abstract

PHAST library is a high-level heterogeneous STL-like C\(++\) library that can be targeted on multi-core processors and Nvidia GPUs. It permits to exploit the performance of modern parallel architectures without the complexity of parallel programming. The library manages the programming and critical fine tuning of the parallel execution on both platforms without interfering with the application code structure, while maintaining the possibility to use architecture-specific features and instructions. In cryptography, performance and architectural efficiency of software implementations is crucial. This is witnessed by the extensive research in highly optimized and specialized versions of many protocols. In this paper, we assess the performance overhead and productivity improvement achievable through the PHAST library. We implement a pseudo random number generator (PRNG) based on cache-timing-attack resistant AES. We compare it with the fastest implementations in both CPU and Nvidia GPU domains. Achieved results show that the PHAST code is shorter and simpler than the state-of-the-art implementations. Its source length is 59.59% of the reference CUDA C implementation and 88.18% of the sequential C\(++\) version for CPUs, despite being the same for both targets. It is also far less complex in terms of McCabe’s and Halstead’s metrics. Results show that these productivity improvements induce a limited performance overhead of the library layer: less than 5% on single-thread execution for CPUs and around 10% on Nvidia GPUs. Furthermore, performance of the PHAST PRNG automatically scales with the available cores in a nearly linear fashion, allowing programmers to fully exploit multi-core resources with the same source code.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

XHX – A Framework for Optimally Secure Tweakable Block Ciphers from Classical Block Ciphers and Universal Hashing

Implementing Post-quantum Cryptography for Developers

Article Open access 29 April 2023

Can GPU performance increase faster than the code error rate?

Article Open access 18 April 2024

Notes

The exact amount of these components depends on the generation and the model of the graphic card.
Within the global memory of the video card.

References

Boyar, J., Peralta, R.: A New Combinational Logic Minimization Technique with Applications to Cryptology, pp. 178–189. Springer, Berlin (2010). https://doi.org/10.1007/978-3-642-13193-6_16
Canright, D.: A very compact S-box for AES. In: Proceedings of the 7th International Conference on Cryptographic Hardware and Embedded Systems, CHES ’05, pp. 441–455. Springer, Berlin (2005). https://doi.org/10.1007/11545262_32
Dagum, L., Menon, R.: OpenMP: an industry-standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998). https://doi.org/10.1109/99.660313
Article Google Scholar
Edwards, H.C., Trott, C.R.: Kokkos: enabling performance portability across manycore architectures. In: 2013 Extreme Scaling Workshop (xsw 2013), pp. 18–24 (2013). https://doi.org/10.1109/XSW.2013.7
Enmyren, J., Kessler, C.W.: SkePU: a multi-backend skeleton programming library for multi-GPU systems. In: Proceedings of the Fourth International Workshop on High-Level Parallel Programming and Applications, HLPP ’10, pp. 5–14. ACM, New York (2010). https://doi.org/10.1145/1863482.1863487
Gepner, P., Kowalik, M.F.: Multi-core processors: new way to achieve high system performance. In: International Symposium on Parallel Computing in Electrical Engineering (PARELEC’06), pp. 9–13 (2006). https://doi.org/10.1109/PARELEC.2006.54
Gregory, K., Miller, A.: C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++. O’Reilly, Sebastopol (2012)
Google Scholar
Haidl, M., Gorlatch, S.: PACXX: towards a unified programming model for programming accelerators using C++14. In: Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, LLVM-HPC ’14, pp. 1–11. IEEE Press, Piscataway (2014). https://doi.org/10.1109/LLVM-HPC.2014.9
Halstead, M.H.: Elements of Software Science (Operating and Programming Systems Series). Elsevier Science Inc., New York (1977)
MATH Google Scholar
Han, T.D., Abdelrahman, T.S.: Reducing branch divergence in GPU programs. In: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, pp. 3:1–3:8. ACM, New York (2011). https://doi.org/10.1145/1964179.1964184
Hellekalek, P., Wegenkittl, S.: Empirical evidence concerning AES. ACM Trans. Model. Comput. Simul. 13(4), 322–333 (2003). https://doi.org/10.1145/945511.945515
Article Google Scholar
Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach, 5th edn. Morgan Kaufmann Publishers Inc., San Francisco (2011)
MATH Google Scholar
Hunt, A., Thomas, D.: The Pragmatic Programmer. Addison-Wesley, Boston (2000)
Google Scholar
Intel: Intel 64 and IA-32 Architectures Software Developer’s Manual—Volume 1: Basic Architecture. http://download.intel.com/design/processor/manuals/253665.pdf (2011). Accessed 17 Sept 2016
ISO: ISO/IEC 14882:2011—Information technology—Programming languages—C++. Standard, International Organization for Standardization, Geneva (2011)
Käsper, E., Schwabe, P.: Faster and timing-attack resistant AES-GCM. In: Proceedings of the 11th International Workshop on Cryptographic Hardware and Embedded Systems, CHES ’09, pp. 1–17. Springer, Berlin (2009). https://doi.org/10.1007/978-3-642-04138-9_1
Khronos OpenCL Working Group: SYCL Provisional Specification, version 2.2. https://www.khronos.org/registry/sycl/specs/sycl-2.2.pdf (2016). Accessed 17 Sept 2016
Khronos OpenCL Working Group: The OpenCL Specification, version 2.2. https://www.khronos.org/registry/cl/specs/opencl-2.2.pdf (2016). Accessed 17 Sept 2016
Kim, C., Burger, D., Keckler, S.W.: Nonuniform cache architectures for wire-delay dominated on-chip caches. IEEE Micro 23(6), 99–107 (2003). https://doi.org/10.1109/MM.2003.1261393
Article Google Scholar
Knuth, D.E.: The Art of Computer Programming. Seminumerical Algorithms, vol. 2, 3rd edn. Addison-Wesley Longman Publishing Co., Inc., Boston (1997)
MATH Google Scholar
Lim, R.K., Petzold, L.R., Koç, Ç.K.: Bitsliced high-performance AES-ECB on GPUs. In: Ryan, A.P.Y., Naccache, D., Quisquater, J.J. (eds.) The New Codebreakers: Essays Dedicated to David Kahn on the Occasion of His 85th Birthday, pp. 125–133. Springer, Berlin (2016). https://doi.org/10.1007/978-3-662-49301-4_8
Lutz, K.: Boost.Compute. http://www.boost.org/doc/libs/1_61_0/libs/compute/doc/html/index.html (2016). Accessed 17 Sept 2016
McCabe, T.J.: A complexity measure. IEEE Trans. Softw. Eng. 2(4), 308–320 (1976). https://doi.org/10.1109/TSE.1976.233837
Article MathSciNet MATH Google Scholar
Microsoft: Multithreading with C and Win32. https://msdn.microsoft.com/en-us/library/y6h8hye8.aspx. Accessed 17 Sept 2016
Miller, R., Stout, Q.F.: Algorithmic techniques for networks of processors. In: Atallah, M.J. (ed.) Algorithms and Theory of Computation Handbook, 2nd edn., Chap. 46, pp. 46:1–46:18. CRC Press, Boca Raton (1999)
National Institute of Standards and Technology (NIST): FIPS PUB 197: Announcing the ADVANCED ENCRYPTION STANDARD (AES). National Institute for Standards and Technology, Gaithersburg (2001)
Nichols, B., Buttlar, D., Farrell, J.P.: Pthreads Programming—A POSIX Standard for Better Multiprocessing. O’Reilly, Sebastopol (1996)
Google Scholar
NVIDIA: NVIDIA GF100 Whitepaper. http://www.nvidia.com/object/IO_89569.html (2010). Accessed 17 Sept 2016
NVIDIA: CUDA C Best Practices Guide. http://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf (2015). Accessed 17 Sept 2016
NVIDIA: CUDA C Programming Guide. http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf (2015). Accessed 17 Sept 2016
NVIDIA: NVIDIA GeForce GTX 1080 Whitepaper. http://international.download.nvidia.com/geforce-com/international/pdfs/geforce_gtx_1080_whitepaper_final.pdf (2016). Accessed 17 Sept 2016
OpenACC: OpenACC Programming and Best Practices Guide. http://www.openacc.org/sites/default/files/OpenACC_Programming_Guide_0.pdf (2015). Accessed 17 Sept 2016
Perkins, H.: EasyCL—easy to run kernels using OpenCL. https://github.com/hughperkins/EasyCL (2016). Accessed 17 Sept 2016
Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly Media Inc, Sebastopol (2007)
Google Scholar
Schäling, B.: The Boost C++ Libraries, 2nd edn. XML Press, Laguna Hills (2014)
Google Scholar
Steuwer, M., Kegel, P., Gorlatch, S.: SkelCL—a portable skeleton library for high-level GPU programming. In: Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, IPDPSW ’11, pp. 1176–1182. IEEE Computer Society, Washington (2011). https://doi.org/10.1109/IPDPS.2011.269
Sutter, H.: The free lunch is over: a fundamental turn toward concurrency in software. Dr. Dobb’s J. 30(3), 202–210 (2005)
Google Scholar
Yalamanchili, P., Arshad, U., Mohammed, Z., Garigipati, P., Entschev, P., Kloppenborg, B., Malcolm, J., Melonakos, J.: ArrayFire—a high performance software library for parallel computing with an easy-to-use API. https://github.com/arrayfire/arrayfire (2015)

Download references

Acknowledgements

We would like to thank Rone Kwei Lim for sharing with us the source code of his CUDA AES-based PRNG, which constituted a valuable reference for the experimental work described in this paper.

Author information

Authors and Affiliations

Dipartimento di Ingegneria dell’Informazione e Scienze Matematiche, Università degli Studi di Siena, Siena, Italy
Biagio Peccerillo & Sandro Bartolini
Department of Computer Science, University of California, Santa Barbara, CA, 93106, USA
Çetin Kaya Koç

Authors

Biagio Peccerillo
View author publications
You can also search for this author in PubMed Google Scholar
Sandro Bartolini
View author publications
You can also search for this author in PubMed Google Scholar
Çetin Kaya Koç
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Biagio Peccerillo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Peccerillo, B., Bartolini, S. & Koç, Ç.K. Parallel bitsliced AES through PHAST: a single-source high-performance library for multi-cores and GPUs. J Cryptogr Eng 9, 159–171 (2019). https://doi.org/10.1007/s13389-017-0175-4

Download citation

Received: 24 January 2017
Accepted: 16 October 2017
Published: 29 October 2017
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s13389-017-0175-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel bitsliced AES through PHAST: a single-source high-performance library for multi-cores and GPUs

Abstract

Access this article

Similar content being viewed by others

XHX – A Framework for Optimally Secure Tweakable Block Ciphers from Classical Block Ciphers and Universal Hashing

Implementing Post-quantum Cryptography for Developers

Can GPU performance increase faster than the code error rate?

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Parallel bitsliced AES through PHAST: a single-source high-performance library for multi-cores and GPUs

Abstract

Access this article

Similar content being viewed by others

XHX – A Framework for Optimally Secure Tweakable Block Ciphers from Classical Block Ciphers and Universal Hashing

Implementing Post-quantum Cryptography for Developers

Can GPU performance increase faster than the code error rate?

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation