Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processors

Lim, Roktaek; Lee, Yeongha; Kim, Raehyun; Choi, Jaeyoung; Lee, Myungho

doi:10.1007/s11227-018-2702-1

Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processors

Published: 26 November 2018

Volume 75, pages 7895–7908, (2019)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Roktaek Lim¹,
Yeongha Lee¹,
Raehyun Kim¹,
Jaeyoung Choi ORCID: orcid.org/0000-0002-7321-9682¹ &
…
Myungho Lee²

597 Accesses
8 Citations
Explore all metrics

Abstract

The general matrix–matrix multiplication is a core building block for implementing Basic Linear Algebra Subprograms. This paper presents a methodology for automatically producing the matrix–matrix multiplication kernels tuned for the Intel Xeon Phi Processor code-named Knights Landing and the Intel Skylake-SP processors with AVX-512 intrinsic functions. The architecture of the latest manycore processors has been complicated in the levels of parallelism and cache hierarchies; it is not easy to find the best combination of optimization techniques for a given application. Our approach produces matrix multiplication kernels through a process of heuristic auto-tuning based on generating multiple kernels and selecting the fastest ones through performance tests. The tuning parameters include the size of block matrices for registers and caches, prefetch distances, and loop unrolling depth. Parameters for multithreaded execution, such as identifying loops to parallelize and the optimal number of threads for such loops are also investigated. We also present a method to reduce the parameter search space based on our previous research results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic generation of ARM NEON micro-kernels for matrix multiplication

Article Open access 12 March 2024

Tuning and Optimization for a Variety of Many-Core Architectures Without Changing a Single Line of Implementation Code Using the Alpaka Library

Capturing the Expert: Generating Fast Matrix-Multiply Kernels with Spiral

References

Bilmes J, Asanovic K, Chin CW, Demmel J (2014) Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: ACM International Conference on Supercomputing 25th Anniversary Volume. ACM, pp 253–260
Goto K, Geijn RA (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw (TOMS) 34(3):12
Article MathSciNet Google Scholar
Gunnels JA, Henry GM, Van De Geijn RA (2001) A family of high-performance matrix multiplication algorithms. In: International Conference on Computational Science. Springer, pp 51–60
Heinecke A, Vaidyanathan K, Smelyanskiy M, Kobotov A, Dubtsov R, Henry G, Shet AG, Chrysos G, Dubey P (2013) Design and implementation of the linpack benchmark for single and multi-node systems based on Intel^® Xeon Phi Coprocessor. In: IEEE 27th International Symposium on Parallel & Distributed Processing (IPDPS), 2013. IEEE, pp 126–137
Intel: Math kernel library (2018) https://software.intel.com/en-us/intel-mkl. Accessed 24 July 2018
Jeffers J, Reinders J, Sodani A (2016) Intel Xeon Phi processor high performance programming: knights, landing edn. Morgan Kaufmann, Burlington
Google Scholar
Lim R, Lee Y, Kim R, Choi J (2018) OpenMP-based parallel implementation of matrix-matrix multiplication on the Intel Knights Landing. In: HPC Asia 2018, pp 63–66
Lim R, Lee Y, Kim R, Choi J (2018) An implementation of matrix-matrix multiplication on the Intel KNL processor with AVX-512. Cluster Comput 21(4):1785–1795
Article Google Scholar
Low TM, Igual FD, Smith TM, Quintana-Orti ES (2016) Analytical modeling is enough for high-performance blis. ACM Trans Math Softw (TOMS) 43(2):12
Article MathSciNet Google Scholar
Smith TM, Van De Geijn RA, Smelyanskiy M, Hammond JR, Van Zee FG (2014) Anatomy of high-performance many-threaded matrix multiplication. In: Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, pp 1049–1059
Van Zee FG, Van De Geijn RA (2015) BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans Math Softw (TOMS) 41(3):14
Article MathSciNet Google Scholar
Whaley RC, Dongarra JJ (1998) Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. IEEE Computer Society, pp 1–27
Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimizations of software and the atlas project. Parallel Comput 27(1–2):3–35
Article Google Scholar
Van Zee FG, Smith TM, Marker B, Low TM, Van De Geign RA, Igual FD, Smelyanskiy M, Zhang X, Kistler M, Austel V, Gunnels JA, Killough L (2016) The BLIS framework: experiments in portability. ACM Trans Math Softw (TOMS) 42(2):12:1–12:19
Article Google Scholar
Zhang X, Wang Q, Werber S (2018) Openblas. http://www.openblas.net. Accessed 24 July 2018

Download references

Acknowledgements

The work was supported by the Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Korea government (MSIT) (NRF-2015M3C4A7065662).

Author information

Authors and Affiliations

Soongsil University, Seoul, 06978, Korea
Roktaek Lim, Yeongha Lee, Raehyun Kim & Jaeyoung Choi
Myongji University, Yongin, Gyeonggi, 17058, Korea
Myungho Lee

Authors

Roktaek Lim
View author publications
You can also search for this author in PubMed Google Scholar
Yeongha Lee
View author publications
You can also search for this author in PubMed Google Scholar
Raehyun Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jaeyoung Choi
View author publications
You can also search for this author in PubMed Google Scholar
Myungho Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jaeyoung Choi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lim, R., Lee, Y., Kim, R. et al. Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processors. J Supercomput 75, 7895–7908 (2019). https://doi.org/10.1007/s11227-018-2702-1

Download citation

Published: 26 November 2018
Issue Date: December 2019
DOI: https://doi.org/10.1007/s11227-018-2702-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processors

Abstract

Access this article

Similar content being viewed by others

Automatic generation of ARM NEON micro-kernels for matrix multiplication

Tuning and Optimization for a Variety of Many-Core Architectures Without Changing a Single Line of Implementation Code Using the Alpaka Library

Capturing the Expert: Generating Fast Matrix-Multiply Kernels with Spiral

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processors

Abstract

Access this article

Similar content being viewed by others

Automatic generation of ARM NEON micro-kernels for matrix multiplication

Tuning and Optimization for a Variety of Many-Core Architectures Without Changing a Single Line of Implementation Code Using the Alpaka Library

Capturing the Expert: Generating Fast Matrix-Multiply Kernels with Spiral

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation