An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512

Lim, Roktaek; Lee, Yeongha; Kim, Raehyun; Choi, Jaeyoung

doi:10.1007/s10586-018-2810-y

An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512

Published: 01 June 2018

Volume 21, pages 1785–1795, (2018)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Roktaek Lim¹,
Yeongha Lee¹,
Raehyun Kim¹ &
…
Jaeyoung Choi ORCID: orcid.org/0000-0002-7321-9682¹

1851 Accesses
25 Citations
Explore all metrics

Abstract

The second generation Intel Xeon Phi processor codenamed Knights Landing (KNL) have recently emerged with 2D tile mesh architecture and the Intel AVX-512 instructions. However, it is very difficult for general users to get the maximum performance from the new architecture since they are not familiar with optimal cache reuse, efficient vectorization, and assembly language. In this paper, we illustrate several developing strategies to achieve good performance with C programming language by carrying out general matrix–matrix multiplications and without the use of assembly language. Our implementation of matrix–matrix multiplication is based on blocked matrix multiplication as an optimization technique that improves data reuse. We use data prefetching, loop unrolling, and the Intel AVX-512 to optimize the blocked matrix multiplications. When we use a single core of the KNL, our implementation achieves up to 98% of SGEMM and 99% of DGEMM using the Intel MKL, which is the current state-of-the-art library. Our implementation of the parallel DGEMM using all 68 cores of the KNL achieves up to 90% of DGEMM using the Intel MKL.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Can GPU performance increase faster than the code error rate?

Article Open access 18 April 2024

Parallelizing the dual revised simplex method

Article Open access 14 December 2017

Shared Memory Parallelism in Modern C++ and HPX

Article 20 April 2024

References

Jeffers, J., Reinders, J., Sodani, A.: Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann (2016)
Bilmes, J., Asanovic, K., Chin, C.W., Demmel, J.: Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: ACM International Conference on Supercomputing 25th Anniversary Volume, pp. 253–260. ACM (2014)
Goto, K., van de Geijn, R.A.: Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. (TOMS) 34(3), 12 (2008)
Article MathSciNet Google Scholar
Heinecke, A., Vaidyanathan, K., Smelyanskiy, M., Kobotov, A., Dubtsov, R., Henry, G., Shet, A.G., Chrysos, G., Dubey, P.: Design and implementation of the linpack benchmark for single and multi-node systems based on Intel^® Xeon Phi Coprocessor. In: 2013 IEEE 27th International Symposium on Parallel & Distributed Processing (IPDPS), pp. 126–137. IEEE (2013)
Peyton, J.L.: Programming dense linear algebra kernels on vectorized architectures. Master’s thesis, The University of Tennessee, Knoxville (2013)
Van Zee, F.G., Van De Geijn, R.A.: BLIS: a framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software (TOMS) 41(3), 14 (2015)
Article MathSciNet Google Scholar
Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE conference on Supercomputing, pp. 1–27. IEEE Computer Society (1998)
Low, T.M., Igual, F.D., Smith, T.M., Quintana-Orti, E.S.: Analytical modeling is enough for high-performance blis. ACM Trans. Math. Softw. (TOMS) 43(2), 12 (2016)
Article MathSciNet Google Scholar
Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimizations of software and the atlas project. Parallel Comput. 27(1), 3–35 (2001)
Article Google Scholar
Gunnels, J.A., Henry, G.M., Van De Geijn, R.A.: A family of high-performance matrix multiplication algorithms. In: International Conference on Computational Science, pp. 51–60. Springer (2001)
Whaley, R.C., Petitet, A.: Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw. Pract. Exp. 35(2), 101–121 (2005)
Google Scholar
Lee, J., Kim, H., Vuduc, R.: When prefetching works, when it doesn’t, and why. Architecture and Code Optimization (TACO), vol. 9(2) (2012)
Article Google Scholar
Smith, T.M., Van De Geijn, R.A., Smelyanskiy, M., Hammond, J.R., Van Zee, F.G.: Anatomy of high-performance many-threaded matrix multiplication. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 1049–1059. IEEE (2014)
Marker, B., Van Zee, F.G., Goto, K., Quintana-Ortí, G., Van De Geijn, R.A.: Toward scalable matrix multiply on multithreaded architectures. In: European Conference on Parallel Processing, pp. 748–757. Springer (2007)

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2015M3C4A7075662).

Author information

Authors and Affiliations

Soongsil University, Seoul, 06978, Korea
Roktaek Lim, Yeongha Lee, Raehyun Kim & Jaeyoung Choi

Authors

Roktaek Lim
View author publications
You can also search for this author in PubMed Google Scholar
Yeongha Lee
View author publications
You can also search for this author in PubMed Google Scholar
Raehyun Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jaeyoung Choi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jaeyoung Choi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lim, R., Lee, Y., Kim, R. et al. An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512. Cluster Comput 21, 1785–1795 (2018). https://doi.org/10.1007/s10586-018-2810-y

Download citation

Received: 18 October 2017
Revised: 28 February 2018
Accepted: 08 May 2018
Published: 01 June 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s10586-018-2810-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512

Abstract

Access this article

Similar content being viewed by others

Can GPU performance increase faster than the code error rate?

Parallelizing the dual revised simplex method

Shared Memory Parallelism in Modern C++ and HPX

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512

Abstract

Access this article

Similar content being viewed by others

Can GPU performance increase faster than the code error rate?

Parallelizing the dual revised simplex method

Shared Memory Parallelism in Modern C++ and HPX

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation