Superscalar GEMM-based level 3 BLAS—The on-going evolution of a portable and high-performance library

Gustavson, Fred; Henriksson, André; Jonsson, Isak; Kågström, Bo; Ling, Per

doi:10.1007/BFb0095338

Fred Gustavson¹,
André Henriksson²,
Isak Jonsson²,
Bo Kågström² &
…
Per Ling²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1541))

Included in the following conference series:

International Workshop on Applied Parallel Computing

154 Accesses
11 Citations

Abstract

Recently, a first version of our GEMM-based level 3 BLAS for superscalar type processors was announced. A new feature is the inclusion of DGEMM itself. This DGEMM routine contains inline what we call a level 3 kernel routine, which is based on register blocking. Additionally, it features level 1 cache blocking and data copying of submatrix operands for the level 3 kernel. Our other BLAS’s which possess triangular operands, e.g., DTRSM, DSYRK use a similar level 3 kernel routine to handle the triangular blocks that appear on the diagonal of the larger input triangular operand. Like our previous GEMM-based work all other BLAS’s perform the dominating part of the computations in calls to DGEMM. We are seeing the adoption of our BLAS’s by several organizations, including the ATLAS and PHiPAC projects on automatic generation of fast DGEMM kernels for superscalar processors, and some computer vendors. The evolution of the superscalar GEMM-based level 3 BLAS is presented. Also, we describe new developments which include techniques that make the library applicable to symmetric multiprocessing (SMP) systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

R. C. Agarwal, F. G. Gustavson, and M. Zubair. Improving performance of linear algebra algorithms for dense matrices, using algorithmic prefetch. IBM J. Res. Develop, 38(3):265–275, May 1994.
MATH Google Scholar
R. C. Agarwal, F. G. Gustavson, and M. Zubair. Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms. IBM J. Res. Develop, 38(5):563–576, September 1994.
Article Google Scholar
J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using PHiPAC: A portable, high performance, ANSI C coding methodology. In Proceedings of the 11th International Conference on Supercomputing (ICS-97), pages 340–347, New York, July 7–11 1997. ACM Press.
Google Scholar
J. Dongarra, J. DuCroz, I. Duff, and S. Hammarling. A Set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Softw., 16(1):1–17, 18–28, March 1990.
Article MATH Google Scholar
M. J. Dayde, I. S. Duff, and A. Petitet. A parallel block implementation of level-3 BLAS for MIMD vector processors. ACM Trans. Math. Softw., 20(2):178–193, June 1994.
Article MATH Google Scholar
F. Gustavson, A. Henriksson, I. Jonsson, B. Kågström and P. Ling. Recursive Blocked Data Formats and BLAS’s for Dense Linear Algebra Algorithms. This Proceedings, Springer Verlag, 1998.
Google Scholar
A. Henriksson and I. Jonsson. High-Performance Matrix Multiplication on the IBM SP High Node. Master Thesis, UMNAD 98.235, Department of Computing Science, Umeå University, S-901 87 Umeå, June 1998.
Google Scholar
B. Kågström and C. Van Loan. GEMM-Based Level-3 BLAS. Technical Report CTC91TR47, Department of Computer Science, Cornell University, Dec. 1989.
Google Scholar
B. Kågström, P. Ling, and C. Van Loan. GEMM-based level 3 BLAS: Highperformance model implementations and performance evaluation benchmark. ACM Trans. Math. Software, 1997. To appear.
Google Scholar
B. Kågström, P. Ling, and C. Van Loan. GEMM-based level 3 BLAS: Portability and optimization issues. ACM Trans. Math. Software, 1997. To appear.
Google Scholar
P. Ling. A set of high-performance level 3 BLAS structured and tuned for the IBM 3090 VF and implemented in Fortran 77. The Journal of Supercomputing, 7(3):323–355, September 1993.
Article Google Scholar
R. C. Whaley and J. J. Dongarra. Automatically tuned linear algebra software. Tech. Report TN 37996-1301, Computer Science Dept., Univ. of Tennessee, 1997.
Google Scholar

Download references

Author information

Authors and Affiliations

IBM T.J. Watson Research Center, P.O. Box 218, 10598, Yorktown Heights, NY, U.S.A.
Fred Gustavson
Department of Computing Science and HPC2N, Umeå University, S-901 87, Umeå, Sweden
André Henriksson, Isak Jonsson, Bo Kågström & Per Ling

Authors

Fred Gustavson
View author publications
You can also search for this author in PubMed Google Scholar
André Henriksson
View author publications
You can also search for this author in PubMed Google Scholar
Isak Jonsson
View author publications
You can also search for this author in PubMed Google Scholar
Bo Kågström
View author publications
You can also search for this author in PubMed Google Scholar
Per Ling
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Bo Kågström Jack Dongarra Erik Elmroth Jerzy Waśniewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gustavson, F., Henriksson, A., Jonsson, I., Kågström, B., Ling, P. (1998). Superscalar GEMM-based level 3 BLAS—The on-going evolution of a portable and high-performance library. In: Kågström, B., Dongarra, J., Elmroth, E., Waśniewski, J. (eds) Applied Parallel Computing Large Scale Scientific and Industrial Problems. PARA 1998. Lecture Notes in Computer Science, vol 1541. Springer, Berlin, Heidelberg . https://doi.org/10.1007/BFb0095338

Download citation

DOI: https://doi.org/10.1007/BFb0095338
Published: 20 October 2006
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65414-8
Online ISBN: 978-3-540-49261-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics