Abstract
We describe a model of hierarchical memories and we use it to determine an optimal strategy for blocking operand matrices of matrix multiplication. The model is an extension of an earlier related model by three of the authors. As before the model predicts the form of current, state-of-the-art L1 kernels. Additionally, it shows that current L1 kernels can continue to produce their high performance on operand matrices that are as large as the L2 cache. For a hierarchical memory with L memory levels (main memory and L-1 caches), our model reduces the number of potential matrix multiply algorithms from 6L to four. We use the shape of the matrix input operands to select one of our four algorithms. Previously four was 2L and the model was independent of the matrix operand shapes. Because of space limitations, we do not include performance results.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Gunnels, J.A., Henry, G.M., van de Geijn, R.A.: A Family of High-Performance Matrix Multiplication Algorithms. In: Alexandrov, V.N., Dongarra, J., Juliano, B.A., Renner, R.S., Tan, C.J.K. (eds.) ICCS-ComputSci 2001. LNCS, vol. 2073, p. 51. Springer, Heidelberg (2001)
ESSL Guide andReference for IBMES/3090Vector Multiprocessors.Order No. SA22-7220, IBM Corporation (February 1986)
Gallivan, K., Jalby, W., Meier, U., Sameh, A.: The Impact ofHierarchical Memory Systems on Linear Algebra Algorithm Design, CSRD Tech Report 625, University of Illinois at Urbana Champaign, pub.
Agarwal, R.C., Gustavson, F., Zubair, M.: Exploiting functional parallelism on Power2 to design high-performance numerical algorithms. IBM Journal of Research and Development 38(5), 563–576 (1994)
Bilmes, J., Asanovic, K., Chin, C.-w., Demmel, J.: Optimizing Matrix Multiply using PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology. In: Proc. of Int. Conf. on Supercomputing, Vienna, Austrian (July 1997)
Clint Whaley, R., Dongarra, J.J.: Automatically Tuned Linear Algebra Software. In: Proceedings of Supercomputing 1998 (1998)
Goto, K., van de Geijn, R.: On reducing TLB misses in matrix multiplication, University of Texas at Austin, FLAME Working Note #9 (November 2002)
Gustavson, F.G.: New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms. In: Boisvert, R.F., Tang, P.T.P. (eds.) The Architecture of Scientific Software, Kluwer Academic Press, Pub., Dordrecht (2001)
Elmroth, E., Gustavson, F., Jonsson, I., Kagstrom, B.: Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software. SIAM Review 46(1), 3–45 (2004)
Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-Oblivious Algorithms. In: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, IEEE Computer Society, pub., Los Alamitos (1999)
Hong, J., Kung, H.: Complexity: The Red-Blue Pebble Game. In: Proceedings of the 13th Annual ACM Symposium on Theory of Computing, pp. 326–333 (1981)
Toledo, S.: A Survey of Out-of-Core Algorithms in Numerical Linear Algebra. In: Abello, J., Vitter, J.S. (eds.) External Memory Algorithms and Visualization. DIMACS Series in Disc. Math. & Theo. Comp. Sci., pp. 161–180. AMS Press, pub.
Goto, K.: http://www.cs.utexas.edu/users/kgoto
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gunnels, J.A., Gustavson, F.G., Henry, G.M., van de Geijn, R.A. (2006). A Family of High-Performance Matrix Multiplication Algorithms. In: Dongarra, J., Madsen, K., Waśniewski, J. (eds) Applied Parallel Computing. State of the Art in Scientific Computing. PARA 2004. Lecture Notes in Computer Science, vol 3732. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11558958_30
Download citation
DOI: https://doi.org/10.1007/11558958_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29067-4
Online ISBN: 978-3-540-33498-9
eBook Packages: Computer ScienceComputer Science (R0)