Skip to main content

Using Non-canonical Array Layouts in Dense Matrix Operations

  • Conference paper
Applied Parallel Computing. State of the Art in Scientific Computing (PARA 2006)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4699))

Included in the following conference series:

Abstract

We present two implementations of dense matrix multiplication based on two different non-canonical array layouts: one based on a hypermatrix data structure (HM) where data submatrices are stored using a recursive layout; the other based on a simple block data layout with square blocks (SB) where blocks are arranged in column-major order. We show that the iterative code using SB outperforms a recursive code using HM and obtains competitive results on a variety of platforms.

This work was supported by the Ministerio de Educación y Ciencia of Spain (TIN2004-07739-C02-01).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agarwal, R.C., Gustavson, F.G., Zubair, M.: Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms. IBM J. Res. Dev. 38, 563–576 (1994)

    Google Scholar 

  2. Elmroth, E., Gustavson, F., Jonsson, I., Kågström, B.: Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review 46, 3–45 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  3. IBM: ESSL Guide and Reference for IBM ES/3090 Vector Multiprocessors (1986) Order No. SA 22-9220 (Febuary 1986)

    Google Scholar 

  4. Gallivan, K., Jalby, W., Meier, U., Sameh, A.: Impact of hierarchical memory systems on linear algebra algorithm design. Int. J. of Supercomputer Appl. 2, 12–48 (1988)

    Article  Google Scholar 

  5. Irigoin, F., Triolet, R.: Supernode partitioning. In: POPL 1988: Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pp. 319–329. ACM Press, New York (1988)

    Chapter  Google Scholar 

  6. Wolfe, M.: More iteration space tiling. In: ACM (ed.) Supercomputing 1989, Reno, Nevada, November 13-17, 1989, pp. 655–664. ACM Press, New York (1989)

    Chapter  Google Scholar 

  7. Lam, M., Rothberg, E., Wolf, M.: The cache performance and optimizations of blocked algorithms. In: Proceedings of ASPLOS 1991, pp. 67–74 (1991)

    Google Scholar 

  8. Temam, O., Granston, E.D., Jalby, W.: To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts. In: Supercomputing, pp. 410–419 (1993)

    Google Scholar 

  9. McKellar, A.C., Coffman, J.E.G.: Organizing matrices and matrix operations for paged memory systems. Communications of the ACM 12, 153–165 (1969)

    Article  MATH  Google Scholar 

  10. Gustavson, F.G.: Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM J. Res. Dev. 41, 737–756 (1997)

    Article  Google Scholar 

  11. Toledo, S.: Locality of reference in LU decomposition with partial pivoting. SIAM J. Matrix Anal. Appl. 18, 1065–1081 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  12. Ahmed, N., Pingali, K.: Automatic generation of block-recursive codes. In: Bode, A., Ludwig, T., Karl, W.C., Wismüller, R. (eds.) Euro-Par 2000. LNCS, vol. 1900, pp. 368–378. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  13. Gustavson, F., Henriksson, A., Jonsson, I., Kågström, B.: Recursive blocked data formats and BLAS’s for dense linear algebra algorithms. In: Kagström, B., Elmroth, E., Waśniewski, J., Dongarra, J.J. (eds.) PARA 1998. LNCS, vol. 1541, pp. 195–206. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  14. Andersen, B.S., Gustavson, F.G., Karaivanov, A., Marinova, M., Wasniewski, J., Yalamov, P.Y.: LAWRA: Linear algebra with recursive algorithms. In: Sørevik, T., Manne, F., Moe, R., Gebremedhin, A.H. (eds.) PARA 2000. LNCS, vol. 1947, pp. 38–51. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  15. Andersen, B.S., Wasniewski, J., Gustavson, F.G.: A recursive formulation of Cholesky factorization of a matrix in packed storage. ACM Transactions on Mathematical Software (TOMS) 27, 214–244 (2001)

    Article  MATH  Google Scholar 

  16. Gustavson, F.G.: New generalized data structures for matrices lead to a variety of high-performance algorithms. In: Engquist, B. (ed.) Simulation and visualization on the grid: Parallelldatorcentrum, Kungl. Tekniska Högskolan, proceedings 7th annual conference. Lecture Notes in Computational Science and Engineering, vol. 13, pp. 46–61. Springer, Heidelberg (1974)

    Google Scholar 

  17. Andersen, B.S., Gunnels, J.A., Gustavson, F., Wasniewski, J.: A recursive formulation of the inversion of symmetric positive defite matrices in packed storage data format. In: Fagerholm, J., Haataja, J., Järvinen, J., Lyly, M., Råback, P., Savolainen, V. (eds.) PARA 2002. LNCS, vol. 2367, pp. 287–296. Springer, Heidelberg (2002)

    Google Scholar 

  18. Andersen, B.S., Gunnels, J.A., Gustavson, F.G., Reid, J.K., Waśniewski, J.: A fully portable high performance minimal storage hybrid format Cholesky algorithm. ACM Transactions on Mathematical Software 31, 201–227 (2005)

    Article  MATH  Google Scholar 

  19. Gustavson, F.G.: High-performance linear algebra algorithms using new generalized data structures for matrices. IBM J. Res. Dev. 47, 31–55 (2003)

    MathSciNet  Google Scholar 

  20. Gustavson, F.G.: New generalized data structures for matrices lead to a variety of high performance dense linear algebra algorithms. In: Dongarra, J.J., Madsen, K., Waśniewski, J. (eds.) PARA 2004. LNCS, vol. 3732, pp. 11–20. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  21. Gustavson, F.G.: Algorithm Compiler Architecture Interaction Relative to Dense Linear Algebra. Technical Report RC23715 (W0509-039), IBM, T.J. Watson (2005)

    Google Scholar 

  22. Chatterjee, S., Jain, V.V., Lebeck, A.R., Mundhra, S., Thottethodi, M.: Nonlinear array layouts for hierarchical memory systems. In: Proceedings of the 13th international conference on Supercomputing, pp. 444–453. ACM Press, New York (1999)

    Chapter  Google Scholar 

  23. Park, N., Hong, B., Prasanna, V.K.: Tiling, block data layout, and memory hierarchy performance. IEEE Trans. Parallel and Distrib. Systems 14, 640–654 (2003)

    Article  Google Scholar 

  24. Herrero, J.R., Navarro, J.J.: Compiler-optimized kernels: An efficient alternative to hand-coded inner kernels. In: Gavrilova, M., Gervasi, O., Kumar, V., Tan, C.J.K., Taniar, D., Laganà, A., Mun, Y., Choo, H. (eds.) ICCSA 2006. LNCS, vol. 3984, pp. 762–771. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  25. Frens, J.D., Wise, D.S.: Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code. In: Proc. 6th ACM SIGPLAN Symp. on Principles and Practice of Parallel Program, SIGPLAN Notices, pp. 206–216 (1997)

    Google Scholar 

  26. Wise, D.S., Frens, J.D.: Morton-order matrices deserve compilers’ support. Technical Report TR 533, Computer Science Department, Indiana University (1999)

    Google Scholar 

  27. Chatterjee, S., Lebeck, A.R., Patnala, P.K., Thottethodi, M.: Recursive array layouts and fast parallel matrix multiplication. In: Proc. of the 11th annual ACM symposium on Parallel algorithms and architectures, pp. 222–231. ACM Press, New York (1999)

    Google Scholar 

  28. Athanasaki, E., Koziris, N.: Fast indexing for blocked array layouts to improve multi-level cache locality. In: Interaction between Compilers and Computer Architectures, pp. 109–119 (2004)

    Google Scholar 

  29. Bader, M., Mayer, C.: Cache oblivious matrix operations using Peano curves (These proceedings). In: PARA 2006, pp. 521–530. Springer, Heidelberg (2006)

    Google Scholar 

  30. Valsalam, V., Skjellum, A.: A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concurrency and Computation: Practice and Experience 14, 805–839 (2002)

    Article  MATH  Google Scholar 

  31. Athanasaki, E., Koziris, N., Tsanakas, P.: A tile size selection analysis for blocked array layouts. In: Interaction between Compilers and Computer Architectures, pp. 70–80 (2005)

    Google Scholar 

  32. Fuchs, G., Roy, J., Schrem, E.: Hypermatrix solution of large sets of symmetric positive-definite linear equations. Comp. Meth. Appl. Mech. Eng. 1, 197–216 (1972)

    Article  MATH  Google Scholar 

  33. Herrero, J.R., Navarro, J.J.: Automatic benchmarking and optimization of codes: an experience with numerical kernels. In: Int. Conf. on Software Engineering Research and Practice, pp. 701–706. CSREA Press (2003)

    Google Scholar 

  34. Wise, D.S.: Representing matrices as quadtrees for parallel processors. Information Processing Letters 20, 195–199 (1985)

    Article  MathSciNet  Google Scholar 

  35. Herrero, J.R., Navarro, J.J.: Adapting linear algebra codes to the memory hierarchy using a hypermatrix scheme. In: Wyrzykowski, R., Dongarra, J.J., Meyer, N., Waśniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 1058–1065. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  36. Navarro, J.J., Juan, A., Lang, T.: MOB forms: A class of Multilevel Block Algorithms for dense linear algebra operations. In: Proceedings of the 8th International Conference on Supercomputing, pp. 354–363. ACM Press, New York (1994)

    Chapter  Google Scholar 

  37. Herrero, J.R., Navarro, J.J.: A study on load imbalance in parallel hypermatrix multiplication using OpenMP. In: Wyrzykowski, R., Dongarra, J.J., Meyer, N., Waśniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 124–131. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  38. Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. In: Supercomputing 1998, pp. 211–217. IEEE Computer Society Press, Los Alamitos (1998)

    Google Scholar 

  39. Goto, K., van de Geijn, R.: On reducing TLB misses in matrix multiplication. Technical Report CS-TR-02-55, Univ. of Texas at Austin (2002)

    Google Scholar 

  40. Gunnels, J., Gustavson, F., Pingali, K., Yotov, K.: Is cache-oblivious DGEMM viable (These proceedings). In: PARA 2006, pp. 919–928. Springer, Heidelberg (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Bo Kågström Erik Elmroth Jack Dongarra Jerzy Waśniewski

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Herrero, J.R., Navarro, J.J. (2007). Using Non-canonical Array Layouts in Dense Matrix Operations. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds) Applied Parallel Computing. State of the Art in Scientific Computing. PARA 2006. Lecture Notes in Computer Science, vol 4699. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75755-9_70

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-75755-9_70

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-75754-2

  • Online ISBN: 978-3-540-75755-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics