Abstract
As the architectures of computers change, introducing more caches onto multicore chips, even more locality becomes necessary. With the bandwidth between caches and RAM now even more valuable, additional locality from new matrix representations will be important to keep multiple processors busy. The default storage representations of both C and Fortran, row- and column-major respectively, have fundamental deficiencies with many matrix computations. By switching the storage representation from cartesian to block indices, one is able to take better advantage of cache locality at all levels from L1 to paging. This paper only changes storage representation from row-major to Morton-hybrid, and applies it to matrix multiplication. Its purpose is to show that, even with only traditional iterative algorithms, simply changing storage representation offers significant speedups.
- Adams, M. D., and Wise, D. S. Fast additions on masked integers. SIGPLAN Not. 41, 5 (May 2006), 39--45. http://doi.acm.org/10.1145/1149982.1149987 Google ScholarDigital Library
- Adams, M. D., and Wise, D. S. Seven at one stroke: Results from a cache-oblivious paradigm for scalable matrix algorithms. In MSPC '06: Proc. 2006 Wkshp. Memory System Performance and Correctness. ACM Press, New York, Oct. 2006, pp. 41--50. http://doi.acm.org/10.1145/1178597.1178604 Google ScholarDigital Library
- Bader, M., and Zenger, C. Cache oblivious matrix multiplication using an element ordering based on the Peano curve. In Parallel Processing and Applied Mathematics (Berlin, 2006), vol. 3911 of Lecture Notes in Comput. Sci., Springer, pp. 1042--1049. http://dx.doi.org/10.1007/11752578_126 Google ScholarDigital Library
- Chatterjee, S., Lebeck, A. R., Patnala, P. K., and Thottenthodi, M. Recursive array layouts and fast parallel matrix multiplication. IEEE Trans. Parallel Distrib. Syst. 13, 11 (Nov. 2002), 1105--1123. http://dx.doi.org/10.1109/TPDS.2002.1058095 Google ScholarDigital Library
- Dongarra, J. J., Du Croz, J., Hammarling, S., and Duff, I. S. A set of level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Softw. 16, 1 (Mar. 1990), 1--17. http://doi.acm.org/10.1145/77626.79170 Google ScholarDigital Library
- Fox, G. C. A graphical approach to load balancing and sparse matrix-vector multiplication. In Numerical Algorithms for Modern Parallel Architectures, M. Schultz, Ed., vol. 13 of IMA Vol. in Math. & Appl. Springer, New York, 1988, pp. 37--61.Google Scholar
- Fraguela, B. B., Guo, J., Bikshandi, G., Garzarán, M. J., Almási, G., Moreira, J., and Padua, D. The hierarchically tiled arrays programming approach. In LCR '04: Proc. 7th Wkshp. Languages, Compilers, and Run-Time Support for Scalable Systems, vol. 81 of ACM Int. Conf. Proc. Series. ACM Press, New York, 2004, pp. 1--12. http://doi.acm.org/10.1145/1066650.1066657 Google ScholarDigital Library
- Frigo, M., Leiserson, C. E., Prokop, H., and Ramachandran, S. Cache---oblivious algorithms. In Proc. 40th Ann. Symp. Foundations of Computer Science. IEEE Computer Soc. Press, Washington, DC, Oct. 1999, pp. 285--298. http://dx.doi.org/10.1109/SFFCS.1999.814600 Google ScholarDigital Library
- Gabriel, S. T., Chenoweth, B., Lorton, K. P., Carlson, M., and Wise, D. S. The Opie Compiler Distribution. Indiana University, Bloomington, IN, Apr. 2005. http://www/cs.indiana.edu/~dswise/Opie/distribution.htmlGoogle Scholar
- Gabriel, S. T., and Wise, D. S. The Opie compiler from row-major source to Morton-ordered matrices. In Proc. 3rd Wkshp. on Memory Performance Issues, J. Carter and L. Zhang, Eds. ACM Press, New York, 2004, pp. 136--144. http://doi.acm.org/10.1145/1054943.1054962 Google ScholarDigital Library
- Gargantini, I. An effective way to represent quadtrees. Commun. ACM 25, 12 (Dec. 1982), 905--910. http://doi.acm.org/10.1145/358728.358741 Google ScholarDigital Library
- Golub, G. H., and Van Loan, C. F. Matrix Computations, third ed. The Johns Hopkins Univ. Press, Baltimore, 1996. Google ScholarDigital Library
- Goto, K., and van de Geijn, R. On reducing TLB misses in matrix multiplication. FLAME Working Note 9, Univ. of Texas, Austin, Nov. 2002. http://www.cs.utexas.edu/users/flame/pubs/GOTO.ps.gzGoogle Scholar
- Goto, K., and van de Geijn, R. A. Anatomy of high-performance matrix multiplication. Tech. rep., Univ. of Texas, Austin. Submittted for publication. Visited Sept. 2006. http://www.cs.utexas.edu/users/flame/pubs/GOTO_TOMS.pdfGoogle Scholar
- Innovative Computing Laboratory, Univ. of Tennessee. Performance Application Programming Interface (PAPI). Knoxville, TN, Dec. 2005. http://icl.cs.utk.edu/papi/Google Scholar
- Johnson, D. S. A theoretician's guide to the experimental analysis of algorithms. In Data Structures, Near Neighbor Searches, and Methodology: 5th & 6th DIMACS Implementation Challenges, M. H. Goldwasser, D. S. Johnson, and C. C. McGeoch, Eds., vol. 59 of DIMACS Ser. Discrete Math. Theoret. Comput. Sci. Amer. Math. Soc, Providence, 2002, pp. 215--250. http://www.research.att.com/~dsj/papers.htmlGoogle Scholar
- Li, K. Scalable parallel matrix multiplication on distributed memory parallel computers. In 14th Int. Parallel and Distributed Processing Symp. (IPDPS'00). IEEE Computer Soc. Press, Washington, DC, May 2000, pp. 307--314. http://dx.doi.org/10.1109/IPDPS.2000.846000 Google ScholarDigital Library
- Markoff, J. Writing the fastest code, by hand, for fun: A human computer keeps speeding up chips. The New York Times CLV, 53, 412 (2005 Nov. 28), C1, C6. http://www.nytimes.com/2005/11/28/technology/28super.htmlGoogle Scholar
- Morton, G. M. A computer oriented geodetic data base and a new technique in file sequencing. Tech. rep., IBM Ltd., Ottawa, Ontario, Mar. 1966.Google Scholar
- Park, N., Hong, B., and Prasanna, V. K. Tiling, block data layout, and memory hierarchy performance. IEEE Trans. Parallel Distrib. Syst. 14, 7 (July 2003), 640--654. http://dx.doi.org/10.1109/TPDS.2003.1214317 Google ScholarDigital Library
- Samet, H. The Design and Analysis of Spatial Data Structures. Addison-Wesley, Reading, MA, 1990, section 2.7. Google ScholarDigital Library
- Sang Park, J., Penner, M., and Prasanna, V. K. Optimizing graph algorithms for improved cache performance. IEEE Trans. Parallel Distrib. Syst. 15, 9 (Sept. 2004), 769--782. http://dx.doi.org/10.1109/TPDS.2004.44 Google ScholarDigital Library
- Schrack, G. Finding neighbors of equal size in linear quadtrees and octrees in constant time. CVGIP: Image Underst. 55, 3 (May 1992), 221--230. Google ScholarDigital Library
- Valsalam, V., and Skjellum, A. A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concur. Comp. Prac. Exper. 14, 10 (2002), 805--839. http://dx.doi.org/10.1002/cpe.630Google Scholar
- Wise, D. S. Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free. In Euro-Par 2000---Parallel Processing, A. Bode, T. Ludwig, W. Karl, and R. Wismüller, Eds., vol. 1900 of Lecture Notes in Comput. Sci. Springer, Heidelberg, 2000, pp. 774--883. http://www.springerlink.com/content/~0pc0e9gfk4x9j5fa Google ScholarDigital Library
- Wise, D. S., Citro, C. L., Hursey, J. J., Liu, F., and Rainey, M. A. A paradigm for parallel matrix algorithms: Scalable Cholesky. In Euro-Par 2005 --- Parallel Processing, J. C. Cunha and P. D. Medeiros, Eds., no. 3648 in Lecture Notes in Comput. Sci. Springer, Berlin, Aug. 2005, pp. 687--698. http://dx.doi.org/10.1007/11549468_76 Google ScholarDigital Library
- Wolfe, M. More iteration space tiling. In Proc. Supercomputing '89. ACM Press, New York, NY, USA, Nov. 1989, pp. 655--664. Google ScholarDigital Library
Index Terms
- Analyzing block locality in Morton-order and Morton-hybrid matrices
Recommendations
Analyzing block locality in Morton-order and Morton-hybrid matrices
MEDEA '06: Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architecturesAs the architectures of computers change, introducing more caches onto multicore chips, even more locality becomes necessary. With the bandwidth between caches and RAM now even more valuable, additional locality from new matrix representations will be ...
Language support for Morton-order matrices
The uniform representation of 2-dimensional arrays serially in Morton order (or {\eee} order) supports both their iterative scan with cartesian indices and their divide-and-conquer manipulation as quaternary trees. This data structure is important ...
Language support for Morton-order matrices
PPoPP '01: Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programmingThe uniform representation of 2-dimensional arrays serially in Morton order (or {\eee} order) supports both their iterative scan with cartesian indices and their divide-and-conquer manipulation as quaternary trees. This data structure is important ...
Comments