skip to main content
column

Analyzing block locality in Morton-order and Morton-hybrid matrices

Published:01 September 2007Publication History
Skip Abstract Section

Abstract

As the architectures of computers change, introducing more caches onto multicore chips, even more locality becomes necessary. With the bandwidth between caches and RAM now even more valuable, additional locality from new matrix representations will be important to keep multiple processors busy. The default storage representations of both C and Fortran, row- and column-major respectively, have fundamental deficiencies with many matrix computations. By switching the storage representation from cartesian to block indices, one is able to take better advantage of cache locality at all levels from L1 to paging. This paper only changes storage representation from row-major to Morton-hybrid, and applies it to matrix multiplication. Its purpose is to show that, even with only traditional iterative algorithms, simply changing storage representation offers significant speedups.

References

  1. Adams, M. D., and Wise, D. S. Fast additions on masked integers. SIGPLAN Not. 41, 5 (May 2006), 39--45. http://doi.acm.org/10.1145/1149982.1149987 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Adams, M. D., and Wise, D. S. Seven at one stroke: Results from a cache-oblivious paradigm for scalable matrix algorithms. In MSPC '06: Proc. 2006 Wkshp. Memory System Performance and Correctness. ACM Press, New York, Oct. 2006, pp. 41--50. http://doi.acm.org/10.1145/1178597.1178604 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bader, M., and Zenger, C. Cache oblivious matrix multiplication using an element ordering based on the Peano curve. In Parallel Processing and Applied Mathematics (Berlin, 2006), vol. 3911 of Lecture Notes in Comput. Sci., Springer, pp. 1042--1049. http://dx.doi.org/10.1007/11752578_126 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chatterjee, S., Lebeck, A. R., Patnala, P. K., and Thottenthodi, M. Recursive array layouts and fast parallel matrix multiplication. IEEE Trans. Parallel Distrib. Syst. 13, 11 (Nov. 2002), 1105--1123. http://dx.doi.org/10.1109/TPDS.2002.1058095 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Dongarra, J. J., Du Croz, J., Hammarling, S., and Duff, I. S. A set of level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Softw. 16, 1 (Mar. 1990), 1--17. http://doi.acm.org/10.1145/77626.79170 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Fox, G. C. A graphical approach to load balancing and sparse matrix-vector multiplication. In Numerical Algorithms for Modern Parallel Architectures, M. Schultz, Ed., vol. 13 of IMA Vol. in Math. & Appl. Springer, New York, 1988, pp. 37--61.Google ScholarGoogle Scholar
  7. Fraguela, B. B., Guo, J., Bikshandi, G., Garzarán, M. J., Almási, G., Moreira, J., and Padua, D. The hierarchically tiled arrays programming approach. In LCR '04: Proc. 7th Wkshp. Languages, Compilers, and Run-Time Support for Scalable Systems, vol. 81 of ACM Int. Conf. Proc. Series. ACM Press, New York, 2004, pp. 1--12. http://doi.acm.org/10.1145/1066650.1066657 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Frigo, M., Leiserson, C. E., Prokop, H., and Ramachandran, S. Cache---oblivious algorithms. In Proc. 40th Ann. Symp. Foundations of Computer Science. IEEE Computer Soc. Press, Washington, DC, Oct. 1999, pp. 285--298. http://dx.doi.org/10.1109/SFFCS.1999.814600 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Gabriel, S. T., Chenoweth, B., Lorton, K. P., Carlson, M., and Wise, D. S. The Opie Compiler Distribution. Indiana University, Bloomington, IN, Apr. 2005. http://www/cs.indiana.edu/~dswise/Opie/distribution.htmlGoogle ScholarGoogle Scholar
  10. Gabriel, S. T., and Wise, D. S. The Opie compiler from row-major source to Morton-ordered matrices. In Proc. 3rd Wkshp. on Memory Performance Issues, J. Carter and L. Zhang, Eds. ACM Press, New York, 2004, pp. 136--144. http://doi.acm.org/10.1145/1054943.1054962 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Gargantini, I. An effective way to represent quadtrees. Commun. ACM 25, 12 (Dec. 1982), 905--910. http://doi.acm.org/10.1145/358728.358741 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Golub, G. H., and Van Loan, C. F. Matrix Computations, third ed. The Johns Hopkins Univ. Press, Baltimore, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Goto, K., and van de Geijn, R. On reducing TLB misses in matrix multiplication. FLAME Working Note 9, Univ. of Texas, Austin, Nov. 2002. http://www.cs.utexas.edu/users/flame/pubs/GOTO.ps.gzGoogle ScholarGoogle Scholar
  14. Goto, K., and van de Geijn, R. A. Anatomy of high-performance matrix multiplication. Tech. rep., Univ. of Texas, Austin. Submittted for publication. Visited Sept. 2006. http://www.cs.utexas.edu/users/flame/pubs/GOTO_TOMS.pdfGoogle ScholarGoogle Scholar
  15. Innovative Computing Laboratory, Univ. of Tennessee. Performance Application Programming Interface (PAPI). Knoxville, TN, Dec. 2005. http://icl.cs.utk.edu/papi/Google ScholarGoogle Scholar
  16. Johnson, D. S. A theoretician's guide to the experimental analysis of algorithms. In Data Structures, Near Neighbor Searches, and Methodology: 5th & 6th DIMACS Implementation Challenges, M. H. Goldwasser, D. S. Johnson, and C. C. McGeoch, Eds., vol. 59 of DIMACS Ser. Discrete Math. Theoret. Comput. Sci. Amer. Math. Soc, Providence, 2002, pp. 215--250. http://www.research.att.com/~dsj/papers.htmlGoogle ScholarGoogle Scholar
  17. Li, K. Scalable parallel matrix multiplication on distributed memory parallel computers. In 14th Int. Parallel and Distributed Processing Symp. (IPDPS'00). IEEE Computer Soc. Press, Washington, DC, May 2000, pp. 307--314. http://dx.doi.org/10.1109/IPDPS.2000.846000 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Markoff, J. Writing the fastest code, by hand, for fun: A human computer keeps speeding up chips. The New York Times CLV, 53, 412 (2005 Nov. 28), C1, C6. http://www.nytimes.com/2005/11/28/technology/28super.htmlGoogle ScholarGoogle Scholar
  19. Morton, G. M. A computer oriented geodetic data base and a new technique in file sequencing. Tech. rep., IBM Ltd., Ottawa, Ontario, Mar. 1966.Google ScholarGoogle Scholar
  20. Park, N., Hong, B., and Prasanna, V. K. Tiling, block data layout, and memory hierarchy performance. IEEE Trans. Parallel Distrib. Syst. 14, 7 (July 2003), 640--654. http://dx.doi.org/10.1109/TPDS.2003.1214317 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Samet, H. The Design and Analysis of Spatial Data Structures. Addison-Wesley, Reading, MA, 1990, section 2.7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Sang Park, J., Penner, M., and Prasanna, V. K. Optimizing graph algorithms for improved cache performance. IEEE Trans. Parallel Distrib. Syst. 15, 9 (Sept. 2004), 769--782. http://dx.doi.org/10.1109/TPDS.2004.44 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Schrack, G. Finding neighbors of equal size in linear quadtrees and octrees in constant time. CVGIP: Image Underst. 55, 3 (May 1992), 221--230. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Valsalam, V., and Skjellum, A. A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concur. Comp. Prac. Exper. 14, 10 (2002), 805--839. http://dx.doi.org/10.1002/cpe.630Google ScholarGoogle Scholar
  25. Wise, D. S. Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free. In Euro-Par 2000---Parallel Processing, A. Bode, T. Ludwig, W. Karl, and R. Wismüller, Eds., vol. 1900 of Lecture Notes in Comput. Sci. Springer, Heidelberg, 2000, pp. 774--883. http://www.springerlink.com/content/~0pc0e9gfk4x9j5fa Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Wise, D. S., Citro, C. L., Hursey, J. J., Liu, F., and Rainey, M. A. A paradigm for parallel matrix algorithms: Scalable Cholesky. In Euro-Par 2005 --- Parallel Processing, J. C. Cunha and P. D. Medeiros, Eds., no. 3648 in Lecture Notes in Comput. Sci. Springer, Berlin, Aug. 2005, pp. 687--698. http://dx.doi.org/10.1007/11549468_76 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Wolfe, M. More iteration space tiling. In Proc. Supercomputing '89. ACM Press, New York, NY, USA, Nov. 1989, pp. 655--664. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Analyzing block locality in Morton-order and Morton-hybrid matrices

                  Recommendations

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in

                  Full Access

                  • Published in

                    cover image ACM SIGARCH Computer Architecture News
                    ACM SIGARCH Computer Architecture News  Volume 35, Issue 4
                    September 2007
                    59 pages
                    ISSN:0163-5964
                    DOI:10.1145/1327312
                    Issue’s Table of Contents

                    Copyright © 2007 Authors

                    Publisher

                    Association for Computing Machinery

                    New York, NY, United States

                    Publication History

                    • Published: 1 September 2007

                    Check for updates

                    Qualifiers

                    • column

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader