column

Analyzing block locality in Morton-order and Morton-hybrid matrices

Authors:
K. Patrick Lorton

Schrodinger, New York, NY

Schrodinger, New York, NY
View Profile

,
David S. Wise

Indiana University, Bloomington, IN

Indiana University, Bloomington, IN
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 35 Issue 4September 2007pp 6–12https://doi.org/10.1145/1327312.1327315

Published:01 September 2007Publication History

ACM SIGARCH Computer Architecture News

Abstract

As the architectures of computers change, introducing more caches onto multicore chips, even more locality becomes necessary. With the bandwidth between caches and RAM now even more valuable, additional locality from new matrix representations will be important to keep multiple processors busy. The default storage representations of both C and Fortran, row- and column-major respectively, have fundamental deficiencies with many matrix computations. By switching the storage representation from cartesian to block indices, one is able to take better advantage of cache locality at all levels from L1 to paging. This paper only changes storage representation from row-major to Morton-hybrid, and applies it to matrix multiplication. Its purpose is to show that, even with only traditional iterative algorithms, simply changing storage representation offers significant speedups.

References

Adams, M. D., and Wise, D. S. Fast additions on masked integers. SIGPLAN Not. 41, 5 (May 2006), 39--45. http://doi.acm.org/10.1145/1149982.1149987 Google ScholarDigital Library
Adams, M. D., and Wise, D. S. Seven at one stroke: Results from a cache-oblivious paradigm for scalable matrix algorithms. In MSPC '06: Proc. 2006 Wkshp. Memory System Performance and Correctness. ACM Press, New York, Oct. 2006, pp. 41--50. http://doi.acm.org/10.1145/1178597.1178604 Google ScholarDigital Library
Bader, M., and Zenger, C. Cache oblivious matrix multiplication using an element ordering based on the Peano curve. In Parallel Processing and Applied Mathematics (Berlin, 2006), vol. 3911 of Lecture Notes in Comput. Sci., Springer, pp. 1042--1049. http://dx.doi.org/10.1007/11752578_126 Google ScholarDigital Library
Chatterjee, S., Lebeck, A. R., Patnala, P. K., and Thottenthodi, M. Recursive array layouts and fast parallel matrix multiplication. IEEE Trans. Parallel Distrib. Syst. 13, 11 (Nov. 2002), 1105--1123. http://dx.doi.org/10.1109/TPDS.2002.1058095 Google ScholarDigital Library
Dongarra, J. J., Du Croz, J., Hammarling, S., and Duff, I. S. A set of level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Softw. 16, 1 (Mar. 1990), 1--17. http://doi.acm.org/10.1145/77626.79170 Google ScholarDigital Library
Fox, G. C. A graphical approach to load balancing and sparse matrix-vector multiplication. In Numerical Algorithms for Modern Parallel Architectures, M. Schultz, Ed., vol. 13 of IMA Vol. in Math. & Appl. Springer, New York, 1988, pp. 37--61.Google Scholar
Fraguela, B. B., Guo, J., Bikshandi, G., Garzarán, M. J., Almási, G., Moreira, J., and Padua, D. The hierarchically tiled arrays programming approach. In LCR '04: Proc. 7th Wkshp. Languages, Compilers, and Run-Time Support for Scalable Systems, vol. 81 of ACM Int. Conf. Proc. Series. ACM Press, New York, 2004, pp. 1--12. http://doi.acm.org/10.1145/1066650.1066657 Google ScholarDigital Library
Frigo, M., Leiserson, C. E., Prokop, H., and Ramachandran, S. Cache---oblivious algorithms. In Proc. 40th Ann. Symp. Foundations of Computer Science. IEEE Computer Soc. Press, Washington, DC, Oct. 1999, pp. 285--298. http://dx.doi.org/10.1109/SFFCS.1999.814600 Google ScholarDigital Library
Gabriel, S. T., Chenoweth, B., Lorton, K. P., Carlson, M., and Wise, D. S. The Opie Compiler Distribution. Indiana University, Bloomington, IN, Apr. 2005. http://www/cs.indiana.edu/~dswise/Opie/distribution.htmlGoogle Scholar
Gabriel, S. T., and Wise, D. S. The Opie compiler from row-major source to Morton-ordered matrices. In Proc. 3rd Wkshp. on Memory Performance Issues, J. Carter and L. Zhang, Eds. ACM Press, New York, 2004, pp. 136--144. http://doi.acm.org/10.1145/1054943.1054962 Google ScholarDigital Library
Gargantini, I. An effective way to represent quadtrees. Commun. ACM 25, 12 (Dec. 1982), 905--910. http://doi.acm.org/10.1145/358728.358741 Google ScholarDigital Library
Golub, G. H., and Van Loan, C. F. Matrix Computations, third ed. The Johns Hopkins Univ. Press, Baltimore, 1996. Google ScholarDigital Library
Goto, K., and van de Geijn, R. On reducing TLB misses in matrix multiplication. FLAME Working Note 9, Univ. of Texas, Austin, Nov. 2002. http://www.cs.utexas.edu/users/flame/pubs/GOTO.ps.gzGoogle Scholar
Goto, K., and van de Geijn, R. A. Anatomy of high-performance matrix multiplication. Tech. rep., Univ. of Texas, Austin. Submittted for publication. Visited Sept. 2006. http://www.cs.utexas.edu/users/flame/pubs/GOTO_TOMS.pdfGoogle Scholar
Innovative Computing Laboratory, Univ. of Tennessee. Performance Application Programming Interface (PAPI). Knoxville, TN, Dec. 2005. http://icl.cs.utk.edu/papi/Google Scholar
Johnson, D. S. A theoretician's guide to the experimental analysis of algorithms. In Data Structures, Near Neighbor Searches, and Methodology: 5th & 6th DIMACS Implementation Challenges, M. H. Goldwasser, D. S. Johnson, and C. C. McGeoch, Eds., vol. 59 of DIMACS Ser. Discrete Math. Theoret. Comput. Sci. Amer. Math. Soc, Providence, 2002, pp. 215--250. http://www.research.att.com/~dsj/papers.htmlGoogle Scholar
Li, K. Scalable parallel matrix multiplication on distributed memory parallel computers. In 14th Int. Parallel and Distributed Processing Symp. (IPDPS'00). IEEE Computer Soc. Press, Washington, DC, May 2000, pp. 307--314. http://dx.doi.org/10.1109/IPDPS.2000.846000 Google ScholarDigital Library
Markoff, J. Writing the fastest code, by hand, for fun: A human computer keeps speeding up chips. The New York Times CLV, 53, 412 (2005 Nov. 28), C1, C6. http://www.nytimes.com/2005/11/28/technology/28super.htmlGoogle Scholar
Morton, G. M. A computer oriented geodetic data base and a new technique in file sequencing. Tech. rep., IBM Ltd., Ottawa, Ontario, Mar. 1966.Google Scholar
Park, N., Hong, B., and Prasanna, V. K. Tiling, block data layout, and memory hierarchy performance. IEEE Trans. Parallel Distrib. Syst. 14, 7 (July 2003), 640--654. http://dx.doi.org/10.1109/TPDS.2003.1214317 Google ScholarDigital Library
Samet, H. The Design and Analysis of Spatial Data Structures. Addison-Wesley, Reading, MA, 1990, section 2.7. Google ScholarDigital Library
Sang Park, J., Penner, M., and Prasanna, V. K. Optimizing graph algorithms for improved cache performance. IEEE Trans. Parallel Distrib. Syst. 15, 9 (Sept. 2004), 769--782. http://dx.doi.org/10.1109/TPDS.2004.44 Google ScholarDigital Library
Schrack, G. Finding neighbors of equal size in linear quadtrees and octrees in constant time. CVGIP: Image Underst. 55, 3 (May 1992), 221--230. Google ScholarDigital Library
Valsalam, V., and Skjellum, A. A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concur. Comp. Prac. Exper. 14, 10 (2002), 805--839. http://dx.doi.org/10.1002/cpe.630Google Scholar
Wise, D. S. Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free. In Euro-Par 2000---Parallel Processing, A. Bode, T. Ludwig, W. Karl, and R. Wismüller, Eds., vol. 1900 of Lecture Notes in Comput. Sci. Springer, Heidelberg, 2000, pp. 774--883. http://www.springerlink.com/content/~0pc0e9gfk4x9j5fa Google ScholarDigital Library
Wise, D. S., Citro, C. L., Hursey, J. J., Liu, F., and Rainey, M. A. A paradigm for parallel matrix algorithms: Scalable Cholesky. In Euro-Par 2005 --- Parallel Processing, J. C. Cunha and P. D. Medeiros, Eds., no. 3648 in Lecture Notes in Comput. Sci. Springer, Berlin, Aug. 2005, pp. 687--698. http://dx.doi.org/10.1007/11549468_76 Google ScholarDigital Library
Wolfe, M. More iteration space tiling. In Proc. Supercomputing '89. ACM Press, New York, NY, USA, Nov. 1989, pp. 655--664. Google ScholarDigital Library

Index Terms

Recommendations

Analyzing block locality in Morton-order and Morton-hybrid matrices
MEDEA '06: Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures

As the architectures of computers change, introducing more caches onto multicore chips, even more locality becomes necessary. With the bandwidth between caches and RAM now even more valuable, additional locality from new matrix representations will be ...
Read More
Language support for Morton-order matrices

The uniform representation of 2-dimensional arrays serially in Morton order (or {\eee} order) supports both their iterative scan with cartesian indices and their divide-and-conquer manipulation as quaternary trees. This data structure is important ...
Read More
Language support for Morton-order matrices
PPoPP '01: Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming

The uniform representation of 2-dimensional arrays serially in Morton order (or {\eee} order) supports both their iterative scan with cartesian indices and their divide-and-conquer manipulation as quaternary trees. This data structure is important ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGARCH Computer Architecture News Volume 35, Issue 4
September 2007
59 pages
ISSN:0163-5964
DOI:10.1145/1327312
Issue’s Table of Contents

Copyright © 2007 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 2007
Check for updates
Author Tags
Cholesky factorization
Morton order
quadtrees
Qualifiers
- column
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 275
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Analyzing block locality in Morton-order and Morton-hybrid matrices

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

Analyzing block locality in Morton-order and Morton-hybrid matrices

Language support for Morton-order matrices

Language support for Morton-order matrices

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Analyzing block locality in Morton-order and Morton-hybrid matrices

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

Analyzing block locality in Morton-order and Morton-hybrid matrices

Language support for Morton-order matrices

Language support for Morton-order matrices

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media