Abstract
A style for programming problems from matrix algebra is developed with a familiar example and new tools, yielding high performance with a couple of surprising exceptions. The underlying philosophy is to use block recursion as the exclusive control structure, down to a 2p× 2p base case anyway, where hardware favors iterative style to fill its pipe. Use of Morton-ordered matrices yields excellent locality within the memory hierarchy—including block sharing among distributed computers. The recursion generalizes nicely to an SPMD program where such sharing is the only communication.
Cholesky factorization of an n × n SPD matrix is used as a simple nontrivial example to expose the paradigm. The program amounts to four functions, two of which are finalizers for the other two. This insight allows final blocks to be shared with inter-node communication ∈ Θ(n 2) for this algorithm ∈ Θ (n 3) flops.
Supported, in part, by the National Science Foundation under grants numbered CCR-0073491, ACI–0219884, and EIA–0202048. Copyright on twelve pages intact transferred, with rights reserved for anyone to make digital or hard copies of part or all of this work for personal or classroom use, provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full Springer citation on the first page. Rights are similarly reserved for any library to share a hard copy through interlibrary loan.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Chatterjee, S., Lebeck, A.R., Patnala, P.K., Thottenthodi, M.: Recursive array layouts and fast parallel matrix multiplication. IEEE Trans. Parallel Distrib. Syst. 13, 1105–1123 (2002), http://dx.doi.org/10.1109/TPDs.2002.105s095
Thiyagalingam, J., Beckmann, O., Kelly, P.H.J.: Is Morton layout competitive for large two-dimensional arrays, yet? Concur. Comput. Prac. Exper. (2004) ,To appear in special issue on Compilers for Parallel Computing, http://www.docic.ac.uk/~phjk/Publications/IsMortonYetCCPandE2004.pdf
Goto, K., van de Geijn, R.: On reducing TLB misses in matrix multiplication.FLAME Working Note 9, Univ. of Texas, Austin (2002), http://www.cs.utexas.edu/users/flame/pubs/GOTO.ps.gz
Morton, C.: A computer oriented geodetic data base and a new technique in file sequencing. Technical report, IBM Ltd., Ottawa, Ontario (1966)
Drakenberg, P., Lundevall, F., Lisper, B.: An efficient semi-hierarchical array layout. In: Lee, C., Yew, P.C. (eds.) Interaction between Compilers and Computer Architectures. Kluwer Intl. Series in Engineering and Computer Science, vol. 613, Kluwer, Deventer (2001), http://www.mrtc.mdh.se/publications/0313.pdf
Wise, D.S.: Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free. In: Bode, A., Ludwig, T., Karl, W.C., Wismüller, R. (eds.) Euro-Par 2000. LNCS, vol. 1900, pp. 774–883. Springer, Heidelberg (2000)
Wise, D.S., Frens, J.D., Gu, Y., Alexander, G.A.: Language support for Morton-order matrices. In: Proc. 8th ACM SIGPLAN Symp. on Principles and Practice of Parallel Program. SIGPLAN Not., vol. 36, pp. 24–33 (2001), http://doi.acm.org/10.1145/379539.379559
Schrack, G.: Finding neighbors of equal size in linear quadtrees and octrees in constant time. CVGIP: Image Underst. 55, 221–230 (1992)
Raman, R., Wise, D.S.: Converting to and from dilated integers. Submitted for publication (2004), http://www.cs.indiana.edu/dswise/Arcee/castingDilated-comb.pdf
Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: Proc. 40th Ann. Symp. Foundations of Computer Science, pp. 285–298. IEEE Computer Soc. Press, Washington (1999), http://dx.doi.org/10.1109/SFFCS.1999.814600
Frens, J.D.: Matrix Factorization Using a Block-Recursive Structure and Block-Recursive Algorithms. PhD thesis, Indiana Univ., Bloomington (2002), http://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnum=TR568
Spiefi, J.: Untersuchungen des Zeitgewinns durch neue Algorithmen zur Matrix-Multiplication. Computing 17, 23–36 (1976)
Tocher, K.D.: The application of automatic computers to sampling experiments. J. Roy. Statist. Soc. Ser. B 16, 39–61,53-55 (1954)
Johnson, D.S.: A theoretician’s guide to the experimental analysis of algorithms. In: Goldwasser, M.H., Johnson, D.S., McGeoch, C.C. (eds.) Data Structures, Near Neighbor Searches, and Methodology: 5th & 6th DIMACS Implementation Challenges. DIMACS Ser. Discrete Math. Theoret. Comput. Sci. Amer. Math. Soc, Providence, vol. 59, pp. 215–250 (2002), http://www.research.att.com/~dsj/papers.html
Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. In: Proc. Supercomputing 1998, vol. 38, IEEE Computer Soc. Press, Washington (1998), http://dx.doi.org/10.1109/SC.1998.10004
Intel Corp. Santa Clara, CA: Intel Math Kernel Library (2003), http://www.intel.com/software/products/mkl/
LAM/MPI Bloomington, IN (2004) , www.lam-mpi.org
InfiniBand Trade Assn. Portland, OR (2004), www.infinibandta.org
InfiniCon Systems King of Prussia, PA (2004) , www.infinicon.com
Myricom Inc. Arcadia, CA (2004) , www.myri.com
Quadrics Ltd. Bristol, UK (2004), www.quadrics.com
Quadrics Ltd. Bristol, UK: Quadrics Release of MPICH 1.24. (2004), www.quadrics.com
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wise, D.S., Citro, C., Hursey, J., Liu, F., Rainey, M. (2005). A Paradigm for Parallel Matrix Algorithms:. In: Cunha, J.C., Medeiros, P.D. (eds) Euro-Par 2005 Parallel Processing. Euro-Par 2005. Lecture Notes in Computer Science, vol 3648. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11549468_76
Download citation
DOI: https://doi.org/10.1007/11549468_76
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28700-1
Online ISBN: 978-3-540-31925-2
eBook Packages: Computer ScienceComputer Science (R0)