Skip to main content

Matrix Multiplication on Multidimensional Torus Networks

  • Conference paper
High Performance Computing for Computational Science - VECPAR 2012 (VECPAR 2012)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7851))

Abstract

Blocked matrix multiplication algorithms such as Cannon’s algorithm and SUMMA have a 2-dimensional communication structure. We introduce a generalized ’Split-Dimensional’ version of Cannon’s algorithm (SD-Cannon) with higher-dimensional and bidirectional communication structure. This algorithm is useful for torus interconnects that can achieve more injection bandwidth than single-link bandwidth. On a bidirectional torus network of dimension d, SD-Cannon can lower the algorithmic bandwidth cost by a factor of up to d. With rectangular collectives, SUMMA also achieves the lower bandwidth cost but has a higher latency cost. We use Charm++ virtualization to efficiently map SD-Cannon on unbalanced and odd-dimensional torus network partitions. Our performance study on Blue Gene/P demonstrates that a MPI version of SD-Cannon can exploit multiple communication links and improve performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P.: A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. 39, 575–582 (1995)

    Article  Google Scholar 

  2. Aggarwal, A., Chandra, A.K., Snir, M.: Communication complexity of PRAMs. Theoretical Computer Science 71(1), 3–28 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  3. Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Minimizing communication in linear algebra. SIAM J. Mat. Anal. Appl. 32(3) (2011)

    Google Scholar 

  4. Berntsen, J.: Communication efficient matrix multiplication on hypercubes. Parallel Computing 12(3), 335–342 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  5. Cannon, L.E.: A cellular computer to implement the Kalman filter algorithm. Ph.D. thesis, Bozeman, MT, USA (1969)

    Google Scholar 

  6. Chen, D., Eisley, N.A., Heidelberger, P., Senger, R.M., Sugawara, Y., Kumar, S., Salapura, V., Satterfield, D.L., Steinmacher-Burow, B., Parker, J.J.: The IBM Blue Gene/Q interconnection network and message unit. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 1–2. ACM, New York (2011)

    Chapter  Google Scholar 

  7. Dally, W.: Performance analysis of k-ary n-cube interconnection networks. IEEE Transactions on Computers 39(6), 775–785 (1990)

    Article  MathSciNet  Google Scholar 

  8. Dekel, E., Nassimi, D., Sahni, S.: Parallel matrix and graph algorithms. SIAM Journal on Computing 10(4), 657–675 (1981)

    Article  MathSciNet  MATH  Google Scholar 

  9. Faraj, A., Kumar, S., Smith, B., Mamidala, A., Gunnels, J.: MPI collective communications on the Blue Gene/P supercomputer: Algorithms and optimizations. In: 17th IEEE Symposium on High Performance Interconnects, HOTI 2009 (2009)

    Google Scholar 

  10. Gropp, W., Lusk, E., Skjellum, A.: Using MPI: portable parallel programming with the message-passing interface. MIT Press, Cambridge (1994)

    Google Scholar 

  11. IBM Journal of Research and Development staff: Overview of the IBM Blue Gene/P project. IBM J. Res. Dev. 52, 199–220 (2008)

    Google Scholar 

  12. Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. Journal of Parallel and Distributed Computing 64(9), 1017–1026 (2004)

    Article  MATH  Google Scholar 

  13. Johnsson, S.L.: Minimizing the communication time for matrix multiplication on multiprocessors. Parallel Comput. 19, 1235–1257 (1993)

    Article  Google Scholar 

  14. Kale, L.V., Krishnan, S.: CHARM++: a portable concurrent object oriented system based on C++. In: Proceedings of the Eighth Annual Conference on Object-Oriented Programming Systems, Languages, and Applications, OOPSLA 1993, pp. 91–108. ACM, New York (1993)

    Chapter  Google Scholar 

  15. Solomonik, E., Bhatele, A., Demmel, J.: Improving communication performance in dense linear algebra via topology aware collectives. In: Supercomputing, Seattle, WA, USA (November 2011)

    Google Scholar 

  16. Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011, Part II. LNCS, vol. 6853, pp. 90–109. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  17. Van De Geijn, R.A., Watts, J.: SUMMA: scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9(4), 255–274 (1997)

    Article  Google Scholar 

  18. Watts, J., Van De Geijn, R.A.: A pipelined broadcast for multidimensional meshes. Parallel Processing Letters 5, 281–292 (1995)

    Article  Google Scholar 

  19. Yokokawa, M., Shoji, F., Uno, A., Kurokawa, M., Watanabe, T.: The k computer: Japanese next-generation supercomputer development project. In: International Symposium on Low Power Electronics and Design, ISLPED 2011, pp. 371–372 (August 2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Solomonik, E., Demmel, J. (2013). Matrix Multiplication on Multidimensional Torus Networks. In: Daydé, M., Marques, O., Nakajima, K. (eds) High Performance Computing for Computational Science - VECPAR 2012. VECPAR 2012. Lecture Notes in Computer Science, vol 7851. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38718-0_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-38718-0_21

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-38717-3

  • Online ISBN: 978-3-642-38718-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics