High Performance Matrix Multiplication on Many Cores

Yuan, Nan; Zhou, Yongbin; Tan, Guangming; Zhang, Junchao; Fan, Dongrui

doi:10.1007/978-3-642-03869-3_87

Nan Yuan^17,18,
Yongbin Zhou^17,18,
Guangming Tan¹⁷,
Junchao Zhang¹⁷ &
…
Dongrui Fan¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5704))

Included in the following conference series:

European Conference on Parallel Processing

1414 Accesses
6 Citations

Abstract

Moore’s Law suggests that the number of processing cores on a single chip increases exponentially. The future performance increases will be mainly extracted from thread-level parallelism exploited by multi/many-core processors (MCP). Therefore, it is necessary to find out how to build the MCP hardware and how to program the parallelism on such MCP. In this work, we intend to identity the key architecture mechanisms and software optimizations to guarantee high performance for multithreaded programs. To illustrate this, we customize a dense matrix multiplication algorithm on Godson-T MCP as a case study to demonstrate the efficient synergy and interaction between hardware and software. Experiments conducted on the cycle-accurate simulator show that the optimized matrix multiplication could obtain 97.1% (124.3GFLOPS) of the peak performance of Godson-T.

Download to read the full chapter text

Chapter PDF

Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors

Article 04 August 2016

BLAS3 optimization for the Godson-3B1500

Article Open access 25 November 2016

A Parallel 1-D FFT Implementation Method for Multi-core Vector Processors

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Alverson, R., Callahan, D., Cummings, D., Koblenz, B., Porterfield, A., Smith, B.: The Tera computer system. In: Proceedings of the 4th international conference on Supercomputing (1990)
Google Scholar
Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., et al.: The landscape of parallel computing research: A view from berkeley. Electrical Engineering and Computer Sciences, University of California at Berkeley, Technical Report No. UCB/EECS-2006-183, December, 18(2006-183):19 (2006)
Google Scholar
Burger, D., Keckler, S.W., McKinley, K.S., Dahlin, M., John, L.K., Lin, C., Moore, C.R., Burrill, J., McDonald, R.G., Yoder, W., et al.: Scaling to the End of Silicon with EDGE Architectures. Computer 37(7), 44–55 (2004)
Article Google Scholar
Cannon, L.E.: A cellular computer to implement the Kalman filter algorithm (1969)
Google Scholar
Diamond, J.R., Robatmili, B., Keckler, S.W., van de Geijn, R., Goto, K., Burger, D.: High performance dense linear algebra on a spatially distributed processor. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 63–72 (2008)
Google Scholar
Hu, Z., del Cuvillo, J., Zhu, W., Gao, G.R.: Optimization of dense matrix multiplication on IBM cyclops-64: Challenges and experiences. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 134–144. Springer, Heidelberg (2006)
Chapter Google Scholar
Kapasi, U.J., Dally, W.J., Rixner, S., Owens, J.D., Khailany, B.: The Imagine stream processor. In: Proceedings 2002 IEEE International Conference on Computer Design, pp. 282–288 (2002)
Google Scholar
Mattson, T.G., Van der Wijngaart, R., Frumkin, M.: Programming the Intel 80-core network-on-a-chip terascale processor. In: Proceedings of the 2008 ACM/IEEE conference on Supercomputing (2008)
Google Scholar
Mukherjee, S.S., Silla, F., Bannon, P., Emer, J., Lang, S., Webb, D.: A comparative study of arbitration algorithms for the Alpha 21364 pipelined router. In: Proceedings of the 10th international conference on Architectural Support for Programming Languages and Operating Systems (2002)
Google Scholar
Tan, G., Fan, D., Zhang, J., Russo, A., Gao, G.R.: Experience on optimizing irregular computation for memory hierarchy in manycore architecture. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 279–280 (2008)
Google Scholar
Taylor, M.B., Kim, J., Miller, J., Wentzlaff, D., Ghodrat, F., Greenwald, B., Hoffman, H., Johnson, P., Lee, J.W., Lee, W., et al.: The Raw microprocessor: A computational fabric for software circuits and general-purpose programs. IEEE micro. 272, 2 (2002)
Google Scholar
Williams, S., Shalf, J., Oliker, L., Kamil, S., Husbands, P., Yelick, K.: The potential of the cell processor for scientific computing. In: Proceedings of the 3rd conference on Computing Frontiers, pp. 9–20 (2006)
Google Scholar
Ye, X., Nguyen, V.H., Lavenier, D., Fan, D.: Efficient parallelization of a protein sequence comparison algorithm on manycore architecture. In: Proceedings of the 9th international conference on Parallel and Distributed Computing, Applications and Technologies, pp. 167–170 (2008)
Google Scholar
Zhu, W., Sreedhar, V.C., Hu, Z., Gao, G.R.: Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. In: Proceedings of the 34th annual International Symposium on Computer Architecture, pp. 35–45 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, P. R. China
Nan Yuan, Yongbin Zhou, Guangming Tan, Junchao Zhang & Dongrui Fan
Graduate University of Chinese Academy of Sciences, Beijing, 100039, P.R. China
Nan Yuan & Yongbin Zhou

Authors

Nan Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Yongbin Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Guangming Tan
View author publications
You can also search for this author in PubMed Google Scholar
Junchao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Dongrui Fan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Software Technology, Delft University of Technology, Mekelweg 4, 2628, Delft, CD, The Netherlands
Henk Sips , Dick Epema & Hai-Xiang Lin , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yuan, N., Zhou, Y., Tan, G., Zhang, J., Fan, D. (2009). High Performance Matrix Multiplication on Many Cores. In: Sips, H., Epema, D., Lin, HX. (eds) Euro-Par 2009 Parallel Processing. Euro-Par 2009. Lecture Notes in Computer Science, vol 5704. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03869-3_87

Download citation

DOI: https://doi.org/10.1007/978-3-642-03869-3_87
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03868-6
Online ISBN: 978-3-642-03869-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics