Abstract
Moore’s Law suggests that the number of processing cores on a single chip increases exponentially. The future performance increases will be mainly extracted from thread-level parallelism exploited by multi/many-core processors (MCP). Therefore, it is necessary to find out how to build the MCP hardware and how to program the parallelism on such MCP. In this work, we intend to identity the key architecture mechanisms and software optimizations to guarantee high performance for multithreaded programs. To illustrate this, we customize a dense matrix multiplication algorithm on Godson-T MCP as a case study to demonstrate the efficient synergy and interaction between hardware and software. Experiments conducted on the cycle-accurate simulator show that the optimized matrix multiplication could obtain 97.1% (124.3GFLOPS) of the peak performance of Godson-T.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Alverson, R., Callahan, D., Cummings, D., Koblenz, B., Porterfield, A., Smith, B.: The Tera computer system. In: Proceedings of the 4th international conference on Supercomputing (1990)
Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., et al.: The landscape of parallel computing research: A view from berkeley. Electrical Engineering and Computer Sciences, University of California at Berkeley, Technical Report No. UCB/EECS-2006-183, December, 18(2006-183):19 (2006)
Burger, D., Keckler, S.W., McKinley, K.S., Dahlin, M., John, L.K., Lin, C., Moore, C.R., Burrill, J., McDonald, R.G., Yoder, W., et al.: Scaling to the End of Silicon with EDGE Architectures. Computer 37(7), 44–55 (2004)
Cannon, L.E.: A cellular computer to implement the Kalman filter algorithm (1969)
Diamond, J.R., Robatmili, B., Keckler, S.W., van de Geijn, R., Goto, K., Burger, D.: High performance dense linear algebra on a spatially distributed processor. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 63–72 (2008)
Hu, Z., del Cuvillo, J., Zhu, W., Gao, G.R.: Optimization of dense matrix multiplication on IBM cyclops-64: Challenges and experiences. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 134–144. Springer, Heidelberg (2006)
Kapasi, U.J., Dally, W.J., Rixner, S., Owens, J.D., Khailany, B.: The Imagine stream processor. In: Proceedings 2002 IEEE International Conference on Computer Design, pp. 282–288 (2002)
Mattson, T.G., Van der Wijngaart, R., Frumkin, M.: Programming the Intel 80-core network-on-a-chip terascale processor. In: Proceedings of the 2008 ACM/IEEE conference on Supercomputing (2008)
Mukherjee, S.S., Silla, F., Bannon, P., Emer, J., Lang, S., Webb, D.: A comparative study of arbitration algorithms for the Alpha 21364 pipelined router. In: Proceedings of the 10th international conference on Architectural Support for Programming Languages and Operating Systems (2002)
Tan, G., Fan, D., Zhang, J., Russo, A., Gao, G.R.: Experience on optimizing irregular computation for memory hierarchy in manycore architecture. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 279–280 (2008)
Taylor, M.B., Kim, J., Miller, J., Wentzlaff, D., Ghodrat, F., Greenwald, B., Hoffman, H., Johnson, P., Lee, J.W., Lee, W., et al.: The Raw microprocessor: A computational fabric for software circuits and general-purpose programs. IEEE micro. 272, 2 (2002)
Williams, S., Shalf, J., Oliker, L., Kamil, S., Husbands, P., Yelick, K.: The potential of the cell processor for scientific computing. In: Proceedings of the 3rd conference on Computing Frontiers, pp. 9–20 (2006)
Ye, X., Nguyen, V.H., Lavenier, D., Fan, D.: Efficient parallelization of a protein sequence comparison algorithm on manycore architecture. In: Proceedings of the 9th international conference on Parallel and Distributed Computing, Applications and Technologies, pp. 167–170 (2008)
Zhu, W., Sreedhar, V.C., Hu, Z., Gao, G.R.: Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. In: Proceedings of the 34th annual International Symposium on Computer Architecture, pp. 35–45 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yuan, N., Zhou, Y., Tan, G., Zhang, J., Fan, D. (2009). High Performance Matrix Multiplication on Many Cores. In: Sips, H., Epema, D., Lin, HX. (eds) Euro-Par 2009 Parallel Processing. Euro-Par 2009. Lecture Notes in Computer Science, vol 5704. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03869-3_87
Download citation
DOI: https://doi.org/10.1007/978-3-642-03869-3_87
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03868-6
Online ISBN: 978-3-642-03869-3
eBook Packages: Computer ScienceComputer Science (R0)