A Performance Model of Dense Matrix Operations on Many-Core Architectures

Long, Guoping; Fan, Dongrui; Zhang, Junchao; Song, Fenglong; Yuan, Nan; Lin, Wei

doi:10.1007/978-3-540-85451-7_14

A Performance Model of Dense Matrix Operations on Many-Core Architectures

Guoping Long¹,
Dongrui Fan¹,
Junchao Zhang¹,
Fenglong Song¹,
Nan Yuan¹ &
…
Wei Lin¹

Conference paper

747 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5168))

Abstract

Current many-core architectures (MCA) have much larger arithmetic to memory bandwidth ratio compared with traditional processors (vector, superscalar, and multi-core, etc). As a result, bandwidth has become an important performance bottleneck of MCA. Previous works have demonstrated promising performance of MCA for dense matrix operations. However, there is still little quantitative understanding of the relationship between performance of matrix computation kernels and the limited memory bandwidth. This paper presents a performance model for dense matrix multiplication (MM), LU and Cholesky decomposition. The input parameters are memory bandwidth B and on-chip SRAM capacity C, while the output is maximum core number P_max. We show that \(P_{max}=\Theta(B\ast \sqrt{C})\). P_max indicates that when the problem size is large enough, the given memory bandwidth will not be a performance bottleneck as long as the number of cores P < P_max. The model is validated by a comparison between the theoretical performance and experimental data of previous works.

Download to read the full chapter text

Chapter PDF

References

Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The Landscape of Parallel Computing Research: A View from Berkeley
Google Scholar
Zhu, W.R., Sreedhar, V.C., Aang Hu, Z., Gao, G.R.: Synchronization State Buffer: Supporting Efficient Fine-Grain Synchronization for Many-Core Architectures. In: Proceedings of the 34th International Symposium on Computer Architecture (ISCA 2007), San Diego, CA, USA, June 9-13 (2007)
Google Scholar
Vangal, S., Howard, J., Ruhl, G., Dighe, S., Wilson, H., Tschanz, J., Finan, D., Iyer, P., Singh, A., Jacob, T., Jain, S., Venkataraman, S., Hoskote, Y., Borkar, N.: An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS. In: Proceedings of IEEE International Solid-State Circuits Conference, February 11-15 (2007)
Google Scholar
Dally, W.J., Labonte, F., Das, A., Hanrahan, P., Ahn, J.H., Gummaraju, J., Erez, M., Jayasena, N., Buck, I., Knight, T.J., Kapasi, U.J.: Merrimac: Supercomputing with Streams. In: Proceedings of the Supercomputer Conference, November 15-21 (2003)
Google Scholar
Tan, G., Fan, D., Zhang, J., Russo, A., Gao, G.R.: Experience on Optimizing Irregular Computation for Memory Hierarchy in Manycore Architecture. In: The 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 20-23 (2008)
Google Scholar
Ang Hu, Z., del Cuvillo, J., Zhu, W., Gao, G.R.: Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences. In: The 12th International European Conference on Parallel Processing, 29 August - 1 September (2006)
Google Scholar
Venetis, I.E., Gao, G.R.: Optimizing the LU Benchmark for the Cyclops-64 Architecture. CAPSL Technical Memo 75 (February 2007)
Google Scholar
Tan, G.: Locality and Parallelism of Algorithm in Irregular Computation. PH.D. dissertation. Institute of Computing Technology, Chinese Academy of Sciences (6) (2007)
Google Scholar
Automatically Tuned Linear Algebra Software (ATLAS), http://math-atlas.sourceforge.net/
Yotov, K., Roeder, T., Pingali, K., Gunnels, J., Gustavson, F.: An Experimental Comparison of Cache-oblivious and Cache-aware Programs. In: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, June 9-11 (2007)
Google Scholar
Bilardi, G., Pietracaprina, A., Pucci, G., Schifano, S.F., Tripiccione, R.: The Potential of On-Chip Multiprocessing for QCD Machines. In: Proceedings of the International Conference on High Performance Computing, pp. 386–397 (December 2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, 100080, Beijing, China
Guoping Long, Dongrui Fan, Junchao Zhang, Fenglong Song, Nan Yuan & Wei Lin

Authors

Guoping Long
View author publications
You can also search for this author in PubMed Google Scholar
Dongrui Fan
View author publications
You can also search for this author in PubMed Google Scholar
Junchao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Fenglong Song
View author publications
You can also search for this author in PubMed Google Scholar
Nan Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Wei Lin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Emilio Luque Tomàs Margalef Domingo Benítez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Long, G., Fan, D., Zhang, J., Song, F., Yuan, N., Lin, W. (2008). A Performance Model of Dense Matrix Operations on Many-Core Architectures. In: Luque, E., Margalef, T., Benítez, D. (eds) Euro-Par 2008 – Parallel Processing. Euro-Par 2008. Lecture Notes in Computer Science, vol 5168. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85451-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-540-85451-7_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85450-0
Online ISBN: 978-3-540-85451-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics