CPU versus GPU: which can perform matrix computation faster—performance comparison for basic linear algebra subprograms

Li, Feng; Ye, Yunming; Tian, Zhaoyang; Zhang, Xiaofeng

doi:10.1007/s00521-018-3354-z

CPU versus GPU: which can perform matrix computation faster—performance comparison for basic linear algebra subprograms

Original Article
Published: 25 January 2018

Volume 31, pages 4353–4365, (2019)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Feng Li ORCID: orcid.org/0000-0002-0011-859X¹,
Yunming Ye¹,
Zhaoyang Tian² &
…
Xiaofeng Zhang¹

1854 Accesses
14 Citations
Explore all metrics

Abstract

Matrix computing is the core component of machine learning and artificial intelligence. Fast matrix computations can facilitate many large-scale computational projects greatly. Basic linear algebra subprograms (BLAS) are proposed, which classify different matrices and provide a standardized interface. Currently, the most commonly used heterogeneous computing platforms are central processing unit (CPU) and graphics processing unit (GPU). At present, BLAS has been implemented on both CPU and GPU. However, due to the different characteristics of algorithms and hardware, a particular matrix method should be designed for a particular processor. It is important to choose the right processor for a particular matrix computation. This paper first briefly reviews the BLAS, and then introduces architecture and optimization methods of CPU and GPU. The effect of different subroutines in BLAS is studied through experiments. Finally, we discuss the reasons and the processor selection scheme of matrix computations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Reconfigurable Hardware Architecture for Principal Component Analysis

Article 11 October 2018

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

Article Open access 19 January 2019

PyGAD: an intuitive genetic algorithm Python library

Article 19 December 2023

References

Oh KS, Jung K (2004) GPU implementation of neural networks. Pattern Recogn 37(6):1311–1314
Article MATH Google Scholar
Baptista D, Morgado-Dias F (2013) A survey of artificial neural network training tools. Neural Comput Appl 23(3–4):609–615
Article Google Scholar
Baptista D, Abreu S, Freitas F et al (2013) A survey of software and hardware use in artificial neural networks. Neural Comput Appl 23(3–4):591–599
Article Google Scholar
Lee VW, Kim C, Chhugani J et al (2010) Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. Int Symp Comput Archit 38(3):451–460
Google Scholar
Owens JD, Luebke D, Govindaraju NK et al (2007) A survey of general-purpose computation on graphics hardware. Comput Gr Forum 26(1):80–113
Article Google Scholar
Brodtkorb AR, Hagen TR, Saetra ML et al (2013) Graphics processing unit (GPU) programming strategies and trends in GPU computing. J Parallel Distrib Comput 73(1):4–13
Article Google Scholar
Lawson CL, Hanson RJ, Kincaid DR et al (1979) Basic linear algebra subprograms for fortran usage. ACM Trans Math Softw 5(3):308–323
Article MATH Google Scholar
AMD, AMD Core Math Library (ACML). http://developer.amd.com/acml
Wang E, Zhang Q, Shen B et al (2014) Intel math kernel library. High-Performance Computing on the Intel Xeon Phi. Springer International Publishing, Berlin, pp 167–188
Barrachina S, Castillo M, Igual FD et al (2008) Evaluation and tuning of the level 3 CUBLAS for graphics processors. In: IEEE international symposium on parallel and distributed processing, pp 1–8
Anderson E, Bai Z, Bischof C et al (1999) LAPACK users’ guide. Society for Industrial and Applied Mathematics, Philadelphia, PA
Book MATH Google Scholar
Moler C (2000) Matlab incorporates LAPACK. Increasing the speed and capabilities of matrix computation, MATLAB News and NotesCWinter
Walt S, Colbert SC, Varoquaux G (2011) The NumPy array: a structure for efficient numerical computation. Comput Sci Eng 13(2):22–30
Article Google Scholar
Huang Z, Ye Y, Li X et al (2017) Joint weighted nonnegative matrix factorization for mining attributed graphs. Pacific-Asia conference on knowledge discovery and data mining. Springer, Cham, pp 368–380
Google Scholar
Zhang H, Ho JKL, Wu QMJ et al (2013) Multidimensional latent semantic analysis using term spatial information. IEEE Trans Cybern 43(6):1625–1640
Article Google Scholar
Abadi M, Agarwal A, Barham P et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems
Jia Y, Shelhamer E, Donahue J et al (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia, pp 675–678
Zhang H, Li J, Ji Y et al (2017) Understanding subtitles by character-level sequence-to-sequence learning. IEEE Trans Industr Inf 13(2):616–624
Article Google Scholar
Uzair M, Shafait F, Ghanem B et al (2015) Representation learning with deep extreme learning machines for efficient image set classification. Neural Comput Appl, pp 1–13
Zhang H, Cao X, Ho JKL et al (2017) Object-level video advertising: an optimization framework. IEEE Trans Industr Inf 13(2):520–531
Article Google Scholar
Guo H, Tang R, Ye Y et al (2017) DeepFM: a factorization-machine based neural network for CTR prediction. In: The twenty-sixth international joint conference on artificial intelligence (IJCAI), pp 1725–1731
Dongarra J, DuCroz J, Hammarling S et al (1988) An extended set of FORTRAN basic linear algebra subprograms. ACM Trans Math Softw 14(1):1–17
Article MATH Google Scholar
Dongarra J, DuCroz J, Hammarling S et al (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17
Article MathSciNet MATH Google Scholar
Mukunoki D, Imamura T, Takahashi D (2015) Fast implementation of general matrix–vector multiplication (GEMV) on Kepler GPUs. In: 23rd Euromicro international conference on parallel, distributed and network-based processing (PDP), IEEE,, pp 642–650
Danihelka I, Wayne G, Uria B et al (2016) Associative long short-term memory. arXiv preprint arXiv:1602.03032
Nath R, Tomov S, Dongarra J (2010) An improved MAGMA GEMM for Fermi graphics processing units. Int J High Perform Comput Appl 24(4):511–515
Article Google Scholar
Nakasato N (2011) A fast GEMM implementation on the Cypress GPU. ACM SIGMETRICS Perform Eval Rev 38(4):50–55
Article Google Scholar
Romine CH, Ortega JM (1988) Parallel solution of triangular systems of equations. Parallel Comput 6(1):109–114
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This research was supported in part by NSFC under Grant Nos. 61572158 and 61602132, Shenzhen Science and Technology Program under Grant Nos. JSGG20150512145714247, JCYJ20160330163900579 and JCYJ20170413105929681. And manuscript is approved by all authors for publication. I would like to declare on behalf of my co-authors that the work described was original research that has not been published previously and not under consideration for publication elsewhere, in whole or in part. All the authors listed have approved the manuscript that is enclosed.

Author information

Authors and Affiliations

Shenzhen Key Laboratory of Internet Information Collaboration, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China
Feng Li, Yunming Ye & Xiaofeng Zhang
Department of Electronic Engineering, City University of Hong Kong, Kowlloon Tong, Hong Kong
Zhaoyang Tian

Authors

Feng Li
View author publications
You can also search for this author in PubMed Google Scholar
Yunming Ye
View author publications
You can also search for this author in PubMed Google Scholar
Zhaoyang Tian
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Feng Li or Xiaofeng Zhang.

Ethics declarations

Conflict of interest

No conflict of interest exits in the submission of this manuscript.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, F., Ye, Y., Tian, Z. et al. CPU versus GPU: which can perform matrix computation faster—performance comparison for basic linear algebra subprograms. Neural Comput & Applic 31, 4353–4365 (2019). https://doi.org/10.1007/s00521-018-3354-z

Download citation

Received: 08 November 2017
Accepted: 08 January 2018
Published: 25 January 2018
Issue Date: August 2019
DOI: https://doi.org/10.1007/s00521-018-3354-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CPU versus GPU: which can perform matrix computation faster—performance comparison for basic linear algebra subprograms

Abstract

Access this article

Similar content being viewed by others

A Reconfigurable Hardware Architecture for Principal Component Analysis

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

PyGAD: an intuitive genetic algorithm Python library

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CPU versus GPU: which can perform matrix computation faster—performance comparison for basic linear algebra subprograms

Abstract

Access this article

Similar content being viewed by others

A Reconfigurable Hardware Architecture for Principal Component Analysis

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

PyGAD: an intuitive genetic algorithm Python library

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation