research-article

BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing

Authors:

Jianxiong Xiao,

Yi YangAuthors Info & Claims

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Article No.: 20, Pages 1 - 11

https://doi.org/10.1145/2925426.2926256

Published: 01 June 2016 Publication History

Abstract

Basic Linear Algebra Subprograms (BLAS) are a set of low level linear algebra kernels widely adopted by applications involved with the deep learning and scientific computing. The massive and economic computing power brought forth by the emerging GPU architectures drives interest in implementation of compute-intensive level 3 BLAS on multi-GPU systems. In this paper, we investigate existing multi-GPU level 3 BLAS and present that 1) issues, such as the improper load balancing, inefficient communication, insufficient GPU stream level concurrency and data caching, impede current implementations from fully harnessing heterogeneous computing resources; 2) and the inter-GPU Peer-to-Peer(P2P) communication remains unexplored. We then present BLASX: a highly optimized multi-GPU level-3 BLAS. We adopt the concepts of algorithms-by-tiles treating a matrix tile as the basic data unit and operations on tiles as the basic task. Tasks are guided with a dynamic asynchronous runtime, which is cache and locality aware. The communication cost under BLASX becomes trivial as it perfectly overlaps communication and computation across multiple streams during asynchronous task progression. It also takes the current tile cache scheme one step further by proposing an innovative 2-level hierarchical tile cache, taking advantage of inter-GPU P2P communication. As a result, linear speedup is observable with BLASX under multi-GPU configurations; and the extensive benchmarks demonstrate that BLASX consistently outperforms the related leading industrial and academic implementations such as cuBLAS-XT, SuperMatrix, MAGMA.

References

[1]

Dongarra, J. J., Du Croz, J., Hammarling, S., and Duff, I. S. (1990). A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS), 16(1), 1--17.

Digital Library

[2]

Chan, E., Van Zee, F. G., Bientinesi, P., Quintana-Orti, E. S., Quintana-Orti, G., and Van de Geijn, R. (2008, February). Supermatrix: a multithreaded runtime scheduling system for algorithms-by-blocks. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming (pp. 123--132). ACM.

Digital Library

[3]

Tomasulo, Robert M. "An efficient algorithm for exploiting multiple arithmetic units." IBM J. Res. Dev (1995): 13--21.

Digital Library

[4]

Blumofe, Robert D., and Charles E. Leiserson. "Scheduling multithreaded computations by work stealing." Journal of the ACM (JACM) 46.5 (1999): 720--748.

Digital Library

[5]

Leung, Joseph Y-T., and Jennifer Whitehead. "On the complexity of fixed-priority scheduling of periodic, real-time tasks." Performance evaluation 2.4 (1982): 237--250.

[6]

Nath, R., Tomov, S., Dong, T. T., and Dongarra, J. (2011, November). Optimizing symmetric dense matrix-vector multiplication on GPUs. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (p. 6). ACM.

Digital Library

[7]

Wu, W., Bouteiller, A., Bosilca, G., Faverge, M., and Dongarra, J. (2015, January). Hierarchical DAG Scheduling for Hybrid Distributed Systems. In 29th IEEE International Parallel and Distributed Processing Symposium.

Digital Library

[8]

developer.nvidia.com/cublasxt

[9]

Goto, Kazushige, and Robert Van De Geijn. "High-performance implementation of the level-3 BLAS." ACM Transactions on Mathematical Software (TOMS) 35.1 (2008): 4.

Digital Library

[10]

Schroeder, Tim C. "Peer-to-peer and unified virtual addressing." GPU Technology Conference, NVIDIA. 2011.

[11]

Kedzierski, K., Moreto, M., Cazorla, F. J., and Valero, M. (2010, April). Adapting cache partitioning algorithms to pseudo-lru replacement policies. In Parallel and Distributed Processing (IPDPS), 2010 IEEE International Symposium on (pp. 1--12). IEEE.

[12]

Sweazey, Paul, and Alan Jay Smith. "A class of compatible cache consistency protocols and their support by the IEEE futurebus." ACM SIGARCH Computer Architecture News. Vol. 14. No. 2. IEEE Computer Society Press, 1986.

Digital Library

[13]

Song, Fengguang, Stanimire Tomov, and Jack Dongarra. "Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems." Proceedings of the 26th ACM international conference on Supercomputing. ACM, 2012.

Digital Library

[14]

Michael, Maged M., and Michael L. Scott. "Simple, fast, and practical non-blocking and blocking concurrent queue algorithms." Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing. ACM, 1996.

Digital Library

[15]

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R. and Darrell, T. (2014, November). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia (pp. 675--678). ACM.

Digital Library

[16]

Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. "Learning representations by back-propagating errors." Cognitive modeling 5 (1988): 3.

[17]

Krizhevsky, Alex, and Geoffrey Hinton. "Learning multiple layers of features from tiny images." (2009).

[18]

Wang, L., Wu, W., Xiao, J., and Yi, Y. (2015). Large Scale Artificial Neural Network Training Using Multi-GPUs. Supercomputing 15.

[19]

Wang, Linnan, et al. "Accelerating Deep Neural Network Training with Inconsistent Stochastic Gradient Descent." arXiv preprint arXiv:1603.05544 (2016).

[20]

Seidel, Raimund. "On the all-pairs-shortest-path problem in unweighted undirected graphs." Journal of computer and system sciences 51.3 (1995): 400--403.

Digital Library

Cited By

Leng YZou GWang HWu PZhang S(2025)High Performance Householder QR Factorization on Emerging GPU Architectures Using Tensor CoresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.352277636:3(422-436)Online publication date: Mar-2025
https://doi.org/10.1109/TPDS.2024.3522776
ÖZ I(2024)Quantitative Performance Analysis of BLAS Libraries on GPU ArchitecturesBLAS Kütüphanelerinin GPU Mimarilerindeki Nicel Performans AnaliziDeu Muhendislik Fakultesi Fen ve Muhendislik10.21205/deufmd.202426760626:76(40-48)Online publication date: 23-Jan-2024
https://doi.org/10.21205/deufmd.2024267606
A. Ferreira IA. Acebrón JMonteiro J(2024)Survey of a class of iterative row-action methods: The Kaczmarz methodNumerical Algorithms10.1007/s11075-024-01945-2Online publication date: 26-Sep-2024
https://doi.org/10.1007/s11075-024-01945-2
Show More Cited By

Recommendations

PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP

The recent version of the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) library is based on tasks with dependencies from the OpenMP standard. The main functionality of the library is presented. Extensive benchmarks are targeted ...
CLBlast: A Tuned OpenCL BLAS Library
IWOCL '18: Proceedings of the International Workshop on OpenCL

This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at machine learning and HPC applications and thus provides a fast matrix-...
Developing High-Performance, Portable OpenCL Code via Multi-Dimensional Homomorphisms
IWOCL '19: Proceedings of the International Workshop on OpenCL

A key challenge in programming high-performance applications is achieving portable performance, such that the same program code can reach a consistent level of performance over the variety of modern parallel processors, including multi-core CPU and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

June 2016

547 pages

ISBN:9781450343619

DOI:10.1145/2925426

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICS '16

Sponsor:

SIGARCH

ICS '16: 2016 International Conference on Supercomputing

June 1 - 3, 2016

Istanbul, Turkey

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
495
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)2

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Leng YZou GWang HWu PZhang S(2025)High Performance Householder QR Factorization on Emerging GPU Architectures Using Tensor CoresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.352277636:3(422-436)Online publication date: Mar-2025
https://doi.org/10.1109/TPDS.2024.3522776
ÖZ I(2024)Quantitative Performance Analysis of BLAS Libraries on GPU ArchitecturesBLAS Kütüphanelerinin GPU Mimarilerindeki Nicel Performans AnaliziDeu Muhendislik Fakultesi Fen ve Muhendislik10.21205/deufmd.202426760626:76(40-48)Online publication date: 23-Jan-2024
https://doi.org/10.21205/deufmd.2024267606
A. Ferreira IA. Acebrón JMonteiro J(2024)Survey of a class of iterative row-action methods: The Kaczmarz methodNumerical Algorithms10.1007/s11075-024-01945-2Online publication date: 26-Sep-2024
https://doi.org/10.1007/s11075-024-01945-2
Motta PBezerra GSantos APortugal R(2023)Hiperwalk: Simulation of Quantum Walks with Heterogeneous High-Performance Computing2023 IEEE International Conference on Quantum Computing and Engineering (QCE)10.1109/QCE57702.2023.00055(424-433)Online publication date: 17-Sep-2023
https://doi.org/10.1109/QCE57702.2023.00055
Yang SZhang CMa J(2023)DeltaSPARSE: High-Performance Sparse General Matrix-Matrix Multiplication on Multi-GPU Systems2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC58850.2023.00037(194-202)Online publication date: 18-Dec-2023
https://doi.org/10.1109/HiPC58850.2023.00037
Torres YAndújar FGonzalez-Escribano ALlanos D(2023)Supporting efficient overlapping of host-device operations for heterogeneous programming with CtrlEventsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.04.009179:COnline publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1016/j.jpdc.2023.04.009
Zhang PWu JCheng DLu JHu W(2023)A Task-Duplication Based Clustering Scheduling Algorithm for Heterogeneous Computing SystemAdvanced Intelligent Computing Technology and Applications10.1007/978-981-99-4755-3_16(181-193)Online publication date: 30-Jul-2023
https://doi.org/10.1007/978-981-99-4755-3_16
Dang YYang YChen YZhu MYin D(2022)Competitive and Collaborative Learning Accelerates the Convergence of Deep Convolutional Neural Networks2022 7th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA)10.1109/ICCCBDA55098.2022.9778930(431-438)Online publication date: 22-Apr-2022
https://doi.org/10.1109/ICCCBDA55098.2022.9778930
Yu XNikitin VChing DAslan SGürsoy DBiçer T(2022)Scalable and accurate multi-GPU-based image reconstruction of large-scale ptychography dataScientific Reports10.1038/s41598-022-09430-312:1Online publication date: 29-Mar-2022
https://doi.org/10.1038/s41598-022-09430-3
Choi YStegailov V(2022)Multi-GPU GEMM Algorithm Performance Analysis for Nvidia and AMD GPUs Connected by NVLink and PCIeMathematical Modeling and Supercomputer Technologies10.1007/978-3-031-24145-1_23(281-292)Online publication date: 24-Dec-2022
https://doi.org/10.1007/978-3-031-24145-1_23
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten