skip to main content
10.1145/2925426.2926256acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing

Published: 01 June 2016 Publication History

Abstract

Basic Linear Algebra Subprograms (BLAS) are a set of low level linear algebra kernels widely adopted by applications involved with the deep learning and scientific computing. The massive and economic computing power brought forth by the emerging GPU architectures drives interest in implementation of compute-intensive level 3 BLAS on multi-GPU systems. In this paper, we investigate existing multi-GPU level 3 BLAS and present that 1) issues, such as the improper load balancing, inefficient communication, insufficient GPU stream level concurrency and data caching, impede current implementations from fully harnessing heterogeneous computing resources; 2) and the inter-GPU Peer-to-Peer(P2P) communication remains unexplored. We then present BLASX: a highly optimized multi-GPU level-3 BLAS. We adopt the concepts of algorithms-by-tiles treating a matrix tile as the basic data unit and operations on tiles as the basic task. Tasks are guided with a dynamic asynchronous runtime, which is cache and locality aware. The communication cost under BLASX becomes trivial as it perfectly overlaps communication and computation across multiple streams during asynchronous task progression. It also takes the current tile cache scheme one step further by proposing an innovative 2-level hierarchical tile cache, taking advantage of inter-GPU P2P communication. As a result, linear speedup is observable with BLASX under multi-GPU configurations; and the extensive benchmarks demonstrate that BLASX consistently outperforms the related leading industrial and academic implementations such as cuBLAS-XT, SuperMatrix, MAGMA.

References

[1]
Dongarra, J. J., Du Croz, J., Hammarling, S., and Duff, I. S. (1990). A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS), 16(1), 1--17.
[2]
Chan, E., Van Zee, F. G., Bientinesi, P., Quintana-Orti, E. S., Quintana-Orti, G., and Van de Geijn, R. (2008, February). Supermatrix: a multithreaded runtime scheduling system for algorithms-by-blocks. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming (pp. 123--132). ACM.
[3]
Tomasulo, Robert M. "An efficient algorithm for exploiting multiple arithmetic units." IBM J. Res. Dev (1995): 13--21.
[4]
Blumofe, Robert D., and Charles E. Leiserson. "Scheduling multithreaded computations by work stealing." Journal of the ACM (JACM) 46.5 (1999): 720--748.
[5]
Leung, Joseph Y-T., and Jennifer Whitehead. "On the complexity of fixed-priority scheduling of periodic, real-time tasks." Performance evaluation 2.4 (1982): 237--250.
[6]
Nath, R., Tomov, S., Dong, T. T., and Dongarra, J. (2011, November). Optimizing symmetric dense matrix-vector multiplication on GPUs. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (p. 6). ACM.
[7]
Wu, W., Bouteiller, A., Bosilca, G., Faverge, M., and Dongarra, J. (2015, January). Hierarchical DAG Scheduling for Hybrid Distributed Systems. In 29th IEEE International Parallel and Distributed Processing Symposium.
[8]
developer.nvidia.com/cublasxt
[9]
Goto, Kazushige, and Robert Van De Geijn. "High-performance implementation of the level-3 BLAS." ACM Transactions on Mathematical Software (TOMS) 35.1 (2008): 4.
[10]
Schroeder, Tim C. "Peer-to-peer and unified virtual addressing." GPU Technology Conference, NVIDIA. 2011.
[11]
Kedzierski, K., Moreto, M., Cazorla, F. J., and Valero, M. (2010, April). Adapting cache partitioning algorithms to pseudo-lru replacement policies. In Parallel and Distributed Processing (IPDPS), 2010 IEEE International Symposium on (pp. 1--12). IEEE.
[12]
Sweazey, Paul, and Alan Jay Smith. "A class of compatible cache consistency protocols and their support by the IEEE futurebus." ACM SIGARCH Computer Architecture News. Vol. 14. No. 2. IEEE Computer Society Press, 1986.
[13]
Song, Fengguang, Stanimire Tomov, and Jack Dongarra. "Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems." Proceedings of the 26th ACM international conference on Supercomputing. ACM, 2012.
[14]
Michael, Maged M., and Michael L. Scott. "Simple, fast, and practical non-blocking and blocking concurrent queue algorithms." Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing. ACM, 1996.
[15]
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R. and Darrell, T. (2014, November). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia (pp. 675--678). ACM.
[16]
Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. "Learning representations by back-propagating errors." Cognitive modeling 5 (1988): 3.
[17]
Krizhevsky, Alex, and Geoffrey Hinton. "Learning multiple layers of features from tiny images." (2009).
[18]
Wang, L., Wu, W., Xiao, J., and Yi, Y. (2015). Large Scale Artificial Neural Network Training Using Multi-GPUs. Supercomputing 15.
[19]
Wang, Linnan, et al. "Accelerating Deep Neural Network Training with Inconsistent Stochastic Gradient Descent." arXiv preprint arXiv:1603.05544 (2016).
[20]
Seidel, Raimund. "On the all-pairs-shortest-path problem in unweighted undirected graphs." Journal of computer and system sciences 51.3 (1995): 400--403.

Cited By

View all
  • (2025)High Performance Householder QR Factorization on Emerging GPU Architectures Using Tensor CoresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.352277636:3(422-436)Online publication date: Mar-2025
  • (2024)Quantitative Performance Analysis of BLAS Libraries on GPU ArchitecturesBLAS Kütüphanelerinin GPU Mimarilerindeki Nicel Performans AnaliziDeu Muhendislik Fakultesi Fen ve Muhendislik10.21205/deufmd.202426760626:76(40-48)Online publication date: 23-Jan-2024
  • (2024)Survey of a class of iterative row-action methods: The Kaczmarz methodNumerical Algorithms10.1007/s11075-024-01945-2Online publication date: 26-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '16: Proceedings of the 2016 International Conference on Supercomputing
June 2016
547 pages
ISBN:9781450343619
DOI:10.1145/2925426
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. BLAS
  2. Cache Hierarchy
  3. MultiGPU
  4. Runtime Scheduler
  5. Tile Algorithms

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICS '16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)29
  • Downloads (Last 6 weeks)2
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)High Performance Householder QR Factorization on Emerging GPU Architectures Using Tensor CoresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.352277636:3(422-436)Online publication date: Mar-2025
  • (2024)Quantitative Performance Analysis of BLAS Libraries on GPU ArchitecturesBLAS Kütüphanelerinin GPU Mimarilerindeki Nicel Performans AnaliziDeu Muhendislik Fakultesi Fen ve Muhendislik10.21205/deufmd.202426760626:76(40-48)Online publication date: 23-Jan-2024
  • (2024)Survey of a class of iterative row-action methods: The Kaczmarz methodNumerical Algorithms10.1007/s11075-024-01945-2Online publication date: 26-Sep-2024
  • (2023)Hiperwalk: Simulation of Quantum Walks with Heterogeneous High-Performance Computing2023 IEEE International Conference on Quantum Computing and Engineering (QCE)10.1109/QCE57702.2023.00055(424-433)Online publication date: 17-Sep-2023
  • (2023)DeltaSPARSE: High-Performance Sparse General Matrix-Matrix Multiplication on Multi-GPU Systems2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC58850.2023.00037(194-202)Online publication date: 18-Dec-2023
  • (2023)Supporting efficient overlapping of host-device operations for heterogeneous programming with CtrlEventsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.04.009179:COnline publication date: 1-Sep-2023
  • (2023)A Task-Duplication Based Clustering Scheduling Algorithm for Heterogeneous Computing SystemAdvanced Intelligent Computing Technology and Applications10.1007/978-981-99-4755-3_16(181-193)Online publication date: 30-Jul-2023
  • (2022)Competitive and Collaborative Learning Accelerates the Convergence of Deep Convolutional Neural Networks2022 7th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA)10.1109/ICCCBDA55098.2022.9778930(431-438)Online publication date: 22-Apr-2022
  • (2022)Scalable and accurate multi-GPU-based image reconstruction of large-scale ptychography dataScientific Reports10.1038/s41598-022-09430-312:1Online publication date: 29-Mar-2022
  • (2022)Multi-GPU GEMM Algorithm Performance Analysis for Nvidia and AMD GPUs Connected by NVLink and PCIeMathematical Modeling and Supercomputer Technologies10.1007/978-3-031-24145-1_23(281-292)Online publication date: 24-Dec-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media