research-article

Data Analytics with NVLink: An SpMV Case Study

Authors:

Fabio Checconi,

Lars SchneidenbachAuthors Info & Claims

CF'17: Proceedings of the Computing Frontiers Conference

Pages 89 - 96

https://doi.org/10.1145/3075564.3075569

Published: 15 May 2017 Publication History

Abstract

A recent advancement in the world of heterogeneous computing, the NVLink interconnect enables high-speed communication between GPUs and CPUs and among GPUs. In this paper we show how NVLink changes the role GPUs can play in graph, and more in general, data analytics. With the technology preceding NVLink, the processing efficiency of GPUs is limited to data sets that fit into their local memory.

The increased bandwidth provided by NVLink imposes a reassessment of many algorithms---including those used in data analytics---that in the past could not efficiently exploit GPUs because of their limited bandwidth towards host memory.

Our contributions consist in the introduction of the basic properties of one of the first systems using NVLink, and the description of how one of the most pervasive data analytics kernels, SpMV, can be tailored to the system in question. We evaluate the resulting SpMV implementation on a variety of data sets, and compare favorably to the best results available in the literature.

References

[1]

Pham Nguyen Quang Anh, Rui Fan, and Yonggang Wen. 2015. Reducing Vector I/O for Faster GPU Sparse Matrix-Vector Multiplication. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International. 1043--1052.

Digital Library

[2]

Arash Ashari, Naser Sedaghati, John Eisenlohr, Srinivasan Parthasarathy, and P. Sadayappan. 2014. Fast Sparse Matrix-vector Multiplication on GPUs for Graph Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 781--792.

Digital Library

[3]

Erik G. Boman, Karen D. Devine, and Sivasankaran Rajamanickam. 2013. Scalable Matrix Computations on Large Scale-free Graphs Using 2D Graph Partitioning. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, New York, NY, USA, Article 50, 12 pages.

Digital Library

[4]

Daniele Buono, Fabrizio Petrini, Fabio Checconi, Xing Liu, Xinyu Que, Chris Long, and Tai-Ching Tuan. 2016. Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics. In Proceedings of the 2016 International Conference on Supercomputing (ICS '16). ACM, New York, NY, USA, Article 37, 12 pages.

Digital Library

[5]

J.W. Choi, A. Singh, and R.W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In ACM SIGPLAN Notices, Vol. 45. ACM, 115 126.

Digital Library

[6]

T.A. Davis. 1994. The University of Florida sparse matrix collection. In NA digest. Citeseer.

[7]

Joseph L. Greathouse and Mayank Daga. 2014. Efficient Sparse Matrix-vector Multiplication on GPUs Using the CSR Storage Format. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 769--780.

Digital Library

[8]

E.J. Im, K. Yelick, and R. Vuduc. 2004. SPARSITY: Optimization framework for sparse matrix kernels. Intl J. High Perf. Comput. Appl. 18 (2004), 135158.

Digital Library

[9]

Tamara G. Kolda, Ali Pinar, Todd Plantenga, and C. Seshadhri. 2014. A Scalable Generative Graph Model with Community Structure. SIAM Journal on Scientific Computing 36, 5 (September 2014), C424--C452.

Digital Library

[10]

Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data. (June 2014).

[11]

Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient Sparse Matrix-vector Multiplication on x86-based Many-core Processors. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS '13). ACM, New York, NY, USA, 273--282.

Digital Library

[12]

Wai Teng Tang, Ruizhe Zhao, Mian Lu, Yun Liang, Huynh Phung Huynh, Xibai Li, and Rick Siow Mong Goh. 2015. Optimizing and Auto-tuning Scale-free Sparse Matrix-vector Multiplication on Intel Xeon Phi. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '15). IEEE Computer Society, Washington, DC, USA, 136145. http://dl.acm.org/citation.cfm?id=2738600.2738618

[13]

Richard Wilson Vuduc. 2003. Automatic performance tuning of sparse matrix kernels. Ph.D. Dissertation. Univ. of California, Berkeley.

[14]

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2007. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proc. ACM/IEEE Conf. Supercomputing (SC '07). ACM, New York, NY, USA, 38:1--38:12.

Digital Library

[15]

Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet Another SpMV Framework on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14). ACM, New York, NY, USA, 107118.

Digital Library

[16]

Andy Yoo, Allison H. Baker, Roger Pearce, and Van Emden Henson. 2011. A Scalable Eigensolver for Large Scale-free Graphs Using 2D Graph Partitioning. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11). ACM, New York, NY, USA, Article 63, 11 pages.

Digital Library

Cited By

Aananthakrishnan SPawlowski RFryman JHur I(2020)Efficient Sparse Matrix-Vector Multiplication on Intel PIUMA Architecture2020 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC43674.2020.9286245(1-2)Online publication date: 22-Sep-2020
https://doi.org/10.1109/HPEC43674.2020.9286245
Jain AOmidian HFraisse HBenipal MLiu LGaitonde D(2020)A Domain-Specific Architecture for Accelerating Sparse Matrix Vector Multiplication on FPGAs2020 30th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL50879.2020.00031(127-132)Online publication date: Aug-2020
https://doi.org/10.1109/FPL50879.2020.00031
Akhtar PTezcan EQararyah FUnat D(2020)ComScribe: Identifying Intra-node GPU CommunicationBenchmarking, Measuring, and Optimizing10.1007/978-3-030-71058-3_10(157-174)Online publication date: 15-Nov-2020
https://dl.acm.org/doi/10.1007/978-3-030-71058-3_10
Show More Cited By

Recommendations

Leveraging NVLINK and asynchronous data transfer to scale beyond the memory capacity of GPUs
ScalA '17: Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems

In this paper we demonstrate the utility of fast GPU to CPU interconnects to weak scale on hierarchical nodes without being limited to problem sizes that fit only in the GPU memory capacity. We show the speedup possible for a new regime of algorithms ...
Characterizing Data Analytics Workloads on Intel Xeon Phi
IISWC '15: Proceedings of the 2015 IEEE International Symposium on Workload Characterization

With the growing computation demands of data analytics, heterogeneous architectures become popular for their support of high parallelism. Intel Xeon Phi, a many-core coprocessor originally designed for high performance computing applications, is ...
SIMD parallel MCMC sampling with applications for big-data Bayesian analytics

Computational intensity and sequential nature of estimation techniques for Bayesian methods in statistics and machine learning, combined with their increasing applications for big data analytics, necessitate both the identification of potential ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CF'17: Proceedings of the Computing Frontiers Conference

May 2017

450 pages

ISBN:9781450344876

DOI:10.1145/3075564

General Chair:
Roberto Giorgi
University of Siena, IT
,
Program Chairs:
Michela Becchi
North Carolina State University
,
Francesca Palumbo
University of Sassari, IT

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 May 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

CF '17

Sponsor:

SIGMICRO

CF '17: Computing Frontiers Conference

May 15 - 17, 2017

Siena, Italy

Acceptance Rates

CF'17 Paper Acceptance Rate 43 of 87 submissions, 49%;

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Sponsor:
sigmicro

22nd ACM International Conference on Computing Frontiers

May 28 - 30, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
283
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)2

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Aananthakrishnan SPawlowski RFryman JHur I(2020)Efficient Sparse Matrix-Vector Multiplication on Intel PIUMA Architecture2020 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC43674.2020.9286245(1-2)Online publication date: 22-Sep-2020
https://doi.org/10.1109/HPEC43674.2020.9286245
Jain AOmidian HFraisse HBenipal MLiu LGaitonde D(2020)A Domain-Specific Architecture for Accelerating Sparse Matrix Vector Multiplication on FPGAs2020 30th International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL50879.2020.00031(127-132)Online publication date: Aug-2020
https://doi.org/10.1109/FPL50879.2020.00031
Akhtar PTezcan EQararyah FUnat D(2020)ComScribe: Identifying Intra-node GPU CommunicationBenchmarking, Measuring, and Optimizing10.1007/978-3-030-71058-3_10(157-174)Online publication date: 15-Nov-2020
https://dl.acm.org/doi/10.1007/978-3-030-71058-3_10
IBM POWER9 NPU team (2018)Functionality and performance of NVLink with IBM POWER9 processorsIBM Journal of Research and Development10.1147/JRD.2018.284697862:4-5(9:1-9:10)Online publication date: 1-Jul-2018
https://dl.acm.org/doi/10.1147/JRD.2018.2846978
Appelhans DWalkup BAlexandrov VGeist ADongarra J(2017)Leveraging NVLINK and asynchronous data transfer to scale beyond the memory capacity of GPUsProceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems10.1145/3148226.3148232(1-5)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3148226.3148232
Reguly IMudalige GGiles M(2017)Beyond 16GBProceedings of the Workshop on Memory Centric Programming for HPC10.1145/3145617.3145619(20-29)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3145617.3145619

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten