ABSTRACT
A recent advancement in the world of heterogeneous computing, the NVLink interconnect enables high-speed communication between GPUs and CPUs and among GPUs. In this paper we show how NVLink changes the role GPUs can play in graph, and more in general, data analytics. With the technology preceding NVLink, the processing efficiency of GPUs is limited to data sets that fit into their local memory.
The increased bandwidth provided by NVLink imposes a reassessment of many algorithms---including those used in data analytics---that in the past could not efficiently exploit GPUs because of their limited bandwidth towards host memory.
Our contributions consist in the introduction of the basic properties of one of the first systems using NVLink, and the description of how one of the most pervasive data analytics kernels, SpMV, can be tailored to the system in question. We evaluate the resulting SpMV implementation on a variety of data sets, and compare favorably to the best results available in the literature.
- Pham Nguyen Quang Anh, Rui Fan, and Yonggang Wen. 2015. Reducing Vector I/O for Faster GPU Sparse Matrix-Vector Multiplication. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International. 1043--1052. Google ScholarDigital Library
- Arash Ashari, Naser Sedaghati, John Eisenlohr, Srinivasan Parthasarathy, and P. Sadayappan. 2014. Fast Sparse Matrix-vector Multiplication on GPUs for Graph Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 781--792. Google ScholarDigital Library
- Erik G. Boman, Karen D. Devine, and Sivasankaran Rajamanickam. 2013. Scalable Matrix Computations on Large Scale-free Graphs Using 2D Graph Partitioning. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, New York, NY, USA, Article 50, 12 pages. Google ScholarDigital Library
- Daniele Buono, Fabrizio Petrini, Fabio Checconi, Xing Liu, Xinyu Que, Chris Long, and Tai-Ching Tuan. 2016. Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics. In Proceedings of the 2016 International Conference on Supercomputing (ICS '16). ACM, New York, NY, USA, Article 37, 12 pages. Google ScholarDigital Library
- J.W. Choi, A. Singh, and R.W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In ACM SIGPLAN Notices, Vol. 45. ACM, 115 126. Google ScholarDigital Library
- T.A. Davis. 1994. The University of Florida sparse matrix collection. In NA digest. Citeseer.Google Scholar
- Joseph L. Greathouse and Mayank Daga. 2014. Efficient Sparse Matrix-vector Multiplication on GPUs Using the CSR Storage Format. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 769--780. Google ScholarDigital Library
- E.J. Im, K. Yelick, and R. Vuduc. 2004. SPARSITY: Optimization framework for sparse matrix kernels. Intl J. High Perf. Comput. Appl. 18 (2004), 135158. Google ScholarDigital Library
- Tamara G. Kolda, Ali Pinar, Todd Plantenga, and C. Seshadhri. 2014. A Scalable Generative Graph Model with Community Structure. SIAM Journal on Scientific Computing 36, 5 (September 2014), C424--C452.Google ScholarDigital Library
- Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data. (June 2014).Google Scholar
- Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient Sparse Matrix-vector Multiplication on x86-based Many-core Processors. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS '13). ACM, New York, NY, USA, 273--282. Google ScholarDigital Library
- Wai Teng Tang, Ruizhe Zhao, Mian Lu, Yun Liang, Huynh Phung Huynh, Xibai Li, and Rick Siow Mong Goh. 2015. Optimizing and Auto-tuning Scale-free Sparse Matrix-vector Multiplication on Intel Xeon Phi. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '15). IEEE Computer Society, Washington, DC, USA, 136145. http://dl.acm.org/citation.cfm?id=2738600.2738618 Google ScholarCross Ref
- Richard Wilson Vuduc. 2003. Automatic performance tuning of sparse matrix kernels. Ph.D. Dissertation. Univ. of California, Berkeley.Google Scholar
- Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2007. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proc. ACM/IEEE Conf. Supercomputing (SC '07). ACM, New York, NY, USA, 38:1--38:12. Google ScholarDigital Library
- Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet Another SpMV Framework on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14). ACM, New York, NY, USA, 107118. Google ScholarDigital Library
- Andy Yoo, Allison H. Baker, Roger Pearce, and Van Emden Henson. 2011. A Scalable Eigensolver for Large Scale-free Graphs Using 2D Graph Partitioning. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11). ACM, New York, NY, USA, Article 63, 11 pages. Google ScholarDigital Library
Recommendations
Leveraging NVLINK and asynchronous data transfer to scale beyond the memory capacity of GPUs
ScalA '17: Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale SystemsIn this paper we demonstrate the utility of fast GPU to CPU interconnects to weak scale on hierarchical nodes without being limited to problem sizes that fit only in the GPU memory capacity. We show the speedup possible for a new regime of algorithms ...
Characterizing Data Analytics Workloads on Intel Xeon Phi
IISWC '15: Proceedings of the 2015 IEEE International Symposium on Workload CharacterizationWith the growing computation demands of data analytics, heterogeneous architectures become popular for their support of high parallelism. Intel Xeon Phi, a many-core coprocessor originally designed for high performance computing applications, is ...
SIMD parallel MCMC sampling with applications for big-data Bayesian analytics
Computational intensity and sequential nature of estimation techniques for Bayesian methods in statistics and machine learning, combined with their increasing applications for big data analytics, necessitate both the identification of potential ...
Comments