research-article

Efficient sparse matrix-vector multiplication on x86-based many-core processors

Authors:
Xing Liu

Georgia Institute of Technology, Atlanta, GA, USA

Georgia Institute of Technology, Atlanta, GA, USA
View Profile

,
Mikhail Smelyanskiy

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

,
Edmond Chow

Georgia Institute of Technology, Atlanta, GA, USA

Georgia Institute of Technology, Atlanta, GA, USA
View Profile

,
Pradeep Dubey

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputingJune 2013Pages 273–282https://doi.org/10.1145/2464996.2465013

Published:10 June 2013Publication History

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

Pages 273–282

ABSTRACT

Sparse matrix-vector multiplication (SpMV) is an important kernel in many scientific applications and is known to be memory bandwidth limited. On modern processors with wide SIMD and large numbers of cores, we identify and address several bottlenecks which may limit performance even before memory bandwidth: (a) low SIMD efficiency due to sparsity, (b) overhead due to irregular memory accesses, and (c) load-imbalance due to non-uniform matrix structures.

We describe an efficient implementation of SpMV on the Intel^R Xeon Phi^TM Coprocessor, codenamed Knights Corner (KNC), that addresses the above challenges. Our implementation exploits the salient architectural features of KNC, such as large caches and hardware support for irregular memory accesses. By using a specialized data structure with careful load balancing, we attain performance on average close to 90% of KNC's achievable memory bandwidth on a diverse set of sparse matrices. Furthermore, we demonstrate that our implementation is 3.52x and 1.32x faster, respectively, than the best available implementations on dual Intel^R Xeon^R Processor E5-2680 and the NVIDIA Tesla K20X architecture.

References

N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. ACM/IEEE Conf. Supercomputing, SC '09, pp. 18. ACM, 2009. Google ScholarDigital Library
N. Bell and M. Garland. CUSP: Generic parallel algorithms for sparse matrix and graph computations, 2012. V0.3.0.Google Scholar
A. Buluc, S. Williams, L. Oliker, and J. Demmel. Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In Proc. 2011 IEEE Intl Parallel & Distributed Processing Symposium, IPDPS 2011, pp. 721--733, Washington, DC, USA, 2011. IEEE. Google ScholarDigital Library
A. Buttari, V. Eijkhout, J. Langou, and S. Filippone. Performance optimization and modeling of blocked sparse kernels. Intl J. High Perf. Comput. Appl., 21:467--484, 2007. Google ScholarDigital Library
J. Choi, A. Singh, and R. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In ACM SIGPLAN Notices, volume 45, pp. 115--126. ACM, 2010. Google ScholarDigital Library
J. Davis and E. Chung. SpMV: A memory-bound application on the GPU stuck between a rock and a hard place. Microsoft Technical Report, 2012.Google Scholar
G. Goumas, K. Kourtis, N. Anastopoulos, V. Karakasis, and N. Koziris. Performance evaluation of the sparse matrix-vector multiplication on modern architectures. J. Supercomput., 50:36--77, 2009. Google ScholarDigital Library
E. Im, K. Yelick, and R. Vuduc. SPARSITY: Optimization framework for sparse matrix kernels. Intl J. High Perf. Comput. Appl., 18:135--158, 2004. Google ScholarDigital Library
V. Karakasis, G. Goumas, and N. Koziris. A comparative study of blocking storage methods for sparse matrices on multicore architectures. In Proc. 2009 Intl Conf. Comput. Sci. and Eng., CSE '09, pp. 247--256. IEEE, 2009. Google ScholarDigital Library
K. Kourtis, G. Goumas, and N. Koziris. Exploiting compression opportunities to improve SpMxV performance on shared memory systems. ACM Trans. Architecture and Code Optimization, 7:16, 2010. Google ScholarDigital Library
M. Kreutzer, G. Hager, G. Wellein, H. Fehske, A. Basermann, and A. R. Bishop. Sparse matrix-vectorGoogle Scholar

Index Terms

Efficient sparse matrix-vector multiplication on x86-based many-core processors

Recommendations

Explicit Fourth-Order Runge---Kutta Method on Intel Xeon Phi Coprocessor

This paper concerns an Intel Xeon Phi implementation of the explicit fourth-order Runge---Kutta method (RK4) for very sparse matrices with very short rows. Such matrices arise during Markovian modeling of computer and telecommunication networks. In this ...
Read More
Evaluating multi-core and many-core architectures through accelerating the three-dimensional Lax-Wendroff correction stencil

Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time-consuming, ...
Read More
Analysis of Sparse Matrix-Vector Multiplication Using Iterative Method in CUDA
NAS '13: Proceedings of the 2013 IEEE Eighth International Conference on Networking, Architecture and Storage

Scaling up the sparse matrix-vector multiplication has been at the heart of numerous studies in both academia and industry. The massive parallelism of graphics processing units offers tremendous performance in many high-performance computing ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing
June 2013
512 pages
ISBN:9781450321303
DOI:10.1145/2464996
General Chair:
Allen D. Malony
University of Oregon, USA
,
Program Chairs:
Mario Nemirovsky
Barcelona Supercomputing Center, Spain
,
Sam Midkiff
Purdue University, USA
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 June 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
esb format
intel many integrated core architecture (intel mic)
intel xeon phi
knights corner
spmv
Qualifiers
- research-article
Conference

Acceptance Rates
ICS '13 Paper Acceptance Rate43of202submissions,21%Overall Acceptance Rate584of2,055submissions,28%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 205
  Total Citations
  View Citations
- 1,430
  Total Downloads
- Downloads (Last 12 months)61
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient sparse matrix-vector multiplication on x86-based many-core processors

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Explicit Fourth-Order Runge---Kutta Method on Intel Xeon Phi Coprocessor

Evaluating multi-core and many-core architectures through accelerating the three-dimensional Lax-Wendroff correction stencil

Analysis of Sparse Matrix-Vector Multiplication Using Iterative Method in CUDA

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Efficient sparse matrix-vector multiplication on x86-based many-core processors

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Explicit Fourth-Order Runge---Kutta Method on Intel Xeon Phi Coprocessor

Evaluating multi-core and many-core architectures through accelerating the three-dimensional Lax-Wendroff correction stencil

Analysis of Sparse Matrix-Vector Multiplication Using Iterative Method in CUDA

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media