research-article

Optimizing memory bandwidth use and performance for matrix-vector multiplication in iterative methods

Authors:

George A. ConstantinidesAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 4, Issue 3

Article No.: 22, Pages 1 - 14

https://doi.org/10.1145/2000832.2000834

Published: 22 August 2011 Publication History

Abstract

Computing the solution to a system of linear equations is a fundamental problem in scientific computing, and its acceleration has drawn wide interest in the FPGA community [Morris et al. 2006; Zhang et al. 2008; Zhuo and Prasanna 2006]. One class of algorithms to solve these systems, iterative methods, has drawn particular interest, with recent literature showing large performance improvements over General-Purpose Processors (GPPs) [Lopes and Constantinides 2008]. In several iterative methods, this performance gain is largely a result of parallelization of the matrix-vector multiplication, an operation that occurs in many applications and hence has also been widely studied on FPGAs [Zhuo and Prasanna 2005; El-Kurdi et al. 2006]. However, whilst the performance of matrix-vector multiplication on FPGAs is generally I/O bound [Zhuo and Prasanna 2005], the nature of iterative methods allows the use of on-chip memory buffers to increase the bandwidth, providing the potential for significantly more parallelism [deLorimier and DeHon 2005]. Unfortunately, existing approaches have generally only either been capable of solving large matrices with limited improvement over GPPs [Zhuo and Prasanna 2005; El-Kurdi et al. 2006; deLorimier and DeHon 2005], or achieve high performance for relatively small matrices [Lopes and Constantinides 2008; Boland and Constantinides 2008]. This article proposes hardware designs to take advantage of symmetrical and banded matrix structure, as well as methods to optimize the RAM use, in order to both increase the performance and retain this performance for larger-order matrices.

References

[1]

Barrett, R., Berry, M., Chan, T. F., Demmel, J., Donato, J., Dongarra, J., Eijkhout, V., Pozo, R., Romine, C., and der Vorst, H. V. 1994. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Ed. SIAM, Philadelphia, PA.

[2]

Boland, D. and Constantinides, G. 2008. An FPGA-based implementation of the MINRES algorithm. In Proceedings of the International Conference on Field Programmable Logic and Applications. 379--384.

[3]

Boland, D. and Constantinides, G. 2010. Optimising memory bandwidth use for matrix-vector multiplication in iterative methods. In Proceedings of the International Symposium on Applied Reconfigurable Computing. 169--181.

Digital Library

[4]

deLorimier, M. and DeHon, A. 2005. Floating-Point sparse matrix-vector multiply for FPGAs. In Proceedings of the ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays. ACM, New York, 75--85.

Digital Library

[5]

El-Kurdi, Y., Gross, W. J., and Giannacopoulos, D. 2006. Sparse matrix-vector multiplication for finite element method matrices on FPGAs. In Proceedings of the International Symposium on Field-Programmable Custom Computing Machines. 293--294.

Digital Library

[6]

Golub, G. H. and Loan, C. F. V. 1996. Matrix Computations, 3rd Ed. Johns Hopkins University Press, Baltimore, MD.

Digital Library

[7]

Heath, M. T. 2001. Scientific Computing. McGraw-Hill Higher Education.

Digital Library

[8]

Hoekstra, A. G., Sloot, P., Hoffmann, W., and Hertzberger, L. 1992. Time complexity of a parallel conjugate gradient solver for light scattering simulations: Theory and spmd implementation. Tech. rep., University of Amsterdam.

[9]

Ilog, Inc. 2009. Solver cplex. http://www.ilog.fr/products/cplex/.

[10]

Lopes, A., Constantinides, G., and Kerrigan, E. C. 2008. A floating-point solver for band structured linear equations. In Proceedings of the International Conference on Field Programmable Technology. 353--356.

[11]

Lopes, A. R. and Constantinides, G. A. 2008. A high throughput FPGA-based floating point conjugate gradient implementation. In Proceedings of the Applied Reconfigurable Recomputing. 75--86.

Digital Library

[12]

Morris, G. R., Prasanna, V. K., and Anderson, R. D. 2006. A hybrid approach for mapping conjugate gradient onto an FPGA-augmented reconfigurable supercomputer. In Proceedings of the 14th IEEE Symposium on Field-Programmable Custom Computing Machines. 3--12.

Digital Library

[13]

Sewell, G. 1988. The Numerical Solution of Ordinary and Partial Differential Equations. Academic Press Professional, San Diego, CA.

Digital Library

[14]

Winston, W. L. 2003. Introduction to Mathematical Programming: Applications and Algorithms. Duxbury Resource Center.

Digital Library

[15]

Xilinx. 2010. Virtex-5 FPGA User Guide. http://www.xilinx.com/support/documentation/user-guides/ug190.pdf.

[16]

Zhang, W., Betz, V., and Rose, J. 2008. Portable and scalable FPGA-based acceleration of a direct linear system solver. In Proceedings of the International Conference on Field Programmable Technology. 17--24.

[17]

Zhuo, L., Morris, G. R., and Prasanna, V. K. 2007. High-Performance reduction circuits using deeply pipelined operators on FPGAs. IEEE Trans. Parall. Distrib. Syst. 18, 10, 1377--1392.

Digital Library

[18]

Zhuo, L. and Prasanna, V. K. 2005. Sparse matrix-vector multiplication on FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. ACM, New York, 63--74.

Digital Library

[19]

Zhuo, L. and Prasanna, V. K. 2006. High-Performance and parameterized matrix factorization on fpgas. In Proceedings of the International Conference on Field Programmable Logic and Applications. 1--6.

Cited By

Dong HSun KEgbert GKelbert AMeqbel N(2024)Hybrid CPU-GPU solution to regularized divergence-free curl-curl equations for electromagnetic inversion problemsComputers & Geosciences10.1016/j.cageo.2024.105518184:COnline publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1016/j.cageo.2024.105518
Tezcan ETorun TKosar FKaya KUnat D(2022)Mixed and Multi-Precision SpMV for GPUs with Row-wise Precision Selection2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD55451.2022.00014(31-40)Online publication date: Nov-2022
https://doi.org/10.1109/SBAC-PAD55451.2022.00014
AlAhmadi SMohammed TAlbeshri AKatib IMehmood R(2020)Performance Analysis of Sparse Matrix-Vector Multiplication (SpMV) on Graphics Processing Units (GPUs)Electronics10.3390/electronics91016759:10(1675)Online publication date: 13-Oct-2020
https://doi.org/10.3390/electronics9101675
Show More Cited By

Index Terms

Optimizing memory bandwidth use and performance for matrix-vector multiplication in iterative methods
1. Hardware
  1. Electronic design automation
    1. Logic synthesis
      1. Circuit optimization

Recommendations

Optimising memory bandwidth use for matrix-vector multiplication in iterative methods
ARC'10: Proceedings of the 6th international conference on Reconfigurable Computing: architectures, Tools and Applications

Computing the solution to a system of linear equations is a fundamental problem in scientific computing, and its acceleration has drawn wide interest in the FPGA community [1, 2, 3]. One class of algorithms to solve these systems, iterative methods, has ...
Practical Use of Some Krylov Subspace Methods for Solving Indefinite and Nonsymmetric Linear Systems

The main purpose of this paper is to develop stable versions of some Krylov subspace methods for solving linear systems of equations $Ax = b$. As in the case of Paige and Saunders's SYMMLQ [SIAM J. Numer. Anal., 12 (1975), pp. 617–624], our algorithms ...
Iterative Consistency: A Concept for the Solution of Singular Systems of Linear Equations

The authors present a fast procedure for computing a "modified" triangular factorization of Hankel, quasi-Hankel (matrices congruent in a certain sense to Hankel matrices) and sign-modified quasi-Hankel (products of quasi-Hankel and signature matrices) ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 4, Issue 3

August 2011

204 pages

ISSN:1936-7406

EISSN:1936-7414

DOI:10.1145/2000832

Issue’s Table of Contents

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 August 2011

Accepted: 01 April 2011

Revised: 01 March 2011

Received: 01 September 2010

Published in TRETS Volume 4, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
592
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)1

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Dong HSun KEgbert GKelbert AMeqbel N(2024)Hybrid CPU-GPU solution to regularized divergence-free curl-curl equations for electromagnetic inversion problemsComputers & Geosciences10.1016/j.cageo.2024.105518184:COnline publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1016/j.cageo.2024.105518
Tezcan ETorun TKosar FKaya KUnat D(2022)Mixed and Multi-Precision SpMV for GPUs with Row-wise Precision Selection2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD55451.2022.00014(31-40)Online publication date: Nov-2022
https://doi.org/10.1109/SBAC-PAD55451.2022.00014
AlAhmadi SMohammed TAlbeshri AKatib IMehmood R(2020)Performance Analysis of Sparse Matrix-Vector Multiplication (SpMV) on Graphics Processing Units (GPUs)Electronics10.3390/electronics91016759:10(1675)Online publication date: 13-Oct-2020
https://doi.org/10.3390/electronics9101675
Khusainov BKerrigan ESuardi AConstantinides G(2018)Nonlinear predictive control on a heterogeneous computing platformControl Engineering Practice10.1016/j.conengprac.2018.06.01678(105-115)Online publication date: Sep-2018
https://doi.org/10.1016/j.conengprac.2018.06.016
Schmidt SBoland D(2017)Dynamic bitwidth assignment for efficient dot products2017 27th International Conference on Field Programmable Logic and Applications (FPL)10.23919/FPL.2017.8056829(1-8)Online publication date: Sep-2017
https://doi.org/10.23919/FPL.2017.8056829
Khusainov BKerrigan ESuardi AConstantinides G(2017)Nonlinear predictive control on a heterogeneous computing platformIFAC-PapersOnLine10.1016/j.ifacol.2017.08.141350:1(11877-11882)Online publication date: Jul-2017
https://doi.org/10.1016/j.ifacol.2017.08.1413
Tessier RPocek KDeHon A(2015)Reconfigurable Computing ArchitecturesProceedings of the IEEE10.1109/JPROC.2014.2386883103:3(332-354)Online publication date: Mar-2015
https://doi.org/10.1109/JPROC.2014.2386883
Ong KFahmy SLing K(2014)A scalable and compact systolic architecture for linear solvers2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors10.1109/ASAP.2014.6868658(186-187)Online publication date: Jun-2014
https://doi.org/10.1109/ASAP.2014.6868658
Boland DConstantinides G(2013)Revisiting the reduction circuit: A case study for simultaneous architecture and precision optimisation2013 International Conference on Field-Programmable Technology (FPT)10.1109/FPT.2013.6718401(410-413)Online publication date: Dec-2013
https://doi.org/10.1109/FPT.2013.6718401
Boland DConstantinides GCompton KHutchings B(2012)A scalable approach for automated precision analysisProceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays10.1145/2145694.2145726(185-194)Online publication date: 22-Feb-2012
https://dl.acm.org/doi/10.1145/2145694.2145726

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents