research-article

Reducing Memory Requirements for High-Performance and Numerically Stable Gaussian Elimination

Author:

David BolandAuthors Info & Claims

FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pages 244 - 253

https://doi.org/10.1145/2847263.2847281

Published: 21 February 2016 Publication History

Abstract

Gaussian elimination is a well-known technique to compute the solution to a system of linear equations and boosting its performance is highly desirable. While straightforward parallel techniques are limited either by I/O or on-chip memory bandwidth, block-based algorithms offer the potential to bridge this gap by interleaving I/O with computation. However, these algorithms require the amount of on-chip memory to be at least the square of the number of processing elements available. Using the latest generation Altera FPGAs with hardened floating-point units, this is no longer the case. It follows that the amount of on-chip memory limits performance, a problem that is only likely to increase unless on-chip memory dominates FPGA architecture. In addition to this limitation, existing FPGA implementations of block-based Gaussian elimination either sacrifice numerical stability or efficiency. The former limits the usefulness of these implementations to a small class of matrices, the latter limits its performance.

This paper presents a high-performance and numerically stable method to perform Gaussian elimination on an FPGA. This modified algorithm makes use of a deep pipeline to store the matrix and ensures that the peak performance is once again limited by the number of floating-point units that can fit on the FPGA. When applied to large matrices, this technique can obtain a sustained performance of up to 256 GFLOPs on an Arria 10, beginning to tap into the full potential of these devices. This performance is comparable to the peak that could be achieved using a simple block-based algorithm, with the performance on a Stratix 10 predicted to be superior. This is in spite of the fact that the underlying algorithm for the implementation in this paper, Gaussian elimination with pairwise pivoting, is more complex and applicable to a wider range of practical problems.

References

[1]

M. Parker, "Technical White Paper: Understanding Peak Floating-Point Performance Claims," Altera, Tech. Rep., 06 2014.

[2]

W. Zhang, V. Betz, and J. Rose, "Portable and Scalable FPGA-based Acceleration of a Direct Linear System Solver," ACM Trans. Reconfigurable Technol. Syst., vol. 5, no. 1, pp. 6:1--6:26, 2012.

Digital Library

[3]

G. Wu, Y. Dou, J. Sun, and G. Peterson, "A High Performance and Memory Efficient LU Decomposer on FPGAs," IEEE Transactions on Computers, vol. 61, no. 3, pp. 366--378, 2012.

Digital Library

[4]

M. Kumar Jaiswal and N. Chandrachoodan, "FPGA-Based High-Performance and Scalable Block LU Decomposition Architecture," IEEE Transactions on Computers, vol. 61, no. 1, pp. 60--72, 2012.

Digital Library

[5]

N. Higham, "Gaussian elimination," Computational Statistics, vol. 3, pp. 230--238, 2011.

Digital Library

[6]

G. de Matos and H. Neto, "On Reconfigurable Architectures for Efficient Matrix Inversion," in Int. Conf. on Field Programmable Logic and Applications, 2006, pp. 1--6.

[7]

----, "Memory Optimized Architecture for Efficient Gauss-Jordan Matrix Inversion," in Southern Conference on Programmable Logic, 2007, pp. 33--38.

[8]

R. Duarte, H. Neto, and M. Vestias, "Double-precision Gauss-Jordan Algorithm with Partial Pivoting on FPGAs," in Euromicro Conference on Digital System Design, Architectures, Methods and Tools, 2009, pp. 273--280.

Digital Library

[9]

J. Arias-Garcia, R. Jacobi, C. Llanos, and M. Ayala-Rincon, "A suitable FPGA implementation of floating-point matrix inversion based on Gauss-Jordan elimination," in Southern Conference on Programmable Logic, 2011, pp. 263--268.

[10]

J. Arias-Garcia, C. Llanos, M. Ayala-Rincon, and R. Jacobi, "A fast and low cost architecture developed in FPGAs for solving systems of linear equations," in IEEE Third Latin American Symposium on Circuits and Systems, 2012, pp. 1--4.

[11]

G. Wu, Y. Dou, Y. Lei, J. Zhou, M. Wang, and J. Jiang, "A Fine-grained Pipelined Implementation of the LINPACK Benchmark on FPGAs," in Int. Symp. on Field Programmable Custom Computing Machines, 2009, pp. 183--190.

Digital Library

[12]

S. Donfack, J. Dongarra, M. Faverge, M. Gates, J. Kurzak, P. Luszczek, and I. Yamazaki, "On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties," LAPACK Working Note, Tech. Rep. 280, 2013.

[13]

Y.-G. Tai, C.-T. Dan Lo, and K. Psarris, "Scalable Matrix Decompositions with Multiple Cores on FPGAs," Microprocess. Microsyst., vol. 37, no. 8, pp. 887--898, 2013.

Digital Library

[14]

A. Buttari, J. Langou, J. Kurzak, and J. Dongarra, "A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures," Parallel Computing, vol. 35, no. 1, pp. 38--53, 2009.

Digital Library

[15]

J. H. Wilkinson, phThe Algebraic Eigenvalue Problem. Oxford University Press, 1965.

Digital Library

[16]

Y. Robert, phThe Impact of Vector and Parallel Architectures on the Gaussian Elimination Algorithm. New York, NY, USA: Halsted Press, 1990.

Digital Library

[17]

P. Grigoras, P. Burovskiy, E. Hung, and W. Luk, "Accelerating SpMV on FPGAs by Compressing Nonzero Values," in Int. Symp. on Field-Programmable Custom Computing Machines, 2015, pp. 64--67.

Digital Library

Cited By

Li XMaskell DLi CLeong PBoland D(2022)A Scalable Systolic Accelerator for Estimation of the Spectral Correlation Density Function and Its FPGA ImplementationACM Transactions on Reconfigurable Technology and Systems10.1145/354618116:1(1-24)Online publication date: 22-Dec-2022
https://dl.acm.org/doi/10.1145/3546181
Chen LXia TZhao WRen PSavidis ISasan AThapliyal HDeMara R(2022)MI2D: Accelerating Matrix Inversion with 2-Dimensional Tile ManipulationsProceedings of the Great Lakes Symposium on VLSI 202210.1145/3526241.3530314(423-429)Online publication date: 6-Jun-2022
https://dl.acm.org/doi/10.1145/3526241.3530314
Tatsumura KYazdanshenas SBetz V(2018)Enhancing FPGAs with Magnetic Tunnel Junction-Based Block RAMsACM Transactions on Reconfigurable Technology and Systems10.1145/315442511:1(1-22)Online publication date: 26-Jan-2018
https://dl.acm.org/doi/10.1145/3154425
Show More Cited By

Index Terms

Reducing Memory Requirements for High-Performance and Numerically Stable Gaussian Elimination

Recommendations

32-bit floating-point FPGA gaussian elimination
FPGA '09: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays

The well-known Gaussian elimination (with partial pivoting) is a widely-used algorithm, one of traditional methods for solving dense linear systems of equations (LSEs). This paper presents a hardware-optimized variant of Gaussian elimination and its 32-...
A High Performance and Memory Efficient LU Decomposer on FPGAs

LU decomposition for dense matrices is an important linear algebra kernel that is widely used in both scientific and engineering applications. To efficiently perform large matrix LU decomposition on FPGAs with limited local memory, a block LU ...
A scalable, numerically stable, high-performance tridiagonal solver using GPUs
SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

In this paper, we present a scalable, numerically stable, high-performance tridiagonal solver. The solver is based on the SPIKE algorithm for partitioning a large matrix into small independent matrices, which can be solved in parallel. For each small ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 2016

298 pages

ISBN:9781450338561

DOI:10.1145/2847263

General Chair:
Deming Chen
University of Illinois at Urbana-Champaign, USA
,
Program Chair:
Jonathan Greene
Microsemi, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 February 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

FPGA'16

Sponsor:

SIGDA

FPGA'16: The 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 21 - 23, 2016

California, Monterey, USA

Acceptance Rates

FPGA '16 Paper Acceptance Rate 20 of 111 submissions, 18%;

Overall Acceptance Rate 125 of 627 submissions, 20%

Upcoming Conference

FPGA '25

Sponsor:
sigda

The 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

February 27 - March 1, 2025

Monterey , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
210
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)1

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li XMaskell DLi CLeong PBoland D(2022)A Scalable Systolic Accelerator for Estimation of the Spectral Correlation Density Function and Its FPGA ImplementationACM Transactions on Reconfigurable Technology and Systems10.1145/354618116:1(1-24)Online publication date: 22-Dec-2022
https://dl.acm.org/doi/10.1145/3546181
Chen LXia TZhao WRen PSavidis ISasan AThapliyal HDeMara R(2022)MI2D: Accelerating Matrix Inversion with 2-Dimensional Tile ManipulationsProceedings of the Great Lakes Symposium on VLSI 202210.1145/3526241.3530314(423-429)Online publication date: 6-Jun-2022
https://dl.acm.org/doi/10.1145/3526241.3530314
Tatsumura KYazdanshenas SBetz V(2018)Enhancing FPGAs with Magnetic Tunnel Junction-Based Block RAMsACM Transactions on Reconfigurable Technology and Systems10.1145/315442511:1(1-22)Online publication date: 26-Jan-2018
https://dl.acm.org/doi/10.1145/3154425
Boland DCheng CKahng ALeong P(2017)Reconfigurable ComputingWiley Encyclopedia of Electrical and Electronics Engineering10.1002/047134608X.W7603.pub3(1-17)Online publication date: 15-Feb-2017
https://doi.org/10.1002/047134608X.W7603.pub3

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents