Abstract
We present two designs (I and II) for IEEE 754 double precision floating point matrix multiplication, optimized for implementation on high-end FPGAs. It forms the kernel in many important tile-based BLAS algorithms, making an excellent candidate for acceleration. The designs, both based on the rank-1 update scheme, can handle arbitrary matrix sizes, and are able to sustain their peak performance except during an initial latency period. Through these designs, the trade-offs involved in terms of local-memory and bandwidth for an FPGA implementation are demonstrated and an analysis is presented for the optimal choice of design parameters. The designs, implemented on a Virtex-5 SX240T FPGA, scale gracefully from 1 to 40 processing elements(PEs) with a less than 1% degradation in the design frequency of 373 MHz. With 40 PEs and a design speed of 373 MHz, a sustained performance of 29.8 GFLOPS is possible with a bandwidth requirement of 750 MB/s for design-II and 5.9 GB/s for design-I. This compares favourably with both related art and general purpose CPU implementations.
Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Baxter, R., Booth, S., Bull, M., Cawood, G., Perry, J., Parsons, M., Simpson, A., Trew, A., McCormick, A., Smart, G., Smart, R., Cantle, A., Chamberlain, R., Genest, G.: Maxwell—a 64 fpga supercomputer. In: AHS ’07: Proceedings of the Second NASA/ESA Conference on Adaptive Hardware and Systems, pp. 287–294. IEEE Computer Society, Washington, DC, USA (2007)
Baxter, R., Booth, S., Bull, M., Cawood, G., Perry, J., Parsons, M., Simpson, A., Trew, A., McCormick, A., Smart, G., Smart, R., Cantle, A., Chamberlain, R., Genest, G.: The fpga high-performance computing alliance parallel toolkit. In: AHS ’07: Proceedings of the Second NASA/ESA Conference on Adaptive Hardware and Systems, pp. 301–310. IEEE Computer Society, Washington, DC, USA (2007)
Underwood, K.D., Hemmert, K.S.: Closing the gap: Cpu and fpga trends in sustainable floating-point blas performance. In: FCCM, pp. 219–228. IEEE Computer Society (2004)
Zhuo L., Prasanna V.K.: High-performance designs for linear algebra operations on reconfigurable hardware. IEEE Trans. Comput. 57(8), 1057–1071 (2008)
Craven S., Athanas P.: Examining the viability of fpga supercomputing. EURASIP J. Embed. Syst. 2007(1), 13–13 (2007)
Kumar, V.B.Y., Joshi, S., Patkar, S.B., Narayanan, H.: Fpga based high performance double-precision matrix multiplication. In: VLSID ’09: Proceedings of the 2009 22nd International Conference on VLSI Design, pp. 341–346. IEEE Computer Society, Washington, DC, USA (2009)
Zhuo L., Prasanna V.K.: Scalable and modular algorithms for floating-point matrix multiplication on reconfigurable computing systems. IEEE Trans. Parallel Distrib. Syst. 18(4), 433–448 (2007)
Goto, K., van de Geijn, R.: High performance implementation of the level-3 BLAS, accepted 28 Oct 2007
Dou, Y., Vassiliadis, S., Kuzmanov, G.K., Gaydadjiev, G.N.: 64-bit floating-point fpga matrix multiplication. In: FPGA ’05: Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays, pp. 86–95. ACM, New York, USA (2005)
Zhuo L., Prasanna V.K.: Scalable and modular algorithms for floating-point matrix multiplication on fpgas. IPDPS 01, 92 (2004)
Xilinx Virtex-5 family User Guide
Kuzmanov, G., van Oijen, W.: Floating-point matrix multiplication in a polymorphic processor. In: International Conference on Field Programmable Technology (ICFPT), Dec 2007, pp. 249–252
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kumar, V.B.Y., Joshi, S., Patkar, S.B. et al. FPGA Based High Performance Double-Precision Matrix Multiplication. Int J Parallel Prog 38, 322–338 (2010). https://doi.org/10.1007/s10766-010-0131-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-010-0131-8