FPGA implementation of an exact dot product and its application in variable-precision floating-point arithmetic

Lei, Yuanwu; Dou, Yong; Dong, Yazhuo; Zhou, Jie; Xia, Fei

doi:10.1007/s11227-012-0860-0

FPGA implementation of an exact dot product and its application in variable-precision floating-point arithmetic

Published: 23 January 2013

Volume 64, pages 580–605, (2013)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Yuanwu Lei¹,
Yong Dou¹,
Yazhuo Dong¹,
Jie Zhou¹ &
…
Fei Xia¹

784 Accesses
5 Citations
3 Altmetric
Explore all metrics

Abstract

The current paper explores the capability and flexibility of field programmable gate-arrays (FPGAs) to implement variable-precision floating-point (VP) arithmetic. First, the VP exact dot product algorithm, which uses exact fixed-point operations to obtain an exact result, is presented. A VP multiplication and accumulation unit (VPMAC) on FPGA is then proposed. In the proposed design, the parallel multipliers generate the partial products of mantissa multiplication in parallel, which is the most time-consuming part in the VP multiplication and accumulation operation. This method fully utilizes DSP performance on FPGAs to enhance the performance of the VPMAC unit. Several other schemes, such as two-level RAM bank, carry-save accumulation, and partial summation, are used to achieve high frequency and pipeline throughput in the product accumulation stage. The typical algorithms in Basic Linear Algorithm Subprograms (i.e., vector dot product, general matrix vector product, and general matrix multiply product), LU decomposition, and Modified Gram–Schmidt QR decomposition, are used to evaluate the performance of the VPMAC unit. Two schemes, called the VPMAC coprocessor and matrix accelerator, are presented to implement these applications. Finally, prototypes of the VPMAC unit and the matrix accelerator based on the VPMAC unit are created on a Xilinx XC6VLX760 FPGA chip.

Compared with a parallel software implementation based on OpenMP running on an Intel Xeon Quad-core E5620 CPU, the VPMAC coprocessor, equipped with one VPMAC unit, achieves a maximum acceleration factor of 18X. Moreover, the matrix accelerator, which mainly consists of a linear array of eight processing elements, achieves 12X–65X better performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FPGA-Based Multi-precision Architecture for Accelerating Large-Scale Floating-Point Matrix Computing

A Multi-functional Multi-precision 4D Dot Product Unit with SIMD Architecture

Article 30 March 2016

Shiann-Rong Kuang, Chih-Yuan Liang & Ming-Fong Chang

Acceleration of Multiple Precision Matrix Multiplication Based on Multi-component Floating-Point Arithmetic Using AVX2

References

Yun H, Chris D (2001) Using accurate arithmetics to improve numerical reproducibility and stability in parallel applications. J Supercomput 18(3):259–277
Article MATH Google Scholar
Bailey DH (2005) High-precision floating-point arithmetic in scientific computation. Comput Sci Eng 7(3):54–61
Article Google Scholar
GNU Multiple-Precision Arithmetic Library (2011) Available from: http://www.swox.com/gmp
Fousse L, Hanrot G, Lefevre V, Pelissier P, Zimmermann P (2007) MPFR: a multiple-precision binary floating-point library with correct rounding. Trans Math Softw 33(2):1–15
Article MathSciNet Google Scholar
NTL: A Library for Doing Number Theory (2011) Available from: http://www.shoup.net/ntl/
Fujimoto J, Ishikawa T, Perret-Gallix D (2005) High precision numerical computations—a case for an happy design. ACPP IRG note, ACPP-N-1: KEK-CP-164
Cohen MS, Hull TE, Hamarcher VC (1983) A controlled-precision decimal arithmetic unit. IEEE Trans Comput C-32:370–377
Article Google Scholar
Chiarulli DM, Ruaa WG, Buell DA (1985) DRAFT: a dynamically reconfigurable processor for integer arithmetic. In: Proceedings of the 7th symposium on computer arithmetic, pp 309–318
Google Scholar
Carter TM (1989) Cascade: hardware for high/variable precision arithmetic. In: Proceedings of the 9th symposium on computer arithmetic, pp 184–191
Chapter Google Scholar
Schulte MJ, Swartzlander EE Jr (2000) A family of variable-precision, interval arithmetic processors. IEEE Trans Comput 49(5):387–397
Article Google Scholar
El-Araby E, Gonzalez I, El-Ghazawi T (2007) Bringing high-performance reconfigurable computing to exact computations. In: Proceedings of FPL 2007, pp 79–85
Google Scholar
Alexandre FT, Milos DE (1998) A variable long-precision arithmetic unit design for reconfigurable coprocessor architectures. In: Proceedings of FCCM 1998
Google Scholar
Hormigo J, Villalba J (2000) A hardware algorithm for variable-precision division. In: Proceedings of the 4th conference on real numbers and computers, pp 1–7
Google Scholar
Hormigo J, Villalba J, Schulte M (2000) A hardware algorithm for variable-precision logarithm. In: Proceedings of ASAP2000, pp 215–224
Google Scholar
Hormigo J, Villalba J, Zapata EL (1999) Interval sine and cosine functions computation based on variable-precision cordic algorithm. In: Proceedings of Arith 1999, pp 186–193
Google Scholar
Hormigo J, Villalba J, Zapata EL (2004) CORDIC processor for variable-precision interval arithmetic. J VLSI Signal Process 37:21–39
Article Google Scholar
Saez E, Villalba J, Hormigo J, Quiles FJ, Benavides JI, Zapata EL (1998) FPGA implementation of a variable precision CORDIC processor. In: Proceedings of 13th conf on design of circuits and integrated systems (DCIS’98), pp 604–609
Google Scholar
Lei Y, Dou Y, Zhou J (2011) FPGA-specific custom VLIW architecture for arbitrary precision floating-point arithmetic. IEICE Trans Inf Syst E94-D(11):2173–2183
Article Google Scholar
Li XS, Demmel JW, Bailey DH, Henry G (2002) Design, implementation and testing of extended and mixed precision blas. ACM Trans Math Softw 18(2):152–205
Article Google Scholar
Rump SM (1988) Algorithms for verified inclusions-theory and practice. In: Moore RE (ed) Reliability in computing. Academic Press, San Diego, pp C109–C126
Google Scholar
Kulisch U (1997) The fifth floating-point operation for top-performance computers. Universitat Karlsruhe
Google Scholar
IEEE (2008) Standard for binary floating point arithmetic ansi/ieee standard 754-2008. The Institute of Electrical and Electronic Engineers, Inc. Revised version of original 754-1985 Standard
Edmonson W, Melquiond G (2009) IEEE interval standard working group—p1788: current status. In: Proceedings of Arith 2009, pp 183–190
Google Scholar
Kulisch U, Snyder V (2011) The exact dot product as basic tool for long interval arithmetic. Computing 91(3):307–313
Article MathSciNet MATH Google Scholar
Kulisch U (2011) Very fast and exact accumulation of products. Computing 91(4):397–405
Article MathSciNet MATH Google Scholar
Lopes AR, Constantinides GA (2010) A fused hybrid floating-point and fixed-point dot-product for FPGAs. In: Proceedings of ARC 2010, vol 5992, pp 157–168
Google Scholar
Manoukian MV, Constantinides GA (2011) Accurate floating point arithmetic through hardware error-free transformations. In: Proceedings of ARC 2011, vol 6578, pp 94–101
Google Scholar
Dinechin FD, Pasca B, Cret O, Tudoran R (2008) An fpga-specific approach to floating-point accumulation and sum-of-products. In: Proceedings of FPT 2008, pp 33–40
Google Scholar
Kulisch U (2008) Computer arithmetic and validity: theory, implementation, and applications. de Gruyter, Berlin
MATH Google Scholar
Muller M, Rub C, Rulling W (1991) Exact accumulation of floating-point numbers. In: Proceedings of Arith 1991, pp 64–69
Google Scholar
Knofel A (1991) A fast hardware units for the computation of accurate dot products. In: Proceedings of Arith 1991, pp 70–74
Google Scholar
Dou Y, Lei Y, Wu G (2010) FPGA accelerating double/quad-double high precision floating-point application for exascale computing. In: Proceedings of ICS 2010, pp 325–336
Google Scholar
Underwood K (2004) FPGAs vs. CPUs: trends in peak floating-point performance. In: Proceedings of FPGA 2004, pp 171–180
Google Scholar
Schulte MJ, Swartzlander EE Jr (1995) Hardware design and arithmetic algorithms for a variable-precision, interval arithmetic coprocessor. In: Proceedings of the 12th symposium on computer arithmetic, pp 222–228
Chapter Google Scholar
Higham NJ (2002) Accuracy and stability of numerical algorithms, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia
Book MATH Google Scholar
Dou Y, Zhou J, Wu G, Jiang J, Lei Y (2010) A unified co-processor architecture for matrix decomposition. J Comput Sci Technol 25(4):874–885
Article MathSciNet Google Scholar
Dou Y, Vassiliadis S, Kuzmanov GK, Gaydadjiev GN (2005) 64-bit floating-point FPGA matrix multiplication. In: Proceedings of FPGA 2005, pp 86–95
Google Scholar
Fousse L, Hanrot G, Lefevre V, Pelissier P, Zimmermann P (2007) MPFR: a multiple-precision binary floating-point library with correct rounding. Trans Math Softw 33(2):1–15
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work is partially supported by NSFC (61125201, 61202127, and 60903057).

Author information

Authors and Affiliations

National Laboratory for Parallel & Distributed Processing, NUDT, Changsha, China
Yuanwu Lei, Yong Dou, Yazhuo Dong, Jie Zhou & Fei Xia

Authors

Yuanwu Lei
View author publications
You can also search for this author in PubMed Google Scholar
Yong Dou
View author publications
You can also search for this author in PubMed Google Scholar
Yazhuo Dong
View author publications
You can also search for this author in PubMed Google Scholar
Jie Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Fei Xia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuanwu Lei.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lei, Y., Dou, Y., Dong, Y. et al. FPGA implementation of an exact dot product and its application in variable-precision floating-point arithmetic. J Supercomput 64, 580–605 (2013). https://doi.org/10.1007/s11227-012-0860-0

Download citation

Published: 23 January 2013
Issue Date: May 2013
DOI: https://doi.org/10.1007/s11227-012-0860-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FPGA implementation of an exact dot product and its application in variable-precision floating-point arithmetic

Abstract

Access this article

Similar content being viewed by others

FPGA-Based Multi-precision Architecture for Accelerating Large-Scale Floating-Point Matrix Computing

A Multi-functional Multi-precision 4D Dot Product Unit with SIMD Architecture

Acceleration of Multiple Precision Matrix Multiplication Based on Multi-component Floating-Point Arithmetic Using AVX2

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

FPGA implementation of an exact dot product and its application in variable-precision floating-point arithmetic

Abstract

Access this article

Similar content being viewed by others

FPGA-Based Multi-precision Architecture for Accelerating Large-Scale Floating-Point Matrix Computing

A Multi-functional Multi-precision 4D Dot Product Unit with SIMD Architecture

Acceleration of Multiple Precision Matrix Multiplication Based on Multi-component Floating-Point Arithmetic Using AVX2

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation