Measuring and Modeling the Power Consumption of Energy-Efficient FPGA Coprocessors for GEMM and FFT

Giefers, Heiner; Polig, Raphael; Hagleitner, Christoph

doi:10.1007/s11265-015-1057-6

Measuring and Modeling the Power Consumption of Energy-Efficient FPGA Coprocessors for GEMM and FFT

Published: 15 October 2015

Volume 85, pages 307–323, (2016)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Heiner Giefers¹,
Raphael Polig¹ &
Christoph Hagleitner¹

1020 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

In this paper we analyze the power consumption and energy efficiency of general matrix-matrix multiplication (GEMM) and Fast Fourier Transform (FFT) implemented as streaming applications for an FPGA-based coprocessor card. The power consumption is measured with internal voltage sensors and the power draw is broken down onto the systems components in order to classify the energy consumed by the processor cores, the memory, the I/O links and the FPGA card. We present an abstract model that allows for estimating the power consumption of FPGA accelerators on the system level and validate the model using the measured kernels. The performance and energy consumption is compared against optimized multi-threaded software running on the POWER7 host CPUs. Our experimental results show that the accelerator can improve the energy efficiency by an order of magnitude when the computations can be undertaken in a fixed point format. Using floating point data, the gain in energy-efficiency was measured as up to 30 % for the double precision GEMM accelerator and up to 5 × for a 1k complex FFT.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluation of a Floating-Point Intensive Kernel on FPGA

Evaluating the performance of FFT library implementations on modern hybrid computing systems

Article 20 January 2021

Chisel Usecase: Designing General Matrix Multiply for FPGA

Notes

BLAS Level-2 and Level-3 functions support in-place addition of the result matrix/vector and scaling via scalar parameters. The FPGA architecture as presented in this paper is optimized for the basic matrix multiplication but can be extended to support these features.

References

Altera Corp. (2013). Floating-Point Megafunctions: User Guide.
Altera Corp. (2013). Quartus II Handbook Version 13.1. ch. PowerPlay Power Analysis.
Anderson, E., Bai, Z., Dongarra, J., Greenbaum, A., McKenney, A., Croz, J. Du, Hammerling, S., Demmel, J., Bischof, C., & Sorensen, D. (1990). LAPACK: A Portable Linear Algebra Library for High-performance Computers. In ACM/IEEE Conf. on Supercomputing (SC’90).
Brigham, E.O. (1988). The Fast Fourier Transform and Its Applications: Prentice-Hall.
Chen, R., Park, N., & Prasanna, V.K. (2013). High throughput energy efficient parallel FFT architecture on FPGAs. In High Performance Extreme Computing Conference (HPEC) (pp. 1–6): IEEE.
Choi, J., Dongarra, J., Pozo, R., & Walker, D. (1992). ScaLAPACK: A Scalable Linear Algebra for Distributed Memory Concurrent Computers, LAPACK Working Note 55.
Chu, E., & George, A. (2000). Inside the FFT Black Box. Serial and Parallel Fast Fourier Transform Algorithms: CRC Press.
Cooley, J.W., & Tukey, J.W. (1965). An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19, 297–301.
Article MathSciNet MATH Google Scholar
de Dinechin, F., Pasca, B., Cret, O., & Tudoran, R. (2008). An FPGA-specific approach to floating-point accumulation and sum-of-products. In Int. Conf on Field-Programmable Technology (FPT’08): IEEE.
Dou, Y., Vassiliadis, S., Kuzmanov, G.K., & Gaydadjiev, G.N. (2005). 64-bit Floating-point FPGA Matrix Multiplication. In Int. Symp. on Field-programmable Gate Arrays (FPGA’05): ACM.
Esmaeilzadeh, H., Blem, E., Amant, R. St., Sankaralingam, K., & Burger, D. (2011). Dark Silicon and the End of Multicore Scaling. In Int. Symp. on Computer Architecture (ISCA).
Fowers, J., Brown, G., Cooke, P., & Stitt, G. (2012). A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-window Applications. In Int. Symp. on Field-programmable Gate Arrays (FPGA’12): ACM.
Frigo, M., & Johnso, S.G. (2005). The design and implementation of FFTW3. Proceedings of the IEEE, 93 (2).
Giefers, H., Polig, R., & Hagleitner, C. (2014). Analyzing the energy-efficiency of dense linear algebra kernels by power-profiling a hybrid CPU/FPGA system. In Application-specific Systems, Architectures and Processors (ASAP) (pp. 92–99): IEEE.
Hameed, R., Qadeer, W., Wachs, M., Azizi, O., Solomatnikov, A., Lee, B.C., Richardson, S., Kozyrakis, C., & Horowitz, M. (2010). Understanding sources of inefficiency in general-purpose chips. In Int. Symp. on Computer Architecture (ISCA).
Hemmert, K.S., & Underwood, K.D. (2005). An analysis of the double-precision floating-point FFT on FPGAs. In Field-Programmable Custom Computing Machines (FCCM) (pp. 171–180): IEEE.
http://www.netlib.org/blas.
IBM Corp. (2012). ESSL Guide and Reference.
Inggs, G., Thomas, D., & Winberg, S. (2012). Exploring the latency-resource trade-off for the Discrete Fourier Transform on the FPGA. In Field Programmable Logic and Applications (FPL) (pp. 695–698): IEEE.
Kestur, S., Davis, J., & Williams, O. (2010). BLAS Comparison on FPGA, CPU and GPU. In Annual Symposium on VLSI (ISVLSI): IEEE.
Kumar, V., Joshi, S., Patkar, S., & Narayanan, H. (2009). FPGA based high performance double-precision matrix multiplication. In Int. Conf. on VLSI Design: IEEE.
Lawson, C.L., Hanson, R.J., Kincaid, D.R., & Krogh, F.T. (1979). Basic linear algebra subprograms for fortran usage, 5(3).
McCreary, H.-Y., Broyles, M.A., Floyd, M. S., Geissler, A.J., Hartman, S.P., Rawson, F.L., Rosedahl, T.J., Rubio, J.C., & Ware, M.S. (2007). Energyscale for IBM POWER6 microprocessor-based systems. IBM Journal of Research and Development, 51(6), 775–786.
Article Google Scholar
Milder, P., Franchetti, F., Hoe, J.C., & Püschel, M. (2012). Computer generation of hardware for linear digital signal processing transforms. ACM Transactions on Design Automation of Electronic Systems, 17(2), 15:1–15:33.
Article Google Scholar
Moore, G.E. (1965). Cramming more components onto integrated circuits. Electronics, 38(8).
Pedram, A., McCalpin, J., & Gerstlauer, A. (2014). A highly efficient multicore floating-point FFT architecture based on hybrid linear algebra/FFT cores. Journal of Signal Processing System, 77(1-2), 169–190.
Article Google Scholar
Putnam, A., Caulfield, A., Chung, E., Chiou, D., Constantinides, K., Demme, J., Esmaeilzadeh, H., Fowers, J., Gopal, G.P., Gray, J., Haselman, M., Hauck, S., Heil, S., Hormati, A., Kim, J.-Y., Lanka, S., Larus, J., Peterson, E., Pope, S., Smith, A., Thong, J., Xiao, P.Y., & Burger, D. (2014). A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. In Int. Symp. on Computer Architecture (ISCA).
Whaley, R.C., & Petitet, A. (2005). Minimizing development and maintenance costs in supporting persistently optimized BLAS. Software: Practice and Experience, 35(2), 101–121.
Google Scholar
Zhang, W., Betz, V., & Rose, J. (2012). Portable and scalable FPGA-based acceleration of a direct linear system solver. ACM Transactions on Reconfigurable Technology Systems, 5(1), 6:1–6:26.
Article Google Scholar
Zhuo, L., Morris, G., & Prasanna, V. (2007). High-performance reduction circuits using deeply pipelined operators on FPGAs. IEEE Transactions on Parallel Distributed Systems, 18(10), 1377–1392.
Article Google Scholar
Zhuo, L., & Prasanna, V.K. (2005). High Performance Linear Algebra Operations on Reconfigurable Systems. In ACM/IEEE Conf. on Supercomputing (SC’05): IEEE.
Zhuo, L., & Prasanna, V.K. (2007). Scalable and modular algorithms for floating-point matrix multiplication on reconfigurable computing systems. IEEE Transactions Parallel Distributed Systems, 18(4), 433–448.
Article Google Scholar

Download references

Author information

Authors and Affiliations

IBM Research – Zurich, Ruschlikon, Switzerland
Heiner Giefers, Raphael Polig & Christoph Hagleitner

Authors

Heiner Giefers
View author publications
You can also search for this author in PubMed Google Scholar
Raphael Polig
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Hagleitner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Heiner Giefers.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Giefers, H., Polig, R. & Hagleitner, C. Measuring and Modeling the Power Consumption of Energy-Efficient FPGA Coprocessors for GEMM and FFT. J Sign Process Syst 85, 307–323 (2016). https://doi.org/10.1007/s11265-015-1057-6

Download citation

Received: 09 November 2014
Revised: 18 June 2015
Accepted: 30 September 2015
Published: 15 October 2015
Issue Date: December 2016
DOI: https://doi.org/10.1007/s11265-015-1057-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Measuring and Modeling the Power Consumption of Energy-Efficient FPGA Coprocessors for GEMM and FFT

Abstract

Access this article

Similar content being viewed by others

Evaluation of a Floating-Point Intensive Kernel on FPGA

Evaluating the performance of FFT library implementations on modern hybrid computing systems

Chisel Usecase: Designing General Matrix Multiply for FPGA

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Measuring and Modeling the Power Consumption of Energy-Efficient FPGA Coprocessors for GEMM and FFT

Abstract

Access this article

Similar content being viewed by others

Evaluation of a Floating-Point Intensive Kernel on FPGA

Evaluating the performance of FFT library implementations on modern hybrid computing systems

Chisel Usecase: Designing General Matrix Multiply for FPGA

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation