Skip to main content

Advertisement

Log in

Measuring and Modeling the Power Consumption of Energy-Efficient FPGA Coprocessors for GEMM and FFT

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

In this paper we analyze the power consumption and energy efficiency of general matrix-matrix multiplication (GEMM) and Fast Fourier Transform (FFT) implemented as streaming applications for an FPGA-based coprocessor card. The power consumption is measured with internal voltage sensors and the power draw is broken down onto the systems components in order to classify the energy consumed by the processor cores, the memory, the I/O links and the FPGA card. We present an abstract model that allows for estimating the power consumption of FPGA accelerators on the system level and validate the model using the measured kernels. The performance and energy consumption is compared against optimized multi-threaded software running on the POWER7 host CPUs. Our experimental results show that the accelerator can improve the energy efficiency by an order of magnitude when the computations can be undertaken in a fixed point format. Using floating point data, the gain in energy-efficiency was measured as up to 30 % for the double precision GEMM accelerator and up to 5 × for a 1k complex FFT.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16

Similar content being viewed by others

Notes

  1. BLAS Level-2 and Level-3 functions support in-place addition of the result matrix/vector and scaling via scalar parameters. The FPGA architecture as presented in this paper is optimized for the basic matrix multiplication but can be extended to support these features.

References

  1. Altera Corp. (2013). Floating-Point Megafunctions: User Guide.

  2. Altera Corp. (2013). Quartus II Handbook Version 13.1. ch. PowerPlay Power Analysis.

  3. Anderson, E., Bai, Z., Dongarra, J., Greenbaum, A., McKenney, A., Croz, J. Du, Hammerling, S., Demmel, J., Bischof, C., & Sorensen, D. (1990). LAPACK: A Portable Linear Algebra Library for High-performance Computers. In ACM/IEEE Conf. on Supercomputing (SC’90).

  4. Brigham, E.O. (1988). The Fast Fourier Transform and Its Applications: Prentice-Hall.

  5. Chen, R., Park, N., & Prasanna, V.K. (2013). High throughput energy efficient parallel FFT architecture on FPGAs. In High Performance Extreme Computing Conference (HPEC) (pp. 1–6): IEEE.

  6. Choi, J., Dongarra, J., Pozo, R., & Walker, D. (1992). ScaLAPACK: A Scalable Linear Algebra for Distributed Memory Concurrent Computers, LAPACK Working Note 55.

  7. Chu, E., & George, A. (2000). Inside the FFT Black Box. Serial and Parallel Fast Fourier Transform Algorithms: CRC Press.

  8. Cooley, J.W., & Tukey, J.W. (1965). An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19, 297–301.

    Article  MathSciNet  MATH  Google Scholar 

  9. de Dinechin, F., Pasca, B., Cret, O., & Tudoran, R. (2008). An FPGA-specific approach to floating-point accumulation and sum-of-products. In Int. Conf on Field-Programmable Technology (FPT’08): IEEE.

  10. Dou, Y., Vassiliadis, S., Kuzmanov, G.K., & Gaydadjiev, G.N. (2005). 64-bit Floating-point FPGA Matrix Multiplication. In Int. Symp. on Field-programmable Gate Arrays (FPGA’05): ACM.

  11. Esmaeilzadeh, H., Blem, E., Amant, R. St., Sankaralingam, K., & Burger, D. (2011). Dark Silicon and the End of Multicore Scaling. In Int. Symp. on Computer Architecture (ISCA).

  12. Fowers, J., Brown, G., Cooke, P., & Stitt, G. (2012). A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-window Applications. In Int. Symp. on Field-programmable Gate Arrays (FPGA’12): ACM.

  13. Frigo, M., & Johnso, S.G. (2005). The design and implementation of FFTW3. Proceedings of the IEEE, 93 (2).

  14. Giefers, H., Polig, R., & Hagleitner, C. (2014). Analyzing the energy-efficiency of dense linear algebra kernels by power-profiling a hybrid CPU/FPGA system. In Application-specific Systems, Architectures and Processors (ASAP) (pp. 92–99): IEEE.

  15. Hameed, R., Qadeer, W., Wachs, M., Azizi, O., Solomatnikov, A., Lee, B.C., Richardson, S., Kozyrakis, C., & Horowitz, M. (2010). Understanding sources of inefficiency in general-purpose chips. In Int. Symp. on Computer Architecture (ISCA).

  16. Hemmert, K.S., & Underwood, K.D. (2005). An analysis of the double-precision floating-point FFT on FPGAs. In Field-Programmable Custom Computing Machines (FCCM) (pp. 171–180): IEEE.

  17. http://www.netlib.org/blas.

  18. IBM Corp. (2012). ESSL Guide and Reference.

  19. Inggs, G., Thomas, D., & Winberg, S. (2012). Exploring the latency-resource trade-off for the Discrete Fourier Transform on the FPGA. In Field Programmable Logic and Applications (FPL) (pp. 695–698): IEEE.

  20. Kestur, S., Davis, J., & Williams, O. (2010). BLAS Comparison on FPGA, CPU and GPU. In Annual Symposium on VLSI (ISVLSI): IEEE.

  21. Kumar, V., Joshi, S., Patkar, S., & Narayanan, H. (2009). FPGA based high performance double-precision matrix multiplication. In Int. Conf. on VLSI Design: IEEE.

  22. Lawson, C.L., Hanson, R.J., Kincaid, D.R., & Krogh, F.T. (1979). Basic linear algebra subprograms for fortran usage, 5(3).

  23. McCreary, H.-Y., Broyles, M.A., Floyd, M. S., Geissler, A.J., Hartman, S.P., Rawson, F.L., Rosedahl, T.J., Rubio, J.C., & Ware, M.S. (2007). Energyscale for IBM POWER6 microprocessor-based systems. IBM Journal of Research and Development, 51(6), 775–786.

    Article  Google Scholar 

  24. Milder, P., Franchetti, F., Hoe, J.C., & Püschel, M. (2012). Computer generation of hardware for linear digital signal processing transforms. ACM Transactions on Design Automation of Electronic Systems, 17(2), 15:1–15:33.

    Article  Google Scholar 

  25. Moore, G.E. (1965). Cramming more components onto integrated circuits. Electronics, 38(8).

  26. Pedram, A., McCalpin, J., & Gerstlauer, A. (2014). A highly efficient multicore floating-point FFT architecture based on hybrid linear algebra/FFT cores. Journal of Signal Processing System, 77(1-2), 169–190.

    Article  Google Scholar 

  27. Putnam, A., Caulfield, A., Chung, E., Chiou, D., Constantinides, K., Demme, J., Esmaeilzadeh, H., Fowers, J., Gopal, G.P., Gray, J., Haselman, M., Hauck, S., Heil, S., Hormati, A., Kim, J.-Y., Lanka, S., Larus, J., Peterson, E., Pope, S., Smith, A., Thong, J., Xiao, P.Y., & Burger, D. (2014). A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. In Int. Symp. on Computer Architecture (ISCA).

  28. Whaley, R.C., & Petitet, A. (2005). Minimizing development and maintenance costs in supporting persistently optimized BLAS. Software: Practice and Experience, 35(2), 101–121.

    Google Scholar 

  29. Zhang, W., Betz, V., & Rose, J. (2012). Portable and scalable FPGA-based acceleration of a direct linear system solver. ACM Transactions on Reconfigurable Technology Systems, 5(1), 6:1–6:26.

    Article  Google Scholar 

  30. Zhuo, L., Morris, G., & Prasanna, V. (2007). High-performance reduction circuits using deeply pipelined operators on FPGAs. IEEE Transactions on Parallel Distributed Systems, 18(10), 1377–1392.

    Article  Google Scholar 

  31. Zhuo, L., & Prasanna, V.K. (2005). High Performance Linear Algebra Operations on Reconfigurable Systems. In ACM/IEEE Conf. on Supercomputing (SC’05): IEEE.

  32. Zhuo, L., & Prasanna, V.K. (2007). Scalable and modular algorithms for floating-point matrix multiplication on reconfigurable computing systems. IEEE Transactions Parallel Distributed Systems, 18(4), 433–448.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Heiner Giefers.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Giefers, H., Polig, R. & Hagleitner, C. Measuring and Modeling the Power Consumption of Energy-Efficient FPGA Coprocessors for GEMM and FFT. J Sign Process Syst 85, 307–323 (2016). https://doi.org/10.1007/s11265-015-1057-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-015-1057-6

Keywords

Navigation