Skip to main content

Advertisement

Log in

Energy-Constrained Multiplication of Non-square Matrices on FPGA-Based SIMD-MIMD Hybrid Multi-core Processors

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Constant growing demands for embedded systems with better performance, lower cost, more flexibility, longer battery life, better user experience, and shorter time-to-market (TTM) call for more flexible and high-performance computing platforms. Although significant research results offer exciting benefits of state-of-the-art FPGAs, they have not yet been widely adopted by many system designers due to their hardware-oriented design methodology and low portability. We believe programmability supported by an established architecture is essential to close the gap. In this paper, we explore multiplication of non-square matrices by exploiting the benefits of both SIMD (single-instruction, multiple-data) and MIMD (multiple-instruction, multiple-data) simultaneously present in a reconfigurable and programmable multi-core processor. A novel memory design is proposed to facilitate data communication and overlap of computation and communication. With ever-increasing concerns for energy consumption, performance-energy trade-offs are often necessary, especially for embedded systems. Our performance-energy tradeoff techniques offer the user opportunities to meet performance-energy challenges of various scenarios. Comprehensive experimental results on the Xilinx ML605 FPGA board featuring a Virtex 6 device demonstrate the effectiveness of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9

Similar content being viewed by others

References

  1. Tessier, R., & Burleson, W. (2001). Reconfigurable computing for digital signal processing: a survey. Journal of VLSI Signal Processing Systems, 28(1/2), 7–27.

    Article  MATH  Google Scholar 

  2. Cope, B., Cheung, P., Luk, W., Howes, L. (2010). Performance comparison of graphics processors to reconfigurable logic: a case study. IEEE Transactions on Computers, 59(4), 433–448.

    Article  MathSciNet  Google Scholar 

  3. Asano, S., Maruyama, T., Yamaguchi, Y. (2009). Performance comparison of FPGA, GPU and CPU in image processing. In Proceedings of the international conference on field programmable logic and applications, Aug. 31–Sept. 2, pp 126–131.

  4. Duan, B., Wang, W., Li, X., Zhang, C., Zhang, P., Sun, N. (2011). Floating-point mixed-radix FFT core generation for FPGA and comparison with GPU and CPU. In Proceedings of the international conference on field programmable technology (pp. 1–6).

  5. Kapre, N., & DeHon, A. (2009). Performance comparison of single-precision SPICE model-evaluation on FPGA, GPU, Cell, and multi-core processors. In Proceedings of the international conference on field programmable logic and applications, Aug. 2009–Sept. 2 (pp. 65–72).

  6. B.D. Technology. (2007). FPGAs for DSP. Tech. Rep.

  7. Fowers, J., Brown, G., Cooke, P., Stitt, G. (2012). A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In Proceedings of ACM/SIGDA international symposium on field programmable gate arrays (pp. 47–56).

  8. Underwood, K., & Hemmert, K. (2004). Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance. In IEEE symposium on field-programmable custom computing machines (pp. 219–228).

  9. Underwood, K.D., Hemmert, K.S., Ulmer, C.D. (2009). From silicon to science: the long road to production reconfigurable supercomputing. ACM Transactions on Reconfigurable Technology and Systems, 2(4), 26:1–26:15.

    Article  Google Scholar 

  10. Ronen, R., Mendelson, A., Lai, K., Lu, S.-L., Pollack, F., Shen, J. (2001). Coming challenges in microarchitecture and architecture. Proceedings of the IEEE, 89(3), 325–340.

    Article  Google Scholar 

  11. Blake, G., Dreslinski, R., Mudge, T. (2009). A survey of multicore processors. IEEE Signal Processing Magazine, 26(6), 26–37.

    Article  Google Scholar 

  12. Stevens, D., & Chouliaras, V. (2010). LE1: a parameterizable VLIW chip-multiprocessor with hardware pthreads support. In Proceedings of the IEEE computer society annual symposium on VLSI (pp. 122–126).

  13. Kozyrakis, C., Judd, D., Gebis, J., Williams, S., Patterson, D., Yelick, K. (2001). Hardware/compiler codevelopment for an embedded media processor. Proceedings of the IEEE, 89(11), 1694–1709.

    Article  Google Scholar 

  14. Krashinsky, B., Batten, C., Hampton, M., Gerding, S., Pharris, B., Casper, J., Asanovic, K. (2004). The vector-thread architecture. IEEE Micro, 24(6), 84–90.

    Article  Google Scholar 

  15. Hammond, L., Hubbert, B., Siu, M., Prabhu, M., Chen, M., Olukolun, K. (2000). The Stanford Hydra cmp. IEEE Micro, 20(2), 71–84.

    Article  Google Scholar 

  16. Hofstee, H. (2005). Power efficient processor architecture and the cell processor. In Proceedings of the IEEE international symposium on high performance computer architecture (pp. 258–262).

  17. Eshaghian, M.M. (Ed.) (1996). Heterogeneous computing. Norwood: Artech House Publishers.

    Google Scholar 

  18. Wang, X., & Ziavras, S.G. (2006). Exploiting mixed-mode parallelism for matrix operations on the HERA architecture through reconfiguration. IEE Proceedings - Computers and Digital Techniques, 153(4), 249–260.

    Article  Google Scholar 

  19. deLorimier, M., & DeHon, A. (2005). Floating-point sparse matrix-vector multiply for FPGAs. In Proceedings of ACM/SIGDA international symposium on field programmable gate arrays (pp. 75–85).

  20. Dou, Y., Vassiliadis, S., Kuzmanov, G.K., Gaydadjiev, G.N. (2005). 64-bit floating-point FPGA matrix multiplication. In Proceedings of ACM/SIGDA international symposium on field programmable gate arrays (pp. 86–95).

  21. Jang, J.-W., Choi, S., Prasanna, V. (2005). Energy- and time-efficient matrix multiplication on FPGAs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 13(11), 1305–1319.

    Article  Google Scholar 

  22. Zhuo, L., & Prasanna, V. (2007). Scalable and modular algorithms for floating-point matrix multiplication on reconfigurable computing systems. IEEE Transactions on Parallel and Distributed Systems, 18(4), 433–448.

    Article  Google Scholar 

  23. Zhuo, L., & Prasanna, V.K. (2008). High-performance designs for linear algebra operations on reconfigurable hardware. IEEE Transactions on Computers, 57(8), 1057–1071.

    Article  MathSciNet  Google Scholar 

  24. El-Ghazawi, T., El-Araby, E., Huang, M., Gaj, K., Kindratenko, V., Buell, D. (2008). The promise of high-performance reconfigurable computing. IEEE Computer, 41(2), 69–76.

    Article  Google Scholar 

  25. Kumar, V., Joshi, S., Patkar, S., Narayanan, H. (2009). FPGA based high performance double-precision matrix multiplication. In Proceedings of the IEEE international conference on VLSI design (pp. 341–346).

  26. Wang, X., & Leeser, M. (2010). VFloat: a variable precision fixed- and floating-point library for reconfigurable hardware. ACM Transactions on Design Automation of Electronic Systems, 3(3), 16:1–16:34.

    Google Scholar 

  27. Wang, X., & Gupta, P. (2011). Resource-constrained multiprocessor synthesis for floating-point applications on FPGAs. ACM Transactions on Design Automation of Electronic Systems, 16(4), 41:1–41:29.

    Google Scholar 

  28. Mathur, K.K., & Johnsson, S.L. (1994). Multiplication of matrices of arbitrary shape on a data parallel computer. Parallel Computing, 20(7), 919–951.

    Article  MATH  Google Scholar 

  29. Lee, H.-J., Robertson, J.P., Fortes, J.A.B. (1997). Generalized Cannon’s algorithm for parallel matrix multiplication. In Proceedings of the international conference on supercomputing (pp. 44–51).

  30. Kyo, S., & Okazaki, S. (2011). IMAPCAR: a 100 GOPS in-vehicle vision processor based on 128 ring connected four-way VLIW processing elements. Journal of Signal Processing Systems, 62(1), 5–16.

    Article  Google Scholar 

  31. Cannon, L.E. (1969). A cellular computer to implement the kalman filter algorithm. Ph.D. dissertation, Montana State Univ., Bozeman.

  32. Shang, L., Kaviani, A.S., Bathala, K. (2002). Dynamic power consumption in VirtexTM-II FPGA. In Proceedings of the international symposium on FPGAs (pp. 157–164).

  33. Poon, K.K.W., Wilton, S.J.E., Yan, A. (2005). A detailed power model for field-programmable gate arrays. ACM Transactions on Reconfigurable Technology and Systems, 10(2), 279–302.

    Google Scholar 

  34. Li, F., Lin, Y., He, L., Chen, D., Cong, J. (2005). Power modeling and characteristics of field programmable gate arrays. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 24(11), 1712–1724.

    Article  Google Scholar 

  35. Gonzalez, R., & Horowitz, M. (1996). Energy dissipation in general purpose microprocessors. IEEE Journal of Solid-State Circuits, 31(9), 1277–1284.

    Article  Google Scholar 

  36. Xilinx Virtex-6 FPGA ML605 evaluation kit. (2011). [Online]. Available: http://www.xilinx.com/products/boards-and-kits/EK-V6-ML605-G.htm.

  37. Getting started with the Xilinx Virtex-6 FPGA ML605 evaluation kit. (2011). [Online]. Available: http://www.xilinx.com/support/documentation/boards_and_kits/ug533.pdf.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaofang (Maggie) Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, X.(. Energy-Constrained Multiplication of Non-square Matrices on FPGA-Based SIMD-MIMD Hybrid Multi-core Processors. J Sign Process Syst 80, 209–224 (2015). https://doi.org/10.1007/s11265-013-0867-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-013-0867-7

Keywords

Navigation