Energy-Constrained Multiplication of Non-square Matrices on FPGA-Based SIMD-MIMD Hybrid Multi-core Processors

Wang, Xiaofang (Maggie)

doi:10.1007/s11265-013-0867-7

Energy-Constrained Multiplication of Non-square Matrices on FPGA-Based SIMD-MIMD Hybrid Multi-core Processors

Published: 16 January 2014

Volume 80, pages 209–224, (2015)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Xiaofang (Maggie) Wang¹

526 Accesses
1 Citation
Explore all metrics

Abstract

Constant growing demands for embedded systems with better performance, lower cost, more flexibility, longer battery life, better user experience, and shorter time-to-market (TTM) call for more flexible and high-performance computing platforms. Although significant research results offer exciting benefits of state-of-the-art FPGAs, they have not yet been widely adopted by many system designers due to their hardware-oriented design methodology and low portability. We believe programmability supported by an established architecture is essential to close the gap. In this paper, we explore multiplication of non-square matrices by exploiting the benefits of both SIMD (single-instruction, multiple-data) and MIMD (multiple-instruction, multiple-data) simultaneously present in a reconfigurable and programmable multi-core processor. A novel memory design is proposed to facilitate data communication and overlap of computation and communication. With ever-increasing concerns for energy consumption, performance-energy trade-offs are often necessary, especially for embedded systems. Our performance-energy tradeoff techniques offer the user opportunities to meet performance-energy challenges of various scenarios. Comprehensive experimental results on the Xilinx ML605 FPGA board featuring a Virtex 6 device demonstrate the effectiveness of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A methodology for speeding up matrix vector multiplication for single/multi-core architectures

Article 29 March 2015

Chisel Usecase: Designing General Matrix Multiply for FPGA

A Highly Efficient Multicore Floating-Point FFT Architecture Based on Hybrid Linear Algebra/FFT Cores

Article 26 June 2014

References

Tessier, R., & Burleson, W. (2001). Reconfigurable computing for digital signal processing: a survey. Journal of VLSI Signal Processing Systems, 28(1/2), 7–27.
Article MATH Google Scholar
Cope, B., Cheung, P., Luk, W., Howes, L. (2010). Performance comparison of graphics processors to reconfigurable logic: a case study. IEEE Transactions on Computers, 59(4), 433–448.
Article MathSciNet Google Scholar
Asano, S., Maruyama, T., Yamaguchi, Y. (2009). Performance comparison of FPGA, GPU and CPU in image processing. In Proceedings of the international conference on field programmable logic and applications, Aug. 31–Sept. 2, pp 126–131.
Duan, B., Wang, W., Li, X., Zhang, C., Zhang, P., Sun, N. (2011). Floating-point mixed-radix FFT core generation for FPGA and comparison with GPU and CPU. In Proceedings of the international conference on field programmable technology (pp. 1–6).
Kapre, N., & DeHon, A. (2009). Performance comparison of single-precision SPICE model-evaluation on FPGA, GPU, Cell, and multi-core processors. In Proceedings of the international conference on field programmable logic and applications, Aug. 2009–Sept. 2 (pp. 65–72).
B.D. Technology. (2007). FPGAs for DSP. Tech. Rep.
Fowers, J., Brown, G., Cooke, P., Stitt, G. (2012). A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In Proceedings of ACM/SIGDA international symposium on field programmable gate arrays (pp. 47–56).
Underwood, K., & Hemmert, K. (2004). Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance. In IEEE symposium on field-programmable custom computing machines (pp. 219–228).
Underwood, K.D., Hemmert, K.S., Ulmer, C.D. (2009). From silicon to science: the long road to production reconfigurable supercomputing. ACM Transactions on Reconfigurable Technology and Systems, 2(4), 26:1–26:15.
Article Google Scholar
Ronen, R., Mendelson, A., Lai, K., Lu, S.-L., Pollack, F., Shen, J. (2001). Coming challenges in microarchitecture and architecture. Proceedings of the IEEE, 89(3), 325–340.
Article Google Scholar
Blake, G., Dreslinski, R., Mudge, T. (2009). A survey of multicore processors. IEEE Signal Processing Magazine, 26(6), 26–37.
Article Google Scholar
Stevens, D., & Chouliaras, V. (2010). LE1: a parameterizable VLIW chip-multiprocessor with hardware pthreads support. In Proceedings of the IEEE computer society annual symposium on VLSI (pp. 122–126).
Kozyrakis, C., Judd, D., Gebis, J., Williams, S., Patterson, D., Yelick, K. (2001). Hardware/compiler codevelopment for an embedded media processor. Proceedings of the IEEE, 89(11), 1694–1709.
Article Google Scholar
Krashinsky, B., Batten, C., Hampton, M., Gerding, S., Pharris, B., Casper, J., Asanovic, K. (2004). The vector-thread architecture. IEEE Micro, 24(6), 84–90.
Article Google Scholar
Hammond, L., Hubbert, B., Siu, M., Prabhu, M., Chen, M., Olukolun, K. (2000). The Stanford Hydra cmp. IEEE Micro, 20(2), 71–84.
Article Google Scholar
Hofstee, H. (2005). Power efficient processor architecture and the cell processor. In Proceedings of the IEEE international symposium on high performance computer architecture (pp. 258–262).
Eshaghian, M.M. (Ed.) (1996). Heterogeneous computing. Norwood: Artech House Publishers.
Google Scholar
Wang, X., & Ziavras, S.G. (2006). Exploiting mixed-mode parallelism for matrix operations on the HERA architecture through reconfiguration. IEE Proceedings - Computers and Digital Techniques, 153(4), 249–260.
Article Google Scholar
deLorimier, M., & DeHon, A. (2005). Floating-point sparse matrix-vector multiply for FPGAs. In Proceedings of ACM/SIGDA international symposium on field programmable gate arrays (pp. 75–85).
Dou, Y., Vassiliadis, S., Kuzmanov, G.K., Gaydadjiev, G.N. (2005). 64-bit floating-point FPGA matrix multiplication. In Proceedings of ACM/SIGDA international symposium on field programmable gate arrays (pp. 86–95).
Jang, J.-W., Choi, S., Prasanna, V. (2005). Energy- and time-efficient matrix multiplication on FPGAs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 13(11), 1305–1319.
Article Google Scholar
Zhuo, L., & Prasanna, V. (2007). Scalable and modular algorithms for floating-point matrix multiplication on reconfigurable computing systems. IEEE Transactions on Parallel and Distributed Systems, 18(4), 433–448.
Article Google Scholar
Zhuo, L., & Prasanna, V.K. (2008). High-performance designs for linear algebra operations on reconfigurable hardware. IEEE Transactions on Computers, 57(8), 1057–1071.
Article MathSciNet Google Scholar
El-Ghazawi, T., El-Araby, E., Huang, M., Gaj, K., Kindratenko, V., Buell, D. (2008). The promise of high-performance reconfigurable computing. IEEE Computer, 41(2), 69–76.
Article Google Scholar
Kumar, V., Joshi, S., Patkar, S., Narayanan, H. (2009). FPGA based high performance double-precision matrix multiplication. In Proceedings of the IEEE international conference on VLSI design (pp. 341–346).
Wang, X., & Leeser, M. (2010). VFloat: a variable precision fixed- and floating-point library for reconfigurable hardware. ACM Transactions on Design Automation of Electronic Systems, 3(3), 16:1–16:34.
Google Scholar
Wang, X., & Gupta, P. (2011). Resource-constrained multiprocessor synthesis for floating-point applications on FPGAs. ACM Transactions on Design Automation of Electronic Systems, 16(4), 41:1–41:29.
Google Scholar
Mathur, K.K., & Johnsson, S.L. (1994). Multiplication of matrices of arbitrary shape on a data parallel computer. Parallel Computing, 20(7), 919–951.
Article MATH Google Scholar
Lee, H.-J., Robertson, J.P., Fortes, J.A.B. (1997). Generalized Cannon’s algorithm for parallel matrix multiplication. In Proceedings of the international conference on supercomputing (pp. 44–51).
Kyo, S., & Okazaki, S. (2011). IMAPCAR: a 100 GOPS in-vehicle vision processor based on 128 ring connected four-way VLIW processing elements. Journal of Signal Processing Systems, 62(1), 5–16.
Article Google Scholar
Cannon, L.E. (1969). A cellular computer to implement the kalman filter algorithm. Ph.D. dissertation, Montana State Univ., Bozeman.
Shang, L., Kaviani, A.S., Bathala, K. (2002). Dynamic power consumption in VirtexTM-II FPGA. In Proceedings of the international symposium on FPGAs (pp. 157–164).
Poon, K.K.W., Wilton, S.J.E., Yan, A. (2005). A detailed power model for field-programmable gate arrays. ACM Transactions on Reconfigurable Technology and Systems, 10(2), 279–302.
Google Scholar
Li, F., Lin, Y., He, L., Chen, D., Cong, J. (2005). Power modeling and characteristics of field programmable gate arrays. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 24(11), 1712–1724.
Article Google Scholar
Gonzalez, R., & Horowitz, M. (1996). Energy dissipation in general purpose microprocessors. IEEE Journal of Solid-State Circuits, 31(9), 1277–1284.
Article Google Scholar
Xilinx Virtex-6 FPGA ML605 evaluation kit. (2011). [Online]. Available: http://www.xilinx.com/products/boards-and-kits/EK-V6-ML605-G.htm.
Getting started with the Xilinx Virtex-6 FPGA ML605 evaluation kit. (2011). [Online]. Available: http://www.xilinx.com/support/documentation/boards_and_kits/ug533.pdf.

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Villanova University, 800 Lancaster Ave, Villanova, PA, 19085, USA
Xiaofang (Maggie) Wang

Authors

Xiaofang (Maggie) Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaofang (Maggie) Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, X.(. Energy-Constrained Multiplication of Non-square Matrices on FPGA-Based SIMD-MIMD Hybrid Multi-core Processors. J Sign Process Syst 80, 209–224 (2015). https://doi.org/10.1007/s11265-013-0867-7

Download citation

Received: 18 May 2012
Revised: 18 December 2013
Accepted: 19 December 2013
Published: 16 January 2014
Issue Date: August 2015
DOI: https://doi.org/10.1007/s11265-013-0867-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Energy-Constrained Multiplication of Non-square Matrices on FPGA-Based SIMD-MIMD Hybrid Multi-core Processors

Abstract

Access this article

Similar content being viewed by others

A methodology for speeding up matrix vector multiplication for single/multi-core architectures

Chisel Usecase: Designing General Matrix Multiply for FPGA

A Highly Efficient Multicore Floating-Point FFT Architecture Based on Hybrid Linear Algebra/FFT Cores

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Energy-Constrained Multiplication of Non-square Matrices on FPGA-Based SIMD-MIMD Hybrid Multi-core Processors

Abstract

Access this article

Similar content being viewed by others

A methodology for speeding up matrix vector multiplication for single/multi-core architectures

Chisel Usecase: Designing General Matrix Multiply for FPGA

A Highly Efficient Multicore Floating-Point FFT Architecture Based on Hybrid Linear Algebra/FFT Cores

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation