Abstract
Constant growing demands for embedded systems with better performance, lower cost, more flexibility, longer battery life, better user experience, and shorter time-to-market (TTM) call for more flexible and high-performance computing platforms. Although significant research results offer exciting benefits of state-of-the-art FPGAs, they have not yet been widely adopted by many system designers due to their hardware-oriented design methodology and low portability. We believe programmability supported by an established architecture is essential to close the gap. In this paper, we explore multiplication of non-square matrices by exploiting the benefits of both SIMD (single-instruction, multiple-data) and MIMD (multiple-instruction, multiple-data) simultaneously present in a reconfigurable and programmable multi-core processor. A novel memory design is proposed to facilitate data communication and overlap of computation and communication. With ever-increasing concerns for energy consumption, performance-energy trade-offs are often necessary, especially for embedded systems. Our performance-energy tradeoff techniques offer the user opportunities to meet performance-energy challenges of various scenarios. Comprehensive experimental results on the Xilinx ML605 FPGA board featuring a Virtex 6 device demonstrate the effectiveness of the proposed approach.
Similar content being viewed by others
References
Tessier, R., & Burleson, W. (2001). Reconfigurable computing for digital signal processing: a survey. Journal of VLSI Signal Processing Systems, 28(1/2), 7–27.
Cope, B., Cheung, P., Luk, W., Howes, L. (2010). Performance comparison of graphics processors to reconfigurable logic: a case study. IEEE Transactions on Computers, 59(4), 433–448.
Asano, S., Maruyama, T., Yamaguchi, Y. (2009). Performance comparison of FPGA, GPU and CPU in image processing. In Proceedings of the international conference on field programmable logic and applications, Aug. 31–Sept. 2, pp 126–131.
Duan, B., Wang, W., Li, X., Zhang, C., Zhang, P., Sun, N. (2011). Floating-point mixed-radix FFT core generation for FPGA and comparison with GPU and CPU. In Proceedings of the international conference on field programmable technology (pp. 1–6).
Kapre, N., & DeHon, A. (2009). Performance comparison of single-precision SPICE model-evaluation on FPGA, GPU, Cell, and multi-core processors. In Proceedings of the international conference on field programmable logic and applications, Aug. 2009–Sept. 2 (pp. 65–72).
B.D. Technology. (2007). FPGAs for DSP. Tech. Rep.
Fowers, J., Brown, G., Cooke, P., Stitt, G. (2012). A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In Proceedings of ACM/SIGDA international symposium on field programmable gate arrays (pp. 47–56).
Underwood, K., & Hemmert, K. (2004). Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance. In IEEE symposium on field-programmable custom computing machines (pp. 219–228).
Underwood, K.D., Hemmert, K.S., Ulmer, C.D. (2009). From silicon to science: the long road to production reconfigurable supercomputing. ACM Transactions on Reconfigurable Technology and Systems, 2(4), 26:1–26:15.
Ronen, R., Mendelson, A., Lai, K., Lu, S.-L., Pollack, F., Shen, J. (2001). Coming challenges in microarchitecture and architecture. Proceedings of the IEEE, 89(3), 325–340.
Blake, G., Dreslinski, R., Mudge, T. (2009). A survey of multicore processors. IEEE Signal Processing Magazine, 26(6), 26–37.
Stevens, D., & Chouliaras, V. (2010). LE1: a parameterizable VLIW chip-multiprocessor with hardware pthreads support. In Proceedings of the IEEE computer society annual symposium on VLSI (pp. 122–126).
Kozyrakis, C., Judd, D., Gebis, J., Williams, S., Patterson, D., Yelick, K. (2001). Hardware/compiler codevelopment for an embedded media processor. Proceedings of the IEEE, 89(11), 1694–1709.
Krashinsky, B., Batten, C., Hampton, M., Gerding, S., Pharris, B., Casper, J., Asanovic, K. (2004). The vector-thread architecture. IEEE Micro, 24(6), 84–90.
Hammond, L., Hubbert, B., Siu, M., Prabhu, M., Chen, M., Olukolun, K. (2000). The Stanford Hydra cmp. IEEE Micro, 20(2), 71–84.
Hofstee, H. (2005). Power efficient processor architecture and the cell processor. In Proceedings of the IEEE international symposium on high performance computer architecture (pp. 258–262).
Eshaghian, M.M. (Ed.) (1996). Heterogeneous computing. Norwood: Artech House Publishers.
Wang, X., & Ziavras, S.G. (2006). Exploiting mixed-mode parallelism for matrix operations on the HERA architecture through reconfiguration. IEE Proceedings - Computers and Digital Techniques, 153(4), 249–260.
deLorimier, M., & DeHon, A. (2005). Floating-point sparse matrix-vector multiply for FPGAs. In Proceedings of ACM/SIGDA international symposium on field programmable gate arrays (pp. 75–85).
Dou, Y., Vassiliadis, S., Kuzmanov, G.K., Gaydadjiev, G.N. (2005). 64-bit floating-point FPGA matrix multiplication. In Proceedings of ACM/SIGDA international symposium on field programmable gate arrays (pp. 86–95).
Jang, J.-W., Choi, S., Prasanna, V. (2005). Energy- and time-efficient matrix multiplication on FPGAs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 13(11), 1305–1319.
Zhuo, L., & Prasanna, V. (2007). Scalable and modular algorithms for floating-point matrix multiplication on reconfigurable computing systems. IEEE Transactions on Parallel and Distributed Systems, 18(4), 433–448.
Zhuo, L., & Prasanna, V.K. (2008). High-performance designs for linear algebra operations on reconfigurable hardware. IEEE Transactions on Computers, 57(8), 1057–1071.
El-Ghazawi, T., El-Araby, E., Huang, M., Gaj, K., Kindratenko, V., Buell, D. (2008). The promise of high-performance reconfigurable computing. IEEE Computer, 41(2), 69–76.
Kumar, V., Joshi, S., Patkar, S., Narayanan, H. (2009). FPGA based high performance double-precision matrix multiplication. In Proceedings of the IEEE international conference on VLSI design (pp. 341–346).
Wang, X., & Leeser, M. (2010). VFloat: a variable precision fixed- and floating-point library for reconfigurable hardware. ACM Transactions on Design Automation of Electronic Systems, 3(3), 16:1–16:34.
Wang, X., & Gupta, P. (2011). Resource-constrained multiprocessor synthesis for floating-point applications on FPGAs. ACM Transactions on Design Automation of Electronic Systems, 16(4), 41:1–41:29.
Mathur, K.K., & Johnsson, S.L. (1994). Multiplication of matrices of arbitrary shape on a data parallel computer. Parallel Computing, 20(7), 919–951.
Lee, H.-J., Robertson, J.P., Fortes, J.A.B. (1997). Generalized Cannon’s algorithm for parallel matrix multiplication. In Proceedings of the international conference on supercomputing (pp. 44–51).
Kyo, S., & Okazaki, S. (2011). IMAPCAR: a 100 GOPS in-vehicle vision processor based on 128 ring connected four-way VLIW processing elements. Journal of Signal Processing Systems, 62(1), 5–16.
Cannon, L.E. (1969). A cellular computer to implement the kalman filter algorithm. Ph.D. dissertation, Montana State Univ., Bozeman.
Shang, L., Kaviani, A.S., Bathala, K. (2002). Dynamic power consumption in VirtexTM-II FPGA. In Proceedings of the international symposium on FPGAs (pp. 157–164).
Poon, K.K.W., Wilton, S.J.E., Yan, A. (2005). A detailed power model for field-programmable gate arrays. ACM Transactions on Reconfigurable Technology and Systems, 10(2), 279–302.
Li, F., Lin, Y., He, L., Chen, D., Cong, J. (2005). Power modeling and characteristics of field programmable gate arrays. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 24(11), 1712–1724.
Gonzalez, R., & Horowitz, M. (1996). Energy dissipation in general purpose microprocessors. IEEE Journal of Solid-State Circuits, 31(9), 1277–1284.
Xilinx Virtex-6 FPGA ML605 evaluation kit. (2011). [Online]. Available: http://www.xilinx.com/products/boards-and-kits/EK-V6-ML605-G.htm.
Getting started with the Xilinx Virtex-6 FPGA ML605 evaluation kit. (2011). [Online]. Available: http://www.xilinx.com/support/documentation/boards_and_kits/ug533.pdf.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, X.(. Energy-Constrained Multiplication of Non-square Matrices on FPGA-Based SIMD-MIMD Hybrid Multi-core Processors. J Sign Process Syst 80, 209–224 (2015). https://doi.org/10.1007/s11265-013-0867-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-013-0867-7