Skip to main content
Log in

Hardware–software optimizations of reconfigurable multi-core processors for floating-point computations of large sparse matrices

  • Special Issue
  • Published:
Journal of Real-Time Image Processing Aims and scope Submit manuscript

Abstract

State-of-the-art field-programmable gate array (FPGA) technologies have provided exciting opportunities to develop more flexible, less expensive, and better performance floating-point computing platforms for embedded systems. To better harness the full power of FPGAs and to bring FPGAs to more system designers, we investigate unique advantages and optimization opportunities in both software and hardware offered by multi-core processors on a programmable chip (MPoPCs). In this paper, we present our hardware customization and software dynamic scheduling solutions for LU factorization of large sparse matrices on in-house developed MPoPCs. Theoretical analysis is provided to guide the design. Implementation results on an Altera Stratix III FPGA for five benchmark matrices of size up to 7,917 × 7,917 are presented. Our hardware customization alone can reduce the execution time by up to 17.22 %. The integrated hardware–software optimization improves the speedup by an average of 60.30 %.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Catapult C Synthesis. http://www.mentor.com/esl/catapult/overview

  2. Comparison of Nvidia Graphics Processing Units. http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units

  3. Forte Cynthesizer. http://www.forteds.com/products/index.asp

  4. Intel Xeon Processor http://download.intel.com/support/processors/xeon/sb/xeon_3100.pdf

  5. Matrix Market. http://math.nist.gov/MatrixMarket/

  6. Stratix III FPGAs vs. Xilinx Virtex-5 devices: architecture and performance comparison. http://www.altera.com/literature/wp/wp-01007.pdf

  7. Synopsys Synphony C Compiler. http://www.synopsys.com/Systems/BlockDesign/HLS/Pages/SynphonyC-Compiler.aspx

  8. Altera Nios: http://www.altera.com/devices/processor/nios2/ni2-index.html (2001)

  9. Xilinx Microblaze: http://www.xilinx.com/products/design_resources/proc_central/microblaze.htm (2001)

  10. Qsys Interconnect: http://www.altera.com/literature/hb/qts/qsys_interconnect.pdf (2011)

  11. User-Customizable ARM-Based SoC FPGAs for Next-Generation Embedded Systems. http://www.altera.com/literature/wp/wp-01167-custom-arm-soc.pdf (2011)

  12. Ahmadinia, A., Bobda, C., Fekete, S., Teich, J., van der Veen, J.: Optimal free-space management and routing-conscious dynamic placement for reconfigurable devices. IEEE Trans. Comp. 56(5), 673–680 (2007)

    Article  Google Scholar 

  13. Aoun, D., Déplanche, A., Trinquet, Y.: Pfair scheduling improvement to reduce interprocessor migrations. In: Proceedings of International Conference on Real-Time and Network Systems (2008)

  14. Baruah, S.K., Cohen, N.K., Plaxton, C.G., Varvel, D.A.: Proportionate progress: a notion of fairness in resource allocation. Algorithmica 15, 600–625 (1996)

    Article  MATH  MathSciNet  Google Scholar 

  15. Benkrid, K., Crookes, D.: From application descriptions to hardware in seconds: a logic-based approach to bridging the gap. IEEE Trans. VLSI Syst. 12(4), 420–436 (2004)

    Article  Google Scholar 

  16. Berkeley Design Technology, I.: An independent evaluation of high-level synthesis tools for Xilinx FPGAs (2010)

  17. Cardoso, J.a.M.P.; Diniz, P.C.; Weinhardt, M.: Compiling for reconfigurable computing: a survey. ACM Comput. Surv. 42(4), 13:1–13:65 (2010)

    Google Scholar 

  18. Cardoso, J.M.: On combining temporal partitioning and sharing of functional units in compilation for reconfigurable architectures. IEEE Trans. Comp. 52(10), 1362–1375 (2003)

    Article  MathSciNet  Google Scholar 

  19. Chen, H., Chen, Y., Summerville, D.: A survey on the application of FPGAs for network infrastructure security. IEEE Commun. Surv. Tut. 13(4), 541–561 (2011)

    Article  Google Scholar 

  20. Cheng, L., Chen, D., Wong, M.: DDBDD: Delay-driven BDD synthesis for FPGAs. IEEE Trans. Comput. Aid. Des. Integr. Circ. Syst. 27(7), 1203–1213 (2008)

    Article  Google Scholar 

  21. Chillet, D., Eiche, A., Pillement, S., Sentieys, O.: Real-time scheduling on heterogeneous system-on-chip architectures using an optimized artificial neural network. J. Syst. Archit. 57(4), 340–353 (2011)

    Article  Google Scholar 

  22. Compton, K., Hauck, S.: Reconfigurable computing: a survey of systems and software. ACM Comput. Surv. 34(2) (2002)

  23. Cong, J., Liu, B., Neuendorffer, S., Noguera, J., Vissers, K., Zhang, Z.: High-level synthesis for fpgas: from prototyping to deployment. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst. 30(4), 473–491 (2011)

    Article  Google Scholar 

  24. Davis, R.I., Burns, A.: A survey of hard real-time scheduling for multiprocessor systems. ACM Comput. Surv. 43(4), 35:1–35:44 (2011)

    Google Scholar 

  25. Dorta, T., Jimenez, J., Martin, J., Bidarte, U., Astarloa, A.: Overview of FPGA-based multiprocessor systems. In: Proceedings of International Conference on Reconfigurable Computing FPGAs, pp. 273–278 (2009)

  26. Duff, I.S.: Direct methods. In: Technical Report RAL-98-056 (1998)

  27. El-Ghazawi, T., El-Araby, E., Huang, M., Gaj, K., Kindratenko, V., Buell, D.: The promise of high-performance reconfigurable computing. IEEE Comput. 41(2), 69–76 (2008)

    Article  Google Scholar 

  28. Ezer, G.: Xtensa with user defined DSP coprocessor microarchitectures. In: Proceedings of International Conference on Computer Design, pp. 335–342 (2000)

  29. Fowers, J., Brown, G., Cooke, P., Stitt, G.: A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In: Proceedings of International Symposium Field Program. Gate Arrays, pp. 47–56 (2012)

  30. Fu, C., Yang, T.: Efficient run-time support for irregular task computations with mixed granularities. In: IEEE International International Parallel and Distributed Processing Symposium, pp. 823–830 (1996)

  31. Fursin, G., Miranda, C., Temam, O., Namolaru, M., Yom-Tov, E., Zaks, A., Mendelson, B., Bonilla, E., Thomson, J., Leather, H., Williams, C., O’Boyle, M., Barnard, P., Ashton, E., Courtois, E., Bodin, F.: MILEPOST GCC: machine learning based research compiler. In: Proceedings of GCC Developers’ Summit (2008)

  32. Ghiasi, S., Sarrafzadeh, M.: An optimal algorithm for minimizing run-time reconfiguration delay. ACM Trans. Embed. Comput. Syst. 3(2), 237–256 (2004)

    Article  Google Scholar 

  33. Göhringer, D., Becker, J.: High performance reconfigurable multi-processor-based computing on FPGAs. In: IEEE International International Parallel and Distributed Processing Workshops and PhD Forum, pp. 1–4 (2010)

  34. Göhringer, D., Hübner, M., Zeutebouo, E.N., Becker, J.: Operating system for runtime reconfigurable multiprocessor systems. Int. J. Reconfig. Comput. 3:1–3:16 (2011)

    Google Scholar 

  35. Göhringer, D., Perschke, T., Hübner, M., Becker, J.: A taxonomy of reconfigurable single-multiprocessor systems-on-chip. Int. J. Reconfig. Comput. 1–12 (2009)

  36. Gokhale, M.B., Stone, J.M., Arnold, J., Kalinowski, M.: Stream-oriented FPGA computing in the streams-C high level language. In: Proceedings IEEE Symposium on Field-Programmable Custom Computing Machines, pp. 49–56 (2000)

  37. Grama, A., Gupta; A., Karypis, G., Kumar, V.: Introduction to Parallel Computing, 2 edn. Addison Wesley, Monterey (2003)

  38. Gupta, A.: Recent advances in direct methods for solving unsymmetric sparse systems of linear equations. ACM Trans. Mathem. Softw. 28(3), 301–324 (2002)

    Article  MATH  Google Scholar 

  39. Hannig, F., Ruckdeschel, H., Dutta, H., Teich, J.: PARO: Synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications. In: Proceedings of the International Workshop on Reconfigurable Computing: Architectures, Tools and Applications, pp. 287–293 (2008)

  40. Hauck, S., DeHon, A. (eds.): Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation. Morgan Kaufmann, Burlington (2008)

  41. Hemmert, K.S., Underwood, K.D., Ulmer, C.D.; Thompson, D.C.: What is the potential for FPGAs in HPC systems in the future? Tech. rep. (2006)

  42. Ho, C.H., Yu, C.W., Leong, P., Luk, W., Wilton, S.: Floating-point FPGA: architecture and modeling. IEEE Trans. VLSI Syst. 17(12), 1709–1718 (2009)

    Article  Google Scholar 

  43. Hübner, M., Figuli, P., Girardey, R., Soudris, D., Siozios, K., Becker, J.: A heterogeneous multicore system on chip with run-time reconfigurable virtual FPGA architecture. In: IPDPS Workshops, pp. 143–149 (2011)

  44. Huerta, P., Castillo, J., Sánchez, C., Martínez, J.I.: Operating system for symmetric multiprocessors on FPGA. In: Proceedings of International Conference on Reconfigurable Computing FPGAs, pp. 157–162 (2008)

  45. Iqbal, M., Saltz, J.H., Bokhari, S.H.: A comparative analysis of static and dynamic load balancing strategies. ACM Perform. Eval. Rev. 4(1), 1040–1047 (1985)

    Google Scholar 

  46. Janhunen, J., Pitkanen, T., Silven, O., Juntti, M.: Fixed- and floating-point processor comparison for MIMO-OFDM detector. IEEE J. Sel. Topics Signal Process. 5(8), 1588–1598 (2011)

    Article  Google Scholar 

  47. Jin, Y., Satish, N., Ravindran, K., Keutzer, K.: An automated exploration framework for FPGA-based soft multiprocessor systems. In: IEEE/ACM/IFIP International Conference on Hardware/software Codesign and System Synthesis, pp. 273–278 (2005)

  48. Johnson, J., Chagnon, T., Vachranukunkiet, P., Nagvajara, P., Nwankpa, C.: Sparse LU decomposition using FPGA. In: Proceedings of International Workshop on State-of-the-Art in Scientific and Parallel Computing (2008)

  49. Kalra, R., Lysecky, R.: Configuration locking and schedulability estimation for reduced reconfiguration overheads of reconfigurable systems. IEEE Trans. VLSI Syst. 18(4), 671–674 (2010)

    Article  Google Scholar 

  50. Kapre, N., DeHon, A.: Parallelizing sparse matrix solve for SPICE circuit simulation using FPGAs. In: Proceedings of International Conference on Field-Programmable Technology, pp. 190–198 (2009)

  51. Kapre, N., DeHon, A.: Performance comparison of single-precision SPICE model-evaluation on FPGA, GPU, Cell, and multi-core processors. In: Proceedings of International Conference on Field Programmable Logic and Applications, pp. 65–72 (2009)

  52. Kuon, I., Tessier, R., Rose, J.: FPGA architecture: survey and challenges. Found. Trends Electr. Des. Autom. 2(2), 135–253 (2008)

    Article  Google Scholar 

  53. Menard, D., Chillet, D., Sentieys, O.: Floating-to-fixed-point conversion for digital signal processors. EURASIP J. Appl. Signal Process. 2006, 1–19 (2006)

    Article  Google Scholar 

  54. Monmasson, E., Cirstea, M.: FPGA design methodology for industrial control systems—a review. IEEE Trans. Ind. Electron. 54(4), 1824–1842 (2007)

    Article  Google Scholar 

  55. Najjar, W.A., Böhm, W., Draper, B.A., Hammes, J., Rinker, R., Beveridge, J.R., Chawathe, M., Ross, C.: High-level language abstraction for reconfigurable computing. IEEE Comput. 36(8), 63–69 (2003)

    Article  Google Scholar 

  56. Narayanan, S., Chillet, D., Pillement, S., Sourdis, I.: Hardware OS communication service and dynamic memory management for RSoCs. In: Proceedings of International Conference on Reconfigurable Computing FPGAs, pp. 117–122 (2011)

  57. Noguera, J., Badia, R.M.: Multitasking on reconfigurable architectures: microarchitecture support and dynamic scheduling. ACM Trans. Embed. Comput. Syst. 3, 385–406 (2004)

    Article  Google Scholar 

  58. Patel, P., Moallem, M.: Reconfigurable system for real-time embedded control applications. IET Control Theory Appl. 4(11), 2506–2515 (2010)

    Article  Google Scholar 

  59. Ramo, E.P., Resano, J., Mozos, D., Catthoor, F.: Reducing the reconfiguration overhead: a survey of techniques. In: Proceedings of Internatioanl Conference Engineering Reconfigur. Syst. and Algor., pp. 191–194 (2007)

  60. Sabeghi, M., Sima, V.M., Bertels, K.: Compiler assisted runtime task scheduling on a reconfigurable computer. In: Proceedings of International Conference on Field Programmable Logic and Applications, pp. 44–50 (2009)

  61. Sakae, Y., Matsuoka, S., Sato, M., Harada, H.: Preliminary evaluation of dynamic load balancing using loop re-partitioning on Omni/SCASH. In: IEEE/ACM International Symposium on Cluster Computing Grid, pp. 463–470 (2003)

  62. Sangiovanni-Vincentelli, A., Chen, L.K., Chua, L.O.: An efficient heuristic cluster algorithm for tearing large-scale networks. IEEE Trans. Circ. Syst. 24(12), 709–717 (1977)

    Article  MATH  MathSciNet  Google Scholar 

  63. Srinivasan, V., Govindarajan, S., Vemuri, R.: Fine-grained and coarse-grained behavioral partitioning with effective utilization of memory and design space exploration for multi-FPGA architectures. IEEE Trans. VLSI Syst. 9(1), 140–158 (2001)

    Article  Google Scholar 

  64. Technology, B.D.: FPGAs for DSP. Tech. rep. (2007)

  65. Todman, T.J., Constantinides, G.A., Wilton, S.J.E., Mencer, O., Luk, W., Cheung, P.Y.K.: Reconfigurable computing: architectures and design methods. IEE Proc. Comput. Digital Tech. 152(2), 193–207 (2005)

    Article  Google Scholar 

  66. Underwood, K.: FPGAs vs. CPUs: Trends in Peak Floating-Point Performance. Addison Wesley, Monterey (2004)

  67. Venetis, I.E., Gao, G.R.: Mapping the LU decomposition on a many-core architecture: challenges and solutions. In: Proceedings of the ACM Conference on Computing Frontiers, pp. 71–80 (2009)

  68. Venkataramani, G., Najjar, W., Kurdahi, F., Bagherzadeh, N., Bohm, W., Hammes, J.: Automatic compilation to a coarse-grained reconfigurable system-on-chip. ACM Trans. Embed. Comput. Syst. 2 (2003)

  69. Virtex II FPGA datasheet. http://direct.xilinx.com/bvdocs/publications/ds031.pdf (2007)

  70. Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Proceedings of ACM/IEEE Conference on Supercomputers, pp. 31:1–31:11 (2008)

  71. Wain, R., Bush, I., Guest, M., Deegan, M., Kozin, I., Kitchen, C.: An overview of FPGAs and FPGA programming; initial experiences at Daresbury. Tech. rep. (2006)

  72. Wang, X., Ziavras, S.G.: Exploiting mixed-mode parallelism for matrix operations on the HERA architecture through reconfiguration. IEE Proc. Comput. Digital Tech. 153(4), 249–260 (2006)

    Article  Google Scholar 

  73. Willebeek-LeMair, M.H., Reeves, A.P.: Strategies for dynamic load balancing on highly parallel computers. IEEE Trans. Parallel Distrib. Syst. 4(9), 319–343 (1993)

    Article  Google Scholar 

  74. Zhuo, L., Prasanna, V.K.: High-performance designs for linear algebra operations on reconfigurable hardware. IEEE Trans. Comp. 57(8):1057–1071 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaofang Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, X. Hardware–software optimizations of reconfigurable multi-core processors for floating-point computations of large sparse matrices. J Real-Time Image Proc 9, 187–204 (2014). https://doi.org/10.1007/s11554-012-0277-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11554-012-0277-2

Keywords

Navigation