Abstract
State-of-the-art field-programmable gate array (FPGA) technologies have provided exciting opportunities to develop more flexible, less expensive, and better performance floating-point computing platforms for embedded systems. To better harness the full power of FPGAs and to bring FPGAs to more system designers, we investigate unique advantages and optimization opportunities in both software and hardware offered by multi-core processors on a programmable chip (MPoPCs). In this paper, we present our hardware customization and software dynamic scheduling solutions for LU factorization of large sparse matrices on in-house developed MPoPCs. Theoretical analysis is provided to guide the design. Implementation results on an Altera Stratix III FPGA for five benchmark matrices of size up to 7,917 × 7,917 are presented. Our hardware customization alone can reduce the execution time by up to 17.22 %. The integrated hardware–software optimization improves the speedup by an average of 60.30 %.
Similar content being viewed by others
References
Catapult C Synthesis. http://www.mentor.com/esl/catapult/overview
Comparison of Nvidia Graphics Processing Units. http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
Forte Cynthesizer. http://www.forteds.com/products/index.asp
Intel Xeon Processor http://download.intel.com/support/processors/xeon/sb/xeon_3100.pdf
Matrix Market. http://math.nist.gov/MatrixMarket/
Stratix III FPGAs vs. Xilinx Virtex-5 devices: architecture and performance comparison. http://www.altera.com/literature/wp/wp-01007.pdf
Synopsys Synphony C Compiler. http://www.synopsys.com/Systems/BlockDesign/HLS/Pages/SynphonyC-Compiler.aspx
Altera Nios: http://www.altera.com/devices/processor/nios2/ni2-index.html (2001)
Xilinx Microblaze: http://www.xilinx.com/products/design_resources/proc_central/microblaze.htm (2001)
Qsys Interconnect: http://www.altera.com/literature/hb/qts/qsys_interconnect.pdf (2011)
User-Customizable ARM-Based SoC FPGAs for Next-Generation Embedded Systems. http://www.altera.com/literature/wp/wp-01167-custom-arm-soc.pdf (2011)
Ahmadinia, A., Bobda, C., Fekete, S., Teich, J., van der Veen, J.: Optimal free-space management and routing-conscious dynamic placement for reconfigurable devices. IEEE Trans. Comp. 56(5), 673–680 (2007)
Aoun, D., Déplanche, A., Trinquet, Y.: Pfair scheduling improvement to reduce interprocessor migrations. In: Proceedings of International Conference on Real-Time and Network Systems (2008)
Baruah, S.K., Cohen, N.K., Plaxton, C.G., Varvel, D.A.: Proportionate progress: a notion of fairness in resource allocation. Algorithmica 15, 600–625 (1996)
Benkrid, K., Crookes, D.: From application descriptions to hardware in seconds: a logic-based approach to bridging the gap. IEEE Trans. VLSI Syst. 12(4), 420–436 (2004)
Berkeley Design Technology, I.: An independent evaluation of high-level synthesis tools for Xilinx FPGAs (2010)
Cardoso, J.a.M.P.; Diniz, P.C.; Weinhardt, M.: Compiling for reconfigurable computing: a survey. ACM Comput. Surv. 42(4), 13:1–13:65 (2010)
Cardoso, J.M.: On combining temporal partitioning and sharing of functional units in compilation for reconfigurable architectures. IEEE Trans. Comp. 52(10), 1362–1375 (2003)
Chen, H., Chen, Y., Summerville, D.: A survey on the application of FPGAs for network infrastructure security. IEEE Commun. Surv. Tut. 13(4), 541–561 (2011)
Cheng, L., Chen, D., Wong, M.: DDBDD: Delay-driven BDD synthesis for FPGAs. IEEE Trans. Comput. Aid. Des. Integr. Circ. Syst. 27(7), 1203–1213 (2008)
Chillet, D., Eiche, A., Pillement, S., Sentieys, O.: Real-time scheduling on heterogeneous system-on-chip architectures using an optimized artificial neural network. J. Syst. Archit. 57(4), 340–353 (2011)
Compton, K., Hauck, S.: Reconfigurable computing: a survey of systems and software. ACM Comput. Surv. 34(2) (2002)
Cong, J., Liu, B., Neuendorffer, S., Noguera, J., Vissers, K., Zhang, Z.: High-level synthesis for fpgas: from prototyping to deployment. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst. 30(4), 473–491 (2011)
Davis, R.I., Burns, A.: A survey of hard real-time scheduling for multiprocessor systems. ACM Comput. Surv. 43(4), 35:1–35:44 (2011)
Dorta, T., Jimenez, J., Martin, J., Bidarte, U., Astarloa, A.: Overview of FPGA-based multiprocessor systems. In: Proceedings of International Conference on Reconfigurable Computing FPGAs, pp. 273–278 (2009)
Duff, I.S.: Direct methods. In: Technical Report RAL-98-056 (1998)
El-Ghazawi, T., El-Araby, E., Huang, M., Gaj, K., Kindratenko, V., Buell, D.: The promise of high-performance reconfigurable computing. IEEE Comput. 41(2), 69–76 (2008)
Ezer, G.: Xtensa with user defined DSP coprocessor microarchitectures. In: Proceedings of International Conference on Computer Design, pp. 335–342 (2000)
Fowers, J., Brown, G., Cooke, P., Stitt, G.: A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In: Proceedings of International Symposium Field Program. Gate Arrays, pp. 47–56 (2012)
Fu, C., Yang, T.: Efficient run-time support for irregular task computations with mixed granularities. In: IEEE International International Parallel and Distributed Processing Symposium, pp. 823–830 (1996)
Fursin, G., Miranda, C., Temam, O., Namolaru, M., Yom-Tov, E., Zaks, A., Mendelson, B., Bonilla, E., Thomson, J., Leather, H., Williams, C., O’Boyle, M., Barnard, P., Ashton, E., Courtois, E., Bodin, F.: MILEPOST GCC: machine learning based research compiler. In: Proceedings of GCC Developers’ Summit (2008)
Ghiasi, S., Sarrafzadeh, M.: An optimal algorithm for minimizing run-time reconfiguration delay. ACM Trans. Embed. Comput. Syst. 3(2), 237–256 (2004)
Göhringer, D., Becker, J.: High performance reconfigurable multi-processor-based computing on FPGAs. In: IEEE International International Parallel and Distributed Processing Workshops and PhD Forum, pp. 1–4 (2010)
Göhringer, D., Hübner, M., Zeutebouo, E.N., Becker, J.: Operating system for runtime reconfigurable multiprocessor systems. Int. J. Reconfig. Comput. 3:1–3:16 (2011)
Göhringer, D., Perschke, T., Hübner, M., Becker, J.: A taxonomy of reconfigurable single-multiprocessor systems-on-chip. Int. J. Reconfig. Comput. 1–12 (2009)
Gokhale, M.B., Stone, J.M., Arnold, J., Kalinowski, M.: Stream-oriented FPGA computing in the streams-C high level language. In: Proceedings IEEE Symposium on Field-Programmable Custom Computing Machines, pp. 49–56 (2000)
Grama, A., Gupta; A., Karypis, G., Kumar, V.: Introduction to Parallel Computing, 2 edn. Addison Wesley, Monterey (2003)
Gupta, A.: Recent advances in direct methods for solving unsymmetric sparse systems of linear equations. ACM Trans. Mathem. Softw. 28(3), 301–324 (2002)
Hannig, F., Ruckdeschel, H., Dutta, H., Teich, J.: PARO: Synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications. In: Proceedings of the International Workshop on Reconfigurable Computing: Architectures, Tools and Applications, pp. 287–293 (2008)
Hauck, S., DeHon, A. (eds.): Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation. Morgan Kaufmann, Burlington (2008)
Hemmert, K.S., Underwood, K.D., Ulmer, C.D.; Thompson, D.C.: What is the potential for FPGAs in HPC systems in the future? Tech. rep. (2006)
Ho, C.H., Yu, C.W., Leong, P., Luk, W., Wilton, S.: Floating-point FPGA: architecture and modeling. IEEE Trans. VLSI Syst. 17(12), 1709–1718 (2009)
Hübner, M., Figuli, P., Girardey, R., Soudris, D., Siozios, K., Becker, J.: A heterogeneous multicore system on chip with run-time reconfigurable virtual FPGA architecture. In: IPDPS Workshops, pp. 143–149 (2011)
Huerta, P., Castillo, J., Sánchez, C., Martínez, J.I.: Operating system for symmetric multiprocessors on FPGA. In: Proceedings of International Conference on Reconfigurable Computing FPGAs, pp. 157–162 (2008)
Iqbal, M., Saltz, J.H., Bokhari, S.H.: A comparative analysis of static and dynamic load balancing strategies. ACM Perform. Eval. Rev. 4(1), 1040–1047 (1985)
Janhunen, J., Pitkanen, T., Silven, O., Juntti, M.: Fixed- and floating-point processor comparison for MIMO-OFDM detector. IEEE J. Sel. Topics Signal Process. 5(8), 1588–1598 (2011)
Jin, Y., Satish, N., Ravindran, K., Keutzer, K.: An automated exploration framework for FPGA-based soft multiprocessor systems. In: IEEE/ACM/IFIP International Conference on Hardware/software Codesign and System Synthesis, pp. 273–278 (2005)
Johnson, J., Chagnon, T., Vachranukunkiet, P., Nagvajara, P., Nwankpa, C.: Sparse LU decomposition using FPGA. In: Proceedings of International Workshop on State-of-the-Art in Scientific and Parallel Computing (2008)
Kalra, R., Lysecky, R.: Configuration locking and schedulability estimation for reduced reconfiguration overheads of reconfigurable systems. IEEE Trans. VLSI Syst. 18(4), 671–674 (2010)
Kapre, N., DeHon, A.: Parallelizing sparse matrix solve for SPICE circuit simulation using FPGAs. In: Proceedings of International Conference on Field-Programmable Technology, pp. 190–198 (2009)
Kapre, N., DeHon, A.: Performance comparison of single-precision SPICE model-evaluation on FPGA, GPU, Cell, and multi-core processors. In: Proceedings of International Conference on Field Programmable Logic and Applications, pp. 65–72 (2009)
Kuon, I., Tessier, R., Rose, J.: FPGA architecture: survey and challenges. Found. Trends Electr. Des. Autom. 2(2), 135–253 (2008)
Menard, D., Chillet, D., Sentieys, O.: Floating-to-fixed-point conversion for digital signal processors. EURASIP J. Appl. Signal Process. 2006, 1–19 (2006)
Monmasson, E., Cirstea, M.: FPGA design methodology for industrial control systems—a review. IEEE Trans. Ind. Electron. 54(4), 1824–1842 (2007)
Najjar, W.A., Böhm, W., Draper, B.A., Hammes, J., Rinker, R., Beveridge, J.R., Chawathe, M., Ross, C.: High-level language abstraction for reconfigurable computing. IEEE Comput. 36(8), 63–69 (2003)
Narayanan, S., Chillet, D., Pillement, S., Sourdis, I.: Hardware OS communication service and dynamic memory management for RSoCs. In: Proceedings of International Conference on Reconfigurable Computing FPGAs, pp. 117–122 (2011)
Noguera, J., Badia, R.M.: Multitasking on reconfigurable architectures: microarchitecture support and dynamic scheduling. ACM Trans. Embed. Comput. Syst. 3, 385–406 (2004)
Patel, P., Moallem, M.: Reconfigurable system for real-time embedded control applications. IET Control Theory Appl. 4(11), 2506–2515 (2010)
Ramo, E.P., Resano, J., Mozos, D., Catthoor, F.: Reducing the reconfiguration overhead: a survey of techniques. In: Proceedings of Internatioanl Conference Engineering Reconfigur. Syst. and Algor., pp. 191–194 (2007)
Sabeghi, M., Sima, V.M., Bertels, K.: Compiler assisted runtime task scheduling on a reconfigurable computer. In: Proceedings of International Conference on Field Programmable Logic and Applications, pp. 44–50 (2009)
Sakae, Y., Matsuoka, S., Sato, M., Harada, H.: Preliminary evaluation of dynamic load balancing using loop re-partitioning on Omni/SCASH. In: IEEE/ACM International Symposium on Cluster Computing Grid, pp. 463–470 (2003)
Sangiovanni-Vincentelli, A., Chen, L.K., Chua, L.O.: An efficient heuristic cluster algorithm for tearing large-scale networks. IEEE Trans. Circ. Syst. 24(12), 709–717 (1977)
Srinivasan, V., Govindarajan, S., Vemuri, R.: Fine-grained and coarse-grained behavioral partitioning with effective utilization of memory and design space exploration for multi-FPGA architectures. IEEE Trans. VLSI Syst. 9(1), 140–158 (2001)
Technology, B.D.: FPGAs for DSP. Tech. rep. (2007)
Todman, T.J., Constantinides, G.A., Wilton, S.J.E., Mencer, O., Luk, W., Cheung, P.Y.K.: Reconfigurable computing: architectures and design methods. IEE Proc. Comput. Digital Tech. 152(2), 193–207 (2005)
Underwood, K.: FPGAs vs. CPUs: Trends in Peak Floating-Point Performance. Addison Wesley, Monterey (2004)
Venetis, I.E., Gao, G.R.: Mapping the LU decomposition on a many-core architecture: challenges and solutions. In: Proceedings of the ACM Conference on Computing Frontiers, pp. 71–80 (2009)
Venkataramani, G., Najjar, W., Kurdahi, F., Bagherzadeh, N., Bohm, W., Hammes, J.: Automatic compilation to a coarse-grained reconfigurable system-on-chip. ACM Trans. Embed. Comput. Syst. 2 (2003)
Virtex II FPGA datasheet. http://direct.xilinx.com/bvdocs/publications/ds031.pdf (2007)
Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Proceedings of ACM/IEEE Conference on Supercomputers, pp. 31:1–31:11 (2008)
Wain, R., Bush, I., Guest, M., Deegan, M., Kozin, I., Kitchen, C.: An overview of FPGAs and FPGA programming; initial experiences at Daresbury. Tech. rep. (2006)
Wang, X., Ziavras, S.G.: Exploiting mixed-mode parallelism for matrix operations on the HERA architecture through reconfiguration. IEE Proc. Comput. Digital Tech. 153(4), 249–260 (2006)
Willebeek-LeMair, M.H., Reeves, A.P.: Strategies for dynamic load balancing on highly parallel computers. IEEE Trans. Parallel Distrib. Syst. 4(9), 319–343 (1993)
Zhuo, L., Prasanna, V.K.: High-performance designs for linear algebra operations on reconfigurable hardware. IEEE Trans. Comp. 57(8):1057–1071 (2008)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, X. Hardware–software optimizations of reconfigurable multi-core processors for floating-point computations of large sparse matrices. J Real-Time Image Proc 9, 187–204 (2014). https://doi.org/10.1007/s11554-012-0277-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11554-012-0277-2