Hardware–software optimizations of reconfigurable multi-core processors for floating-point computations of large sparse matrices

Wang, Xiaofang

doi:10.1007/s11554-012-0277-2

Hardware–software optimizations of reconfigurable multi-core processors for floating-point computations of large sparse matrices

Special Issue
Published: 28 September 2012

Volume 9, pages 187–204, (2014)
Cite this article

Journal of Real-Time Image Processing Aims and scope Submit manuscript

Xiaofang Wang¹

569 Accesses
1 Citation
Explore all metrics

Abstract

State-of-the-art field-programmable gate array (FPGA) technologies have provided exciting opportunities to develop more flexible, less expensive, and better performance floating-point computing platforms for embedded systems. To better harness the full power of FPGAs and to bring FPGAs to more system designers, we investigate unique advantages and optimization opportunities in both software and hardware offered by multi-core processors on a programmable chip (MPoPCs). In this paper, we present our hardware customization and software dynamic scheduling solutions for LU factorization of large sparse matrices on in-house developed MPoPCs. Theoretical analysis is provided to guide the design. Implementation results on an Altera Stratix III FPGA for five benchmark matrices of size up to 7,917 × 7,917 are presented. Our hardware customization alone can reduce the execution time by up to 17.22 %. The integrated hardware–software optimization improves the speedup by an average of 60.30 %.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Low-Latency FPGA-Based PLC Microprocessor for Industrial Automation in Compliance with IEC-61131-3

Article 19 April 2024

A Survey on Pipelined FFT Hardware Architectures

Article Open access 06 July 2021

GPU Architecture

References

Catapult C Synthesis. http://www.mentor.com/esl/catapult/overview
Comparison of Nvidia Graphics Processing Units. http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
Forte Cynthesizer. http://www.forteds.com/products/index.asp
Intel Xeon Processor http://download.intel.com/support/processors/xeon/sb/xeon_3100.pdf
Matrix Market. http://math.nist.gov/MatrixMarket/
Stratix III FPGAs vs. Xilinx Virtex-5 devices: architecture and performance comparison. http://www.altera.com/literature/wp/wp-01007.pdf
Synopsys Synphony C Compiler. http://www.synopsys.com/Systems/BlockDesign/HLS/Pages/SynphonyC-Compiler.aspx
Altera Nios: http://www.altera.com/devices/processor/nios2/ni2-index.html (2001)
Xilinx Microblaze: http://www.xilinx.com/products/design_resources/proc_central/microblaze.htm (2001)
Qsys Interconnect: http://www.altera.com/literature/hb/qts/qsys_interconnect.pdf (2011)
User-Customizable ARM-Based SoC FPGAs for Next-Generation Embedded Systems. http://www.altera.com/literature/wp/wp-01167-custom-arm-soc.pdf (2011)
Ahmadinia, A., Bobda, C., Fekete, S., Teich, J., van der Veen, J.: Optimal free-space management and routing-conscious dynamic placement for reconfigurable devices. IEEE Trans. Comp. 56(5), 673–680 (2007)
Article Google Scholar
Aoun, D., Déplanche, A., Trinquet, Y.: Pfair scheduling improvement to reduce interprocessor migrations. In: Proceedings of International Conference on Real-Time and Network Systems (2008)
Baruah, S.K., Cohen, N.K., Plaxton, C.G., Varvel, D.A.: Proportionate progress: a notion of fairness in resource allocation. Algorithmica 15, 600–625 (1996)
Article MATH MathSciNet Google Scholar
Benkrid, K., Crookes, D.: From application descriptions to hardware in seconds: a logic-based approach to bridging the gap. IEEE Trans. VLSI Syst. 12(4), 420–436 (2004)
Article Google Scholar
Berkeley Design Technology, I.: An independent evaluation of high-level synthesis tools for Xilinx FPGAs (2010)
Cardoso, J.a.M.P.; Diniz, P.C.; Weinhardt, M.: Compiling for reconfigurable computing: a survey. ACM Comput. Surv. 42(4), 13:1–13:65 (2010)
Google Scholar
Cardoso, J.M.: On combining temporal partitioning and sharing of functional units in compilation for reconfigurable architectures. IEEE Trans. Comp. 52(10), 1362–1375 (2003)
Article MathSciNet Google Scholar
Chen, H., Chen, Y., Summerville, D.: A survey on the application of FPGAs for network infrastructure security. IEEE Commun. Surv. Tut. 13(4), 541–561 (2011)
Article Google Scholar
Cheng, L., Chen, D., Wong, M.: DDBDD: Delay-driven BDD synthesis for FPGAs. IEEE Trans. Comput. Aid. Des. Integr. Circ. Syst. 27(7), 1203–1213 (2008)
Article Google Scholar
Chillet, D., Eiche, A., Pillement, S., Sentieys, O.: Real-time scheduling on heterogeneous system-on-chip architectures using an optimized artificial neural network. J. Syst. Archit. 57(4), 340–353 (2011)
Article Google Scholar
Compton, K., Hauck, S.: Reconfigurable computing: a survey of systems and software. ACM Comput. Surv. 34(2) (2002)
Cong, J., Liu, B., Neuendorffer, S., Noguera, J., Vissers, K., Zhang, Z.: High-level synthesis for fpgas: from prototyping to deployment. IEEE Trans. Comput. Aided Des. Integr. Circ. Syst. 30(4), 473–491 (2011)
Article Google Scholar
Davis, R.I., Burns, A.: A survey of hard real-time scheduling for multiprocessor systems. ACM Comput. Surv. 43(4), 35:1–35:44 (2011)
Google Scholar
Dorta, T., Jimenez, J., Martin, J., Bidarte, U., Astarloa, A.: Overview of FPGA-based multiprocessor systems. In: Proceedings of International Conference on Reconfigurable Computing FPGAs, pp. 273–278 (2009)
Duff, I.S.: Direct methods. In: Technical Report RAL-98-056 (1998)
El-Ghazawi, T., El-Araby, E., Huang, M., Gaj, K., Kindratenko, V., Buell, D.: The promise of high-performance reconfigurable computing. IEEE Comput. 41(2), 69–76 (2008)
Article Google Scholar
Ezer, G.: Xtensa with user defined DSP coprocessor microarchitectures. In: Proceedings of International Conference on Computer Design, pp. 335–342 (2000)
Fowers, J., Brown, G., Cooke, P., Stitt, G.: A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In: Proceedings of International Symposium Field Program. Gate Arrays, pp. 47–56 (2012)
Fu, C., Yang, T.: Efficient run-time support for irregular task computations with mixed granularities. In: IEEE International International Parallel and Distributed Processing Symposium, pp. 823–830 (1996)
Fursin, G., Miranda, C., Temam, O., Namolaru, M., Yom-Tov, E., Zaks, A., Mendelson, B., Bonilla, E., Thomson, J., Leather, H., Williams, C., O’Boyle, M., Barnard, P., Ashton, E., Courtois, E., Bodin, F.: MILEPOST GCC: machine learning based research compiler. In: Proceedings of GCC Developers’ Summit (2008)
Ghiasi, S., Sarrafzadeh, M.: An optimal algorithm for minimizing run-time reconfiguration delay. ACM Trans. Embed. Comput. Syst. 3(2), 237–256 (2004)
Article Google Scholar
Göhringer, D., Becker, J.: High performance reconfigurable multi-processor-based computing on FPGAs. In: IEEE International International Parallel and Distributed Processing Workshops and PhD Forum, pp. 1–4 (2010)
Göhringer, D., Hübner, M., Zeutebouo, E.N., Becker, J.: Operating system for runtime reconfigurable multiprocessor systems. Int. J. Reconfig. Comput. 3:1–3:16 (2011)
Google Scholar
Göhringer, D., Perschke, T., Hübner, M., Becker, J.: A taxonomy of reconfigurable single-multiprocessor systems-on-chip. Int. J. Reconfig. Comput. 1–12 (2009)
Gokhale, M.B., Stone, J.M., Arnold, J., Kalinowski, M.: Stream-oriented FPGA computing in the streams-C high level language. In: Proceedings IEEE Symposium on Field-Programmable Custom Computing Machines, pp. 49–56 (2000)
Grama, A., Gupta; A., Karypis, G., Kumar, V.: Introduction to Parallel Computing, 2 edn. Addison Wesley, Monterey (2003)
Gupta, A.: Recent advances in direct methods for solving unsymmetric sparse systems of linear equations. ACM Trans. Mathem. Softw. 28(3), 301–324 (2002)
Article MATH Google Scholar
Hannig, F., Ruckdeschel, H., Dutta, H., Teich, J.: PARO: Synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications. In: Proceedings of the International Workshop on Reconfigurable Computing: Architectures, Tools and Applications, pp. 287–293 (2008)
Hauck, S., DeHon, A. (eds.): Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation. Morgan Kaufmann, Burlington (2008)
Hemmert, K.S., Underwood, K.D., Ulmer, C.D.; Thompson, D.C.: What is the potential for FPGAs in HPC systems in the future? Tech. rep. (2006)
Ho, C.H., Yu, C.W., Leong, P., Luk, W., Wilton, S.: Floating-point FPGA: architecture and modeling. IEEE Trans. VLSI Syst. 17(12), 1709–1718 (2009)
Article Google Scholar
Hübner, M., Figuli, P., Girardey, R., Soudris, D., Siozios, K., Becker, J.: A heterogeneous multicore system on chip with run-time reconfigurable virtual FPGA architecture. In: IPDPS Workshops, pp. 143–149 (2011)
Huerta, P., Castillo, J., Sánchez, C., Martínez, J.I.: Operating system for symmetric multiprocessors on FPGA. In: Proceedings of International Conference on Reconfigurable Computing FPGAs, pp. 157–162 (2008)
Iqbal, M., Saltz, J.H., Bokhari, S.H.: A comparative analysis of static and dynamic load balancing strategies. ACM Perform. Eval. Rev. 4(1), 1040–1047 (1985)
Google Scholar
Janhunen, J., Pitkanen, T., Silven, O., Juntti, M.: Fixed- and floating-point processor comparison for MIMO-OFDM detector. IEEE J. Sel. Topics Signal Process. 5(8), 1588–1598 (2011)
Article Google Scholar
Jin, Y., Satish, N., Ravindran, K., Keutzer, K.: An automated exploration framework for FPGA-based soft multiprocessor systems. In: IEEE/ACM/IFIP International Conference on Hardware/software Codesign and System Synthesis, pp. 273–278 (2005)
Johnson, J., Chagnon, T., Vachranukunkiet, P., Nagvajara, P., Nwankpa, C.: Sparse LU decomposition using FPGA. In: Proceedings of International Workshop on State-of-the-Art in Scientific and Parallel Computing (2008)
Kalra, R., Lysecky, R.: Configuration locking and schedulability estimation for reduced reconfiguration overheads of reconfigurable systems. IEEE Trans. VLSI Syst. 18(4), 671–674 (2010)
Article Google Scholar
Kapre, N., DeHon, A.: Parallelizing sparse matrix solve for SPICE circuit simulation using FPGAs. In: Proceedings of International Conference on Field-Programmable Technology, pp. 190–198 (2009)
Kapre, N., DeHon, A.: Performance comparison of single-precision SPICE model-evaluation on FPGA, GPU, Cell, and multi-core processors. In: Proceedings of International Conference on Field Programmable Logic and Applications, pp. 65–72 (2009)
Kuon, I., Tessier, R., Rose, J.: FPGA architecture: survey and challenges. Found. Trends Electr. Des. Autom. 2(2), 135–253 (2008)
Article Google Scholar
Menard, D., Chillet, D., Sentieys, O.: Floating-to-fixed-point conversion for digital signal processors. EURASIP J. Appl. Signal Process. 2006, 1–19 (2006)
Article Google Scholar
Monmasson, E., Cirstea, M.: FPGA design methodology for industrial control systems—a review. IEEE Trans. Ind. Electron. 54(4), 1824–1842 (2007)
Article Google Scholar
Najjar, W.A., Böhm, W., Draper, B.A., Hammes, J., Rinker, R., Beveridge, J.R., Chawathe, M., Ross, C.: High-level language abstraction for reconfigurable computing. IEEE Comput. 36(8), 63–69 (2003)
Article Google Scholar
Narayanan, S., Chillet, D., Pillement, S., Sourdis, I.: Hardware OS communication service and dynamic memory management for RSoCs. In: Proceedings of International Conference on Reconfigurable Computing FPGAs, pp. 117–122 (2011)
Noguera, J., Badia, R.M.: Multitasking on reconfigurable architectures: microarchitecture support and dynamic scheduling. ACM Trans. Embed. Comput. Syst. 3, 385–406 (2004)
Article Google Scholar
Patel, P., Moallem, M.: Reconfigurable system for real-time embedded control applications. IET Control Theory Appl. 4(11), 2506–2515 (2010)
Article Google Scholar
Ramo, E.P., Resano, J., Mozos, D., Catthoor, F.: Reducing the reconfiguration overhead: a survey of techniques. In: Proceedings of Internatioanl Conference Engineering Reconfigur. Syst. and Algor., pp. 191–194 (2007)
Sabeghi, M., Sima, V.M., Bertels, K.: Compiler assisted runtime task scheduling on a reconfigurable computer. In: Proceedings of International Conference on Field Programmable Logic and Applications, pp. 44–50 (2009)
Sakae, Y., Matsuoka, S., Sato, M., Harada, H.: Preliminary evaluation of dynamic load balancing using loop re-partitioning on Omni/SCASH. In: IEEE/ACM International Symposium on Cluster Computing Grid, pp. 463–470 (2003)
Sangiovanni-Vincentelli, A., Chen, L.K., Chua, L.O.: An efficient heuristic cluster algorithm for tearing large-scale networks. IEEE Trans. Circ. Syst. 24(12), 709–717 (1977)
Article MATH MathSciNet Google Scholar
Srinivasan, V., Govindarajan, S., Vemuri, R.: Fine-grained and coarse-grained behavioral partitioning with effective utilization of memory and design space exploration for multi-FPGA architectures. IEEE Trans. VLSI Syst. 9(1), 140–158 (2001)
Article Google Scholar
Technology, B.D.: FPGAs for DSP. Tech. rep. (2007)
Todman, T.J., Constantinides, G.A., Wilton, S.J.E., Mencer, O., Luk, W., Cheung, P.Y.K.: Reconfigurable computing: architectures and design methods. IEE Proc. Comput. Digital Tech. 152(2), 193–207 (2005)
Article Google Scholar
Underwood, K.: FPGAs vs. CPUs: Trends in Peak Floating-Point Performance. Addison Wesley, Monterey (2004)
Venetis, I.E., Gao, G.R.: Mapping the LU decomposition on a many-core architecture: challenges and solutions. In: Proceedings of the ACM Conference on Computing Frontiers, pp. 71–80 (2009)
Venkataramani, G., Najjar, W., Kurdahi, F., Bagherzadeh, N., Bohm, W., Hammes, J.: Automatic compilation to a coarse-grained reconfigurable system-on-chip. ACM Trans. Embed. Comput. Syst. 2 (2003)
Virtex II FPGA datasheet. http://direct.xilinx.com/bvdocs/publications/ds031.pdf (2007)
Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Proceedings of ACM/IEEE Conference on Supercomputers, pp. 31:1–31:11 (2008)
Wain, R., Bush, I., Guest, M., Deegan, M., Kozin, I., Kitchen, C.: An overview of FPGAs and FPGA programming; initial experiences at Daresbury. Tech. rep. (2006)
Wang, X., Ziavras, S.G.: Exploiting mixed-mode parallelism for matrix operations on the HERA architecture through reconfiguration. IEE Proc. Comput. Digital Tech. 153(4), 249–260 (2006)
Article Google Scholar
Willebeek-LeMair, M.H., Reeves, A.P.: Strategies for dynamic load balancing on highly parallel computers. IEEE Trans. Parallel Distrib. Syst. 4(9), 319–343 (1993)
Article Google Scholar
Zhuo, L., Prasanna, V.K.: High-performance designs for linear algebra operations on reconfigurable hardware. IEEE Trans. Comp. 57(8):1057–1071 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Villanova University, 800 Lancaster Ave, Villanova, PA, 19085, USA
Xiaofang Wang

Authors

Xiaofang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaofang Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, X. Hardware–software optimizations of reconfigurable multi-core processors for floating-point computations of large sparse matrices. J Real-Time Image Proc 9, 187–204 (2014). https://doi.org/10.1007/s11554-012-0277-2

Download citation

Received: 06 February 2012
Accepted: 06 September 2012
Published: 28 September 2012
Issue Date: March 2014
DOI: https://doi.org/10.1007/s11554-012-0277-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hardware–software optimizations of reconfigurable multi-core processors for floating-point computations of large sparse matrices

Abstract

Access this article

Similar content being viewed by others

Low-Latency FPGA-Based PLC Microprocessor for Industrial Automation in Compliance with IEC-61131-3

A Survey on Pipelined FFT Hardware Architectures

GPU Architecture

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hardware–software optimizations of reconfigurable multi-core processors for floating-point computations of large sparse matrices

Abstract

Access this article

Similar content being viewed by others

Low-Latency FPGA-Based PLC Microprocessor for Industrial Automation in Compliance with IEC-61131-3

A Survey on Pipelined FFT Hardware Architectures

GPU Architecture

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation