Scalability and Parallel Execution of Warp Processing: Dynamic Hardware/Software Partitioning

Lysecky, Roman

doi:10.1007/s10766-008-0079-0

Scalability and Parallel Execution of Warp Processing: Dynamic Hardware/Software Partitioning

Published: 19 September 2008

Volume 36, pages 478–492, (2008)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Roman Lysecky¹

79 Accesses
1 Citation
Explore all metrics

Abstract

Warp processors are a novel architecture capable of autonomously optimizing an executing application by dynamically re-implementing critical kernels within the software as custom hardware circuits in an on-chip FPGA. Previous research on warp processing focused on low-power embedded systems, incorporating a low-end ARM processor as the main software execution resource. We provide a thorough analysis of the scalability of warp processing by evaluating several possible warp processor implementations, from low-power to high-performance, and by evaluating the potential for parallel execution of the partitioned software and hardware. We further demonstrate that even considering a high-performance 1 GHz embedded processor, warp processing provides the equivalent performance of a 2.4 GHz processor. By further enabling parallel execution between the processes and FPGA, the parallel warp processor execution provides the equivalent performance of a 3.2 GHz processor.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Balboni, A., Fornaciari, W., Sciuto, D.: Partitioning and exploration in the TOSCA co-design flow. International Workshop on Hardware/Software Codesign, pp. 62–69 (1996)
Berkeley Design Technology, Inc.: http://www.bdti.com/articles/info_eet0207fpga.htm#DSPEnhanced%20FPGAs (2004)
Chen, W., Kosmas, P., Leeser, M., Rappaport, C.: An FPGA implementation of the two-dimensional finite-difference time-domain (FDTD) algorithm. In: Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA), pp. 97–105 (2004)
Eles P., Peng Z., Kuchchinski K., Doboli A.: System level hardware/software partitioning based on simulated annealing and Tabu search. Kluwer’s Des. Automat. Embedded Syst. 2(1), 5–32 (1997)
Article Google Scholar
Ernst R., Henkel J., Benner T.: Hardware-software cosynthesis for microcontrollers. IEEE Des. Test Comput. 10, 64–75 (1993)
Article Google Scholar
Gajski D., Vahid F., Narayan S., Gong J.: SpecSyn: an environment supporting the specify-explore-refine paradigm for hardware/software system design. IEEE Trans. VLSI Syst. 6(1), 84–100 (1998)
Article Google Scholar
Guo, Z., Buyukkurt, B., Najjar, W., Vissers, K.: Optimized generation of data-path from C codes. In: Proceedings of the Design Automation and Test in Europe Conference (DATE), pp. 112–117 (2005)
Henkel, J., Ernst, R.: A hardware/software partitioner using a dynamically determined granularity. In: Design Automation Conference (1997)
Keane, J., Bradley, C., Ebeling, C.: A compiled accelerator for biological cell signaling simulations. In: Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA), pp. 233–241 (2004)
Stitt, G., Vahid, F., McGregor, G., Einloth, B.: Hardware/software partitioning of software binaries: a case study of H.264 decode. In: Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pp. 285–290 (2005)
Stitt G., Vahid F., Nematbakhsh S.: Power Savings and speedups from partitioning critical loops to hardware in embedded systems. ACM Trans. Embedded Comput. Syst. (TECS) 3(1), 218–232 (2004)
Article Google Scholar
Böhm W., Hammes J., Draper B., Chawathe M., Ross C., Rinker R., Najjar W.: Mapping a single assignment programming language to reconfigurable systems. J. Supercomput. 21, 117–130 (2002)
Article MATH Google Scholar
Venkataramani, G., Najjar, W., Kurdahi, F., Bagherzadeh, N., Bohm, W.: A compiler framework for mapping applications to a coarse-grained reconfigurable computer architecture. In: Conference on Compiler, Architecture and Synthesis for Embedded Systems (CASES 2001) (2001)
Henkel, J.: A low power hardware/software partitioning approach for core-based embedded systems. In: Design Automation Conference, pp. 122–127 (1999)
Henkel, J., Li, Y.: Power-conscious HW/SW-partitioning of embedded systems: a case study on an MPEG-2 encoder. In: Proceedings of Sixth International Workshop on Hardware/Software Codesign, pp. 23–27, March 1998
Stitt G., Vahid F.: The energy advantages of microprocessor platforms with on-chip configurable logic. IEEE Des. Test Comput. 19(6), 36–43 (2002)
Article Google Scholar
Wan, M., Ichikawa, Y., Lidsky, D., Rabaey, L.: An power conscious methodology for early design space exploration of heterogeneous DSPs. In: Proceedings of the ISSS Custom Integrated Circuits Conference (CICC) (1998)
Stitt, G., Vahid, F.: New decompilation techniques for binary-level co-processor generation. In: Proceedings of the International Conference on Computer Aided Design (ICCAD) (2005)
Lysecky R., Stitt G., Vahid F.: Warp processors. ACM Trans. Des. Automat. Electron. Syst. (TODAES) 11(3), 659–681 (2006)
Article Google Scholar
Gordon-Ross, A., Vahid, F.: Frequent loop detection using efficient non-intrusive on-chip hardware. In: Proceedings of the Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), pp. 117–124 (2003)
Lysecky, R., Vahid, F.: On-chip logic minimization. In: Proceedings of the Design Automation Conference (DAC), pp. 334–337 (2003)
Lysecky, R., Vahid, F., Tan, S.: Dynamic FPGA routing for just-in-time FPGA compilation. In: Proceedings of the Design Automation Conference (DAC), pp. 954–959 (2004)
ARM Ltd.: ARM7 Processor Family. http://www.arm.com/products/CPUs/families/ARM7Family.html (2006)
Intel Crop.: XScale PXA27x Processor Family. http://www.intel.com/design/pca/prodbref/253820.htm (2006)
Malik, A., Moyer, B., Cermak, D.: A low power unified cache architecture providing power and performance flexibility. In: Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), pp. 241–243 (2000)
EEMBC.: Embedded Microprocessor Benchmark Consortium. http://www.eembc.org (2005)
Lee, C., Potkonjak, M., Mangione-Smith, W.: MediaBench: a tool for evaluating and synthesizing multimedia and communications systems. In: Proceedings of the International Symposium on Microarchitecture (MIO), pp. 330–335 (1997)
Memik, G., Mangione-Smith, W., Hu, W.: NetBench: a benchmarking suite for network processors. In: Proceedings of the International Conference on Computer-Aided Design (ICCAD), pp. 39–42 (2001)
Burger D., Austin T.: The SimpleScalar tool set, version 2.0. SIGARCH Comput. Architect. News 25(3), 13–25 (1997)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of Arizona, Tucson, AZ, 857821, USA
Roman Lysecky

Authors

Roman Lysecky
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roman Lysecky.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lysecky, R. Scalability and Parallel Execution of Warp Processing: Dynamic Hardware/Software Partitioning. Int J Parallel Prog 36, 478–492 (2008). https://doi.org/10.1007/s10766-008-0079-0

Download citation

Received: 04 March 2008
Accepted: 23 July 2008
Published: 19 September 2008
Issue Date: October 2008
DOI: https://doi.org/10.1007/s10766-008-0079-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalability and Parallel Execution of Warp Processing: Dynamic Hardware/Software Partitioning

Abstract

Access this article

Similar content being viewed by others

A novel warp scheduling scheme considering long-latency operations for high-performance GPUs

Supporting On-Chip Dynamic Parallelism for Task-Based Hardware Accelerators

On-Chip and Distributed Dynamic Parallelism for Task-based Hardware Accelerators

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scalability and Parallel Execution of Warp Processing: Dynamic Hardware/Software Partitioning

Abstract

Access this article

Similar content being viewed by others

A novel warp scheduling scheme considering long-latency operations for high-performance GPUs

Supporting On-Chip Dynamic Parallelism for Task-Based Hardware Accelerators

On-Chip and Distributed Dynamic Parallelism for Task-based Hardware Accelerators

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation