Skip to main content
Log in

Scalability and Parallel Execution of Warp Processing: Dynamic Hardware/Software Partitioning

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Warp processors are a novel architecture capable of autonomously optimizing an executing application by dynamically re-implementing critical kernels within the software as custom hardware circuits in an on-chip FPGA. Previous research on warp processing focused on low-power embedded systems, incorporating a low-end ARM processor as the main software execution resource. We provide a thorough analysis of the scalability of warp processing by evaluating several possible warp processor implementations, from low-power to high-performance, and by evaluating the potential for parallel execution of the partitioned software and hardware. We further demonstrate that even considering a high-performance 1 GHz embedded processor, warp processing provides the equivalent performance of a 2.4 GHz processor. By further enabling parallel execution between the processes and FPGA, the parallel warp processor execution provides the equivalent performance of a 3.2 GHz processor.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Balboni, A., Fornaciari, W., Sciuto, D.: Partitioning and exploration in the TOSCA co-design flow. International Workshop on Hardware/Software Codesign, pp. 62–69 (1996)

  2. Berkeley Design Technology, Inc.: http://www.bdti.com/articles/info_eet0207fpga.htm#DSPEnhanced%20FPGAs (2004)

  3. Chen, W., Kosmas, P., Leeser, M., Rappaport, C.: An FPGA implementation of the two-dimensional finite-difference time-domain (FDTD) algorithm. In: Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA), pp. 97–105 (2004)

  4. Eles P., Peng Z., Kuchchinski K., Doboli A.: System level hardware/software partitioning based on simulated annealing and Tabu search. Kluwer’s Des. Automat. Embedded Syst. 2(1), 5–32 (1997)

    Article  Google Scholar 

  5. Ernst R., Henkel J., Benner T.: Hardware-software cosynthesis for microcontrollers. IEEE Des. Test Comput. 10, 64–75 (1993)

    Article  Google Scholar 

  6. Gajski D., Vahid F., Narayan S., Gong J.: SpecSyn: an environment supporting the specify-explore-refine paradigm for hardware/software system design. IEEE Trans. VLSI Syst. 6(1), 84–100 (1998)

    Article  Google Scholar 

  7. Guo, Z., Buyukkurt, B., Najjar, W., Vissers, K.: Optimized generation of data-path from C codes. In: Proceedings of the Design Automation and Test in Europe Conference (DATE), pp. 112–117 (2005)

  8. Henkel, J., Ernst, R.: A hardware/software partitioner using a dynamically determined granularity. In: Design Automation Conference (1997)

  9. Keane, J., Bradley, C., Ebeling, C.: A compiled accelerator for biological cell signaling simulations. In: Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA), pp. 233–241 (2004)

  10. Stitt, G., Vahid, F., McGregor, G., Einloth, B.: Hardware/software partitioning of software binaries: a case study of H.264 decode. In: Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pp. 285–290 (2005)

  11. Stitt G., Vahid F., Nematbakhsh S.: Power Savings and speedups from partitioning critical loops to hardware in embedded systems. ACM Trans. Embedded Comput. Syst. (TECS) 3(1), 218–232 (2004)

    Article  Google Scholar 

  12. Böhm W., Hammes J., Draper B., Chawathe M., Ross C., Rinker R., Najjar W.: Mapping a single assignment programming language to reconfigurable systems. J. Supercomput. 21, 117–130 (2002)

    Article  MATH  Google Scholar 

  13. Venkataramani, G., Najjar, W., Kurdahi, F., Bagherzadeh, N., Bohm, W.: A compiler framework for mapping applications to a coarse-grained reconfigurable computer architecture. In: Conference on Compiler, Architecture and Synthesis for Embedded Systems (CASES 2001) (2001)

  14. Henkel, J.: A low power hardware/software partitioning approach for core-based embedded systems. In: Design Automation Conference, pp. 122–127 (1999)

  15. Henkel, J., Li, Y.: Power-conscious HW/SW-partitioning of embedded systems: a case study on an MPEG-2 encoder. In: Proceedings of Sixth International Workshop on Hardware/Software Codesign, pp. 23–27, March 1998

  16. Stitt G., Vahid F.: The energy advantages of microprocessor platforms with on-chip configurable logic. IEEE Des. Test Comput. 19(6), 36–43 (2002)

    Article  Google Scholar 

  17. Wan, M., Ichikawa, Y., Lidsky, D., Rabaey, L.: An power conscious methodology for early design space exploration of heterogeneous DSPs. In: Proceedings of the ISSS Custom Integrated Circuits Conference (CICC) (1998)

  18. Stitt, G., Vahid, F.: New decompilation techniques for binary-level co-processor generation. In: Proceedings of the International Conference on Computer Aided Design (ICCAD) (2005)

  19. Lysecky R., Stitt G., Vahid F.: Warp processors. ACM Trans. Des. Automat. Electron. Syst. (TODAES) 11(3), 659–681 (2006)

    Article  Google Scholar 

  20. Gordon-Ross, A., Vahid, F.: Frequent loop detection using efficient non-intrusive on-chip hardware. In: Proceedings of the Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), pp. 117–124 (2003)

  21. Lysecky, R., Vahid, F.: On-chip logic minimization. In: Proceedings of the Design Automation Conference (DAC), pp. 334–337 (2003)

  22. Lysecky, R., Vahid, F., Tan, S.: Dynamic FPGA routing for just-in-time FPGA compilation. In: Proceedings of the Design Automation Conference (DAC), pp. 954–959 (2004)

  23. ARM Ltd.: ARM7 Processor Family. http://www.arm.com/products/CPUs/families/ARM7Family.html (2006)

  24. Intel Crop.: XScale PXA27x Processor Family. http://www.intel.com/design/pca/prodbref/253820.htm (2006)

  25. Malik, A., Moyer, B., Cermak, D.: A low power unified cache architecture providing power and performance flexibility. In: Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), pp. 241–243 (2000)

  26. EEMBC.: Embedded Microprocessor Benchmark Consortium. http://www.eembc.org (2005)

  27. Lee, C., Potkonjak, M., Mangione-Smith, W.: MediaBench: a tool for evaluating and synthesizing multimedia and communications systems. In: Proceedings of the International Symposium on Microarchitecture (MIO), pp. 330–335 (1997)

  28. Memik, G., Mangione-Smith, W., Hu, W.: NetBench: a benchmarking suite for network processors. In: Proceedings of the International Conference on Computer-Aided Design (ICCAD), pp. 39–42 (2001)

  29. Burger D., Austin T.: The SimpleScalar tool set, version 2.0. SIGARCH Comput. Architect. News 25(3), 13–25 (1997)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roman Lysecky.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lysecky, R. Scalability and Parallel Execution of Warp Processing: Dynamic Hardware/Software Partitioning. Int J Parallel Prog 36, 478–492 (2008). https://doi.org/10.1007/s10766-008-0079-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-008-0079-0

Keywords

Navigation