Skip to main content

Advertisement

Log in

A Dynamic Modulo Scheduling with Binary Translation: Loop optimization with software compatibility

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

In the past years, many works have demonstrated the applicability of Coarse-Grained Reconfigurable Array (CGRA) accelerators to optimize loops by using software pipelining approaches. They are proven to be effective in reducing the total execution time of multimedia and signal processing applications. However, the run-time reconfigurability of CGRAs is hampered overheads introduced by the needed translation and mapping steps. In this work, we present a novel run-time translation technique for the modulo scheduling approach that can convert binary code on-the-fly to run on a CGRA. We propose a greedy approach, since the modulo scheduling for CGRA is an NP-complete problem. In addition to read-after-write dependencies, the dynamic modulo scheduling faces new challenges, such as register insertion to solve recurrence dependences and to balance the pipelining paths. Our results demonstrate that the greedy run-time algorithm can reach a near-optimal ILP rate, better than an off-line compiler approach for a 16-issue VLIW processor. The proposed mechanism ensures software compatibility as it supports different source ISAs. As proof of concept of scaling, a change in the memory bandwidth has been evaluated. In this analysis it is demonstrated that when changing from one memory access per cycle to two memory accesses per cycle, the modulo scheduling algorithm is able to exploit this increase in memory bandwidth and enhance performance accordingly. Additionally, to measure area and performance, the proposed CGRA was prototyped on an FPGA. The area comparisons show that a crossbar CGRA (with 16 processing elements and including an 4-issue VLIW host processor) is only 1.11 × bigger than a standalone 8-issue VLIW softcore processor.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19
Figure 20
Figure 21
Figure 22
Figure 23
Figure 24
Figure 25
Figure 26

Similar content being viewed by others

References

  1. Ahn, M., Yoon, J.W., Paek, Y., Kim, Y., Kiemb, M., Choi, K. (2006). A spatial mapping algorithm for heterogeneous coarse-grained reconfigurable architectures. In: Proceedings DATE, pp. 363–368.

  2. Arnold, O., Matus, E., Noethen, B., Winter, M., Limberg, T., Fettweis, G. (2014). Tomahawk: Parallelism and heterogeneity in communications signal processing mpsocs. ACM Transactions on Embedded Computing Systems, 13(3s), 107:1–107:241.

    Article  Google Scholar 

  3. Beck, A.C.S., Rutzig, M.B., Gaydadjiev, G., Carro, L. (2008). Transparent reconfigurable acceleration for heterogeneous embedded applications. In: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 1208–1213.

  4. Bispo, J., Paulino, N., Cardoso, J.M., Ferreira, J.C. (2013). Transparent runtime migration of loop-based traces of processor instructions to reconfigurable processing units. International Journal of Reconfigurable Computing

  5. Bispo, J., Paulino, N., Ferreira, J., Cardoso, J. (2012). Transparent trace-based binary acceleration for reconfigurable hw/sw systems. IEEE Transactions on Industrial Informatics.

  6. Boppu, S., Hannig, F., Teich, J. (2014). Compact code generation for tightly-coupled processor arrays. Journal of Signal Processing Systems, 77(1-2), 5–29.

    Article  Google Scholar 

  7. Bouwens, F., Berekovic, M., Kanstein, A., Gaydadjiev, G. (2007). Architectural exploration of the adres coarse-grained reconfigurable array. In Proceedings ARC pp. 1–13.

  8. Chen, L., & Mitra, T. (2012). Graph minor approach for application mapping on cgras. In: Proceedings FPT.

  9. De Sutter, B., Coene, P., Vander Aa, T., Mei, B. (2008). Placement-and-routing-based register allocation for coarse-grained reconfigurable arrays. In Proceedings LCTES (pp. 151–160).

  10. Ferreira, R., Duarte, V., Meireles, W., Pereira, M., Carro, L., Wong, S. (2013). A just-in-time modulo scheduling for virtual coarse-grained reconfigurable architectures. In SAMOS XIII.

  11. Ferreira, R., Vendramini, J.G., Mucida, L., Pereira, M.M., Carro, L. (2011). An fpga-based heterogeneous coarse-grained dynamically reconfigurable architecture. In: Proceedings CASES.

  12. Friedman, S., Carroll, A., Van Essen, B., Ylvisaker, B., Ebeling, C., Hauck, S. (2009). Spr: an architecture-adaptive cgra mapping tool. In: Proceeding of the ACM/SIGDA international symposium on field programmable gate arrays, FPGA ’09 (pp. 191–200). New York: ACM

  13. Goel, N., Kumar, A., Panda, P.R. (2014). Shared-port register file architecture for low-energy vliw processors. ACM Transactions Architectural Code Optimization, 11(1).

  14. Hamzeh, M., Shrivastava, A., Vrudhula, S. (2012). Epimap: Using epimorphism to map applications on CGRAs. In Proceeding of DAC conference (pp. 1280–1287).

  15. Hamzeh, M., Shrivastava, A., Vrudhula, S. (2014). Branch-aware loop mapping on CGRAs. In Proceeding of DAC conference on design automation conference (pp. 1–6). ACM.

  16. Hamzeh, M., Shrivastava, A., Vrudhula, S.B. (2013). Regimap: register-aware application mapping on coarse-grained reconfigurable architectures (CGRAs). In Proceeding of DAC conference (p. 18).

  17. Hartenstein, R. (2001). Coarse grain reconfigurable architecture (embedded tutorial). In: Proceedings of the 2001 asia and south pacific design automation conference, ASP-DAC ’01.

  18. Hatanaka, A., & Bagherzadeh, N. (2007). A modulo scheduling algorithm for a coarse-grain reconfigurable array template. In IPDPS 2007 (pp. 1–8).

  19. Hoogerbrugge, J., & Corporaal, H. (1994). Register file port requirements of transport triggered architectures. In: Proceedings of the 27th annual international symposium on microarchitecture. (pp. 191–195). ACM.

  20. Jääskeläinen, P., Kultala, H., Viitanen, T., Takala, J. (2014). Code density and energy efficiency of exposed datapath architectures. Journal of Signal Processing Systems, 1–16.

  21. Kim, Y., Lee, J., Shrivastava, A., Yoon, J., Cho, D., Paek, Y. (2011). High throughput data mapping for coarse-grained reconfigurable architectures. IEEE Transactions on CAD of International Circuits and Systems, 30 (11), 1599 –1609. doi:10.1109/TCAD.2011.2161217.

  22. Laboratories, & H.P. (2014). Vex toolchain. http://www.hpl.hp.com/downloads/vex/.

  23. Lee, G., Choi, K., Dutt, N. (2011). Mapping multi-domain applications onto coarse-grained reconfigurable architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 30(5), 637–650.

    Article  Google Scholar 

  24. Lin, T.J., Chen, S.K., Kuo, Y.T., Liu, C.W., Hsiao, P.C. (2008). Design and implementation of a high-performance and complexity-effective vliw dsp for multimedia applications. Journal of Signal Processing Systems, 51(3), 209–223.

    Article  Google Scholar 

  25. Loeffler, C., Ligtenberg, A., Moschytz, G.S. (1989). Practical fast 1-d dct algorithms with 11 multiplications. IEEE international conference on acoustics, speech, and signal processing, 1989. ICASSP-89 (pp. 988–991).

  26. McCool, M. (2007). Signal processing and general-purpose computing and gpus [exploratory dsp]. IEEE Signal Processing Magazine, 24(3), 109–114.

    Article  MathSciNet  Google Scholar 

  27. Mei, B., Vernalde, S., Verkest, D., De Man, H., Lauwereins, R. (2002). Dresc: a retargetable compiler for coarse-grained reconfigurable architectures. In Proceedings FPT pp. 166–173.

  28. Mei, B., Vernalde, S., Verkest, D., Man, H.D., Lauwereins, R. (2003). Exploiting loop-level parallelism on coarse-grained reconfigurable architectures using modulo scheduling. In: Proceedings DATE.

  29. Oh, T., Egger, B., Park, H., Mahlke, S. (2009). Recurrence cycle aware modulo scheduling for coarse-grained reconfigurable architectures. In Proceedings LCTES pp. 21–30.

  30. Paek, J.K., Choi, K., Lee, J. (2011). Binary acceleration using coarse-grained reconfigurable architecture. SIGARCH Computers Architecture News, 38(4), 33–39.

    Article  Google Scholar 

  31. Park, H., Fan, K., Kudlur, M., Mahlke, S. (2006). Modulo graph embedding: mapping applications onto coarse-grained reconfigurable architectures. In Proceedings CASES (pp. 136–146).

  32. Park, H., Fan, K., Mahlke, S.A., Oh, T., Kim, H., Kim, H.s. (2008). Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. In: Proceedings PACT.

  33. Park, H., Park, Y., Mahlke, S. (2009). Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications. In Proceedings MICRO (pp. 370–380).

  34. Rau, B.R. (1994). Iterative modulo scheduling: an algorithm for software pipelining loops. In Proceedings MICRO (pp. 63–74).

  35. Wong, S., Van As, T., Brown, G. (2008). ρ-vex: A reconfigurable and extensible softcore vliw processor. In International conference on field-programmable technology FPT (pp. 369–372). IEEE.

  36. Yoon, J., Shrivastava, A., Park, S., Ahn, M., Jeyapaul, R., Paek, Y. (2008). Spkm: A novel graph drawing based algorithm for application mapping onto coarse-grained reconfigurable architectures. In Proceedings ASPDAC (pp. 776–782).

  37. Zhou, L., Liu, H., Zhang, J. (2013). Loop acceleration by cluster-based cgra. IEICE Electronics Express, 10(16).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ricardo Ferreira.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ferreira, R., Denver, W., Pereira, M. et al. A Dynamic Modulo Scheduling with Binary Translation: Loop optimization with software compatibility. J Sign Process Syst 85, 45–66 (2016). https://doi.org/10.1007/s11265-015-0974-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-015-0974-8

Keywords

Navigation