A Dynamic Modulo Scheduling with Binary Translation: Loop optimization with software compatibility

Ferreira, Ricardo; Denver, Waldir; Pereira, Monica; Wong, Stephan; Lisbȏa, Carlos A.; Carro, Luigi

doi:10.1007/s11265-015-0974-8

A Dynamic Modulo Scheduling with Binary Translation: Loop optimization with software compatibility

Published: 17 February 2015

Volume 85, pages 45–66, (2016)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Ricardo Ferreira¹,
Waldir Denver¹,
Monica Pereira²,
Stephan Wong³,
Carlos A. Lisbȏa⁴ &
…
Luigi Carro⁴

448 Accesses
Explore all metrics

Abstract

In the past years, many works have demonstrated the applicability of Coarse-Grained Reconfigurable Array (CGRA) accelerators to optimize loops by using software pipelining approaches. They are proven to be effective in reducing the total execution time of multimedia and signal processing applications. However, the run-time reconfigurability of CGRAs is hampered overheads introduced by the needed translation and mapping steps. In this work, we present a novel run-time translation technique for the modulo scheduling approach that can convert binary code on-the-fly to run on a CGRA. We propose a greedy approach, since the modulo scheduling for CGRA is an NP-complete problem. In addition to read-after-write dependencies, the dynamic modulo scheduling faces new challenges, such as register insertion to solve recurrence dependences and to balance the pipelining paths. Our results demonstrate that the greedy run-time algorithm can reach a near-optimal ILP rate, better than an off-line compiler approach for a 16-issue VLIW processor. The proposed mechanism ensures software compatibility as it supports different source ISAs. As proof of concept of scaling, a change in the memory bandwidth has been evaluated. In this analysis it is demonstrated that when changing from one memory access per cycle to two memory accesses per cycle, the modulo scheduling algorithm is able to exploit this increase in memory bandwidth and enhance performance accordingly. Additionally, to measure area and performance, the proposed CGRA was prototyped on an FPGA. The area comparisons show that a crossbar CGRA (with 16 processing elements and including an 4-issue VLIW host processor) is only 1.11 × bigger than a standalone 8-issue VLIW softcore processor.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hardware Based Loop Optimization for CGRA Architectures

Coarse-Grained Reconfigurable Array Architectures

A methodology correlating code optimizations with data memory accesses, execution time and energy consumption

Article 13 May 2019

References

Ahn, M., Yoon, J.W., Paek, Y., Kim, Y., Kiemb, M., Choi, K. (2006). A spatial mapping algorithm for heterogeneous coarse-grained reconfigurable architectures. In: Proceedings DATE, pp. 363–368.
Arnold, O., Matus, E., Noethen, B., Winter, M., Limberg, T., Fettweis, G. (2014). Tomahawk: Parallelism and heterogeneity in communications signal processing mpsocs. ACM Transactions on Embedded Computing Systems, 13(3s), 107:1–107:241.
Article Google Scholar
Beck, A.C.S., Rutzig, M.B., Gaydadjiev, G., Carro, L. (2008). Transparent reconfigurable acceleration for heterogeneous embedded applications. In: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 1208–1213.
Bispo, J., Paulino, N., Cardoso, J.M., Ferreira, J.C. (2013). Transparent runtime migration of loop-based traces of processor instructions to reconfigurable processing units. International Journal of Reconfigurable Computing
Bispo, J., Paulino, N., Ferreira, J., Cardoso, J. (2012). Transparent trace-based binary acceleration for reconfigurable hw/sw systems. IEEE Transactions on Industrial Informatics.
Boppu, S., Hannig, F., Teich, J. (2014). Compact code generation for tightly-coupled processor arrays. Journal of Signal Processing Systems, 77(1-2), 5–29.
Article Google Scholar
Bouwens, F., Berekovic, M., Kanstein, A., Gaydadjiev, G. (2007). Architectural exploration of the adres coarse-grained reconfigurable array. In Proceedings ARC pp. 1–13.
Chen, L., & Mitra, T. (2012). Graph minor approach for application mapping on cgras. In: Proceedings FPT.
De Sutter, B., Coene, P., Vander Aa, T., Mei, B. (2008). Placement-and-routing-based register allocation for coarse-grained reconfigurable arrays. In Proceedings LCTES (pp. 151–160).
Ferreira, R., Duarte, V., Meireles, W., Pereira, M., Carro, L., Wong, S. (2013). A just-in-time modulo scheduling for virtual coarse-grained reconfigurable architectures. In SAMOS XIII.
Ferreira, R., Vendramini, J.G., Mucida, L., Pereira, M.M., Carro, L. (2011). An fpga-based heterogeneous coarse-grained dynamically reconfigurable architecture. In: Proceedings CASES.
Friedman, S., Carroll, A., Van Essen, B., Ylvisaker, B., Ebeling, C., Hauck, S. (2009). Spr: an architecture-adaptive cgra mapping tool. In: Proceeding of the ACM/SIGDA international symposium on field programmable gate arrays, FPGA ’09 (pp. 191–200). New York: ACM
Goel, N., Kumar, A., Panda, P.R. (2014). Shared-port register file architecture for low-energy vliw processors. ACM Transactions Architectural Code Optimization, 11(1).
Hamzeh, M., Shrivastava, A., Vrudhula, S. (2012). Epimap: Using epimorphism to map applications on CGRAs. In Proceeding of DAC conference (pp. 1280–1287).
Hamzeh, M., Shrivastava, A., Vrudhula, S. (2014). Branch-aware loop mapping on CGRAs. In Proceeding of DAC conference on design automation conference (pp. 1–6). ACM.
Hamzeh, M., Shrivastava, A., Vrudhula, S.B. (2013). Regimap: register-aware application mapping on coarse-grained reconfigurable architectures (CGRAs). In Proceeding of DAC conference (p. 18).
Hartenstein, R. (2001). Coarse grain reconfigurable architecture (embedded tutorial). In: Proceedings of the 2001 asia and south pacific design automation conference, ASP-DAC ’01.
Hatanaka, A., & Bagherzadeh, N. (2007). A modulo scheduling algorithm for a coarse-grain reconfigurable array template. In IPDPS 2007 (pp. 1–8).
Hoogerbrugge, J., & Corporaal, H. (1994). Register file port requirements of transport triggered architectures. In: Proceedings of the 27th annual international symposium on microarchitecture. (pp. 191–195). ACM.
Jääskeläinen, P., Kultala, H., Viitanen, T., Takala, J. (2014). Code density and energy efficiency of exposed datapath architectures. Journal of Signal Processing Systems, 1–16.
Kim, Y., Lee, J., Shrivastava, A., Yoon, J., Cho, D., Paek, Y. (2011). High throughput data mapping for coarse-grained reconfigurable architectures. IEEE Transactions on CAD of International Circuits and Systems, 30 (11), 1599 –1609. doi:10.1109/TCAD.2011.2161217.
Laboratories, & H.P. (2014). Vex toolchain. http://www.hpl.hp.com/downloads/vex/.
Lee, G., Choi, K., Dutt, N. (2011). Mapping multi-domain applications onto coarse-grained reconfigurable architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 30(5), 637–650.
Article Google Scholar
Lin, T.J., Chen, S.K., Kuo, Y.T., Liu, C.W., Hsiao, P.C. (2008). Design and implementation of a high-performance and complexity-effective vliw dsp for multimedia applications. Journal of Signal Processing Systems, 51(3), 209–223.
Article Google Scholar
Loeffler, C., Ligtenberg, A., Moschytz, G.S. (1989). Practical fast 1-d dct algorithms with 11 multiplications. IEEE international conference on acoustics, speech, and signal processing, 1989. ICASSP-89 (pp. 988–991).
McCool, M. (2007). Signal processing and general-purpose computing and gpus [exploratory dsp]. IEEE Signal Processing Magazine, 24(3), 109–114.
Article MathSciNet Google Scholar
Mei, B., Vernalde, S., Verkest, D., De Man, H., Lauwereins, R. (2002). Dresc: a retargetable compiler for coarse-grained reconfigurable architectures. In Proceedings FPT pp. 166–173.
Mei, B., Vernalde, S., Verkest, D., Man, H.D., Lauwereins, R. (2003). Exploiting loop-level parallelism on coarse-grained reconfigurable architectures using modulo scheduling. In: Proceedings DATE.
Oh, T., Egger, B., Park, H., Mahlke, S. (2009). Recurrence cycle aware modulo scheduling for coarse-grained reconfigurable architectures. In Proceedings LCTES pp. 21–30.
Paek, J.K., Choi, K., Lee, J. (2011). Binary acceleration using coarse-grained reconfigurable architecture. SIGARCH Computers Architecture News, 38(4), 33–39.
Article Google Scholar
Park, H., Fan, K., Kudlur, M., Mahlke, S. (2006). Modulo graph embedding: mapping applications onto coarse-grained reconfigurable architectures. In Proceedings CASES (pp. 136–146).
Park, H., Fan, K., Mahlke, S.A., Oh, T., Kim, H., Kim, H.s. (2008). Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. In: Proceedings PACT.
Park, H., Park, Y., Mahlke, S. (2009). Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications. In Proceedings MICRO (pp. 370–380).
Rau, B.R. (1994). Iterative modulo scheduling: an algorithm for software pipelining loops. In Proceedings MICRO (pp. 63–74).
Wong, S., Van As, T., Brown, G. (2008). ρ-vex: A reconfigurable and extensible softcore vliw processor. In International conference on field-programmable technology FPT (pp. 369–372). IEEE.
Yoon, J., Shrivastava, A., Park, S., Ahn, M., Jeyapaul, R., Paek, Y. (2008). Spkm: A novel graph drawing based algorithm for application mapping onto coarse-grained reconfigurable architectures. In Proceedings ASPDAC (pp. 776–782).
Zhou, L., Liu, H., Zhang, J. (2013). Loop acceleration by cluster-based cgra. IEICE Electronics Express, 10(16).

Download references

Author information

Authors and Affiliations

Universidade Federal de Viçosa, Vicosa, Minas Gerais, Brazil
Ricardo Ferreira & Waldir Denver
Universidade Federal do Rio Grande do Norte, Natal, Rio Grande do Norte, Brazil
Monica Pereira
TU Delft, Delft, Netherlands
Stephan Wong
Universidade Federal do Rio Grande do Sul, Porto Alegre, Rio Grande do Sul, Brazil
Carlos A. Lisbȏa & Luigi Carro

Authors

Ricardo Ferreira
View author publications
You can also search for this author inPubMed Google Scholar
Waldir Denver
View author publications
You can also search for this author inPubMed Google Scholar
Monica Pereira
View author publications
You can also search for this author inPubMed Google Scholar
Stephan Wong
View author publications
You can also search for this author inPubMed Google Scholar
Carlos A. Lisbȏa
View author publications
You can also search for this author inPubMed Google Scholar
Luigi Carro
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Ricardo Ferreira.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ferreira, R., Denver, W., Pereira, M. et al. A Dynamic Modulo Scheduling with Binary Translation: Loop optimization with software compatibility. J Sign Process Syst 85, 45–66 (2016). https://doi.org/10.1007/s11265-015-0974-8

Download citation

Received: 20 October 2014
Revised: 26 December 2014
Accepted: 21 January 2015
Published: 17 February 2015
Issue Date: October 2016
DOI: https://doi.org/10.1007/s11265-015-0974-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Dynamic Modulo Scheduling with Binary Translation: Loop optimization with software compatibility

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hardware Based Loop Optimization for CGRA Architectures

Coarse-Grained Reconfigurable Array Architectures

A methodology correlating code optimizations with data memory accesses, execution time and energy consumption

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now