Compact Code Generation for Tightly-Coupled Processor Arrays

Boppu, Srinivas; Hannig, Frank; Teich, Jürgen

doi:10.1007/s11265-014-0891-2

Compact Code Generation for Tightly-Coupled Processor Arrays

Published: 31 May 2014

Volume 77, pages 5–29, (2014)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Srinivas Boppu¹,
Frank Hannig¹ &
Jürgen Teich¹

395 Accesses
9 Citations
9 Altmetric
Explore all metrics

Abstract

In this paper, we consider programmable tightly-coupled processor arrays consisting of interconnected small light-weight VLIW cores, which can exploit both loop-level parallelism and instruction-level parallelism. These arrays are well suited for compute-intensive nested loop applications often providing a higher power and area efficiency compared with commercial off-the-shelf processors. They are ideal candidates for accelerating the computation of nested loop programs in future heterogeneous systems, where energy efficiency is one of the most important design goals for overall system-on-chip design. In this context, we present a novel design methodology for the mapping of nested loop programs onto such processor arrays. Key features of our approach are: (1) Design entry in form of a functional programming language and loop parallelization in the polyhedron model, (2) support of zero-overhead looping not only for innermost loops but also for arbitrarily nested loops. Processors of such arrays are often limited in instruction memory size to reduce the area and power consumption. Hence, (3) we present methods for code compaction and code generation, and integrated these methods into a design tool. Finally, (4) we evaluated selected benchmarks by comparing our code generator with the Trimaran and VEX compiler frameworks. As the results show, our approach can reduce the size of the generated processor codes up to 64 % (Trimaran) and 55 % (VEX) while at the same time achieving a significant higher throughput.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Dynamic Modulo Scheduling with Binary Translation: Loop optimization with software compatibility

Article 17 February 2015

Ricardo Ferreira, Waldir Denver, … Luigi Carro

Parallel Tiled Cache and Energy Efficient Code for Zuker’s RNA Folding

A methodology correlating code optimizations with data memory accesses, execution time and energy consumption

Article 13 May 2019

Vasilios Kelefouras & Karim Djemame

Notes

In practice, n is usually 2 or 3.
An iteration vector consists of the iteration variables of a loop, e. g., in case of a two-dimensional loop nest with iteration variables i and j, the iteration vector is I = (i j)^T.
Moldovan [27] uses already a similar linear transformation but he requires the transformation to be bijective. The same definition is used by Lengauer [22] and referred to as space-time mapping.
These are similar to rotating registers, used mainly to store the value of variables to deal with loop-carried data dependencies.
The iteration interval II denotes the number of clock cycles between the evaluation of two successive iterations.
If G _CF contains a node with a self-edge, there exists the possibility to pack contiguously the instructions from different iterations for that node, since more than one iteration, given by the sum of the weights of all incoming edges divided by how many times the execution enters the node, are executed consecutively. If there is no self-edge, there is no possibility to pack contiguously the instructions from different iterations, since only one iteration is executed.
The average iteration interval is the average time between the start of two successive loop iterations. It is calculated by dividing the total execution time of a loop nest GL by the total number of iterations executed.
The overhead indicates the amount of time that is spent in executing other than the innermost loop compared to the total execution time.

References

Boppu, S., Hannig, F., Teich, J. (2013). Loop program mapping and compact code generation for programmable hardware accelerators. In:Proceedings of the 24th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), (pp. 10–17): IEEE.
Dutta, H., Hannig, F., Teich, J. (2006). Hierarchical partitioning for piecewise linear algorithms. In Proceedings of the 5th International Conference on Parallel Computing in Electrical Engineering (PARELEC), (pp. 153–160): IEEE Computer Society.
Feautrier, P., & Lengauer, C. (2011). Polyhedron model In Padua, D. (Ed.), Encyclopedia of Parallel Computing, (pp. 1581–1592): Springer.
Fisher, J. (1983). Very long instruction word architectures and the ELI-512. In Proceedings of the 10th Annual International Symposium on Computer Architecture (ISCA), (pp. 140–150): IEEE.
GCC. the GNU Compiler Collection. http://gcc.gnu.org.
Gupta, S., Gupta, R., Dutt, N., Nicolau, A. (2004). SPARK: A Parallelizing Approach to the High-Level Synthesis of Digital Circuits: Kluwer Academic Publishers.
Hannig, F. (2009). Scheduling Techniques for High-Throughput Loop Accelerators. Ph.D. thesis. Germany: University of Erlangen-Nuremberg. Verlag Dr. Hut, Munich,Germany,ISBN: 978-3-86853-220-3.
Google Scholar
Hannig, F., Dutta, H., Teich, J. (2006). Mapping a Class of Dependence Algorithms to Coarse-Grained Reconfigurable Arrays: Architectural parameters and methodology. International Journal of Embedded Systems, 2(1/2), 114–127. doi:10.1504/IJES.2006.010170.
Article Google Scholar
Hannig, F., Lari, V., Boppu, S., Tanase, A., Reiche, O. (2014). Invasive Tightly-Coupled Processor Arrays: A Domain-Specific Architecture/Compiler Co-Design Approach. ACM Transactions on Embedded Computing Systems (TECS), 13(4s), 133:1–133:29. doi:10.1145/2584660.
Google Scholar
Hannig, F., Ruckdeschel, H., Dutta, H., Teich, J. (2008). PARO: synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications. In Proceedings of the 4th International Workshop on Applied Reconfigurable Computing (ARC), Lecture Notes in Computer Science (LNCS) (Vol. 4943, pp. 287–293): Springer.
Hannig, F., Ruckdeschel, H., Teich, J. (2008). The PAULA language for designing multi-dimensional dataflow-intensive applications. In Proceedings of the GI/ITG/GMM-Workshop – Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen, (pp. 129–138): Shaker.
Hannig, F., & Teich, J. (2004). Resource constrained and speculative scheduling of an algorithm class with run-time dependent conditionals. In Proceedings of the 15th IEEE International Conference on Application-specific Systems, Architectures, and Processors (ASAP), (pp. 17–27): IEEE Computer Society.
Hewlett-Packard Laboratories: Vex toolchain. http://www.hpl.hp.com/downloads/vex.
ILOG (2011). CPLEX Division:ILOG CPLEX 12.1,User’s Manual.
Irigoin, F., & Triolet, R. (1988). Supernode partitioning. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), (pp. 319–329). USA. ACM, San Diego, CA.
Jainandunsing, K. (1986). Optimal partitioning scheme for wavefront/systolic array processors. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), (pp. 940–943). USA: San Jose, CA.
Kissler, D., Hannig, F., Kupriyanov, A., Teich, J. (2006). A dynamically reconfigurable weakly programmable processor array architecture template. In Proceedings of the 2nd International Workshop on Reconfigurable Communication Centric System-on-Chips (ReCoSoC), (pp. 31–37). Montpellier.
Kissler, D., Hannig, F., Kupriyanov, A., Teich, J. (2006). A highly parameterizable parallel processor array architecture. In Proceedings of the International Conference on Field Programmable Technology (FPT), (pp. 105–112): IEEE.
Kroupis, N., Raghavan, P., Jayapala, M., Catthoor, F., Soudris, D. (2009). Compilation technique for loop overhead minimization. In Proceedings of 12th Euromicro Conference on Digital System Design, Architectures, Methods and Tools (DSD), (pp. 419–426).
Lattner, C., & Adve, V. (2004). LLVM: a compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO), (pp. 75–86).
Lee, J., Choi, K., Dutt, N. (2003). An algorithm for mapping loops onto coarse-grained reconfigurable architectures. In Proceedings of Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES), (pp. 183–188): ACM.
Lengauer, C. (1993). Loop parallelization in the polytope model. In Best, E. (Ed.) Proceedings of the 4th International Conference on Concurrency Theory (CONCUR), Lecture Notes in Computer Science (LNCS) (Vol. 715, pp. 398–416). Hildesheim: Springer.
Google Scholar
Lengauer, C., Barnett, M., III, D.G.H. (1991). Towards Systolizing Compilation. Distributed Computing, 5, 7–24. doi:10.1007/BF02311229.
Article MATH Google Scholar
Mei, B., Vernalde, S., Verkest, D., De Man, H., Lauwereins, R. (2002). DRESC: a retargetable compiler for coarse-grained reconfigurable architectures. In Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT), (pp. 166–173).
Mei, B., Vernalde, S., Verkest, D., Lauwereins, R. (2004). Design methodology for a tightly coupled VLIW/reconfigurable matrix architecture: a case study. In Proceedings of Design, Automation and Test in Europe Conference and Exhibition (DATE) (Vol. 2, pp. 1224–1229).
Melpignano, D., Benini, L., Flamand, E., Jego, B., Lepley, T., Haugou, G., Clermidy, F., Dutoit, D. (2012). Platform 2012, a many-core computing accelerator for embedded SoCs: performance evaluation of visual analytics applications. In Proceedings of the 49th Annual Design Automation Conference (DAC), (pp. 1137–1142): ACM.
Moldovan, D. (1983). On the Design of Algorithms for Vlsi Systolic Arrays. In Proceedings of the IEEE, 71(1), 113–120.
Article Google Scholar
Muddasani, S., Boppu, S., Hannig, F., Kuzmin, B., Lari, V., Teich, J. (2012). A prototype of an invasive tightly-coupled processor array. In Proceedings of the Conference on Design and Architectures for Signal and Image Processing (DASIP), (pp. 393–394): IEEE.
Munshi, A. (2012). The OpenCL specification version 1.2: Khronos OpenCL Working Group.
Rau, B.R. (1994). Iterative modulo scheduling: An algorithm for software pipelining loops. In Proceedings of the 27th Annual International Symposium on Microarchitecture, (pp. 63–74).
Rau, B.R., Schlansker, M.S., Tirumalai, P.P. (1992). CodeGeneration Schema for Modulo Scheduled Loops. SIGMICRO Newsletter, 23(1–2), 158–169.
Article Google Scholar
Schmid, M., Hannig, F., Tanase, A., Teich, J. (2014). High-level synthesis revised – generation of FPGA accelerators from a domain-specific language using the polyhedron model. Parallel Computing: Accelerating Computational Science and Engineering (CSE), Advances in Parallel Computing (Vol. 25, pp. 497–506). Amsterdam: IOS Press.
Singh, H., Lee, M., Lu, G., Bagherzadeh, N., Kurdahi, F., Filho, E. (2000). MorphoSys: An integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications. IEEE Transactions on Computers, 49(5), 465–481. doi:10.1109/12.859540.
Article Google Scholar
Sousa, É., Tanase, A., Hannig, F., Teich, J. (2013). Accuracy and performance analysis of harris corner computation on tightly-coupled processor arrays. In Proceedings of the Conference on Design and Architectures for Signal and Image Processing (DASIP), (pp. 88–95): IEEE.
Teich, J. (1993). A Compiler for Application-Specific Processor Arrays. Saarbrücken: Shaker Verlag. Ph.D. thesis,Institut für Mikroelektronik, Universität des Saarlandes,ISBN: 3-86111-701-0.
Google Scholar
Teich, J., & Thiele, L. (1993). Partitioning of Processor Arrays: A Piecewise Regular Approach. Integration, the VLSI Journal, 14(3), 297–332. doi:10.1016/0167-9260(93)90013-3.
Article MATH Google Scholar
Teich, J., Thiele, L., Zhang, L. (1997). Partitioning Processor Arrays Under Resource Constraints. Journal of VLSI Signal Processing, 17(1), 5–20. doi:10.1023/A:1007935215591.
Article MATH Google Scholar
The Trimaran Consortium: An infrastructure for research in backend compilation and architecture exploration. http://www.trimaran.org.
Thiele, L. (1988). On the hierarchical design of VLSI processor arrays. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS) (Vol. 3, pp. 2517–2520).
Thiele, L. (1995). Resource Constrained Scheduling of Uniform Algorithms. Journal of VLSI Signal Processing, 10, 295–310.
Article Google Scholar
Thiele, L., & Roychowdhury, V. (1991). Systematic design of local processor arrays for numerical algorithms. In Deprettere, E., & van der Veen, A. (Eds.) Proceedings of the International Workshop on Algorithms and Parallel VLSI Architectures 1990 (Vol. A:Tutorials, pp. 329–339). Amsterdam: Elsevier.
Google Scholar
Uh, G.R., Wang, Y., Whalley, D., Jinturkar, S., Burns, C., Cao, V. (1999). Effective exploitation of a zero overhead loop buffer. In Proceedings of the Workshop on Languages, Compilers, and Tools for Embedded Systems (LCTES), (pp. 10–19).
Venkataramani, G., Najjar, W., Kurdahi, F., Bagherzadeh, N., Böhm, W., Hammes, J (2003). Automatic compilation to a coarse-grained reconfigurable system-on-chip. ACM Transactions on Embedded Computing Systems (TECS), 2(4), 560–589. doi:10.1145/950162.950167.
Article Google Scholar
Wolfe, M. (1996). High Performance Compilers for Parallel Computing: Addison-Wesley.

Download references

Acknowledgements

This work was supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Centre “Invasive Computing” (SFB/TR 89).

Author information

Authors and Affiliations

Hardware/Software Co-Design, Department of Computer Science, University of Erlangen-Nuremberg, Cauerstr. 11, 91058, Erlangen, Germany
Srinivas Boppu, Frank Hannig & Jürgen Teich

Authors

Srinivas Boppu
View author publications
You can also search for this author in PubMed Google Scholar
Frank Hannig
View author publications
You can also search for this author in PubMed Google Scholar
Jürgen Teich
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Srinivas Boppu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Boppu, S., Hannig, F. & Teich, J. Compact Code Generation for Tightly-Coupled Processor Arrays. J Sign Process Syst 77, 5–29 (2014). https://doi.org/10.1007/s11265-014-0891-2

Download citation

Received: 09 September 2013
Revised: 03 January 2014
Accepted: 31 March 2014
Published: 31 May 2014
Issue Date: October 2014
DOI: https://doi.org/10.1007/s11265-014-0891-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Compact Code Generation for Tightly-Coupled Processor Arrays

Abstract

Access this article

Similar content being viewed by others

A Dynamic Modulo Scheduling with Binary Translation: Loop optimization with software compatibility

Parallel Tiled Cache and Energy Efficient Code for Zuker’s RNA Folding

A methodology correlating code optimizations with data memory accesses, execution time and energy consumption

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Compact Code Generation for Tightly-Coupled Processor Arrays

Abstract

Access this article

Similar content being viewed by others

A Dynamic Modulo Scheduling with Binary Translation: Loop optimization with software compatibility

Parallel Tiled Cache and Energy Efficient Code for Zuker’s RNA Folding

A methodology correlating code optimizations with data memory accesses, execution time and energy consumption

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation