Symbolic Mapping of Loop Programs onto Processor Arrays

Teich, Jürgen; Tanase, Alexandru; Hannig, Frank

doi:10.1007/s11265-014-0905-0

Symbolic Mapping of Loop Programs onto Processor Arrays

Published: 11 July 2014

Volume 77, pages 31–59, (2014)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Jürgen Teich¹,
Alexandru Tanase¹ &
Frank Hannig¹

345 Accesses
Explore all metrics

Abstract

In this paper, we present a solution to the problem of joint tiling and scheduling a given loop nest with uniform data dependencies symbolically. This challenge arises when the size and number of available processors for parallel loop execution is not known at compile time. But still, in order to avoid any overhead of dynamic (run-time) recompilation, a schedule of loop iterations shall be computed and optimized statically. In this paper, it will be shown that it is possible to derive parameterized latency-optimal schedules statically by proposing a two step approach: First, the iteration space of a loop program is tiled symbolically into orthotopes of parametrized extensions. Subsequently, the resulting tiled program is also scheduled symbolically, resulting in a set of latency-optimal parameterized schedule candidates. At run time, once the size of the processor array becomes known, simple comparisons of latency-determining expressions finally steer which of these schedules will be dynamically selected and the corresponding program configuration executed on the resulting processor array so to avoid any further run-time optimization or expensive recompilation. Our theory of symbolic loop parallelization is applied to a number of loop programs from the domains of signal processing and linear algebra. Finally, as a proof of concept, we demonstrate our proposed methodology for a massively parallel processor array architecture called tightly coupled processor array (TCPA) on which applications may dynamically claim regions of processors in the context of invasive computing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Monoparametric Tiling of Polyhedral Programs

Article 18 March 2021

Polygonal Iteration Space Partitioning

Code Bones: Fast and Flexible Code Generation for Dynamic and Speculative Polyhedral Optimization

Notes

Following the well-known LSGP (locally sequential, globally parallel) static mapping principle.
In the following, we assume w.l.o.g. that we start from a UDA, as any linear dependence algorithm may be systematically transformed into a UDA using localization, see, e. g., [42, 43].
In the following, we do not necessarily require to assume perfect tilings, as due to Eq. 2, each variable may be defined over an individual subspace of the global loop iteration space $\mathcal {I}$. Moreover, we may assume in the following w.l.o.g. that $\mathcal {I}$ is the rectangular hull of the union of all the $\mathcal {I}_i$ in Eq. 2.
We assume for simplicity that a single iteration of a given loop may be executed in a unit of time.
We assume for regularity of a schedule that each tile is scheduled equally in exactly det(P) time steps, even if the covering of the union of the iteration spaces of all G equations of a given UDA might lead to some non-perfectly filled tiles.
Loops with affine data dependencies [39, 40] or certain classes of dynamic data dependencies [14] may converted first in this form, e.g., by localization of data dependencies [43] or hiding data-value dependent computations by data-dependent functions [10, 14].

References

Baskaran, M.M., Ramanujam, J., Sadayappan, P. (2010). Automatic C-to-CUDA code generation for affine programs. In Proceedings of the 19th joint European conference on theory and practice of software, international conference on compiler construction (pp. 244–263). Paphos, Cyprus: Springer.
Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P. (2008). A practical automatic polyhedral parallelizer and locality optimizer. ACM SIGPLAN Notices, 43(6), 101–113.
Article Google Scholar
Boppu, S., Hannig, F., Teich, J. (2013). Loop program mapping and compact code generation for programmable hardware accelerators. In Proceedings of the 24th IEEE international conference on application-specific systems, architectures and processors (ASAP) (pp. 10–17). IEEE.
Boppu, S., Hannig, F., Teich, J., Perez-Andrade, R. (2011). Towards symbolic run-time reconfiguration in tightly-coupled processor arrays. In ReConFig (pp. 392–397).
Darte, A., Khachiyan, L., Robert, Y. (1992). Linear scheduling is close to optimality. In Proceedings of the international conference on application specific array processors (ASAP) (pp. 37–46). Berkeley, CA, USA. doi:10.1109/ASAP.1992.218583.
Darte, A., & Robert, Y. (1995). Affine-by-statement scheduling of uniform and affine loop nests over parametric domains. Journal of Parallel and Distributed Computing, 29(1), 43–59.
Article Google Scholar
Darte, A., Schreiber, R., Rau, B.R., Vivien, F. (2000). A constructive solution to the juggling problem in systolic array synthesis. In Proceedings of the international parallel and distributed processing symposium (IPDPS) (pp. 815–821).
Di, P., & Xue, J. (2011). Model-driven tile size selection for doacross loops on gpus. In Proceedings of the 17th international conference on parallel processing - Volume Part II, Euro-Par (pp. 401–412). Berlin, Heidelberg: Springer-Verlag.
Di, P., Ye, D., Su, Y., Sui, Y., Xue, J. (2012). Automatic parallelization of tiled loop nests with enhanced fine-grained parallelism on GPUs. In Proceedings of the 41st international conference on parallel processing (ICPP) (pp. 350–359). Pittsburgh: IEEE Computer Society.
Hannig, F. (2009). Scheduling techniques for high-throughput loop accelerators. Dissertation, University of Erlangen-Nuremberg, Germany. Verlag Dr. Hut, Munich, Germany.
Hannig, F., Dutta, H., Teich, J. (2006). Mapping a class of dependence algorithms to coarse-grained reconfigurable arrays: Architectural parameters and methodology. International Journal of Embedded Systems, 2(1/2), 114–127. doi:10.1504/IJES.2006.010170.
Article Google Scholar
Hannig, F., Lari, V., Boppu, S., Tanase, A., Reiche, O. (2014). Invasive tightly-coupled processor arrays: A domain-specific architecture/compiler co-design approach. ACM Transactions on Embedded Computing Systems (TECS), 13(4s), 133:1–133:29. doi:10.1145/2584660.
Article Google Scholar
Hannig, F., Roloff, S., Snelting, G., Teich, J., Zwinkau, A. (2011). Resource-aware programming and simulation of MPSoC architectures through extension of X10. In Proceedings of the 14th international workshop on software and compilers for embedded systems (SCOPES) (pp. 48–55). ACM Press. doi:10.1145/1988932.1988941.
Hannig, F., Ruckdeschel, H., Dutta, H., Teich, J. (2008). PARO: Synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications. In Proceedings of the Fourth international workshop on applied reconfigurable computing (ARC), Lecture Notes in Computer Science (LNCS) (vol. 4943, pp. 287–293). Springer.
Hannig, F., Schmid, M., Lari, V., Boppu, S., Teich, J. (2013). System integration of tightly-coupled processor arrays using reconfigurable buffer structures. In Proceedings of the ACM international conference on computing frontiers (CF) (pp. 2:1–2:4). ACM. doi:10.1145/2482767.2482770.
Hartono, A., Baskaran, M., Ramanujam, J., Sadayappan, P. (2010). DynTile: Parametric tiled loop generation for parallel execution on multicore processors. In Proceedings of the international parallel and distributed processing symposium (IPDPS) (pp. 1–12). Atlanta: IEEE.
Hartono, A., Baskaran, M.M., Bastoul, C., Cohen, A., Krishnamoorthy, S., Norris, B., Ramanujam, J., Sadayappan, P. (2009). Parametric multi-level tiling of imperfectly nested loops. In Proceedings of the 23rd international conference on supercomputing (ICS) (pp. 147–157). Yorktown Heights: ACM.
Henkel, J., Narayanan, V., Parameswaran, S., Teich, J. (2013). Run-time adaptation for highly-complex multi-core systems. In Proceedings of the IEEE international conference on hardware/software codesign and system synthesis (CODES+ISSS).
Högstedt, K., Carter, L., Ferrante, J. (1999). Selecting tile shape for minimal execution time. In Proceedings of the 11th annual ACM symposium on parallel algorithms and architectures (pp. 201–211. Saint Malo, France.
Irigoin, F., & Triolet, R. (1988). Supernode partitioning. In Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on principles of programming languages (POPL) (pp. 319–329). San Diego: ACM.
Kissler, D., Gran, D., Salcic, Z., Hannig, F., Teich, J. (2011). Scalable many-domain power gating in coarse-grained reconfigurable processor arrays. IEEE Embedded Systems Letters, 3(2), 58–61.
Article Google Scholar
Kissler, D., Hannig, F., Kupriyanov, A., Teich, J. (2006). A highly parameterizable parallel processor array architecture. In Proceedings of the IEEE International Conference on Field Programmable Technology (FPT) (pp. 105–112). Bangkok: IEEE.
Lamport, L. (1974). The parallel execution of DO loops. Communications of the ACM, 17(2), 83–93. doi:10.1145/360827.360844.
Article MATH MathSciNet Google Scholar
Lari, V., Hannig, F., Teich, J. (2011). Distributed resource reservation in massively parallel processor arrays. In Proceedings of the international parallel and distributed processing symposium workshops (IPDPSW) (pp. 313–316). IEEE Computer Society. doi:10.1109/IPDPS.2011.157.
Lari, V., Muddasani, S., Boppu, S., Hannig, F., Schmid, M., Teich, J. (2012). Hierarchical power management for adaptive tightly-coupled processor arrays. ACM Transactions on Design Automation of Electronic Systems (TODAES), 18(1), 2:1–2:25. doi:10.1145/2390191.2390193.
Article Google Scholar
Lari, V., Narovlyanskyy, A., Hannig, F., Teich, J. (2011). Decentralized dynamic resource management support for massively parallel processor arrays. In Proceedings of the 22nd IEEE International Conference on Application-specific Systems, Architectures, and Processors (ASAP) (pp. 87–94). IEEE Computer Society. doi:10.1109/ASAP.2011.6043240.
Mehrara, M., Jablin, T., Upton, D., August, D., Hazelwood, K., Mahlke, S. (2009). Compilation strategies and challenges for multicore signal processing. IEEE Signal Processing Magazine, 26(6), 55–63.
Article Google Scholar
Moore, G. (1965). Cramming more components onto integrated circuits. Electronics, 38(8), 114–117.
Google Scholar
Muchnick, S. (1997). Advanced compiler design and implementation. Morgan Kaufmann.
Radivojevic, I.P., & Brewer, F. (1995). Symbolic scheduling techniques. IEICE Transactions, 78-D(3), 224–230.
Google Scholar
Rao, S., & Kailath, T. (1988). Regular iterative algorithms and their implementation on processor arrays. Proceedings of the IEEE, 76(3), 259–269. doi:10.1109/5.4402.
Article Google Scholar
Renganarayanan, L., Kim, D., Rajopadhye, S., Strout, M.M. (2007). Parameterized tiled loops for free. In Proceedings of the Conference on Programming Language Design and Implementation (pp. 405–414). San Diego.
Renganarayanan, L., Kim, D., Strout, M.M., Rajopadhye, S. (2012). Parameterized loop tiling. ACM Transactions on Programming Languages and Systems, 34(1), 3:1–3:41.
Article Google Scholar
Shang, W., & Fortes, J.A.B. (1991). Time optimal linear schedules for algorithms with uniform dependencies. IEEE Transactions on Computers, 40(6), 723–742.
Article MathSciNet Google Scholar
Tavarageri, S., Hartono, A., Baskaran, M., Pouchet, L.N., Ramanujam, J., Sadayappan, P. (2010). Parametric tiling of affine loop nests. In Proceedings of the 15th workshop on compilers for parallel computing (CPC). Vienna, Austria.
Teich, J. (2008). Invasive algorithms and architectures. Information Technology, 50(5), 300–310.
Google Scholar
Teich, J., Henkel, J., Herkersdorf, A., Schmitt-Landsiedel, D., Schröder-Preikschat, W., Snelting, G. (2011). Invasive computing: an overview. In Multiprocessor system-on-chip – hardware design and tool integration (pp. 241–268). Springer.
Teich, J., Tanase, A., Hannig, F. (2013). Symbolic parallelization of loop programs for massively parallel processor arrays. In Proceedings of the 24th IEEE international conference on application-specific systems, architectures and processors (ASAP) (pp. 1–9). IEEE.
Teich, J., & Thiele, L. (2002). Exact partitioning of affine dependence algorithms. In Embedded processor design challenges (pp. 135–153).
Teich, J., Thiele, L., Zhang, L. (1997). Scheduling of partitioned regular algorithms on processor arrays with constrained resources. Journal of VLSI Signal Processing, 17(1), 5–20.
Article MATH Google Scholar
Teich, J., Weichslgartner, A., Oechslein, B., Schröder-Preikschat, W. (2012). Invasive computing - concepts and overheads. In Forum on specification & design languages (FDL) (pp. 193–200).
Thiele, L. (1989). On the design of piecewise regular processor arrays. In IEEE international symposium on circuits and systems (vol. 3, pp. 2239–2242).
Thiele, L., & Roychowdhury, V. (1991). Systematic design of local processor arrays for numerical algorithms. In Proceedings of the international workshop on algorithms and parallel VLSI architectures, vol. A: Tutorials (pp. 329–339). Amsterdam: Elsevier.
Xue, J. (2000). Loop tiling for parallelism. Norwell: Kluwer Academic Publishers.
Yang, T., & Ibarra, O.H. (1995). On symbolic scheduling and parallel complexity of loops. In Proceedings IEEE symposium on parallel and distributed processing (pp. 360–367).

Download references

Acknowledgments

This work was supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Centre “Invasive Computing” (CRC 89).

Author information

Authors and Affiliations

University of Erlangen-Nüremberg (FAU), Cauerstr. 11, 91058, Erlangen, Germany
Jürgen Teich, Alexandru Tanase & Frank Hannig

Authors

Jürgen Teich
View author publications
You can also search for this author inPubMed Google Scholar
Alexandru Tanase
View author publications
You can also search for this author inPubMed Google Scholar
Frank Hannig
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Jürgen Teich or Alexandru Tanase.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Teich, J., Tanase, A. & Hannig, F. Symbolic Mapping of Loop Programs onto Processor Arrays. J Sign Process Syst 77, 31–59 (2014). https://doi.org/10.1007/s11265-014-0905-0

Download citation

Received: 09 September 2013
Revised: 23 April 2014
Accepted: 07 May 2014
Published: 11 July 2014
Issue Date: October 2014
DOI: https://doi.org/10.1007/s11265-014-0905-0

Keywords

Symbolic Loop Parallelisation

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Symbolic Mapping of Loop Programs onto Processor Arrays

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Monoparametric Tiling of Polyhedral Programs

Polygonal Iteration Space Partitioning

Code Bones: Fast and Flexible Code Generation for Dynamic and Speculative Polyhedral Optimization

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now