## Symbolic Loop Compilation for Tightly Coupled Processor Arrays

MICHAEL WITTERAUF, Friedrich-Alexander-Universität Erlangen-Nürnberg DOMINIK WALTER, Friedrich-Alexander-Universität Erlangen-Nürnberg FRANK HANNIG, Friedrich-Alexander-Universität Erlangen-Nürnberg JÜRGEN TEICH, Friedrich-Alexander-Universität Erlangen-Nürnberg

Loop compilation for Tightly Coupled Processor Arrays (TCPAs), a class of massively parallel loop accelerators, entails solving NPhard problems, yet depends on the loop bounds and number of available processing elements (PEs), parameters known only at runtime because of dynamic resource management and input sizes. Therefore, this article proposes a two-phase approach called symbolic loop compilation: At compile time, the necessary NP-complete problems are solved and the solutions compiled into a space-efficient *symbolic configuration*. At runtime, a *concrete configuration* is generated from the symbolic configuration according to the parameters values. We show that the latter phase, called *instantiation*, runs in polynomial time with its most complex step, program instantiation, not depending on the number of PEs.

As validation, we performed symbolic loop compilation on real-world loops and measured time and space requirements. Our experiments confirm that a symbolic configuration is space-efficient and suited for systems with little memory—often, a symbolic configuration is smaller than a single concrete configuration—and that program instantiation scales well with the number of PEs—for example, when instantiating a symbolic configuration of a matrix-matrix multiplication, the execution time is similar for  $4 \times 4$  and  $32 \times 32$  PEs.

# $\label{eq:ccs} Concepts: \bullet \textbf{Computer systems organization} \rightarrow \textbf{Systolic arrays}; \textbf{Embedded and cyber-physical systems}; \bullet \textbf{Software and its engineering} \rightarrow \textbf{Compilers}.$

Additional Key Words and Phrases: systolic arrays, compilation, polyhedral model

#### **ACM Reference Format:**

Michael Witterauf, Dominik Walter, Frank Hannig, and Jürgen Teich. 2018. Symbolic Loop Compilation for Tightly Coupled Processor Arrays. *ACM Trans. Arch. Code Optim.* xx, n, Article iii (March 2018), 26 pages. https://doi.org/10.1145/1122445.1122456

## **1 INTRODUCTION**

Tightly Coupled Processor Arrays (TCPAs) [9] are loop accelerators with the goal to be energy efficient by offering *comprehensive* loop acceleration, meaning they handle all parts of loop execution: computation, control, and communication. For this purpose, TCPAs have a grid of numerous, simple processing elements (PEs) to exploit task- (multiple loops in parallel), loop- (multiple parts of a loop in parallel), iteration- (multiple subsequent iterations in parallel),

© 2018 Association for Computing Machinery.

Authors' addresses: Michael Witterauf, michael.witterauf@fau.de, Friedrich-Alexander-Universität Erlangen-Nürnberg; Dominik Walter, dominik.l. walter@fau.de, Friedrich-Alexander-Universität Erlangen-Nürnberg; Frank Hannig, frank.hannig@fau.de, Friedrich-Alexander-Universität Erlangen-Nürnberg; Jürgen Teich, juergen.teich@fau.de, Friedrich-Alexander-Universität Erlangen-Nürnberg.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

and instruction-level parallelism; they have global controllers to centrally compute control flow and unburden the PEs, a circuit-switched interconnect network to locally communicate intermediate data, and I/O buffers with address generators to autonomously stream I/O data only at the borders.

Synchronization and efficient utilization of these components rely on a parallelizing compiler, in particular cycleaccurate scheduling of operations as well as high-quality register allocation and routing, all of which are NP-complete problems. Because of this tight synchronization, the components require distinct programs and configuration data for any distinct combination of loop bounds and number of allocated PEs, but these two parameters are in general unknown a priori. The number of allocated PEs is unknown because multiple applications may *dynamically* allocate regions of PEs sized in accordance with, for example, non-functional properties such as latency and energy consumption. We face a conundrum: Both programs and configuration data must be generated at runtime despite the NP-complete problems compilation involves. This renders just-in-time compilation unsuitable. Instead, we propose to split compilation into two phases, as illustrated in Figure 1:

- (1) Symbolic mapping (Section 5) is performed off-line and solves the involved NP-complete problems, generating a symbolic configuration. A symbolic configuration is a novel compact representation of configurations parameterized in the loop bounds and number of PEs. Here, we contribute the first solution to allocating and representing routes on the interconnect network (Section 5.4) despite not yet knowing the number of allocated PEs. The PE programs are represented symbolically and compactly by a polyhedral syntax tree [24].
- (2) Instantiation (Section 6) is performed once the parameter values are known and generates a concrete configuration from a symbolic configuration. In particular, we show for the first time how to instantiate PE programs from a polyhedral syntax tree and that it is possible to instantiate them in polynomial time independently of the number of PEs (Sections 6.1–6.3).

In Section 7, we present experimental results showing the time and space efficiency of this hybrid compilation approach for a number of real-world loop programs—but first, we distinguish our work from related approaches to loop parallelization.

## 2 RELATED WORK

Loop acceleration is a wide field of research; as for related work, we are interested in two aspects: How are loops mapped to architectures similar to TCPAs, and what other *symbolic* compilation approaches for parallelizing loops have been proposed?

#### 2.1 Loop mapping on similar architectures

Like TCPAs, the following classes of architectures are characterized by a grid of processing elements: massively parallel processor arrays (MPPAs), coarse-grain reconfigurable arrays (CGRAs), and systolic neural network accelerators.

MPPAs are aimed at accelerating entire applications, not only loops; consequently, the PEs of an MPPA are more complex, close to a general-purpose CPU, and usually communicate via shared memory and message passing. Because cycle-accurate synchronization is not necessary then, an MPPA allows for more traditional compilation flows, such as the Kalray MPPA-256 [6], which is programmed using OpenCL or POSIX multi-threading with automatic loop parallelization via OpenMP. Another example is the KiloCore [1], which is programmed using C++ or assembly, but requires manual parallelization. While the compilation flow proposed in this article only applies to a certain class of loops (see Section 4), they are automatically and comprehensively parallelized up to the loop level. This includes the Manuscript submitted to ACM



Fig. 1. Hybrid compilation flow for symbolically compiling loop programs for TCPAs. It is divided into symbolic mapping, performed offline, and instantiation, in general performed at runtime.

generation of programs and configuration data of tightly synchronized components dedicated to energy-efficient loop acceleration, whereas on MPPAs, the corresponding functionalities are usually part of the PE programs.

CGRAs, on the other hand, only accelerate the kernel of a loop, that is, the body of the innermost level of a loop nest. Their processing elements are usually simple, reconfigurable functional units interconnected by a circuit-switched network. We refer to the survey [23] for details on CGRA architectures. Simply put, loops are mapped onto a CGRA by embedding the data flow graph of the loop body onto the CGRA (nodes onto processing elements, edges onto interconnections). While this type of mapping allows for arbitrary loop bodies, it not only disregards the vast parallelism offered by the regularity of a loop, but leaves loop control and I/O communication to the host CPU, making acceleration less autonomous and thus potentially less useful. The generation of corresponding parallelized loop control and I/O code is usually not addressed in the respective papers. By contrast, we offer a comprehensive approach to loop acceleration.

Finally, systolic neural network accelerators have very special-purpose compilation flows, starting from a neural network description, not a loop program. They hence have a narrower scope than the compilation flow proposed in this article. (Note that our compilation flow does support neural networks if formulated as a loop program.)

#### 2.2 Other symbolic loop compilation approaches

Several works investigate the generation of loop code for a number of processors unknown at compile time. Dyn-Tile [10] and D-Tiling [12] target general-purpose multi-cores; Kong et al. [14] generate vectorized code for cores supporting SIMD processing; Konstantinidis et al. [15] generate parallelized code for GPUs. However, none of these approaches apply to TCPAs because the target architectures do neither rely on cycle-accurate synchronization of components nor require PE-specific compact programs (see Section 6.1) to save space and keep instruction memories small.

These approaches also have in common that they finish code generation at compile time. In contrast, the speculative loop optimizer Appollo offers two approaches for runtime code generation: code skeletons [11] and, more recently, code bones [4]. While these get assembled at runtime, similar to the instantiation phase described in this article, they lack the capability to represent instructions parameterized in the current iteration, which is necessary for generating modulo-scheduled programs in the compact manner required by TCPAs (see Section 5.5).

Finally, while a symbolic configuration sounds similar to a template (static content mixed with placeholders that are replaced later), a symbolic configuration represents differently *structured* configurations. Next, before describing our hybrid compilation flow in detail, we discuss the fundamentals of TCPAs and how we model loops using reduced dependence graphs.

## 3 TIGHTLY COUPLED PROCESSOR ARRAYS

TCPAs are massively parallel loop accelerators highly configurable at synthesis time, featuring a grid of simple programmable *processing elements* (PEs) [9, 13] interconnected by a circuit-switched network. As illustrated in Figure 2, the processor array is surrounded by *I/O buffers* on all four borders, responsible for decoupling data streaming from the rest of the system. Finally, in each of the corners, there is a so-called *global controller*, responsible for synchronizing the parallel execution of a loop nest on a rectangular region of PEs.

#### 3.1 Interconnect network

Instead of accessing shared memory, loop-carried dependences are communicated locally between processors to save energy. Each PE is embedded into a so-called interconnect wrapper that acts as a switch between its four neighbors and the PE. For that, each wrapper provides two independent layers, data and control, each with an individually configurable number of both input and output ports in the four cardinal directions, called wrapper ports, while the contained PE provides input and output ports called PE ports. A *port* is a triple (*location*, *orientation*, *i*) where  $i \in \mathbb{N}_0$  is an index, *location* is one of *north*, *east*, *south*, *west*, or *pe*, and *orientation* specifies whether it is a *sink* (wrapper output and PE input ports) or a *source* (wrapper input and PE output ports). As a shorthand notation, we use *location*<sup>cation</sup> for source ports and *location*<sup>cation</sup> for sink ports.

Ports of opposing orientation may be connected. Between wrappers, each wrapper port is connected to its *sibling*, which is the port of both opposing cardinal direction and opposing orientation; for example *south*<sub>0</sub><sup> $\triangleleft$ </sup> of a wrapper is connected to *north*<sub>0</sub><sup> $\triangleright$ </sup> of the wrapper to its south. At the borders of the TCPA, the wrapper ports facing the border are connected to corresponding input and output ports of the I/O buffer. Within a wrapper, two ports can be connected at runtime if they are made *adjacent* at synthesis time; two ports may be adjacent if they are of opposing orientation.

#### 3.2 Processing elements

Each PE houses two types of registers: data and control. In this paper, we focus on the data registers, which have been specially designed to exploit the regularity of loops based on the insight that each PE processes a neighborhood of loop iterations locally:

- General-purpose registers  $R_{rd} = \{ rd0, rd1, ... \}$  are conventional registers and used for storing intermediate values during a single loop iteration.
- Feedback registers  $R_{fd} = \{fd0, fd1, \ldots\}$  are rotating shift registers with reconfigurable depth used for storing intermediate values across multiple loop iterations (loop-carried dependences). Each feedback register manages separate read and write pointers that advance (and potentially wrap around) after each read or write independently.
- Input registers R<sub>id</sub> = {id0, id1,...} each allow read access to a FIFO connected to the corresponding PE input port: id0 to pe<sup>d</sup><sub>0</sub>, id1 to pe<sup>d</sup><sub>1</sub>, and so on.



Fig. 2. A Tightly Coupled Processor Array contains a grid of processing elements (PEs) interconnected using a circuit-switched network with two independent layers, data (solid lines) and control (dotted lines). Each PE is surrounded by an interconnect wrapper, whose internal architecture is shown on the right side for the data layer. Additionally, there are two kinds of peripheral components: I/O buffers, each providing one border with individual memory banks, and global controllers, one per corner, each orchestrating the parallel execution of one loop program. The figure shows two simultaneously running loop programs, indicated by the two colors.

• Output registers  $R_{od} = \{ od0, od1, ... \}$  each allow write access to the corresponding PE output port: od0 to  $pe_0^{\triangleright}$ , od1 to  $pe_1^{\triangleright}$ , and so on.

The registers are connected to a set of functional units (FUs) { $fu_1, fu_2, \ldots$ } composed at synthesis time from a variety of types such as integer and floating-point ALUs, MAC units, and so on. Each type offers its own instruction set, but most follow the three-register format; for example, add rd0 fd3 id1 means rd0  $\leftarrow$  fd3 + id1. For convenience, each functional unit is assigned a unique identifier such as alu0 or mac1.

To enable instruction-level parallelism, each FU executes an individual program, an architecture known as orthogonal instruction processing (OIP) that aims to reduce the overall program size compared to VLIW architectures [3]. In OIP, each FU contains an individual branch unit, which computes the program counter of the next instruction. Conditional branches may depend on flags, which are shared between FUs, and on control registers, allowing the synchronization of programs across FUs. The number of targets per conditional branch can be chosen at synthesis time, but is usually two. Section 6.3 introduces branch instructions in more detail.

## 3.3 Periphery

There are two types of supporting components: four I/O buffers, one on each border, and up to four global controllers, one in each corner. Each I/O buffer is flexibly connected to the respective border PEs, but the details irrelevant for this article, so we assume each border PE is connected to an individual set of memory banks via its wrapper ports (compare Figure 2). Each memory bank contains a reconfigurable *address generator* for providing affine address sequences, unburdening the connected PE. Each global controller orchestrates a single running loop application by generating control signals according to its iteration space and control flow, which are then propagated from PE to PE using the control layer of the interconnect network.

## 4 LOOPS, THE POLYHEDRAL MODEL, AND REDUCED DEPENDENCE GRAPHS

As shown in Figure 1, we assume that compilation for a TCPA starts from a loop program given as a reduced dependence graph (RDG). For transforming other loop representations, in particular sequential loops, into RDGs, we refer to the literature [7].

Reduced dependence graphs are closely tied to the polyhedral model, where the *iteration space* I of an *n*-dimensional loop nest is represented as a subset I of  $\mathbb{Z}^n$ , and each iteration is identified by a vector  $I \in I$ . In particular, our loop model is based on *piecewise linear dependence algorithms* (PLAs) [21]. A PLA is a set of equations  $S_i$  quantified over I that interrelate the *instances* of a set X of affinely indexed variables, where the instance of  $x \in X$  at index I is denoted x[I]. In this article, we assume the form

$$S_i: x_i[Q_iI - d_i] = op_i(x_{i,1}[Q_{i,1}I - d_{i,1}], x_{i,2}[Q_{i,2}I - d_{i,2}], \dots) \text{ if } I \in \mathcal{I}_i,$$
(1)

where  $x_i \in X$  is the variable with instances *defined* by  $S_i$  as the result of operation  $op_i$  and  $x_{i,j} \in X$  are the variables with instances *used* by the definition. Each variable is  $m_i$ -dimensionally indexed using an affine transform given by a matrix  $Q \in \mathbb{Z}^{m_i \times n}$  and a vector  $d \in \mathbb{Z}^{m_i}$ . Which instances of  $x_i$  are defined by an equation  $S_i$  is restricted by its *condition space*  $I_i \subset \mathbb{Z}^n$ . Note that no instance of a variable  $x \in X$  may be defined more than once by a PLA<sup>1</sup>. We assume the iteration space I of a PLA to be the union of its condition spaces.

Without loss of generality, we use a restricted form of Equation (1) that describes uniform dependence algorithms with affinely indexed I/O variables. Here, each variable  $x \in X$  satisfies exactly one of the following conditions: (1) if instances of x are only used, but none are defined (intuitively: x only appears on the right-hand side), x is an *input* variable; (2) if instances of x are only defined, but none are used (intuitively: x only appears on the left-hand side), x is an output variable; and (3) if all used instances of x are defined, x is an *internal variable*. These conditions partition X into the set of input variables  $X_{in}$ , the set of output variables  $X_{out}$ , and the set of internal variables  $X_{var}$ .<sup>2</sup> Furthermore, internal variables may only be indexed uniformly<sup>3</sup>: When defining an internal variable  $x_i \in X_{var}$ , the matrix  $Q_i$  must be the identity matrix and the vector  $d_i$  must be 0; when using an internal variable  $x_{i,j} \in X_{var}$ , then  $Q_{i,j}$  must be the identity matrix. In the latter case, we call  $d_{i,j}$  the dependence vector of the uniform dependence between  $x_{i,j}$  and  $x_i$ .

Note that input and output scalars are modeled as zero-dimensionally indexed input and output variables. If the value *c* of an input scalar  $x \in X_{in}$  is known a priori, we call it a constant and refer to it by *c*, that is  $x \coloneqq c$ .

*Running example.* Throughout this article, we use the following artificial, yet illustrative example that writes the first *N* bits of an integer scalar *in* into an array *bits*:

| for $0 \le i < N$ do                    | iteration space is union of all condition spaces             |
|-----------------------------------------|--------------------------------------------------------------|
| $S_1: x[i] = in[]$ if $i = 0$           | ▷ read scalar input <i>in</i> in first iteration             |
| $S_2: x[i] = y[i-1]$ if $i \ge 1$       | $\triangleright$ otherwise, use value shifted <i>i</i> times |
| $S_3 \colon y[i] = x[i] \text{ shr } 1$ | ▹ shift right by constant 1 for next iteration               |
| $S_4$ : $bits[i] = x[i]$ and 1          | extract bit and output into <i>bits</i>                      |
|                                         |                                                              |

Note that  $1 \in X_{in}$  is a constant used in both  $S_3$  and  $S_4$ .

PLAs prescribe neither place nor time of execution; feasible execution orders are only implied by the dependences between equations. A *reduced dependence graph* (RDG) is a directed graph (V, E) that makes these explicit. In the

Δ

<sup>&</sup>lt;sup>1</sup>This corresponds to the array single-assignment property.

<sup>&</sup>lt;sup>2</sup>Any PLA may be transformed to allow for such a partitioning.

<sup>&</sup>lt;sup>3</sup>PLAs may be systematically transformed into this form by localization and embedding [19, 20].

Manuscript submitted to ACM

following, we rely on node annotations and edge annotations to structure all relevant information. An annotation is a function f that maps either a node  $v \in V$  or an edge  $e \in E$  to a value z of arbitrary type. To disambiguate function application from annotations, we write z = f[v] to access an annotated value and  $f[v] \leftarrow z$  to annotate a value.

Given a uniform dependence algorithm with affinely indexed I/O variables, the corresponding RDG is a directed multigraph (V, E) with the following nodes:

- An operation node v for each equation  $S_i$ , annotated with the operation  $op[v] \leftarrow op_i$  and condition space  $I[v] \leftarrow I_i$ .
- An *input node* v for each x<sub>in</sub> ∈ X<sub>in</sub> that is not a constant, annotated with the variable x[v] ← x<sub>in</sub>, and an *output* node for each x<sub>out</sub> ∈ X<sub>out</sub>, annotated with the variable x[v] ← x<sub>out</sub>.
- A *constant node* v for each constant  $c \in X_{in}$ , annotated with its value  $c[v] \leftarrow c$ .

The set of edges *E* contains the following edges:

- If w is an operation node representing equation  $S_i$  and v is an operation node representing  $S_k$ , a *dependence edge*  e = (v, w) from v to w is inserted for each j where  $x_k = x_{i,j}$  (that is, for each use of  $x_k$  in  $S_i$ ), annotated with dependence vector  $d[e] \leftarrow d_{i,j}$  and operand index  $pos[e] \leftarrow j$ .
- If w is an operation node representing equation  $S_i$  and v is the input node representing  $x_{in} \in X_{in}$ , an *input edge* e = (v, w) is inserted for each j where  $x_{in} = x_{i,j}$  (that is, for each use of  $x_{in}$  in  $S_i$ ), annotated with the indexing function  $Q[e] \leftarrow Q_{i,j}$  and  $d[e] \leftarrow d_{i,j}$ , as well as the input variable  $x[e] \leftarrow x_{in}$  and operand index  $pos[e] \leftarrow j$ .
- If *w* is an operation node representing equation  $S_i$  and *v* is a constant node representing value *c*, a *constant edge* e = (v, w) is inserted for each *j* where  $c = x_{i,j}$  (that, is for each use of *c* in  $S_i$ ), annotated with the operand index  $pos[e] \leftarrow j$ .
- If v is an operation node representing  $S_i$  and w is the output node representing  $x_i \in X_{out}$ , an output edge is inserted, annotated with the indexing function  $Q[e] \leftarrow Q_i$  and  $d[e] \leftarrow d_i$ , as well as the output variable  $x[e] \leftarrow x_i$ .

Figure 3 shows the RDG of the running example, which we use as the basis in the next section to explain how to compile a loop program given as an RDG into a symbolic configuration.

## 5 SYMBOLIC MAPPING

Because the loop bounds and number of PEs are assumed to become known only at run time, mapping an RDG onto a TCPA is split into two phases (see Figure 1): Symbolic mapping, which front-loads as many steps as possible to be performed at compile time and produces a symbolic configuration, and instantiation, which generates a concrete configuration from it. In this section, we discuss the former, consisting of the following steps:

- (1) Processor allocation by tiling and symbolic modulo scheduling define a space-time mapping of the RDG, assigning each operation node a time, a PE, as well as a functional unit and corresponding instruction. Both tiling and modulo scheduling are well known, but we review them in Sections 5.1 and 5.2.
- (2) Register allocation (Section 5.3) assigns registers to edges in the RDG that store and communicate both intermediate results and I/O data.
- (3) **Routing of propagation channels** (Section 5.4) allocates routes on the interconnect network for propagating intermediate values according to the data dependences.
- (4) Generation of a polyhedral syntax tree. This data structure is a compact representation of all programs that adhere to a given space-time mapping, register allocation, and channel routing. It is independent of runtime Manuscript submitted to ACM



Fig. 3. Reduced dependence graph for the running example (bit extraction). Each rectangle is an operation node that represents an equation  $S_i$  and is annotated with its sink variable  $x_i$ , its operation  $op_i$  (in brackets) and its iteration-dependent condition space  $I_i$  (second line). Each rounded rectangle is an input or output node, and each circle is a constant node. Additionally, each operation node is annotated with the result of scheduling (Section 5.2) in the form  $fu: [\tau]$  mnemo. Each edge represents a dependence and is annotated with the corresponding indexing function (here only shown for input and output edges and if  $d \neq 0$ ) and the its operand index (here either A := 1 or B := 2). The two dashed edges  $e_1^1$  and  $e_4^2$  are the result of splitting edge  $e_4$  according to its dependence's processor displacements (Section 5.1). Finally, each edge is annotated with its allocated register(s) (Section 5.3).

parameters such as loop bounds and number of available PEs in the sense that given any valid values of these parameters, the corresponding program can be generated from it. While polyhedral syntax trees were first introduced in [24], we review them in Section 5.5.

These steps are performed at compile time because they contain NP-complete problems but heavily influence the quality (code size, number of registers, and so on) of the generated configuration and therefore benefit from not being time-constrained. The final output of the symbolic mapping phase is a symbolic configuration, which retains parameters as symbols, but from which concrete configurations can be instantiated once the parameter values become known.

#### 5.1 Allocation of processing elements by tiling

To distribute the loop iterations across PEs for execution, our compilation starts with *orthogonal tiling*: partitioning a loop's iteration space into  $t_1 \times t_2 \times \ldots t_n$  rectangular tiles, each of size  $p_1 \times p_2 \times \ldots p_n$ . Assuming that at most two tile counts  $t_r$  and  $t_c$  are not 1, the tiles are mapped one-to-one to a rectangular region  $t_r \times t_c$  of PEs on the TCPA, with each PE being assigned the execution of the iterations within one tile.

Mathematically, tiling decomposes an iteration space I into an intra-tile iteration space  $\mathcal{J}$  and inter-tile iteration space  $\mathcal{K}$  [18]:

$$I \subseteq \mathcal{J} \oplus \mathcal{K} = \left\{ (J, K)^{\mathrm{T}} \mid J \in \mathcal{J}, K \in \mathcal{K} \right\},$$

where we call  $I^* = \mathcal{J} \oplus \mathcal{K} \in \mathbb{Z}^{2n}$  the *tiled iteration space*. We assume a rectangular inter-tile iteration space, which describes the set of tiles:

$$\mathcal{K} = \left\{ \mathbf{K} = (k_1, k_2, \dots, k_n)^{\mathrm{T}} \mid 0 \le k_i < t_i \right\}$$

Likewise, the intra-tile iteration space describes the set of iterations within every tile:

$$\mathcal{J} = \left\{ J = (j_1, j_2, \dots, j_n)^{\mathrm{T}} \mid 0 \le j_i < p_i \right\}.$$

Explained informally, tiling transforms an *n*-dimensional loop into a 2n-dimensional loop where *n* dimensions iterate over the tiles and *n* dimensions iterate over the iterations within a tile. Since tiling doubles the dimension of a given loop nest, condition spaces  $\mathcal{I}[v]$  are embedded accordingly:

$$\mathcal{I}^*[v] \leftarrow \left\{ (J, K)^{\mathrm{T}} \mid PK + J \in \mathcal{I}[v] \right\}, \text{ where } P = \operatorname{diag}(p_1, \dots, p_n)$$

*Running example.* Tiling  $I = \{0 \le i < N\}$  into *t* tiles, each of width *p*, yields

$$\mathcal{I}^* = \left\{ I^* = (j,k)^{\mathrm{T}} \mid 0 \le j < p, 0 \le k < t = \lceil N/p \rceil \right\}$$

Embedding, for example,  $I[v_4] = \{i < N\}$  into  $I^*$  yields  $I^*[v_4] \leftarrow \{(j,k)^T \mid pk + j < N\}$ .

After tiling, each tile  $K \in \mathcal{K}$  is assigned to a PE of the TCPA by the space mapping<sup>4</sup>:

$$pe(K): \mathcal{K} \mapsto \mathbb{N}^2 := \Phi \cdot K, \quad \Phi \in \mathbb{Z}^{2 \times n}$$

where  $\Phi$  is the *allocation matrix*, which is chosen such that  $pe(K) = (k_r, k_c)^T$ ,  $1 \le r, c \le n$  with  $r \ne c$ . In the sequel, we use K and pe(K) interchangeably because in K all elements other than  $k_r$  and  $k_c$  are 0 by allowing only tile counts  $t_r$  and  $t_c$  to differ from 1.<sup>5</sup>

The distribution of iterations across multiple PEs may entail inter-processor communication for loop-carried dependences ( $d \neq 0$ , for example x[i] = y[i - 1]). Here, we assume that communication always takes place between neighboring PEs (including the diagonal neighbors), that is dependence vectors never "skip" a tile; in other words, we assume  $p_i \ge d_i \forall 1 \le i \le n$ . Which PEs do communicate for such a dependence  $d \neq 0$ ? Suppose tiling maps an iteration I to tile K. Then the source iteration I - d is on tiles  $\{K - \theta\}$  where  $\theta$  is from the set of tile displacements

$$\Theta(\boldsymbol{d}) := \left\{ \boldsymbol{\theta} = (\theta_1, \theta_2, \dots, \theta_n)^{\mathrm{T}} \mid \theta_i \in \{0, \operatorname{sign}(d_i)\} \right\}$$

The set of processor displacements—which processors  $\{\Phi K - \delta\}$  needs to communicate the result—is

$$\Delta(\boldsymbol{d}) \coloneqq \{\boldsymbol{\delta} \mid \boldsymbol{\delta} = \Phi\boldsymbol{\theta} \colon \forall \boldsymbol{\theta} \in \Theta(\boldsymbol{d})\}$$

*Running example.* Only  $d[e_4] = (1)$  is loop-carried (corresponding to equation x[i] = y[i-1]), resulting in the set of tile and processor displacements

$$\Theta(\boldsymbol{d}[e_4]) = \{(0), (1)\} \implies \Delta(\boldsymbol{d}[e_4]) := \{\boldsymbol{\delta}_1 = (0), \boldsymbol{\delta}_2 = (1)\}.$$

Consequently, the dependence results in communication from processor k - 1 to processor k.

Δ

#### 5.2 Scheduling of operations

Next, *modulo scheduling* [17], a software pipelining technique, is performed to obtain a schedule that specifies a) when to start each PE, b) when to start each iteration, c) when to start each operation within an iteration. We assume that tiles are executed in parallel (since each is assigned to a different PE) in a wavefront-like fashion (to not violate data dependences), but iterations within a tile are executed sequentially (since PEs execute sequential programs).

Modulo scheduling constructs a cyclic schedule with period  $\pi$ , called the *initiation interval*, consisting of a linear part  $\lambda^* = (\lambda^J, \lambda^K)$  and a start offset  $\tau[v]$  for each operation node v. Given an iteration  $I^*$ , the start time of v then is

$$t_{v}(\boldsymbol{I}^{*}) = \boldsymbol{\lambda}^{*} \cdot \boldsymbol{I}^{*} + \boldsymbol{\tau}[v],$$

<sup>5</sup>For a one-dimensional mapping, either  $k_r$  or  $k_c$  is also 0.

Δ

<sup>&</sup>lt;sup>4</sup>This tiling and PE assignment is also known as *locally sequential, globally parallel* (LSGP) in the literature [16].

that is,  $\lambda^{K}$  determines the start times of the mapped PEs,  $\lambda^{J}$  determines the start times of the iterations assigned to a PE, and  $\tau[v]$  determines the relative start time of the operation. Modulo scheduling also allocates a functional unit fu[v] to execute operation op[v] and selects the corresponding instruction mnemo[v], including its latency l[v]. Note that because we do not know the number and size of tiles in advance, we use *symbolic modulo scheduling* as introduced in [25].

*Running example.* For the sake of illustration, we assume each PE has only two functional units alu0 and alu1, both generic ALUs. Because there are three different operations, we have to settle for an initiation interval  $\pi = 2$ . A linear schedule not violating the loop-carried dependence  $d[e_4]$  is  $\lambda^* = (\pi, \pi \cdot p)$ . For each operation node, the start offset  $\tau$ , allocated functional unit *fu*, and selected instruction *mnemo* are shown in Figure 3. We assume latency  $l[v] = 1 \forall v$ .

## 5.3 Register allocation

Each edge in an RDG represents a dependence between two nodes: the sink node consumes the value produced by the source node. On TCPAs, these intermediate values are stored and communicated using different types of registers allocated per edge. For convenience, in the following we simply write "value of edge e" instead of "value produced/consumed via e" or similar. Two aspects determine the type of register that is allocated for an edge: how long a value is *alive*, and whether it requires communication. Inter-processor communication is required if an annotated dependence vector d[e] results in at least one processor displacement  $\delta \neq 0$ . To make this more explicit in the RDG, we split each edge e with  $d[e] \neq 0$  into  $|\Delta(d[e])|$  new edges  $e^i$ , all with the same annotations as e, but each additionally annotated with one of the processor displacements  $\delta[e^i] \leftarrow \delta_i \in \Delta(d[e])$ . The original edge e is removed.

*Running example.* Edge  $e_4$  corresponds to  $|\Delta(d[e_4])| = 2$  processor displacements  $\delta_1 = (0)$  and  $\delta_2 = (1)$ . Therefore, it is split into two new edges:  $e_4^1$  with  $\delta[e_4^1] \leftarrow (0)$  and  $e_4^2$  with  $\delta[e_4^2] \leftarrow (1)$  as illustrated in Figure 3.

After splitting, for each edge  $e = (v, w) \in E$ , one or more registers are allocated according to the following classification. If *e* is a dependence edge, that is if both *v* and *w* are operation nodes [8]:

- If d[e] = 0, the edge value is alive for  $l' = \tau[w] \tau[v] l[v]$  time steps within a single iteration and a k-tuple of general-purpose registers  $regs[e] \leftarrow (r_1, r_2, ..., r_k)$  with  $r_i \in R_{rd}$  is allocated, where  $k = \lceil l'/\pi \rceil$ . Multiple registers are necessary if the edge value is still alive when the next value is produced, that is if  $l' > \pi$ .
- If  $d[e] \neq 0$  and  $\delta[e] = 0$ , the edge value is alive for  $\lambda^{J}d$  iterations, but only within the current PE. A feedback register  $reg[e] \leftarrow r \in R_{fd}$  with depth  $\lambda^{J}d$  is allocated.
- If d[e] ≠ 0 and δ[e] ≠ 0, the edge value is alive across PEs and requires communication. Two registers are allocated: an input register reg<sup>read</sup>[e] ← r ∈ R<sub>id</sub> for reading on PE ΦK and an output register reg<sup>write</sup>[e] ← r ∈ R<sub>od</sub> for writing on the PE ΦK δ[e]. Additionally, a route from ΦK δ[e] to ΦK on the interconnect network is allocated (see Section 5.4).

Otherwise, if *e* is an input edge, an input register  $reg[e] \leftarrow r \in R_{id}$  is allocated, and if *e* is an output edge, an output register  $reg[e] \leftarrow r \in R_{od}$  is allocated; in both cases, additionally an I/O access mapping is generated (Section 5.7). For constant edges, no register is allocated (constants are immediate operands and do not require registers).

The conventional approach to register allocation—vertex coloring of an interference graph [5]—applies here, for which several optimal and heuristic solving methods exist.

#### 5.4 Routing of propagation channels

The communication induced by each edge e with a processor displacement  $\delta[e] \neq 0$  requires an interconnect route from an output port of PE  $\Phi K$  to an input port of PE  $\Phi K + \delta[e]$  for each PE K that produces at least one value of e. However, because the tiling is performed symbolically, we do not know which PEs will be involved; we therefore assume that the interconnect network is homogeneous—all interconnect wrappers have the same ports and adjacencies—, and utilizing this homogeneity, for each affected edge only one template route is allocated to be replicated across all PEs pairs during instantiation (see Section 6). Since this route forms a dedicated communication channel between equidistant pairs of PEs in order to propagate data across the TCPA, we call it a *propagation channel*. The set R of propagation channels is found using a reduced topology graph.

Definition 1. A reduced topology graph T is a directed graph that represents the topology of a homogeneous interconnect network. The graph contains a node v for each port in the interconnect wrapper and an edge for each possible connection between two ports weighted with their inter-processor distance. In particular, there is an edge weighted  $(0 \ 0)^T$  from each source port to each of its adjacent sink ports, as well as an edge from each wrapper sink port *location*<sup>4</sup> to its sibling port weighted by *location*:  $(1 \ 0)^T$  for *east*,  $(0 \ 1)^T$  for *south*,  $(-1 \ 0)^T$  for *west*, and  $(0 \ -1)^T$  for *north*.

For convenience, assume *T* has two polar nodes: *Source*, connected to all PE output port nodes, and *Sink*, connected to all PE input port nodes. A propagation channel  $\rho$  for an RDG edge *e* is a path  $(v_1, v_2, \ldots, v_{|\rho|})$  on *T* from *Source* to *Sink* where the sum of weights equals  $\delta[e]$ . Routing  $k = |\{e \in E : \delta[e] \neq 0\}|$  propagation channels is thus equivalent to solving the *k* node-disjoint exact-length paths problem, with the relaxation that the paths of two propagation channels  $\rho_1$  and  $\rho_2$  may share the first  $1 \leq r \leq \min(|\rho_1|, |\rho_2|)$  nodes if the two corresponding RDG edges never have values alive simultaneously. Then, the shared nodes (ports) of the reduced topology graph are never occupied at the same time. However, sharing any node (port) but not its predecessors would imply that the port has two connected sources, which is not allowed.

The resulting routes and registers—the first node in each  $\rho$  corresponds to a PE output register, the last node to a PE input register—are annotated to the corresponding edges in the RDG.

*Running example.* Only edge  $e_4^2$  has  $\delta[e_4^2] \neq 0$  and is therefore the only edge requiring a propagation channel; for example,  $\rho[e_4^2] \leftarrow (pe_0^{\triangleright}, east_0^{\triangleleft}, west_0^{\triangleright}, pe_0^{\triangleleft})$ . Consequently, the edge gets assigned registers  $reg^{write}[e_4^2] \leftarrow \text{od0}$  (from  $pe_0^{\triangleright}$ ) and  $reg^{read}[e_4^2] \leftarrow id0$  (from  $pe_0^{\triangleleft}$ ). For a more complex example, refer to Figure 4.

After tiling, scheduling, register allocation, and propagation channel routing, all necessary information is annotated to the RDG to generate a polyhedral syntax tree, a symbolic representation of the set of programs that can be generated from the RDG for any valid values of the parameters.

#### 5.5 Generation of a polyhedral syntax tree

As motivated in the introduction, the generation of concrete PE programs depends both on the loop bounds and the number of allocated PEs. However, from the annotated RDG, all instructions, their condition spaces, functional unit, and time offset have meanwhile be determined at compile time; only their composition depends on the still unknown parameter values. A *polyhedral syntax tree* [24] represents this information hierarchically such that it is equivalent to the forest of PE program syntax trees over all parameter values. Its building blocks are so-called *fragments*.

Definition 2 ([24]). A fragment F is any syntactic constituent of a program.



Fig. 4. An example reduced topology graph, where each node represents an interconnect wrapper port, and each edge represents a possible connection (weights only shown if not 0). Visualized is a solution to routing two propagation channels, one for processor displacement  $\boldsymbol{\delta} = (0 \ 1)^{\mathrm{T}}$ , depicted in blue, and one for processor displacement  $\boldsymbol{\delta} = (1 \ 1)^{\mathrm{T}}$ , depicted in red. Assuming the corresponding RDG edges never have values alive simultaneously, the first three nodes  $pe_0^{\triangleright}$ ,  $east_0^{\triangleleft}$ , and  $west_0^{\triangleright}$  can be shared. Node  $pe_1^{\triangleleft}$  cannot be shared because then the port would have two different input connections.

For example, the assembly instruction addi rd0 rd1 10 can be structured into five fragments: the mnemonic addi, the registers rd0 and rd1, the literal 10, and finally the entire instruction itself. Whether an operation is executed within an iteration  $I^*$ , for example, depends on its condition space. Fragments may thus be iteration-dependent.

## Definition 3 ([24]). A polyhedral fragment f(I) maps an iteration $I \in I$ to a fragment F.

We denote specific polyhedral fragments by *polyhedral (fragment name)*, for example polyhedral register or polyhedral instruction. Since fragments are syntactic, representation as a tree is natural.

Definition 4 (Adapted from [24]). A polyhedral syntax tree (PST) is a triple f = (I, a, G) of a condition space I, a tuple a of attributes, and a set of children G, each of which is again a polyhedral syntax tree. To avoid ambiguity, we write domain(f) for I, attr(f) for a, and children(f) for G of f, but we use the term node for both a polyhedral syntax tree itself and its children. Each node is of one of two types: If attr(f) = (F), that is if the tuple only contains a fragment F, then f is a *fragment node*. Otherwise, f is a *meta node* that stores implementation-specific syntactic meta-information. A node of a polyhedral syntax tree satisfies the following properties regarding its immediate children: All children are of the same type; if the children are fragment nodes, their condition spaces must be disjoint; if it has children, its condition space is the union of its children's condition spaces; no two children may have the same attribute values. The *evaluation* f(I) of a polyhedral syntax tree is the sub-tree where all nodes g with  $I \notin \text{domain}(g)$  are removed.

The evaluation of a PST at a concrete iteration I results in a syntax tree that represents the sequence of instructions *issued* in that iteration. Thus, concatenating the instruction sequences in execution order for all I in the iteration space I yields an unrolled assembly program for the entire loop. This observation serves as the basis for an efficient program generation algorithm in Section 6.1.

*Running example.* The PST generated from the RDG of the running example is shown below, showing the condition spaces (in brackets) after tiling only for the leaf nodes (since all other condition spaces are unions of these). Fragment nodes are set in typewriter, meta nodes in *(italic)*. For TCPAs, a PST has the following semantics: The second level represents the functional unit program fu, the third the time offset  $\tau$ , the fourth the instruction *mnemo*, and the last two the instruction's operands.



In the following, we refer to elements of attr(f) using annotation syntax, for example  $F[g_1] = rd0$ . The generation of a PST from an RDG is described in detail in previous work [24].

## 5.6 Compilation of access mappings

Still missing is information about accesses to external data, that is, to the input and output variables. To represent this information, for each input and output edge, an *access mapping a* is compiled from the edge's annotations.

Definition 5. An access mapping is a four-tuple  $a = (reg, x, \alpha, \mathcal{A})$  that maps all accesses to register *reg* in iterations  $I \in \mathcal{A}$  to variable element  $x[\alpha(I)]$ .

Given an input or output edge *e*, the indexing function is  $\alpha = (Q[e], d[e])$ , its access space  $\mathcal{A} = I[v]$  (input access) or  $\mathcal{A} = I[w]$  (output access), the register  $reg = reg^{read}[e]$  (input access) or  $reg = reg^{write}[e]$  (output access), and the associated input variable x = x[v] or output variable x = x[w]. We denote the set of all access mappings in a symbolic configuration *A*.

*Running example.* The access mappings  $A = \{a_{in}, a_{bits}\}$  are as follows:

- $a_{in}$ : Edge  $e_1$  represents accesses to the input scalar *in*. We model scalars as zero-dimensional variables, that means, the indexing function is  $\alpha = (Q[e_1] = () \in \mathbb{Z}^{0 \times n}, d[e] = () \in \mathbb{Z}^0)$ . (For these empty matrices, we assume that  $\mathbb{Z}^{0 \times n} \cdot \mathbb{Z}^n \in \mathbb{Z}^0$ .) Its access domain is  $\mathcal{A} = \mathcal{I}[v_1] = \{i = 0\}$ , the allocated input register  $reg = id\emptyset$ .
- $a_{bits}$ : Edge  $e_7$  represents accesses to the one-dimensional output variable *bits*. The indexing function is  $\alpha = (Q[e_7] = (1), d[e_7] = (0))$ . It is written to in each iteration, meaning  $\mathcal{A} = I[v_7] = \{i < N\}$ . The allocated output register is reg = od0.

### 5.7 Summary: symbolic compilation

After these steps, a symbolic configuration consisting of the following parts is obtained: A polyhedral syntax tree f to generate PE programs from, a symbolic schedule  $\lambda^*$  that schedules tiles (PEs) in parallel, iterations within a tile sequentially, a set R of propagation routes that will be replicated across the allocated TCPA region, and a set A of access mappings from which the I/O buffer and address generator configurations will be generated.

## **6** INSTANTIATION

Instantiation is the generation of a *concrete configuration* from a given *symbolic configuration* and an assignment of values to the parameters—the concrete loop bounds, the number of allocated processing elements, and the memory layouts of input and output arrays. It comprises these steps:

- Concretization substitutes all occurrences of parameters in the symbolic configuration, for example in the iteration space *I*\*, with their assigned values. Using the concretized schedule, the feedback register depths (λ<sup>J</sup>d/π) are computed.
- (2) Program instantiation is the most complex step and further divided into three sub-steps:
  - (a) **Control flow analysis** (Section 6.1) first determines a set of processor classes, that is, a partitioning of  $\mathcal{K}$  into subsets  $\mathcal{P}_{pc}$  of PEs that will execute the same PE program. For each processor class  $\mathcal{P}_{pc}$  and each functional unit *fu*, a control flow graph  $CFG_{pc,fu}$  is generated from the specialized polyhedral syntax tree  $f_{pc} = f(\mathcal{P}_{pc} \oplus \mathcal{J})$  and intra-tile schedule  $\lambda^J$ . Each node in  $CFG_{pc,fu}$  represents an atomic sequence of instructions and each edge represents a branch annotated with its *transition space*  $\mathcal{T}$  (the iterations in which the branch is taken).
  - (b) **Control signal allocation** (Section 6.2) allocates a set of binary control signals that, for each iteration  $J \in \mathcal{J}$ , encode for all transition spaces across all processor classes and control flow graphs whether  $J \in \mathcal{T}$ . The corresponding control signal values are annotated back to the control flow graph edges and used later for generating branch conditions.
  - (c) Program generation (Section 6.3) generates a PE program for each processor class P<sub>pc</sub> that contains a functional unit program for each CFG<sub>pc,fu</sub>: The instruction sequences represented by the nodes are concatenated and branch instructions generated according to the control signal values annotated to the edges.
- (3) **Periphery instantiation** generates a concrete configuration for the global controller from the allocated set of control signals (therefore also described in Section 6.2). Furthermore, for each access mapping  $a \in A$ , each involved PE is connected to a memory bank, whose address generator configuration is generated according to the indexing function  $\alpha$  (Section 6.4).
- (4) Interconnect instantiation replicates the propagation channels for all allocated interconnect wrappers and incorporates any additional routes from the I/O routing, resulting in the concrete interconnect configuration (see Figure 8 for the running example).

Instantiation is, in general, performed at runtime. Making runtime instantiation viable requires the above steps to be efficient—that is, to have low-degree polynomial time and space complexity—and scale well with an increasing number of PEs. The latter especially matters for the most complex part of instantiation: program instantiation.

*Running example.* Suppose we choose as loop bound N = 16 (extracting the first 16 bits from *in*) and allocate t = 3 PEs, resulting in a minial tile size  $p = \lceil N/t \rceil = 6$ , which is an imperfect tiling (tile k = 2 is not full since  $3 \cdot 6 = 18 > N$ ). Concretization yields the schedule  $\lambda = (12, 2)$ . The allocated feedback register fd0 has a depth of  $\lambda^J d[e_4^0]/2 = 1$ .

#### 6.1 Control flow analysis

Generating a program for each PE separately does not scale to arbitrary TCPA sizes; however, due to the regularity of loops, large subsets of PEs execute the same program. Thus, by generating each distinct PE program only once, program instantiation scales well because the number of distinct programs across PEs is bounded even if the number of PEs keeps increasing. But is it possible to determine whether multiple PEs will be configured with the same program without actually generating their programs? Yes, by using the information in the polyhedral syntax tree (PST): We consider the programs of two PEs  $K_1$  and  $K_2$  equal if *specialization* yields the same PST for both.

*Definition 6.* Given a condition space after tiling  $I^* \subseteq \mathcal{J} \oplus \mathcal{K}$ , the function

split: 
$$K, I^* \mapsto \widehat{\mathcal{J}} = \{J \in \mathcal{J} \mid (J, K)^T \in I^*\}$$

maps a tile  $K \in \mathcal{K}$  to its *intra-tile domain*  $\widehat{\mathcal{J}}$  within  $I^*$ , that is the set of iterations J within tile K that lie in  $I^*$ .

Definition 7. Given a polyhedral syntax tree f with condition space after tiling  $I^* = \text{domain}(f) \subseteq \mathcal{J} \oplus \mathcal{K}$ , specialization for a tile  $K \in \mathcal{K}$ , denoted  $f \triangleright K$ , recursively maps  $I^*$  to the intra-tile domain of K within  $I^*$ :

$$f \triangleright K \coloneqq (\operatorname{split}(K, \operatorname{domain}(f)), \operatorname{attr}(f), \{g \triangleright K \mid g \in \operatorname{children}(f)\})$$

*Running example.* Using p = 6, t = 3 and N = 16, the condition space domain $(g_1) = \{pk + j < N\}$  concretizes to  $\{6k + j < 16\}$ . For tile k = 2, the corresponding intra-tile domain (iterations *j* that satisfy  $6 \cdot 2 + j < 16$ ) is  $\{0 \le j \le 3\}$ . Consequently, the specialization  $g_1 \triangleright (2)$  yields  $g_1 \coloneqq rd0$  [ $\{0 \le j \le 3\}$ ].

We use Definition 7 to partition the inter-tile space  $\mathcal{K}$  into *processor classes*  $\mathcal{P}_{pc}$ , each of which is a set of PEs that result in the same specialized PST and thus program. But when is  $f \triangleright K_1 = f \triangleright K_2$  for two distinct PEs  $K_1$  and  $K_2$ ? Since specialization only transforms condition spaces and all condition spaces in a PST are unions of its children's condition spaces, the two specializations are equal if domain $(g \triangleright K_1) = \text{domain}(g \triangleright K_2)$  for all leaves g of f. Hence, we must investigate when  $K_1$  and  $K_2$  result in the same intra-tile domain within a condition space  $\mathcal{I}^*$ .

Definition 8. Given a condition space after tiling  $I^* \subseteq \mathcal{K} \oplus \mathcal{J}$ , the set of tiles with the same intra-tile domain  $\widehat{\mathcal{J}}$ , called its *inter-tile domain*  $\widehat{\mathcal{K}}$ , is given by the function

tiles: 
$$\widehat{\mathcal{J}}, \mathcal{I}^* \mapsto \widehat{\mathcal{K}} = \{K | K \in \mathcal{K} \land \operatorname{split}(K, \mathcal{I}^*) = \widehat{\mathcal{J}}\}.$$

We call  $\widehat{I} = (\widehat{J}, \widehat{\mathcal{K}})$  the *intra-tile pattern* of  $\widehat{\mathcal{K}}$  within  $I^*$ . We denote the set of all intra-tile patterns of  $I^*$  as  $\mathbb{I}_{I^*}$ , which always corresponds to a partitioning of  $\mathcal{K}$ .

Running example. For  $I^* = \text{domain}(g_1)$ , there are  $|\mathbb{I}_{I^*}| = 2$  intra-tile patterns:  $\widehat{I}_1 = (\{0 \le j \le 5\}, \{k < 2\})$  (full tiles) and  $\widehat{I}_2 = (\{0 \le j \le 3\}, \{k = 2\})$  (partial tile).

Definition 8 implies that if  $K_1$  and  $K_2$  are part of the same intra-tile pattern  $\widehat{I} = (\widehat{\mathcal{J}}, \widehat{\mathcal{K}})$  of a leaf g's condition space, that is if both  $K_1 \in \widehat{\mathcal{K}}$  and  $K_2 \in \widehat{\mathcal{K}}$ , specialization maps them to the same intra-tile pattern  $\widehat{\mathcal{J}}$ , making domain( $g \triangleright K_1$ ) = domain( $g \triangleright K_2$ ). Consequently, the determination of processor classes depends only on the inter-tile domains: If  $K_1$  and  $K_2$  are in the same inter-tile domain for each leaf node of f, they result in the same specialized PST. Figure 5 illustrates this relation between processor classes and intra-tile patterns, as well as why two PEs share the same program if the specialized PSTs are equal. To formalize this visual intuition, let  $\mathbb{K} := {\widehat{\mathcal{K}}_1, \widehat{\mathcal{K}}_2, \ldots}$  be the set of all inter-tile domains annotated to the leaves of f:

$$\mathbb{K} = \left\{ \widehat{\mathcal{K}} \mid \exists g \in \text{leaves}(f) \colon \exists (\widehat{\mathcal{J}}, \widehat{\mathcal{K}}') \in \mathbb{I}_{\text{domain}(g)} \colon \widehat{\mathcal{K}} = \widehat{\mathcal{K}}' \right\}.$$

For each PE  $K \in \mathcal{K}$ , there is a partitioning of  $\mathbb{K}$  into two subsets:  $\mathbb{K}_{K}^{+}$ , containing all  $\widehat{\mathcal{K}}_{i}$  such that  $K \notin \widehat{\mathcal{K}}_{i}$ , and  $\mathbb{K}_{K}^{-}$ , containing all  $\widehat{\mathcal{K}}_{i}$  such that  $K \notin \widehat{\mathcal{K}}_{i}$ . As per the reasoning above, if the partitioning is equal for two PEs  $K_{1}$  and  $K_{2}$ , they are part of the same inter-tile domain for each leaf node of f, meaning the specialized PSTs are equal as well. Given K, the set  $\mathcal{P}_{K}$  of PEs with the same partitioning is

$$\mathcal{P}_{K} = \bigcap_{\widehat{\mathcal{K}} \in \mathbb{K}_{K}^{+}} \widehat{\mathcal{K}} \cap \bigcap_{\widehat{\mathcal{K}} \in \mathbb{K}_{K}^{-}} \overline{\widehat{\mathcal{K}}}, \tag{2}$$



Fig. 5. Illustration of the relation between intra-tile patterns and processor classes. The top shows the iteration space  $I^*$  of the running example after tiling with loop bounds N = 16, tile size p = 6, and tile count t = 3. Each color represents a distinct sequence of instructions issued in that iteration  $I^* = (j, k)^T$ . Below, the intra-tile patterns of the condition spaces of the leaves  $g_1, g_2$ , and  $g_4$  of the PST f are visualized (omitting the others for brevity since they do not result in distinct intra-tile patterns). Each row visualizes one intra-tile pattern  $\widehat{I_i} = (\widehat{J_i}, \widehat{\mathcal{K}_i})$ : There is a 1 if  $(j) \in \widehat{\mathcal{J}_i} \land (k) \in \widehat{\mathcal{K}_i}$ , a 0 otherwise, and a gray box in the background if  $(k) \in \widehat{\mathcal{K}_i}$ . Reading the column of an iteration  $I^*$  as a binary number yields an encoding for the evaluation  $f(I^*)$ ; for example, for the left-most iteration, the encoding is 1010, meaning that  $g_1$  and  $g_2$  remain after evaluation, but  $g_4$  is removed. Therefore, two iterations  $I_1^*$  and  $I_2^*$  with the same encoding—visualized by color in the figure—issue the same combination of instructions because  $f(I_1^*) = f(I_2^*)$ . If two PEs have the same coloring for all iterations within its assigned tile, they thus have the same program and belong to the same processor class. Having the same coloring is equivalent to being in the same set of inter-tile domains, that is, having the same pattern of gray boxes in the figure. Hence, the examples results in two processor classes  $\mathcal{P}_1$  and  $\mathcal{P}_2$ .

making the set of processor classes  $PC = \{\mathcal{P}_K \mid K \in \mathcal{K}\}$ . By rearranging Equation (2) as in Algorithm 1, we can avoid a time complexity proportional to the number of PEs  $|\mathcal{K}|$  to obtain  $PC = \{\mathcal{P}_1, \ldots\}$  as output D using  $\mathcal{D} = \mathcal{K}$  as initial partitioning and  $C = \mathbb{K}$  as set of conditions.

| Algorithm 1 Partition $\mathcal D$ into subsets                                     | $D = \{\mathcal{D}_1, \ldots\}$ according to conditions $C = \{C_1, \ldots\}$ |
|-------------------------------------------------------------------------------------|-------------------------------------------------------------------------------|
| $D \leftarrow \{\mathcal{D}\}$                                                      | $\triangleright$ start with full space $\mathcal D$ as "partitioning"         |
| for $C \in C$ do                                                                    | ▶ for each condition, partition current set of subsets                        |
| $D' \leftarrow \emptyset$                                                           |                                                                               |
| for $\mathcal{D}' \in D$ do                                                         | ▶ partition each subset according to condition into up to two subsets         |
| if $\mathcal{D}' \cap C \neq \emptyset$ then $D' \leftarrow D$                      | $\mathcal{D}' \cup \{\mathcal{D}' \cap \mathcal{C}\}$                         |
| if $\mathcal{D}' \cap \overline{\mathcal{C}} \neq \emptyset$ then $D' \leftarrow D$ | $\mathcal{D}' \cup \{\mathcal{D}' \cap \overline{\mathcal{C}}\}$              |
| $D \leftarrow D'$                                                                   |                                                                               |

*Running example.* Assuming the same tiling parameters as before, there are three distinct inter-tile domains:  $\widehat{\mathcal{K}}_1 = \{k < 2\}, \widehat{\mathcal{K}}_2 = \{k \le 2\}, \text{ and } \widehat{\mathcal{K}}_3 = \{k = 2\}.$  Consider PE k = 0:  $\mathbb{K}_K^+ = \{\widehat{\mathcal{K}}_1, \widehat{\mathcal{K}}_2\}$  and  $\mathbb{K}_K^- = \{\widehat{\mathcal{K}}_3\}$ . According to Equation (2), all PEs in the same processor class are then

$$\widehat{\mathcal{K}}_1 \cap \widehat{\mathcal{K}}_2 \cap \widehat{\mathcal{K}}_3 = \{k < 2\} \cap \{k \le 2\} \cap \{k \ne 2\} \equiv \{k < 2\}.$$

Overall, there are two processor classes:  $PC = \{\mathcal{P}_1 = \{k < 2\}, \mathcal{P}_2 = \{k = 2\}\}$ .

Next, for each processor class  $\mathcal{P}_{pc}$ , a control flow graph for each functional unit in the specialized PST  $f_{pc} = f \triangleright K$ ,  $K \in \mathcal{P}_{pc}$  is constructed. However, generating compact programs necessitates exploiting repetition, but successive iterations may overlap in time due to software pipelining. Manuscript submitted to ACM

Δ

*Running example.* Suppose we fully unroll the functional unit program described by alu1 of  $f_1 = f \triangleright (0)$  into pseudo-assembly (time offset  $\tau$  in brackets):

[0] // j = 0 starts (prolog) nop [1] shri fd0 rd0 1 andi od1 rd0 1 // j = 1 starts --\ [2] [0] shri fd0 rd0 1 [1] // ... |-- repetition andi od1 rd0 1 // j = 4 starts | [2] [0] shri fd0 rd0 1 Γ17 [2] [0] andi od1 rd0 1 // j = 5 starts [1] shri od0 rd0 1 [2] andi od1 rd0 1 // epilog starts

Observe that iterations overlap and that between  $1 \le j \le 4$ , the same two instructions repeat. A timing-equivalent *compact* program is the following, arranged by iteration:

|     | nop  |     |     |   | ; | shri | fd0 | rd0 | 1 |   |        |   |           | // | j = 0       |
|-----|------|-----|-----|---|---|------|-----|-----|---|---|--------|---|-----------|----|-------------|
| L1: | andi | od1 | rd0 | 1 | ; | shri | fd0 | rd0 | 1 | ; | goto L | 1 | if j <= 4 | // | 1 <= j <= 4 |
|     | andi | od1 | rd0 | 1 | ; | shri | od0 | rd0 | 1 |   |        |   |           | // | j = 5       |
|     | andi | od1 | rd0 | 1 |   |      |     |     |   |   |        |   |           | // | epilog      |

The  $\pi = 2$  instructions that repeat originate from different evaluations of  $f_1((j))$ —and i from iterations  $0 \le j \le 3$ and shr i from iterations  $1 \le j \le 4$ . How can we find such a compact program *without* unrolling the program?

In a modulo-scheduled functional unit program, the next iteration is issued every  $\pi$  time steps, making the execution of the program a sequence of assembly *kernels* of length  $\pi$  *slots* each and each slot housing one instruction. (In the compact program example above, each line is a kernel.) Control flow therefore only changes each  $\pi$  time steps—each kernel is executed atomically—, an observation we use to reformulate the problem: How do we determine all kernels necessary for program instantiation and build a corresponding control flow graph?

To answer this, we first look at the formation of kernels in a functional unit program, visualized in Figure 6. Let  $f_{pc,fu}$  be the child of  $f_{pc}$  describing the functional unit program of fu. Then, at the start t(J) of each iteration J, the sequence of instructions described by the evaluation  $f_{pc,fu}(J)$  is issued. Each child g of  $f_{pc,fu}$  represents a temporal offset  $\tau[g]$  relative to t(J), from which we compute into which future iterations it overlaps: Because after  $\pi$  timesteps, the next iteration starts, any instruction at offset  $\tau$  is executed within slot ( $\tau \mod \pi$ ) of the kernel issued in iteration succ $(J, \lfloor \tau/\pi \rfloor)$ . Here, succ(J, n) is the *n*-th successor of J according to the intra-tile schedule  $\lambda^J$ . Since g is issued whenever J is in g's condition space  $\mathcal{J}_g = \text{domain}(g)$ , the instruction at  $\tau[g]$  therefore occupies the slot of the kernel issued in *all* iterations that are the  $\lfloor \tau[g]/\pi \rfloor$ -th successor of an iteration in  $\mathcal{J}_g$  (compare the red instructions in Figure 6), given by

succ 
$$(\mathcal{J}_q, n) \coloneqq \{J = \operatorname{succ}(J', n) \mid \forall J' \in \mathcal{J}_q\}, n = \lfloor \tau[g]/\pi \rfloor.$$

Using this knowledge, we *fold* all children of  $f_{pc,fu}$  into  $\pi$  slots to obtain a tranformed polyhedral syntax tree  $f'_{pc,fu}$  that does not describe the sequence of assembly instructions issued at the start of iteration J, but that instead describes the  $\pi$  slots of the *kernel* issued at the start of iteration J. The folding operation is elaborated in Algorithm 2 and visualized in Figure 6.

**Algorithm 2** Given  $\pi$ , fold PST  $f_{fu}$  representing a functional unit program into  $f'_{fu}$ 

|                                                                                                      | J                                                                                       |
|------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|
| $G \leftarrow \emptyset$                                                                             |                                                                                         |
| <b>for</b> $g \in \text{children}(f_{fu})$ <b>do</b>                                                 | $\triangleright g$ represents a time offset $\tau[g]$                                   |
| $G' \leftarrow \text{children}(\text{OFFSET}(g, \lfloor \tau[g]/\pi \rfloor))$                       |                                                                                         |
| $g' \leftarrow \left(\bigcup_{g'' \in G'} \operatorname{domain}(g''), (\tau[g] \mod \pi), G'\right)$ | ▷ new node with slot index as attribute                                                 |
| $G \leftarrow G \cup \{g'\}$                                                                         |                                                                                         |
| $f'_{fu} \leftarrow (\bigcup_{g \in G} \operatorname{domain}(g), \operatorname{attr}(f), G)$         | ▶ condition space of a parent node is union of children's                               |
| <u>j.</u>                                                                                            |                                                                                         |
| <b>function</b> OFFSET $(f, n)$                                                                      | $\triangleright$ recursively offset condition spaces in <i>f</i> by <i>n</i> iterations |
| <b>if</b> children $(f) = \emptyset$ <b>then</b>                                                     | $\triangleright$ if $f$ is a leaf                                                       |
| <b>return</b> (succ(domain( $f$ ), $n$ ), attr( $f$ ), $\emptyset$ )                                 | $\triangleright$ copy <i>f</i> , but with offset condition space                        |
| else                                                                                                 |                                                                                         |
| $G \leftarrow \emptyset$                                                                             |                                                                                         |
| for $g \in \operatorname{children}(f)$ do                                                            | offset children condition spaces recursively                                            |
| $G \leftarrow G \cup \{\text{offset}(g, n)\}$                                                        |                                                                                         |
| <b>return</b> $(\bigcup_{g \in G} \operatorname{domain}(g), \operatorname{attr}(f), G)$              | $\triangleright$ copy $f$ , but with offset children                                    |
|                                                                                                      |                                                                                         |



Fig. 6. Formation of kernels in  $f_{1,alu1}$  with  $\pi = 2$ . Each column represents an iteration  $0 \le j \le 6$  (j = 6 being a "pseudo-iteration" corresponding to the epilog) and the boxes within a column the instructions executed in the  $\pi = 2$  time steps until the next iteration starts. Above, the instructions are arranged according to the evaluation  $f_{1,alu1}((j))$ . In particular, the instruction with time offset  $\tau = 2$  overlaps with iteration j + 1, having its condition space shifted from  $\{0 \le j \le 5\}$  to  $\{1 \le j \le 6\}$  and giving rise to the epilog space  $\mathcal{E} = \{j = 6\}$ . Below, the instructions are arranged into *kernels* corresponding to the evaluation  $f'_{1,alu1}((j))$  of the folded PST (see main text). A *kernel class*  $Q_i$  is a set of instructions issuing the same kernel.

Folding introduces "iterations" J where  $J \notin \mathcal{J}$ , pseudo-iterations that contain the epilog of the pipelined functional unit program; we therefore call  $\mathcal{E}_g = \mathcal{J}_g \setminus \mathcal{J}$  the *epilog space* of g. In the following,  $\mathcal{J}_{pc,fu} = \mathcal{J} \cup \mathcal{E}_{f'_{pc,fu}}$  denotes intra-tile iteration space including the epilog space and  $\mathcal{E}$  the union of all individual epilog spaces.

*Running example.* After folding  $f_{1,alu1}$ , we obtain a transformed tree where the second level represents the slot index instead of the time offset  $\tau$ :

$$f'_{1,alu1} := alu1 (slot 0) ---- andi ---- (srcA) ---- rd0 [{1 \le j (srcB) ---- 1 [{1 \le j (slot 1) ---- shri ---- (srcA) ---- rd0 [{0 \le j (srcB) ---- 1 [{0 \le j (srcB) ---- 1 [{0 \le j < p}] (srcB) ----- 1 [{0 \le j < p}]$$

For example, the and i instruction, which was originally at *offset*  $\tau = 2$ , now resides at *slot* 0, but with an offset condition space that reflects the overlapping into the next iteration (compare Figure 6). This makes the pipelined program's epilog space  $\mathcal{E}_{1,alu1} = \{j = p\}$ .

The folded polyhedral syntax tree  $f'_{pc,fu}$  of a functional unit gives rise to a set  $QC_{pc,fu}$  of kernel classes, that is a partition of  $\mathcal{J} \cup \mathcal{E}_{f'_{pc,fu}}$  into subsets  $Q_{qc}$  of iterations in which the same kernel is issued. These are determined analogously to processor classes using Algorithm 1.

*Running example.* For  $f'_{1 \text{ alu1}}$ , we obtain 4 kernel classes:

$$QC_{1,alu1} = \{Q_1 = \{j = 0\}, Q_2 = \{1 \le j < 4\}, Q_3 = \{j = 4\}, Q_4 = \{j = 5\}\}$$

The kernel  $q_1$  of  $Q_1$ , for example, is obtained by evaluating  $f'_{1,alu1}$  at any  $J \in Q_1$ , that is, removing all nodes g where J is not in condition space domain(g):

$$f'_{1,alu1}((0)) \coloneqq \text{shift} (slot 1) - shri (srcA) - rd0$$

$$(srcB) - 1$$

Converted into an instruction sequence, this evaluation yields (slot index in brackets)

[0] nop

[1] shri rd1 rd0 1

(Note that slot 0 has no associated node in  $f'_{1,alu1}$  –we assume an implied nop in such cases.)

Finally, for each processor class and functional unit, the control flow graph  $CFG_{pc,fu}$  is constructed from the set of kernel classes  $QC_{pc,fu}$  using Algorithm 3. The algorithm inserts a node for each kernel class and an edge *e* between each pair of kernel classes  $Q_i$  and  $Q_j$  where control flow passes from  $Q_i$  to  $Q_j$ . The edge *e* is annotated with the transition space  $\mathcal{T}$ , that is, the set of iterations J where control flow passes from  $Q_i$  to  $Q_j$ . Figure 7 shows the CFG for processor class  $\mathcal{P}_1$  and functional unit alu1 of the running example.

| Algorithm 3 Generate control flow graph CFG from ke                             | ernel classes $QC$ and PST $f'$                               |
|---------------------------------------------------------------------------------|---------------------------------------------------------------|
| $V \leftarrow \{v_1, v_2, \dots, v_{ QC }\}, E \leftarrow \emptyset$            | ▷ one node for each kernel class                              |
| for $Q_i \in QC$ do                                                             |                                                               |
| $q[v_i] \leftarrow f'(Q_i), Q[v_i] \leftarrow Q_i$                              | ▷ annotate kernel and kernel class to node                    |
| for $Q_j \in QC$ do                                                             | ▷ (note: self edges represent repetition)                     |
| $\mathcal{T} \leftarrow Q_i \cap \operatorname{succ}(Q_i, 1)$                   | ▶ all iterations in $Q_i$ that have a successor in $Q_j$      |
| if $\mathcal{T} \neq \emptyset$ then                                            | ▶ if control flow passes from $Q_i$ to $Q_j$ in any iteration |
| $E \leftarrow E \cup \{e = (v_i, v_j)\}, \mathcal{T}[e] \leftarrow \mathcal{T}$ | ▹ insert edge and annotate transition space                   |
| $CFG \leftarrow (V, E)$                                                         |                                                               |

Manuscript submitted to ACM

Δ



Fig. 7. Control flow graph for processor class  $\mathcal{P}_1 = \{k < 2\}$  and functional unit alu1 in the running example. Each node represents a kernel, and each edge is annotated with its transition space  $\mathcal{T}$ , that is, in which iterations the branch represented by the edge is taken. For kernels with more than one outgoing edge, the global control signals *CS* and assignments *c* are also annotated (Section 6.2).

The constructed control flow graph represents the functional unit program to be generated: In any iteration  $J \in \mathcal{J}_{pc,fu}$ , there is exactly one node v where  $J \in Q[v]$ , representing the kernel q[v] to be issued. Among its outgoing edges, there is exactly one edge e = (v, w) where  $J \in \mathcal{T}[e]$ , otherwise it is the last node. Node w represents the kernel q[w] issued in the next iteration. Hence, the outgoing edges of v represent the set of branch targets and the transition spaces the branch conditions.<sup>6</sup> The next step is to encode these branch conditions with a set of control signals.

## 6.2 Control signal allocation

As stated above, given a  $CFG_{pc,fu} = (V, E)$ , for each iteration  $J \in \mathcal{J}_{pc,fu}$ , there is exactly one node  $v \in V$  where  $J \in Q[v]$  with deg<sup>+</sup>(v) branch targets. Only one target w will be branched to: the one where  $J \in \mathcal{T}[(v, w)]$ . Consequently, some entity—in case of a TCPA the global controller—must track the current iteration and signal to the PEs for which edges J is in  $\mathcal{T}$  in order to select to which target to branch. For that, the global controller generates (binary) control signals.

Definition 9. A control signal is a function

$$cs(J): \mathcal{J} \cup \mathcal{E} \mapsto \{0, 1, -\}$$

that maps an intra-tile iteration J to 0, 1, or don't-care (represented by –). We call a control signal *partial* if it maps at least one J to –.

For each node  $v \in V$ , there are  $N_v = \lceil \log_2 \deg^+(v) \rceil$  control signals required to encode the  $\deg^+(v)$  outgoing edges of v, each of which is given an assignment  $c[e] \leftarrow (c_1, \ldots, c_{N_v})$  with  $c_i \in \{0, 1, -\}$  such that these assignments do not overlap for any two outgoing edges. These assignments can, for example, be determined using binary decision diagrams [2]. From these assignments, for each node v, we build  $N_v$  partial control signals:

$$\forall v \in V, 1 \le i \le N_v \colon cs_{v,i}(J) \coloneqq \begin{cases} 1 & \exists e = (v, w) \in E \colon c_i[e] = 1 \land J \in \mathcal{T}[e] \\ 0 & \exists e = (v, w) \in E \colon c_i[e] = 0 \land J \in \mathcal{T}[e] \\ - & \text{else} \end{cases}$$

*Running example.* Only node  $q_2$  in Figure 7 requires a control signal because it is the only node with more than one outgoing edge. We give its two outgoing edges the assignments  $c[(q_2, q_2)] \leftarrow (0)$  and  $c[(q_2, q_3)] \leftarrow (1)$ .

Across all control flow graphs  $CFG_{pc,fu}$ , there is now a large set of partial control signals  $cs_{pc,fu,v,i}$  local to the corresponding node. Their number usually exceeds the maximum number C of control signals supported by the global

<sup>&</sup>lt;sup>6</sup>Algorithm 3 generates CFGs with arbitrary outdegree, that is, an arbitrary number of branch targets. While TCPAs support a configurable number of simultaneous branch targets, it is usually set to 2 or other low numbers. Therefore, the outdegree of the control flow graph must be reduced accordingly. For lack of space, we refer to [2].

Manuscript submitted to ACM

controller, which we call *global* control signals. We therefore combine the local, partial control signals using an interference graph where each control signal  $cs_{pc,fu,v,i}$  is a node and two control signals  $cs_1$  and  $cs_2$  interfere if

$$\exists J \in \mathcal{J} \cup \mathcal{E} \colon cs_1(J) \neq cs_2(J) \land cs_1(J) \neq - \land cs_2(J) \neq -.$$

A *C*-coloring of the vertices then corresponds to the allocation of *C* global control signals  $cs_i$ . For each  $cs_i$ , the global controller is configured with its *one domain*, that is, the subset of  $\mathcal{J}_{pc,fu}$  where  $cs_i(J) = 1$  (for all other iterations 0 is output). Additionally, each node v in each control flow graph is annotated with the  $N_v$ -tuple CS[v] of global control signals that were allocated for the node's local control signals; this information is necessary for program generation.

Running example. For node  $q_2$  in Figure 7, a global control signal  $cs_1$  is allocated and configured:

$$cs_1(\boldsymbol{J} = (j)) := \begin{cases} 1 & \text{if } j = 4 \\ 0 & \text{else} \end{cases}.$$

That means that if  $cs_1(J)$  is 0, the next kernel is again  $q_2$ , if it is 1, which is only in iteration j = 4, the next kernel is  $q_3$ .

Note that for instantiation at runtime, fast heuristics such as greedy graph coloring are sensible because graph coloring is NP-complete. The generated control flow graphs now contain all information necessary for program generation.

## 6.3 Program generation

Each processor class  $\mathcal{P}_{pc}$  requires the generation of one PE program, which is simply a container for the functional unit programs in that processor class. Generating a PE program therefore requires generating the program for each functional unit from its control flow graph  $CFG_{pc,fu}$ .

In orthogonal instruction processing (see Section 3.2), each instruction in a functional unit program is a pair of a functional instruction, specifying the operation, and a branch instruction, specifying the next instruction. While the functional instructions are explicit in the kernels q[v] annotated to the nodes in  $CFG_{pc,fu}$ , corresponding branch instructions remain to be generated. The instructions in slots  $0 \dots \pi - 2$  of a kernel q[v] are each combined with a next branch instruction because each kernel is executed atomically and in order. However, for the last instruction, the one in slot  $\pi - 1$ , a conditional multi-target branch instruction must be generated that selects the target according to the allocated global control signals  $CS[v] = \{cs_1, \dots, cs_{N_v}\}$  and the assigned values c[e] for all outgoing edges  $e \in E^+(v)$ . The subsequent concatenation of all kernels (starting with q[v] where v is the start node, that is, has no incoming edges) yields the functional unit program. Algorithm 4 summarizes these two steps.

*Running example.* We obtain the following functional unit program for functional unit alu1 in processor class  $\mathcal{P}_1$  (the comments show some applicable optional simplifications):

```
q1: nop / next
shri rd1 rd0 1 / goto q2 // can be simplified to 'next'
q2: andi od0 rd1 1 / next
shri rd1 rd0 1 / if ic0 jmp q3, q2
q3: andi od0 rd1 1 / next
shri od1 rd0 1 / goto q4 // can be simplified to 'next'
q4: andi od0 rd1 1 / next
nop / halt // can be merged into previous instruction
```

| <b>Algorithm 4</b> Generate functional unit program from annotated $CFG = (V, E)$ |  |
|-----------------------------------------------------------------------------------|--|
|-----------------------------------------------------------------------------------|--|

| program ← []                                                                                       |                                                             |
|----------------------------------------------------------------------------------------------------|-------------------------------------------------------------|
| for $v \in \text{topological\_sort}(V)$ do                                                         | ▷ begin with start node                                     |
| $targets \leftarrow \{w \mid (v, w) \in E\}$                                                       |                                                             |
| <b>for</b> <i>slot</i> := 0 <b>to</b> $\pi$ – 1 <b>do</b>                                          |                                                             |
| if $slot = \pi - 1$ then                                                                           | ▹ if it is the last instruction in the kernel               |
| $branch \leftarrow \text{MAKE}_BRANCH(v, targets)$                                                 | make an explicit branch to the next kernels                 |
| else                                                                                               |                                                             |
| $branch \leftarrow next$                                                                           | ▶ otherwise, go unconditionally to the next instruction     |
| <pre>append(program, (q[v].instruction[slot], branch))</pre>                                       |                                                             |
| <b>function</b> MAKE_BRANCH(v, targets)                                                            |                                                             |
| if $ targets  = 0$ then                                                                            |                                                             |
| return halt                                                                                        | $\triangleright$ last node $\rightarrow$ halt execution     |
| else if $ targets  = 1$ then                                                                       |                                                             |
| return goto target                                                                                 | target is the single element of targets                     |
| else if $ targets  > 1$ then                                                                       |                                                             |
| <b>return</b> if $cs_1[v], \ldots, cs_{N_n}[v]$ jmp target( <i>targets</i> , 2 <sup><i>N</i></sup> | $N_v = 1$ ,, target( <i>targets</i> , 0)                    |
|                                                                                                    | uch that control signal assignment of edge (v, w) matches i |
|                                                                                                    |                                                             |

The programs for alu0 (not shown) and alu1 together form the PE program for  $\mathcal{P}_1$ .

Δ

## 6.4 I/O access instantiation

The last step is the instantiation of configuration data for the I/O buffers from the access mappings. Recall that an access mapping  $a = (reg, x, \alpha, \mathcal{A})$  maps accesses to *reg* within iterations  $I \in \mathcal{A}$  to  $x[\alpha(I)]$ . We call a particular access in an iteration I an *access instance* a[I].

*Running example.* The access mapping  $a_{bits}$  maps all write accesses to od0 with i < N to  $bits[Q_{bits}I + d_{bits}]$ , that is, bits[i]. These write accesses correspond to instruction and i od0 rd01 in the previous example: each time od0 is written, the value is stored in bits[i].

Now, tiling distributes access instances  $a[(J, K)^T]$  in  $\mathcal{A}^* \subseteq \mathcal{K}_a \oplus \mathcal{J}$  across multiple PEs  $\mathcal{K}_a$ , possibly making them concurrent since the PEs run in parallel. Consequently, for each  $K \in \mathcal{K}_a$  of each access mapping  $a \in A$ , two parts must be instantiated:

- (1) A connection between a memory bank and the port corresponding to *reg* of PE *K*, which entails finding a free bank and a route between the bank and the PE on the interconnect.
- (2) The configuration of the allocated memory bank's address generator, consisting of the coefficients of an affine address function derived from  $\alpha$  and the memory layout of x that maps the intra-tile iteration J to an address, and the intra-tile domain  $\widehat{\mathcal{J}} = \text{split}(K, \mathcal{A}^*)$  in the access space, required to generate an enable signal.

This clearly makes the time complexity of this step linear in the number of involved PEs  $|\mathcal{K}_a|$  in the general case. However, if I/O variables are only accessed at the borders, routing becomes unnecessary—the border PEs have a direct connection to the I/O banks. This allows the compiler to generate the above two parts already at compile time. Figure 8 shows the interconnect and I/O buffers parts of the instantiated configuration for the running example. Manuscript submitted to ACM



Fig. 8. Concrete interconnect wrapper and memory bank configuration for the running example. The propagation channels are replicated across the t = 3 PEs and each is connected to memory banks according to the access mappings  $a_{in}$  (west bank) and  $a_{bits}$  (north banks). Each bank is annotated with the iterations when the enable signal is 1 (first line) and which element of the associated array is accessed (second line).

## 7 EXPERIMENTS AND DISCUSSION

In the following, we experimentally show the validity of the two claims given in the beginning: that time complexity of program instantiation does not directly depend on the number of PEs and that a symbolic configuration is a space-efficient representation.

In particular, including the running example (bit extraction), we compiled a symbolic configuration according to Section 5 for each of the following real-world loop programs: a pipelined implementation of an FIR filter (2-dimensional loop), matrix-matrix multiplication (3-dimensional loop), and a convolutional layer within a CNN (6-dimensional loop). The choice is based on the intention to cover a variety of both application domains and dimensionality. For each symbolic configuration, we instantiated, as described in Section 6, six concrete configurations corresponding to six tilings: three resulting in a 1-dimensional region of PEs (1, 16, and 32 PEs), and three resulting in a 2-dimensional region  $(4 \times 4, 8 \times 8, \text{ and } 32 \times 32)^7$ . For each instantiation run, we measured the execution time of program instantiation and the size of the concrete configuration.

Since we want to show time complexity, we are interested in normalized execution times, summarized in Table 1. Each row represents one of the examples and contains, for each of the six tilings, the execution time of program instantiation normalized to the runtime of program instantiation in the case of a single tile (1 PE). The number of resulting processor classes is given in parentheses. Clearly, the execution time of program instantiation is roughly linear in the number of processor classes and *not* in the number of PEs, as is, for example, evident for the matrix multiplication example: Program instantiation for both  $4 \times 4 = 16$  and  $32 \times 32 = 1024$  PEs takes about equally as long because both have two processor classes, meaning two programs need to be instantiated. The instantiation phase therefore effortlessly scales to the ever-increasing number of PEs. (Note that some tilings may result in more complex control flow analysis, as is for example seen in the convolution example, where instantiation for  $4 \times 4$  PEs takes about four times as long as for  $1 \times 16$  PEs, despite having only double as many processor classes. However, it is still independent of the number of PEs.)

Table 2 shows the size of each concrete configuration normalized to the size of the symbolic configuration it was instantiated from. Excluding the bit extraction example, all concrete configurations by themselves were already larger than the symbolic configuration. Consequently, runtime instantiation significantly saves memory even if only a small number of concrete configurations were necessary at runtime. For example, storing the symbolic configuration for the CNN example saves about 95 % space compared to storing both concrete configurations for  $4 \times 4$  and  $8 \times 8$  PEs.

<sup>&</sup>lt;sup>7</sup>Both loop bounds and tile sizes were chosen appropriately to result in these PE regions.

|                            |   |          | 1D (# PEs) |          |              | 2D ( $R \times C$ ) |                |  |  |
|----------------------------|---|----------|------------|----------|--------------|---------------------|----------------|--|--|
| Algorithm                  | π | 1        | 16         | 32       | $4 \times 4$ | $8 \times 8$        | $32 \times 32$ |  |  |
| Bit extract (1D)           | 2 | 1.00 (1) | 0.94 (1)   | 1.04 (1) | -            | -                   | -              |  |  |
| FIR filter (2D)            | 2 | 1.00(1)  | 2.64 (3)   | 2.72 (3) | 9.06 (9)     | 8.98 (9)            | 8.23 (9)       |  |  |
| Matrix multiplication (3D) | 2 | 1.00(1)  | 0.94 (1)   | 0.99 (1) | 1.86 (2)     | 1.74 (2)            | 1.78 (2)       |  |  |
| CNN Convolution (6D)       | 1 | 1.00 (1) | 1.77 (2)   | 1.78 (2) | 9.22 (4)     | 9.33 (4)            | 9.06 (4)       |  |  |

Table 1. Relative runtimes of program instantiation for a set of mappings of various loop programs. Each row represents a symbolic configuration of the listed algorithm and each column the instantiation for one of six tilings, three resulting in a one-dimensional and three in a two-dimensional PE allocation. For each instantiation, the runtime relative to the runtime of the first column is listed; the number of processor classes is listed in parentheses. The table clearly shows that the time complexity of instantiation is roughly proportional to the number of processor classes and *not* to the number of processing elements.

|                            |   |      | 1D (# PEs) |      |              | $2D(R \times C)$ |                |
|----------------------------|---|------|------------|------|--------------|------------------|----------------|
| Algorithm                  | π | 1    | 16         | 32   | $4 \times 4$ | $8 \times 8$     | $32 \times 32$ |
| Bit extract (1D)           | 2 | 0.77 | 4.46       | 8.20 | -            | -                | _              |
| FIR filter (2D)            | 2 | 1.17 | 3.86       | 5.85 | 5.14         | 7.18             | 21.36          |
| Matrix multiplication (3D) | 2 | 1.00 | 3.65       | 6.48 | 4.88         | 15.12            | 193.73         |
| CNN Convolution (6D)       | 1 | 1.05 | 3.73       | 6.53 | 5.83         | 16.25            | 299.41         |

Table 2. The size of each generated concrete configuration normalized to the size of the symbolic configuration it was instantiated from.

## 7.1 Practical insights

As proof of concept, we implemented the instantiation phase as described in Section 6 using isl, the integer set library [22], to represent parametric iteration and condition spaces. This implementation—which was also used for the experiments in the previous section—, is functionally complete, but was not implemented with optimizing performance in mind. Instead, it uses a high level of abstraction (which includes isl) to aid in verifying correctness of the approach and in analyzing intermediate artifacts. For employment in an embedded system for instantiation at runtime, a more optimized implementation is desirable; in particular, profiling showed that a significant part of the execution time was spent during isl calls. A dedicated, simplified, non-parametric representation of iteration and condition spaces may thus lead to a considerate speed-up. Furthermore, if the host platform supports multi-threading, the execution time of program instantiation can be significantly improved by instantiating each processor class in parallel.

On a related note, we noticed that control flow graph generation is often the most computationally expensive part of program instantiation, caused by its quadratic time complexity in the number of program blocks. In a more practical implementation, this complexity should be improved by decreasing the number of successor candidates considered for each program block. This might be achieved by bringing the program blocks in a clever order according to the loop schedule.

## 8 CONCLUSION

In this article, we presented *symbolic loop compilation*, a two-phase approach that decouples the solving of the NPcomplete mapping problem from the actual generation of configuration data, which depends on parameters only known Manuscript submitted to ACM at runtime (the loop bounds and number of available PEs). The first phase, *symbolic mapping*, generates a *symbolic configuration* that represents the set of concrete configurations over all combinations of loop bounds and numbers PEs. The second phase, *instantiation*, generates a concrete configuration from the symbolic configuration according to the concrete values of the parameters once they become known.

We show that this is a viable approach for dynamically generating configurations because not only does instantiation run in polynomial time, but a symbolic configuration is a very space-efficient representation. In particular, *program instantiation*, the most complex part of the instantiation phase, does not directly depend on the number of PEs, thus scaling to arbitrary sizes of TCPAs.

## REFERENCES

- B. Bohnenstiehl, A. Stillmaker, J. Pimentel, T. Andreas, B. Liu, A. Tran, E. Adeagbo, and B. Baas. 2017. KiloCore: A Fine-Grained 1,000-Processor Array for Task Parallel Applications. *IEEE Micro* 37, 2 (3 2017), 63–69.
- [2] Srinivas Boppu. 2015. Code Generation for Tightly Coupled Processor Arrays. Ph.D. Dissertation. Friedrich-Alexander-Universität Erlangen-Nürnberg.
- [3] M. Brand, F. Hannig, A. Tanase, and J. Teich. 2017. Orthogonal Instruction Processing: An Alternative to Lightweight VLIW Processors. In IEEE 11th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC). 5–12. https://doi.org/10.1109/MCSoC.2017.17
- [4] Juan Manuel Martinez Caamaño, Willy Wolff, and Philippe Clauss. 2016. Code bones: Fast and flexible code generation for dynamic and speculative polyhedral optimization. In Proceedings of the 22nd European Conference on Parallel Processing (Euro-Par). Springer, 225–237.
- [5] Gregory J Chaitin, Marc A Auslander, Ashok K Chandra, John Cocke, Martin E Hopkins, and Peter W Markstein. 1981. Register allocation via coloring. Computer languages 6, 1 (1981), 47–57.
- [6] Benoit Dupont De Dinechin, Renaud Ayrignac, Pierre-Edouard Beaucamps, Patrice Couvert, Benoit Ganne, Pierre Guironnet de Massas, François Jacquet, Samuel Jones, Nicolas Morey Chaisemartin, Frédéric Riss, et al. 2013. A clustered manycore processor architecture for embedded and accelerated applications. In 2013 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–6.
- [7] Paul Feautrier. 1991. Dataflow analysis of array and scalar references. International Journal of Parallel Programming 20, 1 (1991), 23-53.
- [8] Frank Hannig. 2009. Scheduling techniques for high-throughput loop accelerators. Verlag Dr. Hut, Munich.
- [9] Frank Hannig, Vahid Lari, Srinivas Boppu, Alexandru Tanase, and Oliver Reiche. 2014. Invasive tightly-coupled processor arrays: A domain-specific architecture/compiler co-design approach. ACM Transactions on Embedded Computing Systems (TECS) 13, 4s (2014), 133.
- [10] Albert Hartono, Muthu Manikandan Baskaran, J. Ramanujam, and P. Sadayappan. 2010. DynTile: Parametric tiled loop generation for parallel execution on multicore processors. In *IEEE International Symposium on Parallel Distributed Processing (IPDPS)*. 1–12. https://doi.org/10.1109/IPDPS.2010.5470459
- [11] Alexandra Jimborean, Philippe Clauss, Jean-François Dollinger, Vincent Loechner, and Juan Manuel Martinez Caamaño. 2014. Dynamic and Speculative Polyhedral Parallelization Using Compiler-Generated Skeletons. *International Journal of Parallel Programming* 42, 4 (01 Aug 2014), 529–545. https://doi.org/10.1007/s10766-013-0259-4
- [12] DaeGon Kim and Sanjay Rajopadhye. 2009. Efficient tiled loop generation: D-tiling. In Languages and compilers for parallel computing (LCPC). Springer, 293–307.
- [13] Dmitrij Kissler, Frank Hannig, Alexey Kupriyanov, and Jürgen Teich. 2006. A Dynamically Reconfigurable Weakly Programmable Processor Array Architecture Template. In Proceedings of the 2nd International Workshop on Reconfigurable Communication Centric System-on-Chips (ReCoSoC). 31–37.
- [14] Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet, and Ponnuswamy Sadayappan. 2013. When polyhedral transformations meet SIMD code generation. In ACM Sigplan Notices, Vol. 48. ACM, 127–138.
- [15] Athanasios Konstantinidis, Paul Kelly, J Ramanujam, and Ponnuswamy Sadayappan. 2014. Parametric GPU Code Generation for Affine Loop Programs. In Languages and compilers for parallel computing (LCPC), Vol. 8664. Springer, 136–151. https://doi.org/10.1007/978-3-319-09967-5\_8
- [16] Harry W Nelis and Ed F Deprettere. 1988. Automatic design and partitioning of systolic/wavefront arrays for VLSI. Circuits, Systems and Signal Processing 7, 2 (1988), 235–252.
- [17] B. R. Rau and C. D. Glaeser. 1981. Some Scheduling Techniques and an Easily Schedulable Horizontal Architecture for High Performance Scientific Computing. In Proceedings of the 14th Annual Workshop on Microprogramming. IEEE, 183–198.
- [18] Jürgen Teich. 1993. A compiler for application specific processor arrays. Shaker, Aachen. ISBN 9783861117018.
- [19] Jürgen Teich and Lothar Thiele. 1991. Control generation in the design of processor arrays. Journal of VLSI signal processing systems for signal, image and video technology 3, 1-2 (1991), 77–92.
- [20] Lothar Thiele. 1989. On the design of piecewise regular processor arrays. In IEEE International Symposium on Circuits and Systems. 2239–2242 vol.3. https://doi.org/10.1109/ISCAS.1989.100823

- [21] Lothar Thiele and Vwani Roychowdhury. 1991. Systematic design of local processor arrays for numerical algorithms. Algorithms and Parallel VLSI Architectures, Vol. A: Tutorials, Elsevier, Amsterdam (1991), 329–339.
- [22] Sven Verdoolaege. 2010. isl: An integer set library for the polyhedral model. In International Congress on Mathematical Software. Springer, 299-302.
- [23] Mark Wijtvliet, Luc Waeijen, and Henk Corporaal. 2016. Coarse grained reconfigurable architectures in the past 25 years: Overview and classification. In 2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS). IEEE, 235–244.
- [24] Michael Witterauf, Frank Hannig, and Jürgen Teich. 2019. Polyhedral Fragments: An Efficient Representation for Symbolically Generating Code for Processor Arrays. In Proceedings of the 17th ACM-IEEE International Conference on Formal Methods and Models for System Design (MEMOCODE '19). Association for Computing Machinery, New York, NY, USA, Article Article 8, 10 pages. https://doi.org/10.1145/3359986.3361205
- [25] M. Witterauf, A. Tanase, F. Hannig, and J. Teich. 2016. Modulo scheduling of symbolically tiled loops for tightly coupled processor arrays. In 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 58–66. https://doi.org/10.1109/ASAP.2016.7760773