# Simulation Environment for Link Energy Estimation in Networks-on-Chip with Virtual Channels

Jan Moritz Joseph<sup>a,1</sup>, Lennart Bamberg<sup>b,1</sup>, Imad Hajjar<sup>a</sup>, Robert Schmidt<sup>b</sup>, Thilo Pionteck<sup>a</sup>, Alberto García-Ortiz<sup>b</sup>

> <sup>a</sup> Otto-von-Guericke-Universität Magdeburg Institut für Informations- und Kommunikationstechnik (IIKT) 39106 Magdeburg, Germany <sup>b</sup> University of Bremen Institute of Electrodynamics and Microelectronics (ITEM.ids) 28359 Bremen, Germany

# Abstract

Network-on-chip (NoC) is the most promising design paradigm for the interconnect architecture of a multiprocessor system-on-chip (MPSoC). On the downside, a NoC has a significant impact on the overall energy consumption of the system. NoC simulators are highly relevant for design space exploration even at an early stage. Since links in NoC consume up to 50% of the energy, a realistic energy consumption of links in NoC simulators is important. This work presents a simulation environment which implements a technique to precisely estimate the data dependent link energy consumption in NoCs with virtual channels for the first time. Our model works at a high level of abstraction, making it feasible to estimate the energy requirements at an early design stage. Additionally, it enables the fast evaluation and early exploration of low-power coding techniques. The presented model is applicable for 2D and 3D NoCs. A case study for an image processing application shows that the current link model leads to an underestimate of the link energy consumption by up to a factor of four. In contrast, the technique presented in this paper estimates the energy quantities precisely with an error below 1% compared to results obtained by precise, but computational extensive, bit-level simulation.

*Keywords:* Through Silicon Vias, 3D Integration, Low Power, Coding, Networks-on-Chip

Email addresses: jan.joseph@ovgu.de (Jan Moritz Joseph),

bamberg@item.uni-bremen.de (Lennart Bamberg), imad.hajjar@st.ovgu.de (Imad Hajjar), rschmidt@item.uni-bremen.de (Robert Schmidt), thilo.pionteck@ovgu.de (Thilo Pionteck), agarcia@item.uni-bremen.de (Alberto García-Ortiz)

URL: www.iikt.ovgu.de (Jan Moritz Joseph), www.item.uni-bremen.de (Lennart Bamberg), www.iikt.ovgu.de (Imad Hajjar), www.item.uni-bremen.de (Robert Schmidt), www.iikt.ovgu.de (Thilo Pionteck), www.item.uni-bremen.de (Alberto García-Ortiz)

<sup>&</sup>lt;sup>1</sup>Authors contributed equally.

## 1. Introduction

Packet-switching-based networks-on-chip (NoC) build the most promising interconnect architecture for multi-core systems-on-chip (SoC) [1]. However, the link's energy consumption of a NoC has increased substantially with the ongoing scaling of technology [2]. A model to estimate the energy requirements at high abstraction levels is important to determine the required energy budget for the links and to evaluate the effect of high-level energy optimization approaches. Lower-level models are not optimal due to their long runtime. Furthermore, high-level optimization approaches have a significantly higher impact on the energy requirements than low-level ones [3].

However, the current link model [2, 4] is only valid for a sequential transmission of single packets, which is not the case when virtual channels are used. Virtual channels are a commonly implemented technique to increase the NoC performance [1], but they also lead to drastic increases in switching activities, as shown in this work. Since switching probabilities are directly proportional to energy consumption, neglecting the effect of virtual channels leads to a heavy underestimation of energy requirements.

Also, virtual channels have a major impact on coding, which is one of the most promising low-power approaches. Since the overhead of a coding approach must not cancel out savings in links, encoding is typically based on an end-to-end manner [1, 2], in which data is encoded and decoded at the network interfaces of the source and destination respectively, and not on link level. However, on an end-to-end basis the effect of virtual channels can not yet be considered and the energy reduction of most techniques vanishes when virtual channels are used. Hence, the majority of techniques are explicitly designed for NoCs without virtual channels [2, 4], even though they are commonly implemented in NoCs. The only technique which is applicable for virtual-channel-based NoCs is probability coding [5]. However, this method was only designed for the uncommon scenario that links permanently make use of non-prioritized virtual channels. Thus, no optimal coding approach can be identified yet, due to the lack of a patterndependent model for the energy consumption of links with virtual channels. Furthermore, such a model is required to design efficient coding approaches, as outlined in this work.

This work is an invited extension of Ref. [6]. In this initial work, *Bamberg* et al. proposed the first accurate, coding-aware, high-level model for the energy consumption of 2D and 3D links with virtual channels. In detail, the existing link energy model for 2D links [2, 4] and 3D links [7] is extended in a way that it enables estimating the switching characteristics not only for a sequential transmission of data packets (no virtual channels), but also for multiplexing individual flits of multiple packets (virtual channels). While the existing energy model likely results in implementing inefficient coding techniques and underestimates the link's energy requirements by up to a factor of four, the model introduced precisely predicts the energy requirements of 2D and 3D links, as well as coding efficiencies. We achieve a power estimation error below 1% compared to results obtained by precise, but computational extensive, bit-level simulation. Thereby, the proposed model reveals new options to reduce the link energy consumption using coding.

The present work provides the following major extensions over the initial conference publication [6]: (a) We make the aforementioned models available and usable for NoC simulations. Precise estimation of the energy budget is essential because links participate a large portion of the overall NoC energy consumption. Evidence therefore are a 17% share in the Teraflop router [8] and share of up to 53.9% in the NOSTRUM 8×8-NoC [9, Table 1, p. 4]. Our NoC simulator comprises a NoC and an application model beside the proposed energy models. The simulator is implemented in C++ using the SystemC class library [10] including both models on the transaction and the cycle-accurate level, while the energy models are implemented in Python. Our approach enables separate consideration of TSV arrays and of metal wires in NoCs, as well as accounting for application-dependent switching activity for individual links. This is orthogonal to energy estimations provided by competing simulators that do not implement such energy and application models. (b) We introduce the innovative feature of transmission matrices that are recorded during simulation. Thereby, parts of the architectural design can be modified post-simulation. For instance, the evaluation of various coding methods can be conducted only with a single simulation run. (c) Our models are easily configurable using XML files, including many properties of the architecture of individual routers. This approach facilitates rapid prototyping during design space exploration. (d) We enable an easy use of the simulator by providing a single point-of-entry and simply configurable simulation scripts. Summarizing the novelty of this paper: We improve state-of-the-art energy estimation for NoCs such as [11] by accounting for capacitance, separation between metal wires and TSV arrays, and coding. We demonstrate applicability of our approach by comparison of virtual channel-based NoCs and NoCs without virtual channels in a realistic case study and show power and performance as well as the effects of coding.

The remainder of this work is structured as follows: Sec. 3 presents general formulas to determine the link energy consumption employing the switching/bit properties. Sec. 4 reviews the concept of virtual channels and shows that they have a huge impact on the switching properties. Afterwards, our model to estimate the switching characteristic is presented in Sec. 5. The model is integrated into a NoC simulator, which is presented in Sec. 6. Further, we validate the model by means of simulation results in Sec. 7. In Sec. 8, a case study for a heterogeneous 3D vision SoC is presented which shows that for a precise energy estimation, as well as a design of efficient coding architectures, our model has to be used. Finally, a conclusion is drawn.

## 2. Related Work

Due to increasing need for power efficient systems, low-power link encoding has drawn a lot of attention in recent years (e.g. [2, 4, 12, 13]). To estimate the pattern-dependent link energy consumption, and thereby the coding efficiency, as a function of the network architecture and the application specific data flow, NoC simulators are commonly used [14]. However, current link energy models do not consider the effect of virtual channels, even though they are usually implemented to increase the NoC performance. This leads to a lack of efficient coding approaches applicable for virtual channel-based NoCs.

Generally, there exist a wide range of software for the simulation of NoCs at high abstraction levels, which can be divided into two classes: First class, simulators which target a specific architecture or use case, and the second class in which they are universally applicable. The first category of simulators is usually proposed as a supplementary to novel NoC hardware designs. One example is [15], which is a simulator written for a specific 3D router architecture; in [16], a router architecture which accelerates data streams in NoCs is proposed with a simulator implementing this specific router model. In [17], a NoC router is extended over multiple layers in a 3D chip and an application specific simulator is proposed. However, all the aforementioned examples have limited applicability.

In the second category, simulators target as many architectures, designs and use cases as possible. There are two well-maintained and well-known software: Noxim [14] simulates NoCs on cycle-accurate abstraction level and is implemented in C++ using SystemC. To cover multiple use cases, many parameters of the architecture can be freely defined (e.g. buffer depth, packet size, routing algorithm, network topology). The simulator returns performance and energy figures for the NoC. The power is calculated using a simple, cycle-accurate, event-based model. In Noxim, a fixed energy consumption is assigned to events. These are counted and their energy consumption is accumulated to a static value. Among others, one event is the transmission of a flit over a link. However, we will show in this work that the energy consumption of a link transmitting a flit strongly depends on the switching allocation, associated with the virtual channels and not only on the accumulated number of transmitted flits. The simulator BookSim 2.0 [18] is similar; it is also cycle-accurate, implemented in C++ and offers similar parameters to set. It does not offer detailed power analysis and only provides network statistics.

There are many works for power estimation in NoC simulations. ORION 3.0 [11] provides the most detailed models for energy consumption of the router's components. Therefore, it is currently state-of-the-art to estimate router power. It also includes a basic power model for links, which is introduced in Eqs. 10 and 11 in Ref. [11]. It does not account for the effects of pattern-dependent coupling switching, which leads to a modeling error of up to 79.77% [19]. The authors themselves are aware of this limitation: They argue that their approach does not model power on flit level and that they "do not consider bit encodings in a flit, which can lead to significant errors in dynamic power estimation" [20, Sec. IV-C, p. 9]. Furthermore, our model allows analyzing any geometrical TSV or metal wire dimension an array sizes and thus can be adapted to all technologies, including heterogeneous 3D integration. This is not accounted for by ORION 3.0. Within Ref. [20], the authors propose to use a rather old NoC simulator GARNET [21] from 2009 for full-system NoC simulation to overcome limitations of ORION 3.0. This would allow modeling flit-level power. We also extend this approach because evaluation of different codings is possible postsimulation without the necessity of a slow full-system simulation. To summarize, the manuscript submitted is the first work to account for physical effects in links on flit-level including pattern-dependent coupling and coding and including.

To summarize, the proposed NoC simulator extends the existing solutions. It peruses a general approach and, as a unique feature, implements the proposed energy model to estimate the dynamic link energy. Furthermore, it is based on well-defined models, as proposed in [22] and offers a structured design process for 3D NoC targeting heterogeneous 3D integrated circuits (ICs) as contributed in [23].

# 3. Data dependent link energy consumption

The mean, pattern dependent, energy consumption of an N-bit link can be precisely estimated using [7, 24]:

$$E = \frac{V_{dd}^2}{2} \left( \sum_{i}^{N} \mathbf{E} \{ \Delta b_i^2 \} C_{i,i} + \sum_{i,j}^{N} \mathbf{E} \{ \Delta b_i^2 - \Delta b_i \Delta b_j \} C_{i,j} \right).$$
(1)

Here,  $C_{i,i}$  is the ground capacitance of interconnect *i*, and  $C_{i,j}$  is the coupling capacitance between the interconnects *i* and *j*. Furthermore,  $\mathbf{E}$  is the expectation operator and  $\Delta b_i$  represents the switching of bit *i*, which is either 1 (0 to 1 transition), 0 (no transition), or -1 (1 to 0 transition). Thus,  $\mathbf{E}\{\Delta b_i^2\}$  is the self switching probability of interconnect *i*.

While the energy consumption due to the ground capacitance of an interconnect *i* is determined only by its self switching  $\Delta b_i$ , the energy consumption associated with a coupling capacitance  $C_{i,j}$  is additionally affected by a switching on interconnect  $j \ \Delta b_j$  (correlated switching). Compared to the scenario where only interconnect *i* toggles ( $\Delta b_j = 0$ ), the contribution of  $C_{i,j}$  to the energy consumption is doubled when interconnect *j* toggles in the opposite direction ( $\Delta b_i \Delta b_j = -1$ ) and vanishes if it toggles in the same direction ( $\Delta b_i \Delta b_j = 1$ ).

In modern links, the coupling capacitances dominate over the ground capacitances [25, 4]. Therefore, recent low-power data encoding approaches aim for an increase in the correlated switching ( $\mathbf{E}\{\Delta b_i \Delta b_j\}$ ) and a decrease in the self switching ( $\mathbf{E}\{\Delta b_i^2\}$ ). However, for 3D links composed of TSVs, the logical-bit probabilities also affect the energy consumption [7]. A TSV, its oxide liner and the conductive substrate form a metal-oxide-semiconductor (MOS) junction. Thus, due to the MOS-effect, each TSV is surrounded by a depletion region, which is an insulating region within the conductive substrate. For the typical acceptor ( $\mathbf{p}$ ) doped substrate, an increase of 1-bit probability ( $\mathbf{E}\{b_i\}$ ) on a TSV enlarges the width of its surrounding depletion region. This further isolates the TSV from the conductive substrate, resulting in up to 40% lower capacitance values and consequently to reduced energy needs [7]. The exact bit probability — capacitance relation is complex and consequently not suitable for high-level models. Thus, a simple linear model is used to estimate the capacitance values  $C_{i,j}$  as a function of the bit probabilities [7]:

$$C_{i,j} = C_{T0,i,j} + \Delta C_{T,i,j} (\mathbf{E}\{b_i\} + \mathbf{E}\{b_j\}),$$
(2)

where  $C_{\text{T0},i,j}$  is the capacitance value for all 1-bit probabilities equal to zero.  $\Delta C_{\text{T},i,j}$  is the derivation of the capacitance value with increasing bit probability  $\mathbf{E}\{b_i\}$  or  $\mathbf{E}\{b_j\}$ .

Summarized, the energy consumption of 3D links, normalized by the technology factor  $V_{dd}^2/2$ , is estimated via:

$$E_{n,3D} = \sum_{i} \mathbf{E} \{ \Delta b_i^2 \} \left( C_{\mathrm{T}0,i,i} + \Delta C_{\mathrm{T},i,i} 2 \mathbf{E} \{ b_i \} \right)$$
  
+ 
$$\sum_{i \neq j} \mathbf{E} \{ \Delta b_i^2 - \Delta b_i \Delta b_j \} \left( C_{\mathrm{T}0,i,j} + \Delta C_{\mathrm{T},i,j} (\mathbf{E} \{ b_i \} + \mathbf{E} \{ b_j \}) \right).$$
(3)

For 2D links, composed of metal-wires, the capacitance quantities are independent of the pattern properties. Therefore, the energy consumption is estimated via:

$$E_{n,2D} = \sum_{i} \mathbf{E} \{\Delta b_i^2\} C_{\mathrm{M},i,i} + \sum_{i \neq j} \mathbf{E} \{\Delta b_i^2 - \Delta b_i \Delta b_j\} C_{\mathrm{M},i,j}.$$
(4)

The energy consumption can also be expressed using Frobenius inner product  $(\langle \rangle)$  of two matrices, which simplifies the formulas. For 2D links we obtain:

$$E_{n,2D} = \langle \mathbf{T}, \mathbf{C}_M \rangle. \tag{5}$$

Here,  $\mathbf{C}_M$  is the capacitance matrix, with  $C_{M,i,j}$  on entry (i, j). **T** presents the switching properties of the bits:

$$\mathbf{T} = \vec{t}_s \mathbf{1}_{1 \times N} - \mathbf{T}_c,\tag{6}$$

where the vector  $\vec{t}_s$  contains the self switching probabilities  $\mathbf{E}\{\Delta b_i^2\}$ . Matrix  $\mathbf{T_c}$  includes the mean correlated switching quantities with zeros on the diagonal and  $\mathbf{E}\{\Delta b_i \Delta b_j\}$  on entry (i, j).  $\mathbf{1}_{1 \times N}$  is a  $1 \times N$  matrix of ones.

For 3D links the matrix formulations is:

$$E_{n,3D} = \langle \mathbf{T}, \left( \mathbf{C}_{T0} - \mathbf{\Delta} \mathbf{C}_{T} \circ \left( \vec{p} \cdot \mathbf{1}_{1 \times N} + \mathbf{1}_{N \times 1} \cdot \vec{p}^{T} \right) \right) \rangle, \tag{7}$$

where  $\circ$  is the Hadamard operator and  $\vec{p}$  is the bit probability vector  $(p_i = \mathbf{E}\{b_i\})$  and  $\vec{p}^T$  its transpose.

In Summary, to estimate link energy requirements, three parameters are needed: first the capacitance matrices; second the switching matrices  $\mathbf{T}$ ; and third the bit probabilities  $\vec{p}$ . While many works propose models to estimate the capacitance quantities at high abstraction levels (e.g. [25, 7, 26]), the estimation of bit properties ( $\mathbf{T}$  and  $\vec{p}$ ) on high abstraction levels is currently possible only if virtual channels are neglected [27, 2]. To consider the effect of virtual channels, cost extensive bit-level simulations are currently required.

# 4. Virtual channel-based NoCs

In this section we discuss the effect of virtual channels on the link energy consumption. For this purpose, in Subsec. 4.1, we review the concept of virtual channels. Afterwards, in Subsec. 4.2, we show that they have a huge impact on the link energy consumption and heavily reduce the efficiency of existing low-power coding approaches. This also serves as a motivation for the present work.

## 4.1. Concept of virtual channels

The main functionality of a NoC is to forward data/messages (in form of packets) from a source processing element to a destination one. Thereby, packets may pass multiple routers (multi-hop transmission). Also, packets may compete for links if multiple of them traverse the network.

Consider for example the scenario where two packets A and B compete for a link. Without virtual channels, one packet (e.g. A) will be granted to use the link/channel, while the second packet B is blocked until A is completely transmitted. This has two disadvantages: first, due to the commonly applied flit-level flow control [28], the next flit of the granted packet A might be blocked due to a contention elsewhere in the network. In this scenario the link is idle even though the other packet B could make effective use of it. The second problem is that blocked packets result heavily in unequal transmission times for messages. To mitigate this degradation and to provide quality-of-service (QoS) quantities, the bandwidth of a link is divided among different packets using virtual channels. With this technique more than one buffer is associated with each input port, so that different packets are buffered simultaneously and interleaved. While the assignment of virtual channels (input buffers) is packet based, the arbitration for physical channel bandwidth is on a flit-by-flit basis [28].

Different techniques exist for this arbitration. The most common one is time multiplexing [29, 30]. Hereby, the available bandwidth of the link is equally partitioned on each virtual channel (fairness). For example, if three packets (A to C) simultaneously request the usage of one physical channel, the available link bandwidth is shared by transmitting  $A_0B_0C_0A_1B_1C_1...A_mB_mC_m$  if no congestions occur in the preceding paths. Here, the indices are the flit numbers of the packets containing *m* flits. Another common technique uses priorities [31]. Each virtual channel is associated with a different priority depending on the service class of the according message. The transmission of packets with a lower priority is preempted if higher priority one are using the link. This guarantees quality of service for high priority traffic at the cost of a being less fair.

Summarized, virtual channels not only transmit packets sequentially but also multiplex simultaneous transmissions of multiple packets. The multiplexed transmission probability is slightly lower than a priority based virtual channel arbitration.

#### 4.2. Effect of virtual channels on the energy consumption

As outlined in Sec. 3, the two bit-level properties that affect link power consumption are switching and bit probabilities. Obviously, the interleaved transmission of multiple data streams (multiplexing) - due to the use of virtual channels - has no impact on bit probabilities but on switching properties as they are determined by bit-level deltas between consecutively transmitted pattern pairs.

For the sequential transmission of highly correlated data streams, as found in DSP applications, the switching properties (**T**-entries) are small compared to the transmission of uncorrelated data [27]. This results in relatively low energy requirements as the energy consumption is directly proportional to the switching properties. However, when the data streams are multiplexed due to the use of virtual channels, this beneficial behavior is lost as the individual messages are uncorrelated. Consequently, the link energy consumption drastically increases when virtual channels are used. This is validated by the following analysis: we investigate the mean energy consumption per transmitted byte for the transmission of two data streams containing 100,000 Gaussian distributed 16 b flits with a relative correlation  $\rho$  of 0.99 and a standard-deviation  $\sigma$  of 256. Two physical link structures are considered: a 3D TSV array and a 2D metal-wire bus, both driven by commercial 40 nm inverters. To obtain the metal wire parasitics, a commercial wire tool is used. Hereby, the metal wire width and spacing is set to  $0.3 \,\mu\text{m}$  and  $0.6 \,\mu\text{m}$ , respectively. The TSV parasitics are generated with the edge-effect-aware capacitance model from Ref. [25]. For the quadratic TSV array, we consider a TSV radius and pitch of  $2\,\mu\mathrm{m}$  and  $8\,\mu\mathrm{m}$  respectively, which corresponds to the minimum TSV dimensions predicted by the International Technology Roadmap for Semiconductors (ITRS) for the time frame 2015–2018 [32]. The link length is set to  $50 \,\mu m$  (3D) and  $100 \,\mu m$  (2D). The driver parasitics are provided by the vendor.

Employing bit-level simulations, we analyze the energy consumption for various data stream multiplexing probabilities (mux. prob.), to cover all possible scenarios in virtual channel based NoCs. Mux. prob. defines the likelihood of a change in the active virtual channel, so that the next transmitted flit belongs to another data stream than the current one. The results are presented in Fig. 1-a. For no data stream multiplexing, the energy consumption of the 2D and 3D link is 17 and 28 fJ per transmitted byte, respectively. This represents the case where no virtual channels are used. When they are used with equal priorities, mux. prob. is maximized to 1, indicating a continuous flit-by-flit multiplexing. In this scenario the energy consumption of the links is approx. doubled (2D: 34 fJ; 3D: 50 fJ) compared to the scenario where virtual channels are unused. This shows the dramatic effect of virtual channels on the link energy consumption. For scenarios where virtual channels are sometimes used, or if priorities are assigned to the channels, mux. prob. lies between 0 and 1 and the degradation is smaller.

To illustrate the effect of virtual channels on the efficiency of existing lowpower codes, we analyze the transmission of two random (uniformly distributed and uncorrelated) data streams, encoded with the classical invert technique [33].



Fig. 1: Effect of virtual channels: a) mean link energy consumption over the channel multiplexing probability for the transmission of correlated, Gaussian data streams; b) gain of classical invert-coding [33] over the multiplexing probability for the transmission of two completely random data streams.

The results for the 2D and the 3D link, are shown in Fig. 1-b. When the two data streams are transmitted without virtual channels (i.e. mux. prob. = 0), the encoding technique leads to a reduction in the 2D/3D link energy consumption by approx. 15%. However, with increasing virtual channel usage (i.e. mux. prob.) the coding efficiency vanishes. Due to the added redundancy of the technique (invert-bit), the encoding approach can even increase the energy consumption by up to 6%.

Thus, the high-level model presented in this work is not only required to estimate the energy consumption, but also to derive new coding techniques which do not show this degradation.

## 5. Modeling approach

In this section we present our high-level model to estimate the switching  $(\mathbf{T})$ and bit probability  $(\vec{p})$  characteristics of links with virtual channels, transmitting up to *n* different data types  $(D^1 \text{ to } D^n)$ . Thereby, head-flits build one data type. Thus, the amount of different transmitted message types is n-1. For each individual data type, we can obtain the switching properties  $(\mathbf{T}^1 \text{ to } \mathbf{T}^n)$ for a sequential transmission of the according data stream, and the bit probabilities [27, 34]. For our approach, not only the bit probability vectors of the data streams  $(\vec{p}^1 \text{ to } \vec{p}^n)$  are required, but bit probability matrices  $(\mathbf{S}^1 \text{ to } \mathbf{S}^n)$ , with

$$\mathbf{S}^{\mathbf{x}}_{i,j} = \mathbf{E}\{b_i^x \cdot b_j^x\}.$$
(8)

The diagonals of these matrices are equal to the bit probability vectors, since  $\mathbf{E}\{b_i^x \cdot b_i^x\} = \mathbf{E}\{b_i^x\}$ . The remaining entries of a **S**-matrix are equal to the probability that both bits *i* and *j* of a pattern of  $D^x$  are logical 1. Please note that, due to possible spatial bit-correlations (e.g. due to a normal distribution [27]),  $\mathbf{E}\{b_i^x \cdot b_i^x\}$  is generally unequal to  $\mathbf{E}\{b_i^x \cdot \mathbf{E}\{b_i^x\} = \vec{p}_i^x \cdot \vec{p}_i^x$ .

With the data flow independent **S**-matrices, we estimate the switching properties when two data streams  $D^x$  and  $D^y$  are multiplexed  $\mathbf{T}^{\mathbf{x}\to\mathbf{y}}$ . For this purpose, the mean self switching  $(\mathbf{E}\{\Delta b_i^2\}^{x\to y})$ , as well as the mean correlated switching  $(\mathbf{E}\{\Delta b_i\Delta b_i\}^{x\to y}$  for  $i \neq j$ ) is required. Both can be calculated via

$$\mathbf{E}\{\Delta b_{i}\Delta b_{j}\}^{x \to y} = \mathbf{E}\{(b_{i}^{y} - b_{i}^{x})(b_{j}^{y} - b_{j}^{x})\}$$

$$= \mathbf{E}\{b_{i}^{y}b_{j}^{y} + b_{i}^{x}b_{j}^{x} - b_{i}^{y}b_{j}^{x} - b_{i}^{x}b_{j}^{y}\}$$

$$= S_{i,j}^{y} + S_{i,j}^{x} - S_{i,i}^{y}S_{j,j}^{x} - S_{i,i}^{x}S_{j,j}^{y}.$$
(9)

For i = j, we obtain the self switching probabilities and for  $i \neq j$  the correlated switching properties. In Eq. 9, we exploit that the cross-correlation of two different data streams is zero, which results in  $\mathbf{E}\{b_i^y b_j^x\} = \mathbf{E}\{b_i^y\} \cdot \mathbf{E}\{b_j^x\} = S_{i,i}^y \cdot S_{j,j}^x$ .

Employing the resulting switching matrices for the scenarios of multiplexed data streams  $(\mathbf{T}^{\mathbf{x}\to\mathbf{y}}, \text{ with } x \neq y)$ , as well as the switching matrices for no multiplexing  $(\mathbf{T}^{\mathbf{x}} = \mathbf{T}^{\mathbf{x}\to\mathbf{x}})$ , we can determine the switching for a link and a given data flow:

$$\mathbf{T}_{link} = \sum_{x,y} (M_{x,y} + M_{x+n,y}) \mathbf{T}^{\mathbf{x} \to \mathbf{y}}, \qquad (10)$$

where **M** is a  $2n \times 2n$  matrix which contains the information about the data flow over the link. The (x, y)-entry  $M_{x,y}$  is equal to the probability of transmitting a pattern of data type y, after transmitting a pattern of data type x. Thus,  $M_{x,x}$ is equal to the probability of two subsequently transmitted patterns belonging to data type x (no multiplexing). Entry (x, x + n) is equal to the probability that the link holds a pattern of data type x (link is idle). Therefore, entry (x + n, y) is the probability of transmitting a pattern of data type y after being idle, holding a value of type x.

Analogously, the bit probability vector for the link can be calculated by

$$\vec{p}_{link} = \sum_{x,y} (M_{x,y} + M_{x,y+n} + M_{x+n,y}) \vec{p}^{y}.$$
(11)

Finally, by substituting  $\mathbf{T}_{link}$  and  $\vec{p}_{link}$  into Eq. 5 (for a 2D link) or Eq. 7 (for a 3D link) we can estimate the energy consumption of links in the presence of virtual channels.

# 6. Simulator

In this section, we describe our simulator, which is capable of generating the data flow matrices **M**. First, we introduce its architecture. Second, we demonstrate the configuration options of the simulator. Third, we explain the generation of data flow matrices and other evaluation metrics. The simulator and its source code are publicly available on Github at https://github.com/jmjos/ratatoskr. Originally, it targets NoCs for 3D SoCs with technology heterogeneity, but can be used to model traditional 2D and homogeneous 3D NoCs, as well.



Fig. 2: Components of the simulator

#### 6.1. Architecture

The simulator consists of three individual parts as shown in Fig. 2: The *NoC simulator* is the core of the software; it simulates the NoC by means of hardware models for its components. The *application model* provides an implementation, which injects synthetic or real-world based traffic into the network. The *reporting tool* offers functions for evaluation of the simulation results. Neither application model nor reporting tool are topics of this very publication, and we kindly refer to the given references for details. We only provide a short description of the parts, while the NoC simulator is described in more detail in the next paragraph. The three parts are implemented in C++11 using the class library SystemC 2.3.1a [10], which provides the simulation kernel. In version 1.1.8, the simulator has approximately 7,700 lines of code. This is a similar code size as the competitors (Noxim: 8,600 lines of code). Please note that BookSim 2.0 provides its own kernel and thus is naturally larger with 25,000 lines of code.

The application model implements colored, statistical Petri nets with retention time on places as published in [22]. This allows to model a very wide range of applications, reassessing both real world based traffic streams and traditional, synthetic ones. The reporting tool, as proposed in [23], generates textual and graphical reports for each simulation run of the simulator. Further, it offers a MySQL database to track events in the NoC simulator such as *flit send* or *routing calculation*. It also generates the data flow matrices as explained later on.

The NoC simulator implements a hardware model, consisting of PEs, NIs, routers and links. As a distinguishing factor to the competitors, the model is well-defined, as published in [22]. The architecture, i.e. structure of components and their communication, is shown in Fig. 3. As already explained, the application model injects traffic into the network via PEs. Since the application is modeled on transaction level (left-hand side of the figure) and the hardware is modeled cycle-accurate (right-hand side), both parts are connected by a transaction level model (TLM) interconnect. It handles mapping of places in the application's Petri net to PEs and vice versa. The PE is an abstract representation of processing cores, sensors or hardware accelerators connected to the NoC. PEs support virtual channels and are implemented in the class **ProcessingElementVC**. PEs process packets from the network via a receive-



Fig. 3: Modules of NoC and application simulator.

function. In the PEs' execute-function, events are triggered in the application model for the application dynamic behavior. The simulator connects PEs with routers at the same location via a single NI. NIs serialize data from packets into flits using the network link bit width, and vice versa. The NIs' implementation in the class NetworkInterfaceVC consists of one function per direction to serialize and deserialize data. Flits are sent in the network via routers and links. The router model is implemented in the class **RouterVC**. The router model is a standard input-buffered router [35], which in addition, supports non-purely synchronous communication. Routers have two main functions: Flits are stored in the correct buffers in a receive-function; it also handles flow control to upstream routers. In the router's main thread, flits are processed by calculating a route, arbitrating virtual channels and sending flits. The router model has three stages (routing calculation, virtual channel allocation, sending) for head flits. Subsequent flits will be transmitted immediately if downstream buffer space is available. Many parameters of the router, such as buffer depth, number of virtual channels, routing function and even the network topology (cf. [22]) can be configured during runtime using XML files, which avoids recompiling the simulator. As one exemplary excerpt, the definition of a router and a PE model is shown Listing 1. The router supports virtual channels, uses deterministic dimension ordered routing, round robin selection<sup>2</sup>, a fair arbiter as introduced in Sec. 4 and a clock speed of 1 GHz. The PEs have fewer options; they support virtual channels and are clocked at 500 MHz in this example. Finally, links connect routers unidirectionally, and their model is implemented in the class Link. Links issue data transmission clock-wise to the reporting tool for the generation of data flow matrices  $\mathbf{M}$ . We verified the correctness of these matrices by comparison with raw data. It is noteworthy, that the link itself does not require knowledge about link bit width, since this information is already encapsulated

 $<sup>^2 {\</sup>rm Actually},$  the selection is not relevant for deterministic routing algorithms, which only return a single path.

Listing 1: Configuring node types via XML.



Fig. 4: Interplay of application model and link matrices.

into the flit generation in the NIs.

In general, the proposed architecture is similar to competing software. Important distinguishing features are as follows: The proposed simulator is the only one with a link model to calculate the dynamic energy consumption of virtual channel-based links, as proposed as part of this publication. The number of parameters, which can be set during runtime via XML files is high in comparison and allows for very flexible modeling of many architectures. The application model, as published in [22], is comprehensive and allows for generating real world data traffic, which is not possible with the existing solutions: These only provide synthetic traffic patterns, which are on a different level of abstraction. Finally, the reporting tool enables more flexible reports with adjustable level of abstraction than the competitors. This is exemplified in detail in the next section. To summarize, the proposed simulator offers more diverse features than the competing software and has an energy model of higher accuracy.

#### 6.2. Generation of the data flow matrices M

The **M**-entries depend on the application scenario and the NoC architecture (virtual channel count/arbitration, routing, etc.). A typical scenario for the generation of data flow matrices is shown in Fig. 4. In the upper part of the figure, an application is shown. It consists of a sender and two receivers; the data

for each receiver have different pattern types, i.e. different switching (**T**) and bit probabilities ( $\vec{p}$ ). In the lower layer, a simple 2×2 NoC is shown with mesh topology and dimension order routing. For the sake of simplicity, only routers and links are shown; NIs and PEs are at the same position of routers. The sender is mapped to the upper right PE. The senders are mapped to two PEs on the left-hand side of the NoC.<sup>3</sup> In the application model, as proposed in [22], ndifferent pattern types are denoted by colors using the set  $\Sigma = \{\sigma_1, \sigma_2, \ldots, \sigma_n\}$ . In this example, only  $\sigma_1$  and  $\sigma_2$  are used. The data flow of both pattern types is shown in the figure in red and orange. The data flow matrix  $\mathbf{M}_1$  for the upper link is influenced by both pattern types; the data flow matrix  $\mathbf{M}_2$  is only influenced by type  $\sigma_2$ , since the data of the first flow do not traverse this link. If the two pattern types are in two different virtual channels and are transmitted simultaneously, there will be switching activity between the two types in the data flow matrix for this link. This example demonstrates the dependence of data flow matrices from NoC and application.

This method is superior to saving a whole protocol of the transmitted flit types (which we used to verify the implementation of the link model), since this would require memory which linearly increases with the amount of simulated clock cycles, i.e. the trace of the link has a memory complexity of O(t), with t as simulation time. The effort to save the data flow matrices, however, is constant because the matrices are of size  $2n \times 2n$  for n - 1 transmitted data streams and do not increase their size with the simulation time, i.e. the generation of the matrices has a memory complexity of  $O(n^2)$ . Naturally, the execution time complexity is identical for both methods, since the whole simulation must be executed (O(t)).

The reporting tool saves the matrix's contents and reports them both in human-readable textual form and as a .csv file for further processing. Please note that the data flow matrices allow for the calculation of further statistics, beside the proposed energy consumption. For instance, the average idle/usage time of every link, and thus also router port, can be easily extracted. Therefore, the data flow matrices provide an innovative feature with a truly additional value in comparison to the existing NoC simulators.

#### 7. Simulation results

#### 7.1. Model accuracy

In this section we investigate the accuracy of our approach. For this purpose, we analyze the transmission of 2–5 data types/streams for different mean multiplexing probabilities with Python. Furthermore, each simulation is executed 1,000 times for a flit width of 16 b, and 1,000 times for a flit width of 32 b. To cover the large space of data type combinations as well as possible, in

 $<sup>^{3}</sup>$ Actually, to comply with the model [22], two different places are required to send two different pattern types, which are mapped to the same PE. For the sake of simplicity, we depict a single place.



Fig. 5: Accuracy of our proposed high-level model compared to bit-level simulations: a) root mean square error (RMSE); b) maximum absolute error. Both quantities are normalized and given in percentage points (pp).

each run the synthetically generated data streams vary randomly. The pattern distribution of each single data stream is either uniform, normal (Gaussian), or log-normal. For the last two distributions, the standard deviation of the patterns is in the range from  $2^{N/10}$  to  $2^{N-1}$  and the relative pattern correlation is in the range from 0 to 1. Compared are the switching properties (**T**), estimated with our proposed high-level model, and the exact switching properties determined by means of bit-level simulations, for the transmission of 10,000 flits. Reported are the overall root-mean-square-errors (*RMSE*) as well as the maximum-absolute-errors (*MAE*).

The results are presented in Fig. 5. The results show that our approach enables an extremely accurate estimation of the switching characteristics, independent of the multiplexing probabilities, the number of multiplexed data streams, or the flit width. For all analyzed scenarios, the RMSE of our estimates is in the range of 0.6–0.8 percentage points (pp). The maximum error (MAE) for all 5,120,000 estimated switching properties does not exceed 2.8 pp. Although our model has a close to perfect match with bit-level simulations, it requires more than 2,000 times lower execution time on an Intel i5-4690 machine, with 16 GB of RAM, running Linux kernel 3.16. This speed up will even increase with an increasing pattern and/or link count. Thus, our model enables to precisely predict the energy consumption of a full virtual channel based NoC, containing multiple processing elements, within a tolerable time.

Furthermore, the experiment proves that, as expected, the switching activities linearly increase with an increasing multiplexing probability (see also Fig. 1), while the number of multiplexed data streams does not affect the switching properties. Since switching is only determined by direct consecutive pattern pairs, a multiplexing of two data streams leads, on average, to the same energy consumption as a multiplexing of three or more data streams. Thus, without loss of generality, in the remainder of this section we restrict our analysis to scenarios where only two data types are multiplexed.

# 7.2. Low-power coding

As outlined in Sec. 4, there is a strong need for coding techniques which reduce the energy consumption of links with virtual channels. The modeling technique presented in this work allows for a fast estimation of the efficiency of such low-power coding techniques. With a single simulation, the coding independent data flow matrices  $\mathbf{M}$  are determined once. Afterwards, using Eq. 5–11, the actual link energy requirements can be determined for arbitrary data streams, and thus different applied data encoding techniques. Furthermore, the highlevel model presented in this work enables the design of new coding techniques which consider the effect of virtual channels. This approach is investigated in this subsection.

For this purpose, we consider the simultaneous transmission of 2 MB of data from two different sources over 16 b wide 2D and 3D links. For the physical media (2D and 3D links), we use the same structures as in Sec. 4. Investigated is the dissipated energy per effectively transmitted byte over the mean multiplexing probability (mean( $M_{1,2}, M_{2,1}$ )) to take possible bit-overheads of encoding techniques into account. For the data streams, we consider uniformly distributed patterns where the eight MSBs show a strong temporal correlation ( $\rho = 0.99$ ), and completely random (uncorrelated) patterns. For the uncorrelated, data we analyze the invert-coding and for the correlated data a correlator-coding [24]. The second approach correlates (bit-wise XOR) every data word with the previous data word of the stream. Therefore, in the analyzed example, the high MSB correlation in combination with the inverting drivers leads to code word MSBs nearly stable on logical 1 [5]. The energy quantities are determined twice by means of Eq. 5–7: once using the high-level model presented in this work to estimate **T** &  $\vec{p}$ ; and once using the exact **T** &  $\vec{p}$  obtained by bit-level simulations.

The resulting energy quantities are illustrated in Fig. 6. The markers indicate the energy quantities obtained for the exact bit properties which are in perfect accordance with the energy quantities obtained with our high-level model (lines). Thus, our model allows for a fast and precise estimation of coding efficiencies. For example, the model precisely predicts the decreasing efficiency of the invert-coding with an increasing multiplexing probability as well as the increasing efficiency of the correlator-coding for the transmission of two correlated data streams. Thus, it allows to identify that for a high virtual channel usage a different coding technique than the invert-coding approach is required, while the correlator-coding performs well for this data flow scenario. Without our high-level model, it is not trivial for a designer to explain this observation, which complicates the design of new coding techniques for links with virtual channels. Our proposed model provides a clear answer. Invert-coding does only affect the switching probabilities for the sequential (non-multiplexed) transmission of data streams. However, it does not affect the bit probability matrices **S**. Thus, according to our model, it does not decrease the energy consumption per clock cycle for a continuous data type multiplexing, and due to its induced overhead it even increases the energy consumption per effectively transmitted byte. In contrast, the correlator-coding leads to MSBs nearly stable on logical



Fig. 6: Effect of low-power coding techniques on the 2D/3D link energy consumption for the transmission of correlated and fully random data in the presence of virtual channels. Marks show results for exact bit-level simulations, corresponding lines are estimates of our proposed high-level model.

1. This increases the **S**-values, which reduces the  $\mathbf{T}^{x \to y}$  values as well as the TSV capacitance quantities via the MOS-effect.

Summarized, our model reveals an important message for the design of lowpower techniques: in order to obtain the most efficient coding approach for links with virtual channels, the technique must not only affect the switching activities of the single data streams  $\mathbf{E}\{\Delta b_i \Delta b_i\}$ , but also the bit probabilities  $\mathbf{E}\{b_i \cdot b_i\}$ .

# 8. Image processing case study

This section presents a case study with a common use-case of the model presented in this work. We consider a heterogeneous 3D Vision SoC with a NoC interconnect architecture. The NoC architecture has a flit width of 16 b with 1 head- and 31 body-flits (payload) per packet, supports up to four virtual channels per port, and has an input buffer depth of 4. Furthermore, for comparison, we also consider the same architecture without virtual channels.

The full 3D SoC consists of: one mixed signal (MS) layer which contains six CMOS image sensors (S1–S6) at the top; one underlying memory (MEM) layer; and at the bottom one digital layer containing the actual processors. In this case study we want to analyze, and optimize, the transmission of raw gray-scale image pixels (two per flit) from the sensors to the memory. Since the analyzed NoC uses a XYZ-routing [1], while images are always read from the memory,



Fig. 7: Part of the heterogeneous 3D Vision System-on-Chip which is analyzed in the case study. The system includes one mixed signal layer for the digitalization of the sampled images and one memory layer to temporary store the images. The components are connected via a 3D NoC.

we can analyze the traffic from the memory to the sensors without considering the traffic between the cores and the memory.

Thus, the structure sketched in Fig. 7 is analyzed. In total seven different data/flit types are transmitted over the links: 1 for head-flits h, and 6 for bodyflits (one per source) s1-s6. Each of the six image sensors in the MS layer is connected to one router R1–R6, which are connected by 2D links. Router R5 is connected via a 3D link with router R7 in the underlying MEM layer. Connected to R7 is a memory block to store the sensor data. Thus, over the links connecting  $R1 \rightarrow R2$ ,  $R3 \rightarrow R2$ ,  $R4 \rightarrow R5$  and  $R6 \rightarrow R5$  only data stemming from one sensor and head-flits are transmitted, resulting in unused virtual channels. Usage of virtual channels occurs in the links  $R2 \rightarrow R5$  and  $R5 \rightarrow R7$  as they are required for the transmission of data from 3 and 6 different sensors, respectively. In this section the same physical link structures as in Sec. 4 & 7 are considered. To obtain data flow matrices we run a simulation for the virtual channel-based and the virtual channel-less 3D NoC with our extended simulator for a mean traffic injection rate of 20% per sensor. To allow for subsequent bit-level simulations (to obtain reference values), the simulator is temporarily modified in a way that it saves the whole protocol of the transmitted flits.

After the simulations, the energy quantities per transmitted packet are determined with Python, employing: first our proposed high-level model, second the standard high-level model (neglects the effect of virtual channels), and third bit-level simulations. Thereby, we consider two separate traffic scenarios. In scenario 1 all six sensors capture road images with a resolution of  $512 \times 512$  pixels during the daylight, and in scenario 2 during the night. We choose these particular traffic scenarios as they result in relatively high errors for our approach, which assumes that the cross-correlation between the individual data streams is zero. This is not guaranteed if all sensors capture pictures of the same environment from different perspectives. The energy results with virtual channels are presented in the first row of Table 1. The NoC performance results, obtained by our simulator, are presented in the last row of the table. Energy consumption

Table 1: Link energy quantities and network performance with 4 virtual channels.

| Data                                 | Energy per transmitted packet [pJ] |                                |                        |  |
|--------------------------------------|------------------------------------|--------------------------------|------------------------|--|
|                                      | Bit-level sim.                     | Presented model                | Standard model [2]     |  |
| Uncoded                              | 4.18                               | 4.15                           | 2.40                   |  |
| Gray                                 | 4.11 ( <b>-1.67 %</b> )            | 4.09 ( <b>-1.44 %</b> )        | 2.23 ( <b>-7.08</b> %) |  |
| Corr                                 | 2.69 ( <b>-35.64</b> %)            | 2.68 ( <b>-35.42</b> %)        | 2.71 (+12.92%)         |  |
| Avg. flit latency: $19.2\mathrm{ns}$ |                                    | Avg. network latency: 105.4 ns |                        |  |

Table 2: Link energy quantities and network performance without virtual channels.

| Data                       | Energy per transmitted packet [pJ] |                                |                    |  |
|----------------------------|------------------------------------|--------------------------------|--------------------|--|
|                            | Bit-level sim.                     | Presented model                | Standard model [2] |  |
| Uncoded                    | 2.39                               | 2.40                           | 2.40               |  |
| Gray                       | 2.22 (-7.11%)                      | 2.23 (-7.08 %)                 | 2.23 (-7.08%)      |  |
| Corr                       | 2.70 (+12.94 %)                    | 2.71 (+12.92%)                 | 2.71 (+12.92 %)    |  |
| Avg. flit latency: 40.2 ns |                                    | Avg. network latency: 193.3 ns |                    |  |

and network performance without virtual channels are shown in Table 2.

The simulation framework reveals that implementing virtual channels in the NoC allows to almost double the network performance for the analyzed application, as the average flit and network latency decreases by 52.2% and 45.5%, respectively. However, this performance gain is at an expense of a dramatic increase in link's power consumption of 74.9% (beside increased complexity of the NoC architecture). Generally, in all analyzed realistic NoC traffic scenarios, our proposed model precisely predicts the energy consumption (error below 1%, compared to results obtained by precise, but computational extensive, bit-level simulation). In contrast, the previous high-level model which neglects the effect of virtual channels, leads to an error of almost 50\%, for the virtual channel-based NoC, although in the analyzed virtual channels. However, the two links which use virtual channels show the highest energy consumption, and for these links the traditional model leads to an underestimation of the energy consumption by more than a factor of four.

After the simulation, we analyze the integration of two overhead free lowpower coding approaches for the body-flits: correlator- and Gray-coding [24]. We analyze overhead free, low-complex, low-power codes as an induced bitoverhead would increase the buffer/memory cost. Additionally, both encoding techniques can be hidden in the AD converters of the sensors, to mitigate the implementation overhead. Gray-encoding reduces the switching activities for a sequential transmission of the highly correlated pixels, while a correlator mainly affects the bit probabilities. As we know from the previous section, the correlator-coding shows good coding efficiency for multiplexed data streams, while the efficiency for no multiplexing is rather poor. The energy quantities for the transmission of the encoded data, instead of the RAW data, are also shown in Table 1 (with virtual channels) and in Table 2 (without virtual channels). The (estimated) energy reductions due to the coding approaches are shown in bold. Our presented model, in accordance with the bit-level simulations, indicates that the correlator leads to a far better coding efficiency (-36 % instead of -1 %). Thus, if virtual channels are integrated together with an end-to-end correlator coding, the link's energy consumption compared to the virtual channel-less system will be reduced from 52.2% to 12.5%.

In comparison, the standard model that neglects virtual channels, erroneously predicts a much higher coding efficiency for the Gray-coding and even a significant increase in the energy consumption (negative coding gain) for the correlator-coding. This underlines that, using any model other than ours for a high-level performance estimation, results in implementing inefficient coding techniques and a dramatic underestimation of the energy quantities.

# 9. Conclusion

This work presents a NoC simulator with the first model to precisely estimate the data dependent dynamic energy consumption of 2D and 3D links even at the presence of virtual channels. The simulator is implemented in C++ and SystemC. Thus, it allows for an early design stage estimation of the NoC energy requirements. Furthermore, it enables the derivation and a fast evaluation of low-power data encoding techniques for links with virtual channels. In combination with proposed NoC simulator, a full design space exploration including link energy is possible for NoCs for the first time. The model shows negligible errors for realistic NoC traffic scenarios, with and without implemented coding techniques.

# Acknowledgment

This work is funded by the German Research Foundation (DFG) project GA 763/7-1 and PI 477/8-1.

# References

- G. De Micheli, L. Benini, Networks on chips: technology and tools, Academic Press, 2006.
- [2] N. Jafarzadeh, M. Palesi, A. Khademzadeh, A. Afzali-Kusha, Data encoding techniques for reducing energy consumption in network-on-chip, IEEE Trans. Very Large Scale Integ. (VLSI) Syst. 22 (3) (2014) 675–685.
- [3] N. Eisley, L.-S. Peh, High-level power analysis for on-chip networks, in: Int. Conf. on Compilers, Architecture, and Synthesis for Embedded Syst., 2004, pp. 104–115. doi:10.1145/1023833.1023849.
- [4] M. Palesi, G. Ascia, F. Fazzino, V. Catania, Data encoding schemes in networks on chip, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 30 (5) (2011) 774–786. doi:10.1109/TCAD.2010.2098590.

- [5] A. Garcia-Ortiz, L. S. Indrusiak, Practical and theoretical considerations on low-power probability-codes for networks-on-chip, in: Int. Workshop on Power and Timing Mod., Opt. and Sim., Springer, 2010, pp. 160–169.
- [6] L. Bamberg, J. M. Joseph, R. Schmidt, T. Pionteck, A. García-Oritz, Coding-aware Link Energy Estimation for 2D and 3D Networks-on-Chip with Virtual Channels, International Symposium on Power and Timing Modeling, Optimization and Simulation.
- [7] L. Bamberg, A. Garcia-Ortiz, High-level energy estimation for submicrometric TSV arrays, IEEE Trans. Very Large Scale Integ. (VLSI) Syst. 25 (10) (2017) 2856–2866. doi:10.1109/TVLSI.2017.2713601.
- [8] A. B. Kahng, B. Li, L.-S. Peh, K. Samadi, Orion 2.0: A fast and accurate noc power and area model for early-stage design space exploration, in: Proceedings of the Conference on Design, Automation and Test in Europe, DATE '09, European Design and Automation Association, 3001 Leuven, Belgium, Belgium, 2009, pp. 423–428. URL http://dl.acm.org/citation.cfm?id=1874620.1874721
- [9] S. Penolazzi, A. Jantsch, A High Level Power Model for the Nostrum NoC, in: V. Muthukumar (Ed.), 9th EUROMICRO Conference on Digital System Design: Architectures, Methods and Tools, IEEE Computer Society, Los Alamitos, Calif., 2006, pp. 673–676. doi:10.1109/DSD.2006.9.
- [10] IEEE Standard for Standard SystemC Language Reference Manual. doi: 10.1109/IEEESTD.2012.6134619.
- [11] A. B. Kahng, B. Lin, S. Nath, ORION3.0: a comprehensive NoC router estimation tool, IEEE Embedded Systems Letters 7 (2) (2015) 41–45.
- [12] A. García-Oritz, L. Bamberg, A. Najafi, Low-Power Coding: Trends and New Challenges, Journal of Low Power Electronics 13 (3) (2017) 356–370. doi:10.1166/jolpe.2017.1507.
- [13] A. García-Ortiz, L. S. Indrusiak, Practical and Theoretical Considerations on Low-Power Probability-Codes for Networks-on-Chip, in: International Symposium on Power and Timing Modeling, Optimization and Simulation, 2011.
- [14] V. Catania, A. Mineo, S. Monteleone, M. Palesi, D. Patti, Cycle-Accurate Network on Chip Simulation with Noxim, ACM Transactions on Modeling and Computer Simulation 27 (1) (2016) 1–25. doi:10.1145/2953878.
- [15] C. H. Chao, K. Y. Jheng, H. Y. Wang, J. C. Wu, A. Y. Wu, Trafficand Thermal-Aware Run-Time Thermal Management Scheme for 3D NoC Systems, in: Networks-on-Chip (NOCS), 2010 Fourth ACM/IEEE International Symposium on, 2010, pp. 223–230. doi:10.1109/NOCS.2010.32.

- [16] J. M. Joseph, C. Blochwitz, T. Pionteck, Adaptive allocation of default router paths in Network-on-Chips for latency reduction, in: International Conference on High Performance Computing & Simulation, IEEE, 2016. doi:10.1109/HPCSim.2016.7568328.
- [17] D. Park, S. Eachempati, R. Das, A. K. Mishra, Y. Xie, N. V., C. R. Das, MIRA: A Multi-layered On-Chip Interconnect Router Architecture, in: 35th International Symposium on Computer Architecture, IEEE, 2008. doi:10.1109/ISCA.2008.13.
- [18] N. Jiang, J. Balfour, D. U. Becker, B. Towles, W. J. Dally, G. Michelogiannakis, J. Kim, A detailed and flexible cycle-accurate Network-on-Chip simulator, in: International Symposium on Performance Analysis of Systems and Software, IEEE, 2013. doi:10.1109/ISPASS.2013.6557149.
- [19] L. Bamberg, A. García-Oritz, High-Level Energy Estimation for Submicrometric TSV Arrays, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25 (10) (2017) 2856–2866. doi:10.1109/TVLSI.2017. 2713601.
- [20] A. B. Kahng, B. Lin, S. Nath, Comprehensive Modeling Methodologies for NoC Router Estimation, Technical Report UCSD Technical Report CS2012-0989, University of California, San Diego, CA (September 2012).
- [21] N. Agarwal, T. Krishna, L. Peh, N. K. Jha, GARNET: A detailed on-chip network model inside a full-system simulator, in: 2009 IEEE International Symposium on Performance Analysis of Systems and Software, 2009, pp. 33–42. doi:10.1109/ISPASS.2009.4919636.
- [22] J. M. Joseph, L. Bamberg, G. Krell, I. Hajjar, A. García-Oritz, T. Pionteck, Specification of Simulation Models for NoCs in Heterogeneous 3D SoCs, International Symposium on Reconfigurable Communication-centric Systems-on-Chip.
- [23] J. M. Joseph, S. Wrieden, C. Blochwitz, A. García-Oritz, T. Pionteck, A simulation environment for design space exploration for asymmetric 3Dnetwork-on-chip, in: Int. Symp. on Reconf. Commun.-centric Syst.-on-Chip, 2016, pp. 1–8. doi:10.1109/ReCoSoC.2016.7533908.
- [24] A. Garcia-Ortiz, L. Bamberg, A. Najafi, Low-power coding: trends and new challenges, Journal of Low Power Electron. 13 (3) (2017) 356–370. doi:10.1109/TVLSI.2017.2713601.
- [25] L. Bamberg, A. Najafi, A. García-Ortiz, Edge effects on the TSV array capacitances and their performance influence, Integration, the VLSI Journal.
- [26] C. Xu, H. Li, R. Suaya, K. Banerjee, Compact AC modeling and performance analysis of through-silicon vias in 3-D ICs, IEEE Trans. Electron Devices 57 (12) (2010) 3405–3417. doi:10.1109/TED.2010.2076382.

- [27] P. E. Landman, J. M. Rabaey, Architectural power analysis: The dual bit type method, IEEE Trans. Very Large Scale Integ. (VLSI) Syst. 3 (2) (1995) 173–187. doi:10.1109/92.386219.
- [28] W. J. Dally, Virtual-channel flow control, IEEE Trans. Parallel and Distributed Syst. 3 (2) (1992) 194–205. doi:10.1109/71.127260.
- [29] N. Kavaldjiev, G. J. M. Smit, P. G. Jansen, A virtual channel router for on-chip networks, in: IEEE Int. SoC Conf., IEEE, 2004, pp. 289–293.
- [30] T. Marescaux, A. Bartic, D. Verkest, S. Vernalde, R. Lauwereins, Interconnection networks enable fine-grain dynamic multi-tasking on fpgas, in: Int. Conf. on Field Programmable Logic and Applications, Springer, 2002, pp. 795–805.
- [31] E. Bolotin, I. Cidon, R. Ginosar, A. Kolodny, QNoC: QoS architecture and design process for network on chip, Journal of Syst. architecture 50 (2-3) (2004) 105–128.
- [32] International technology roadmap for semiconductors (ITRS), Semiconductor Industry Association.
- [33] M. R. Stan, W. P. Burleson, Bus-invert coding for low-power I/O, IEEE Trans. Very Large Scale Integ. (VLSI) Syst. 3 (1) (1995) 49–58.
- [34] A. García-Ortiz, D. Gregorek, C. Osewold, Analysis of bus-invert coding in the presence of correlations, in: Saudi Int. Electronics, Commun. and Photonics Conf., 2011, pp. 1–5. doi:10.1109/SIECPC.2011.5876944.
- [35] W. J. Dally, B. Towles, Principles and Practices of Interconnection Networks, Elsevier, 2004.