

# Model of Network-on-Chip routers and performance analysis

# Youhui Zhang $^{\rm 1a)},$ Xiaoguo Dong $^2,$ Siqing Gan $^2,$ and Weimin Zheng $^1$

<sup>1</sup> Department of Computer Science and Technology, Tsinghua University, Beijng, 100084, China

<sup>2</sup> Department of Mathematical Science and Computing Technology, Central South University, Changsha, 410075, China

a) zyh02@tsinghua.edu.cn

**Abstract:** This paper presents a generic analytical performance model of Network-on-Chip (NoC) router, which is further used to analyze the performance of a whole wormhole NoC. We focus on the analysis of various packet blocking-conditions at the router input-queues for a more accurate estimation of waiting time. Based on this estimation, some key performance metrics of NoC, such as the buffer utilization and packet transfer latency, are both computed. Compared with some previous model, it presents more accurate results: for buffer utilization ratio, the error is 6.30%; for packet transfer latency, it is about 5.98%. **Keywords:** Network-on-Chip, markov chain, performance analysis **Classification:** Integrated circuits

#### References

- [1] [Online] http://nocs.stanford.edu/cgibin/trac.cgi/wiki/Resources/ BookSim
- [2] O. Lysne, "Towards a generic analytical model of wormhole routing networks," *Microprocessors and Microsystems*, vol. 21, no. 7-8, pp. 491–498, 1998.
- [3] J. Hu, U. Y. Ogras, and R. Marculescu, "System-level buffer allocation for application-specific Networks-on-Chip router design," *IEEE Trans. Computer-Aided Design Integr. Circuits Syst.*, vol. 25, no. 12, pp. 2919– 2933, 2006.
- [4] Y. U. Ogras and R. Marculescu, "Analytical router modeling for networkson-chip performance analysis," *Proc. Design, Automation and Test in Eu*rope Conference, Acropolis, pp. 1096–1101, 2007.
- [5] S. Foroutan, Y. Thonnart, R. Hersemeule, and A. Jerraya, "An analytic method for evaluating network-on-chip performance," *Proc. Design, Au*tomation and Test in Europe Conference (DATE'10), Dresden, pp. 1629– 1632, 2010.
- [6] Y. Zhang, X. Dong, and W. Zheng, "A Performance Analytical Approach Based on Queuing Model for Network-on-Chip Router," Proc. of PAAP2010, 2010.

EiC



# 1 Introduction

Many design-space explorations of Networks-on-Chip (NoC) are based on the simulation method that is time-consuming. Then, system designers have to choose limited assessment and then cannot get the optimized results usually.

Therefore many analytical models have been developed for NoC. This paper presents a generic analytical performance evaluation approach of NoC. Different from other work, the flow-control feedback probability between adjacent routers is considered meticulously here.

In summary, we give the following contributions:

- 1) A general analytical model of wormhole routers is proposed, which supports arbitrary network topologies, deterministic routing algorithm, etc.
- 2) Computing methods for the buffer utilization ratio and the packet transfer latency of NoC are presented based on the above contribution.
- 3) The analysis accuracy is validated through comparisons with a meticulous cycle-accurate simulator, BookSim [1].

# 2 Previous work

NoC modeling is one of important performance analysis techniques. For example, [2] introduces a probability model for wormhole network with any topology. Compared with [2], our work is based on the queuing model. [3] gives a performance model based on queuing theory, but it only can apply to the switched network. [4] presents performance analysis for the general wormhole NoC. But it does not consider the flow-control feedback mechanism, which is caused by the fullness of the input queue of the successive router. [5], based on the same assumption, uses numerical analysis and iterative computation to estimate performance. In contrast with them, flowcontrol feedback probability is one of the research focuses here.

Moreover, compared with our preliminary work [6], this paper extends to the buffer utilization and transfer latency based on the flow-control feedback.

# 3 Router modeling

# 3.1 Router

We assume that a wormhole router contains w ports, including the port for the local processing element, and adopts a deterministic routing algorithm. Each port is of the single-channel structure associated with an input queue.

Before transfer, any packet is divided into small pieces, called flits. The header flit holds the destination information to set up the transfer channel for all subsequent flits of the same packet.

As in [4, 5], we introduce the following hypotheses.

1) Network traffic is generated from all nodes uniformly and follows the





Poisson process<sup>1</sup>.

2) Any input queue has finite capacity, denoted by B.

The delay for a packet to cross a router is divided into two parts: the service time, T, and the waiting time. To compute T, the following parameters are introduced:

*P*: packet size (in flit). Usually, the router pipeline can deal with one flit per cycle if no waiting is considered.

 $H_s$ : Service time of the header flit, or the ideal time of the header flit going through the router (does not include any waiting time, too). For any given router,  $H_s$  is a constant value that only depends on the micro-architecture of the router. Then, we have:

$$T = H_s + P \tag{1}$$

Moreover, in later sections, Symbol (i, j), stands for the Port j  $(0 \le j \le w)$  at Router i.

#### 3.2 The model

We analyze the average waiting time that a packet spends in the router.  $T_{i,j}$  denotes the average waiting time in (i, j) that is composed of three parts [4]:

- 1) Service time of the packets already waiting in the same queue;
- 2) The residual service time seen by an incoming packet;
- 3) The packets waiting in other buffers of the same router and served before the incoming packet.

In [4, 5], both Part 1 and 2 have been analyzed completely. But for Part 3, they do not consider the flow-control feedback. Therefore, we focus on this issue.

In detail, when a head flit intends to go to the specific output port, it has to compete with all other flits applying for the same direction. Moreover, another necessary condition for any winner to continue is that the input queue of the downstream router is not full, which is called the flow-control feedback.

Then, a packet transmitted from (i, j) to (i + 1, k) consists of two processes: *competition* and *flow-control*.

Because of the deterministic routing algorithm, the forwarding probabilities for a given packet can be deterministic. Then, we suppose  $F_{i,j,k}$  is the probability of the header flit transmitted from (i, j) to (i + 1, k), and  $p_{i+1,k}$ is the flow-control feedback probability<sup>2</sup> produced from (i + 1, k) and  $f_{i,j,k}$ is the competition probability of the header flit. Then we have:

$$F_{i,j,k} = f_{i,j,k} \times (1 - p_{i+1,k})$$
(2)



<sup>&</sup>lt;sup>1</sup>It does not mean the whole traffic distribution is uniform; otherwise our model cannot support arbitrary network topologies.

<sup>&</sup>lt;sup>2</sup>How to compute the probability has been proposed in [6] and we also introduce it in Appendix.



 $\lambda_{i,j,k}$  is the traffic rate from (i,j) to (i+1,k) and we get:

$$f_{i,j,k} = \frac{\lambda_{i,j,k}}{\sum_{l=1}^{w} \lambda_{i,l,k}}$$
(3)

 $c_{i,j,q}$  denotes the competition probability of the header flits in (i, j) and (i, q) transmitting to the same input port of Router (i + 1). We have  $c_{i,j,q} = 1$  if j = q. If  $1 \le j, q \le p$  and  $j \ne q$ , we can get

$$c_{i,j,q} = \sum_{k=1}^{w} F_{i,j,k} F_{i,q,k} = \sum_{k=1}^{w} f_{i,j,k} f_{i,q,k} (1 - p_{i+1,k})^2$$
(4)

Therefore, the blocking delay caused by packet competitions and flow controls is denoted by Eq. (5):

$$E(T)\sum_{q=1,q\neq j}^{w} c_{i,j,q}N_q = E(T)\sum_{q=1,q\neq j}^{w}\sum_{k=1}^{w} f_{i,j,k}f_{i,q,k}(1-p_{i+1,k})^2N_q \quad (5)$$

where  $N_q$  is the average number of packets waiting in (i, q). And the average waiting time for incoming packet is  $E(T)N_q$ , where E(T) indicates the mean service time.

Then, based on Eq. (5) and other computation from [4] (for Part 1 and 2), the waiting time of an incoming packet buffered in the input queue of (i, j) is:

$$T_{i,j} = E(T)N_j + \frac{1}{2}\lambda_{i,j}E(T^2) + E(T)\sum_{q=1,q\neq j}^{w}\sum_{k=1}^{w}f_{i,j,k}f_{i,q,k}(1-p_{i+1,k})^2N_q$$
(6)

where  $E(T^2)$  is the second moment of service time, the arrival traffic rate at (i, j) is denoted by  $\lambda_{i,j}$  and  $N_j$  is the average number of packets waiting at (i, j).

#### 4 NoC performance analysis

Eq. (6) provides the waiting time estimation, which depends on the network topology and traffic rates. Based on this method, we can not only effectively estimate the packet transfer latency, but also analyze the influence of key parameters (including the buffer size and number of pipeline stages). In this section, the proposed model is used to compute the buffer utilization and packet transfer latency of the entire network.

#### 4.1 The buffer utilization

For the micro-architecture design, the buffer size of a router is one of major parameters. The optimization can improve the NoC performance significantly. We use Eq. (6) to compute the average buffer utilization at (i, j), which provides information about the distribution of traffic across the entire network.





Using the Little's theorem (the long-term average number of customers in a stable system is equal to the arrival rate multiplied by the average time a customer spends in the system), we have

$$N_{j} = \lambda_{i,j} T_{i,j}$$

$$= \lambda_{i,j} E(T) N_{j} + \frac{1}{2} \lambda_{i,j}^{2} E(T^{2})$$

$$+ \lambda_{i,j} E(T) \sum_{q=1, q \neq j}^{p} \sum_{k=1}^{p} f_{i,j,k} f_{i,q,k} (1 - p_{i+1,k})^{2} N_{q}$$
(7)

N stands for the average number of packets waiting in the queue. So the final buffer utilization ratio is the ratio between N and B (the queue capacity).

#### 4.2 The average packet latency

The estimation of the latency for a particular network is performed based on the wormhole routing strategy. The delay at each router incorporates two terms: the waiting time in input queue  $T_{i,j}$  and the service time E(T). Delay from the source node s to the destination d is the sum of delays over all routers in the path  $\pi_{s,d}$  (the path from s to d).

Through computing the delay from arbitrary source to arbitrary destination, the average packet latency, L, is denoted by the following expression:

$$L = \frac{1}{\sum_{\forall s,d} x_{s,d}} \sum_{\forall s,d} \sum_{(i,j)\in\pi_{s,d}} x_{s,d} (T_{i,j} + E(T))$$
(8)

where  $X_{s,d}$  represents the traffic rate from s to d.

#### **5** Evaluation

Different with the previous works that use their own custom simulators, a third-party NoC simulator, BookSim, is used to validate our model. BookSim supports a wide range of topologies and provides diverse routing algorithms for customizing the router's micro-architecture. It can provide various types of simulation results:

- 1) The transfer latency of packets is directly provided as one of the results;
- 2) At each simulation cycle, we record the flit number of any input queue; then the ratio of buffer utilization can be gotten accurately.

We adopt the XY deterministic routing algorithm and a  $5 \times 5$  2D mesh network for simulation. The observed results are obtained by simulating  $2 \times 10^7$  cycles after a warm-up phase of  $2 \times 10^7$  cycles, and then compared with analysis results that computed with MATLAB. We have modified the source code of BookSim to generate such traffic described in Section 3.1. The injection rate is specified in packets per cycle.





#### 5.1 The average packet transfer latency

We focus on the transfer latency under the given conditions, including the input buffer size (B), the service time<sup>3</sup> of the header flit (Hs, without contention) and the packet size (P). We observe that the latencies estimated by Eq. (8) follow the simulation results closely in Fig. 1 (the left part is set as B=3, Hs=2 and P=14 while the right is Hs=6). In this experiment, the average packet latency achieves a mean error of 5.98%. For packet injection rates below 0.2, the relative error is within 4.38% while the corresponding error is 5% in [4].



Fig. 1. Packet transfer latencies

# 5.2 The average buffer utilization

Here we validate the accuracy of Eq. (7) by computing the average number of flits in the input queue for each port of a router. For comparison, we also compute the corresponding values based on the model proposed in [4]. Both are compared with the simulation results.

We choose three different injection rates to compute the buffer utilization with the fixed buffer size (8), pipeline stage number (6) and packet size (14).

When the injection rate is 0.12, 0.16 and 0.20 respectively, the mean error of the proposed model in [4] is 6.94%, 12.94% and 14.91% respectively, while our model is of 5.18%, 5.45% and 8.28% accordingly (the average value is 6.30%). Then, our model performs better, especially under the heavy traffic.

# 6 Conclusion

A router model for NoC performance analysis is presented to focus on the flow-control feedback probability. The computing methods of buffer utilization and transfer latency are also given. Test-results show that, the average error of the computing method for buffer utilization is 6.30%; for transfer latency, it is 5.98%. Comparing with previous research, it improves the accuracy.



 $<sup>^{3}</sup>$ The BookSim router contains several pipeline stages and the delay for each stage can be set manually as the input parameter. So, we can set Hs as any value while the delay of virtual-channel allocation is always zero because we only consider the single-channel structure here.



# Acknowledgments

It is supported by the 863 R&D Program of China under Grant No. 2008AA01A204 and the NSF of China under Grant No. 60773147.

#### Appendix

We consider the flow-control feedback probability  $p_{i+1,k}$  of the input queue at (i+1,k), which is produced by (i+1,k) with the traffic rate  $\sum_{j=1}^{w} f_{i,j,k}\lambda_{i,j,k}$ and the service rate  $\frac{1}{E(T)}$ .



Fig. 2. State transition diagram for M/D/1/B queue

We use the Markov chain to analyze the changes of flit number in the input queue, and the state transition diagram for the queue is shown in Fig. 2. Its state transition matrix can be written as follows.

$$M = \begin{bmatrix} 1 - \alpha & \alpha & 0 & \cdots & 0 & 0 & 0 \\ \beta & \Gamma & \alpha & \cdots & 0 & 0 & 0 \\ 0 & \beta & \Gamma & \cdots & 0 & 0 & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \vdots \\ 0 & 0 & 0 & \cdots & \Gamma & \alpha & 0 \\ 0 & 0 & 0 & \cdots & \beta & \Gamma & \alpha \\ 0 & 0 & 0 & \cdots & 0 & \beta & 1 - \beta \end{bmatrix}$$
(9)

where

$$\alpha = \left(\sum_{j=1}^{w} f_{i,j,k} \lambda_{i,j,k}\right) \left(1 - \frac{1}{E(T)}\right) \tag{10}$$

and

$$\beta = \left(1 - \sum_{j=1}^{w} f_{i,j,k} \lambda_{i,j,k}\right) \frac{1}{E(T)}$$
(11)

and

$$\tau = \left(\sum_{j=1}^{w} f_{i,j,k} \lambda_{i,j,k}\right) \frac{1}{E(T)} + \left(1 - \sum_{j=1}^{w} f_{i,j,k} \lambda_{i,j,k}\right) \left(1 - \frac{1}{E(T)}\right).$$
(12)

According to the state transition diagram, we get the equilibrium distribution vector

$$S_{i+1,k} = [S_{i+1,k,0}, S_{i+1,k,1}, \dots, S_{i+1,k,B}]^T$$
(13)

and

$$\sum_{n=0}^{B} S_{i+1,k,n} = 1.$$
(14)



In Eq. (14),  $S_{i+1,k,n}$  is the probability of the state having *n* flits filled in the input queue of (i + 1, k) and  $S_{i+1,k,0}$  is the probability of an empty queue;  $S_{i+1,k,B}$  is the probability of a full queue, which can be called the probability generating the flow-control feedback from (i + 1, k).

The difference equations for the state transition distribution vector can be written as follows.

$$\alpha S_{i+1,k,0} - \beta S_{i+1,k,1} = 0 \tag{15}$$

$$\alpha S_{i+1,k,n-1} - (\alpha + \beta) S_{i+1,k,n} + \beta S_{i+1,k,n+1} = 0 \quad (0 \le n \le B).$$
(16)

Then, the solution of the above difference equations can be gotten as

$$S_{i+1,k,n} = \left(\frac{\alpha}{\beta}\right)^n S_{i+1,k,0} (0 \le n \le B).$$
(17)

We define the duty factor of the system as

$$\rho = \frac{\alpha}{\beta} = \frac{\left(\sum_{j=1}^{w} f_{i,j,k} \lambda_{i,j,k}\right) \left(1 - \frac{1}{E(T)}\right)}{\left(1 - \sum_{j=1}^{w} f_{i,j,k} \lambda_{i,j,k}\right) \frac{1}{E(T)}}$$
(18)

and use the constraint

$$\sum_{n=0}^{B} S_{i+1,k,n} = S_{i+1,k,0} \sum_{n=0}^{B} \rho^n = 1$$
(19)

Now, we get that

$$S_{i+1,k,0} = \frac{1-\rho}{1-\rho^{B+1}} \tag{20}$$

and then

$$p_{i+1,k} = S_{i+1,k,B} = \rho^B \frac{1-\rho}{1-\rho^{B+1}}.$$
(21)

Eq. (21) reflects the relationship between the flow-control feedback probability and other parameters, including the traffic rate, the queue capacity, the router port-number and the average service time.

