

# **Design and implementation of a NoC router supporting multicast**

# Guohai Zheng<sup>1</sup>, Huaxi Gu<sup>1a)</sup>, and Jian Zhu<sup>2,3</sup>

<sup>1</sup> State Key Laboratory of ISN, Shenzhen CU-Xidian Joint Center, Xidian University, Xi'an, China

<sup>2</sup> Key Laboratory of Network Coding Key Technology and Application,

Shenzhen, China

LETTER

<sup>3</sup> Shenzhen Research Institute, The Chinese University of Hong Kong

a) hxgu@xidian.edu.cn

**Abstract:** Multicast communication has increasingly become common and indispensable for Network-on-Chip (NoC). Router is the key unit of NoC, but few of its implementations provide multicast communication directly. In this letter, we design a tree-based multicast router using differentiated subnetwork. Also, we propose a new deadlock-free routing algorithm to overcome the complex deadlock problem on NoC when it involves multicast communication. In addition, a verification platform on RTL-level is established and synthesized for a Xilinx XC5VLX110T chip. The synthesis reports show that, the multicast router consumes 21.45% less of the chip's storage resources and 3.82% less of the logic resources when in comparison to the result of unicast router that has the same configuration parameters. The maximum operating frequency is up to 158.203 MHz.

**Keywords:** multicast, deadlock-free, router, Network-on-Chip, FPGA

**Classification:** Integrated circuits

#### References

- J. Duato, S. Yalamanchili and L. Ni: Interconnection Network: An Engineering Approach (Morgan Kaufmann Publishers, 2003).
- [2] N. E. Jerger, L. S. Peh and M. H. Lipasti: Proc. 35th International Symp. Computer Architecture (2008) 229.
- [3] A. Al-Dubai and I. Romdhani: Proc. Conf. PARELEC (2006) 245.
- [4] M. Daneshtalab, M. Ebrahimi, S. Mohammadi and A. Afzali-Kusha: IET, Computers and Digital Techniques 3 (2009) 430.
- [5] F. A. Samman, T. Hollstein and M. Glesner: IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 18 (2012) 1067.
- [6] W. Hu, H. Liu and B. Zhang: IEICE Electron. Express 8 [20] (2011) 1743.
- [7] J. H. Bahn, S. E. Lee and N. Bagherzadeh: Proc. 4th International Conf. Information Technology (2007) 1033.





#### 1 Introduction

Multicast is one of the most important communication styles for an on-chip system [1]. For example, in a distributed shared-memory (DSM) system, the data in the shared cache should be consistent with the data in the private cache of each IP core. When the data in the shared cache is changed, a cache consistency request should be sent to corresponding private cache. In this system, the multicast communication consists 12.4% of the traffic [2]. The simplest way is to send multiple copies of the message to all the destinations. Although there is no modification to the traditional unicast NoC, it will result in high bandwidth requirements. Supporting multicast communication in hardware level is essential.

To implement a multicast NoC, the work in [3, 4] adopt a path-based multicast mechanism. Using this method, the router should possess the ability of sorting, which leads to a large overhead in terms of hardware and delay. Then, a tree-based multicast method is more suitable. One of the most popular schemes is Virtual Circuit Tree Multicasting [2]. However, it has to maintain the big virtual circuit tree table and needs to send a setup packet to build a tree before data transmission, which involves large storage resources overhead and multicast latency. Combined with wormhole switching, there are many other tree-based multicast schemes [5, 6]. But as far as we know, none of them had completely solved the complex multicast problem coming with this mechanism. In this letter, we propose a deadlock free multicast routing algorithm employing tree-based scheme with wormhole switching mechanism, which costs less hardware resources and less multicast latency.

# 2 Routing scheme

#### 2.1 Deadlock in tree-based multicast wormhole switching

Combined with wormhole switching, tree-based multicast routing suffers from a new kind of multicast deadlock. Fig. 1 (a) shows an example of a multicast deadlock situation between multicast packet A and multicast packet B. The multicast deadlock arises when the following two sufficient and necessary conditions are satisfied: 1) Two or more multicast packets need to share some network resources during their transmissions in two or more transmission directions; 2) Each packet occupies part of the shared resources until the end of their transmissions. If any of the conditions is not satisfied, the multicast deadlock will not occur.

#### 2.2 A novel deadlock-free multicast routing algorithm

To begin with, we assume the topology of the network is 2D Mesh, and its scale is  $m \times n$ , where m stands for the maximum column number and n stands for the maximum row number. Each node has a label l, with its assignment function can be expressed in terms of the nodes' x and y coordinates as:

$$l(x,y) = \begin{cases} yn+x & \text{if y is even} \\ yn+n-x-1 & \text{if y is odd} \end{cases}$$







**Fig. 1.** (a) Multicast deadlock situation; (b) Multicast NoC system; (c) The base routing algorithm

To solve the above-mentioned deadlock problem, we propose a novel routing algorithm using a differentiated subnetwork strategy. The labels effectively divide those routers into two subnetworks: the up-subnetwork and the down-subnetwork, they are distinguished by links of different colors in Fig. 1 (b). Actually, each router consists of two sub-routers: up sub-router and down sub-router. All the up sub-routers constitute the up-subnetwork, in which an up sub-router can only forward a packet to those up sub-routers that have a label bigger than it. And the situation is reversed in the downsubnetwork. When a packet is injected into the network it will choose a suitable subnetwork and then follows a base routing, which varies with the subnetwork and the row the packet is located in. However, the difference can be ignored. Here we take an example that a sub-router is located in an even row of the up-subnetwork, the pseudo- code is shown as Fig. 1 (c).

Although this differentiated subnetwork method is usually used in the path-based routing algorithm, it only aims at solving the deadlock caused by its irregular transmission paths. We use this method to solve the complicated deadlock problem in the tree-based multicast communication. As the network resources is divided into two subsets and in each subnetwork multicast packets share less network resources in their transmission directions, so the possibility of deadlock is decreased. Also, when a deadlock is to be formed, the related multicast packets will be aware of and turn to the only transmission path in horizontal direction. Thus, the condition of multicast deadlock isn't satisfied. In this way, the multicast routing algorithm is deadlock-free. In addition, when a multicast packet arrives, there are several possible transmission paths for it to choose. It can choose a shortest path or another path according to the traffic condition. Therefore, the algorithm has a certain degree of adaptivity.





## **3** Router architecture

Router is the key unit of NoC, it is responsible for the implementation of functions and protocols of the on-chip network. As mentioned in Section 2, the router can be further divided into two sub-routers. The hierarchical architecture of the network leads to a lower complexity and less resource consumption. Fig. 2 shows the block diagrams of our modularized sub-router architecture. It is composed of three Input modules, one Switch Allocator (SA), one Crossbar, and four Output modules.

Our multicast router is pipelined at the flit level. A multicast packet consists of k head flits followed by several body flits and one tail flit, k corresponds to the number of destinations. The first 2 bits of a flit is used to identify the flit type, and the second 2 bits stands for the ID number of the Virtual Channel (VC) the flit will be stored when it comes into an input port. A head flit also contains a 4-bit source address field and a 4-bit destination address field.

When a head flit is injected into the router from the local Input port, the routing unit will judge whether this flit belongs to this subnetwork. If not, the flit will be discarded. If yes, it will be stored into the corresponding buffer (FIFO) according to its VCID. At the same time the output direction that the head flit will be forwarded to is stored in an Output Port Table (OPT), and the routing unit will send a switching request to SA. Once the request is granted, the head flit will be forwarded to the output port through the Crossbar. In the output port, if there is an available VC in the input port of next node, a new VCID will be assigned to this head flit. The output data and VC status of next node will be updated simultaneously. In the case that there is not any available VC, the SA will not grant the request to this output



Fig. 2. Sub-router architecture





port in advance. There will be multiple head flits if the packet has multiple destinations. All those head flits transfer separately and independently. And all of them will have an output or input direction record in the input port and output port separately. In this way, a tree with multiple branches is established.

Body flits follow the paths that are established by the head flits. A body flit will be identified when it comes into the router, then it is stored in the FIFO and look for the output ports it will be forwarded to. A switching request will be sent to the SA as well. If the FIFO of next node's input port is full, this SA request will be masked. As a body may have multiple output directions, there is a situation that only parts of those directions are granted. In this case, the body flit will be forwarded to those granted directions. The rest of the directions that are not granted will participate in the next stage of arbitration. The data buffered in the FIFO will not be released only after all the directions have got a copy of this body flit. In the output port, the data will be updated according to the record of input direction and VCID. Tail flit transfers in the same way as body flit. The only difference is that it will release all the direction records in the input port and output port.

## 4 Result

In this section, we'll show our cycle-accurate simulation model of this multicast router to prove the correctness of our design. This model was established with Verilog HDL language and the simulations were conducted using the Modelsim simulation platform. Our router configuration has 4 virtual channels in every input port, each of these virtual channels has a 16 depth FIFO, and the data width is 32-bit.

A set of test vectors are considered to verify the function of the multicast router. We shall show a case in Fig. 3 (a) that the router works under the worst-case operating condition. In this case, five packets are injected into the router from five input ports, and each of them is sent to two or more different outputs respectively at the same time. In this situation, all of those packets have a serious conflict at every output port. From the waveform, we can see that all of them are output correctly.

We also synthesize the router using the Xilinx's ISE integrated development tools. Fig. 3 (b) gives some of the synthesis information on its resource consumption and the maximum frequency. In order to highlight the advantages of this design, we have a comparison with a unicast router which has the same configuration parameters. It can be seen that our multicast router has some advantages in terms of used LUTs and registers resource, also with the maximum frequency. This is mainly due to the idea of differentiated subnetwork. Using this method, the structure of crossbar and SA are simplified. While the function of assigning a new VCID number is finished in the output port, there is no need for a dedicated virtual channel allocator compared with a classic virtual channel router, which consumes lots of logic and storage resources. Thus, the critical path delay is reduced, which leads





| 110000000               |         |               |                                          |
|-------------------------|---------|---------------|------------------------------------------|
| o/dk_sys                | 0       |               |                                          |
| b/rst_n                 | 1       | ← head flit → | tail flit                                |
| b/data_in_e             | 0000000 |               | 00000000                                 |
| o/data_in_w             | 0000000 |               | 00000000                                 |
| o/data_in_s             | 0000000 |               | 00000000                                 |
| b/data_in_n             | 0000000 |               | 00000000                                 |
| o/data_in_l             | 0000000 |               | 00000000                                 |
| b/data_out_e            | 0000000 | 00000000 body | () () () () () () () () () () () () () ( |
| b/data_out_w            | 0000000 | 00000000 flit | ) ) ) 00000000 ) ) ) ) ) ) ) ) ) ) ) )   |
| b/data_out_s            | 0000000 | 00000000      | () () (000 ) () () () () () (00000000    |
| b/data_out_n            | 0000000 | 00000000      | () () () () () () () () () () () () () ( |
| b/data_out_le           | 0000000 | 00000000      | ) ()000 ) ()000 ) ()0000000              |
| b/data_out_lw           | 0000000 | 0000000       | ) (000) (000) (0000000                   |
| o/data_out_ls           | 0000000 | 0000000       | ) )000 ) )000 ) (00000000                |
| b/data_out_In           | 0000000 | 00000000      | ) )000 ) )000 ) (00000000                |
|                         |         |               |                                          |
| (a)                     |         |               |                                          |
|                         |         |               | Unicast Router Multicast Router          |
| Used Registers          |         |               | 4125 3866                                |
| Used LUTs               |         |               | 7754 5060                                |
| Maximum Frequency (MHz) |         |               | 118.901 201.45                           |



(b)

X5VLX110T-1ff1136

to a higher maximum frequency.

# **5** Conclusion

Target Device

In this letter we analyze the causes of the multicast deadlock and propose a new multicast routing algorithm. This algorithm uses the idea of differentiated subnetwork, and the traffic are separated in the up direction and in the down direction. Through this way, our algorithm is deadlock-free and adaptive, and our router costs less hardware resources and multicast latency when compared with those state-of-art schemes.

In addition, some simulations are conducted to verify the correctness of the multicast router's function. A RTL level model is also established. The synthesis reports show that it occupies less resource and have a higher maximum frequency in comparison to the result of unicast router with the same configuration parameter.

# Acknowledgments

This work is supported by the National Science Foundation of China Grant No. 61070046 and 61334003, Shenzhen Research Funding No. JCYJ20130401171935815, the Fundamental Research Funds for the Central Universities Grant No. K5051301003, the 111 Project Grant No. B08038.

