# **LETTER** Study-Based Error Recovery Scheme for Networks-on-Chip

Depeng JIN<sup>†</sup>, Member, Shijun LIN<sup>††a)</sup>, Nonmember, Li SU<sup>†</sup>, Member, and Lieguang ZENG<sup>†</sup>, Nonmember

**SUMMARY** Motivated by different error characteristics of each path, we propose a study-based error recovery scheme for Networks-on-Chip (NoC). In this scheme, two study processes are executed respectively to obtain the characteristics of the errors in every link first; and then, according to the study results and the selection rule inferred by us, this scheme selects a better error recovery scheme for every path. Simulation results show that compared with traditional simple retransmission scheme and hybrid single-error-correction, multi-error-retransmission scheme, this scheme greatly improves the throughput and cuts down the energy consumption with little area increase.

key words: network-on-chip, system-on-chip, error recovery

### 1. Introduction

As shrinking process continues, power supply voltage and device  $V_t$  decrease, signal integrity problem becomes serious, thus wires and devices become more and more unreliable and some random errors may occur [1]. Therefore, error resiliency is necessary for Networks-on-Chip (NoC), but it should not incur too much energy consumption and area [1]–[4]. Some simple error recovery schemes, such as simple retransmission (SR) scheme and hybrid single-errorcorrection, multi-error-retransmission scheme (hereinafter referred to in brief as hybrid scheme), are proposed and analyzed in [1]. However, different paths of on-chip network may have different error characteristics (such as flit-error rate, the proportion of flits with multi-error) because 1) different paths may have different lengths; 2) the environments through which different paths pass may be not the same, for example, crosstalk and other noises may differ in each part of a chip. Therefore, it is difficult to achieve high communication and energy performance if the same error recovery scheme is adopted for all paths. Motivated by this, we propose a study-based error recovery scheme, which selects a better error recovery scheme between the SR scheme and the hybrid scheme in network interface (NI) for every path according to its error characteristics obtained from study and the rule inferred by us.

Manuscript received April 15, 2009.

Manuscript revised June 15, 2009.

a) E-mail: linsj05@mails.tsinghua.edu.cn

DOI: 10.1587/transinf.E92.D.2272

### 2. Study-Based Error Recovery Scheme

### 2.1 Basic Definitions

In our scheme, we use static routing, that is, the routing paths are fixed. The reasons are as follows: 1) static routing mechanism will be much simpler than the dynamic one and thus it is easier-implemented and lower-cost, which is important in resource-limited NoC; 2) since the traffic of NoC is predictable, on-chip network can reach relatively high performance when routing paths are reasonably allocated. Also, we assume that the path from IP (intellectual property) A to IP B is the same with the path from IP B to IP A (across the same routers and links), which can be easily implemented in static routing mechanism. Then, the error characteristics of the path from IP A to IP B and the path from IP B and IP A are nearly the same. And, as many other NoCs, wormhole switching mechanism and fixed-length packet are used (of course, our scheme is not limited to them).

When a receiver receives a flit, it sends a nack or an ack signal by a null packet with special packet type back to the sender, depending on whether the flit needs to be retransmitted or not. And the sender differentiates them by the arriving order, for example, the first ack denotes that the first flit has arrived at the receiver and doesn't need to be retransmitted. In order to reduce the retransmission cost, the retransmission flits are put in order in the first several flits of a general packet. And the receiver differentiates the retransmission flits according to their positions in the packets and the arriving order of packets. Then, packet header includes five parts: 1) packet type; 2) the type of error recovery scheme used in the packet body; 3) source IP address; 4) destination IP address; 5) the number of retransmission flits in the packet body. Routing information is stored in the routing tables, and routers forward packets according to their routing tables and the source and destination IP address in the header of packets. To avoid the possible error in the packet header, the (n, 1) repetition code is used, where n is called "the repetition factor". In packet body, the error detection code or error correction code is applied at flit level, that is, every flit is a code in SR scheme or hybrid scheme.

2.2 Process of the Study-Based Error Recovery Scheme

After the chip is produced and before the on-chip network

<sup>&</sup>lt;sup>†</sup>The authors are with the Department of electronic and engineering, Tsinghua University, Beijing 100084, China.

<sup>&</sup>lt;sup>††</sup>The author is the corresponding author and with the Department of electronic and engineering, Tsinghua University, Beijing 100084, China.

works normally, study process 1 and 2 are executed one by one. If the "study" signal (connected with a special I/O) is effective and the "process-selection" signal (connected with a special I/O) is "0", study process 1 is executed; if the "study" signal is effective and the "process-selection" signal is "1", study process 2 is executed. In study process 1, every NI repeatedly sends multicast testing packets (packets with no data, packet type is "00", the repetition factor is set to more than 7 to avoid any possible errors in links) to decide the least repetition factor (LRF). When a router receives a multicast testing packet, it checks the errors in the packet header and decides the LRF in order to avoid the errors. If LRF is more than 3, i.e. 5, the router sends this information in a multicast report packet (packet type is "11") to all the NIs and the other routers (repetition code is also used to avoid any possible errors). Then, the router corrects the errors in packet header and forwards it to all the output ports. At the same time, all the routers and NIs collect the report packets and set the maximum LRF as their final repetition factor. Then, all the packets after this process are packetized according to the final repetition factor. In study process 2, every NI repeatedly sends known packets to the NIs which communicate with it. Then, every NI will receive many packets from the NIs with which it communicates, and it compares the received packets with the original packet (the known packet) to obtain the flit-error rate and the proportion of flits with multi-error in every path. Based on this information and the selection rule which will be developed in the next part of this section, the NI selects a better error recovery scheme between the SR scheme and the hybrid scheme for every path.

#### 2.3 The Error Recovery Scheme Selection Rule

The main factors that affect the performance of error recovery schemes are the bits used for error recovery, which occupy one part of the network bandwidth and consume additional energy. When transporting the same information bits successfully, the less the total number of bits being transported is, the better the error recovery scheme is. Therefore, to select a better error recovery scheme, it is necessary to calculate the total number of bits being transported. Before the calculation, the following definitions are needed.

Definition 1: code length N is the number of bits in a code (a flit here).

Definition 2: coding efficiency k is defined as the ratio between the number of information bits in a code L and N. Specially,  $k_d$  and  $L_d$  are respectively the coding efficiency of the error detection code and the number of information bits in a code in simple retransmission scheme;  $k_h$  and  $L_h$  are respectively the total coding efficiency and the number of information bits in a code in the hybrid scheme. Obviously,  $L_d = N \cdot k_d, L_h = N \cdot k_h.$ 

Definition 3: code-error rate p (flit-error rate here) is the error probability of a data flit when transported through the on-chip network. In the simple retransmission scheme, it equals the probability of retransmission  $P_{sr}$ .

Definition 4: q is the proportion of flits with multi-error in the case of flit error. Then, in hybrid scheme, the probability of retransmission  $P_{hr}$  equals  $p \cdot q$ .

Definition 5: T is the total number of information bits in the source node; H and R are respectively the number of bits in the header of a packet and in a nack or ack packet; M is the number of flits in packet body.

Then, when a flit is transported to the destination, the average number of bits transported is  $H_{M+N+R}$ , where  $H_{M}$ is the average number of bits in packet header for the flit. Thus, in simple retransmission scheme, in order to transport a flit successfully, the expectation of the total number of bits being transported  $(T_{s_c})$ ,  $E[T_{s_c}]$ , can be calculated by the following equation:

$$E[T_{s\_c}] = (H_{M} + N + R) + (H_{M} + N + R) \cdot P_{sr} + (H_{M} + N + R) \cdot P_{sr} + (H_{M} + N + R) \cdot P_{sr}^{2} + \dots = \frac{(H_{M} + N + R)}{1 - p}$$
(1)

Let  $T_s$  be the total number of bits being transported when all information bits in the source node have been successfully transported in simple retransmission scheme. Then, the expectation of  $T_s$ ,  $E[T_s]$ , can be calculated by the following equation:

$$E[T_s] = \frac{T}{L_d} \cdot E[T_{s\_c}] = \frac{T}{N \cdot k_d} \cdot \frac{\left(\frac{H}{M} + N + R\right)}{1 - p} \quad (2)$$

Similarly, let  $T_h$  be the total number of bits being transported when all information bits in the source node have been successfully transported in hybrid scheme. Since the retransmission probability and code efficiency in hybrid scheme are respectively  $p \cdot q$  and  $k_h$ , the expectation of  $T_h$ ,  $E[T_h]$ , equals  $\frac{T}{N \cdot k_h} \cdot \frac{\left(\frac{H_{/M^{+N+R}}}{1-p \cdot q}\right)}{1-p \cdot q}$ . When  $E[T_s] < E[T_h]$ , that is,  $p < \frac{k_d - k_h}{k_d - k_h \cdot q}$ , the sim-

ple retransmission scheme should be better than the hybrid one, otherwise, the hybrid scheme should be better than the simple retransmission one.  $k_d$  and  $k_h$  are determined by the error detection code and error correction code used in SR and hybrid scheme; p and q are obtained from study process 2.

### NI and Router Design in Study-Based Error Recov-3. ery Scheme

Figure 1 shows the NI and router architecture which supports the study-based error recovery scheme. In the NI, a study-based arbiter is used to carry out the study processes and store the study results. Based on the study results, it tells the packetizer and depacketizer the repetition factor in packet header, and selects SR scheme or hybrid scheme for every packet from IP according to its destination IP address, study results and the error recovery scheme selection rule. In the sender of NI, when the SR scheme is selected, the error correction code encoder is disabled and the data flits from error detection code encoder is sent directly to the



Fig. 1 NI and router design in study-based error recovery scheme.

packetizer; otherwise, when the hybrid scheme is selected, the error correction code encoder is enabled and the data flits from retransmission buffer go first through the error detection code encoder and then the error correction code encoder before packetized. In the receiver of NI, the depacketizer depacketizes the packets from router and sends them to study-based arbiter, error correction code decoder or error detection code decoder according to the packet type and the type of error recovery scheme used in the packet body in the packet headers. In every input port of router, a packet pre-processor is used to support the study-based scheme. In study process 1, when a packet pre-processor receives a testing packet, it corrects the errors in the packet header, reports the LRF when necessary by a multicast report packet, and sends the packet to all the output ports. Also, it collects the report packets from other routers and stores the maximum LRF, corrects the errors in the packet header, and sends the report packet to all the output ports. In the other processes, when a packet pre-processor receives a packet, it corrects the errors in the packet header and sends the packet to wormhole router module for forwarding.

### 4. Experimental Results and Conclusions

A 4×4 mesh NoC with 16 IPs, 16 NIs and 16 routers is used to study the throughput, energy and area of the SR scheme, hybrid scheme and study-based scheme. Here, throughput is defined as the average number of information bits that onchip network can handle every cycle per IP. We assume 8-flit packet, 63-bit flit size, CRC-4 error detection code, (63, 57) Hamming error correction code. Then, in SR scheme, the number of information bits in a code (flit) is 59; in hybrid scheme, the number of information bits in a code (flit) is 53. We assume 3 virtual channels in the router, uniform Poisson traffic and 20 bits/IP/cycle traffic load. We assume that

Table 1 Comparison results.

| -                |        |        |             |
|------------------|--------|--------|-------------|
|                  | SR     | Hybrid | Study-based |
|                  | scheme | scheme | scheme      |
| Throughput       | 13.4   | 14.7   | 18.2        |
| (bits/IP/cycle)  |        |        |             |
| Energy (J)       | 1.51   | 1.34   | 0.98        |
| Area of a router | 6954   | 8336   | 8547        |
| and a NI (ALUTs) |        |        |             |

p in every link varies randomly from 2% to 6%, and q in every link varies randomly from 4% to 8%. We use the energy model proposed in [5] to estimate the energy and we estimate the area based on FPGA EP2S180F1508C5. The comparison results of the SR scheme, hybrid scheme and study-based scheme are shown in Table 1. From Table 1, we can see that the study-based scheme could greatly improve the throughput of NoC and cut down the energy consumption with little area increase. The reason why throughput is improved and energy is reduced is: Based on the information obtained from the study processes and the selection rule which is developed in part 2.3, the NI selects a better error recovery scheme in which less redundant bits are needed for error recovery. Since the redundant bits will not only increase the communication energy consumption, but also occupy part of the network bandwidth, the proposed scheme can improve the throughput and reduce the energy consumption.

## Acknowledgements

This work is partly supported by National Natural Science Fund (NNSF-90607009), partly supported by the National High Technology Research and Development Program (No.2008AA01Z107) and partly supported by the National Basic Research Program (No.2007CB310701).

### References

- S. Murali, T. Theocharides, N. Vijaykrishnan, M.J. Irwin, L. Benini, and G. De Micheli, "Analysis of error recovery schemes for networks on chips," IEEE Des. Test Comput., vol.22, no.5, pp.434–442, 2005.
- [2] A. Dutta and N.A. Touba, "Reliable network-on-chip using a low cost unequal error protection code," 22nd IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems, 2007 (DFT '07), pp.3–11, Sept. 2007.
- [3] P. Huang, W. Fang, Y. Wang, and W. Hwang, "Low power and reliable interconnection with self-corrected green coding scheme for networkon-chip," Second ACM/IEEE International Symposium on Networkson-Chip, 2008 (NoCS 2008), pp.77–83, April 2008.
- [4] J. Wang, H. Zeng, K. Huang, G. Zhang, and Y. Tang, "Zero-Efficient buffer design for reliable network-on-chip in tiled chipmulti-processor," Design, Automation and Test in Europe, 2008 (DATE '08), pp.792–795, March 2008.
- [5] J. Hu and R. Marculescu, "Energy- and performance—Aware mapping for regular noc architectures," IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol.24, no.4, pp.551–562, 2005.