# A Global Wire Planning Scheme for Network-on-Chip J. Liu, L-R Zheng, D. Pamunuwa, H. Tenhunen Laboratory of Electronics & Computer Systems (LECS) Royal Institute of Technology (KTH) Electrum 229, SE-164 40 Kista, Sweden {jianliu, lrzheng, dinesh, hannu}@imit.kth.se #### **Abstract** As technology scales down, the interconnect for on-chip global communication becomes the delay bottleneck. In order to provide well-controlled global wire delay and efficient global communication, a packet switched Network-on-Chip (NoC) architecture was proposed by different authors [1][2]. In this paper, the NoC system parameters constrained by the interconnections are studied. Predictions on scaled system parameters such as clock frequency, resource size, global communication bandwidth and inter-resource delay are made for future technologies. Based on these parameters, a global wire planning scheme is proposed. #### 1. Introduction Interconnect has been the major design constraint in deep submicron circuits. The downscaled wire size, increased aspect ratio, combined with higher signal speed cause many signal integrity challenges and time closure problems. Traditionally, these issues are tackled mainly from an electrical design point of view. Recent studies show that the problem also can be coped with interconnect-centric system architectures [1][2]. One such emerging architecture is the Network-on-Chip (NoC). The NoC architecture is a packet switched network on a single chip [1][2]. It scales from a few dozens to several hundreds or even thousands of resources. A resource may be a processor core, a DSP core, an FPGA block, a dedicated HW block, or a memory block. Any kind of inter-resource information is sent in packets over the network. The structured network wiring gives wellcontrolled electrical parameters and enables reusing of building blocks. Clearly, any topology that fully connects the resources can be used for the network. However, a two-dimensional mesh topology turns out to be simple and effective [2][3]. Thus, the following study will be based on this specific topology. The NoC uses a backbone to provide a reliable and efficient communication platform for user-specified resources. The NoC backbone consists of resources and switches organized in a two-dimensional mesh, as shown in **Figure 1**. A data packet from one resource is first passed to the switch attached to the resource. The switch then routes the packet onto the appropriate link. As the NoC is targeted to future DSM and nanometer technologies, the following questions are interesting: what is the appropriate size of each synchronous resource; how many resources can be integrated in one chip in future technologies; how fast can signals travel from one resource to another through the on-chip communication network and how to plan the wires to get an optimal data bandwidth with limited wire resource. In this paper, we study the NoC system parameters constrained by the interconnections and answer the above questions. In section 2, we use empirical rules to derive the gate delays for future DSM technologies, which is followed by an estimation of the maximum clock frequency and the corresponding resource size. In section 3, the inter-resource delay is studied and a global wire planning scheme providing maximum bandwidth is proposed. The NoC is a typical interconnect-centric architecture, which means that the wire planning is the first design step. In this early planning stage, detailed system parameters for the wires are often unknown, making it impractical to consider layout-related properties such as 3D multiplayer interconnections. Therefore, a simpler wire model is used below. When the planning is done and various requirements on the wires, such as delay and noise level, are determined, a dynamic interconnect model can be used to generate a wire structure meeting these requirements in later design phases. One dynamic interconnect model using 3D capacitance, resistance and inductance is described in [4]. Similar CAD tools like Magma's FixedTiming [www.magma-da.com] are also emerging commercially. **Figure 1.** The 2D-mesh backbone of the NoC, with switches (S) and resources (R). ## 2. NoC Interconnect Fabric Optimization The performance of interconnections is a major concern in scaled technologies. Under scaling, the gate delay decreases. However, the global wires do not scale in length since they communicate signals across the chip. For these wires, the delay per unit length can be kept constant if optimal repeaters are used [5]. In NoC, we assume that all global wires are reserved for global communications and semi-global/local wires are used within a resource. #### 2.1 Technology Scaling and Gate Delay Since four is the typical average gate connectivity, "fan-out-offour inverter delay", or simply FO4 is a reasonable parameter to be used for measuring gate delays. As the name suggests, an FO4 is the delay through an inverter driving four identical copies. Ron Ho [5] pointed out that, historically, gates have scaled linearly with technology, and an accurate model of recent FO4 delays has been $360 \cdot L_{gate}$ ps at typical and $500 \cdot L_{gate}$ ps under worst-case environmental conditions. After studying today's existing nanometer scale devices, he also predicts that this trend will continue for future generations of transistors, which means $500 \cdot L_{gate}$ ps is a lower limit for future FO4 delays. This model of gate delay will be used later when estimating clock cycle time and comparing with wiring delays. ### 2.2 Clock Cycle Analysis A resource in a NoC can run at different speed. To study how the clock cycle within a NoC resource scales with the gate delay, we first examine the relationship between clock cycle and FO4 delay. Recent Pentium4 micro architecture and the aggressive Compaq/DEC alpha chips have 14 to 16 FO4s per clock cycle. Older processors, for example PentiumPro/II, run at 20 to 40 FO4s per clock cycle. It shows that the number of FO4s required in a clock cycle decreases as the technology scales down. Extrapolating historical data would lead to 6-8 FO4s per clock cycle within a few generations [5]. However, such fast-cycling machines pose many difficulties. With 6-8 FO4s per clock cycle, clock skew of a few FO4s would be extreme hard to manage. Furthermore, generating a clock of 8 FO4s per clock cycle is a difficult task since the rise and fall time of a clock wave take more than 2 FO4s to fully transition. With these difficulties in consideration, a clock cycle of 20 FO4s is projected for a costperformance NoC resource and 10 FO4s for a high-performance one. Thus, with 0.05-µm technology, the clock cycle becomes $20.500 \cdot 0.05 = 500$ ps for a cost-performance NoC resource, giving a clock frequency of 2 GHz. Table 1 shows projected clock frequencies for some different technologies. | | 0.18-µm | 0.13-µm | 0.10-µm | 0.07-µm | 0.05-µm | |------------------|---------|---------|---------|---------|---------| | Cost Perf. (GHz) | 0.56 | 0.77 | 1.0 | 1.4 | 2.0 | | High Perf. (GHz) | 1.1 | 1.5 | 2.0 | 2.9 | 4.0 | **Table 1.** Projected clock frequencies for NoC resources under worse-case FO4 delays. #### 2.3 Synchronous NoC Resource Size Estimation Knowing the projected clock cycle, the maximum size of a synchronous NoC resource is limited by the wiring delays since the clock signal must be able to traverse 2 resource edges within a clock cycle (assuming the resource is quadratic) in the worst case, see **Figure 2**. Figure 2. The worst-case delay in a NoC resource. The wiring delay of a distributed RC line can be modeled as: $$T_{wire} = 0.4rcl^2$$ Here $T_{wire}$ is the wiring delay, l is the wire length, r is the resistance per unit length and c is the capacitance per unit length. This is a very good approximation and is reported to be accurate to within 4% for a very wide range of r and c [6]. Knowing the clock cycle time and RC delay model, the maximum resource size satisfies: $$\max\_wiring\_delay < clock\_cycle$$ $$\Rightarrow 0.4rc(2L)^2 < clock\_cycle$$ Here, L is the maximum resource edge length. The clock cycle estimation is described in previous section and qualified predictions on wire resistance and capacitance for future technologies are available in a number of different papers. The RC-model given above shows that the wiring delay grows quadratically with wire length. To reduce the delay for semi-global and global wires, a long line can be broken into shorter sections, with a repeater (an inverter) driving each section, see **Figure 3**. This makes the total wire delay equal to the number of repeated sections multiplied by the individual section delay: $$T_{total} = k \cdot (T_{drv} + 0.4 \cdot rc(l/k)^{2})$$ Now, a first order model of the driver (repeater), with lumped output resistance and input capacitance, gives the driver delay as: $$T_{drv} = 0.7 \frac{R}{h} (hC_0 + hC_g + c\frac{l}{k}) + 0.7 r\frac{l}{k} hC_g$$ Here, R is the minimum sized inverter resistance, $C_0$ and $C_g$ are diffusion and gate capacitances of a minimum sized inverter and r and c are wire resistance and capacitance per unit length. Figure 3. A long wire with k repeaters, each with a size of h times the minimum sized inverter. The expression above for the total delay can be minimized and the minimum delay per unit length can be shown to be $2.13\sqrt{rcFO1}$ ps/mm [5][7]. Here, FO1 stands for fan-out-of-one delay and $1FO4 \approx 3FO1$ . The time for a signal to traverse 2 resource edge lengths should be less than a clock cycle, suggesting the inequality $4.26 \cdot L \cdot \sqrt{rcFO1} < 1 \ clock \ cycle$ . Using the predicted future semi-global wire parameters provided in [7], as shown in **Table 2**, the maximum synchronous resource size and the number of resources on a single chip are calculated and listed in **Table 3**. | Wire Type | Parameter | 0.18-μm | 0.13-μm | 0.10-μm | 0.07-µm | 0.05-µm | |-----------|------------|---------|---------|---------|---------|---------| | Semi- | R (ohm/mm) | 107 | 185 | 317 | 611 | 1196 | | Global | c (fF/mm) | 331 | 268 | 208 | 170 | 155 | Table 2. Wire parameters for different technologies. | | Technology | 0.18-µm | 0.13-µm | 0.10-µm | 0.07-µm | 0.05-µm | |---------------------|-------------------|---------|---------|---------|---------|---------| | | Chip Size (mm) | 20 | 21 | 23 | 25 | 28 | | High<br>Performance | Max Resource Size | 6.5 | 4.7 | 3.5 | 2.4 | 1.5 | | | Nr of Resources | 9 | 20 | 42 | 112 | 350 | | Cost | Max Resource Size | 13 | 9.3 | 7.1 | 4.7 | 3.0 | | Performance | Nr of Resources | 2 | 5 | 10 | 28 | 87 | **Table 3.** Maximum resource size and number of resources on a single chip, with different technologies. The resistance and capacitance used to calculate **Table 3** are for semi-global wire, since the semi-global wire is normally used within a resource. Routing with global wires within a resource would allow larger resource size, since global wires, in general, have lower resistance and therefore also smaller delay per unit length than semi-global wires. From the table, we have that the maximum size of a synchronous high performance resource is 1.5 mm using $0.05~\mu m$ technology. For a cost performance resource with a cycle time of 20 FO4s, twice as long as the high performance resource cycle time, the maximum resource size is also twice as large. It should be noticed that the analysis made above is valid for single wires. Crosstalk effects are not taken into consideration. If many wires are in parallel and switch simultaneously, the delay will be higher for unfavorable switch patterns, requiring smaller resource size. Therefore, the derived maximum resource size above should be seen as an upper bound. ## 3. Inter-Resource Delay and Bandwidth ## 3.1 Inter-Resource Delay The inter-resource communication link will most likely consist of a large number of parallel wires, with uniform coupling over most of the wire length. For such closely coupled parallel wire structures, the crosstalk effects are considerable and cannot be neglected. Hence, the single wire model used in previous section is not valid here. Instead, the model shown in **Figure 4** is used. Each wire is modeled as a distributed RC line with total resistance R, total self-capacitance $C_s$ , and total coupling capacitance $C_s$ uniformly distributed over the whole line. Figure 4. Distributed RC lines with uniform coupling. The effect of crosstalk on the delay depends on the switching pattern of the aggressor (adjacent) lines. Most often, static timing models that take crosstalk into account are based on a *switch factor*. To model the crosstalk effects, the coupling capacitance is multiplied by this switch factor, which takes the value between 0 and 2 for the best and worst case respectively. In **Figure 4**, suppose that the victim line in the middle switches up from zero to one, the switching pattern that gives rise to the worst case delay on the victim line is when the two aggressor lines switch down from one to zero (almost) simultaneously [6]. The worst-case delay is then given by: $$t_{0.5} = 0.7R_{dry}(C_x + 4.4C_c + C_{dry}) + R(0.4C_x + 1.5C_c + 0.7C_{dry})$$ Here, $t_{0.5}$ is the delay for step response to reach 50% point, $R_{drv}$ is the driver (minimum sized inverter) output resistance and $C_{drv}$ is the driver capacitance. Similar to the single wire case, the second term in this expression grows quadratically with the wire length. Inserting repeaters reduces the total wire delay. As shown in **Figure 5**, a long wire is broken into k sections, with an k-sized repeater driving each section. For each section, the driven has a lumped resistance of $R_{drv} / h$ and capacitance of $h \cdot C_{drv}$ , the wire has a distributed resistance of R/k and self-capacitance $C_s / k$ , the mutual capacitance becomes $C_c / k$ between two adjacent lines. Figure 5. Insertion of repeaters in a long uniformly coupled RC line. Applying the formula for worst-case delay for each section, the total wire delay becomes: $$t_{0.5} = k \left[ 0.7 \frac{R_{drv}}{h} \left( \frac{C_s}{k} + hC_{drv} + 4.4 \frac{C_c}{k} \right) + \frac{R}{k} \left( 0.4 \frac{C_s}{k} + 1.5 \frac{C_c}{k} + 0.7 hC_{drv} \right) \right]$$ To obtain the optimal k and h value, the partial derivatives are equaled to zero, giving: $$\begin{split} \frac{\partial t_{0.5}}{\partial k} &= 0 \Rightarrow k_{opt} = \sqrt{\frac{0.4RC_s + 1.5RC_c}{0.7R_{drv}C_{drv}}} \\ \frac{\partial t_{0.5}}{\partial h} &= 0 \Rightarrow h_{opt} = \sqrt{\frac{0.7R_{drv}C_s + 3.1R_{drv}C_c}{0.7RC_{drv}}} \end{split}$$ Now, the optimal value of k must be a positive integer. Using the minimum sized inverter resistance and capacitance from [8], as shown in **Table 4**, the optimal k and h values are calculated and listed in **Table 5**. If the optimal k is not an integer, both of the two closest integers are used and corresponding delays are compared to each other in order to find the smallest delay. | | 0.18-μm | 0.13-µm | 0.10-µm | 0.07-µm | 0.05-µm | |-----------------------|---------|---------|---------|---------|---------| | Inv. Resistance (ohm) | 9020 | 10560 | 11370 | 13710 | 15080 | | Inv. Capacitance (fF) | 1.795 | 1.267 | 0.996 | 0.709 | 0.532 | **Table 4.** Resistance and capacitance of minimum sized inverter for different technologies. From **Table 5**, we see that the optimal size of the repeaters is large and the number of sections does not seem to be very significant for the delay. The increased number of repeaters only gives marginal improvement in delay. This means that the trade-off between the number of repeaters and the delay should be considered. | Technology | 0.18-µm | 0.13-µm | 0.10-µm | 0.07-µm | 0.05-µm | |---------------------|---------|---------|---------|---------|---------| | Optimal h | 322 | 296 | 226 | 187 | 154 | | Optimal k (1/mm) | 0.99 | 1.30 | 1.66 | 2.28 | 3.33 | | Integer k (1/mm) | 1 | 1 | 1 | 2 | 3 | | Total Delay (ps/mm) | 65.5 | 73.2 | 83.7 | 91.8 | 110 | | Integer k (1/mm) | 1 | 2 | 2 | 3 | 4 | | Total Delay (ps/mm) | 65.5 | 71.3 | 76.0 | 90.1 | 108 | **Table 5.** Optimal size of the repeaters, h, optimal number of sections, k, closest integer values to k and corresponding delay per unit length. #### 3.2 Inter-Resource Bandwidth Estimation We have seen that repeater insertion can reduce the wire delay. However, the repeaters tend to be area- and power hungry and repeaters for global wires require many via cuts from the upper-layer wires all the way down to the substrate, introducing considerable via-resistances. Therefore, it is preferable to avoid repeaters in inter-resource communication. The wire delay makes demand on the inter-resource bandwidth and distance. To see how these quantities are related, we first assume that a good signal has duration of at least $3t_r$ , where $t_r$ is the time for a rising signal to rise from 10% to 90% of its final value. Usually, for RC delays, 0-50% time $t_{0.5} = 0.69\tau$ and $t_r = 2.2\tau$ [5], where $\tau$ is the RC time constant. Thus, the bandwidth of a single wire is limited by $\frac{1}{9t_{0.5}}$ . Figure 6 shows the allowed maximum length of a global wire at different bandwidths, with and without repeaters. Clearly, for same technology and wire length, wires with repeaters can have higher bandwidth due to their low propagation delay. For an interresource distance of 1.5 mm with 0.05-µm technology (assuming that the resources are close to each other and the inter-resource distance is therefore equal to the resource size), the bandwidth between two adjacent resources is estimated to 0.6 Gbps per global wire without repeaters. Figure 6. Maximum length of a global wire for different bandwidths and technologies, with and without repeaters. ## 4. Summary and Future Works In this paper, we study the NoC system parameters constrained by the interconnections. Predictions on future technology feature size, clock speed in a synchronous resource, maximum NoC resource size, optimal global communication bandwidth and inter-resource distance, are made. These quantities are closely related to each other. The technology determines the gate delay, which in turn determines the maximum clock frequency. The maximum resource size can then be derived from the obtained clock frequency and the semi-global wire delay. At last, the global communication bandwidth is limited by the distance between resources and the global wire delay. Based on these estimated quantities, this paper provides a global wire planning scheme for NoC and can be used as a guideline for NoC system architecture definition. This can be demonstrated in a numerical example: for a NoC in 50-nm technology, the clock frequency is estimated to be 4 GHz for a high-performance synchronous resource with an edge length of 1.5 mm. With an inter-resource distance of 1.5 mm, there is room for about 350 such resources on a single chip of 28×28 mm. The bandwidth between two adjacent resources is estimated to be 0.6 Gbps per global wire without using repeaters. Future work involves global communication bandwidth optimization strategies under different constraints such as area, power consumption, etc. In addition, the role of multilayer interconnection and real-world application integration in NoC are important and should be studied more closer. ## 5. References - [1] A. Hemani, A. Jantsch, S. Kumar, A. Postula, J. Öberg, M. Millberg, and D. Lindqvist. "Network on Chip: An Architecture for Billion Transistor Era", Proceeding of the IEEE NorChip Conference, November 2000. - [2] W. J. Dally and B. Towles, "Route Packets, Not Wires: On-Chip Interconnection Networks", Design Automation Conference, Proceedings, 684-689, 2001. - [3] E. Nilsson, "Design and Implementation of a Hot-potato Switch in Network on Chip", Master of Science thesis, Laboratory of Electronics and Computer Systems, Royal Institute of Technology (KTH), Sweden, June 2002. - [4] L-R Zheng, H. Tenhunen, "Design and Analysis of Power Integrity in Deep Submicron System-on-Chip Circuits", Analog Integrated Circuits and Signal Processing, 30, 15-29, 2002. - [5] R. Ho, K. W. Mai and M Horowitz, "The Future of Wires", Proceedings of The IEEE, vol. 89, no. 4, April 2001. - [6] D. Pamunuwa, L-R. Zheng and H. Tenhunen, "Maximizing Throughput over Parallel Wire Structures in the Deep Submicro Regime", in manuscript, Laboratory of Electronics and Computer Systems, Royal Institute of Technology (KTH), Sweden. - [7] H. Tenhunen, workshop "Systems on Chip, Systems in Package", ESSCIRC 2001, Villach Austria, Sep 2001. - [8] A. Maheshwari, S. Srinivasaraghavan and W. Burleson, "Quantifying the Impact of Current-Sensing on Interconnect Delay Trends", ASIC/SOC Conference, 15th Annual IEEE International, 461-465, 2002.