

# MSP based thermal-aware mapping approach for 3D Network-on-Chip under performance constraints

# Gui Feng<sup>1</sup>, Fen Ge<sup>1a)</sup>, Ning Wu<sup>1</sup>, Lei Zhou<sup>2</sup>, and Jing Liu<sup>3</sup>

<sup>1</sup> College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

<sup>2</sup> College of Information Engineering, Yangzhou University,

Yangzhou 225009, China

<sup>3</sup> Nanjing Research Institute of Electronics Technology, Nanjing 211106, China

a) gefen@nuaa.edu.cn

**Abstract:** Three dimensional Network-on-chip (3D NoC) is proposed as an effective architecture to optimize system performance. However, thermal issues bring significant challenges on 3D NoC due to high power density. In this paper, we propose a 3D matrix synthesis problem (MSP) based thermal-aware mapping approach under performance constraints for 3D NoC architecture to realize temperature equilibrium and achieve better performance. Genetic algorithm is taken in the approach to obtain the optimal placements. Experimental results show that the proposed approach can achieve a temperature deviation of 45.3% on average compared with the state of art thermal optimization approaches. Moreover, our approach achieves 9.43% power saving and 14.88% delay reduction.

**Keywords:** 3D Network-on-Chip, MSP, thermal, mapping **Classification:** Electron devices, circuits, and systems

#### References

- [1] S. Borkar: ACM/IEEE Design Automation Conference (DAC) (2011) 214.
- [2] K. Puttaswamy and G. H. Loh: High-Performance 3D-Integrated Processors, Proc. of HPCA (2007) 193. DOI:10.1109/HPCA.2007.346197
- [3] C. Addo-Quaye: Proc. IEEE International SOC Conference (2005) 25. DOI: 10.1109/SOCC.2005.1554447
- [4] P. K. Hamedani, S. Hessabi, H. Sarbazi-Azad and N. E. Jerger: 2012 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP) (2012). DOI:10.1109/PDP.2012.68
- [5] K. Manna, V. Choubey, S. Chattopadhyay and I. Sengupta: Proc. of the IEEE International Conference on Parallel, Distributed and Grid Computing (2014). DOI:10.1109/PDGC.2014.7030755
- [6] G. Feng, F. Ge, S. Yu and N. Wu: Proc. of the IEEE 10th International Conference on ASIC (2013). DOI:10.1109/ASICON.2013.6811834
- [7] C. C. N. Chu and D. F. Wong: IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 17 (1998) 1166. DOI:10.1109/43.736189
- [8] R. P. Dick, D. L. Rhodes and W. Wolf: Proc. of International Workshop on





Hardware/Software Codesign (1988). DOI:10.1109/HSC.1998.666245

- [9] A. B. Kahng, B. Li, L. S. Peh and K. Samadi: IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 20 (2012) 191. DOI:10.1109/TVLSI.2010.2091686
- [10] L. Jain: NIRGAM Manual. A Simulator for NoC Interconnect Routing and Application Modeling (2007).
- [11] HotSpot 5.01, Thermal Modeling Tool for Integrated Circuits. http://lava.cs. virginia.edu/hotspot/

#### 1 Introduction

Compared to the 2D implementations, three dimensional Network-on-Chip (3D NoC) is an efficient solution to increase the system scalability and alleviate the interconnect problem in large-scale integrated circuits [1]. However, one side effect of 3D designs is the increase in power density on parts of the chip due to the stacking of active power dissipation devices, which result in thermal hotspots [2]. The temperature increase may limit the maximum operating frequency of the chip, and thereby degrade the system performance. One method to control the chip temperature is mapping. In NoC design, mapping determines the topological placement of IP cores under different design constraints, which has great influence on the overall system performance. Thus, utilizing mapping algorithm to handle thermal problem is an efficient way.

Several approaches have been proposed in literature about thermal-aware mapping in 3D NoC. Addo Quaye first proposed to address the thermal-aware mapping problem for 3D NoC [3]. In [4] three ILP-based thermal-aware mapping algorithms for 3D NoC are proposed to explore the thermal constraints and their effects on temperature and performance. However, their work only considered the peak temperature of the chip, while ignored the distribution of temperature. Kernighan-Lin Partition based thermal-aware mapping is presented in [5], which aims to achieve a tradeoff between communication cost and thermal behavior of the NoC based system. Our prior work [6] takes the deviation of temperature as the optimal goal to achieve temperature equilibrium by introducing a value parameter m to adjust the number of cores used to calculate the deviation. However, this approach increases the complexity to choose the optimal value m. Moreover, the optimal goal is nonlinear, which slows down the speed of solution as the IP cores increase.

Therefore, this paper further introduces a matrix synthesis problem (MSP) to solve the problem of thermal-aware mapping approach for 3D NoC architecture. The proposed approach aims to achieve temperature equilibrium of the chip in a faster way and ensure that the performance constraints are satisfied. In addition, a detailed comparison has been carried out with various thermal-aware mapping algorithms.

#### 2 3D MSP based thermal model

In integrated circuit chips, temperature at a point depends upon the heat generated by the module located at that position and temperatures of neighboring modules.





Chu et al. [7] introduce MSP to model the thermal placement problem for gate arrays. MSP is to synthesis a matrix out of a given list of numbers such that no submatrix of a particular size has a large sum. Here, we model our NoC architecture as a  $m \times n \times l$  matrix with the given temperatures such that the average value of all  $t_1 \times t_2 \times t_3$  submatrices is minimized. As shown in Fig. 1, the temperature of each element in  $4 \times 4 \times 4$  3D NoC can be represented by a nonnegative real number in  $4 \times 4 \times 4$  matrix. And the  $t_1 \times t_2 \times t_3$  submatrix (we call it a MSP cube) on the right of Fig. 1 is set to be  $2 \times 2 \times 3$  as an example. The parameters  $t_1$ ,  $t_2$ ,  $t_3$  is used to account for the heat transfer ability in three directions (x, y, z) in 3D NoC. Increasing  $t_1/t_2/t_3$  means that heat transfer is better. So the number of affected cores near the heat source core along the direction will be larger. The specific value of  $t_1/t_2/t_3$  can be decided according to the requirement of chip design.

Hence, the basis of MSP based thermal problem is to find out all the MSP cube  $M_i$  in a 3D NoC. Then, Let  $\sigma(M_i)$  be the sum of MSP cube  $M_i$ . And let  $\mu(M) = \frac{1}{num} \sum_{i=1}^{num} \sigma(M_i)$ , which is the average value of all MSP cubes. *num* is the number of MSP cubes in 3D NoC.  $\sigma(M_i)$  has a larger value represents the hotter region on the chip. Our MSP based mapping method here attempts to find a placement of the modules such that the temperature of the chip is balanced, and thereby to minimize the average value of all MSP cubes ( $\mu(M)$ ). Take Fig. 1 as an example,  $\sigma(M_i)$  can be calculated as follows:

$$\sigma(M_i) = Th_{i,j,k} + Th_{i+1,j,k} + Th_{i,j+1,k} + Th_{i+1,j+1,k} + Th_{i,j,k+1} + Th_{i+1,j,k+1} + Th_{i,j+1,k+1} + Th_{i+1,j+1,k+1} + Th_{i,j,k-1} + Th_{i+1,j,k-1} + Th_{i,j+1,k-1} + Th_{i+1,j+1,k-1}$$
(1)



Fig. 1. MSP cube model

 $Th_{i,j,k}$  is the temperature of IP core at the position (i, j, k) in 3D NoC, which can be calculated as [4] suggested as follows:

$$Th_{i,j,k} = T_{Amb} + \sum_{m=1}^{k} \frac{R_{i,j,m}}{A} \times \left(\sum_{s=m}^{n} P_{i,j,s} + PR_{i,j,s}\right)$$
 (2)

where  $T_{Amb}$  is the ambient temperature.  $R_{i,j,m}$  is the thermal resistance of the IP core at the position (i, j, m). A is the area of IP core. n is the total number of layers.





 $P_{i,j,s}$  and  $PR_{i,j,s}$  are the average power consumption of IP core and router at the position (i, j, s) in 3D NoC. The calculation of each parameter in detail can be referred in [6].

#### 3 Problem formulation

In order to formulate the mapping problem, we need the following definitions.

**Definition 1:** A Core Communication Graph is a digraph denoted by CCG(C, A). Each vertex  $c_i \in C$  represents an IP core, and each edge  $a_{i,j} \in A$  represents the communication from IP  $c_i$  to IP  $c_j$ . The weight of the edge, denoted by  $b_{i,j}$ , represents the total volume of the communication.

**Definition 2**: An NoC Architecture Graph is a digraph denoted by  $NAG(\mathbb{R}, \mathbb{P})$ . Each vertex  $r_i \in \mathbb{R}$  represents a source node in the architecture, and each edge denoted as  $p_{i,j} \in \mathbb{P}$  represents the communication path from resource node  $r_i$  to  $r_j$ .

Using the above graph representations, the problem of balancing temperature across the chip under performance constraint can be formulated as follows:

Given a CCG(C, A) and NAG(R, P)

find a mapping function map() from CCG(C, A) to NAG(R, P), which minimizes:

ľ

Р

$$\min\{\mu(M)\}\tag{3}$$

such that

$$c_i \neq c_j \in C$$
 (4)

$$map(c_i) \neq map(c_j) \tag{5}$$

$$\lambda_{\rm L} \le {\rm BW}_{\rm L} \tag{6}$$

$$f \le \mathbf{D} \tag{7}$$

where  $\lambda_{\rm L}$  is the load of a link and BW<sub>L</sub> is the bandwidth of the link. Performance constraints is suggested by [4], where *f* is communication cost and  $f = \sum_{i=0}^{s} \sum_{j=0}^{s} b_{i,j} \times dist(c_i, c_j)$ ,  $dist(c_i, c_j)$  is the Manhattan distance between IP  $c_i$  and IP  $c_j$ . *s* is the number of IP cores in an application. *D* is average distance in the NoC topology, which can be calculated as  $D = \frac{1}{3} \sum_{i=0}^{s} \sum_{j=0}^{s} b_{i,j} \times \left(d_1 - \frac{1}{d_1}\right) \times \left(d_2 - \frac{1}{d_2}\right) \times \left(d_3 - \frac{1}{d_3}\right)$ , where  $d_1, d_2, d_3$  are dimensions of 3D Mesh.

Condition (4) and (5) mean that each IP should be mapped to exactly on tile and no tile can host more than one IP. Equation (6) guarantees that the load of any link will not exceed its bandwidth, while (7) makes sure the network delay will not exceed average distance. Hence, the goal of our proposed approach that maps CCG(C, A) to NAG(R, P) is to find a mapping according to equation (3), and satisfy performance constraints as (6)~(7).

#### 4 The proposed thermal-aware mapping algorithm

In order to balance the temperature across the whole chip, we propose a MSP based thermal-aware mapping approach utilizing the genetic algorithm. The approach mainly consists of four phases, and the algorithm flow is shown in Fig. 2.







Fig. 2. Thermal-aware mapping algorithm flow

Application and NoC architecture information (including core communication graph, performance constraints etc.) is prepared for the mapping algorithm. The heat transfer parameter  $(t_1, t_2, t_3)$  is chosen according the application and NoC architecture. Firstly, we initialize the population, which is generated randomly. Secondly, for each chromosome, its fitness is evaluated. In this phase, MSP cube is traversed in 3D NoC. And the sum of each MPS cube and average value of all MSP cubes are calculated. Thirdly, depending on the fitness, a new population is generated through three operations (selection, crossover and mutation). Then, the above steps are executed repeatedly, until the max generation is reached. Finally, the optimal mapping is reported. The details of four phases are explained as follows.

**Initialize population**. The mapping of IP cores is represented by a chromosome, and an amount of chromosomes compose an initial population. Given the layers of 3D NoC (e.g. 2 layers), chromosome is divided into 2 parts as shown in Fig. 3 and the placement of IP cores are generated randomly.



Fig. 3. GA encoding for 3D NoC

**Evaluate fitness**. Each chromosome in the population is evaluated for its fitness in Fig. 3. Firstly, all MSP cubes are traversed in a solution. Then we calculate the





sum temperature of all MSP cubes. The reciprocal of the average temperature of all MSP cubes (equation (3)) is set to be the fitness. The higher the fitness of the chromosome is more likely to be chosen as the parent chromosomes. Then we need to verify the performance constraints. If the chromosome can not satisfy the constraints, its fitness is set to be zero, which means it will have no chance to be parent to generate new chromosomes.

**Create new population**. Three operators (selection, crossover and mutation) are applied to generate a new population. The selection operator selects two parent chromosomes form the population with probability proportional to the potential parents' fitness. The crossover operator randomly selects cross points from the parents to form a new generation. For mutation, we exchange two genes in the new chromosome that randomly selected.

**Output optimal result**. The approach repeats the above two steps until no improvement in observed fitness over for a sufficient number of iterations, then reports best result as the generated optimal mapping in 3D NoC.

# 5 Experimental results

To verify the efficiency of the proposed mapping approach, several multimedia applications including MWD, VOPD, MPEG4 decoder are taken as benchmarks. Besides, 16-core graphs and 36-core graphs are generated using TGFF [8] as benchmarks as well. The size of 3D Mesh NoC architecture and the size of MSP cube are summarized in Table I. The size of each IP core is assumed to be the same. In this work, we take Orion [9] to simulate NoC power consumption. Nirgam [10] is adopted to simulate NoC delay. Besides, HotSpot [11] is used to give temperature information of NoC.

In this experiments, the proposed mapping algorithm is compared with the approaches in [3] and [4]. Authors in [3] present a thermal-aware and communication-aware hybrid optimization approach, which targets the optimal tradeoff between the two. The approach in [4] takes peak temperature as the goal to optimize. For simplicity, we denote methods in [3] and [4] as *Tlog* and *Tmax* respectively in the following comparison. Besides, the proposed approach is compared with our prior work [6] as well, which is denoted as *Tdev*. The value of *m* in [6] is set to be equal to the number of cores in an application and the value will not be adjusted. And the proposed approach in this paper is represented by *Tmsp*.

Table I. Graph characteristics

| Graph         | Nodes | Edges | NoC<br>X × Y × Z      | $\begin{array}{c} \text{MSP cube} \\ t_1 \times t_2 \times t_3 \end{array}$ |
|---------------|-------|-------|-----------------------|-----------------------------------------------------------------------------|
| MWD           | 12    | 13    | $2 \times 3 \times 2$ | $2 \times 2 \times 2$                                                       |
| VOPD          | 12    | 15    | $2 \times 3 \times 2$ | $2 \times 2 \times 2$                                                       |
| MPEG4 decoder | 12    | 26    | $2 \times 3 \times 2$ | $2 \times 2 \times 2$                                                       |
| Random16      | 14    | 19    | $2 \times 4 \times 2$ | $2 \times 2 \times 2$                                                       |
| Random36      | 25    | 33    | $3 \times 4 \times 3$ | $2 \times 2 \times 2$                                                       |





In Fig. 4, a comparison of *Tlog*, *Tmax*, *Tdev* and *Tmsp* on chip temperature is given, and Fig. 4(a) shows the comparison on peak temperature. The peak temperature of Tmsp is lower than *Tlog* and *Tmax* by 7.37% and 7.59% on average, while the peak temperature of *Tdev and Tmsp* are nearly approximate. Average temperature comparison is shown in Fig. 4(b). The result seems the four methods have nearly average temperature. However, the average temperature of *Tmsp* is still lower than *Tlog* and *Tmax* by 3.51% and 4.9% respectively. The comparison on temperature deviation in Fig. 4(c) shows how evenly the heat is distributed. The temperature deviation of *Tmsp* outperforms *Tlog* 45.3% on average. And *Tmsp* shows better result than *Tmax*, whose temperature deviation is lower than *Tmax* on random16. *Tmsp* shows capability of optimization on the whole benchmarks. While *Tdev* shows unstable capability to optimize due to that the value of *m* is set to be constant.



Fig. 4. Temperature comparison with various thermal-aware mapping algorithms





On the aspect of performance, the power consumption and communication delay of the network are compared by using *Tlog*, *Tmax*, *Tdev* and *Tmsp* four different approaches. Fig. 5(a) shows the power consumption comparison. The execution time comparison between *Tmsp* and *Tdev* is shown in Table II. CPU used in the experiment is Intel Core i3 @3.3 GHz, and the memory amount is 3 GB. It can be seen that *Tmsp* has faster speed than *Tdev* to get the optimal result when the value of *m* in *Tdev* is set to be constant. If *Tdev* adjust the value of *m*, it takes more time.

Our proposed approach *Tmsp* saves 9.43% and 10.01% power consumption than *Tlog* and *Tmax* respectively. The comparison on Network delay is shown in Fig. 5(b), which shows that *Tmsp* performs better. The result shows that *Tmsp* saves 14.88% and 10.03% network delay than *Tlog* and *Tmax* respectively.

Table II. Execution time comparison

| Execution time (ms) | MWD  | VOPD | MPEG4 | Random16 | Random36 |
|---------------------|------|------|-------|----------|----------|
| Tmsp                | 1250 | 1391 | 1688  | 1688     | 4750     |
| Tdev                | 1407 | 2088 | 2781  | 1750     | 5516     |



(a)Network power consumption comparison



Fig. 5. Network performance comparison with various thermal-aware mapping algorithms





# 6 Conclusion

In this paper, we propose a MSP based thermal-aware mapping approach for balancing the temperature across the chip meanwhile having performance optimized. Experimental results show that, compared with the state of art thermal optimization scheme, the proposed approach can achieve 45.3% temperature deviation reduction, 9.43% power saving and 14.88% delay reduction respectively.

# Acknowledgments

This work is supported by the Natural Science Foundation of China under Grant 61106018 and 61376025, the Industry-academic Joint Technological Innovations Fund Project of Jiangsu under Grant BY2013003-11.

