Virtualizing network-on-chip resources in chip-multiprocessors

https://doi.org/10.1016/j.micpro.2010.10.001Get rights and content

Abstract

The number of cores on a single silicon chip is rapidly growing and chips containing tens or even hundreds of identical cores are expected in the future. To take advantage of multicore chips, multiple applications will run simultaneously. As a consequence, the traffic interferences between applications increases and the performance of individual applications can be seriously affected.

In this paper, we improve the individual application performance when several applications are simultaneously running. This proposal is based on the virtualization concept and allows us to reduce execution time and network latency in a significant percentage.

Introduction

In order to continue increasing the computing speed, current semiconductor manufacturing techniques include multiple cores on the same chip. The microprocessor industry has chosen multicore chips instead of an increasingly powerful uniprocessor [1]. Power usage, heat generation or cost are some of the main reasons against the latter motivating that decision. Although these cores do not necessarily run as fast as the highest performing single-core processors, they all improve the overall performance. Therefore, chips containing tens or even hundreds of identical cores are expected in the future. Chip multiprocessors (CMPs) are an excellent example of these systems [2], [3], [4], [5], [6].

To take full advantage of CMPs, it is also expected that several applications will run simultaneously on such CMP systems. Moreover, as the number of cores is increased, it is expected that the number of applications to run in the same CMP also increases. All those applications can be of diverse nature, making the traffic pattern completely unpredictable because of the very different program behaviors for different external inputs (e.g. computer vision, media processing, animation, simulations, data mining, etc.). As a consequence, the CMP load will also increase, which may affect the performance of individual applications.

Fig. 1 shows results about several performance metrics both when there exists only one application1 running in the system, and when it shares network resources with others which are represented by stress load in this test. In both cases the applications are running with 16 threads where each one is mapped in one core of a 4 × 4 mesh CMP. We have chosen synthetic traffic set up to 0.3 packet injection rate to represent the stress load. Finally, note that the results are shown in normalized terms compared with the performance metrics when the application runs alone.

In the figures the execution time increases 37% for the Blackscholes application and 25% for the Streamcluster, approximately. When we apply the stress traffic in the network, not only the execution time is affected. As we can see, the stress traffic has also a direct impact on whole CMP resources. For instance, the total application cache misses are increased by 31% for the Blackscholes and 21% for the Streamcluster application when both are running in shared mode.

Therefore, it is necessary to improve the individual application performance when several applications are simultaneously running in a CMP system. In this scenario, all resources in the CMP are shared for the applications. If this is not done in an efficient way, performance of any individual application can be seriously affected. The work presented here highlights this problem and motivates the performance differentiation desired for the applications.

Our proposal to avoid these problems is based on isolating the traffic of each application so that the CMP could guarantee the performance requirements that applications need. This results in a partitioning of the CMP in several regions, each one composed of a subset of available resources in the CMP system. Fig. 2 shows an example of applying the virtualization mechanism to a CMP, where four partitions have been created which only share off-chip memory. Notice that the size of the partitions can be performed in order to satisfy the application’s requirements and, therefore, it is possible to assign applications to partitions with different size. In this example, the CMP has been partitioned to allow four applications to run simultaneously with each one allocated in one different region not having interaction among them. Note that in this scenario it is also possible that more applications than four run simultaneously in the CMP. Then, several applications would be allocated to the same region. Only applications belonging to the same region could interfere among them.

Of course, this technique involves several CMP components and different actions must be performed. The operating system in close cooperation with a hypervisor must analyze the requirements of each application. Based on these requirements and the set of available resources it must be decided if the application can be run. If possible, a new partition is created for the application or the application is included in an existing one. The system should enable dynamic reassignment of cores, caches and memory to different regions.

Specifically, in this paper we focus on one of these components: the on-chip network. In CMPs, a requirement for a high-performance on-chip interconnect emerges allowing efficient communication among cores. Networks-on-chip (NoCs) can reduce the transmission delays to acceptably low values and allow an efficient communication among cores, cache levels and memory controllers. These NoCs are required to meet the challenges imposed by the most advanced chip technologies to become part of future CMP systems [7], [8], [9], [10], [11], [12].

This CMP component has a major impact in performance and is responsible for much of the miss-latency in all our experiments. Moreover, we have observed in CMPs that the latency greatly increases as the number of applications running together increases, and so the NoC has a large impact on the applications performance (final execution time). This is true even for multithreaded cores and applications with large miss rates. The performance of applications is mainly constrained by latency and, therefore, minimizing latency should be a priority on interconnects for such CMPs.

In this paper, we propose, discuss and evaluate to isolate the traffic of different applications in order to reduce as much as possible the negative effect of the traffic interferences on the applications performance. Our proposal is based on the effective use of the LBDR bits as a means to virtualize the NoC. We propose two ways of isolating the traffic of different applications in order to reduce or even to eliminate the traffic interferences. We show that enabling and enforcing a virtualization mechanism is much more effective for managing the resources than a baseline NoC scheme.

The structure of this paper is as follows: Section 2 presents the related work. In Section 3 we present our proposal for isolating the applications traffic in the NoC. Section 4 details the performance evaluation and Section 5 shows the obtained results. Finally, Section 6 presents conclusions and directions for future work.

Section snippets

Related work

As we have already shown in Section 1, if traffic isolation is not enforced by the NoC, the traffic interferences have negative effects in the performance of the applications. For this reason, the traffic isolation is the main property a NoC system must achieve in order to minimize interference among the different applications. In such a situation, traffic from one application is not allowed to affect other applications.

On the other hand, virtualization offers an opportunity for improving the

NoC virtualization

Generally, a CMP system consists of homogeneous nodes (tiles) where each one contains a processing element, cache memory and the local router which connects the node to the neighboring nodes building thus the NoC. One packet, when generated in the local processor, is sent to the router via a network interface. Then, the packet moves to the next router on its path depending on the routing algorithm, and the process is repeated until the packet arrives to its destination.

In this paper we focus on

Performance evaluation

In this section, we evaluate the behavior of our proposal using simulation. We describe the simulation environment (based on a full-system), the integration of the different tools to obtain the full-system simulator, the benchmarks used in our experiments as well as our experimental setup, followed by a detailed discussion of the results obtained.

Experimental results

In this section, we show the experimental results obtained in the evaluation process. We have considered static scenarios where three applications and the stress traffic are present at the same time in the system. In a more realistic scenario, the system would dynamically allocate the resources to the applications. When an application ends, the network resources it maintained would be freed, and so the system would reallocate them to new applications, maybe reallocating the still running

Conclusions and future work

This paper aims to improve the performance of the applications that are running simultaneously in a CMP. Applications share CMP resources and performance of individual application can be seriously affected. If we put our attention only in the interconnection network, we can see that traffic interferences affect the network performance and as a consequence the CMP performance. To address this issue, the network needs a mechanism to isolate the traffic in order to reduce or even eliminate the

Acknowledgments

This work was supported by the Spanish MEC and MICINN, as well as European Comission FEDER funds, under Grants CSD2006-00046 and TIN2009-14475-C04. It was also partly supported by Junta de Comunidades de Castilla-La Mancha under Grants PCC08-0078-9856, POII10-0289-3724, and by the project NaNoC (project label 248972) which is funded by the European Commission within the Research Programme FP7.

Francisco Triviño received the MS degree in computer science from the University of Castilla-La Mancha, Spain, in 2008 and is currently working toward the PhD degree. He is currently a research assistant in the Research Group in High Performance Networks and Architectures, University of Castilla-La Mancha. His research interests include network-on-chip and quality of service.

References (36)

  • Y. Zhu

    Efficient processor allocation strategies for mesh-connected parallel computers

    Journal of Parallel and Distributed Computing

    (1992)
  • V. Agarwal et al.

    Clock rate versus IPC: the end of the road for conventional microarchitectures

  • Tilera Tile-Gx Product Brief. 2010. URL...
  • Krewell K. Best Servers of 2004: Where Multicore is Norm. Microprocessor Report. 2005. URL...
  • J. Howard, et al. A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS, in: International Solid-State...
  • S.R. Vangal et al.

    An 80-tile sub-100-W TeraFLOPS processor in 65-nm CMOS

    IEEE Journal of Solid-State Circuits

    (2008)
  • S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, et al. TILE64 processor: a 64-Core SoC with mesh...
  • P. Wielage et al.

    Networks on Silicon: Blessing or Nightmare?

  • W.J. Dally et al.

    Not Wires: On-Chip Inteconnection Networks

  • M. Sgroi et al.

    Addressing the System-on-a-Chip Interconnect Woes Through Communication-Based Design

  • K. Goossens et al.

    Networks on Silicon: Combining Best-Effort and Guaranteed Services

  • D. Wentzlaff et al.

    On-chip interconnection architecture of the tile processor

    IEEE Micro

    (2007)
  • G.D. Micheli et al.

    Networks on Chips: Technology and Tools (Systems on Silicon)

    (2006)
  • D. Gupta et al.

    Enforcing Performance Isolation Across Virtual Machines in Xen

  • V. Subramani et al.

    Selective buddy allocation for scheduling parallel jobs on clusters

    Cluster

    (2002)
  • P.J. Chuang, N.F. Tzeng. An efficient submesh allocation strategy for mesh computer systems, in: ICDCS, 1991, pp....
  • Li K, Cheng KH. A two dimensional buddy system for dynamic resource allocation in a partitionable mesh connected...
  • M.R. Marty, M.D. Hill. Virtual hierarchies to support server consolidation, in: ISCA, ACM, New York, NY, USA, 2007, pp....
  • Cited by (16)

    • Efficient multicast schemes for 3-D Networks-on-Chip

      2013, Journal of Systems Architecture
      Citation Excerpt :

      Unfortunately, up to date, very few NoC router designs actually support multicasting [5,6,10], and even fewer for 3-D NoC-based many-core systems [11]. The need of efficient hardware collective communication support is complicated by topological irregularity which might result from virtualization [12]. Virtualization at the NoC level basically allows a single NoC-based CMP to be shared by multiple applications with each mapped to a different region of the chip [13] either statically [14] or dynamically [15,16].

    • Mapping multiple applications with unbounded and bounded number of cores on many-core networks-on-chip

      2013, Microprocessors and Microsystems
      Citation Excerpt :

      And the communication interference between applications will be significantly reduced. A system with less communication interference will benefit from lower latency, higher throughput and shorter execution time [7]. Hence, in this work the design focus for many-core NoCs shifts from the single-application to the multi-application scenario.

    • Network-on-Chip virtualization in chip-multiprocessor systems

      2012, Journal of Systems Architecture
      Citation Excerpt :

      In such scenario, the CMP load and the interferences among applications will increase. For instance, the network latency in CMPs greatly increases as the number of applications running together increases [6], and so the NoC has a large impact on the applications performance (final execution time). This is true even for multithreaded cores and especially for applications generating high cache miss rates.

    • Adapt-NoC: A Flexible Network-on-Chip Design for Heterogeneous Manycore Architectures

      2021, Proceedings - International Symposium on High-Performance Computer Architecture
    • Architecting a Secure Wireless Network-on-Chip

      2018, 2018 12th IEEE/ACM International Symposium on Networks-on-Chip, NOCS 2018
    View all citing articles on Scopus

    Francisco Triviño received the MS degree in computer science from the University of Castilla-La Mancha, Spain, in 2008 and is currently working toward the PhD degree. He is currently a research assistant in the Research Group in High Performance Networks and Architectures, University of Castilla-La Mancha. His research interests include network-on-chip and quality of service.

    José L. Sánchez received the PhD degree from the Technical University of Valencia, Spain, in 1998. Since November 1986 he is a member of the Computer Systems Department (formerly Computer Science Department) at the University of Castilla-La Mancha. He is currently an associate professor of computer architecture and technology. His research interests include multicomputer systems, quality of service in high-speed networks, interconnection networks, networks-on-chip, parallel algorithms and simulation.

    Francisco J. Alfaro received the MS degree in computer science from the University of Murcia in 1995 and the PhD degree from the University of Castilla-La Mancha in 2003. He is currently an assistant professor of computer architecture and technology in the Computer Systems Department at the Castilla-La Mancha University. His research interests include high-performance local area networks, QoS, design of high-performance routers, and design of on-chip interconnection networks for multicore systems.

    José Flich received his MS and PhD degrees in Computer Science from the Technical University of Valencia (Universidad Politécnica de Valencia), Spain, in 1994 and 2001, respectively. He joined the Department of Computer Engineering (DISCA), Universidad Politécnica de Valencia in 1998 where he is currently an Associate Professor of Computer Architecture and Technology. He has served as Program Committee member in different conferences, including ICPP, IPDPS, HiPC, CAC, ICPADS, and ISCC. He is currently co-chair of CAC and INA-OCMC workshops and vice-chair (high-performance networks track) of EuroPar conference. His research interests are related to high-performance interconnection networks for multiprocessor systems, cluster of workstations, and networks-on-chip.

    View full text