On an efficient NoC multicasting scheme in support of multiple applications running on irregular sub-networks

https://doi.org/10.1016/j.micpro.2010.08.003Get rights and content

Abstract

When a number of applications simultaneously running on a many-core chip multiprocessor (CMP) chip connected through network-on-chip (NoC), significant amount of on-chip traffic is one-to-many (multicast) in nature. As a matter of fact, when multiple applications are mapped onto an NoC architecture with applicable traffic isolation constraints, the corresponding sub-networks of these applications are mapped onto actually tend to be irregular. In the literature, multicasting for irregular topologies is supported through either multiple unicasting or broadcasting, which, unfortunately, results in overly high power consumption and/or long network latency. To address this problem, a simple, yet efficient hardware-based multicasting scheme is proposed in this paper. First, an irregular oriented multicast strategy is proposed. Literally, following this strategy, an irregular oriented multicast routing algorithm can be designed based on any regular mesh based multicast routing algorithm. One such algorithm, namely, Alternative Recursive Partitioning Multicasting (AL + RPM), is proposed based on RPM, which was designed for regular mesh topology originally. The basic idea of AL + RPM is to find the output directions following the basic RPM algorithm and then decide to replicate the packets to the original output directions or the alternative (AL) output directions based on the shape of the sub-network. The experiment results show that the proposed multicast AL + RPM algorithm can consume, on average, 14% and 20% less power than bLBDR (a broadcasting-based routing algorithm) and the multiple unicast scheme, respectively. In addition, AL + RPM has much lower network latency than the above two approaches. To incorporate AL + RPM into a baseline router to support multicasting, the area overhead is fairly modest, less than 5.5%.

Introduction

Advance in technology continues to drive the increase of transistor integration capacity. It is estimated that by 2015, there will be 100 billion transistors integrated on a 300 mm2 die [1]. To exploit this large number transistors and also take into consideration of pressing high power consumption of ever bigger chips, the design paradigm is migrating to many-core architectures [1], [2]. Network-on-chip (NoC) [3] has been proposed as the mainstream on-chip network architecture to efficiently interconnect the large number of (16 or more) processing cores integrated on a many-core system. Some most recent, high profile examples include Intel’s Teraflop [4] and Tilera [5] chips featuring many-core chip multiprocessors (CMPs) architectures with 2D mesh topologies [13] for on-chip interconnect.

With the development of diverse applications and programming models on CMPs, one-to-many communication and one-to-all communication are becoming more common. For example, in CMPs with cache coherent shared memory systems, the cache coherence protocols exhibit one-to-many communication characteristics to keep the ordering of different requests or to invalidate shared data on different cache nodes [6]. In [7], it has been observed that 5–10% of the network traffic is one-to-many in nature, ranging from scientific workloads to commercial workloads, in communication traces of different cache coherence protocols and operand network. Therefore, efficient support of one-to-many communications in CMPs, particularly hardware multicast support, will benefit a wide range of applications by boosting the network performance with reduced power consumption. Unfortunately, up to date, there is only very limited number of chip router designs that actually support multicasting [6], [7], [8].

In addition, the following issues make multicast supporting even more complicated. The first issue is topology irregularity. The large number of cores on a CMP unquestionably offers high parallelism in computation. To better utilize these vastly available computation resources, virtualization of the chip becomes a necessity [9], where resources can be distributed among different virtual machines [3]. Applying virtualization [8] at the NoC level basically allows a single NoC-based CMP to be shared by multiple applications with each mapped to different sub-networks of the chip [10] either statically [11] or dynamically [12]. Fig. 1 shows an example with three applications arriving at 1 ms, 2 ms, and 3 ms. The three applications are allocated to three sub-networks which may not be regular shapes (e.g., 2D mesh, torus). On the other hand, virtualization requires traffic isolation [8]; that is, communication between nodes in a virtualized region is limited to the sub-network only. The irregular sub-network and traffic isolation requirements together negate regular 2D mesh oriented routing algorithms, like XY routing, odd–even routing, etc. [13].

The second issue is unpredictability of the application communication behavior. Different types of applications, such as desktop, server, embedded systems, will be executed on general purpose CMPs. It is impossible to pre-characterize the communication patterns among the cores inside a sub-network. As a result, customized NoC routing approaches (like the ones using routing tables [14]) may not be feasible.

Hence, it is important to design an efficient multicast mechanism which supports irregular topologies without the need of a routing table. In this paper, an irregular sub-network oriented multicast strategy is first proposed. Following this strategy, an irregular sub-network oriented multicast routing algorithm, namely, Alternative Recursive Partitioning Multicast (AL + RPM), is developed based on RPM [13], an efficient deterministic multicast routing algorithm proposed for regular mesh topology. To our best knowledge, our approach is the first multicast routing approach, as opposed to the broadcast-based one [8], that targets to irregular sub-networks.

In the rest of the paper, Section 2 reviews the existing work on multicast routing schemes in NoCs. Section 3 presents the preliminaries. Section 4 describes the irregular sub-network oriented multicast routing strategy and algorithm. Section 5 reports the performance evaluation of AL + RPM. Finally, Section 6 concludes the paper.

Section snippets

Related work

Multicast communication has been extensively studied in computer networks and interconnection networks [13]. However, due to the power and area constraints pertaining to NoCs, supporting multicast in NoCs has a different set of requirements. Particularly, an efficient multicasting approach for NoCs should result in low network latency and low power and area consumptions. A simple multicasting approach is to send a multicast packet as multiple unicast packets. However, such a scheme suffers from

Architecture and power models

The target NoC architecture is a tile based NoC, which is composed of N × N tiles interconnected by a 2-D mesh network. Each tile (node exchangeably), indexed by its coordinate (x, y) or its ID xN + y, where 0  x  N  1 and 0  y  N  1, has one processing core and one router. Each router (shown in Fig. 2) connects to its local processing core and four neighbour tiles through bidirectional channels. A 5 × 5 crossbar switch is used as the switching fabric of the router. The arbitration unit arbitrates the

Motivation example and irregular sub-network oriented multicasting strategy

Before the proposed algorithms are described in detail, an example is given to explain the motivation. Fig. 6 shows an irregular sub-network composed of five nodes. A multicast packet is sent from the source node to two destination nodes. The dashed line represents the path if RPM [6] is used. However, since the sub-network is irregular, the dashed path cannot reach the destinations, i.e., at node 4, the packet cannot go West as the link to West is not available in this sub-network.

Experiment settings

To evaluate the performance of the AL + RPM multicast routing algorithm, AL + RPM is simulated under traces from real applications and random traffic. The performance of AL + RPM in terms of power consumption (as defined in Section 3.1) and network latency is compared against bLBDR and multiple unicast. These multicast algorithms are implemented on the cycle accurate simulator Noxim [25]. The power parameters are based on the synthesis results using Synopses Physical compiler with TSMC 90 nm library.

Conclusion

In this paper, an irregular sub-network oriented multicast routing strategy was proposed. The basic idea of this routing strategy is that, if the output channel found by regular topology oriented multicast routing is not available, choose an alternative output channel which also leads to the minimal path to the destination. As a matter of fact, following this strategy, an irregular topology oriented multicast routing algorithm can be designed based on any regular mesh based multicast routing

Acknowledgement

This work has been supported by NSF under grant no. ECCS-0702168 and National Natural Science Foundation of China under grant no. 60873112.

Xiaohang Wang received the B.Eng. degree in communication and electronic engineering from Zhejiang University, China, in 2006. He is currently pursing the Ph.D. degree in communication and electronic engineering at Zhejiang University, China. His research interests include compiler, parallel programming models, core-based digital SoC and NoC design and test.

References (25)

  • S. Borkar, Thousand core chips: a technology perspective, in: Proc. 44th Design Automation Conf., ACM, 2007, pp....
  • J.L. Manferdelli et al.

    Challenges and opportunities in many-core computing

    Proc IEEE

    (2008)
  • J. Held, J. Bautista, S. Koehl, From a few cores to many: a tera-scale computing research review, Intel Research White...
  • Y. Hoskote et al.

    A 5-GHz mesh interconnect for a teraflops processor

    IEEE Micro

    (2007)
  • D. Wentzlaff et al.

    On-chip interconnection architecture of the tile processor

    IEEE Micro

    (2007)
  • L. Wang, Y. Jin, H. Kim, E.J. Kim, Recursive partitioning multicast: a bandwidth-efficient routing for on-chip, in:...
  • N.E. Jerger, L.S. Peh, M. Lipasti, Virtual circuit tree multicasting: a case for on-chip hardware multicast support,...
  • S. Rodrigo, J. Flich, J. Duato, Efficient unicast and multicast support for CMPs, in: Proc. 41st IEEE/ACM Int’l Symp....
  • A. Gavrilovska, S. Kumar, H. Raj, K. Schwan, V. Gupta, R. Nathuji, R. Niranjan, A. Ranadive, P. Saraiya,...
  • S. Murali

    Designing Reliable and Efficient Networks on Chips

    (2009)
  • M.B. Taylor et al.

    The Raw microprocessor: a computational fabric for software circuits and general-purpose programs

    IEEE Micro

    (2002)
  • C.L. Chou, R. Marculescu, User-aware dynamic task allocation in networks-on-chip, in: Proc. Conf. Design, Automation...
  • Cited by (18)

    • A study of a wire-wireless hybrid NoC architecture with an energy-proportional multicast scheme for energy efficiency

      2015, Computers and Electrical Engineering
      Citation Excerpt :

      These packets are usually transmitted to their destinations involving multiple hops. This result leads to a longer transmission delay and costs much energy [16–18]. The naturally broadcast property of WiNoC is promising to enhance the multicast transmission performance in CMP as compared with the 3-D topological NoC [19], the optical NoC [20], and the radio-frequency interconnects (RF-I) NoC [21] since the transmission only involves one time transmission (i.e., one-hop transmission).

    • Efficient multicast schemes for 3-D Networks-on-Chip

      2013, Journal of Systems Architecture
      Citation Excerpt :

      However, the broadcast nature of this scheme makes the network tend to be easily congested, which apparently results in higher power consumption. In our previous work [17,18], an irregular sub-network oriented multicast strategy was proposed for 2-D NoCs. A regular mesh oriented multicast routing algorithm is used as the basic routing algorithm.

    • A survey of multicast communication in optical network-on-chip (Onoc)

      2020, Communications in Computer and Information Science
    • A Subnetting Mechanism with Low Cost Deadlock-Free Design for Irregular Topologies in NoC-based Manycore Processors

      2016, Proceedings - 2016 3rd International Conference on Information Science and Control Engineering, ICISCE 2016
    • User cooperation network coding approach for NoC performance improvement

      2015, Proceedings - 2015 9th IEEE/ACM International Symposium on Networks-on-Chip, NOCS 2015
    View all citing articles on Scopus

    Xiaohang Wang received the B.Eng. degree in communication and electronic engineering from Zhejiang University, China, in 2006. He is currently pursing the Ph.D. degree in communication and electronic engineering at Zhejiang University, China. His research interests include compiler, parallel programming models, core-based digital SoC and NoC design and test.

    Dr. Mei Yang received her Ph.D. in Computer Science from the University of Texas at Dallas in Aug. 2003. She has been an assistant professor in the Department of Electrical and Computer Engineering, University of Nevada, Las Vegas since Aug. 2004. Her research interests include computer architectures, networking, and embedded systems.

    Dr. Yingtao Jiang received his Ph.D. in Computer Science from the University of Texas at Dallas in Aug. 2001. He joined the Department of Electrical and Computer Engineering, University of Nevada, Las Vegas in Aug. 2001. He has been an associate professor since Aug. 2007. His research interests include algorithms, computer architectures, VLSI, networking, nano technologies, etc.

    Peng Liu received the B. Eng. and M. Eng. degrees in optical engineering from Zhejiang University, in 1992, and 1996, respectively, and the Ph.D. degree in communication and electronic engineering from Zhejiang University, China, in 1999. He has been an Associate Professor with the Information Science and Electronic Engineering Department, Zhejiang University, Hangzhou, China, since 2002. His research focuses embedded processor, multiprocessor systems-on-chip architectures, on-chip interconnection networks, real-time operating system, compiler, and circuits for communications.

    View full text