RAFT: A router architecture with frequency tuning for on-chip networks

doi:10.1016/j.jpdc.2010.09.005

Journal of Parallel and Distributed Computing

Volume 71, Issue 5, May 2011, Pages 625-640

https://doi.org/10.1016/j.jpdc.2010.09.005 Get rights and content

Abstract

With increasing number of cores being integrated on a single die, Network-on-Chips (NoCs) have become the de-facto standard in providing scalable communication backbones for these multi-core chips. NoCs have a significant impact on the system’s performance, power and reliability. However, NoCs can be plagued by higher power consumption and degraded throughput if the network and router are not designed properly. Towards this end, this paper proposes a novel router architecture, where we tune the frequency of a router in response to network load to manage both performance and power. We propose three dynamic frequency tuning techniques, FreqBoost, FreqThrtl and FreqTune, targeted at congestion and power management in NoCs. We also propose and evaluate a novel fine-grained frequency tuning scheme where we vary the number of virtual-channels in a router dynamically. As a further optimization to these schemes, we propose a frequency tuning scheme where we tune the frequency of the four ports of a mesh router separately from the local port. As enablers for these techniques, we exploit Dynamic Voltage and Frequency Scaling (DVFS) and the imbalance in a generic router pipeline through time stealing. We also evaluate and analyze the proposed schemes from the point of view of reliability against soft error vulnerability and provide guidelines in choosing the appropriate scheme when reliability is the prime design constraint.

Experiments using synthetic workloads on an 8 × 8 wormhole-switched mesh interconnect show that FreqBoost is a better choice for reducing average latency (maximum 40%) while, FreqThrtl provides the maximum benefits in terms of power saving and energy delay product (EDP). The FreqTune scheme is a better candidate for optimizing both performance and power, achieving on an average 36% reduction in latency, 13% savings in power (up to 24% at high load), and 40% savings (up to 70% at high load) in EDP. With application benchmarks, we observe IPC improvement up to 23% using our design. Our analysis shows FreqBoost to be the most robust scheme amongst the three schemes when reliability is a concern.

Introduction

Power-aware Chip-Multiprocessor (CMP) design has become an important paradigm, at least on par with performance, in the nanometer regime. Since on-chip networks are expected to consume a significant part of the total chip power, design of energy-efficient processors along with the interconnection network provides a holistic approach to overall power conservation. It is predicted that NoC power can be a significant part of the entire chip power and can account for up to 40–60 W [2] with technology scaling for a mesh based network with 128 nodes. A few commercial designs also support this trend, where up to 28% of the entire chip power is devoted to the interconnect [10]. Thus, on-chip interconnects that can optimize both performance and power pose intriguing research challenges. This is evident from the large body of literature covering multiple facets of NoC design [14], [13], [25], [7], [23], [30].

Router frequency is one of the critical design parameters that directly affects both performance and power, albeit in a contradictory fashion. With a sophisticated design of a router pipeline, it is possible to increase the operating frequency [19], [24], but higher router frequency leads to higher power consumption. On the other hand, a mismatch between processor and router/network frequency can result in significant performance penalties [5]. Thus, a prudent control of the router frequency can help in optimizing both performance and power.

In this context, we propose a new design philosophy for NoCs where, unlike the traditional single-frequency routers, we propose dynamically modulating the frequencies of routers in a network for effectively managing network congestion and energy consumption. The proposed variable frequency router design, RAFT, uses the well-known DVFS technique and complements the use of DVFS in the processor cores for power management.

We motivate the fine balance that exists between power and performance in an on-chip network with a relative power-performance trade-off analysis with respect to offered network load. Fig. 1 shows the relative growth of network power versus network latency for an 8×8 mesh with a synthetic traffic mixture of Uniform Random, Transpose, Nearest–Neighbor and Self Similar traffic (a detailed network configuration is mentioned later in Table 3(a)). The bars indicate network latency/power normalized with respect to the network latency/power at no load (idle network). At low load, the network power consumption is less. However, the rate of growth of network power is much higher as compared to the rate of growth of network latency. For example, as shown in Fig. 1, the network power grows to 30× as the injection rate varies from 1% to 40%, whereas the network latency grows only 7×. We leverage our insights from these trends to optimize the network at low load for performance and at high load for power. An activity based power-management technique, which was recently implemented in the Intel 80-core routers [10], [37], shares a similar view of optimizing the network’s power based on activity, albeit in a different fashion by clock-gating the idle ports.

Since performance and power are directly proportional to frequency, we dynamically modulate the router frequency in response to network load to facilitate these optimizations, and demonstrate the advantages at system level. Specifically, at low load we operate the routers at peak frequency. At high load, we dynamically determine the operating frequency of individual routers in the network. The dynamic schemes that determine the operating frequencies of the routers are designed to (a) reduce power consumption and (b) manage congestion in the network, by selectively stepping up and down the frequency of a subset of routers in the congested regions of a network. We propose a two-prong approach to vary the baseline router frequency: clock scaling and time-stealing. We employ Dynamic Voltage and Frequency Scaling (DVFS) [31], [39] to scale up and down the router clock frequency below the nominal frequency by switching the operating voltage levels. The time stealing technique is employed to boost the baseline router frequency by exploiting the timing imbalance between router pipeline stages, such that a router can operate at the average cycle time of all the pipeline stages in contrast to the delay of the worst case pipeline stage.

We explore three techniques for dynamic frequency tuning to simultaneously address power-performance trade-offs. The first technique, called FreqBoost, initially employs time-stealing to operate all routers at a boosted frequency. This helps in enhancing the performance at low load conditions, while slightly increasing the power consumption. However, as the network gets congested, power consumption becomes a key challenge. Hence, it throttles the frequency/voltage of selected routers using DVFS. The second mechanism, called FreqThrtl, initially operates all routers at the baseline frequency and selectively employs time-stealing and DVFS to either increase or decrease the frequency at the onset of congestion. This scheme, unlike FreqBoost, can modulate frequency of routers bi-directionally (higher or lower) and consequently can help reduce power and manage congestion at high load more effectively. Using this technique, the frequency of a congested router is boosted at the onset of congestion and the frequency of a router adjacent to this congested router is throttled. FreqTune is a hybrid of the above two schemes that dynamically switches between FreqBoost and FreqThrtl as the network load varies from low to high.

We evaluate the performance and power implications of the proposed techniques using a wormhole-switched mesh interconnect with synthetic and application benchmarks and compare them with respect to a baseline router/network. To emphasize the efficacy of our approach, we compare our results with adaptive routing and with a baseline design that employs time-stealing but no congestion management.

The novelty and the primary contributions of this paper are the following:

• Variable Frequency Router (RAFT) concept: We propose novel frequency tuning algorithms to reduce latency and power consumption in NoC by distributed throttling and boosting of router frequencies depending upon network load. To the best of our knowledge, this is the first work to propose a distributed congestion management scheme that is based on operating individual routers at different frequency levels. Our proposal leads to 36% reduction in latency at high load, 13.5% savings in power (up to 24% at high load) and average 40.5% reduction in energy delay product (EDP) (maximum 70% at high load). With application benchmarks, we achieve IPC improvements up to 23.1% using our schemes. Moreover, the power-performance benefits increase when these techniques are applied to large networks.

• Variable Frequency Router (RAFT) enablers: Our analysis corroborates the pipeline stage imbalance found in other routers proposed in [5], [27], [26], [19], [24]. While this imbalance can be removed using power optimizations such as variable voltage (supply/threshold) and gate sizing optimizations, we focus on time-stealing techniques to boost performance. Time-stealing in NoC routers can lead to a 25% reduction in zero load latency. In addition, we use the well-known DVFS technique at the granularity of an individual router to tune the router frequency and power based on its load. We believe, this is the first paper to apply time-stealing and DVFS techniques for performance and power management in on-chip routers.

• Performance and power benefits of RAFT: We demonstrate that the proposed techniques are not only much more effective in delivering better performance and reducing power consumption compared to the monolithic, single frequency design, but also can outperform other pure performance enhancement techniques such as using adaptive routing and simply increasing the operating frequency without any congestion management.

• A new flow-control mechanism in RAFT: We couple the idea of dynamic frequency tuning in RAFT with a new flow control scheme. Our proposed flow control scheme, where we slow down incoming flows into a congested router, creates a continuous flow of packets and offers the option of slowing instead of fully stopping a packet stream. Also, whereas regular back-pressure (credit based flow control) is necessary for correct operation of the network, our proposed tuning-style back-pressure is an optimization which opens up a whole new avenue in power/performance management. We then opportunistically lower the voltage and frequency of the routers that should be slowed down to manage congestion, and this leads to reduction in power in the network. Thus, we show that slowing down a router not only leads to better congestion management but also in reduction of network power.

• Reliability of RAFT against soft errors: We study the impact of our proposed techniques on the reliability of the routers. The buffers in the NoC router are known to be vulnerable to radiation-induced soft errors [12], [40]. Scaling the voltage and frequency of the routers further affects the raw soft error rate [40] of data in these buffer structures. We analyze in detail the impact on soft error reliability in the routers employing the three frequency tuning schemes and present guidelines in choosing the appropriate scheme when reliability is the prime design constraint.

The rest of this paper is organized as follows: the three performance and power management techniques are presented in Section 2. In Section 3, we elaborate on clock scaling and time stealing techniques for deploying these schemes, and the required hardware support. Section 4 discusses the experimental platform and results including reliability analysis of the proposed schemes. Prior work is discussed in Section 5, followed by the concluding remarks in Section 6.

Section snippets

Frequency tuning rationale

We use a congestion metric (buffer utilization) per port in a router to decide whether this port of the router is likely to get congested in the next few cycles, and if so, it signals the upstream router to throttle. The intuition behind such an approach comes from the fact that if a router is getting congested, it is due to the pressure from its neighboring routers. The congested router is unable to arbitrate and push out its flits fast enough compared to the rate of flit injection into its

Router and network architecture

In this section, we discuss the enablers for our proposed techniques: on-chip DVFS in routers and time-stealing. We then describe the architectural modifications to the router design for supporting frequency scaling and time stealing techniques, their hardware implementation and the corresponding overheads.

Experimental platform

We use a 64-node network as our experimental platform with the network laid out as a 8×8 2D-mesh. We use a cycle-accurate simulator for our simulations and model a detailed on-chip network. For on-chip routers we model a state-of-the art two stage router pipeline based on [27]. The base case router has 5 physical channels (PCs) including the local PE-to-router port and 4 virtual channels (VCs) multiplexed on to each PC. A message (packet) consists of six 128-bit flits and we use a buffer depth

Related work

We summarize the prior work in two sub-sections: on-chip networks and frequency scaling techniques.

Conclusions

NoC power is a concern not only in current systems but is also going to dominate design decisions in future systems, since the network’s power is likely to increase super-linearly with the number of cores [2]. Additionally, performance implications of NoCs are likely to influence CMP/SoC design decisions as well. Towards this end, we propose a variable frequency router architecture, called RAFT, for dynamically controlling the performance and power behavior of on-chip interconnects by effective

Acknowledgments

We would like to thank the anonymous reviewers for their reviews and comments in improving this paper. This work is supported in part by National Science Foundation (NSF) grantsCCF-0702617, CNS-0916887 and CCF-0903432.

Asit K. Mishra received the B.Tech. degree in electrical engineering from National Institute of Technology, Rourkela, in 2006 and is currently a Ph.D. candidate in the Department of Computer Science and Engineering at Pennsylvania State University. His research interests include chip-multiprocessors, NoCs and emerging memory technologies.

References (40)

K. Unlu et al.
Neutron-induced soft error rate measurements in semiconductor memories
Nuclear Instruments and Methods in Physics Research
(2007)
C. Bienia, S. Kumar, J.P. Singh, K. Li, The PARSEC benchmark suite: characterization and architectural implications,...
S. Borkar, Networks for multi-core chips: a contrarian view, in: Special Session at ISLPED,...
S. Borkar
Design challenges of technology scaling
IEEE Micro
(1999)
D. Brooks, M. Martonosi, Dynamic thermal management for high-performancee microprocessors, in: 7th Intl. Symp. High...
R. Das, A.K. Mishra, C. Nicopoulus, D. Park, V. Narayanan, R. Iyer, et al., Performance and power optimization through...
M. Galles, Scalable pipelined interconnect for distributed endpoint routing: the SGI SPIDER chip, in: Symposium on High...
P. Gratz, B. Grot, S. Keckler, Regional congestion awareness for load balance in networks-on-chip, in: Proceedings of...
M. Hashimoto et al.
Statistical analysis of clock skew variation in $H$ -tree structure
K. Hausman, G. Gaudenzi, J. Mosley, S. Tempest, US patent 4978927—programmable voltage controlled ring oscillator,...

Yatin Hoskote, Sriram Vangal, Arvind Singh, Nitin Borkar, Shekhar Borkar, A 5-GHz mesh interconnect for a teraflops...

J. Howard

A 48-core IA-32 message-passing processor with DVFS in 45 nm CMOS

International Technology Roadmap for Semiconductors, ITRS,...

J. Kim, W.J. Dally, S. Scott, D. Abts, Technology-driven, highly-scalable dragonfly topology, in: 35th International...

M.M. Kim, J.D. Davis, M. Oskin, T. Austin, Polymorphic on-chip networks, in: Proc. of the 35th International Symposium...

W. Kim, M.S. Gupta, G.Y. Wei, D. Brooks, System level analysis of fast, per-core DVFS using on-chip switching...

E.J. Kim et al.

A holistic approach to designing energy-efficient cluster interconnects

IEEE Transactions on Computers

(2005)

J. Kim, D. Park, C. Nicopolous, N. Vijaykrishnan, C.R. Das, Design and analysis of an NoC architecture from...

J. Kim, D. Park, T. Theocharides, N. Vijaykrishnan, C.R. Das, A low latency router supporting adaptivity for on-chip...

A. Kumar, P. Kundu, A. Singh, L.S. Peh, N.K. Jha, A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch...

Cited by (22)

A low-power wireless-assisted multiple network-on-chip
2018, Microprocessors and Microsystems
Citation Excerpt :
However, as the number of cores increases, the power consumption of NoC increases more rapidly. Recent works [2–4] have shown that the power consumption of NoC, for a multicore processor under 16nm technology node, will become a major part of total chip power. On the other hand, even if all cores are processing in a many-core system, network may not be fully saturated.
Multiple network-on-chip (Multi-NoC) architectures are supposed to distribute the network traffic categorically among disjoint sub-networks. The main objective is significant energy reduction through power-gating of unused sub-networks. However, the packets are delayed due to sleep/wake cycles, which directly influences the overall performance of the system. In addition, the communication infrastructure of the Multi-NoC should be selected carefully to avoid performance degradation. Our solution to address these issues is using wireless links, which is used to relax the timing restrictions on sleep/wake cycles to save more power without losing performance. To realize wireless communications, we adopt two types of on-chip wireless technology that operate at different frequency bands, namely terahertz (THz) and millimeter-wave (mmW). To evaluate the merits of the proposed architecture that employs these wireless technologies, we used both real application benchmarks (PARSEC and SPLASH-2) and synthetic traffics on a many-core processor. For THz technology, the proposed architecture results in nearly 51% and 10% power reduction compared to traditional single network-on-chip (Single-NoC) and a power-gated 4-subnets Multi-NoC, respectively. The corresponding results for mmW technology show 46% and 6% power reduction. Also, the proposed architecture for THz and mmW technologies results in 10% and 7% latency reduction compared to Multi-NoC, respectively. The performance metrics of the proposed architecture is comparable to Single-NoC architecture, which demonstrates the effectiveness of our proposal.
The efficiency of buffer and buffer-less data-flow control schemes for congestion avoidance in Networks on Chip
2016, Journal of King Saud University - Computer and Information Sciences
Citation Excerpt :
The measurements obtained by hardware-analysis probes are sent to the MPC, which decides CCBE loads based on the predictions it makes using this information. Additional research has addressed the problem of congestion avoidance and load sharing in NoCs (Mishra et al., 2011; Ascia et al., 2006; Kumar and Mahapatra, 2005; Rijpkema et al., 2003). However, we think that a few contributions have involved per-flit flow-management schemes that might efficiently manage congestion while keeping the best QoS granularity.
The design of efficient architectures for communication in on chip multiprocessors system involves many challenges regarding the internal router functions used in Network on Chip (NoC) infrastructure. The on-chip router should be designed to provide per-flit processing with enhanced granularity. In fact, the quality of service experienced at the application level depends on the capabilities of the router to avoid congestion and to ensure efficient data-flow control. Consequently, an enhanced router architecture is needed to achieve the requested QoS.
This paper proposes an internal router architecture, for on chip communication, implementing flow-control mechanism for congestion avoidance with QoS consideration. It describes the internal functions of this router for optimal output flit scheduling and its capability to apply per-class service for inbound flows. The paper focuses mainly on the description and performance analysis of two proposed schemes for data flow control that can be used with the proposed router architecture. The results shown in this paper prove that the application of these proposed schemes in NoC achieves an interesting enhancement in the measured end to end QoS. We carried out an extensive comparison of the proposed solutions with the existing schemes published in the literature to show that the proposed solution outperforms these, maintaining an interesting tradeoff with the hardware characteristics when designed with 45 nm integration technology.
A control-based methodology for power-performance optimization in NoCs exploiting DVFS
2015, Journal of Systems Architecture
Citation Excerpt :
The interested reader can find a complete discussion related to stability in [7]. Our proposal has been validated against threshold-based policies as the ones proposed in [24]. Moreover, we provide a framework to develop power-performance policies exploiting Dynamic Frequency Scaling actuators, carefully modeled to consider timing and energy overheads, that automatically set the router frequencies according to the actual load.
Networks-on-Chip (NoCs) are considered a viable solution to fully exploit the computational power of multi- and many-cores, but their non negligible power consumption requires ad hoc power-performance design methodologies. In this perspective, several proposals exploited the possibility to dynamically tune voltage and frequency for the interconnect, taking steps from traditional CPU-based power management solutions. However, the impact of the actuators, i.e. the limited range of frequencies for a PLL (Phase Locked Loop) or the time to increase voltage and frequency for a Dynamic Voltage and Frequency Scaling (DVFS) modules, are often not carefully accounted for, thus overestimating the benefits. This paper presents a control-based methodology for the NoC power-performance optimization exploiting the Dynamic Frequency Scaling (DFS). Both timing and power overheads of the actuators are considered, thanks to an ad hoc simulation framework. Moreover the proposed methodology eventually allows for user and/or OS interactions to change between different high level power-performance modes, i.e. to trigger performance oriented or power saving system behaviors. Experimental validation considered a 16-core architecture comparing our proposal with different settings of threshold-based policies. We achieved a speedup up to 3 for the timing and a reduction up to 33.17% of the power ∗ time product against the best threshold-based policy. Moreover, our best control-based scheme provides an averaged power-performance product improvement of 16.50% and 34.79% against the best and the second considered threshold-based policy setting.
Efficient scheme for congestion control in network-on-chip with QoS consideration
2014, Journal of Circuits, Systems and Computers
An energy-efficient multi-level RF-interconnect for global network-on-chip communication
2020, Analog Integrated Circuits and Signal Processing
Investigation of DVFS for network-on-chip based H.264 video decoders with truly real workload
2017, 2016 7th International Green and Sustainable Computing Conference, IGSC 2016

View all citing articles on Scopus

Aditya Yanamandra is an engineer with the Virtual Platform Center of Expertise (VP COE) in Intel. His research interests include reliability of on-chip interconnection networks. He received his Ph.D. in computer science and engineering from the Pennsylvania State University (2010) and his Bachelor of Technology in computer science and engineering from IIT Madras (2005).

Reetuparna Das is a research scientist at Intel Labs. Her research interests include computer architecture, especially interconnection networks. She has a B.S. degree in computer engineering from the National Institute of Technology, India and Ph.D. degree in computer science and engineering from Pennsylvania State University.

Soumya Eachempati is a component design engineer in Intel Corporation. She received her Ph.D. in Computer Science and Engineering from Pennsylvania State University and obtained her BTech from Indian Institute of Technology, Madras in 2005. Her research interests include application of emerging interconnect technologies on the die and on-chip interconnection networks.

Ravi Iyer is a Principal Engineer in Intel Labs. He directs research on SoC and CMP architectures. His research interests are cache/memory hierarchies, small core architectures, accelerators, fabrics, emerging workloads and performance analysis. Ravi has published over 100 papers in conferences and journals. He has also filed 30+ patent applications. Ravi received his Ph.D. from Texas A&M University in 1999.

N. Vijaykrishnan received the Ph.D. degree from the University of South Florida in 1998. He is a professor in the Department of Computer Science and Engineering, Pennsylvania State University, University Park. His research interests include the areas of energy-aware reliable systems, nano/VLSI systems, FPGA design, functional verification and computer architecture.

Chita R. Das received the Ph.D. degree in computer science from the University of Louisiana, Lafayette, in 1986. Since 1986, he has been with the Pennsylvania State University where he is currently a distinguished professor in the Department of Computer Science and Engineering. His research interests include parallel and distributed computing, performance evaluation and fault-tolerant computing.

^☆: Our original work on frequency tuning routers appeared in “A Case for Dynamic Frequency Tuning in On-Chip Networks”, Asit K. Mishra, Reetuparna Das, Soumya Eachempati, Ravishankar Iyer, Narayanan Vijaykrishnan, Chita R. Das, MICRO, 2009, pp 292–303. This journal version augments the analysis on reliability issues due to DVFS in routers and two novel optimizations (nearest-neighbor aware frequency tuning and fine-grained frequency control of routers) to further improve the performance-power envelope of NoCs.

View full text

RAFT: A router architecture with frequency tuning for on-chip networks☆