RAFT: A router architecture with frequency tuning for on-chip networks☆
Introduction
Power-aware Chip-Multiprocessor (CMP) design has become an important paradigm, at least on par with performance, in the nanometer regime. Since on-chip networks are expected to consume a significant part of the total chip power, design of energy-efficient processors along with the interconnection network provides a holistic approach to overall power conservation. It is predicted that NoC power can be a significant part of the entire chip power and can account for up to 40–60 W [2] with technology scaling for a mesh based network with 128 nodes. A few commercial designs also support this trend, where up to 28% of the entire chip power is devoted to the interconnect [10]. Thus, on-chip interconnects that can optimize both performance and power pose intriguing research challenges. This is evident from the large body of literature covering multiple facets of NoC design [14], [13], [25], [7], [23], [30].
Router frequency is one of the critical design parameters that directly affects both performance and power, albeit in a contradictory fashion. With a sophisticated design of a router pipeline, it is possible to increase the operating frequency [19], [24], but higher router frequency leads to higher power consumption. On the other hand, a mismatch between processor and router/network frequency can result in significant performance penalties [5]. Thus, a prudent control of the router frequency can help in optimizing both performance and power.
In this context, we propose a new design philosophy for NoCs where, unlike the traditional single-frequency routers, we propose dynamically modulating the frequencies of routers in a network for effectively managing network congestion and energy consumption. The proposed variable frequency router design, RAFT, uses the well-known DVFS technique and complements the use of DVFS in the processor cores for power management.
We motivate the fine balance that exists between power and performance in an on-chip network with a relative power-performance trade-off analysis with respect to offered network load. Fig. 1 shows the relative growth of network power versus network latency for an 8×8 mesh with a synthetic traffic mixture of Uniform Random, Transpose, Nearest–Neighbor and Self Similar traffic (a detailed network configuration is mentioned later in Table 3(a)). The bars indicate network latency/power normalized with respect to the network latency/power at no load (idle network). At low load, the network power consumption is less. However, the rate of growth of network power is much higher as compared to the rate of growth of network latency. For example, as shown in Fig. 1, the network power grows to 30× as the injection rate varies from 1% to 40%, whereas the network latency grows only 7×. We leverage our insights from these trends to optimize the network at low load for performance and at high load for power. An activity based power-management technique, which was recently implemented in the Intel 80-core routers [10], [37], shares a similar view of optimizing the network’s power based on activity, albeit in a different fashion by clock-gating the idle ports.
Since performance and power are directly proportional to frequency, we dynamically modulate the router frequency in response to network load to facilitate these optimizations, and demonstrate the advantages at system level. Specifically, at low load we operate the routers at peak frequency. At high load, we dynamically determine the operating frequency of individual routers in the network. The dynamic schemes that determine the operating frequencies of the routers are designed to (a) reduce power consumption and (b) manage congestion in the network, by selectively stepping up and down the frequency of a subset of routers in the congested regions of a network. We propose a two-prong approach to vary the baseline router frequency: clock scaling and time-stealing. We employ Dynamic Voltage and Frequency Scaling (DVFS) [31], [39] to scale up and down the router clock frequency below the nominal frequency by switching the operating voltage levels. The time stealing technique is employed to boost the baseline router frequency by exploiting the timing imbalance between router pipeline stages, such that a router can operate at the average cycle time of all the pipeline stages in contrast to the delay of the worst case pipeline stage.
We explore three techniques for dynamic frequency tuning to simultaneously address power-performance trade-offs. The first technique, called FreqBoost, initially employs time-stealing to operate all routers at a boosted frequency. This helps in enhancing the performance at low load conditions, while slightly increasing the power consumption. However, as the network gets congested, power consumption becomes a key challenge. Hence, it throttles the frequency/voltage of selected routers using DVFS. The second mechanism, called FreqThrtl, initially operates all routers at the baseline frequency and selectively employs time-stealing and DVFS to either increase or decrease the frequency at the onset of congestion. This scheme, unlike FreqBoost, can modulate frequency of routers bi-directionally (higher or lower) and consequently can help reduce power and manage congestion at high load more effectively. Using this technique, the frequency of a congested router is boosted at the onset of congestion and the frequency of a router adjacent to this congested router is throttled. FreqTune is a hybrid of the above two schemes that dynamically switches between FreqBoost and FreqThrtl as the network load varies from low to high.
We evaluate the performance and power implications of the proposed techniques using a wormhole-switched mesh interconnect with synthetic and application benchmarks and compare them with respect to a baseline router/network. To emphasize the efficacy of our approach, we compare our results with adaptive routing and with a baseline design that employs time-stealing but no congestion management.
The novelty and the primary contributions of this paper are the following:
• Variable Frequency Router (RAFT) concept: We propose novel frequency tuning algorithms to reduce latency and power consumption in NoC by distributed throttling and boosting of router frequencies depending upon network load. To the best of our knowledge, this is the first work to propose a distributed congestion management scheme that is based on operating individual routers at different frequency levels. Our proposal leads to 36% reduction in latency at high load, 13.5% savings in power (up to 24% at high load) and average 40.5% reduction in energy delay product (EDP) (maximum 70% at high load). With application benchmarks, we achieve IPC improvements up to 23.1% using our schemes. Moreover, the power-performance benefits increase when these techniques are applied to large networks.
• Variable Frequency Router (RAFT) enablers: Our analysis corroborates the pipeline stage imbalance found in other routers proposed in [5], [27], [26], [19], [24]. While this imbalance can be removed using power optimizations such as variable voltage (supply/threshold) and gate sizing optimizations, we focus on time-stealing techniques to boost performance. Time-stealing in NoC routers can lead to a 25% reduction in zero load latency. In addition, we use the well-known DVFS technique at the granularity of an individual router to tune the router frequency and power based on its load. We believe, this is the first paper to apply time-stealing and DVFS techniques for performance and power management in on-chip routers.
• Performance and power benefits of RAFT: We demonstrate that the proposed techniques are not only much more effective in delivering better performance and reducing power consumption compared to the monolithic, single frequency design, but also can outperform other pure performance enhancement techniques such as using adaptive routing and simply increasing the operating frequency without any congestion management.
• A new flow-control mechanism in RAFT: We couple the idea of dynamic frequency tuning in RAFT with a new flow control scheme. Our proposed flow control scheme, where we slow down incoming flows into a congested router, creates a continuous flow of packets and offers the option of slowing instead of fully stopping a packet stream. Also, whereas regular back-pressure (credit based flow control) is necessary for correct operation of the network, our proposed tuning-style back-pressure is an optimization which opens up a whole new avenue in power/performance management. We then opportunistically lower the voltage and frequency of the routers that should be slowed down to manage congestion, and this leads to reduction in power in the network. Thus, we show that slowing down a router not only leads to better congestion management but also in reduction of network power.
• Reliability of RAFT against soft errors: We study the impact of our proposed techniques on the reliability of the routers. The buffers in the NoC router are known to be vulnerable to radiation-induced soft errors [12], [40]. Scaling the voltage and frequency of the routers further affects the raw soft error rate [40] of data in these buffer structures. We analyze in detail the impact on soft error reliability in the routers employing the three frequency tuning schemes and present guidelines in choosing the appropriate scheme when reliability is the prime design constraint.
The rest of this paper is organized as follows: the three performance and power management techniques are presented in Section 2. In Section 3, we elaborate on clock scaling and time stealing techniques for deploying these schemes, and the required hardware support. Section 4 discusses the experimental platform and results including reliability analysis of the proposed schemes. Prior work is discussed in Section 5, followed by the concluding remarks in Section 6.
Section snippets
Frequency tuning rationale
We use a congestion metric (buffer utilization) per port in a router to decide whether this port of the router is likely to get congested in the next few cycles, and if so, it signals the upstream router to throttle. The intuition behind such an approach comes from the fact that if a router is getting congested, it is due to the pressure from its neighboring routers. The congested router is unable to arbitrate and push out its flits fast enough compared to the rate of flit injection into its
Router and network architecture
In this section, we discuss the enablers for our proposed techniques: on-chip DVFS in routers and time-stealing. We then describe the architectural modifications to the router design for supporting frequency scaling and time stealing techniques, their hardware implementation and the corresponding overheads.
Experimental platform
We use a 64-node network as our experimental platform with the network laid out as a 8×8 2D-mesh. We use a cycle-accurate simulator for our simulations and model a detailed on-chip network. For on-chip routers we model a state-of-the art two stage router pipeline based on [27]. The base case router has 5 physical channels (PCs) including the local PE-to-router port and 4 virtual channels (VCs) multiplexed on to each PC. A message (packet) consists of six 128-bit flits and we use a buffer depth
Related work
We summarize the prior work in two sub-sections: on-chip networks and frequency scaling techniques.
Conclusions
NoC power is a concern not only in current systems but is also going to dominate design decisions in future systems, since the network’s power is likely to increase super-linearly with the number of cores [2]. Additionally, performance implications of NoCs are likely to influence CMP/SoC design decisions as well. Towards this end, we propose a variable frequency router architecture, called RAFT, for dynamically controlling the performance and power behavior of on-chip interconnects by effective
Acknowledgments
We would like to thank the anonymous reviewers for their reviews and comments in improving this paper. This work is supported in part by National Science Foundation (NSF) grantsCCF-0702617, CNS-0916887 and CCF-0903432.
Asit K. Mishra received the B.Tech. degree in electrical engineering from National Institute of Technology, Rourkela, in 2006 and is currently a Ph.D. candidate in the Department of Computer Science and Engineering at Pennsylvania State University. His research interests include chip-multiprocessors, NoCs and emerging memory technologies.
References (40)
- et al.
Neutron-induced soft error rate measurements in semiconductor memories
Nuclear Instruments and Methods in Physics Research
(2007) - C. Bienia, S. Kumar, J.P. Singh, K. Li, The PARSEC benchmark suite: characterization and architectural implications,...
- S. Borkar, Networks for multi-core chips: a contrarian view, in: Special Session at ISLPED,...
Design challenges of technology scaling
IEEE Micro
(1999)- D. Brooks, M. Martonosi, Dynamic thermal management for high-performancee microprocessors, in: 7th Intl. Symp. High...
- R. Das, A.K. Mishra, C. Nicopoulus, D. Park, V. Narayanan, R. Iyer, et al., Performance and power optimization through...
- M. Galles, Scalable pipelined interconnect for distributed endpoint routing: the SGI SPIDER chip, in: Symposium on High...
- P. Gratz, B. Grot, S. Keckler, Regional congestion awareness for load balance in networks-on-chip, in: Proceedings of...
- et al.
Statistical analysis of clock skew variation in -tree structure
- K. Hausman, G. Gaudenzi, J. Mosley, S. Tempest, US patent 4978927—programmable voltage controlled ring oscillator,...
A 48-core IA-32 message-passing processor with DVFS in 45 nm CMOS
A holistic approach to designing energy-efficient cluster interconnects
IEEE Transactions on Computers
Cited by (22)
A low-power wireless-assisted multiple network-on-chip
2018, Microprocessors and MicrosystemsCitation Excerpt :However, as the number of cores increases, the power consumption of NoC increases more rapidly. Recent works [2–4] have shown that the power consumption of NoC, for a multicore processor under 16nm technology node, will become a major part of total chip power. On the other hand, even if all cores are processing in a many-core system, network may not be fully saturated.
The efficiency of buffer and buffer-less data-flow control schemes for congestion avoidance in Networks on Chip
2016, Journal of King Saud University - Computer and Information SciencesCitation Excerpt :The measurements obtained by hardware-analysis probes are sent to the MPC, which decides CCBE loads based on the predictions it makes using this information. Additional research has addressed the problem of congestion avoidance and load sharing in NoCs (Mishra et al., 2011; Ascia et al., 2006; Kumar and Mahapatra, 2005; Rijpkema et al., 2003). However, we think that a few contributions have involved per-flit flow-management schemes that might efficiently manage congestion while keeping the best QoS granularity.
A control-based methodology for power-performance optimization in NoCs exploiting DVFS
2015, Journal of Systems ArchitectureCitation Excerpt :The interested reader can find a complete discussion related to stability in [7]. Our proposal has been validated against threshold-based policies as the ones proposed in [24]. Moreover, we provide a framework to develop power-performance policies exploiting Dynamic Frequency Scaling actuators, carefully modeled to consider timing and energy overheads, that automatically set the router frequencies according to the actual load.
Efficient scheme for congestion control in network-on-chip with QoS consideration
2014, Journal of Circuits, Systems and ComputersAn energy-efficient multi-level RF-interconnect for global network-on-chip communication
2020, Analog Integrated Circuits and Signal ProcessingInvestigation of DVFS for network-on-chip based H.264 video decoders with truly real workload
2017, 2016 7th International Green and Sustainable Computing Conference, IGSC 2016
Asit K. Mishra received the B.Tech. degree in electrical engineering from National Institute of Technology, Rourkela, in 2006 and is currently a Ph.D. candidate in the Department of Computer Science and Engineering at Pennsylvania State University. His research interests include chip-multiprocessors, NoCs and emerging memory technologies.
Aditya Yanamandra is an engineer with the Virtual Platform Center of Expertise (VP COE) in Intel. His research interests include reliability of on-chip interconnection networks. He received his Ph.D. in computer science and engineering from the Pennsylvania State University (2010) and his Bachelor of Technology in computer science and engineering from IIT Madras (2005).
Reetuparna Das is a research scientist at Intel Labs. Her research interests include computer architecture, especially interconnection networks. She has a B.S. degree in computer engineering from the National Institute of Technology, India and Ph.D. degree in computer science and engineering from Pennsylvania State University.
Soumya Eachempati is a component design engineer in Intel Corporation. She received her Ph.D. in Computer Science and Engineering from Pennsylvania State University and obtained her BTech from Indian Institute of Technology, Madras in 2005. Her research interests include application of emerging interconnect technologies on the die and on-chip interconnection networks.
Ravi Iyer is a Principal Engineer in Intel Labs. He directs research on SoC and CMP architectures. His research interests are cache/memory hierarchies, small core architectures, accelerators, fabrics, emerging workloads and performance analysis. Ravi has published over 100 papers in conferences and journals. He has also filed 30+ patent applications. Ravi received his Ph.D. from Texas A&M University in 1999.
N. Vijaykrishnan received the Ph.D. degree from the University of South Florida in 1998. He is a professor in the Department of Computer Science and Engineering, Pennsylvania State University, University Park. His research interests include the areas of energy-aware reliable systems, nano/VLSI systems, FPGA design, functional verification and computer architecture.
Chita R. Das received the Ph.D. degree in computer science from the University of Louisiana, Lafayette, in 1986. Since 1986, he has been with the Pennsylvania State University where he is currently a distinguished professor in the Department of Computer Science and Engineering. His research interests include parallel and distributed computing, performance evaluation and fault-tolerant computing.
- ☆
Our original work on frequency tuning routers appeared in “A Case for Dynamic Frequency Tuning in On-Chip Networks”, Asit K. Mishra, Reetuparna Das, Soumya Eachempati, Ravishankar Iyer, Narayanan Vijaykrishnan, Chita R. Das, MICRO, 2009, pp 292–303. This journal version augments the analysis on reliability issues due to DVFS in routers and two novel optimizations (nearest-neighbor aware frequency tuning and fine-grained frequency control of routers) to further improve the performance-power envelope of NoCs.