Chapter Six - Approximate communication for energy-efficient network-on-chip
Introduction
Approximation by trading off output accuracy for benefits in performance and energy efficiency has gained a high degree of recognition as a solution for satisfying energy-efficiency hardware design [1]. Approximate designs rely on the ability of applications to tolerate computation on noisy/erroneous data or imprecision in the computation results. There are large number applications of machine learning, searching, scientific computing, and multimedia that are inherently tolerant approximation [2]. Since inexactness is acceptable, these applications allow a presence of approximate data in storing, computing, or transmitting. These applications, which exhibit some level of error tolerance, motivate the approximate hardware designs to achieve high performance and energy efficiency.
Now approximate computing, as an emerging performance-efficient paradigm, has been widely used in computer architecture design, such as approximate memory system [3, 4], value approximation in CPU-based [5, 6] and GPU-based system [7, 8], relaxes synchronization [9], resilience-aware circuit clocking scheme [10], and so on. Compute-based approximation techniques use inexact compute units [[11], [12], [13], [14], [15], [16], [17], [18], [19]] or neural network models [20, 21] for code acceleration. Memory-based techniques exploit data similarity across memory hierarchies to achieve larger capacity, energy efficiency, or lifetime optimization . A significant portion of research on hardware approximation techniques has focused on either the computation units for accelerated inaccurate execution, or the storage hierarchy (cache/DRAM-based) for low-overhead (area/power) memory.
Approximate communication techniques also deserve attention. With increasing on-chip core counts, network-on-chip (NoC) has emerged as the most competent method for on-chip communication in large-scale parallel systems. It connects varied on-chip components, such as cores, caches, and memory controllers. And it allows the communication necessary for exchanging data of parallel threads and ensuring data coherence. However, NoCs consume a significant amount of power in modern chip multiprocessors (CMP) [22]. Energy efficiency has been a primary concern in NoC designs [23]. Reducing the NoC power while increasing performance is essential for scaling up to larger chip multiprocessor systems. Relaxing accuracy in exchange for performance improvement and energy saving, approximate techniques show their bright future on the research of energy-efficiency designs.
This chapter, with a focus on the approximate communication design for energy-efficient NoC, mainly conducts exploratory research in the following three aspects:
First, a dynamic traffic regulation scheme is proposed for approximate communication of NoC. Network congestion is one of the main factors that affect transmission delay, and different traffic flows have different impacts on network congestion. This method designs an approximation-based traffic regulation structure in the network interface, which reduces the amount of injected data through data approximation, and can regulate the injection rate of each node. In addition, it designs a dynamic traffic regulation algorithm to dynamically adjust the injection rate of each node according to the impact of traffic flow on network congestion. Thus, it improves the NoC performance. Based on the PARSEC benchmark experiments, the results show that this method can reduce the average transmission delay by 30.9% on average, reduce application execution time by 15.8%, and achieve dynamic power saving by 24.4% within 10% quality loss.
Second, a performance optimization method for bufferless NoC based on approximate communication is proposed. By removing the buffers, the bufferless NoC reduces power consumption and area overhead but also leads to an increase in transmission delay and a decrease in network throughput. Through the performance analysis of the bufferles NoC, in the retransmission-based bufferless NoC, packet retransmission is a key factor affecting the NoC performance. In order to improve the performance of the bufferless NoC, this method designs a new bufferless NoC architecture, which reduces packet retransmission through lossy transmission and improves the NoC performance. Moreover, it also proposes a packet approximate codec design to approximate the missing data. Thus, rhis method improves the performance of bufferless NoC with extremely low quality loss. Based on the PARSEC benchmark experiments, the results show that compared with the existing bufferless NoC, this design reduces the retransmission by 83.6%, reduces the transmission delay by 46.7%, increases the network throughput by 92%, and achieves application acceleration 1.2 times, while maintaining low application error.
Third, an NoC energy optimization method based on multiplane network design is proposed. The NoC performance optimization usually leads to an increase in area overhead and affects the energy consumption of NoC. In order to reduce the energy consumption of NoC, this method designs a two-plane network structure which includes a lossy subnetwork and a lossless subnetwork. Based on lossy transmission, the lossy subnetwork realizes a lightweight, low-delay, bufferless architecture design. In addition, based on the multiplane transmission design, this method speeds up part of the data transfer and achieves transmission quality control. Thus, this method improves NoC performance while reducing NoC area overhead and power consumption. Based on the PARSEC benchmark experiments, the results show that compared with the single-plane NoC, this method reduces the transmission delay by 41.9%, and saves 48.6% of the NoC area overhead and 25.7% of the NoC power consumption under the same throughput.
The rest of the paper is organized as follows. Section 2 details the related work. In Section 3, we present the approximation-based dynamic traffic regulation design. Section 4 explains the approximate bufferless NoC implementation. Section 5 presents the design of approximate multiplane NoC.
Section snippets
Related work
Recent studies have been conducted regarding approximate computing in NoC architecture design for applications that allow inaccurate outputs [[24], [25], [26], [27], [28], [29]]. These articles explore the performance improvement or energy efficiency of approximate computing for reducing communication bottlenecks by two techniques: communication reduction and dynamic power management. The APPROX-NoC [25], DAPPER [24], and DEC-NoC [26] belong to the former. APPROX-NoC reduces injected flits by
Approximation-based dynamic traffic regulation
Different traffic flows have different impacts on network congestion. For example, Fig. 1 shows the network transmission status of a certain time. Packets from nodes 1, 2, 6 contribute to the network congestion, while packets from node 5 don’t. Therefore, controlling the packets injected from nodes 1, 2, 6 can lead to better congestion improvement for the network. However, the transmission state will be very complicated in NoC. Each router is likely to communicate with others. Its complexity
Approximate bufferless network-on-chip
The NoC serving as an effective interconnection fabric connects many on-chip components. It provides better scalability and higher bandwidth compared to traditional interconnections such as the bus and crossbar [[48], [49], [50]]. However, NoCs consume a significant amount of power in CMPs, that is, 40% of the tile power consumption in the 16-tile MIT RAW chip [51], 28% in the 80-tile Intel TeraFLOPS chip [22], and 19% in the 36-tile SCORPIO chip [52]. Buffers consume a large portion of network
Approximate multiplane network-on-chip
Reducing the power of the NoC while increasing performance is essential for scaling up to larger systems for future CMP designs. Minimizing power consumption requires more efficient use of network resources. Multiplane NoCs have shown their efficiency in total bandwidth usage [23, 61]. Furthermore, multiplane NoCs can be designed with heterogeneous physical subnetworks; as a result, messages are injected into different subnetworks to satisfy different transmission properties. For many
Ling Wang received the B.S. degree in monitoring and control technology from the Harbin University of Science and Technology, China, in 2010, and the M.S. degree in biomedical engineering from the Harbin Institute of Technology, China, in 2012, where he also get his Ph.D. degree in computer applied technology in 2021. He is currently a lecturer at Xidian University. His research interests include high-performance many-core architecture, network on chip and AI accelerator.
References (62)
- et al.
Energy-aware hybrid precision selection framework for mobile GPUs
Comput. Graph.
(2013) A survey of techniques for approximate computing
ACM Comput. Surv.
(2016)- et al.
Quality-aware data allocation in approximate DRAM
- et al.
Approximate storage in solid-state memories
ACM Trans. Comput. Syst.
(2014) - et al.
Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory
- et al.
Load value approximation
- et al.
General-purpose code acceleration with limited-precision analog computation
ACM SIGARCH Comput. Archit. News
(2014) - et al.
RFVP: rollback-free value prediction with safe-to-approximate loads
ACM Trans. Archit. Code Opt.
(2016) - et al.
Branch and data herding: reducing control and memory divergence for error-tolerant GPU applications
IEEE Trans. Multimedia
(2013) - et al.
Programming with relaxed synchronization
Resilience-aware frequency tuning for neural-network-based approximate computing chips
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Architecture support for disciplined approximate programming
Quality programmable vector processors for approximate computing
ProACt: a processor for high performance on-demand approximate computing
Evaluation of variable bit-width units in a RISC-V processor for approximate computing
ARGA: approximate reuse for GPGPU acceleration
Rmac: runtime configurable floating point multiplier for approximate computing
Truncated SIMD multiplier architecture for approximate computing in low-power programmable processors
IEEE Access
Seda-single exact dual approximate adders for approximate processors
SNNAP: approximate computing on programmable socs via neural acceleration
Neural acceleration for general-purpose approximate programs
Commun. ACM
A 5-GHz mesh interconnect for a teraflops processor
IEEE Micro
The runahead network-on-chip
DAPPER: data aware approximate NoC for GPGPU architectures
APPROX-NoC: a data approximation framework for network-on-chip architectures
DEC-NoC: an approximate framework based on dynamic error control with applications to energy-efficient NoCs
AxNoC: low-power approximate network-on-chips using critical-path isolation
Improving energy consumption of NoC based architectures through approximate communication
Approximate wireless networks-on-chip
Approximate communication: techniques for reducing communication bottlenecks in large-scale parallel systems
ACM Comput. Surv.
Approximate communication strategies for energy-efficient and high performance NoC: opportunities and challenges
Cited by (1)
Subnetwork Based Traffic Aware Rerouting for CMesh Bufferless Network-on-Chip
2024, Journal of Circuits, Systems and Computers
Ling Wang received the B.S. degree in monitoring and control technology from the Harbin University of Science and Technology, China, in 2010, and the M.S. degree in biomedical engineering from the Harbin Institute of Technology, China, in 2012, where he also get his Ph.D. degree in computer applied technology in 2021. He is currently a lecturer at Xidian University. His research interests include high-performance many-core architecture, network on chip and AI accelerator.
Xiaohang Wang received the B.Eng. and Ph.D. degrees in communication and electronic engineering from Zhejiang University, in 2006 and 2011, respectively. He is currently an Associate Professor with the South China University of Technology. His research interests include many-core architecture, power efficient architectures, optimal control, and NoC-based systems.