Energy and memory-aware software pipelining streaming applications on NoC-based MPSoCs

doi:10.1016/j.future.2020.04.028

Future Generation Computer Systems

Volume 111, October 2020, Pages 1-16

https://doi.org/10.1016/j.future.2020.04.028 Get rights and content

Highlights

•
Energy-ware scheduling of CTGs on NoC-based MPSoCs.
•
A novel task mapping and NLP based DVFS algorithm.
•
A novel memory and energy-aware retiming approach.
•
The results show that the proposed approach performs better than state-of-the-art.

Abstract

In this article, we explore the problem of energy-aware scheduling of real-time applications modelled by conditional task graphs on NoC based MPSoC such that the total energy consumption is minimized. We propose a novel energy and memory-aware retiming conditional task graph (EMRCTG) approach that integrates task-level coarse-grained software pipelining with Dynamic Voltage and Frequency Scaling (DVFS). Our approach not only optimizes energy consumption but ensures that memory capacity constraints are satisfied. EMRCTG has two phases. In the first phase, we map tasks to processors, transform intra-period data dependencies into inter-period and generate a schedule by a Non-Linear Programming (NLP)-based algorithm assuming infinite memory capacity. The NLP-based algorithm assigns a continuous frequency and voltage to each task and each communication and uses a polynomial-time heuristic to transform the continuous frequencies and voltages to discrete frequencies and voltages. We analyse the memory consumption of the generated schedule and initiate schedule repair phase 2 if the memory capacity constraints violate. The schedule repair phase finds a set of nodes such that by reducing their retiming values the memory capacity constraints satisfy.

We compare our approach against two existing approaches GeneS and JCCTS. GeneS is a genetic algorithm that first transforms the dependent task set into an independent task set and then collectively performs task mapping, ordering and voltage scaling. JCCTS is a mixed integer linear programming based approach that optimally removes inter-processor communication overhead. Our experimental result show that compared to the approach GeneS our approach can obtain an improvement in range of 1.6 to 18 percent and an average improvement of 11 percent. Compared to the approach JCCTS our approach can achieve an improvement in range of 9 to 42 percent and an average improvement of 26 percent.

Introduction

Modern embedded systems such as driver-less cars and robots require powerful and energy efficient hardware due to their complex functions. MPSoC is an ideal architecture for these systems due to its high performance and low power dissipation. Examples of commercial MPSoCs include Samsung Exynos 5422 SoC [1], Zynq UltraScale MPSoC devices [2]. Samsung Exynos 5422 SoC powers the famous Samsung Galaxy smart phone series. Zynq UltraScale MPSoC devices have been used in robots. Modern MPSoCs have a large number of processors, for example Tilera Tile64 MPSoC [3] consists of 64 processors. The number of processors on MPSoCs are expected to grow [4] and according to International Technology Roadmap for Semiconductors (ITRS), MPSoCs will integrate thousands of processors [5] by 2025. Therefore, the traditional bus-based on-chip communication is no longer feasible due to its poor scalability. NoC-based communication provides significant improvement in terms of flexibility, scalability and performance over hierarchical (e.g., Advanced Micro-controller Bus Architecture and STBus) and traditional bus structures [6].

Surveillance digital video recorders and internet video conferences are examples of real-time streaming applications. When such applications are executed on MPSoCs both energy consumption and time performance need to be considered. Energy consumption of an embedded system is one of the major performance metrics of embedded systems, therefore, energy efficiency is a critical issue in such systems. In order to solve this problem, we need to consider several issues. Firstly, real-time applications such as streaming applications can be modelled by periodic conditional dependent task model because these applications repeatedly execute to service data stream [7]. Streaming applications are computationally-intensive as they service continuous stream of data. Therefore they are suitable to execute on MPSoC. To maximally utilize the multi-processor architecture of MPSoCs techniques are required that can increase the degree of parallelism of streaming applications [7]. In this article, we explore task level software pipelining to maximize the degree of parallelism of the periodic dependent conditional task set. Secondly, one of the key challenges to optimize a streaming application on an MPSoC is to generate a schedule that can satisfy all the real-time requests by maximally utilizing the MPSoC resources. So in this paper, we focus on developing a scheduling approach for MPSoCs. Thirdly, to improve energy efficiency apply Dynamic Voltage and Frequency Scaling (DVFS). DVFS saves energy consumption by lowering the voltage/frequency of a processor when it is underutilized. Many multi-core processors such as ARM 11 MPCore [8] support voltage scaling and provide multiple voltage levels for energy optimization. In addition to processors, NoC communication links and routers also consume a large amount of on-chip energy. For Alpha 21364 processor [9], out of 125 W total on-chip power consumption, 23 W (20%) is consumed by NoC routers and links, and out of 23 W, the NoC links consume 58% of the power. Therefore, just like processors if links support DVFS the energy consumption can be optimized by scaling the links voltages and frequencies.

DVFS is amongst the most effective system-level energy optimization technique. Hence, many DVFS based scheduling approaches have been proposed. Amongst the earliest works that apply DVFS, Aydin et al. developed an algorithm with $O (n^{2} l o g n)$ complexity to calculate the voltage levels i.e. speed for tasks and used Earliest Deadline First (EDF) strategy in order to obtain the feasible task schedule for these optimal voltage levels [10]. In another investigation Aydin et al. [11] addressed energy-aware scheduling for periodic tasks and computed the optimal speed using Dynamic Reclaiming Algorithm (DRA) and efficiently utilized the slack while meeting the task deadlines. Tosun [12] mapped periodic tasks on heterogeneous MPSoC system using ILP to minimize the computational energy consumption. The author also developed two heuristics while deploying EDF strategy for energy-aware task scheduling. Kumar and Vidyarthi [13] integrated voltage assignment and task mapping within a single optimization loop using GA. This approach explored the solution space for a near-optimal solution and achieved 59.4% energy savings compared to Genetic Algorithm-Struggle (GA-ST). Recently Dziurzanski and Singh suggested a feedback control task scheduling scheme called Admission Control Algorithm (ACA) by performing schedulability analysis while determining the tasks expected to violate the deadline constraints [14]. Though scheduling approaches presented in [10], [11], [12], [13], [14] efficiently performed energy-aware task scheduling on multiprocessor systems, however, these research studies considered tasks without precedence constraints i.e. independent task models.

There have been many studies that investigate the problem of DVFS based energy aware-scheduling of tasks with precedence constraints on MPSoCs. For example, Singh et al. [15] design a DVFS based scheduling approach for streaming applications. Their approach consists an off-line analysis that under worst-case execution times of tasks determines tasks whose execution speed can be slowed down and an on-line analysis to make use of the slacks arising from tasks that complete their execution before the worst-case execution times. Lui et al. [16] design an energy-efficient scheduling approach for real-time streaming applications on cluster heterogeneous MPSoCs. They first derive an initial task mapping based on first fit decreasing heuristic and remap a subset of tasks to unused clusters to further reduce the energy consumption. Wang et al. formulated a scheduling problem as an Integer Linear Programming (ILP) and considered homogeneous MPSoC architecture in order to reduce both the computation and communication energy consumptions of the streaming applications. This formulation obtains an optimal solution with minimum schedule length while DVFS minimizes the wasted slack in the schedule [17]. Similarly, Huang et al. [18] used ILP formulation to reduce the energy consumption of the processors and NoC links. The authors also developed a heuristic algorithm called Simulated Annealing with Timing Adjustment (SA-TA) to minimize the execution time while achieving global optimum under tight timing constraints. Chen et al. [19] applied Mixed Integer Linear Programming (MILP) on NoC based MPSoC architecture and developed a scheduling algorithm to generate a non-preemptive schedule and a discrete voltage level to each task for reducing the energy consumption. The surveys [20], [21], [22] and [23] discuss in detail scheduling tasks with precedence constraints on multi-processor architecture.

In all these approaches it is assumed that only processors are voltage scalable. Therefore, the DVFS approaches allocate all the slack to tasks only. Andrei et al. [24] and [25] show that if like processors, communication architecture is voltage scalable, more energy can be saved by sharing the available slack between communication and task. Andrei et al. in [24] and [25] propose an NLP and a MILP based DVFS algorithms for a task set with precedence constraints on heterogeneous MPSoC. Their proposed approach shares available slack between task and communication nodes such that total energy consumption is minimized. Li and Wu [26] propose task mapping, scheduling and DVFS algorithm for a task set with precedence constraints on homogeneous NoC based-MPSoC model with voltage scalable links and processors. They propose a two-step approach. In the first step, they propose a quadratic programming based mapping algorithm that maps tasks to a processor such that total weighted communication distance is minimized. In the second step, they use GA to assign voltages and frequencies to tasks and communications. Ali et al. [27] develop a Contention-aware Integrated Task Mapping and Voltage Assignment (CITM-VA) approach for static energy management and scheduling the tasks based on the Earliest Latest Finish Time First (ELFTF) strategy. The authors assigned discrete voltage and frequency levels to both the processors and NoC links using GA.

The approaches discussed so far schedule set of tasks with precedence constraints (also called task graphs TG) on multi-core architecture. This model is a special case of a task set with conditional precedence constraints (also called conditional task graphs CTG). Scheduling approaches designed for CTGs are also applicable to TGs because all TGs are CTGs. But the same may not be true for approaches designed for TGs because all CTGs are not TGs. A few approaches have been proposed for scheduling CTGs on multi-processor architecture with an objective of minimizing energy consumption. For instance, the work of Xie and Wolf [28] is one of the earliest investigations on the scheduling of tasks with conditional precedence constraints considering multiprocessor computing architectures. Shin and Kim [29] presented a scenario-based static Non-Linear Programming (NLP) algorithm that assigns speed to each task depending upon the scenario for reducing the overall energy consumption. Wu et al. [30] developed an approach that deploys a schedule table generated by an approach developed by Eles et al. [31] in order to determine the available slack and assigns voltage to each task using a heuristic. Tariq et al. [32] scheduled conditional tasks with precedence constraints on homogeneous MPSoCs for energy optimization and formulated the scheduling problem as NLP. The authors further extended their work on CTGs and developed an Iterative Offline Energy-aware Task and Communication Scheduling (IOETCS) algorithm to perform voltage scaling and scheduling in an integrated manner. This approach uses the Earliest Successor-Tree-Consistent Deadline First algorithm to generate an initial task schedule and then assigns discrete voltage levels to the tasks using either a heuristic-based algorithm or ILP [33]. One of the major drawbacks of these approaches is that they may not be able to fully utilize the MPSoC resources because the intra-period data dependencies between tasks limit the degree of parallelism in a streaming application. The degree of parallelism can be maximized through software pipelining or retiming.

Retiming reschedules a parent task few periods ahead of its child task so that the data needed by the child task is available at the start of the period. Consequently, the start time of the child task is not constrained by the finish time of the parent task. In simple words, retiming converts the CTG into independent task model by transforming intra-period data dependencies into inter-period data dependencies. Integrating retiming with DVFS can significantly reduce energy consumption because there are no Intra-period data dependencies tasks and the slack that is otherwise wasted due to these dependencies or because to inter-processor communication overhead is utilized for energy-optimization. Therefore, pipelining-based loop scheduling approaches [34], [35], [36], [37], [38], [39], [40] and [41] have been proposed to minimize the schedule makespan or improve system performance. A few approaches focus on optimizing energy consumption by integrating DVFS with software pipelining. Kim et al. [42] propose a pipelining based power reduction technique to optimize energy consumption in uniprocessor systems. The proposed approach in [42] focuses only on uniprocessor systems and cannot directly apply to multiprocessor systems. Shao et al. [43] propose a loop scheduling approach on a multi-processor platform and optimize the energy consumption by integrating DVFS with pipelining. The loop optimization approach proposed in [43] is based on instruction-level pipelining and therefore cannot be applied to a periodic task set.

Our work is closely related to [7] and [44]. Wang et al. in [7] and [44] propose approaches to schedule Directed acyclic Graph (DAG) based TGs on multi-processor systems and optimize the energy consumption by integrating task-level coarse-grained software pipelining with DVFS. Wang et al. [7] use coarse-grained software pipelining to optimally remove inter-processor communication overhead. They propose Mixed Integer Linear Programming-based (MILP) algorithms to optimally regroup tasks and communications from different periods. Their MILP-based algorithm is called Joint Computation and Communication Task Scheduling (JCCTS), whose objective is to optimally remove all communication overheads from the schedule such that the latency overhead is minimized. They have shown that JCCTS can significantly improve energy consumption when combined with a DVFS technique. Wang et al. [44] combine coarse-grain software pipelining with DVFS to optimize the energy consumption and transform the dependent task model into an independent task model by an algorithm called RDAG. They have proposed an algorithm called GeneS that solves the problem of task mapping, ordering and voltage assignment in an integrated manner. GeneS is a genetic algorithm that searches the mapping space for a mapping that minimizes energy consumption. The main objective of these approaches is to minimize the prologue latency or retiming delay and neglect memory overhead incurred due to retiming.

Large buffers are required for streaming applications that run on MPSoCs to store the intermediate processing results and consequently, the total size of buffer arrays accounts for the significant portion of the application binary memory footprint [45]. The memory consumption further increases because of the memory overhead due to retiming that significantly increases the probability of memory capacity constraints violations. Wang et al. [45] propose a MILP-based algorithm called Memory-Aware Optimal Task Scheduling (MAOTS) and a heuristic algorithm called Heuristic Memory-Aware Task Scheduling (HMATS). The objective of both algorithms is to regroup tasks and communications such that the inter-processor communication overhead is reduced and the memory overhead is minimized. Although both MAOTS and HMATS try to reduce the memory footprint of retiming but they are designed for TGs to minimize the schedule makespan and not the energy consumption of CTGs

In this work, we investigate the problem of scheduling and optimizing the total energy consumption of a set of periodic tasks and communications with conditional precedence constraints, common period and individual deadlines less than or equal to period on NoC based MPSoC by integrating retiming with DVFS. We make the following major contribution:

1.
We propose a novel mapping algorithm called $B U$ that aims to balance the workload across all the processors.
2.
We propose mapping-aware retiming algorithm MRTCCTG that aims to minimize wasted slack by transforming intra-period dependencies.
3.
We propose a novel DVFS algorithm that uses an NLP-based algorithm to assign continuous voltages and frequencies to tasks and communications. The DVFS algorithm uses a heuristic algorithm to map the continuous frequencies and voltages to discrete frequencies and voltages.
4.
Our approach integrates retiming with DVFS for real-time applications modelled by Conditional Task Graphs (CTGs). Our approach ensures that memory overhead incurred due to initial retiming does not violate the memory capacity constraints.
5.
Our experimental result shows that compared to the approach GeneS our approach can obtain an improvement in the range of 1.6 to 18 percent and an average improvement of 11 percent. Compared to approach JCCTS our approach can achieve an improvement in the range of 9 to 42 percent and an average improvement of 26 percent.

The rest of the paper is organized as follows. In Section 2 we present the formal definition of retiming and discuss the application, system, and energy models that we use in simulations. In Section 3 we discuss our energy-aware scheduling approach. In Section 4 we present the discrete DVFS algorithm. In Section 5 we present our novel retiming function and discuss in detail our memory analysis approach. In Section 6 we explain our energy and memory-aware retiming approach. Experimental results are presented in Section 7 followed by conclusion of the paper in Section 8.

Section snippets

Models and definitions

In this section we discuss our application, system, and energy models used in the simulations. Moreover, in this paper, we use the term tile and processor interchangeably.

Energy-aware task mapping and scheduling algorithms

In this section we discuss our task mapping and initial scheduling algorithms. The task mapping specifies the processor at which each task execute and initial schedule specifies the order in which tasks and communications will execute. In Section 4 we will discuss DVFS algorithm that assign start time and an execution time to each task and communication. This section and Section 4 discuss the algorithms for constructing a schedule for one period. Since we consider a periodic application the

Discrete frequency assignment

In order to schedule tasks and communications in a unified way, we first transform a CTG $G$ into an extended CTG by adding an additional node in $G$ for every edge $(v_{i}, v_{j}) \in G$ whose head node $v_{i}$ and tail node $v_{j}$ are mapped on different processors. We refer to these additional nodes as communication nodes. The original nodes in $G$ are kept unchanged and are referred to as task nodes. Specifically, for each edge $(v_{i}, v_{j}) \in G$ whose head and tail node are mapped on different processors, we add a

Mapping-aware retiming CTG

Fig. 2(d) shows a schedule of CTG in Fig. 2(a) on MPSoC in Fig. 1(b). In this example all the task nodes execution times at maximum processor frequency is 3 time units and execution times of all communication nodes is 1 time unit except for communication node $v_{10}$ , whose execution time is 2 time units. The period is 15 time units. Fig. 2(d) shows there are many wasted slacks in schedule. These wasted slacks can be utilized through retiming as demonstrated in Fig. 2, Fig. 2. In Fig. 2(e) node $v_{1}$

Energy and memory-aware retiming

The schedule is infeasible if it violates memory capacity constraints. Although, MRTCCTG reduces the wasted slack it is not memory-aware. Consequently, the retimed schedule may violate memory capacity bounds. Next we describes our energy and memory-aware retiming CTG (EMRCTG) approach. It has two main phases that we explain in the following.

Phase 1: We first map the tasks of the CTG to processors.Next we retime CTG by MRTCCTG, construct the retimed graph $G_{R}$ and generate the schedule $π$ .

Performance evaluation

We compare our approach with two approaches JCCTS [7] and GeneS [44]. The task assignment and schedule for JCCTS is generated by Communication-Aware Task Scheduling Algorithm (CATS) [57]. CATS is a communication-aware task mapping and scheduling approach that can optimize the energy consumption of both tasks and communications. It is shown in [7] that JCCT can significantly reduce the energy consumption of the schedule generated by CATS by totally removing the communication overhead and use the

Conclusions

In this paper we have explored the problem of energy-efficient scheduling of real-time applications modelled by CTGs on NoC based MPSoC. We have proposed an approach called EMRCTG that retimes tasks and communications such that total energy consumption is minimized and memory capacity constraints are satisfied. We have proposed a novel scheduling algorithm that balances the workload across all the processors and prioritizes nodes with shorter deadlines by scheduling them earlier in time than

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Umair Ullah Tariq received the master’s degree from the University of Engineering and Technology, Taxila, Pakistan, and the Ph.D. degree in computer science and engineering from the University of New South Wales Sydney NSW Australia. He is currently a Lecturer with the Department of Electrical Engineering, COMSATS Institute of Information Technology Abbotabad Pakistan. He has published several research papers in prominent conferences and journals. His current research interests include

References (64)

EngelM. et al.
A radical approach to network-on-chip operating systems
LiuD. et al.
Energy-efficient mapping of real-time streaming applications on cluster heterogeneous mpsocs
SahuP.K. et al.
A survey on application mapping strategies for network-on-chip design
J. Syst. Archit.
(2013)
LiD. et al.
Energy-efficient contention-aware application mapping and scheduling on noc-based mpsocs
J. Parallel Distrib. Comput.
(2016)
ZhugeQ. et al.
Timing optimization via nest-loop pipelining considering code size
Microprocess. Microsyst.
(2008)
Mobile processor exynos 5 octa (5422),...
Zynq ultrascale+ mpsocs, https://www.xilinx.com/products/silicon-devices/soc/zynq-ultrascale-mpsoc.html. (Accessed 4...
BellS. et al.
Tile64-processor: A 64-core soc with mesh interconnect
GuindaniG. et al.
Achieving qos in noc-based mpsocs through dynamic frequency scaling
LeeH.G. et al.
On-chip communication architecture exploration: A quantitative evaluation of point-to-point, bus, and network-on-chip approaches
ACM Trans. Des. Autom. Electron. Syst.
(2007)

WangY. et al.

Optimally removing intercore communication overhead for streaming applications on mpsocs

IEEE Trans. Comput.

(2013)

HirataK. et al.

Arm mpcore; the streamlined and scalable arm11 processor core

MukherjeeS.S. et al.

The alpha 21364 network architecture

AydinH. et al.

Determining optimal processor speeds for periodic real-time tasks with different power characteristics

AydinH. et al.

Power-aware scheduling for periodic real-time tasks

IEEE Trans. Comput.

(2004)

TosunS.

Energy-and reliability-aware task scheduling onto heterogeneous mpsoc architectures

J. Supercomput.

(2012)

KumarN. et al.

A ga based energy aware scheduler for dvfs enabled multicore systems

Computing

(2017)

DziurzanskiP. et al.

Feedback-based admission control for firm real-time task allocation with dynamic voltage and frequency scaling

Computers

(2018)

A.K. Singh, A. Das, A. Kumar, Energy optimization by exploiting execution slacks in streaming applications on...

WangY. et al.

Optimal task scheduling by removing inter-core communication overhead for streaming applications on mpsoc

HuangJ. et al.

Energy-aware task allocation for network-on-chip based heterogeneous multiprocessor systems

ChenY.-J. et al.

Processors allocation for mpsocs with single isa heterogeneous multi-core architecture

IEEE Access

(2017)

MittalS.

A survey of techniques for improving energy efficiency in embedded computing systems

Int. J. Comput. Aided Eng. Technol.

(2014)

BambaginiM. et al.

Energy-aware scheduling for real-time systems: a survey

ACM Trans. Embedded Comput. Syst.

(2016)

SinghA.K. et al.

A survey and comparative study of hard and soft real-time dynamic resource allocation strategies for multi-/many-core systems

ACM Comput. Surv.

(2017)

AndreiA. et al.

Simultaneous communication and processor voltage scaling for dynamic and leakage energy reduction in time-constrained systems

AndreiA. et al.

Energy optimization of multiprocessor systems on chip by voltage selection

IEEE Trans. Very Large Scale Integr. (VLSI) Syst.

(2007)

AliH. et al.

Contention & energy-aware real-time task mapping on noc based heterogeneous mpsocs

IEEE Access

(2018)

XieY. et al.

Allocation and scheduling of conditional task graph in hardware/software co-synthesis

ShinD. et al.

Power-aware scheduling of conditional task graphs in real-time multiprocessor systems

WuD. et al.

Scheduling and mapping of conditional task graph for the synthesis of low power embedded systems

IEE Proc. E

(2003)

ElesP. et al.

Scheduling of conditional process graphs for the synthesis of embedded systems

Cited by (6)

An efficient and cost effective application mapping for network-on-chip using Andean condor algorithm
2022, Journal of Network and Computer Applications
Citation Excerpt :
But the limitation of this approach is memory overhead (Ali et al., 2021). The algorithms proposed in Li and Wu (2016), Tariq et al. (2019, 2020), not only rely on task mapping and scheduling but also incorporate voltage and frequency scaling to reduce the overall system energy, which increases the computational complexities. Recently, Alagarsamy et al. proposed another bio-inspired search algorithm, i.e., self-adaptive chicken swarm optimization (SCSO) (Alagarsamy et al., 2019).
Advancement in very large scale integration (VLSI) technologies and the ever-shrinking size of the transistors have led the semiconductor designers to create a multiprocessor system on chips. Network on chip (NoC) provides an efficient and flexible communication infrastructure to these systems. One of the most prominent research problems in NoC is mapping the real-time application tasks to multiple cores. The aim is to map the cores, which require frequent and high-bandwidth communications close enough to increase the performance and decrease the chip’s power consumption. In this research, a nature-inspired Andean condor algorithm (ACA) is applied to the mapping problem of application tasks on multiple cores of NoC. Initially, a clustering-based technique provides the main algorithm a head-start for fast convergence, and then the main ACA is applied to achieve the optimal performance. The simulation results show that the proposed algorithm outperformed state-of-the-art algorithms in terms of various performance metrics, such as communication cost, average packet latency, throughput and energy consumption. The proposed algorithm achieves up to 27.11% improvement in communication cost and provides 78.9% savings in computational overhead.
0–1 ILP-based run-time hierarchical energy optimization for heterogeneous cluster-based multi/many-core systems
2021, Journal of Systems Architecture
Heterogeneous cluster-based multi/many-core platforms are on the edge, delivering high computing and energy-efficient embedded systems. These platforms support Dynamic Voltage/Frequency Scaling (DVFS), allowing to change the voltage/frequency levels for each cluster independently. Mapping dynamic applications on such platforms at run-time is a tedious task. This article presents a 0–1 Integer Linear Programming (ILP) based run-time management approach that aims to optimize the overall system energy. The proposed approach adopts a hierarchical management organization. A global management strategy determines application-to-cluster assignments and setups the cluster frequency configurations. A local management strategy determines task-to-core mapping in each cluster to minimize resource usage. Our approach achieves optimized solutions with reduced complexity and shows good scalability on different platform sizes. The experimental results show that, compared with the state-of-the-art approaches of similar complexity, the proposed global management strategy can reduce the average power consumption of the overall system by 80.3%. The experiment also demonstrates that resource minimization in the local management can significantly impact global management decisions, and thereby further reducing overall average power by up to 60.72%.
Towards Task Mapping Approaches in Network on Chips: A Comprehensive Survey
2023, Research Square
Design and implementation of network-on-chip router using multi-priority based iterative round-robin matching with slip
2022, Transactions on Emerging Telecommunications Technologies
Energy-Aware Scheduling of Streaming Applications on Edge-Devices in IoT-Based Healthcare
2021, IEEE Transactions on Green Communications and Networking
Parallel Applications Mapping onto Heterogeneous MPSoCs Interconnected Using Network on Chip
2021, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Hui Wu received BE and ME from Huazhong University of Science and Technology and PhD from National University of Singapore. His early career was mainly focused on CNC systems. He was the chief software architect and developer of Aerospace I CNC System, vice director of Numerical Control Institute of Huazhong University of Science and Technology between 1992 and 1995, and a co-founder of Wuhan Huazhong Numerical Control Co. Ltd. He is currently a lecturer at University of New South Wales Sydney Australia. His research areas include embedded systems, parallel and distributed systems, and wireless sensor networks

View full text

Energy and memory-aware software pipelining streaming applications on NoC-based MPSoCs

Highlights

Abstract

Introduction

Section snippets

Models and definitions

Energy-aware task mapping and scheduling algorithms

Discrete frequency assignment

Mapping-aware retiming CTG

Energy and memory-aware retiming

Performance evaluation

Conclusions

Declaration of Competing Interest

J. Syst. Archit.

J. Parallel Distrib. Comput.

Microprocess. Microsyst.

Tile64-processor: A 64-core soc with mesh interconnect

Achieving qos in noc-based mpsocs through dynamic frequency scaling

On-chip communication architecture exploration: A quantitative evaluation of point-to-point, bus, and network-on-chip approaches

ACM Trans. Des. Autom. Electron. Syst.

Optimally removing intercore communication overhead for streaming applications on mpsocs

IEEE Trans. Comput.

Arm mpcore; the streamlined and scalable arm11 processor core

The alpha 21364 network architecture

Determining optimal processor speeds for periodic real-time tasks with different power characteristics

Power-aware scheduling for periodic real-time tasks

IEEE Trans. Comput.

Energy-and reliability-aware task scheduling onto heterogeneous mpsoc architectures

J. Supercomput.

A ga based energy aware scheduler for dvfs enabled multicore systems

Computing

Feedback-based admission control for firm real-time task allocation with dynamic voltage and frequency scaling

Computers

Optimal task scheduling by removing inter-core communication overhead for streaming applications on mpsoc

Energy-aware task allocation for network-on-chip based heterogeneous multiprocessor systems

Processors allocation for mpsocs with single isa heterogeneous multi-core architecture

IEEE Access

A survey of techniques for improving energy efficiency in embedded computing systems

Int. J. Comput. Aided Eng. Technol.

Energy-aware scheduling for real-time systems: a survey

ACM Trans. Embedded Comput. Syst.

A survey and comparative study of hard and soft real-time dynamic resource allocation strategies for multi-/many-core systems

ACM Comput. Surv.

Simultaneous communication and processor voltage scaling for dynamic and leakage energy reduction in time-constrained systems

Energy optimization of multiprocessor systems on chip by voltage selection

IEEE Trans. Very Large Scale Integr. (VLSI) Syst.

Contention & energy-aware real-time task mapping on noc based heterogeneous mpsocs

IEEE Access

Allocation and scheduling of conditional task graph in hardware/software co-synthesis

Power-aware scheduling of conditional task graphs in real-time multiprocessor systems

Scheduling and mapping of conditional task graph for the synthesis of low power embedded systems

IEE Proc. E

Scheduling of conditional process graphs for the synthesis of embedded systems