Energy and memory-aware software pipelining streaming applications on NoC-based MPSoCs
Introduction
Modern embedded systems such as driver-less cars and robots require powerful and energy efficient hardware due to their complex functions. MPSoC is an ideal architecture for these systems due to its high performance and low power dissipation. Examples of commercial MPSoCs include Samsung Exynos 5422 SoC [1], Zynq UltraScale MPSoC devices [2]. Samsung Exynos 5422 SoC powers the famous Samsung Galaxy smart phone series. Zynq UltraScale MPSoC devices have been used in robots. Modern MPSoCs have a large number of processors, for example Tilera Tile64 MPSoC [3] consists of 64 processors. The number of processors on MPSoCs are expected to grow [4] and according to International Technology Roadmap for Semiconductors (ITRS), MPSoCs will integrate thousands of processors [5] by 2025. Therefore, the traditional bus-based on-chip communication is no longer feasible due to its poor scalability. NoC-based communication provides significant improvement in terms of flexibility, scalability and performance over hierarchical (e.g., Advanced Micro-controller Bus Architecture and STBus) and traditional bus structures [6].
Surveillance digital video recorders and internet video conferences are examples of real-time streaming applications. When such applications are executed on MPSoCs both energy consumption and time performance need to be considered. Energy consumption of an embedded system is one of the major performance metrics of embedded systems, therefore, energy efficiency is a critical issue in such systems. In order to solve this problem, we need to consider several issues. Firstly, real-time applications such as streaming applications can be modelled by periodic conditional dependent task model because these applications repeatedly execute to service data stream [7]. Streaming applications are computationally-intensive as they service continuous stream of data. Therefore they are suitable to execute on MPSoC. To maximally utilize the multi-processor architecture of MPSoCs techniques are required that can increase the degree of parallelism of streaming applications [7]. In this article, we explore task level software pipelining to maximize the degree of parallelism of the periodic dependent conditional task set. Secondly, one of the key challenges to optimize a streaming application on an MPSoC is to generate a schedule that can satisfy all the real-time requests by maximally utilizing the MPSoC resources. So in this paper, we focus on developing a scheduling approach for MPSoCs. Thirdly, to improve energy efficiency apply Dynamic Voltage and Frequency Scaling (DVFS). DVFS saves energy consumption by lowering the voltage/frequency of a processor when it is underutilized. Many multi-core processors such as ARM 11 MPCore [8] support voltage scaling and provide multiple voltage levels for energy optimization. In addition to processors, NoC communication links and routers also consume a large amount of on-chip energy. For Alpha 21364 processor [9], out of 125 W total on-chip power consumption, 23 W (20%) is consumed by NoC routers and links, and out of 23 W, the NoC links consume 58% of the power. Therefore, just like processors if links support DVFS the energy consumption can be optimized by scaling the links voltages and frequencies.
DVFS is amongst the most effective system-level energy optimization technique. Hence, many DVFS based scheduling approaches have been proposed. Amongst the earliest works that apply DVFS, Aydin et al. developed an algorithm with complexity to calculate the voltage levels i.e. speed for tasks and used Earliest Deadline First (EDF) strategy in order to obtain the feasible task schedule for these optimal voltage levels [10]. In another investigation Aydin et al. [11] addressed energy-aware scheduling for periodic tasks and computed the optimal speed using Dynamic Reclaiming Algorithm (DRA) and efficiently utilized the slack while meeting the task deadlines. Tosun [12] mapped periodic tasks on heterogeneous MPSoC system using ILP to minimize the computational energy consumption. The author also developed two heuristics while deploying EDF strategy for energy-aware task scheduling. Kumar and Vidyarthi [13] integrated voltage assignment and task mapping within a single optimization loop using GA. This approach explored the solution space for a near-optimal solution and achieved 59.4% energy savings compared to Genetic Algorithm-Struggle (GA-ST). Recently Dziurzanski and Singh suggested a feedback control task scheduling scheme called Admission Control Algorithm (ACA) by performing schedulability analysis while determining the tasks expected to violate the deadline constraints [14]. Though scheduling approaches presented in [10], [11], [12], [13], [14] efficiently performed energy-aware task scheduling on multiprocessor systems, however, these research studies considered tasks without precedence constraints i.e. independent task models.
There have been many studies that investigate the problem of DVFS based energy aware-scheduling of tasks with precedence constraints on MPSoCs. For example, Singh et al. [15] design a DVFS based scheduling approach for streaming applications. Their approach consists an off-line analysis that under worst-case execution times of tasks determines tasks whose execution speed can be slowed down and an on-line analysis to make use of the slacks arising from tasks that complete their execution before the worst-case execution times. Lui et al. [16] design an energy-efficient scheduling approach for real-time streaming applications on cluster heterogeneous MPSoCs. They first derive an initial task mapping based on first fit decreasing heuristic and remap a subset of tasks to unused clusters to further reduce the energy consumption. Wang et al. formulated a scheduling problem as an Integer Linear Programming (ILP) and considered homogeneous MPSoC architecture in order to reduce both the computation and communication energy consumptions of the streaming applications. This formulation obtains an optimal solution with minimum schedule length while DVFS minimizes the wasted slack in the schedule [17]. Similarly, Huang et al. [18] used ILP formulation to reduce the energy consumption of the processors and NoC links. The authors also developed a heuristic algorithm called Simulated Annealing with Timing Adjustment (SA-TA) to minimize the execution time while achieving global optimum under tight timing constraints. Chen et al. [19] applied Mixed Integer Linear Programming (MILP) on NoC based MPSoC architecture and developed a scheduling algorithm to generate a non-preemptive schedule and a discrete voltage level to each task for reducing the energy consumption. The surveys [20], [21], [22] and [23] discuss in detail scheduling tasks with precedence constraints on multi-processor architecture.
In all these approaches it is assumed that only processors are voltage scalable. Therefore, the DVFS approaches allocate all the slack to tasks only. Andrei et al. [24] and [25] show that if like processors, communication architecture is voltage scalable, more energy can be saved by sharing the available slack between communication and task. Andrei et al. in [24] and [25] propose an NLP and a MILP based DVFS algorithms for a task set with precedence constraints on heterogeneous MPSoC. Their proposed approach shares available slack between task and communication nodes such that total energy consumption is minimized. Li and Wu [26] propose task mapping, scheduling and DVFS algorithm for a task set with precedence constraints on homogeneous NoC based-MPSoC model with voltage scalable links and processors. They propose a two-step approach. In the first step, they propose a quadratic programming based mapping algorithm that maps tasks to a processor such that total weighted communication distance is minimized. In the second step, they use GA to assign voltages and frequencies to tasks and communications. Ali et al. [27] develop a Contention-aware Integrated Task Mapping and Voltage Assignment (CITM-VA) approach for static energy management and scheduling the tasks based on the Earliest Latest Finish Time First (ELFTF) strategy. The authors assigned discrete voltage and frequency levels to both the processors and NoC links using GA.
The approaches discussed so far schedule set of tasks with precedence constraints (also called task graphs TG) on multi-core architecture. This model is a special case of a task set with conditional precedence constraints (also called conditional task graphs CTG). Scheduling approaches designed for CTGs are also applicable to TGs because all TGs are CTGs. But the same may not be true for approaches designed for TGs because all CTGs are not TGs. A few approaches have been proposed for scheduling CTGs on multi-processor architecture with an objective of minimizing energy consumption. For instance, the work of Xie and Wolf [28] is one of the earliest investigations on the scheduling of tasks with conditional precedence constraints considering multiprocessor computing architectures. Shin and Kim [29] presented a scenario-based static Non-Linear Programming (NLP) algorithm that assigns speed to each task depending upon the scenario for reducing the overall energy consumption. Wu et al. [30] developed an approach that deploys a schedule table generated by an approach developed by Eles et al. [31] in order to determine the available slack and assigns voltage to each task using a heuristic. Tariq et al. [32] scheduled conditional tasks with precedence constraints on homogeneous MPSoCs for energy optimization and formulated the scheduling problem as NLP. The authors further extended their work on CTGs and developed an Iterative Offline Energy-aware Task and Communication Scheduling (IOETCS) algorithm to perform voltage scaling and scheduling in an integrated manner. This approach uses the Earliest Successor-Tree-Consistent Deadline First algorithm to generate an initial task schedule and then assigns discrete voltage levels to the tasks using either a heuristic-based algorithm or ILP [33]. One of the major drawbacks of these approaches is that they may not be able to fully utilize the MPSoC resources because the intra-period data dependencies between tasks limit the degree of parallelism in a streaming application. The degree of parallelism can be maximized through software pipelining or retiming.
Retiming reschedules a parent task few periods ahead of its child task so that the data needed by the child task is available at the start of the period. Consequently, the start time of the child task is not constrained by the finish time of the parent task. In simple words, retiming converts the CTG into independent task model by transforming intra-period data dependencies into inter-period data dependencies. Integrating retiming with DVFS can significantly reduce energy consumption because there are no Intra-period data dependencies tasks and the slack that is otherwise wasted due to these dependencies or because to inter-processor communication overhead is utilized for energy-optimization. Therefore, pipelining-based loop scheduling approaches [34], [35], [36], [37], [38], [39], [40] and [41] have been proposed to minimize the schedule makespan or improve system performance. A few approaches focus on optimizing energy consumption by integrating DVFS with software pipelining. Kim et al. [42] propose a pipelining based power reduction technique to optimize energy consumption in uniprocessor systems. The proposed approach in [42] focuses only on uniprocessor systems and cannot directly apply to multiprocessor systems. Shao et al. [43] propose a loop scheduling approach on a multi-processor platform and optimize the energy consumption by integrating DVFS with pipelining. The loop optimization approach proposed in [43] is based on instruction-level pipelining and therefore cannot be applied to a periodic task set.
Our work is closely related to [7] and [44]. Wang et al. in [7] and [44] propose approaches to schedule Directed acyclic Graph (DAG) based TGs on multi-processor systems and optimize the energy consumption by integrating task-level coarse-grained software pipelining with DVFS. Wang et al. [7] use coarse-grained software pipelining to optimally remove inter-processor communication overhead. They propose Mixed Integer Linear Programming-based (MILP) algorithms to optimally regroup tasks and communications from different periods. Their MILP-based algorithm is called Joint Computation and Communication Task Scheduling (JCCTS), whose objective is to optimally remove all communication overheads from the schedule such that the latency overhead is minimized. They have shown that JCCTS can significantly improve energy consumption when combined with a DVFS technique. Wang et al. [44] combine coarse-grain software pipelining with DVFS to optimize the energy consumption and transform the dependent task model into an independent task model by an algorithm called RDAG. They have proposed an algorithm called GeneS that solves the problem of task mapping, ordering and voltage assignment in an integrated manner. GeneS is a genetic algorithm that searches the mapping space for a mapping that minimizes energy consumption. The main objective of these approaches is to minimize the prologue latency or retiming delay and neglect memory overhead incurred due to retiming.
Large buffers are required for streaming applications that run on MPSoCs to store the intermediate processing results and consequently, the total size of buffer arrays accounts for the significant portion of the application binary memory footprint [45]. The memory consumption further increases because of the memory overhead due to retiming that significantly increases the probability of memory capacity constraints violations. Wang et al. [45] propose a MILP-based algorithm called Memory-Aware Optimal Task Scheduling (MAOTS) and a heuristic algorithm called Heuristic Memory-Aware Task Scheduling (HMATS). The objective of both algorithms is to regroup tasks and communications such that the inter-processor communication overhead is reduced and the memory overhead is minimized. Although both MAOTS and HMATS try to reduce the memory footprint of retiming but they are designed for TGs to minimize the schedule makespan and not the energy consumption of CTGs
In this work, we investigate the problem of scheduling and optimizing the total energy consumption of a set of periodic tasks and communications with conditional precedence constraints, common period and individual deadlines less than or equal to period on NoC based MPSoC by integrating retiming with DVFS. We make the following major contribution:
- 1.
We propose a novel mapping algorithm called that aims to balance the workload across all the processors.
- 2.
We propose mapping-aware retiming algorithm MRTCCTG that aims to minimize wasted slack by transforming intra-period dependencies.
- 3.
We propose a novel DVFS algorithm that uses an NLP-based algorithm to assign continuous voltages and frequencies to tasks and communications. The DVFS algorithm uses a heuristic algorithm to map the continuous frequencies and voltages to discrete frequencies and voltages.
- 4.
Our approach integrates retiming with DVFS for real-time applications modelled by Conditional Task Graphs (CTGs). Our approach ensures that memory overhead incurred due to initial retiming does not violate the memory capacity constraints.
- 5.
Our experimental result shows that compared to the approach GeneS our approach can obtain an improvement in the range of 1.6 to 18 percent and an average improvement of 11 percent. Compared to approach JCCTS our approach can achieve an improvement in the range of 9 to 42 percent and an average improvement of 26 percent.
The rest of the paper is organized as follows. In Section 2 we present the formal definition of retiming and discuss the application, system, and energy models that we use in simulations. In Section 3 we discuss our energy-aware scheduling approach. In Section 4 we present the discrete DVFS algorithm. In Section 5 we present our novel retiming function and discuss in detail our memory analysis approach. In Section 6 we explain our energy and memory-aware retiming approach. Experimental results are presented in Section 7 followed by conclusion of the paper in Section 8.
Section snippets
Models and definitions
In this section we discuss our application, system, and energy models used in the simulations. Moreover, in this paper, we use the term tile and processor interchangeably.
Energy-aware task mapping and scheduling algorithms
In this section we discuss our task mapping and initial scheduling algorithms. The task mapping specifies the processor at which each task execute and initial schedule specifies the order in which tasks and communications will execute. In Section 4 we will discuss DVFS algorithm that assign start time and an execution time to each task and communication. This section and Section 4 discuss the algorithms for constructing a schedule for one period. Since we consider a periodic application the
Discrete frequency assignment
In order to schedule tasks and communications in a unified way, we first transform a CTG into an extended CTG by adding an additional node in for every edge whose head node and tail node are mapped on different processors. We refer to these additional nodes as communication nodes. The original nodes in are kept unchanged and are referred to as task nodes. Specifically, for each edge whose head and tail node are mapped on different processors, we add a
Mapping-aware retiming CTG
Fig. 2(d) shows a schedule of CTG in Fig. 2(a) on MPSoC in Fig. 1(b). In this example all the task nodes execution times at maximum processor frequency is 3 time units and execution times of all communication nodes is 1 time unit except for communication node , whose execution time is 2 time units. The period is 15 time units. Fig. 2(d) shows there are many wasted slacks in schedule. These wasted slacks can be utilized through retiming as demonstrated in Fig. 2, Fig. 2. In Fig. 2(e) node
Energy and memory-aware retiming
The schedule is infeasible if it violates memory capacity constraints. Although, MRTCCTG reduces the wasted slack it is not memory-aware. Consequently, the retimed schedule may violate memory capacity bounds. Next we describes our energy and memory-aware retiming CTG (EMRCTG) approach. It has two main phases that we explain in the following.
Phase 1: We first map the tasks of the CTG to processors.Next we retime CTG by MRTCCTG, construct the retimed graph and generate the schedule .
Performance evaluation
We compare our approach with two approaches JCCTS [7] and GeneS [44]. The task assignment and schedule for JCCTS is generated by Communication-Aware Task Scheduling Algorithm (CATS) [57]. CATS is a communication-aware task mapping and scheduling approach that can optimize the energy consumption of both tasks and communications. It is shown in [7] that JCCT can significantly reduce the energy consumption of the schedule generated by CATS by totally removing the communication overhead and use the
Conclusions
In this paper we have explored the problem of energy-efficient scheduling of real-time applications modelled by CTGs on NoC based MPSoC. We have proposed an approach called EMRCTG that retimes tasks and communications such that total energy consumption is minimized and memory capacity constraints are satisfied. We have proposed a novel scheduling algorithm that balances the workload across all the processors and prioritizes nodes with shorter deadlines by scheduling them earlier in time than
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Umair Ullah Tariq received the master’s degree from the University of Engineering and Technology, Taxila, Pakistan, and the Ph.D. degree in computer science and engineering from the University of New South Wales Sydney NSW Australia. He is currently a Lecturer with the Department of Electrical Engineering, COMSATS Institute of Information Technology Abbotabad Pakistan. He has published several research papers in prominent conferences and journals. His current research interests include
References (64)
- et al.
A radical approach to network-on-chip operating systems
- et al.
Energy-efficient mapping of real-time streaming applications on cluster heterogeneous mpsocs
- et al.
A survey on application mapping strategies for network-on-chip design
J. Syst. Archit.
(2013) - et al.
Energy-efficient contention-aware application mapping and scheduling on noc-based mpsocs
J. Parallel Distrib. Comput.
(2016) - et al.
Timing optimization via nest-loop pipelining considering code size
Microprocess. Microsyst.
(2008) - Mobile processor exynos 5 octa (5422),...
- Zynq ultrascale+ mpsocs, https://www.xilinx.com/products/silicon-devices/soc/zynq-ultrascale-mpsoc.html. (Accessed 4...
- et al.
Tile64-processor: A 64-core soc with mesh interconnect
- et al.
Achieving qos in noc-based mpsocs through dynamic frequency scaling
- et al.
On-chip communication architecture exploration: A quantitative evaluation of point-to-point, bus, and network-on-chip approaches
ACM Trans. Des. Autom. Electron. Syst.
(2007)
Optimally removing intercore communication overhead for streaming applications on mpsocs
IEEE Trans. Comput.
Arm mpcore; the streamlined and scalable arm11 processor core
The alpha 21364 network architecture
Determining optimal processor speeds for periodic real-time tasks with different power characteristics
Power-aware scheduling for periodic real-time tasks
IEEE Trans. Comput.
Energy-and reliability-aware task scheduling onto heterogeneous mpsoc architectures
J. Supercomput.
A ga based energy aware scheduler for dvfs enabled multicore systems
Computing
Feedback-based admission control for firm real-time task allocation with dynamic voltage and frequency scaling
Computers
Optimal task scheduling by removing inter-core communication overhead for streaming applications on mpsoc
Energy-aware task allocation for network-on-chip based heterogeneous multiprocessor systems
Processors allocation for mpsocs with single isa heterogeneous multi-core architecture
IEEE Access
A survey of techniques for improving energy efficiency in embedded computing systems
Int. J. Comput. Aided Eng. Technol.
Energy-aware scheduling for real-time systems: a survey
ACM Trans. Embedded Comput. Syst.
A survey and comparative study of hard and soft real-time dynamic resource allocation strategies for multi-/many-core systems
ACM Comput. Surv.
Simultaneous communication and processor voltage scaling for dynamic and leakage energy reduction in time-constrained systems
Energy optimization of multiprocessor systems on chip by voltage selection
IEEE Trans. Very Large Scale Integr. (VLSI) Syst.
Contention & energy-aware real-time task mapping on noc based heterogeneous mpsocs
IEEE Access
Allocation and scheduling of conditional task graph in hardware/software co-synthesis
Power-aware scheduling of conditional task graphs in real-time multiprocessor systems
Scheduling and mapping of conditional task graph for the synthesis of low power embedded systems
IEE Proc. E
Scheduling of conditional process graphs for the synthesis of embedded systems
Cited by (6)
An efficient and cost effective application mapping for network-on-chip using Andean condor algorithm
2022, Journal of Network and Computer ApplicationsCitation Excerpt :But the limitation of this approach is memory overhead (Ali et al., 2021). The algorithms proposed in Li and Wu (2016), Tariq et al. (2019, 2020), not only rely on task mapping and scheduling but also incorporate voltage and frequency scaling to reduce the overall system energy, which increases the computational complexities. Recently, Alagarsamy et al. proposed another bio-inspired search algorithm, i.e., self-adaptive chicken swarm optimization (SCSO) (Alagarsamy et al., 2019).
0–1 ILP-based run-time hierarchical energy optimization for heterogeneous cluster-based multi/many-core systems
2021, Journal of Systems ArchitectureTowards Task Mapping Approaches in Network on Chips: A Comprehensive Survey
2023, Research SquareDesign and implementation of network-on-chip router using multi-priority based iterative round-robin matching with slip
2022, Transactions on Emerging Telecommunications TechnologiesEnergy-Aware Scheduling of Streaming Applications on Edge-Devices in IoT-Based Healthcare
2021, IEEE Transactions on Green Communications and NetworkingParallel Applications Mapping onto Heterogeneous MPSoCs Interconnected Using Network on Chip
2021, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Umair Ullah Tariq received the master’s degree from the University of Engineering and Technology, Taxila, Pakistan, and the Ph.D. degree in computer science and engineering from the University of New South Wales Sydney NSW Australia. He is currently a Lecturer with the Department of Electrical Engineering, COMSATS Institute of Information Technology Abbotabad Pakistan. He has published several research papers in prominent conferences and journals. His current research interests include energy-aware scheduling, digital image processing, and computer vision.
Hui Wu received BE and ME from Huazhong University of Science and Technology and PhD from National University of Singapore. His early career was mainly focused on CNC systems. He was the chief software architect and developer of Aerospace I CNC System, vice director of Numerical Control Institute of Huazhong University of Science and Technology between 1992 and 1995, and a co-founder of Wuhan Huazhong Numerical Control Co. Ltd. He is currently a lecturer at University of New South Wales Sydney Australia. His research areas include embedded systems, parallel and distributed systems, and wireless sensor networks