Three-phase time-aware energy minimization with DVFS and unrolling for Chip Multiprocessors

https://doi.org/10.1016/j.sysarc.2012.07.001Get rights and content

Abstract

Energy consumption has been one of the most critical issues in the Chip Multiprocessor (CMP). Using the Dynamic Voltage and Frequency Scaling (DVFS), a CMP system can achieve a balance between the performance and the energy-efficiency. In this paper, we propose a three-phase discrete DVFS algorithm for a CMP system dedicated to applications where the period of the applications’ task graph is smaller than the deadline of tasks. In these applications, multiple task graphs are unrolled and then concatenated together to form a new task graph. The proposed DVFS algorithm is applied to the newly formed task graph to stretch tasks’ execution time, lower operating frequencies of processors and achieve the system power efficiency. Experimental results show that the proposed algorithm reduces the energy dissipation by 25% on average, compared to previous DVFS approaches.

Introduction

Computer architecture has evolved from single core to multi-core. Energy dissipation in a Chip Multiprocessor (CMP) system is becoming a major concern, especially for handset/embedded devices with CMP systems. These devices are usually powered by battery, and their sizes are quite small. Higher power consumption produces more heat. Accumulated heat degrades device reliability. On the other hand, in order to extend the battery lifetime, the power consumption of a CMP system must be kept low. Therefore, low-power design is essential for battery powered CMP systems. In addition to dynamic power management (DPM) [1], another effective mechanism to reduce power consumption is Dynamic Voltage and Frequency Scaling (DVFS) [2], [3], which achieves power savings by dynamically adjusting the supply voltage and operating frequency [4] of the processing element/cores, subject to data dependencies and timing constraints of the system. For the sake of convenience, core, processor, and processing element (PE) are used interchangeably in this paper.

Early DVFS research [5] focused on a uniprocessor system executing independent tasks. In order to handle applications with a multiprocessor system, several DVFS algorithms [6], [7], [8], [9], [10], [11] have been proposed recently to deal with task stretching and scheduling under data dependency and timing constraints. DVFS algorithms for a CMP system are usually developed based on the task graph. A heuristic DVFS algorithm by searching the critical path in a task graph was presented in [6]. It uniformly stretches the execution time of each task until none of the task’s execution time could be further stretched in this critical path. After that, timing constraints for the task graph are updated and the new critical path is searched. If no task in the new critical path could be further stretched, then the algorithm terminates. Otherwise the above procedures are applied to the new critical path. However, the algorithm is suboptimal for applications where different tasks have different stretching ability, or where they are executed on heterogeneous distributed architectures, in which different processors have different voltage scaling characteristics. Luo et al. [7] and Schmitz et al. [12] proposed two DVFS algorithms to overcome these drawbacks. In these algorithms, the concept of energy gradient is used for distributing slacks among tasks. The slack of a task is always allocated to the processor that has the largest energy gradient among all processors in the system. In their work, DVFS is applied with the assumption that the supply voltage of the PE can be adjusted continuously. Zhang et al. [8] formulated the DVFS problem as an Integer Linear Programming (ILP) problem, which is efficiently solved by approximation. However, the power profile (available voltage levels and correspondent operation frequencies) of the PE is not considered in this work. In addition, these algorithms have a common limitation in that they only look at one period (Tperoid) of the task graph, i.e., a directed acyclic graph (DAG). In a real-life application, the deadline of a task may be greater than Tperoid of the DAG. Considering a sensing and data processing application that is implemented on a CMP system, the system consists of a sensor and 2 PEs. The sensor acquires data and then feeds them into PE1 for task1; PE1’s output feeds PE2 for task2. If the deadline of the tasks are 3 and 4 ms, respectively, and the sensor sampling interval is 1 ms, we need to finish the whole task graph in 1 ms, assuming no delayed processing. Hence, Tperoid is 1 ms and the deadline of both task1 (3 ms) and task2 (4 ms) are larger than the task graph period (1 ms).

The authors of [13] proposed a DVFS algorithm based on task graph unrolling. They formulated the DVFS problem as an ILP problem, and then solved the ILP to obtain the optimal scheme to stretch task’s execution time, lower operating frequencies of processors and achieve energy dissipation reduction. However, the ILP based approach suffers from high computational complexity. Furthermore, the power profile of a processor is not considered in their formulation. The authors of [14] used nonlinear programming and mixed integer linear programming to minimize energy consumption. They only focused on voltage selection problem and the approach is very complicated. In paper [15], the authors used a hardware-controlled energy management approach, DVFS, with an earliest deadline first method to reduce energy consumption. However, their approach can be systematically improved.

DVFS has proved to be a powerful approach to reduce the energy consumption [16], [17], [18], [3]. Effective exploitation of task slacks is the key to the power reduction of a system. Unrolling task graphs clusters task slacks together, which provides a good opportunity for exploiting DVFS to reduce the energy consumption. In this paper, we propose a three-phase discrete DVFS algorithm that considers power profiles of PEs and uses unrolled task graph for task scheduling. We first present a new discrete DVFS algorithm that takes into account power profiles of PEs. Furthermore, we propose a three-phase DVFS algorithm that achieves better energy saving by clustering task slacks via task graph unrolling. In the first phase, we propose to use a task-scheduling heuristic to assign tasks to PEs. In the second phase, the proposed discrete DVFS algorithm is applied to the given task graph for only one period [19]. In the last phase, the task graphs resulted from the first two phases are unrolled and are chained together to obtain a new task graph. In the new task graph, new task slacks are generated so that the discrete DVFS algorithm could be applied again to further reduce the energy consumption.

Experimental results show that the proposed algorithm reduces the energy dissipation by up to 25% on average, compared to the existing approaches. In addition to achieving more energy savings, our proposed algorithm also reduces the number of idle intervals of the PEs.

The rest of this paper is organized as follows. In Section 2, a motivational example demonstrates that the overall energy dissipation can be further reduced by applying DVFS to an unrolled task graph. Section 3 formally formulates the problem. In Section 4, the proposed algorithm is explained in detail. Sections 5 Experimental results, 6 Conclusion give the experimental results and conclusions, respectively.

Section snippets

A motivational example

In this section, we show the effectiveness of energy saving by combining DVFS and task graph unrolling with a simple example. We consider a DVFS-enabled processor similar to Intel’s Xscale processor [20]. Fig. 1 shows a task graph with 5 tasks and their data dependencies. The period of the task graph is 2.6 ms. There are two identical processing elements in the system: PE0 and PE1. The deadline, the worst-case execution time (WCET) and the correspondent energy dissipation are given in Table 1.

Problem formulation and assumptions

The application tasks and their precedence constraints are usually modeled as a directed acyclic graph (DAG), i.e., the task graph. Given a DAG G(V,E), where node viV denotes a task and edge ei,j=(vi,vj)E denotes a precedence constraint and data dependency between tasks vi and vj. Each task is associated with a deadline dli, by which the task must finish its execution. dli can be larger than T, the period of the task graph.

In this paper, we assume that the target processor has homogeneous

Proposed DVFS algorithm

In this section, we will propose a three-phase algorithm that saves more energy by unrolling the task graph. In the first phase, we propose to use a task-scheduling heuristic to assign tasks to PEs. In the second phase, the proposed discrete DVFS algorithm is applied to the given task graph for only one period. In the last phase, the task graphs resulted from the first two phases are unrolled and are chained together to obtain a new task graph. In the new task graph, new task slacks are

Experimental results

In this section, we will show that the proposed three-phase DVFS algorithm results in more energy reduction, compared to the approach in which the DVFS technique is limited to one period of the task graph. In our experimental setup, we consider a DVFS-enabled processor similar to Intel’s Xscale processor [20]. There are two PEs in the system. Each PE has identical voltage levels and frequency levels as shown in Table 2.

Thirteen task graphs are generated using TGFF [26], as shown in Fig. 6. The

Conclusion

In this paper, we proposed a three-phase DVFS algorithm for a CMP system. This algorithm is dedicated to applications where the deadline of a task is larger than one period of the applications task graph. In the first phase, we propose to use a task-scheduling heuristic to assign tasks to PEs. In the second phase, the proposed DVFS algorithm is used, limited to one period of the task graph. Since the deadline of task is larger than one period of the task graph in these applications, we unroll

Acknowledgements

This work was supported in part by the NSF CNS-1249223, NSFC 61071061; NSFC 61170077, SZ-HK Innovation Circle Proj. ZYB200907060012A, NSF GD:10351806001000000, S & T Proj. of SZ JC200903120046A.

Meikang Qiu received the B.E. and M.E. degrees from Shanghai Jiao Tong University, China. He received the M.S. and Ph.D. degrees of Computer Science from University of Texas at Dallas in 2003 and 2007, respectively. He had worked at Chinese Helicopter R&D Institute and IBM. Currently, he is an assistant professor of ECE at University of Kentucky. He is an IEEE Senior member and has published more than 140 papers, including 50+ journals. He is the recipient of the ACM Transactions on Design

References (26)

  • Q. Qiu, S. Liu, Q. Wu, Task merging for dynamic power management of cyclic applications in real-time multi-processor...
  • M. Weiser, B. Welch, A. Demers, S. Shenker, Scheduling for reduced CPU energy, in: USENIX Symposium on Operating...
  • S. Liu, Q. Wu, Q. Qiu, An adaptive scheduling and voltage/frequency selection algorithm for real-time energy harvesting...
  • T.D. Burd et al.

    A dynamic voltage scaled microprocessor system

    IEEE Journal of Solid-State Circuits

    (2000)
  • F. Yao, A. Demers, S. Shenker, A scheduling model for reduced CPU energy, in: IEEE Symposium on Foundations of Comp....
  • J. Luo, N.K. Jha, Static and dynamic variable voltage scheduling algorithms for real-time heterogeneous distributed...
  • M.T. Schmitz, B.M. Al-Hashimi, Considering power variations of DVS processing elements for energy minimization in...
  • Y. Zhang, X. Hu, D.Z. Chen, Task scheduling and voltage selection for energy minimization, in: Proc. of Design...
  • M. Qiu et al.

    Dynamic and leakage energy minimization with soft real-time loop scheduling and voltage assignment

    IEEE Transactions on Very Large Scale Integration (VLSI) Systems

    (2010)
  • M. Qiu et al.

    Cost minimization while satisfying hard/soft timing constraints for heterogeneous embedded systems

    ACM Transactions on Design Automation of Electronic Systems (TODAES)

    (2009)
  • Z.-L. Zong et al.

    EAD and PEBD: two energy-aware duplication scheduling algorithms for parallel tasks on homogeneous clusters

    IEEE Transactions on Computers

    (2011)
  • J. Luo, N.K. Jha, Power-profile driven variable voltage scaling for heterogeneous distributed real-time embedded...
  • K. Srinivasan, K. Chatha, An ILP formulation for system level throughput and power optimization in multiprocessor SoC...
  • Cited by (113)

    View all citing articles on Scopus

    Meikang Qiu received the B.E. and M.E. degrees from Shanghai Jiao Tong University, China. He received the M.S. and Ph.D. degrees of Computer Science from University of Texas at Dallas in 2003 and 2007, respectively. He had worked at Chinese Helicopter R&D Institute and IBM. Currently, he is an assistant professor of ECE at University of Kentucky. He is an IEEE Senior member and has published more than 140 papers, including 50+ journals. He is the recipient of the ACM Transactions on Design Automation of Electronic Systems (TODAES) 2011 Best Paper Award. He also received four other Best Paper Awards (IEEE ICESS’12, IEEE/ACM GreenCom’10, IEEE CSE’10, and IEEE EUC’09) and one Best Paper Nomination. His paper about cloud computing has been ranked as the most downloaded paper of JPDC in 2012. He also holds 2 patents and has published 3 books. His research has been supported by NSF, ONR, and Air Force. He has also been awarded Naval Summer Faculty 2012 and SFFP Air Force summer faculty 2009. He has been on various chairs and TPC members for many international conferences. He served as the Program Chair of IEEE EmbeddCom’09 and EM-Com’09. His research interests include embedded systems, computer security, and wireless sensor networks.

    Zhong Ming is a professor at College of Computer and Software Engineering of Shenzhen University. He is a member of a council and senior member of China Computer Federation. His major research interests are software engineering and embedded systems. He led two projects of National Natural Science Foundation, and two projects of Natural Science Foundation of Guangdong province, China.

    Jiayin Li received the B.E. and M.E. degrees from Huazhong University of Science and Technology (HUST), China, in 2002 and 2006, respectively. He obtained Ph.D. degree from the Department of Electrical and Computer Engineering (ECE), University of Kentucky in May 2012. His research interests include software/hardware co-design for embedded system and high performance computing.

    Shaobo Liu received the B.S. degree in material science and engineering from Wuhan University of Technology, Wuhan, China, in 2001, the M.S. degree in electrical engineering from Zhejiang University, Hangzhou, China, in 2004, and the Ph.D. degree in electrical and computer engineering from State University of New York, Binghamton, in 2010. He is currently with Marvell Semiconductor, Inc., Marlborough, MA. His research interests include power/thermal analysis and optimization, leakage estimation and minimization, energy harvesting system design, and energy aware computing.

    Bin Wang obtained his B.S. from Zhejiang University in 1992, M.S. from University of Louisville in 1994, and Ph.D. from Ohio State University in 2000. He is a professor at Computer Science department in Wright State University. He obtained US Department of Energy Early Career Award in 2003. His research interests include wireless sensor networks, Communication, and network security.

    Zhonghai Lu received the B.Sc. degree in Radio & Electronics from Beijing Normal University, Beijing, China, in 1989, the M.Sc. degree in System-on-Chip Design and the Ph.D. degree in Electronic and Computer Systems Design from KTH Royal Institute of Technology, Stockholm, Sweden, in 2002 and 2007, respectively. From 1989 to 2000, he was an Engineer in the area of electronic and embedded systems. He is currently an Associate Professor with the Department of Electronic Systems, School for Information and Communication Technology, KTH. His research interests include computer and communication system architectures, cyber-physical systems, and performance analysis. He has published over 100 papers in these areas.

    View full text