An efficient and comprehensive scheduler on Asymmetric Multicore Architecture systems

https://doi.org/10.1016/j.sysarc.2013.05.006Get rights and content

Abstract

Several studies have shown that Asymmetric Multicore Processors (AMPs) systems, which are composed of processors with different hardware characteristics, present better performance and power when compared to homogeneous systems. With Moore’s law behavior still lasting, core-count growth creates typical non-uniform memory accesses (NUMA). Existing schedulers assume that the underlying architecture is homogeneous, and as consequence, they may not be well suited for AMP and NUMA systems, since they, respectively, do not properly explore hardware elements asymmetry, while improving memory utilization by avoid multi-processes data starvation. In this paper we propose a new scheduler, namely NUMA-aware Scheduler, to accommodate the next generation of AMP architectures in terms of architecture asymmetry and processes starvation. Experimental results show that the average speedup is 1.36 times faster than default Linux scheduler through evaluation using PARSEC benchmarks, demonstrating that the proposed technique is promising when compared to other prior studies.

Introduction

With advances in microprocessor technologies and accelerated with development of multicore, manycore- and embedded systems-related technologies last years, processors evolve to include more processing units – hundreds to thousands of cores – into one single die and widely exploited in High Performance Computing, by harnessing processor architectures in parallel with other technologies and techniques to achieve such high performance.

Asymmetric Multicore Processors (AMPs) system is recently introduced, as composed of processors with different characteristics, e.g., clock speed, cache capacities, power consumption, occupied area and the complexity of execution pipeline, containing or not the same Instruction Set Architecture (ISA), also known as single-ISA heterogeneous multicore [1], [3]. Instances of AMP system may contain a few powerful and effective cores and a larger number of cores with slower speed and less power consumptions [1], [3]. Many strategies are employed to explore few powerful out-of-order with higher clock speeds and large cache capacities, suitable for executing the throughput oriented applications and single-threaded sequential applications, while for slower but less power-consuming cores, for parallel execution. Such an idea has been considered by major manufacturers as IBM, AMD and Intel, to combine 32- or 64-bit ×86 or Power cores with capable graphics processing units (GPUs) or Synergistic Processor Elements (SPEs) on a single silicon die, e.g., IBM’s cell processor [21], AMD’s APU [20] and Intel’s Larrabee [19]. Prior studies show that the typical AMP system has significant energy benefits and occupies minor die area, yet maximize the power efficiency [1], [2], [8]. As result, given the core-count growth, access time to memory is variable and depends on the relative location of a processor, which characterizes it as Non-Uniform Memory Access architecture (NUMA) [17]. With rapid growth on the number of cores in computing systems, the amount of memory requests issued by processor cores increases memory starvation.

This limitation on the number of memory accesses decreases the performance of modern multicore systems, and can starve several processors at the same time. In NUMA systems, this problem is settled by providing separate memory for each processor, which is likely to lift the performance when several processors attempt to access same memory. Unfortunately, current OS schedulers assume that the underneath hardware is homogeneous, that is, AMP systems and NUMA architecture are not considered as well as decoupled. Taking as example Linux 2.6 Completely Fair Scheduler (CFS), this scheduler uses a red–black tree implementation to manage the executable processes instead of running queue per processor. The main idea of CFS is to provide processor time to each task fairly. For instance, in a system with n executable processes, each of them should be given 1/n process time of a tiny period. Since the abilities of processors in AMP systems are different, 1/n process time in faster and slower processor cores are completely different. Hence, the scheduler should take the AMP architecture into account. In another direction, NUMA architecture can be used to avoid contention of memory accesses between processes, by dividing the memory into multiple nodes, exploring the high-speed interconnections among them, e.g., Intel’s Quick Path Interconnect (QPI) and AMD’s Hyper Transport (HT). However, given the higher core-count growths and consequent large NUMA architectures formed, combined to the different AMP hardware adaptive design, can provide smaller memory resource contention and avoid data starvation. Again, the scheduler must consider the NUMA architecture in order to get additional benefits from this computer memory design. Based on this tendency, we believe that the AMP and NUMA are essential as the next generation of hand-held devices’ architecture. In order to make OS working well with AMP and NUMA, we propose a new scheduler policy, NUMA-aware Scheduler for Asymmetric Multicore Processors, to support AMP and NUMA architectures. Interesting components as target in our proposed scheduler policy are twofold. The former one is Asymmetric-aware schedule policy, where dynamically trigger AMP scheduler to place the suitable processes on the specific type of cores, while the latter one is NUMA-aware schedule policy, in which precisely calculates the current system performance degradation due to resource contention, minimizing the degradation by thread migration and memory management.

The proposed NUMA-aware Scheduler for AMP (Asymmetric Multicore Processors) is implemented in Linux CentOS release 6.0 and evaluated on a 8-core, 32 GB Dell PowerEdge R910 system. Using performance counters, we independently modulated the CPU frequency as a performance asymmetry factor and explored the NUMA memory space to avoid resource contention. Comparing to Linux CFS scheduler and execution of PARSEC benchmarks, the proposed scheduler improves performance by a factor of 1.36×.

The remaining of this paper is organized as follows. In Section 2, some related works are presented, while the overview of NUMA-aware Scheduler for Asymmetric Multicore Processors is given in Section 3. The design of the NUMA-aware Scheduler for Asymmetric Multicore Processors is discussed in Section 4, and evaluation is shown in Section 5. Finally, Section 6 summarizes our findings, as also brings some remarks and topics for future research.

Section snippets

Related work

There are several references in literature showing energy benefits of Asymmetric Multicore Architectures [1], [2], [8]. The research study in [1] showed that this architecture could achieve a large amount of energy reduction with small performance penalty. In order to accommodate the heterogeneity of Asymmetric Multicore Processors, there are several researches [1], [2], [3], [4], [5], [6], [7], [8], [9] that discussed scheduling algorithms. Some of them considered the load balancing policy and

Proposed scheduler

The purpose of a process scheduling is to optimally sort independent processes according to a given parameter and then execute them. In proposed NUMA-aware Scheduler for AMP, the schedule policy is based on ranking processes according to two metrics, Online AMP Speedup Factor and Resource Contention Degradation Factor, to determine how appropriate they are to be run on certain type of core, the faster core or the slower core, or domain. In the aim of deriving these two metrics for a process, we

Evaluation

To evaluate the accuracy of two metrics proposed as also the performance of NUMA-aware AMP Scheduler, the PARSEC benchmark suite is considered. The performance of asymmetric architectures was implemented on a server Dell PowerEdge R910 with CentOS Linux release 6.0 (Linux 2.6.32). The frequency of three CPUs was reduced by half, with settings conform to typical AMP system with 4 faster cores and 12 slower cores, emulating future generation of asymmetric architectures.

Summary and conclusions

An AMP system is a newly introduced computing system. In order to permit the operating system understand the underneath heterogeneous architecture, we proposed an AMP aware schedule policy. We introduced a new metric called Online AMP Speedup Factor to define which runnable processes should utilize the high efficiency of fast cores. The Online AMP Speedup Factor took the characteristic of a process and the thread-level parallelism into account, and the experimental results showed the metric was

Jiun-Hung Ding received a BS in Industrial Engineering and Management from National Chiao Tung University in 2004, and the MS in Computer Science from National Tsing Hua University in 2006. His research interest includes Embedded System, Hardware–Software Codesign, Multi-core Optimization, and Parallel Computing.

References (22)

  • Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy Ranganathan, Dean M. Tullsen, Single-ISA Heterogeneous...
  • Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy Ranganathan, Dean M. Tullsen, Single-ISA heterogeneous...
  • Tong Li, Dan Baumberger, David A. Koufaty, Scott Hahn, Efficient operating system scheduling for performance-asymmetric...
  • Alexandra Fedorova et al.

    Maximizing power efficiency with asymmetric multicore systems

    Communications of the ACM

    (2009)
  • Daniel Shelepov et al.

    HASS: a scheduler for heterogeneous multicore systems

    Operating Systems Review

    (2009)
  • Juan Carlos Saez, Manuel Prieto, Alexandra Fedorova, Sergey Blagodurov, A comprehensive scheduler for asymmetric...
  • Felipe L. Madruga, Henrique C. Freitas, Philippe O.A. Navaux, Parallel shared-memory workloads performance on...
  • Vishal Gupta, Ripal Nathuji, Analyzing performance asymmetric multicore processors for latency sensitive datacenter...
  • Lina Sawalha, Sonya Wolff, Monte P. Tull, Ronald D. Barnes, Phase-guided scheduling on single-ISA heterogeneous...
  • R. Yang, J. Antony, A.P. Rendell, A simple performance model for multithreaded applications executing on non-uniform...
  • Sergey Blagodurov, Sergey Zhuravlev, Alexandra Fedorova, Ali Kamali, A case for NUMA-aware contention management on...
  • Jiun-Hung Ding received a BS in Industrial Engineering and Management from National Chiao Tung University in 2004, and the MS in Computer Science from National Tsing Hua University in 2006. His research interest includes Embedded System, Hardware–Software Codesign, Multi-core Optimization, and Parallel Computing.

    Ya-Ting Chang received a BS in Mathematics from National Cheng Kung University in 2010, and the MS in Computer Science from National Tsing Hua University in 2012. Her research interest includes parallel processing and multi-core embedded systems.

    Zhou-dong Guo received a BA in Computer Science and Technology from Zhejiang University in 2011, and now studying for the MS int National Tsing Hua University. As a student of Professor Chung, he is doing research in the area of system software and embeded system.

    Kuan-Ching Li is currently a Professor in the Department of Computer Science and Information Engineering at the Providence University, Taiwan. He received the PhD and MS in Electrical Engineering and Licenciatura in Mathematics from University of Sao Paulo, Brazil. He was a chair in 2009 and the Special Associate to the University President since 2010. He has served in a number of journal editorial boards and guest editorship, as also served many international conference chairmanship positions as steering committee, advisory committee, general and program committee chairs and member of program committees. His research interests include networked computing, parallel software design, and performance evaluation and benchmarking. He is a senior member of the IEEE and a Fellow of the IET.

    Yeh-Ching Chung received a BS in Information Engineering from Chung Yuan Christian University in 1983, and the MS and PhD in Computer and Information Science from Syracuse University in 1988 and 1992, respectively. He joined the Department of Information Engineering at Feng Chia University as an Associate Professor in 1992 and became a Full Professor in 1999. From 1998 to 2001, he was the Chairman of the Department. In 2002, he joined the Department of Computer Science at National Tsing Hua University as a Full Professor. His research interests include parallel and distributed processing, cluster systems, grid computing, multi-core tool chain design, and multi-core embedded systems. He is a Member of the IEEE computer society and ACM.

    View full text