Elsevier

Future Generation Computer Systems

Volume 30, January 2014, Pages 229-241
Future Generation Computer Systems

Improving execution unit occupancy on SMT-based processors through hardware-aware thread scheduling

https://doi.org/10.1016/j.future.2013.06.015Get rights and content

Highlights

  • We present WorkOver to improve thread-scheduling for better performance.

  • We use performance counters to profile integer- and floating-point threads.

  • Threads are scheduled according to hardware execution unit availability.

  • WorkOver optimizes unit occupancy on AMD Bulldozer and IBM P7 processors.

  • We measured up to 20% speedup using Spec CPU and Scimark 2.0.

Abstract

Modern processor architectures are increasingly complex and heterogeneous, often requiring software solutions tailored to the specific hardware characteristics of each processor model. In this article, we address this problem by targeting two processors featuring Simultaneous MultiThreading (SMT) to improve the occupancy of their internal execution units through a sustained stream of instructions coming from more than one thread. We target the AMD Bulldozer and IBM POWER7 processors as case studies for specific hardware-oriented performance optimizations that increase the variety of instructions sent to each core to maximize the occupancy of all its execution units. WorkOver, presented in this article, improves thread scheduling by increasing the performance of floating point-intensive workloads on Linux-based operating systems. WorkOver is a user-space monitoring tool that automatically identifies FPU-intensive threads and schedules them in a more efficient way without requiring any patches or modifications at the kernel level. Our measurements using standard benchmark suites show that speedups of up to 20% can be achieved by simply allowing WorkOver to monitor applications and schedule their threads, without any modification of the workload.

Introduction

Since the power wall  [1] prevents hardware manufacturers from increasing the processor’s clock frequency, modern CPUs embed several cores to increase the computational power through parallelism. Recent trends show that hardware manufacturers are preferring asymmetry and heterogeneity over symmetric and homogeneous designs. Indeed, current state-of-the-art processors have very complex architectures featuring multiple internal components, such as multiple cache levels shared among different cores, Non-Uniform Memory Access (NUMA)  [2] controllers and hyperlinks, Simultaneous MultiThreading (SMT) support with several Processing Units (PUs) per core, or ad hoc dedicated units. As a consequence, it is increasingly difficult for software developers to fully exploit the underlying hardware’s computational power, as optimal software configurations can vary according to the hardware platform, to the application software architecture, and to the type of workload.

The Operating System (OS) kernel and scheduler try to optimize the performance of applications depending on the available hardware resources. To this end, OS schedulers rely on a limited set of performance indicators (such as the number of cores, CPU time, and memory usage) to drive their optimization strategies. This is often not enough for multithreaded applications running on modern systems, where the complexity and the specific characteristics of the underlying hardware architecture require to use additional information to improve runtime performance through efficient scheduling.

As a case study, in this article we focus on two of these modern architectures and we present a specific, hardware-aware optimization tool based on (1) an automated workload analysis technique relying on a specific set of performance metrics that are currently not used by common OS schedulers, and (2) a hardware-aware optimized scheduler performing scheduling decisions based on hardware resource usage monitoring. Our goal is to use a controller-based approach to profile the workload of multithreaded and multi-process applications to improve the efficiency of how they share heterogeneous resources.

We focus on two modern micro-architectures that implement very different SMT solutions: the AMD Bulldozer and IBM POWER7 processors. These architectures are good representatives of modern hardware platforms with specific characteristics that cannot easily be exploited by non-hardware-aware approaches. In this context, one of the peculiar characteristics of the Bulldozer architecture is the design of an asymmetric SMT implementation between integer and floating point units, where Floating Point processing Units (FPUs) are shared by two PUs within one same core: two threads may contend for the same FPU units (while integer units are available on a per-PU basis). The IBM POWER7 architecture is based on a more aggressive implementation of SMT, where the instructions coming from up to four threads can be scheduled simultaneously to improve the occupancy of the available execution units on each core. Since each core features two integer and four floating point units, only a proper scheduling of integer- and floating point-intensive threads can take advantage of this improved SMT, otherwise these hardware layouts can have a negative impact on the performance of FPU-intensive workloads.

Our approach is named WorkOver (after Workload Overseer) and corresponds to a Linux daemon that interacts with the OS scheduler to improve the thread scheduling of floating point-intensive workloads on SMT processors by taking into account the way hardware execution units are organized into cores and PUs.

WorkOver runs in user-space and is based on performance metrics commonly available without any modification of the OS kernel and the monitored applications. Our workload profiling approach relies on hardware performance counters to detect which threads make floating point-intensive computations. Our performance optimization is based on improved thread scheduling by pinning the most FPU-intensive threads to PUs of different cores to reduce contention on shared execution units. In this way, WorkOver provides a transparent bottom-up optimization mechanism, based on (1) automatic workload profiling at runtime through performance counters and (2) hardware-aware dynamic allocation of resources. No further intervention is required, neither to modify the running application (the workload) nor to change the OS scheduler. The tool is a system-wide user-mode daemon collecting information and applying optimization policies on the threads spawned by applications (processes) that have been started with a special command.

This article extends our work presented in  [3] by generalizing the approach from a specific CPU model to generic SMT processors and by using two completely different hardware architectures and OSs to validate our generalized approach.

Section snippets

Motivation and approach

Many scientific applications make heavy use of floating point-intensive computations. Consider a scenario in which a multithreaded application performs floating point-intensive computations with variable intensity in all or a subset of its threads. A common OS scheduler would assign FPU-intensive threads to the available SMT units for execution, as it would do for any other application. The scheduler takes metrics such as CPU time consumption into account. However, prevailing schedulers

AMD Bulldozer processors

This section describes in detail the internal architecture of the AMD Bulldozer processor family.1

The processor embeds multiple cores (referred to as processing modules in the AMD documentation). Each core features a front-end to fetch and decode instructions, caches (a larger L3 cache is shared by all the modules being part of the same CPU), a branch prediction unit, out-of-order instruction schedulers, and integer and

Requirements and dependencies

To verify and translate our ideas into practice, we extended our Java low-level monitoring library3 (described in  [14]) into WorkOver, a tool for Linux-based systems written in C++. WorkOver uses hwloc4 for inspecting the underlying hardware configuration and for dynamically extracting information on its topology to enable automatic hardware-awareness. Additional information not provided by hwloc about the number and

Testing environment

Experiments are performed on two different machines. The first one (henceforth referred to as AMD-Bull) is a 4 CPU Dell PowerEdge M915 with 128 GB of RAM. Each CPU is an AMD 6282SE 2.6 GHz processor with 8 cores including 2 PUs each, for a total of 16 PUs.9 This machine features 8 NUMA nodes with 2 nodes per CPU.10

Related work

Hardware performance counters represent a widely used instrument for performing realtime profiling of different computational workloads. Counters have been used for memory optimization  [18], for the identification of hardware characteristics  [19], for application characterization  [20], security  [21], data-race detection  [22], etc.

The idea of exploiting counters for the development of hardware-aware scheduling policies has been already discussed in related research: in  [23], for instance,

Conclusion and future work

Modern micro-architectures are increasingly complex and heterogeneous with a growing adoption of SMT- and out-of-order-based solutions to provide a sustained stream of instructions to keep all the available processor execution units busy. In this article, we present a case study for performance optimizations targeting shared hardware resources such as the ones found on the AMD Bulldozer and IBM POWER7 processors. In our experiments we show that a scheduler not aware of the underlying hardware

Acknowledgments

The research presented in this article has been supported by the Swiss National Science Foundation (Sinergia project CRSI22_127386) and by the European Commission (Seventh Framework Programme grant 287746).

Achille Peternier is Post-Doctoral researcher at the Faculty of Informatics at the University of Lugano, Switzerland. He obtained his diploma in IT-Sciences and Mathematical Methods (IMM) at the University of Lausanne (UNIL) and completed his Ph.D. at the Ecole Polytechnique Fédérale de Lausanne (EPFL). He is the author of several scientific publications on topics such as Multicore, Hardware Performance Tuning, Computer Graphics, Virtual Reality, and Web Services. More information at //www.peternier.com

References (36)

  • N. Min-Allah et al.

    Power efficient rate monotonic scheduling for multi-core systems

    J. Parallel Distrib. Comput.

    (2012)
  • M.W. Krentel

    Libmonitor: a tool for first-party monitoring

    Parallel Comput.

    (2013)
  • D. Patterson

    The trouble with multi-core

    IEEE Spectr.

    (2010)
  • M. Herlihy et al.

    The Art of Multiprocessor Programming

    (2008)
  • A. Peternier, D. Ansaloni, D. Bonetta, C. Pautasso, W. Binder, Hardware-aware thread scheduling: the case of asymmetric...
  • Z. Majo et al.

    Matching memory access patterns and data placement for NUMA systems

  • D. Durand et al.

    Impact of memory contention on dynamic scheduling on NUMA multiprocessors

    IEEE Trans. Parallel Distrib. Syst.

    (1996)
  • Z. Majo, T.R. Gross, A template library to integrate thread scheduling and locality management for NUMA...
  • P.J. Nistler et al.

    Power efficient scheduling for hard real-time systems on a multiprocessor platform

  • J.-J. Chen

    Multiprocessor energy-efficient scheduling for real-time tasks with different power characteristics

  • X. Zhao et al.

    Fine-grained per-core frequency scheduling for power efficient-multicore execution

  • G. Anselmi et al.

    IBM POWER 750 and 755 Technical Overview and Introduction REDP-4638-00

    (2010)
  • J. Abeles et al.

    Performance Guide for HPC Applications on IBM POWER 755

    (2010)
  • J. Du, N. Sehrawat, W. Zwaenepoel, Performance profiling of virtual machines, in: Proc. of the 7th ACM SIGPLAN/SIGOPS...
  • A. Peternier, D. Bonetta, W. Binder, C. Pautasso, Overseer: low-level hardware monitoring and management for java, in:...
  • C. Su, D. Li, D. Nikolopoulos, M. Grove, K.W. Cameron, B.R. de Supinski, Critical path-based thread placement for numa...
  • S. Blagodurov, S. Zhuravlev, M. Dashti, A. Fedorova, A case for numa-aware contention management on multicore systems,...
  • M.M. Tikir, J.K. Hollingsworth, Using hardware counters to automatically improve memory performance, in: Proc. of the...
  • Cited by (1)

    Achille Peternier is Post-Doctoral researcher at the Faculty of Informatics at the University of Lugano, Switzerland. He obtained his diploma in IT-Sciences and Mathematical Methods (IMM) at the University of Lausanne (UNIL) and completed his Ph.D. at the Ecole Polytechnique Fédérale de Lausanne (EPFL). He is the author of several scientific publications on topics such as Multicore, Hardware Performance Tuning, Computer Graphics, Virtual Reality, and Web Services. More information at http://www.peternier.com.

    Danilo Ansaloni is a Ph.D. candidate at the Faculty of Informatics, University of Lugano, Switzerland where he is working with Prof. Walter Binder. His research interests include dynamic program analysis, parallel computing, and programming languages.

    Daniele Bonetta holds a M.Sc. in Computer Science from the University of Pisa, Italy. He is a Ph.D. candidate at the Faculty of Informatics of the University of Lugano (Switzerland) where he works in the team of Prof. Cesare Pautasso. His research interests include High-performance computing, parallel programming and Web engineering.

    Cesare Pautasso is assistant professor at the Faculty of Informatics at the University of Lugano, Switzerland. Previously he was a researcher at the IBM Zurich Research Lab and a senior researcher at ETH Zurich. He completed his graduate studies with a Ph.D. from ETH Zurich in 2004. His research group focuses on building experimental systems to explore the intersection of model-driven software composition techniques, business process modeling languages, and autonomic/Cloud computing. You can find more information at http://www.pautasso.info and follow him on @pautasso.

    Walter Binder is an associate professor at the Faculty of Informatics, University of Lugano, Switzerland. He holds a M.Sc., a Ph.D., and a venia docendi from the Vienna University of Technology, Austria. Before joining the University of Lugano, he was post-doctoral researcher at the Artificial Intelligence Laboratory, Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland. His main research interests are in the areas of program analysis, virtual machines, parallel programming, and Cloud computing.

    View full text