Elsevier

Parallel Computing

Volume 75, July 2018, Pages 11-27
Parallel Computing

Analyzing performance variation of task schedulers with TaskInsight

https://doi.org/10.1016/j.parco.2018.02.003Get rights and content

Highlights

  • Provide a methodology to quantify the performance behaviour of task-based schedulers.

  • Classify data reuse of the tasks through the execution of the application.

  • Distinguish categories of applications sensitive and non-sensitive to the schedulers.

  • Analyze the schedulers based on the effects on private caches (temporal locality).

  • Analyze the schedulers based on the utilization of shared caches (spatial locality).

Abstract

Recent scheduling heuristics for task-based applications have managed to improve performance by taking into account memory-related properties such as data locality and cache sharing. However, there is still a general lack of tools that can provide insights into why, and where, different schedulers result in different memory behavior, and how this is related to the applications’ performance.

To address this we present TaskInsight, a technique to characterize the memory behavior of different task schedulers through the analysis of data reuse across tasks. TaskInsight provides high-level, quantitative information that can be correlated with tasks’ performance variation over time to understand data reuse through the caches due to scheduling choices. TaskInsight is useful to diagnose and identify which scheduling decisions affected performance, when were they taken, and why the performance changed, both in single and multi-threaded executions.

We demonstrate how TaskInsight can diagnose cases where poor scheduling caused over 40% difference on average (and up to 7x slowdowns) across the Montblanc benchmarks due to changes in the tasks’ data reuse through the private and shared caches. This flexible insight is key for optimization in many contexts, including data locality, throughput, memory footprint or even energy efficiency.

Introduction

Scheduling tasks in task-based applications has become significantly more difficult due to overall system complexity, and, in particular, deep, shared memory hierarchies. Typical approaches for optimizing scheduling algorithms consist of either providing an interactive visualization of the execution trace [1], [2] or simulating the tasks execution to evaluate the overall scheduling policy in a controlled environment [3], [4]. The developer then has to analyze the resulting profiling information and deduce if the scheduler behaves as expected, and qualitatively compare different schedulers.

Poor scheduling decisions often cause performance variations across tasks of the same type, which makes it hard to identify the root cause from the overall schedule. Scheduling strategies are sometimes implemented to be aware of these variations and ensure good load-balancing. However, understanding the underlying causes of individual tasks’ performance anomalies as well as the snowball effect of the dynamic scheduler is still an open question.

The effects of poor scheduling decisions can be most easily seen in idle execution time due to load imbalance from the inability to prioritize tasks on the critical path or appropriately map tasks to processors. However, scheduler decisions also impact data locality in the cache hierarchy by changing the order of producer and consumer tasks. The result of these decisions is performance variation across task, which can only be understood by analyzing how the tasks share data and how the schedule affects that sharing.

Generally, task-based application developers blame this performance degradation on data locality and attempt to characterize their workload based on data reuse without considering the dynamic interaction between the scheduler and the caches [5], [6]. This is simply because there has been no way to date to obtain precise information on how the data was reused through the execution of an application, such as how long it remained in the caches, and how the scheduling decisions influenced this reuse. Without an automatic tool capable of providing insight as to whether and where the scheduler misbehaved, the programmer must rely on intuition to understand and adjust the scheduler for improved performance.

In this paper, we present TaskInsight, a new methodology to characterize, in a quantifiable way, the scheduling process in the context of one of the most important performance-related characteristics: how the schedule affects data reuse between tasks through the cache hierarchy over time. We show how the reuse of data throughout the execution can provide insights into the performance of the scheduler, regardless of whether it is optimized for data locality, bandwidth, memory footprint, etc.. Further, TaskInsight can interface directly with the task-based runtime system to provide this information both to the programmer and the scheduler.

Scheduler optimization is a notoriously difficult problem as past decisions affect choices and performance in the future, making it hard to explain performance without a detailed view across the program. Previous work [7] has shown the effects of data reuse distances in performance degradation. Those results were based on aggregated statistics and do not provide the necessary detail to manually (developer) or automatically (runtime system) adjust the schedule to improve performance or locality.

In order to understand the performance of a particular schedule, and thereby the scheduler itself, it is necessary to address three critical questions:

  • (Q1) What scheduling decisions influenced the performance of the execution?

  • (Q2) When did those decisions happen?

  • (Q3) Why did those decisions affect the performance?

Answering these questions is vital for dynamic scheduling strategies that adjust their decisions in real time based on how tasks use the hardware resources. Scheduling decisions need to take into account the individual task performance to optimize the overall application, which is nearly impossible without answers to the above questions. The TaskInsight methodology shows how data reuses between tasks can provide key information for answering these questions, by enabling us to quantify their effects over time, and thereby exposing the interactions between the tasks’ performance and their schedule. We make the following contributions:

  • 1.

    A novel classification of the data of each task based on when the data is used over time. This classification is able to expose different memory behaviors inherent to the schedule.

  • 2.

    A new analysis of schedulers based on their effects of the applications’ temporal locality, by connecting our classification to the measured performance results and statistics from the private caches.

  • 3.

    A new technique to analyze schedulers based on the preservation of spatial locality of the data through time, by linking our classification with performance results and statistics from the shared caches.

We first investigate the impact of scheduling on memory behavior and performance by looking at its effects on a range of benchmark application configurations and schedulers (Section 2). From this study, we distinguish categories for benchmarks that are sensitive and non-sensitive to scheduling by looking at the performance differences (in task cycles, e.g., not including scheduler overhead) and L2/L3 cache misses among the benchmarks.

We then select a representative configuration of the benchmark Cholesky Factorization to present our tool, demonstrating how the overall performance of an application changes when executing with different schedules due to an increase in last-level cache misses (Section 3). This example motivates TaskInsight’s data classification technique, as it enables us to clearly differentiate the schedules in terms of their data reuse patterns, using a data reuse graph as in [8] (Section 4).

With this motivation, we extend the analysis and show how to connect the TaskInsight classification to changes in data reuse, changes in cache misses and changes in performance during the execution: first from the perspective of the private caches (temporal locality on a single-threaded execution, Section 5) and later including the shared caches (spatial locality on multi-threaded run, Section 6). This complete analysis is then used to demonstrate how TaskInsight enables us to understand other behaviors across the benchmarks and schedulers.

Section snippets

Motivation

It is well known that cache optimization is crucial for performance, but real-world applications expose different sensitivities to changes in memory behavior. Furthermore, task-based applications can vary wildly in their behavior based on several factors such as the size of the input problem (total data), the number of tasks they spawn (chunk size), how they distribute work among those tasks (parallelism), and how many tasks they can run in parallel (dependencies). Each of these factors can

Motivating example

As we have seen in the previous section, different applications expose a variety of sensitivities to scheduling. The primary goal of TaskInsight is to be able to detect why there was a performance difference and when the scheduling decisions that led to that change happened. To understand how we can gain insight into these questions, we begin by looking at single-threaded runs of the cholesky factorization benchmark with a 32MB input matrix (256 × 256 block size), which enables us to hold one

Through the data-reuse glass

Tasks operate mostly on their own private data sets, but, over time, parts of tasks’ data may be reused by later tasks. This is typical of producer-consumer tasks and tasks which operate on a shared input. As a result, a portion of the data set is shared between multiple tasks. If the scheduler can arrange to execute the tasks close enough together in time, it will increase the chance that the shared data remains in the cache, and thereby improve performance through temporal locality.

To

Analyzing performance

The classification described in the previous section allows us to characterize the impact of different schedules on memory behavior in a hardware-agnostic manner. However, comparing the relative differences between these metrics is not enough to predict how they will affect performance. To accomplish this TaskInsight combines the data reuse classification (new, last, etc.) with performance measurements to explain changes in performance due to changes in data access.

We will first use TaskInsight

Multi-threaded executions

The previous section showed how TaskInsight can quantitatively characterize different schedules with regards to their temporal data locality and data reuse through the caches for single-threaded applications.

When running multi-threaded applications, the complexity of the analysis is significantly increased by having multiple schedules executing in parallel across the cores and the effects of shared caches, which can cause the schedules to interfere with each other. The shared cache effects can

Detecting problems in other benchmarks

In the previous sections we analyzed how tasks’ data reuse changes led to performance variation in the Cholesky factorization. In this section we show how to use the same TaskInsight analysis to characterize the memory behavior of the other benchmarks shown in Fig. 1 and understand problems that caused performance variation. For this we have selected three of the scheduling-sensitive benchmarks: fft (solving a Fast Fourier Transform), reduction (non-trivial computation over vectors and

Implementation

Fig. 14 shows an overview of how TaskInsight combines memory accesses and hardware performance counter information through a profiler, an instrumentation library and an analysis tool. Independent of the compiler’s optimization, TaskInsight profiles the execution of the applications and analyzes the collected results. The memory access profiling tool uses Pin [13] to sample the tasks’ memory accesses. While profiling, an address map is created, between each accessed address (at a cacheline

Related work

Previous work has proposed different ways to diagnose scheduling anomalies by either interactively visualizing information [1], [2], [15] or by simulating the task execution in order to provide a deterministic behavior of the scheduler [3], [4] without evaluating the performance behavior as a result of how memory is used. Significant work has been done to study the locality as a metric to characterize the workload of an application [5], [6] without considering the scheduling decisions taken as

Conclusion

In this work we presented TaskInsight, a methodology that provides high-level, quantifiable information that ties task scheduling decisions to how tasks reuse data and the resulting task performance. By combining schedule independent memory access profiling (to classify how data is reused between tasks) and schedule specific hardware performance counter data (to determine performance on a given system) we are able to identify which scheduling decisions impact performance, when they happen, and

Acknowledgments

This work was supported by the Swedish Research Council (grant no. FFL12-0051), the Swedish Foundation for Strategic Research [project FFL12-0051] and carried out within the Linnaeus Centre of Excellence UPMARC, Uppsala Programming for Multicore Architectures Research Center. This paper was also published with the support of the HiPEAC network that received funding from the European Union’s Horizon 2020 research and innovation programme [grant agreement no. 687698].

References (16)

  • A. Drebes et al.

    Interactive visualization of cross-layer performance anomalies in dynamic task-parallel applications and systems

    2016 IEEE International Symposium on Performance Analysis of Systems and Software, Uppsala, Sweden, April 17–19, 2016

    (2016)
  • R. Bell et al.

    Para Prof: a portable, extensible, and scalable tool for parallel performance profile analysis

  • L. Stanisic et al.

    Faithful performance prediction of a dynamic task-based runtime system for heterogeneous multi-core architectures

    Concurr. Comput.

    (2015)
  • K. Chronaki et al.

    Criticality-aware dynamic task scheduling for heterogeneous architectures

  • J. Weinberg et al.

    Quantifying locality in the memory access patterns of HPC applications

    Proceedings of the 2005 ACM/IEEE Conference on Supercomputing

    (2005)
  • R. Cheveresan et al.

    Characteristics of workloads used in high performance and technical computing

  • M. Pericàs et al.

    Analysis of data reuse in task-parallel runtimes

  • G. Ceballos, E. Hagersten, D. Black-Schaffer, Formalizing Data Locality in Task Parallel Applications, Springer...
There are more references available in the full text version of this article.

Cited by (0)

View full text