Graph-based multi-core higher-order time integration of linear autonomous partial differential equations

https://doi.org/10.1016/j.jocs.2021.101349Get rights and content

Highlights

  • A multi-core realization has been developed.

  • This is a non-trivial task due to dependencies of the subtasks executed in parallel. We represent such dependencies with a graph using the Threading Building Blocks library (TBB).

  • Results include various cache- and NUMA-domain oriented graph schedulings.

  • Rather than using artificially generated matrices, the results also include a concrete application.

  • Studies are conducted on different architectures.

  • We achieve an up to 60% reduction of wall clock time compared to a conventional, state-of-the-art non- cache-aware approach.

Abstract

Modern high-performance computing (HPC) systems rely on increasingly complex nodes with a steadily growing number of cores and matching deep memory hierarchies. In order to fully exploit them, algorithms must be explicitly designed to exploit these features. In this work we address this challenge for a widely used class of application kernels: polynomial-based time integration of linear autonomous partial differential equations.

We build on prior work [1] of a cache-aware, yet sequential solution and provide an innovative way to parallelize it, while addressing cache-awareness across a large number of cores. For this, we introduce a dependency graph driven view of the algorithm and then use both static graph partitioning and dynamic scheduling to efficiently map the execution to the underlying platform. We implement our approach on top of the widely available Intel Threading Building Blocks (TBB) library, although the concepts are programming model agnostic and can apply to any task-driven parallel programming approach.

We demonstrate the performance of our approach for a 2nd, 4th and 6th order time integration of the linear advection equation on three different architectures with widely varying memory systems and achieve an up to 60% reduction of wall clock time compared to a conventional, state-of-the-art non-cache-aware approach.

Introduction

Solving time-dependent partial differential equations (PDEs) belongs to one of the most common application kernels on high-performance computing (HPC) systems. The nodes used in these systems, however, feature increasingly higher core counts as well as increasingly complex and deep memory hierarchies in order to keep those cores fed with data. This has two major consequences for software intended to exploit these systems: (1) the need to explicitly address and exploit the memory hierarchy, and (2) the need to expose sufficient parallelism to utilize the cores in these highly parallel systems. In order to address these challenges, new mathematical concepts and optimized implementations on high-performance computing (HPC) systems are needed, in particular for central kernels such as time integration of PDEs.

One particularly promising and recent approach targets new methods working along the time dimension for linear autonomous PDEs. Such PDEs occur frequently, e.g., in full waveform inversion problems [2] or as part of splitting methods that incorporate non-linear parts in a separate way [3], and are given in the formU(t)t=LU(t)with L the autonomous linear operator and U(t) denoting the spatial solution at time t. Performing a discretization in space, resulting in L and U(t), the main computational work of an explicit time integration method is then given by multiple sparse matrix-vector multiplications (SpMv) of the form LiU, i0+ for particular discretization methods.

In our prior work [1], we introduced a cache-blocking approach across the time dimension (polynomial terms), which provides substantial performance gain compared to a non-cache-aware approach based on a sequential implementation. This approach extends ideas from traditional cache-blocking for SpMv and temporal tiling for stencil computations to achieve a cache-blocking over the multiple successive SpMvs occurring in matrix-polynomial evaluation. While this addresses the first challenge observed above, the second — the need to exploit all cores in such a system — is not addressed, yet.

In this work, we will extend our cache-aware scheme to exploit multiple cores on a single socket or node, while adjusting the cache-awareness to this parallel setup and dealing with the arising challenges on data dependencies. We approach this by using an explicit graph representation of all data dependencies, rather than relying on more common, yet static SPMD (Single Program Multiple Data) approaches. We use this graph formulation, in combination with graph partitioning approaches, to drive a dynamic parallel execution schedule that only enforces actual algorithmic dependencies and at the same time achieves a suitable load balance.

One of the key challenges with this approach is the efficient execution of such a set of tasks connected by dependency graphs. This requires an efficient and low-overhead scheduler in the form of a shared memory runtime system. In this work, we rely on Intel's Threading Building Blocks [4], which provides such an abstraction, but the results are generally applicable to other tasking runtimes as well, such as OpenMP superscalar [5], Legion [6], or Quark [7].

In particular, we make the following contributions.

  • We provide a dependency graph formulation of a sequential cache-aware matrix exponentiation algorithm.

  • Using static graph partitioning combined with dynamic task scheduling, we use the dependency graph to derive a parallel implementation of the approach.

  • We implement our approach using Intel's Threading Building Blocks including various features of this library.

  • We evaluate our implementation using both a synthetic test suite as well as a realistic use case on three different platforms from different vendors with widely varying hardware characteristics.

Our results show, in a realistic simulation setting for a matrix-polynomial based time integration of the linear advection equation, significant reductions of the wall clock time of up to 60% compared to the standard, non-cache-aware approach.

Overall, with our work, we enable users to further exploit the potential of the cache hierarchy as well as all available computational cores of modern parallel processor architectures, a critical precondition to efficient usage of large-scale HPC resources.

The remainder of this work is organized as follows. In Section 3 we provide an application example and the particular form of time integration we are using in this work. The algorithm for sequential processing on which our work is based is briefly described in Section 4. This is followed by the parallelization using a graph-based formulation, a graph partitioning approach and the realization with Intel's TBB in Section 5. In Section 6 we describe our experimental setup, and in Sections 7 and 8 the results using the synthetic test suite and the real-world advection use case, respectively. We conclude the paper with a brief discussion of limitations in Section 9, related work in Section 2 and concluding remarks in Section 10.

The intention of this paper is not to provide an out-of-the-box tool but to explore the potential of cache-aware graph-algorithms for matrix-polynomial evaluations. However, for the sake of reproducibility we made the source code available [8].

Section snippets

Related work

In recent years the computational performance of modern computer architectures was growing much faster than latencies and bandwidth of memory accesses, leading to the so-called memory wall [9]. This has amplified research on cache-efficient techniques to overcome this bottleneck, especially in the field of PDE solvers. Moreover, the increasing degree of hardware parallelism has raised the interest in using graph theory to improve load balancing and locality in parallel applications. In the

Advection equation and polynomial-based time integration

We start by introducing our application example which we use as a representative case study. Based on this, a discretization in space is briefly discussed, followed by the particular formulation of the time integration which sets up the basis of our work.

Cache-aware matrix polynomial evaluation

This section briefly recaps the preceding work on a single-core implementation, see [1]. As our parallelization shares many concepts with this earlier work we will shortly summarize the principle of cache-aware matrix-polynomials and particularly one implementation, the Backward Blocking Method. In the course of this we will already hint towards possible starting points for the parallelization.

Cache-aware parallelization based on graphs

In this section we present our parallel approach for cache-aware matrix polynomials which makes use of the shared cache(s) of modern HPC architectures.

First we give a high-level description of the concept of explicit task-based dependency graphs for parallel, cache-aware matrix-polynomial evaluations in Section 5.1. For readers mainly interested in the general graph-theoretical perspective on parallel matrix polynomials reading Sections 5.1.1 and 5.1.2 should provide a sufficient understanding

Evaluation setup

To study the effects of different memory architectures we evaluate our approaches on three systems, which offer quite different memory systems and which we name according to their processor type: Intel XeonBronze (Intel Skylake), Intel(R) Core(TM) i5-8265U (Intel Whiskey Lake) and AMD Ryzen Threadripper 2990WX (AMD Zen+ architecture). An overview of the system specifications is given in Table 1. Note, that the AMD Ryzen Threadripper 2990WX memory system is divided into four NUMA domains with

Results on a synthetic benchmark

To evaluate the general performance characteristics of our implementation we first run a synthetic benchmark M4v with a wide range of different input test cases, which we refer to as the Matrix Test Suite. We created this benchmark and test suite to highlight the effects of graph partitioning and execution order as well as the influences of different memory hierarchies.

Results using the linear advection use case

To investigate the impact of our work in a realistic setting and use case, we implemented it as part of a polynomial time integration for the linear advection equation.

Discussion: limitations and future work

While the above results are, in many cases, very encouraging, they also clearly lag behind expectations in others, especially in terms of scalability on NUMA systems like the AMD and in handling large matrices of 3D grids.

In the following, we will discuss potential sources of these observations and briefly point out possible future work towards further optimizations, whose realization goes beyond the scope of this work. In Huber et al. [1], we already determined that the success of our approach

Conclusions

In this paper we investigated the potential of parallel cache-aware algorithms to reduce the wall clock time of higher-order time integration based on matrix-polynomials for linear autonomous PDEs.

We developed a parallel algorithm using an explicit task graph approach and used this to extract and enforce only the algorithmically needed dependencies, leading to a maximal amount of parallelism. We then used static graph partitioning combined with the Backward Blocking Method (BMM) [1] traversal

Authors’ contribution

Dominik Huber: conceptualization, methodology, software, investigation, data curation, writing – original draft, writing – review & editing, visualization. Martin Schreiber: conceptualization, methodology, software, data curation, writing – original draft, writing – review & editing, supervision, project administration, funding acquisition. Martin Schulz: resources, data curation, writing – review & editing, funding acquisition.

Conflict of interest

None declared.

Declaration of Competing Interest

The authors report no declarations of interest.

Acknowledgements

Martin Schreiber gratefully acknowledges KONWIHR funding as part of the project “Parallel in Time Integration with Rational Approximations targeting Weather and Climate Simulations” (https://www.konwihr.de/konwihr-projects/parallel-in-time-integration-with-rational-approximations-targeting-weather-and-climate-simulations/).

Dominik Huber is currently conducting his Master's studies at the Technical University of Munich. He received his Bachelor's degree in 2019 and researches and develops new hardware-aware algorithms for time integration methods on high-performance computing systems targeting the enhancement of computational fluid dynamics.

References (19)

  • C. Yount et al.

    Multi-level spatial and temporal tiling for efficient HPC stencil computation on many-core processors with large shared caches

    Future Gen. Comput. Syst.

    (2019)
  • D. Huber et al.

    Cache-aware matrix polynomials

  • J. Virieux et al.

    An introduction to full waveform inversion

    Encyclopedia of Exploration Geophysics

    (2014)
  • D. Ruprecht et al.

    Spectral deferred corrections with fast-wave slow-wave splitting

    SIAM J. Sci. Comput.

    (2016)
  • M. Voss et al.

    Pro TBB: C++ Parallel Programming with Threading Building Blocks

    (2019)
  • J.M. Perez et al.

    A Flexible and Portable Programming Model for SMP and Multi-Cores, Tech. Rep. 3

    (2007)
  • M. Bauer et al.

    Legion: expressing locality and independence with logical regions

    Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC’12

    (2012)
  • A. YarKhan

    Dynamic Task Execution on Shared and Distributed Memory Architectures (Ph.D. thesis)

    (2012)
  • D. Huber, M. Schreiber, Accessed 2020-12-27 [link]....
There are more references available in the full text version of this article.

Dominik Huber is currently conducting his Master's studies at the Technical University of Munich. He received his Bachelor's degree in 2019 and researches and develops new hardware-aware algorithms for time integration methods on high-performance computing systems targeting the enhancement of computational fluid dynamics.

Martin Schreiber is a senior lecturer and researcher at the Technical University of Munich (TUM). He received his doctoral degree from TUM in 2014, was appointed proleptic lecturer at the University of Exeter in 2015 and returned to TUM in 2018. His research areas are often highly interdisciplinary and connect applied mathematics and high-performance computing. His major research impacts have been so far on dynamically adaptive meshes and novel time integration methods targeting climate and weather simulations (https://www.martin-schreiber.info/).

Martin Schulz is a Full Professor and holds the Chair for Computer Architecture and Parallel Systems at the Technical University of Munich (TUM), which he joined in 2017. He earned his Doctorate in Computer Science in 2001 from TUM. Martin has published over 200 peer-reviewed papers and currently serves as the chair of the MPI Forum, the standardization body for the Message Passing Interface. Martin was a recipient of the IEEE/ACM Gordon Bell Award in 2006 and an R&D 100 award in 2011.

View full text