Graph-based multi-core higher-order time integration of linear autonomous partial differential equations
Introduction
Solving time-dependent partial differential equations (PDEs) belongs to one of the most common application kernels on high-performance computing (HPC) systems. The nodes used in these systems, however, feature increasingly higher core counts as well as increasingly complex and deep memory hierarchies in order to keep those cores fed with data. This has two major consequences for software intended to exploit these systems: (1) the need to explicitly address and exploit the memory hierarchy, and (2) the need to expose sufficient parallelism to utilize the cores in these highly parallel systems. In order to address these challenges, new mathematical concepts and optimized implementations on high-performance computing (HPC) systems are needed, in particular for central kernels such as time integration of PDEs.
One particularly promising and recent approach targets new methods working along the time dimension for linear autonomous PDEs. Such PDEs occur frequently, e.g., in full waveform inversion problems [2] or as part of splitting methods that incorporate non-linear parts in a separate way [3], and are given in the formwith the autonomous linear operator and denoting the spatial solution at time t. Performing a discretization in space, resulting in L and , the main computational work of an explicit time integration method is then given by multiple sparse matrix-vector multiplications (SpMv) of the form , for particular discretization methods.
In our prior work [1], we introduced a cache-blocking approach across the time dimension (polynomial terms), which provides substantial performance gain compared to a non-cache-aware approach based on a sequential implementation. This approach extends ideas from traditional cache-blocking for SpMv and temporal tiling for stencil computations to achieve a cache-blocking over the multiple successive SpMvs occurring in matrix-polynomial evaluation. While this addresses the first challenge observed above, the second — the need to exploit all cores in such a system — is not addressed, yet.
In this work, we will extend our cache-aware scheme to exploit multiple cores on a single socket or node, while adjusting the cache-awareness to this parallel setup and dealing with the arising challenges on data dependencies. We approach this by using an explicit graph representation of all data dependencies, rather than relying on more common, yet static SPMD (Single Program Multiple Data) approaches. We use this graph formulation, in combination with graph partitioning approaches, to drive a dynamic parallel execution schedule that only enforces actual algorithmic dependencies and at the same time achieves a suitable load balance.
One of the key challenges with this approach is the efficient execution of such a set of tasks connected by dependency graphs. This requires an efficient and low-overhead scheduler in the form of a shared memory runtime system. In this work, we rely on Intel's Threading Building Blocks [4], which provides such an abstraction, but the results are generally applicable to other tasking runtimes as well, such as OpenMP superscalar [5], Legion [6], or Quark [7].
In particular, we make the following contributions.
- •
We provide a dependency graph formulation of a sequential cache-aware matrix exponentiation algorithm.
- •
Using static graph partitioning combined with dynamic task scheduling, we use the dependency graph to derive a parallel implementation of the approach.
- •
We implement our approach using Intel's Threading Building Blocks including various features of this library.
- •
We evaluate our implementation using both a synthetic test suite as well as a realistic use case on three different platforms from different vendors with widely varying hardware characteristics.
Our results show, in a realistic simulation setting for a matrix-polynomial based time integration of the linear advection equation, significant reductions of the wall clock time of up to 60% compared to the standard, non-cache-aware approach.
Overall, with our work, we enable users to further exploit the potential of the cache hierarchy as well as all available computational cores of modern parallel processor architectures, a critical precondition to efficient usage of large-scale HPC resources.
The remainder of this work is organized as follows. In Section 3 we provide an application example and the particular form of time integration we are using in this work. The algorithm for sequential processing on which our work is based is briefly described in Section 4. This is followed by the parallelization using a graph-based formulation, a graph partitioning approach and the realization with Intel's TBB in Section 5. In Section 6 we describe our experimental setup, and in Sections 7 and 8 the results using the synthetic test suite and the real-world advection use case, respectively. We conclude the paper with a brief discussion of limitations in Section 9, related work in Section 2 and concluding remarks in Section 10.
The intention of this paper is not to provide an out-of-the-box tool but to explore the potential of cache-aware graph-algorithms for matrix-polynomial evaluations. However, for the sake of reproducibility we made the source code available [8].
Section snippets
Related work
In recent years the computational performance of modern computer architectures was growing much faster than latencies and bandwidth of memory accesses, leading to the so-called memory wall [9]. This has amplified research on cache-efficient techniques to overcome this bottleneck, especially in the field of PDE solvers. Moreover, the increasing degree of hardware parallelism has raised the interest in using graph theory to improve load balancing and locality in parallel applications. In the
Advection equation and polynomial-based time integration
We start by introducing our application example which we use as a representative case study. Based on this, a discretization in space is briefly discussed, followed by the particular formulation of the time integration which sets up the basis of our work.
Cache-aware matrix polynomial evaluation
This section briefly recaps the preceding work on a single-core implementation, see [1]. As our parallelization shares many concepts with this earlier work we will shortly summarize the principle of cache-aware matrix-polynomials and particularly one implementation, the Backward Blocking Method. In the course of this we will already hint towards possible starting points for the parallelization.
Cache-aware parallelization based on graphs
In this section we present our parallel approach for cache-aware matrix polynomials which makes use of the shared cache(s) of modern HPC architectures.
First we give a high-level description of the concept of explicit task-based dependency graphs for parallel, cache-aware matrix-polynomial evaluations in Section 5.1. For readers mainly interested in the general graph-theoretical perspective on parallel matrix polynomials reading Sections 5.1.1 and 5.1.2 should provide a sufficient understanding
Evaluation setup
To study the effects of different memory architectures we evaluate our approaches on three systems, which offer quite different memory systems and which we name according to their processor type: Intel XeonBronze (Intel Skylake), Intel(R) Core(TM) i5-8265U (Intel Whiskey Lake) and AMD Ryzen Threadripper 2990WX (AMD Zen+ architecture). An overview of the system specifications is given in Table 1. Note, that the AMD Ryzen Threadripper 2990WX memory system is divided into four NUMA domains with
Results on a synthetic benchmark
To evaluate the general performance characteristics of our implementation we first run a synthetic benchmark with a wide range of different input test cases, which we refer to as the Matrix Test Suite. We created this benchmark and test suite to highlight the effects of graph partitioning and execution order as well as the influences of different memory hierarchies.
Results using the linear advection use case
To investigate the impact of our work in a realistic setting and use case, we implemented it as part of a polynomial time integration for the linear advection equation.
Discussion: limitations and future work
While the above results are, in many cases, very encouraging, they also clearly lag behind expectations in others, especially in terms of scalability on NUMA systems like the AMD and in handling large matrices of 3D grids.
In the following, we will discuss potential sources of these observations and briefly point out possible future work towards further optimizations, whose realization goes beyond the scope of this work. In Huber et al. [1], we already determined that the success of our approach
Conclusions
In this paper we investigated the potential of parallel cache-aware algorithms to reduce the wall clock time of higher-order time integration based on matrix-polynomials for linear autonomous PDEs.
We developed a parallel algorithm using an explicit task graph approach and used this to extract and enforce only the algorithmically needed dependencies, leading to a maximal amount of parallelism. We then used static graph partitioning combined with the Backward Blocking Method (BMM) [1] traversal
Authors’ contribution
Dominik Huber: conceptualization, methodology, software, investigation, data curation, writing – original draft, writing – review & editing, visualization. Martin Schreiber: conceptualization, methodology, software, data curation, writing – original draft, writing – review & editing, supervision, project administration, funding acquisition. Martin Schulz: resources, data curation, writing – review & editing, funding acquisition.
Conflict of interest
None declared.
Declaration of Competing Interest
The authors report no declarations of interest.
Acknowledgements
Martin Schreiber gratefully acknowledges KONWIHR funding as part of the project “Parallel in Time Integration with Rational Approximations targeting Weather and Climate Simulations” (https://www.konwihr.de/konwihr-projects/parallel-in-time-integration-with-rational-approximations-targeting-weather-and-climate-simulations/).
Dominik Huber is currently conducting his Master's studies at the Technical University of Munich. He received his Bachelor's degree in 2019 and researches and develops new hardware-aware algorithms for time integration methods on high-performance computing systems targeting the enhancement of computational fluid dynamics.
References (19)
- et al.
Multi-level spatial and temporal tiling for efficient HPC stencil computation on many-core processors with large shared caches
Future Gen. Comput. Syst.
(2019) - et al.
Cache-aware matrix polynomials
- et al.
An introduction to full waveform inversion
Encyclopedia of Exploration Geophysics
(2014) - et al.
Spectral deferred corrections with fast-wave slow-wave splitting
SIAM J. Sci. Comput.
(2016) - et al.
Pro TBB: C++ Parallel Programming with Threading Building Blocks
(2019) - et al.
A Flexible and Portable Programming Model for SMP and Multi-Cores, Tech. Rep. 3
(2007) - et al.
Legion: expressing locality and independence with logical regions
Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC’12
(2012) Dynamic Task Execution on Shared and Distributed Memory Architectures (Ph.D. thesis)
(2012)- D. Huber, M. Schreiber, Accessed 2020-12-27 [link]....
Cited by (5)
20 years of computational science: Selected papers from 2020 International Conference on Computational Science
2021, Journal of Computational ScienceAn Explicit Exponential Integrator Based on Faber Polynomials and its Application to Seismic Wave Modeling
2024, Journal of Scientific ComputingLevel-Based Blocking for Sparse Matrices: Sparse Matrix-Power-Vector Multiplication
2023, IEEE Transactions on Parallel and Distributed Systems
Dominik Huber is currently conducting his Master's studies at the Technical University of Munich. He received his Bachelor's degree in 2019 and researches and develops new hardware-aware algorithms for time integration methods on high-performance computing systems targeting the enhancement of computational fluid dynamics.
Martin Schreiber is a senior lecturer and researcher at the Technical University of Munich (TUM). He received his doctoral degree from TUM in 2014, was appointed proleptic lecturer at the University of Exeter in 2015 and returned to TUM in 2018. His research areas are often highly interdisciplinary and connect applied mathematics and high-performance computing. His major research impacts have been so far on dynamically adaptive meshes and novel time integration methods targeting climate and weather simulations (https://www.martin-schreiber.info/).
Martin Schulz is a Full Professor and holds the Chair for Computer Architecture and Parallel Systems at the Technical University of Munich (TUM), which he joined in 2017. He earned his Doctorate in Computer Science in 2001 from TUM. Martin has published over 200 peer-reviewed papers and currently serves as the chair of the MPI Forum, the standardization body for the Message Passing Interface. Martin was a recipient of the IEEE/ACM Gordon Bell Award in 2006 and an R&D 100 award in 2011.