Keywords

1 Introduction

Computational science faces a variety of challenges stemming from the massive increase in parallelism in state-of-the-art high-performance computing systems. One important requirement is the development of new and inherently parallel numerical methods. Parallel-in-time integration has been identified as a promising direction of research for the parallelisation of the solution of initial value problems [6].

Probably the most widely studied and used parallel-in-time method is Parareal [13], but see Gander’s overview for a discussion of a variety of other methods [7]. Parareal iterates between an expensive fine integrator run in parallel and a cheap coarse method which runs in serial and propagates corrections forward in time. While the unavoidable serial part limits parallel efficiency according to Amdahl’s law, some of its cost can be hidden by using a so-called pipelined implementation [2, 14]. Pipelining reduces the effective cost of the serial correction step in Parareal and therefore improves speedup. Even more optimisation is possible by using an event-based approach [5], but this requires a suitable execution framework that is not available on all machines.

Pipelining happens automatically when implementing Parareal in MPI but it is not straightforward in OpenMP and so far no shared memory version of Parareal with pipelining has been described. Previous studies almost exclusively used MPI to implement Parareal and the very few using OpenMP considered only the non-pipelined version [11]. However, using shared memory can have advantages, since it avoids e.g. the need to allocate buffers for message passing. The disadvantage is that naturally OpenMP is limited to a shared memory unit. Since convergence of Parareal tends to deteriorate if too many parallel time slices are computed [8] and given the trend to “fat” compute nodes with large numbers of cores, shared memory Parareal might nevertheless be an attractive choice. It could be useful, e.g., for simulations of power grids and other applications where comparatively small systems of differential(-algebraic) equations have to be solved faster than real-time [12]. This paper introduces an OpenMP-based version of pipelined Parareal and compares it to a standard MPI-based implementation. It relies on standard features like parallelised loops, following a fork-join paradigm, while leaving the investigation of more recent OpenMP features providing task-base parallelism [3] for future work.

2 The Parareal Parallel-in-Time Integration Method

The starting point for Parareal is an initial value problem

$$\begin{aligned} \dot{\mathbf {q}} = \mathbf {f} (\mathbf {q}(t),t), \ \mathbf {q}(0) = \mathbf {q}_{0}, \ t \in [0, T], \end{aligned}$$
(1)

which, in the numerical example below, arises from the spatial discretisation of a PDE (“method-of-lines”) with \(\mathbf {q} \in \mathbb {R}^{N_\mathrm{dof}}\) being a vector containing all degrees-of-freedom. Let \(\mathcal {F}_{\delta t}\) denote a numerical procedure for the approximate solution of (1), for example a Runge-Kutta method. Denote further by \(\mathbf {q} = \mathcal {F}_{\delta t}\left( \mathbf {\tilde{q}}, t_{2}, t_{1} \right) \) the result of approximately integrating (1) forward in time from some starting value \(\mathbf {\tilde{q}}\) at a time \(t_{1}\) to a time \(t_{2} > t_{1}\) using \(\mathcal {F}_{\delta t}\).

To parallelise the numerical integration of (1), decompose [0, T] into time-slices \([t_{p}, t_{p+1}]\), \(p=0, \ldots , P-1\) where P is the number of cores, equal to the number of processes (MPI) or threads (OpenMP). For simplicity, assume here that all time slices have the same length and that the whole interval [0, T] is covered with a single set of time slices so that no restarting or “sliding window” is required [17].

Parareal needs a second time integrator denoted \(\mathcal {G}_{\Delta t}\), which has to be cheap to compute but can be much less accurate (commonly referred to as the “coarse propagator”). It begins with a prediction step, computing rough guesses of the starting value \(\mathbf {q}^0_{p}\) for each time slice by running the coarse method once. Here, subscript p indicates an approximation of the solution at time \(t_p\). It then computes the iteration

$$\begin{aligned} \mathbf {q}^{k}_{p+1} = \mathcal {G}_{\Delta t}(\mathbf {q}^{k}_{p}, t_{p+1}, t_{p}) + \mathcal {F}_{\delta t}(\mathbf {q}^{k-1}_{p}, t_{p+1}, t_{p}) - \mathcal {G}_{\Delta t}(\mathbf {q}^{k-1}_{p}, t_{p+1}, t_{p}) \end{aligned}$$
(2)

concurrently on each time slice for \(p=0,\ldots ,P-1\), \(k=1,\ldots ,K\). Because the computationally expensive evaluation of the fine propagator can be parallelised across time slices, iteration (2) can run in less wall clock time than running \(\mathcal {F}_{\delta t}\) serially – provided the coarse method is cheap enough and the number of required iterations K is small.

The expected performance of Parareal can be described by a simple theoretical model [14]. Denote by \(c_\mathrm{c}\) the cost of integrating over one time slice using \(\mathcal {G}_{\Delta t}\) and by \(c_\mathrm{f}\) the cost when using \(\mathcal {F}_{\delta t}\). Because all time slices are assumed to consist of the same number of steps and an explicit method is used here, it can be assumed that \(c_\mathrm{f}\) and \(c_\mathrm{c}\) are identical for all time slices. Neglecting overhead, speedup of Parareal using K iterations against running the fine method in serial is approximately

$$\begin{aligned} s_\mathrm{np}(P) = \frac{ P c_\mathrm{f} }{ \left( 1 + K \right) P c_\mathrm{c} + K c_\mathrm{f}} = \frac{1}{ \left( 1 + K \right) \frac{c_\mathrm{c}}{c_\mathrm{f}} + \frac{K}{P}}. \end{aligned}$$
(3)
Fig. 1.
figure 1

Execution diagram for Parareal without (left) and with (right) pipelining. Red blocks correspond to \(c_{\text {f}}\), the time needed to run the fine integrator \(\mathcal {F}_{\delta t}\) over one time slice \([t_{p}, t_{p+1}]\) while blue blocks correspond to \(c_{\text {c}}\), the time required by the coarse method \(\mathcal {G}_{\Delta t}\). Pipelining allows to hide some of the cost of the coarse method. While it comes naturally when using MPI to implement Parareal (there, Thread would refer to a Process), using simple loop-based parallelism with OpenMP results in the non-pipelined version shown on the left. (Color figure online)

It is possible to hide some of the cost of the coarse propagator and the name pipelining has been coined for this approach [14]. Figure 1 sketches the execution diagrams of both a non-pipelined (left) and pipelined (right) implementation for four time slices. As can be seen, pipelining reduces the effective cost of the coarse correction step in each iteration from \(P \times c_{\text {c}}\) to \(c_{\text {c}}\) – but note that the initial prediction step still has cost \(P \times c_\mathrm{c}\) as before. For pipelined Parareal, estimate (3) changes to

$$\begin{aligned} s_\mathrm{p}(P) = \frac{P c_\mathrm{f}}{P c_\mathrm{c} + K c_\mathrm{c} + K c_\mathrm{f}} = \frac{1}{\left( 1 + \frac{K}{P} \right) \frac{c_\mathrm{c}}{c_\mathrm{f}} + \frac{K}{P}}. \end{aligned}$$
(4)

Because \(K/P \ll K\), the pipelined version allows for better speedup, that is \(s_\mathrm{np}(P) \le s_\mathrm{p}(P)\). However, because pipelining only hides cost from the coarse integrator, the effect is smaller when the coarse method is very cheap and \(c_\mathrm{c}/c_\mathrm{s} \ll 1\). In that case, the term K / P dominates the estimate which is not affected by pipelining.

3 Pipelined Parareal in OpenMP

The implementation of Parareal with pipelining in OpenMP is sketched in Algorithm 1. A description of how Parareal is implemented using MPI is available in the literature [1] and is therefore not repeated here.

Threads are spawned by the OMP PARALLEL directive in line 1.2 and terminated by OMP END PARALLEL in line 1.37. Manual synchronisation is required so that P OpenMP locks are created using OMP_INIT_LOCK in line 1.4, one for each thread. During the fine integrator and update step, these locks are set and unset using OMP_SET_LOCK and OMP_UNSET_LOCK to protect buffers during writes and avoid race conditions.

The algorithm consists of the following parts:

  • Prediction step: lines 1.5–1.13. Each thread is computing its own coarse prediction of its starting value \(\mathbf {q}^{0}_{p}\) in a parallelised loop. The coarse value \(\mathcal {G}_{\Delta t}(\mathbf {q}^{0}_p, t_{p+1}, t_p)\) is also computed and stored for use in the first iteration. The later the time slice (indicated by a higher thread number p), the more steps the thread must compute and thus the larger its workload (cf. Fig. 1). Therefore, at the end of the coarse prediction loop, the NOWAIT clause is required to avoid implicit synchronisation and enable pipelining.

  • Parareal iteration: lines 1.14–1.36. Here, both the fine integrator and update step are performed inside a single loop over all time slices, parallelised by OMP DO directives. Because parts of the loop (the update step) have to be executed in serialised order, the ORDERERD directive has to be used in line 1.15. Again, to avoid implicit synchronisation at the end of the loop, the NOWAIT clause is required in line 1.35.

    • Fine integrator: lines 1.17–1.20. Before the fine integrator is executed, an OMP_LOCK is set to indicate that the thread will start writing into buffer q(p). Because thread \(p-1\) accesses this buffer in its update step, locks are necessary to prevent race conditions and incorrect solutions. After the lock is set, the thread proceeds with the computation of \(\mathcal {F}_{\delta t}(\mathbf {q}^k_p, t_{p+1}, t_p)\) and computation of the difference \(\delta q\) between coarse and fine value. Then, since q(p) is now up to date and \(\delta q\) ready, the lock can be released.

    • Update step: lines 1.21–1.33. The update step has to be performed in proper order, from first to last time slice. Therefore, it is enclosed in ORDERED directives, indicating that this part of the loop is to be executed in serial order. Then, as in the two other versions, the update step is initialised with \(\mathbf {q}^{k+1}_0 = \mathbf {q}_0\). For every time slice, the coarse value of the updated initial guess is computed and the update performed. The updated end value is written into buffer \(q(p+1)\) to serve as the new starting value for the following time slice. However, to prevent thread p from writing into \(q(p+1)\) while thread \(p+1\) is still running the fine integrator, thread p sets OMP_LOCK number \(p+1\) while performing the update.

figure a

The implementation described here computes a fixed number of iterations K. While useful for testing, this is not necessarily optimal since it will perform iterations even on time slices that have already converged. A simple optimisation would be to leave threads idle for time slices that have converged. For larger problems where not the whole time interval can be covered with time slices, either restarting or some form of “sliding window” should be employed. Both cases require to replace the outer FOR loop by some form of adaptive control of iterations and are not considered here.

4 Numerical Results

This section compares the pipelined OpenMP implementation to a straightforward MPI variant with respect to runtime, memory footprint and energy consumption. The code used here for benchmarking is written in Fortran 90 and available for download [16]. It is special-purpose and solves the nonlinear 3D Burgers’ equation

$$\begin{aligned} u_{t} + u \cdot \nabla u = \nu \Delta u \end{aligned}$$
(5)

on \([0,1]^{3} \subset \mathbb {R}^{3}\) with periodic boundary conditions using finite differences. Both implementations of Parareal use the same modules to provide the coarse and fine integrator and spatial discretisation. Tests guarantee that the two implementations of Parareal produce results that are identical up to a tolerance of \(\varepsilon = 10^{-14}\) and thus essentially to round-off error. To detect possible race conditions, the comparison is run multiple times. Up to 100 instances of the test were passed on both used architectures. Furthermore, both implementations of Parareal use three auxiliary buffers per time-slice: q to store the fine value and for communication, \(\delta q\) to store the difference \(\mathcal {F}_{\delta t}(q) - \mathcal {G}_{\Delta t}(q)\) needed in the correction step and \(q_c\) to store the coarse value from the previous iteration.

A strong stability preserving Runge-Kutta method (RK3-SSP) [18] with a fifth order WENO finite difference discretisation [18] for the advection term and a fourth order centred difference for the diffusion term is used for \(\mathcal {F}_{\delta t}\). For \(\mathcal {G}_{\Delta t}\), a first order forward Euler with a simple first order upwind stencil for the advection term and a second order centred stencil for the diffusive term is used. Being a simplified version of the Navier-Stokes equations, (5) is a popular benchmark and finite difference stencils are a widely used motif in computational science, so that the results can be expected to hold for more general scenarios, at least qualitatively.

Parameters for the simulation are a viscosity parameter of \(\nu = 0.02\) and a spatial discretisation on both levels with \(N_x = N_y = N_z = 40\) grid points in every direction. The simulation is run until \(T=1.0\) with a coarse time step of \(\Delta t = 1/192\) and a fine step of \(\delta t = 1/240\). Because of the high computational cost of the WENO-5 method in comparison to a cheap first order upwind scheme and the fact that RK3SSP needs three evaluations of the right hand side per step while the Euler method needs only one, the coarse propagator is about a factor of forty faster, despite the fact that the coarse step is only a factor of 1.25 larger than the fine. Using a coarse time step \(\Delta t \approx \delta t\) prevents stability issues in the coarse propagator and improves convergence of Parareal.

To fix the number of iterations to a meaningful value which guarantees comparable accuracy from Parareal and serial fine integrator, we estimate the discretisation error of \(\mathcal {F}_{\delta t}\) by comparing against a reference solution with time step \(\delta t / 10\). This gives estimates for the fine relative error at \(T=1\) of about \(e_\mathrm{fine} \approx 5.9 \times 10^{-5}\) and for the coarse error of about \(e_\mathrm{coarse} \approx 7.3 \times 10^{-2}\). For \(P=24\) time slices, after three iterations, the defect between Parareal and the fine solution is approximately \(1.4 \times 10^{-4}\), after four iterations \(1.5 \times 10^{-5}\). We therefore fix the number of iterations to \(K=4\) so that for all values of P Parareal produces a solution with the same accuracy as the fine integrator.

Benchmarks are run on two systems. The first is a commodity work station with an 8-core Intel Xeon(R) E5-1660 CPU and 32 GB of memory running CentOS Linux 7. Flags -O3 and -fopenmp were used when compiling the code for maximum optimisation and to enable OpenMP. As the code is stand alone no external libraries have to be linked. The used MPI implementation is mpich-3.0.4, compiled with GCC-4.8.5.

The second system is one node of Piz Dora at the Swiss National Supercomputing Centre CSCS.Footnote 1 Dora is a Cray XC40 with a total of 1, 256 compute nodes. Each node contains two 12-core Intel Broadwell CPUs and has 64 or 128 GB of RAM and nodes are connected through a Cray Aries interconnect, using a dragonfly network topology. Two compilers are tested, the GCC-4.9.2 and the Cray Fortran compiler version 8.3.12. Both use the MPICH MPI library version 7.2.2. Compiler flags -O3 and -fopenmp (GCC) or omp (Cray compiler) are used for compilation. Performance data for each completed job is generated using the Cray Resource Utilisation Reporting tool RUR [4]. RUR collects compute node statistics for each job and provides data on user and system time, maximum memory used, amount of I/O operations, consumed energy and other metrics. However, it only collects data for a full node and not for individual CPUs or cores.

4.1 Wall Clock Time and Speedup

At first, runtime and speedup compared to the serial execution of the fine integrator are assessed. On both the Linux work station and Dora, five runs are performed for each variant of Parareal and each value of P and the average runtime is reported. Measured runtimes are quite stable across different runs: the largest relative standard deviation of all performed five-run ensembles is smaller than 0.05 on Linux and smaller than 0.01 on dora. Therefore, plots show only the average values without error bars, because those are hardly recognisable and clutter the figure.

Figure 2 shows runtimes in seconds depending on the number of cores on Dora using the Cray compiler (left) and Linux (right). The Cray compiler generates faster code than GCC in the case studied here, but for comparison results using GCC on Dora are shown in Fig. 4. The runtime of the serial fine integrator is indicated by a horizontal black line. The CPU in the Linux system has a slightly higher clock speed so that runtimes are a bit faster than on dora. For \(P=8\) cores, for example, OpenMP-Parareal runs in slightly less than 5 s there while taking about 5.7 s on dora. Differences between the OpenMP and MPI version are small on both systems, but for \(P=8\) OpenMP-Parareal is marginally faster than the MPI version on the work station.

Fig. 2.
figure 2

Five run averages of runtime with relative standard deviation below 0.01 on dora (2a) and 0.05 on a Linux work station (2b).

Fig. 3.
figure 3

Speedup computed from average runtimes shown in Fig. 2 on dora (3a) and a Linux work station (3b).

In addition, Fig. 3 shows the speedup relative to the fine integrator run serially. The black line indicates projected speedup according to (4). Both versions fall short of the theoretically possible speedup. Because of overheads, running P instances of \(\mathcal {F}_{\delta t}\) on P cores does take longer than running a single instance. However, differences between MPI and OpenMP are small. As far as runtimes and speedup are concerned, there is no indication that using the more complex OpenMP version provides significant benefits.

Fig. 4.
figure 4

Runtime (4a) and speedup (4b) on Piz Dora using GCC.

4.2 Memory Footprint

The memory footprint of the code is measured only on dora where RUR is available. In contrast to runtime and energy, the memory footprint, as expected, does not vary between runs. Therefore Fig. 5 shows a visualisation of the data from a single run with no averaging. The bars indicate the maximum required memory in MegaByte (MB) while the black line indicates the expected memory consumption using P cores computed as

$$\begin{aligned} m(P) = P \times m_\mathrm{serial} \end{aligned}$$
(6)

where \(m_\mathrm{serial}\) is the value measured for a reference run of the fine integrator. Because copies of the solution have to be stored for every time slice, the total memory required for Parareal can be expected to increase linearly with the number of cores in time. Note, however, that memory required per core stays constant if it follows (6).

Fig. 5.
figure 5

Maximum memory allocated in MegaByte for GCC (5a) and Cray compiler (5b) for the three different versions of Parareal depending on the number of used cores P. The black line indicates expected memory consumption computed as number of cores time memory footprint of serial fine integrator.

For the OpenMP variant compiled with GCC, the memory footprint shown in Fig. 5a exactly matches the expected values. The Cray compiler, shown in Fig. 5b, leads to a smaller than expected memory footprint, but memory requirements still increase linearly with the number of time slices. For both compilers, the MPI version causes a noticeable overhead in terms of memory footprint, most likely because of internal allocation of additional buffers for sending and receiving [15]. For both OpenMP and MPI, the total memory footprint is larger for the Cray than for the GCC compiler, but the effect is much more pronounced for the MPI implementation (329 MB versus 261 MB) than for the OpenMP variant (200 MB versus 193 MB).

It is important to note that both implementations allocate three auxiliary buffers per core. The overhead in terms of memory in MPI does thus not simply stem from allocating an additional buffer for communication but comes from within the MPI library. The OpenMP implementation avoids this overhead.

It should be pointed out that the MPI baseline relies on two-sided MPI_RECV and MPI_SEND directives for communications. Exploring whether an implementation based on one-sided remote memory access in MPI [9] retains the advantages of OpenMP would be an interesting continuation of the presented work.

4.3 Energy-to-Solution

The RUR tool reports the energy-to-solution for every completed job. Because RUR can only measure energy usage for a full node, results are reported here for runs using the full number of \(P=24\) cores available on a dora node. Unfortunately, this makes measuring the energy consumption of the fine propagator largely meaningless since it can use only a single core of the node. Therefore, no corresponding measurements were taken to quantify the overhead of Parareal in terms of energy.

In contrast to runtimes and memory footprint, energy measurements show significant variations between runs due to random fluctuations. Thus, the presented values are averages over ensembles of 50 runs for each version of Parareal. This number of runs has been sufficient to reduce the relative standard deviation to below 0.09 in both configurations and therefore gives a robust indication of actual energy requirements. Table 1 shows the results including \(95\%\) confidence intervals, assuming energy-to-solution is normally distributed. For comparison, energy-to-solution is also shown for a simple non-pipelined OpenMP implementation where only the loop around \(\mathcal {F}_{\delta t}\) is parallelised using OMP PARALLEL DO.

Table 1. Average energy-to-solution for the three different variants run on \(P=24\) cores. Since RUR only measures energy consumption of a full node, it was not possible to meaningfully measure energy-to-solution of the serial propagator and quantify the energy overhead.

When using the GCC compiler, MPI-Parareal consumes more energy than both the pipelined and non-pipelined OpenMP versions. The averages are well outside the confidence interval for the MPI version, so this is very unlikely just a chance result. Moreover, because runtimes are almost identical for MPI and pipelined OpenMP, the differences in energy-to-solution cannot solely be attributed to differences in time-to-solution. This is supported by the fact that non-pipelined OpenMP, despite being significantly slower, still consumes less energy than the MPI variant. Tracking down the precise reason for the differences in energy-to-solution and power requirement will require detailed tracing of power uptake which is only possible on specially prepared machines [10] and thus left for future work.

For code generated with the Cray compiler, both OpenMP and MPI lead to almost identical energy requirements. Confidence intervals are \(784.24 \pm 11.93\) J for MPI and \(784.24 \pm 12.18\) J for OpenMP. It seems likely that the compiler optimises the message passing to take advantage of the shared memory on the single node. Supposedly, the MPI version handles communication in a way that is similar to what is explicitly coded in the OpenMP version. However, as shown in Subsect. 4.2, this automatic optimisation comes at the expense of a significantly larger memory footprint.

Note that the energy consumption of Parareal has previously been studied [1]. By comparing against a simple theoretical model, it has been shown that the energy overhead of Parareal (defined as energy-to-solution of Parareal divided by energy-to-solution of the fine serial integrator), is mostly due to Parareal’s intrinsic suboptimal parallel efficiency. While improving parallel efficiency of parallel-in-time integration clearly remains the main avenue for improving energy efficiency, the results here suggest that in some cases a shared memory approach can provide non-trivial additional savings.

5 Summary

The paper introduces and analyses an OpenMP implementation of the parallel-in-time method Parareal with pipelining. Pipelining allows to hide some of the cost of the serial coarse correction step in Parareal and is important to optimise its efficiency (even though it cannot relax the inherent limit on parallel efficiency given by the inverse of the number of required iterations). Pipelining comes naturally in a distributed memory MPI implementation, but is not straightforward when using OpenMP. The new OpenMP implementation is compared to the standard MPI variant in terms of runtime, memory footprint and energy consumption for both a Cray compiler and the GCC. Both versions produce essentially identical runtimes. For both compilers, using OpenMP leads to reductions in memory footprint, but the effect is more pronounced for the Cray compiler. In terms of energy-to-solution, the results strongly depend on the compiler: while for GCC the OpenMP version is more energy efficient than the MPI version, there is no difference for the Cray compiler.

The results show that contemplating a shared memory strategy to implement “parallel-across-the-steps” methods like Parareal can be worthwhile. Even though it is more complicated, it can reduce memory requirements. However, more advanced features like task-based parallelism in OpenMP, remote memory access for MPI or a more advanced iteration control to prevent superfluous computation on converged time slices are not explored here. Another potential caveat is whether the benefits carry over to the full space-time parallel case, where a parallel-in-time method is combined with spatial decomposition. For Parareal without pipelining the potential of such a hybrid space-time parallel approach has been illustrated [11] but whether this applies to the pipelined version introduced here remains to be seen.