Keywords

1 Introduction

Scientific applications in life science and in many other fields can benefit from the Parallel In Time (PINT) methods which have the potential to extract additional parallelism in many applications governed by evolutionary models, allowing for concurrency also along the temporal dimension. Consider as an example the analysis, the reconstruction and the denoising of ultrasound images arising from 2D/3D echocardiography [2, 18, 24]. In the European Exascale Software Initiative (EESI) 2014 roadmap, PINT approaches are recommended to the end of developing efficient applications for Exascale computing, thus taking a significant step beyond “traditional” HPC. On the other side, the deployment of application codes by means of scientific libraries, such as PETSc (the Portable, Extensible Toolkit for Scientific computing) [1], can be considered as a good investment [2] to maximize the availability of PinT algorithms for scientific applications.

Recent advances in PETSc regarded the improvement of multilevel, multidomain and multiphysics algorithms. The most relevant capabilities allow users to test different solvers (linear, nonlinear, and time stepping) for their complex simulations, without making premature choices about algorithms and data structures [1]. Neverthless, PETSc does not provide any parallel-in-time support.

The MultiGrid-In-Time (MGRIT) algorithm is a PINT approach based on Multigrid Reduction (MGR) techniques [3]. Although we know that it is implemented in the software package XBraid [4], we are developing a modular multilevel parallel implementation [5] based on PETSc. The main goal of our approach is to provide a model predicting the performance gain achievable using the MGRIT approach instead of a timemarching integrator, independently of wheather parallelism in the space dimension is introduced or not. The performance model in [7], instead, aims to selecting the best parallel configuration (i.e. how much parallelism is to be devoteted to space vs. time). Therefore, we analyze the performance parameters of the algorithm in the mathematical framework presented in other works by the same authors [8]. We intentionally overlook spacial parallelism. In this way we describe the performance improvement regardless of the execution time needed to implement the characteristic function of the problem. We believe that both performance models could be employed for the successful implementation of the MGRIT algorithm [9].

In the second section we describe the basic idea of the algorithm to be implemented, summarizing the main results described in other works [3, 7, 10, 11]. In the third section we briefly define the tools of the performance evaluation framework we need. In the fourth one we give some details about the PETSc implementation of MGRIT and write a performance model to describe the expected performance gain, depending mainly on the number of processors, the number of time-discretization points and of grids levels. Finally, in the last section we introduce what we are currently working on and what are the next planned steps.

2 MGRIT Algorithm. Basic Idea

The basic idea of MGRIT comes from the two-grid formulation of the Parareal method [11] for solving an Ordinary Differential Equation (ODE) and its discretization

$$\begin{aligned} u_t = f(u, t) \text {, with } u(0) = u_0 \text { and } t \in [0, T ]. \end{aligned}$$
(1)
$$\begin{aligned} u(t+\delta t) =\varPhi (u(t),u(t+\delta t)) + g(t+\delta t). \end{aligned}$$
(2)

where \(\varPhi \) is a linear (or nonlinear) operator that encapsulates the chosen time stepping solver, and g incorporates all solution independent terms. The application of \(\varPhi \) is either a matrix vector multiplication, e.g. forward Euler, or a spatial solver, e.g. backward Euler [13].

The MGRIT algorithm is detailed in several works [3, 7, 10, 13]. Briefly, let \(t_i = i\delta _t\), \(i = 0, 1, ..., N\) be a discretization of [0, T] with spacing \(\delta _t = \frac{T}{N}\) (this mesh will be called F-grid), and \(t_j = j\varDelta T\), where \(j = 0, 1, ..., N_\varDelta \) with \(N_\varDelta = \frac{N}{m}\) and \(m > 1\) (called C-grid). We rewrite the problem 2 on the F-grid, denoted as Fine problem:

$$\begin{aligned} A(u)= \begin{bmatrix} I&0&\cdots&0&0\\ -\varPhi _0&I&\cdots&0&0\\ 0&-\varPhi _1&\cdots&0&0\\ \vdots \\ 0&0&\cdots&-\varPhi _{N-1}&I \end{bmatrix}\cdot \begin{bmatrix} u_0 \\ u_1 \\ \vdots \\ u_N \end{bmatrix}= \begin{bmatrix} g_0 \\ g_1 \\ \vdots \\ g_N \end{bmatrix} =g \end{aligned}$$
(3)

that corresponds to a C-grid problem obtained by introducing the appropriate interpolation and restriction operators P and R (see definitions in [10] and description in [9]). Then the multigrid reduction approximates \(A_{\varDelta }\) by \(B_{\varDelta }\) which is based on a new coarse propagator \(\varPhi _{\varDelta ,i}\) arising from the re-discretization of problem 2 on the C-grid, and which is less expensive to evaluate. In this way we have to solve the so-called Coarse problem:

$$\begin{aligned} B_{\varDelta }= \begin{bmatrix} I&0&\cdots&0&0\\ -\varPhi _{\varDelta ,0}&I&\cdots&0&0\\ 0&-\varPhi _{\varDelta ,1}&\cdots&0&0\\ \vdots \\ 0&0&\cdots&-\varPhi _{\varDelta ,N-1}&I \end{bmatrix} \cdot \begin{bmatrix} u_{\varDelta ,0} \\ u_{\varDelta ,1} \\ \vdots \\ u_{\varDelta ,N} \end{bmatrix}= \begin{bmatrix} g_{\varDelta ,0} \\ g_{\varDelta ,0} \\ \vdots \\ g_{\varDelta ,0} \end{bmatrix} =g_{\varDelta } \end{aligned}$$
(4)

Let us assume that the model equations in (1) are linear so that the \(\varPhi _i\) are linear, and we let \(\varPhi _i\equiv \varPhi \), \(i=1, 2, \ldots , N\). Then, Parareal can be derived as an approximate Schur-complement approach with F-relaxation (relaxationFootnote 1 applied on the so-called fine points, or F-points, that are the points on the F-grid and not also on the C-grid), i.e. a two-level multigrid method. Mainly MGRIT algorithm extends the Parareal approach on more grids. This means that it uses discretization, relaxation, restriction, and projection operators for each grid-level, according to different kinds of cycles. The key difference from Parareal relies on a new relaxation operator called the FCF-relaxation. In practice it is the application of the F-relaxation and the C-relaxation repeated one after another once or more times [7].

The MGRIT algorithm for solving the linear case, as detailed in [3], is listed in Algorithm 1.

figure a

where:

  • l is the current level, \(0\le l\le L\) and Lis the coarsest level,

  • \(m_l\) is the coarsening step at each level l, with \(m_0=1\), and \(\delta _l\) is the discretization time step at each level l, where \(\delta _l=\delta _{l-1}\cdot m_l\)

  • \(N_l\) is the number of time steps for each l, with \(N_0>N_1>...>N_L\) and \(A_l\) is the matrix at level l

  • \(u^{(l)}\) and \(g^{(l)}\) are the solution and right hand side vectors at level l,

  • \(R_{Inj}\) is the restriction/injection operator from a level to the coarser one and P is the ideal interpolation (see [3]) corresponding to an injection from the coarser level to a finer one, followed by an F-relaxation with a zero right-hand side (see [13]).

3 Preliminary Concepts and Definitions

The increasing need for parallel and scalable software, ready to exploit the new exascale architectures, leads to the development of many performance models, mainly based on architecture features [19,20,21,22,23, 26] or especially made for choosen algorithm classes [25, 27,28,29]. The model we present here is mainly focused on the dependencies among the computational tasks of the algorithm and is meant to be as general as possible.

We start by giving the definition of dependency relation on a set.

Definition 1

(Dependency relation). Let \(\mathcal {E}\) be a set and let \(\pi _{\mathcal {E}}\) be a strict partial order relation on \(\mathcal {E}\) describing a dependency relation between the elements. We say that any element of \(\mathcal {E}\), say A, depends on another element of \(\mathcal {E}\), say B, if \(A \pi _{\mathcal {E}} B\), and we write \(A \leftarrow B\). If A and B do not depend on each other we write \(A \nleftarrow B\).

Then, consider the set of all the computational problems \(\varGamma \) and any element \(B_N\in \varGamma \) where N is the input data size, called the problem size. Any \(B_N\) can always be decomposed in at least one finite set of other computational problems, that we call decomposition of \(B_N\). Given a decomposition in k subproblems \(B_{N_i}\), called \(D_{k}\), and, taking into account the dependencies among the subproblems, we build a dependency matrix \(\mathcal {M}_{D_{k}}\) where in each row we essentially put subproblems independent of one another and dependent on those in the previous rows. Let us introduce the dependency relation \(\pi _{D_{k}}\) such that \(B_{N_i} \pi _{D_{k}} B_{N_j} \text { with } i\ne j\) if and only if the solution of \(B_{N_j}\) must be found before the one of \(B_{N_i}\).

Definition 2

(Dependency Matrix). Given the partially ordered set \((D_{k},\pi _{D_{k}})\), the matrix \(M_{D_{k}}\), of size \(r_{D_{k}}\times c_{D_{k}}\), whose elements \(d_{i,j}\), are s.t. \(\forall i\in [0,r_{D_{k}}-1]\) and \(\forall s,\,j\in [0,c_{D_{k}}-1]\) it is \(d_{i,j} \nleftrightarrow d_{i,s}\) and s.t. \(\forall i\in [1,r_{D_{k}}-1]\quad \exists q \in [0,c_{D_{k}}-1] \) s.t. \(d_{i,j}\leftarrow d_{i-1,q}\quad \forall j \in [0, c_{D_{k}}-1]\), while the other elements are set equal to zero, is called the dependency matrix.

Given \(D_{k}\), \(c_{D_k}\) is the concurrency degree of \(B_N\), and \(r_{D_k}\) is the dependency degree of \(B_N\), according to the actual decomposition, so that the dependency degree measures the amount of dependencies intrinsic to the chosen decomposition. The number and size of sub-problems a problem is decomposed into determine the granularity of the decomposition. Granularity has a major consequence in the level of detail required for an algorithm to be analysed with this approach.

The decomposition matrix allows us to identify some properties of the algorithm design, such as the concurrency available in a problem when we choose a decomposition rather than another. So the first question must be about how to decompose the problem. That is pretty obvious, but it can lead to algorithm characteristics that we want to emphasize and possibly exploit.

At this point we define the Scale up, using the cardinality of two decompositions.

Definition 3

(Scale Up). Let us consider the following two decomposition \(D_{k_i}\) and \(D_{k_j}\) of \(B_N\), with cardinalities \(k_j\ne k_i\), the ratio \(SC(D_{k_i},D_{k_j}):=\frac{k_i}{k_j}\) is called scale-up factor of \(D_{k_j}\) measured with respect to \(D_{k_i}\).

The next step is to assign the identified subproblems to the computing machine. First we introduce the machine \(\mathcal {M}_P\) equipped with \(P \ge 1\) processing elements with specific logical-operational capabilitiesFootnote 2 called computing operators of \(\mathcal {M}_P\), and denoted by the functionFootnote 3 where S is the set of the solutions of all the problems in \(\varGamma \) and \(S(B_N)\) is the solution of \(B_N\). Given \(\mathcal {M}_P\), the set without repetitions \(Cop_{\mathcal {M}_P}=\{I^{j}\}_{j\in [0,q-1]},\) where \(q\in \mathbb {N}\), characterizes logical-operational capabilities of the machine \(\mathcal {M}_P\).

Definition 4

(Algorithm). Given the problem decomposition \(D_{k}\), an algorithm solving \(B_N\) on \(\mathcal {M}_P\), is the partially ordered set \((A_{k,P},\pi _{A_{k,P}})\), with not necessarily distinct elements, where \(A_{k,P}=\{{I^{i_0},I^{i_1},...I^{i_k}\}}\) such that \(I^{i_j}\in \)\(Cop_{\mathcal {M}_P}\) and

$$\begin{aligned} \forall B_{N_\nu } \in D_{k}(B_N)\quad \exists ! I^{i_j}\in A_{k,P}:I^{i_j}[B_{N_\nu }]=S(B_{N_\nu }). \end{aligned}$$
(5)

There is a bijective correspondence \(\gamma : B_{N_\nu } \in D_{k} \longleftrightarrow I^{i_j}\in A_{k,P}\). Every ordered subset of \(A_{k,P}\) is called sub-algorithm of \(A_{k,P}\).

By virtue of the property 5, operators of \(A_{k,P}\) inherit the dependencies existing between subproblems in \(D_k\), but not the independencies, because, for example, two operators may depend on the availability of computing units of \(\mathcal {M}_P\) during their executions [6].

Let \(AL_{B_{N}}\) (or simply AL) be the set of algorithms that solve \(B_{N}\), obtained by varying \(\mathcal {M}_P\), P and \(D_{k}\). Let us associate each algorithm of AL to a decomposition suited for \(\mathcal {M}_P\), which means that we introduce the surjective correspondence \(\psi : A_{k,P} \in AL\longrightarrow D_{k}\), which induces an equivalence relationship \(\varrho \) of AL to itself, such that

$$\begin{aligned} \varrho (A_{k,P})= \{\widetilde{A_{k,P}}\in AL : \psi (\widetilde{A_{k,P}})= & {} \psi (A_{k,P})\}. \end{aligned}$$
(6)

Therefore, \(\varrho (A_{k,P})\) is the set of algorithms of AL associated with the same decomposition \(D_{k}\)Footnote 4. Hence, \(\varrho \) induces the quotient set \(\frac{AL}{\varrho }\), the elements of which are disjoint subsets of AL determined by \(\varrho \), that is they are equivalence classes under \(\varrho \). In the following we consider \(A_{k,P}\) as a representative of its equivalence class in AL.

Definition 5

(Complexity). The cardinality of \(A_{k,P}\) is called complexity of \(A_{k,P}\). It is denoted as \(C(A_{k,P})\). That is \(C(A_{k,P}) :=card(A_{k,P})=k\).

Notice that, by virtue of the property 5, it holds that

$$\begin{aligned} card(A_{k,P})=card(D_{k}(\mathcal {B}_{N_r}))=k, \quad \forall \, A_{k,P}\in \varrho (A_{k,P}). \end{aligned}$$
(7)

and so \(C(A_{k,P})=k\) equals the number of non empty elements of \(M_{D_k}\) (see Definition 2). This means that each algorithm belonging to the same equivalence class according to \(\varrho \) has the same complexity. Thus an integer (the complexity) is associated with each element \(\varrho (A_{k,P})\) of quotient set \(\frac{AL}{\varrho }\) and induces an ordering relation between the equivalence classes in \(\frac{AL}{\varrho }\). Therefore there is a minimum complexity for algorithms that solve the problem \(\mathcal {B}_{N_r}\).

Let us better define the order relation on the algorithm set, as a second dependency relation \(\pi _{A_{k,P}}\) such that \(I^{i_j} \pi _{A_{k,P}} I^{i_r} \text { with } j\ne r\) if and only if \(I^{i_j}\) solves a subproblem dependent on the one solved by \(I^{i_r}\), and/or the execution of \(I^{i_j}\) needs to wait for the execution of \(I^{i_r}\) to use the same computing resource.

Definition 6

(Execution matrix). Given the partially ordered set \((A_{k,P},\pi _{A_{k,P}})\), the matrix \(\mathcal {E}_{k,P}\), of size \(r_{\mathcal {E}_{k,P}}\times c_{\mathcal {E}_{k,P}}\), withFootnote 5 \(c_{\mathcal {E}_{k,P}}=P\), whose elements \(e_{i,j}\), are s.t. \(\forall i\in [0,r_{\mathcal {E}_{k,P}}-1]\) and \(\forall s,\,j\in [0,P-1]\) it is \(e_{i,j} \nleftrightarrow e_{i,s}\) and s.t. \(\forall i\in [1,r_{\mathcal {E}_{k,P}}-1]\quad \exists q \in [0,P-1] \) s.t. \(e_{i,j}\leftarrow e_{i-1,q}\quad \forall j \in [0, P-1]\), while the other elements are set equal to zero, is called the execution matrix.

Inside an equivalency class, the algorithm solving a problem according to a decomposition that is executed on a machine with just one processor is a sequential algorithm and its execution matrix has just one column, since \(P=1\). In general, the execution matrix size describes the cost of the algorithm. In case of empty spaces in the matrix, they represent the algorithm overhead.

The number of rows \(r_{\mathcal {E}_{k,P}}\) is directly related to the execution time of the algorithm executed with that number P of processing units. Note that in order to compare two algorithms we must ensure that they are described by the same kind of operators, or with the same granularity. In particular, if all the operators have the same execution time t, the algorithm execution time is proportional to \(r_{\mathcal {E}_{k,P}}\) and the Speed Up can be defined for an algorithm, in its equivalency class, as the ratio between its complexity and the number of rows. More specifically, if \(A_{k,P}\) is an algorithm built according to the decomposition \(D_k\) and executed on a machine with P processing units, we give the following definitions

Definition 7

(Execution time). The quantity \(T(A_{k,P}):=r_{\mathcal {E}_{k,P}} \cdot t\) is called execution time of \(A_{k,P}\).

Definition 8

(Speed Up). Given the algorithms \(A_{k,P}\) executed with P computing units the ratio \(S(A_{k,P}):=\frac{C(A_{k,P})}{r_{\mathcal {E}_{k,P}}}\) is called Speed Up of \(A_{k,P}\) in its equivalency class.

This rewrites the classical speed up formula, so we can say that the ideal value is P, and we can also show that, varying P, it is limited by the concurrency degree of the problem in the same decomposition.

Briefly, given two different decompositions \(D_{k_i}\) and \(D_{k_j}\), with \(k_j\ne k_i\), given two different machines with two different number of processors \(P_1=1\) and \(P>1\), for the two corresponding algorithms we define the General Speed Up of the parallel one respect to the sequential one, as the product of the Scale Up between the two decompositions and the classical speed up of the parallel one.

Definition 9

(General Speed Up). The ratio

$$GS(A_{k_j,P}, A_{k_i,1}):=SC(D_{k_i},D_{k_j})\cdot S(A_{k_j,P})= \frac{k_i}{k_j}\cdot \frac{k_j}{r_{\mathcal {E}_{A_{k_j,P}}}}= \frac{r_{\mathcal {E}_{A_{k_i,1}}}}{r_{\mathcal {E}_{A_{k_j,P}}}}$$

is called General Speed Up of \(A_{k_j,P}\) respect to \(A_{k_i,1}\).

Note that the ideal value of the General Speed Up is not limited by the number of processing units P.

4 The PETSc Based Implementation of MGRIT for the Linear Case

At the top of the PETSc hierarchy there are the object to solve ODEs and nonlinear systems, built on other objects needed to solve linear systems. In particular, the TS (TimeStepping) library provides a framework to solve ODEs and DAEs arising from the discretization of time-dependent PDEs. Users shall essentially provide the F function, the G function (if nonzero), the initial condition and the Jacobian.

We are now developing a kind of “parallel TS”, based on MGRIT, to be compared with the already provided sequential ones. The idea is to “simply” solve the linear system using a linear solver with a multigrid preconditioner.

The first step of the implementation is to provide the data structure to handle the time dimension together with the space ones, in the context of the PETSc DM or Distributed objects. This means (1) to provide the basic operations for the new type, (2) to handle the coarsening factor, and (3) to provide the user interface to the function which describes the way of operating for \(\varPhi \), that is the spacial solver, and the time discretization calls.

Everything about the coarsening of the grids along the levels, the distribution of the points among the processes and the communications is handled by the PETSc DA (DistributedArray) object linked to the solvers in a fully transparent fashion. Users can tune the behavior of the solver and thus the actual structure of the scheme through the option setting (including tolerance and initial guess for all the operators involved) at runtime.

The second step is the implementation of F- and C-relaxations that must be set as down and up smoothers of the multigrid scheme, tunable (even at runtime) by the user to fit his/her own problem, according to the PETSc design. Users will still control all the parameters and solver choices even at runtime.

4.1 The Performance Model

First, we notice that the application of \(\varPhi \) is the dominant task. In case of explicit time stepping each application of \(\varPhi \) corresponds to a matrix-vector product, whose execution time will be constant. In case of implicit time stepping, each application of \(\varPhi \) equates itself to a system solver. Using an optimal space solver and fixing the stopping tolerance or the number of iterations and the initial guess choice for the spacial solver, the work required for one time step evaluation can be considered constant across all time levels (and associated time step sizes)Footnote 6 [10].

Let \(\varPhi _{i,j}\) be the subproblem of evaluating the function \(\varPhi \) at any instant \(u_{\delta _i,j}\), with \(i=0,...L\) and \(j=1,...N_l\), and \(\phi _{i,j}\) the operator to solve it. Notice that there is no evaluation at the first instant of each grid.

Let \(N_{F_l}:=N_l-N_{l+1}\) be the number of F-points and \(N_{l+1}\) be the number of C-points at each level l of MGRIT algorithm. The relation between the number of F-points and C-points depends on the coarsening factor m that can be the same for all levels or possibly different for each one. In Algorithm 1, we note that if L is the coarsest level, and the solver of the system on the coarsest-grid is sequential, this will involve at least one \(\phi \)-execution for each time step on the L-th grid. It means that if L is the coarsest level there are \(N_L\) executions of \(\phi \). Otherwise, for each level \(l<L\),

  • the FCF-relaxation involves \(N_{F_l}\) F-relaxation steps (or \(\phi \)-executions), which can be performed in parallel, \(N_{l+1}\) C-relaxation steps (or \(\phi \)-executions), which can be performed in parallel, F-relaxation steps (or \(\phi \)-executions), which can be performed in parallel,

  • computing the residual requires one \(\phi \)-execution for each time step on the \((l+1)\)-th grid, that is \(N_{l+1}\), which can be performed in parallel,

  • the ideal interpolation requires \(N_{F_l}\) F-relaxation steps (or \(\phi \)-executions), which can be performed in parallel.

Let us now define the dependency matrix \(\mathcal {M}_D\) (see Definition 2 in Sect. 3) of the time-space problem to be solved, according to its decomposition in the space subproblems \(\varPhi _{i,j}\), \(\text {for } i=0, \ldots ,L \text { and } j=1,\ldots , Np \text { where } Np\in \{N_{F_l},N_{l+1},N_L\}\quad \) and where in each row we essentially put subproblems independent of one another and dependent on those in the previous rows (the \(\mathcal {M}_D\) matrix is well described in [9]).

The concurrency degree of the problem decomposed in this way is \(c_D\), i.e. the maximum number of simultaneous \(\varPhi \) evaluations. Since \(N_{F_l}>N_{l+1}\) and \(N_{F_l}>N_{F_{l+1}}\), which means that the number of F-points at any level is greater than the number of C-points at the same level and greater than the F-points at the next level, \(c_{D}=N_{F_0}\).

The dependency degree is \(r_{D}=5\cdot L+N_L\), since, with L+1 levels, we have (1) 3 rows for each FCF-relaxation, that means \(3\cdot L\) rows, (2) 1 row for each residual computation, that means L rows, which are the longest rows in the matrix, or with the largest numbers of columns, (3) \(N_L\) rows for the coarsest-grid solver, (4) 1 row for each ideal interpolation (F-relaxation), that means L rows.

Consider now a computing architecture with P processing elements, where \(P=\frac{c_{D}}{np}\) (this condition states that the points on the finest grid are equally distributed among the processors, that is \(c_{D}\) is a multiple of P) and \(np\in \mathbf {N}\) and \(P<=N_L\) (this condition states that on the coarsest grid each processor holds at least one point).

We can define the execution matrix \(\mathcal {E}_P\) of MGRIT (see Definition 6, in Sect. 3), consisting of the operators \(\phi _{i,k\cdot P+j}\), with \(i=0, \ldots ,L\) and \(j=1,...P\) and \(k=1,\ldots ,\frac{N_{F_l}}{P}\) for F-relaxation or \(k=1,\ldots ,\frac{N_{l+1}}{P}\) for C-relaxation and residual computation, considering that, for each level, the number of points of the grid is a multiple of PFootnote 7 (the \(\mathcal {E}_P\) matrix is well described in [9]).

Consider now the algorithm \(A_{N_0,1}\), which solves (2) with the same discretization in time on the finest grid (same initial guess and same tolerance) but without introducing MGR or any parallelism, that means using a sequential time-stepping approach with the same discretization techniques and parameters as used by MGRIT on the finest grid. \(A_{N_0,1}\) is made of \(N_0\) executions of \(\phi \), leading to the execution matrix \(\mathcal {E}_1\) with one column and \(N_0\) rows (the \(\mathcal {E}_1\) matrix is well described in [9]).

We can prove the following (proof in [9]):

Theorem 1

Let us assume that MGRIT algorithm runs on a computing architecture with \(P<=N_L\) processing elements, where \(P=\frac{N_0}{np}\) and \(np\in \mathbf {N}\). Let \(t_{\phi }\) be the execution time of \(\phi \), \(\forall l \in [0,L] \).

Let us say that it reaches the same accuracy as \(A_{N_0,1}\) in \(\nu \) iterations. Then the general speed-up \(GS(MGRIT_{N_{MGRIT},P},A_{N_0,1})\) of MGRIT with respect to \(A_{N_0,1}\) is

$$\begin{aligned} GS(MGRIT_{N_{MGRIT},P},A_{N_0,1})=\frac{N_0}{\nu \cdot \left( \sum _{l=0}^{L-1}\left( 3\cdot \frac{N_{F_l}}{P}+2\cdot \frac{N_{l+1}}{P}\right) +N_L\right) } \end{aligned}$$
(8)

5 Conclusions and Future Work

Summarizing, we introduced a mathematical framework to propose a speed-up model for our implementation of MGRIT algorithm. It describes the impact of several factors (i.e. the number of time steps and the number of processors) on the dependencies among operators and thus on the algorithm performance, regardless of the execution of \(\phi \). Any choice related to its implementation affects the unit time \(t_{\phi }\) and/or the numerical accuracy of the results. The required accuracy will limit one or more parameters in a way that is beyond the scope of this paper. If \(\varPhi \) is nonlinear, each application becomes an iterative nonlinear solver, whose conditioning usually depends on the time step size [13].

The main topics we are now focusing on are the following:

  • definition of a memory access matrix to take into account the communications that can significantly affect the software speed up limiting the number of processing elements and grid levels to be used,

  • parallel implementation of \(\varPhi \), to handle different levels of parallelism, exploiting the capabilities of heterogeneous architectures, such as multicore clusters and GPUs, to efficiently treat the parallelism in the spacial dimension [14,15,16,17],

  • validation of all the results arising from this designing approach through the execution of the resulting software prototype on a suited set of problems. The validation activities should provide the PETSc users with the needed guidelines to efficiently use the new TS object to solve their problems.