Keywords

1 Introduction

For quite some time, the HPC community has acknowledged the central role of hardware faults in large-scale simulations. During 2013, the Blue Waters petascale system exhibited a mean time between failures (MTBF) of 4.2 h [7]. The MTBF is becoming so small that it is approaching the checkpointing interval [5]. This makes it clear that alternative approaches to resilience are not only convenient but urgently necessary. This concern has brought together experts from across the scientific spectrum to come up with new solutions [6].

Fault tolerance can be addressed either at the system level (error-correcting code, checkpointing, replication) or at the algorithmic level. There have been promising advances in both areas in the last few years, such as in-memory checkpointing [21] (system-level) or the use of machine learning techniques to identify PDE solvers that are robust against faults [4] (algorithmic level). Sometimes it is possible to use a combination of both approaches in order to get the best of both worlds [16], but in general this is not possible. Although some algorithms are fault tolerant by construction (e.g., iterative, randomized and restarted schemes), most do not have properties that can be used for algorithmic fault tolerance. These algorithms will largely rely on system-level resilience.

We are now approaching the exascale era, which aggravates the problem of resilience. Given how little we know about the hardware- and system-level specifications of future exascale systems, it is crucial to exploit algorithm-based fault tolerance beyond checksum schemes. Algorithm-based resilience, which we interpret broadly as using the properties of numerical schemes to overcome failures, has the advantage of not relying heavily on the specifications of the system. Also, the resource overhead required to implement resilient algorithms is usually much smaller than the more common system-level approaches (most noticeably replication and checkpointing). This is the approach we have adopted for solving high-dimensional PDEs efficiently, using an algorithm that can deal with faults at the algorithmic level: the Sparse Grid Combination Technique (SGCT) [8].

The SGCT is an extrapolation method that helps to alleviate the curse of dimensionality for a wide variety of problems. In particular, we are interested in solving PDEs. The idea of the SGCT is the following: since solving a d-dimensional PDE on a finely discretized grid with, say, \(2^n\) discretization points per dimension is usually infeasible (it requires a total of \(\mathcal {O}(2^{dn})\) points, leading to the so called curse of dimensionality), one solves the PDE multiple times, each time on an anisotropic grid of smaller resolution. The different solutions are then combined with appropriate weights in order to approximate the high-definition solution. This algorithm has been shown to scale up to more than \(180\,000\) cores [12] and now has been made fault tolerant (called the Fault Tolerant Combination Technique, FTCT) [10]. This makes it a promising candidate for an exascale-ready algorithm, and in this paper we try to support this claim. In particular, we introduce our fault-simulation layer; we present an efficient parallelization strategy for the FTCT and apply it to the well-established PDE toolbox DUNE [2]; we show that the overhead to ensure fault tolerance on a full supercomputer is very small, since it requires neither checkpointing nor process replication; and we show the very first results for a fault-tolerant solution of a higher-dimensional PDE in a massively parallel system: the time-dependent advection-diffusion equation in 5 dimensions.

In previous work we have applied the FTCT to the plasma physics code GENE [15], which solves the (5+1)D gyrokinetic equations. We carried out tests in serial with simulated faults and gave initial estimates of the computational costs [17]. The authors in [19] carried out tests in parallel in 2 and 3 dimensions, but only with up to about 3000 cores. Our parallelization strategy extends to any dimension and performs well on a full supercomputer, representing a significant improvement of the state of the art.

2 The Sparse Grid Combination Technique

Consider a function \(u(\varvec{x}) \in V \subset C([0,1]^d)\), where . This function could represent the solution of a PDE, which we would like to approximate via a discrete function \(u_{\varvec{i}}(\varvec{x}) \in V_{\varvec{i}}\subset V\). The SGCT (introduced by Griebel et al. in the nineties [8]) mitigates the curse of dimensionality. In its most general form it can be stated as

$$\begin{aligned} u_{\varvec{n}} \approx u_{\varvec{n}}^{(c)} = \sum _{\varvec{i} \in \mathcal I} c_{\varvec{i}} u_{\varvec{i}}. \end{aligned}$$
(1)

We call the weights combination coefficients; each \(u_{\varvec{i}}\) is a solution of the PDE on a coarse, anisotropic regular grid \(\varOmega _{\varvec{i}} := \varOmega _{i_1}\times \cdots \times \varOmega _{i_d}\), which has mesh size \(h_{\varvec{i}}:=(h_{i_1},\dots ,h_{i_d}) := 2^{-\mathbf i }\). Each \(u_{\varvec{i}}\) is called a component solution, the corresponding grids \(\varOmega _{\varvec{i}}\) are called component grids, \(u_{\varvec{n}}^{(c)}\) is the combined solution and the grid \(\varOmega _{\varvec{n}}^{(c)}\), on which the combined solution lives, is the combined grid. The set \(\mathcal {I}\) is a set of multi-indices, and it defines over which grids the combination is to be performed. Only certain choices of the index set give combinations that properly approximate \(u_{\varvec{n}}\). For instance, the classical combination technique is given by

$$\begin{aligned} u_{\varvec{n}}^{(c)} = \sum _{q=0}^{d-1}(-1)^q \begin{pmatrix} d-1 \\ q \end{pmatrix} \sum _{\Vert \varvec{i} \Vert _1 = n+(d-1)-q} u_{\varvec{i}}. \end{aligned}$$
(2)
Fig. 1.
figure 1

The classical combination technique in 2D for \(\varvec{n}=(3,3)\) (\(n=3\)).

An example in 2D can be seen in Fig. 1 with the choice \(\varvec{n}=(3,3)\) (\(n=3\)). The five component solutions \(u_{\varvec{i}}\) are weighted with coefficients \(c_{\varvec{i}} = \pm 1\). The grids depicted have boundary points in both dimensions (\(2^{i_j}+1\) total points per dimension), but one can also leave out the boundary points on a given dimension. Each of these grids has only \(\mathcal {O}(h_n^{-1})\) discretization points, and there are \(\mathcal {O}(d(\log h_n^{-1})^{d-1})\) such grids [18]. The crucial advantage of the combination technique is that the solutions on the component grids are independent of each other and can therefore be computed in parallel. The results then have to be combined in a reduction step, either once at the end of the computation or, for time-dependent problems, every certain number of time steps.

In general, the combination coefficients in (1) are calculated using the formula

$$\begin{aligned} c_{\varvec{i}} = \sum _\mathbf{i \le \mathbf j \le \mathbf i +\mathbf 1 } (-1)^{|\mathbf j -\mathbf i |} \chi _{{}_\mathcal {I}}(\mathbf j ), \end{aligned}$$
(3)

with \(\varvec{i} \ge \varvec{1}\), where \(\chi _{{}_\mathcal {I}}\) is the indicator function of set \(\mathcal {I}\) [9]. This is an application of the inclusion/exclusion principle if the set \(\mathcal {I}\) is viewed as a lattice.

Since the most anisotropic grids in the combination technique can introduce some instabilities, one can truncate the combination by defining a minimum level of resolution \(\varvec{i}_{\text {min}}\) s.t. \(\varvec{i}_{\text {min}}\le \varvec{i} \le \varvec{n},\,\,\, \forall \varvec{i} \in \mathcal {I}\). For example, the choice \(\varvec{n} = (3,3)\) and \(\varvec{i}_{\text {min}}=(2,2)\) results in the combination \(u_{(3,3)}^{(c)} = u_{(2,3)} + u_{(3,2)} - u_{(2,2)}\).

The combined solution is usually formulated in the space of the sparse grid. In order to combine the solutions \(u_{\varvec{i}}\) efficiently, they first have to be transformed from their nodal basis representation into the hierarchical basis of the sparse grid by hierarchization [12]. In the sparse grid space they can be combined simply by adding up the hierarchical surpluses at all grid points. The inverse operation (going from the hierarchical to the nodal basis) is called dehierarchization.

These are all the components required to implement the combination technique in serial. A parallel (and scalable) combination technique presents several challenges, many of which we have addressed in previous work. We have developed a framework to run the combination technique on a massively parallel system, which we now briefly describe.

3 A Software Framework for a Massively Parallel Combination Technique

The combination technique offers two levels of parallelism. First, all combination solutions \(u_{\varvec{i}}\) can be computed independently of each other. And second, each \(u_{\varvec{i}}\) can be solved with an existing parallel solver. But the parallel combination technique is not straightforward to implement, especially for time-dependent problems, which we are most interested in. The basic workflow is given in Algorithm 1.

figure a
Fig. 2.
figure 2

Manager-worker model. In this example, 11 tasks are distributed among 4 groups of processes, each composed of 2 nodes, with 4 processes per node. A task corresponds to a component grid distributed over all processes of a group.

We need three ingredients to make the combination technique scale: (i) a good parallelization strategy; (ii) a suitable load balancing scheme to distribute the workload properly; and (iii) efficient algorithms for distributed systems for the hierarchization/dehierarchization and to combine the component solutions.

(i) Parallelization Strategy: A Manager-Worker Model. To parallelize the combination technique we use the manager-worker model depicted in Fig. 2. First, we create groups of MPI processes, and each group might comprise several compute nodes. A manager process generates a list of tasks – in this case, the list of component solutions \(u_{\varvec{i}}\) to be computed – and distributes them evenly among the groups. Each group in turn has a master process that coordinates the work within its group and communicates with the manager. Each group computes all the component grids it has been assigned to one after the other, asynchronously and independently to the rest of the groups. Almost all of the MPI communication is done within the groups, and only the combination step requires inter-group communication. This framework is implemented within the sparse grid library SG\(^{++}\) [1]. A more detailed description can be found in [13].

(ii) Load Balancing. The different combination solutions take different times to be solved. The manager has to take this into consideration when distributing the tasks. We have developed a load balancing scheme that estimates the runtime of each \(u_{\varvec{i}}\) based on the number of grid points and its degree of anisotropy [11].

(iii) Efficient and Scalable Algorithms for Distributed Systems. The combination step and the hierarchization/dehierarchization steps require inter-node communication. Keeping this overhead small is crucial for the parallel combination technique. In previous work we have exploited the hierarchical structure of sparse grids to optimize the combination step [14]. The resulting algorithm outperforms non-hierarchical algorithms by orders of magnitude. We have also adapted the hierarchization algorithm to scale well in distributed systems [12]. Furthermore, we have presented a massively parallel implementation of the combination step, which scales up to more than \(180\,000\) cores in [13].

These three ingredients are sufficient to ensure scalability with the combination technique, but the algorithm would crash in the presence of faults. There is, however, a well-understood version of the combination technique that can tolerate faults. In the following, we describe the principle behind it and the implementation in our framework.

4 The Fault-Tolerant Combination Technique

Existing petascale systems can experience faults at various levels, from outright node failure to I/O network malfunctioning or software errors [5]. In this work we focus on hardware failures at node level, which are already quite frequent and will be more common as we approach exascale [5]. What effect do faults have on the combination technique (Algorithm 1)? The answer partly depends on when the faults occur. Recall that there are four main operations in the algorithm: solve, hierarchize, combine, and dehierarchize. In previous work we have shown that most of the computing time is spent on the actual PDE solver, even in the worst-case scenario where we combine after every time step [13]. Therefore it is reasonable to assume that faults are most likely to occur during the solver call.

Fig. 3.
figure 3

The fault-tolerant combination technique in 2D with two faults, and the resulting alternative combination.

If we look back at our parallelization scheme (Fig. 2) we can see what would happen if one or multiple nodes fail: the component grids assigned to the corresponding process groups are lost, at least for the number of time steps being simulated at the time of the fault. We avoid the need of checkpointing by applying the Fault Tolerant Combination Technique (FTCT) [10], an algorithmic approach to recover the combined solution in the presence of faults. The FTCT is illustrated in Fig. 3. If faults occur, the combination solutions that are lost are given a combination coefficient of zero, and an alternative combination is computed. The new combination avoids using the lost solutions, at the price of having a sparse grid with fewer grid points. It is thus a lossy approach, but as we will show in Sect. 6, the losses are very small. Furthermore, some additional component grids that are not originally needed for the combination (e.g., solution \(u_{(1,4)}\) in the figure) have to be computed: They are required for the recovery step. In rare cases, some of the coarsest combination solutions have to be recomputed. Nevertheless, the total overhead is very small. For more details we refer the reader to [10, 17].

4.1 Implementation of the FTCT

Our implementation is based on the ULFM specification [3], currently being the most mature specification of a fault-tolerant MPI. It adds further functionality to the MPI Standard which enables to detect crashed processes and to exclude these processes from future communication.

There are two cases how a process group is detected to have failed. The master process detects that a process in its own group has failed and notifies the manager. This is implemented by an MPI_Barrier on the group, which will return an error code if a process of the group has failed. If a master process crashes, this is detected by the manager. The manager waits for a message from the master process of each group, which signals whether the computation was successful, or whether a process of the group failed. This is implemented with MPI_Wait, which will return an error code if the master process has failed. For the case that the application code is not prepared to handle errors in MPI calls (e.g., it does not have a predefined exit strategy), it might block forever after a process failed. We would deal with such cases by using time-outs in the manager process. If a group does not finish its computations within reasonable time (relative to the other groups), it will be marked as having failed. However, we have not yet implemented such a procedure.

After a fault has been detected, the following recovery steps are performed:

  1. 1.

    Create new MPI communicators that exclude the process ranks of the whole failed group, using the functions MPI_Comm_revoke and MPI_Comm_shrink.

  2. 2.

    Compute new combination coefficients to correct the combined solution according to the FTCT algorithm.

  3. 3.

    Redistribute the tasks of the failed group to the living groups. Perform initialization routines if necessary (e.g., set up data structures, etc.).

  4. 4.

    If it is necessary to recompute some of the tasks, initialize these tasks with the last combined solution and compute them (for the required number of time steps). The last combined solution is still available on all alive process groups from the last combination step (basically this is an in-memory checkpoint of the combined solution that we get “for free” at each combination step).

The corrected combined solution will be computed during the next combination step (line 10 in Algorithm 1). The lost tasks that were not recomputed have a combination coefficient of zero and do not contribute to the combined solution. Afterwards, the combination coefficients are reset to the original coefficients (for the next combination step) and Algorithm 1 is resumed normally. We only use the reduced set of component grids to recover the combined solution at the current time step, but we proceed the computation with the full original set of component grids.

Our algorithm can tolerate any number of faults, as long as one healthy group is left to continue the computations. In this work we only detect and correct process failures during the computation phase, but the same procedure could be used for failures during the combination step. To protect the algorithm from the very unlikely event of the manager process failing, one could use a replication strategy, e.g., a second manager process on a different node.

4.2 Fault Simulation Layer

For our experiments we did not use an actual fault-tolerant MPI implementation, because there was none readily available on the HPC systems we had access to. Instead we realized a home-grown fault simulation layer between our framework and the actual MPI system, which can simulate failed processes. This layer extends the MPI system by borrowing some functions of the ULFM interface, emulating the external behaviour of ULFM, e.g., returning appropriate error codes. But our actual internal implementation can differ from ULFM, since we do not consider actual process faults, but only simulated faults. The functions we extended include the most common point-to-point and collective operations, as well as some of the new fault tolerance functions, such as MPI_Comm_shrink and MPI_Comm_revoke. With the fault simulation layer we can let processes crash virtually. When a process calls a function named kill_me() at any point in the code, it stops its normal operation and goes idle. When this happens, it only handles background operations of the fault simulation layer, and it can be detected as failed by other processes. If, for example, this process was the destination of an MPI_Send, or if it participated in a collective communication such as MPI_Allreduce, the functions would (according to the specification) return on the other processes with the error message MPI_ERR_PROC_FAILED. Details on the fault simulation layer can be found in [20].

Of course, the fault simulation layer does not provide representative results for the runtime or scalability of the new fault tolerance functions, e.g., MPI_Comm_shrink. However, we do not expect the runtime of these functions to add a significant overhead to our FTCT algorithm, since the contribution to the overall runtime is very low. In contrast, if we used a preliminary FT-MPI implementation (e.g., ULFM, which we expect to be significantly slower than the vendor specific implementation), we would expect a noticeable setback in the performance and scalability of our algorithms, since they all rely heavily on fast communication. The fault simulation layer does come with a certain overhead, but we have the advantage that we can choose, for a given MPI function, between its fault tolerant version or its standard version. By only using the fault tolerant version where it is really necessary (e.g., for detecting faults and shrinking the communicators), but not in the application code or in the performance critical parts, the impact on the performance is negligible.

Assuming that a future FT-MPI implementation provided by the system vendor will be comparably fast to existing implementations, we can get realistic numbers for the overhead of the FTCT on current HPC systems for a large number of cores. This is a major goal of this work. In the future we plan to compare the performance of the FTCT using ULFM and our current approach in order to quantify our arguments. One further advantage of the fault simulation layer is its portability. Since it runs out of the box with any standard MPI implementation, we can easily repeat our experiments on a different HPC system without worrying about installing and running ULFM.

5 Experimental Setup

Our test problem is the d-dimensional advection-diffusion equation

$$\begin{aligned} \partial _t u - \varDelta u + \varvec{a} \cdot \nabla u = f&\text {in } \varOmega \times [0, T) \\ u(\cdot ,t) = 0&\text {in } \partial \varOmega \nonumber \end{aligned}$$
(4)

with \(\varOmega = [0,1]^d\), \(\varvec{a} = (1,1,...,1)^T\) and \(u(\cdot ,0) = e^{-100 \sum \nolimits _{i=1}^d (x_i-0.5)^2 }\), implemented in the PDE-framework DUNE-pdelab. For the spatial discretization we use the finite volume element (FVE) method on rectangular d-dimensional grids. Here, we use a simple explicit Euler scheme for the time integration. Future work will comprise a geometric multigrid solver that works with regular anisotropic grids. We chose this problem as its numerics are well understood and it allows us to investigate our algorithms in arbitrary dimensionality.

The solve function in Algorithm 1 corresponds to computing a component solution \(u_{\varvec{i}}\) with DUNE for one time step \(\varDelta t\) in parallel using all processes of the corresponding process group. As DUNE uses its own data structures, after each call to DUNE the data has to be copied and converted to the data structure of the component grid. After the combination step the data is copied back into DUNE in order to set the new initial values for the next time step. For our experiments we use the same time step size for all component grids and combine after each time step. We simulate process failures by calling the kill_me function of our fault simulation layer. The iteration at which the process fails and the rank of the process are defined a priori in a parameters file.

6 Results

6.1 Convergence

The theoretical convergence of the combination technique (and its fault tolerant version) has been studied before [8, 10], and our experiments confirm previous results. In all our convergence experiments we vary the number of process groups and let one random group fail, so the percentage of failed tasks varies accordingly. Which exact group fails makes almost no difference, since the tasks are distributed homogeneously among the groups. Thus, we show the results only for one simulation run. Additionally, we did not observe a noticeable difference in the quality of the solution when varying the time step at which the fault occurs. The results shown correspond to a fault simulated during the second iteration (so that all tasks are already assigned to the groups). We compute the relative \(l_2\)-error \(e = \frac{ \Vert u_{\varvec{n}}^{(c)} - u_{\text {ref}}\Vert _2 }{ \Vert u_{\text {ref}}\Vert _2 }\) at the end of the simulation (\(t=0.10\) and \(\varDelta t = 10^{-4}\) in 2D, and \(t=0.05\) and \(\varDelta t = 10^{-3}\) in 5D), interpolating each combination solution to the resolution of the reference grid. We combine after every time step.

Figure 4 (left) shows the convergence of the combination technique in 2D with \(\varvec{i}_{\text {min}}=(3,3)\), and increasing \(\varvec{n}\), compared to a full grid reference solution of level \(\varvec{n} = (11,11)\). The recovered combination technique with faults is only minimally worse than without faults (\(1\%\)\(3\%\) worse), even when half of the tasks fail. The difference is more visible in 5D, Fig. 4 (right), where we chose \(\varvec{i}_{\text {min}}= (3,3,3,3,3)\) and a reference solution of size \(\varvec{n} = (6,6,6,6,6)\). To our best knowledge, this is the first, fully fault-tolerant convergence experiment in 5 dimensions with the combination technique. Figure 5 (left) shows a one-dimensional projection of the combined solution at \(t=0.05\) for \(\varvec{n} = (5,5,5,5,5)\) computed on four groups. This shows an excellent match.

Fig. 4.
figure 4

Convergence of the convection-diffusion equation using the combination technique, with and without faults. A single process fault causes an entire group to fail. Left: 2D case. Right: 5D case.

Fig. 5.
figure 5

Left: One-dimensional projection of the 5D solution with and without faults for \(\varvec{i}_{\text {min}}=(3,3,3,3,3)\) and \(\varvec{n} = (5,5,5,5,5)\). Right: 5D scaling experiments. The number in the legend indicates the number of process groups.

6.2 Scalability

We performed scaling experiments on the supercomputer Hazel Hen to investigate the overhead of the FTCT in a massively parallel setup. We used a 5D combination technique with \(\varvec{n} = (8,8,7,7,7)\) and \(\varvec{i}_{\text {min}}=(4,4,3,3,3)\). This resulted in 126 component grids. Note that a computation of the full grid \(\varOmega _{\varvec{n}}\) would not be feasible anymore – not even on the full machine. For the parallelization we used 8192, 16384, 32768 and 65536 processes distributed on 8, 16 or 32 process groups of size 512, 1024, 2048 or 4096. In all cases one process failed in the second iteration and the entire corresponding group is removed. For the same number of groups, always a process in the same group failed.

Figure 5 shows the time to solve all the component grids for one time step (using all groups, before the fault occurs), the time to redistribute the component grids of the failed group and the time to recompute certain tasks if necessary. Our application code has a rather bad node-level performance, but it scales well when the size of the process group is increased. The time for the combination can be neglected in these experiments since it was below one second in all cases. Two factors cause our curves to look slightly erratic. First, we show results for only one experiment per configuration. This is due to the long time and large computing resources it takes to run each simulation, which makes statistical studies infeasible. Second, there is some degree of randomness in the assignment of the tasks to the process groups, so even when the same group fails, the tasks to be redistributed or recomputed can vary in each run. The time to redistribute a task essentially is the time its initialization routine takes. For 16 and 32 groups the number of tasks to be redistributed is lower than the number of groups, so the time to redistribute is dominated by the slowest task. Furthermore, in our case the initialization function does not scale with the number of processes. This explains why the time to redistribute did not always decrease.

It is not easy to specify the exact overhead of the FTCT, since it depends on various parameters, such as the expected number of time steps between two failures, the number of time steps computed in the solve step and the ratio between the initialization time and the cost of one time step. However, we can easily formulate upper bounds for the two most costly steps of the recovery process. The redistribute step can never take longer than to initialize all tasks. The recompute step can never take longer than the solve step. This means, if the cost of the initialization is small compared to the total amount of work done between two process failures, the overhead of the FTCT will be negligible. This is the main conclusion to be drawn from our experiments.

After the recovery, the time for the solve step increases, since less process groups are available. In the future we plan to mitigate this problem by not removing the whole process group, but instead shrinking it to a smaller size. The optimal case for our algorithm would be an MPI system that allows to request new MPI processes at runtime after a node or process failure happened.

7 Conclusions

Our parallelization strategy for the combination technique can handle hard faults on a large-scale supercomputer on the algorithmic level with only a small overhead. The accuracy of the fault-tolerant combination technique remains very close to the original combination technique after faults occur. In low dimensions, this has been observed before. We were able to show that this holds as well in a higher-dimensional setting. In the future we plan to repeat our experiments with an actual fault tolerant MPI implementation; we will explore ways to exclude single processes instead of whole groups; and we will investigate the effect of silent data corruption additionally to hard faults.