Keywords

figure a

1 Introduction

For the solution of large sparse linear systems, we design numerical schemes and software packages for direct parallel solvers. Sparse direct solvers are mandatory when the linear system is very ill-conditioned for example  [5]. Therefore, to obtain an industrial software tool that must be robust and versatile, high-performance sparse direct solvers are mandatory, and parallelism is then necessary for reasons of memory capability and acceptable solution time. Moreover, in order to solve efficiently 3D problems with several million unknowns, which is now a reachable challenge with modern supercomputers, we must achieve good scalability in time and control memory overhead. Solving a sparse linear system by a direct method is generally a highly irregular problem that provides some challenging algorithmic problems and requires a sophisticated implementation scheme in order to fully exploit the capabilities of modern supercomputers.

There are two main approaches in direct solvers: the multifrontal approach  [2, 7], and the supernodal one  [11, 18]. Both can be described by a computational tree whose nodes represent computations and whose edges represent transfer of data. In the case of the multifrontal method, at each node, some steps of Gaussian elimination are performed on a dense frontal matrix and the remaining Schur complement, or contribution block, is passed to the parent node for assembly. In the case of the supernodal method, the distributed memory version uses a right-looking formulation which, having computed the factorization of a supernode corresponding to a node of the tree, then immediately sends the data to update the supernodes corresponding to ancestors in the tree. In a parallel context, we can locally aggregate contributions to the same block before sending the contributions. This can significantly reduce the number of messages. Independently of these different methods, a static or dynamic scheduling of block computations can be used. For homogeneous parallel architectures, it is useful to find an efficient static scheduling.

In order to achieve efficient parallel sparse factorization, we perform the three sequential preprocessing phases:

  1. 1.

    The ordering step, which computes a symmetric permutation of the initial matrix such that the factorization process will exhibit as much concurrency as possible while incurring low fill-in.

  2. 2.

    The block symbolic factorization step, which determines the block data structure of the factorized matrix associated with the partition resulting from the ordering phase. From this block structure, one can deduce the weighted elimination quotient graph that describes all dependencies between column-blocks, as well as the supernodal elimination tree.

  3. 3.

    The block scheduling/mapping step, which consists in mapping the resulting blocks onto the processors. During this mapping phase, a static optimized scheduling of the computational and communication tasks, according to models calibrated for the target machine, can be computed.

When these preprocessing phases are done, the computation on the actual data, that is the numerical factorization, can start.

The optimization problem that needs to be solved at the scheduling/mapping stage is known to be NP-hard, and is usually solved using a proportional mapping heuristic  [16]. This mono-constraint heuristic induces idle times during the numerical factorization. In this paper, we extend the proportional mapping and scheduling heuristic to reduce these idle times. We first detail in Sect. 2 proportional mapping heuristic with its issues and related work, before describing the original application in the context of the PaStiX solver  [12] in Sect. 3. Then, in Sect. 4, we explain the introduced solution before studying its impact on a large set of test cases in Sect. 5. Conclusion and future work directions are presented in Sect. 6.

2 Problem Statement and Related Work

Among different mapping strategies that are used by both supernodal and multifrontal sparse direct solvers, the subtree to subcube mapping  [8] and the proportional mapping  [16] are the most popular. These approaches consist of tree partitioning techniques, where the set of resources mapped on a node of the tree are split among disjoint subsets, each mapped to a child subtree.

The proportional mapping method performs a top-down traversal of the elimination tree, during which each node is assigned a set of computational resources. All the resources are assigned to the root node, which performs the last task. Then, the resources are split recursively following a balancing criterion. The set of resources dedicated to a node are split among its children, proportionally to their weight or any other balancing criterion. This recursive process ends at the leaves of the tree, or when entire subtrees are mapped onto a single resource.

The original version of the proportional mapping  [16] computes the splitting of resources depending on the workload of each subtree, but more sophisticated metrics can also be used. In  [17], a scheduling strategy was proposed for tree-shaped task graphs. The time for computing a parallel task (for instance at the root node of the elimination tree) is considered as proportional to the length of the task and to a given parallel efficiency. This method was proven efficient in  [3] for a multifrontal solver. The proportional mapping technique is widely used because it helps reducing the volume of data transfers due to its data locality. In addition, it allows us to exhibit both tree and node parallelism.

Note that alternative solutions to the proportional mapping have been proposed, such as the 2D block-cyclic distribution of SuperLU  [14], or the 1D cyclic distribution of symPACK   [13]. In the latter, the non load-balanced solution is compensated by a complex and advanced communication scheme that balances the computations in the nodes to get good performance results out of this mapping strategy.

As stated earlier, sparse direct solvers commonly use the proportional mapping heuristic to distribute supernodes (a full set of columns, i.e., 1D distribution that share the same row pattern) onto the processors. Note that each supernode can be split into smaller nodes to increase the level of parallelism, which modifies the original supernodes tree structure as shown in Fig. 2. This heuristic provides a set of candidate processors for each supernode, which is then refined dynamically when going up the tree, as in MUMPS   [1] or PaStiX   [12], with a simulation stage that affects a single processor among the candidates, while providing a static optimized scheduling. The proportional mapping stage, by its construction, may however introduce idle time in the scheduling. This is illustrated on Fig. 1. The ten candidate processors of the root supernode are distributed among the two sons of weight respectively 4 and 6. The Gantt diagram points out the issue of considering a single criterion heuristic to set the mapping: no work is given to processor \(p_9\) due to the low level of parallelism of the right supernode, whereas it could benefit to the left supernode.

Fig. 1.
figure 1

Illustration of proportional mapping: elimination tree on the left, and Gantt diagram on the right.

A naive way to handle this issue is to avoid the proportional mapping stage, and consider only the scheduling stage with all processors as candidates for each node of the tree. The drawback of this method is that 1) it does not preserve the data locality, and 2) it drastically increases the complexity of the scheduling step. This solution has been implemented in the PaStiX solver for comparison, and it is referred to as All2All, since all processors are candidates to all nodes.

3 Description of the Application

At a coarse-grain level, the computation can be viewed as a tree T whose nodes (or vertices) represent supernodes of the matrix, and where the dependencies are directed towards the root of the tree. Because sparse matrices usually represent physical constraints and thanks to the nested dissection used to order the matrix, supernodes at the bottom of the tree are usually small and supernodes at the top are much larger. Each supernode is itself a small DAG (Directed Acyclic Graph) of tasks as illustrated on Fig. 2. A more refined view shows that the dependencies between two supernodes consist of dependencies between tasks of these supernodes. Another way to put it is that the computation is described as a DAG of tasks, tasks are partitioned into supernodes, and the quotient graph of supernodes is the tree T (with some transitivity edges). Note that with 1D distribution, as targetted here, the DAG within can also be seen as a tree with dependencies toward the roots. Thus, in this paper, we will use either nodes or supernodes to denote the vertices of the tree T as they can be used interchangeably.

This structure in two levels allows us to both reduce the cost of the analysis stage by considering only the first level (supernodes), while increasing the parallelism level (nodes) during the numerical factorization with finer grain computations.

Fig. 2.
figure 2

Structure of the computation: tree of supernodes, each supernodes being made of several tasks.

We denote by root(T) the node at the root of tree T, and by \(w_i\) the computational weight of the node i, for \(1\le i \le n\): this is the total number of operations of all tasks within node i. Also, parent(i) is the parent of node i in the tree (except for the root), and child(i) are the children nodes of i in the tree. Given a subtree \(T_i\) of T (rooted in \(root(T_i)\)), \(W_i = \sum _{j\in T_i} w_j\) is the computational weight of this subtree.

As stated above, each node i of the tree is itself made of \(n_i \ge 1\) tasks \(i_1, \ldots , i_{n_i}\), whose dependencies follow a directed acyclic graph (DAG). Each of these tasks is a linear algebra kernel (such as matrix factorization, triangular solve or matrix product) on block matrices. Hence, given a node i and its parent \(j=parent(i)\) in the tree, only some of the tasks of i need to be completed before j is started, which allows some pipelining in the processing of the tree.

When running on a parallel platform with a set P of p processors, nodes and tasks are distributed among available processing resources (processors) in order to ensure a good load-balancing. If node i is executed on \(alloc[i]=k\) processors, its execution time is \(f_i(k)\); this time depends on \(w_i\) and on the structure of the DAG of tasks.

Following the structure of the application, the mapping is done in two phases: the first phase, detailed in Sect. 3.1, consists in using the Proportional Mapping algorithm  [16] to compute a mapping of nodes to subsets of processors. The second phase, detailed in Sect. 3.2, refines this mapping by allocating each task of a node i to a single processor of the subset allocated to i in the first step.

figure b

3.1 Coarse-Grain Load Balancing Using Proportional Mapping

The proportional mapping process follows the sketch of Algorithm 1. First, all processors are allocated to the root of the tree. Then, we compute the total weight of its subtrees (i.e., the sum of the weight of their nodes), and allocate processors to subtrees so that the load is balanced. Then, we recursively apply the same procedure on each subtree.

Apart from balancing the load among branches of the tree, the proportional mapping is known for its good data locality: a processor is allocated to nodes of a single path from a leaf to the root node, and only to nodes on this path. Thus, the data produced by a node and used by its parents mostly stay on a single processor, and no data transfer is made except for the necessary redistribution of data in the upper levels of the tree. This is particularly interesting in a distributed context, where communications among processors are costly.

We can wonder if Algorithm 1 really optimizes load-balancing, as subtrees with similar total weight \(W_i\) may exhibit different levels of parallelism, and thus end up with a different completion time, as illustrated with the example of Fig. 1. The formula \( W_i/|{P_i}| \) correctly computes the duration of the subtree processing only for perfect parallelism. We propose here another mapping algorithm that optimizes the total computation time, under the constraint of perfect data locality. It iteratively adds processors to the root, and recursively to the subtree with the largest completion time (see Algorithm 2). In this mapping algorithm, \( alloc [i]\) represents the number of processors allocated to node i, and \( endTime [i]\) represents the completion time of task i. We assume that the function \(f_i(k)\), that gives the duration of node i on k processors, is non-increasing with k and is known to the algorithm.

figure c

Theorem 1

The \( GreedyMappingInt \) algorithm (Algorithm 2) computes an allocation with minimum total completion time under the constraint that each processor is only allocated to nodes on a path from a leaf to the root.

Note that this result, proven in the companion research report  [9], does not require a particular speed function for tasks: it is valid when the processing time of a task does not increase with the number of processors allocated to the task.

figure d

However, both previous mapping algorithms suffer a major problem when used in a practical context, because they forbid allocating processors to more than one child of a node. First, some nodes, especially leaves, have very small weight and several of them should be mapped on the same processor. Second, allocating integer numbers of processors to nodes creates unbalanced workloads, for example, when three processors have to be allocated to two identical subtrees. All implementations of the proportional mapping tackle this problem (including the first one in  [16]). For example, the actual implementation in PaStiX, as sketched in Algorithm 3, allows “border processors” to be shared among branches, and keeps track of the occupation of each processor to ensure load-balancing. It first computes the total time needed to process the whole tree, and sets the initial availability time of each processor to an equal share of this total time. Whenever some (fraction of a) node is allocated to a processor, its availability time is reduced. Hence, if a processor is shared on two subtrees \(T_1,T_2\), the work allocated by \(T_1\) is taken into account when allocating resources for \(T_2\). Note also that during the recursive allocation process, the subtrees are sorted by non-increasing total weights before being mapped to processors. This allows us to group small subtrees together in order to map them on a single processor, and to avoid unnecessary splitting of processors.

3.2 Refined Mapping

After allocating nodes of the tree to subsets of processors, a precise mapping of each task to a processor has to be computed. In PaStiX, this is done by simulating the actual factorization, based on the prediction of both the running times of tasks and of the time needed for data transfers. The refined mapping process is detailed in Algorithm 4. Thanks to the previous phase, we know that each task can run on a subset of processors (the subset associated to the node it belongs to), called candidate processors for this task. We associate to each processor a ready queue, containing tasks whose predecessors have already completed, and a waiting queue, with tasks that still have some unfinished predecessor. At the beginning of the simulation, each task is put in the waiting queue of all its candidate processors (except tasks without predecessors, which are put in the ready task queue of their candidate processors). Queues are sorted by decreasing depth of the tasks in the graph (tasks without predecessors are ordered first). The depth considered here is an estimation of the critical path length from the task to the root of the tree T.

A ready time is associated both to tasks and processors:

  • The ready time RP[k] of processor k is the completion time of the current task being processed by k (initialized with 0).

  • The ready time RT[i] of task i is the earliest time when i can be started, given its input dependencies. This is at least equal to the completion time of each of its predecessors, but also takes into account the time needed for data movement, in case a predecessor of i is not mapped on the same processor as i. The ready time of tasks with non-started predecessor is set to \(+\infty \).

4 Proposed Mapping Refinement

Our objective is to correct the potential load imbalance (and thus idle times) created by the proportional mapping, as outlined in Sect. 2, but without impacting too much the data locality. We propose a heuristic based on work stealing  [4] that extends the refined mapping phase (see Algorithm 4) using simulation (see Algorithm 5). Intuitively, we propose that if the simulation predicts that a processor will be idle, this processor tries to steal some tasks from its neighbors.

figure e
figure f

In the proposed refinement, we replace the update of the ready and waiting queues of the last line in Algorithm 4 by a call to \( UpdateQueuesWithStealing \) (Algorithm 5). For each processor k, we first detect if k will have some idle time, and we compute the duration d of this idle slot. This happens in particular when the ready time of the first task in its waiting queue is strictly larger than the ready time of the processor (\(RT[i]>RP[k]\)) and ready queue is empty. Whenever both queues are empty, the processor will be idle forever, and thus d is set to a large value. Then, if an idle time is detected (the ready queue is empty and d is a positive value), a task is stolen from a neighbor processor using function \( StealTask \). Otherwise, the ready and waiting queues are updated as previously: the tasks of the waiting queue that will be freed before the processor becomes available are moved to the ready queue.

When stealing tasks, we distinguish between two cases, depending whether we use shared or distributed memory. In shared memory, the two possible victims of the task stealing operation are the two neighbors of processor k, considering that processors are arranged in a ring. In the case of distributed memory, we first try to steal from two neighbor processors within the same cluster, that is, within the set of processors that share the same memory. Stealing to a distant processor is considered only when clusters are reduced to a single element. Once steal victims are identified (set S), we consider the first task of their ready queues and select the one that can start as soon as possible. If the task is able to start during the idle slot of processor k (and thus reduce its idle time), it is then copied into its ready queue.

5 Experimental Results

Experiments were conducted on the PlafrimFootnote 1 supercomputer, and more precisely on the miriel cluster. Each node is equipped with two Intel Xeon E5-2680v3 12-cores running at 2.50 GHz and 128 GB of memory. The Intel MKL 2019 library is used for sequential BLAS kernels. Another shared memory experiment was performed on the crunch cluster from the LIPFootnote 2, where a node is equipped with four Intel Xeon E5-4620 8-cores running at 2.20 GHz and 378 GB of memory. On this platform, the Intel MKL 2018 library is used for sequential BLAS kernels. The PaStiX version used for our experiments is based on the public git repositoryFootnote 3 version at the tag 6.1.0.

In the following, the different methods used to compute the mapping are compared. All to All, referred to as All2All, and Proportional mapping, referred to as PropMap, are available in the PaStiX library, and the newly introduced method is referred to as Steal. When the option to limit stealing tasks into the same MPI is enabled, we refer to it as StealLocal. In all the following experiments, we compare these versions with respect to the All2All strategy, which provides the most flexibility to the scheduling algorithm to perform load balance, but does not consider data locality. The multi-threaded variant is referred to as SharedMem, while for the distributed settings, pMt stands for p MPI nodes with t threads each. All distributed settings fit within a single node.

In order to make a fair comparison between the methods, we use a set of 34 matrices issued from the SuiteSparse Matrix collection  [6]. The matrix sizes range from 72K to 3M of unknowns. The number of floating point operations required to perform the \(LL^t\), \(LDL^t\), or LU factorization ranges from 111 GFlops to 356 TFlops, and the problems are issued from various application fields.

Fig. 3.
figure 3

MPI communication number (left) and volume (right) for the three methods: PropMap, Steal, and StealLocal, with respect to All2All.

Communications. We first report the relative results in terms of communications among processors in different clusters (MPI nodes), which are of great importance for the distributed memory version. The number and the volume of communications normalized to All2All are depicted in Fig. 3a and Fig. 3b respectively. One can observe that all three strategies largely outperform the All2All heuristic, which does not take communications into account. The number of communications especially explodes with All2All as it mainly moves around leaves of the elimination tree. This creates many more communications with a small volume. This confirms the need for a proportional-mapping-based strategy to minimize the number of communications. Both numbers and volumes of communications also confirm the need for the local stealing algorithm to keep it as low as possible. Indeed, Steal generates 6.19 times more communications on average than PropMap, while StealLocal is as good as PropMap. Note the exception of the 24M1 case where Steal and StealLocal are identical. No local task can be stolen. These conclusions are similar when looking at the volume of communication with a ratio reduced to 1.92 between Steal and PropMap.

The distribution of the results is shown by boxplot. It shows five summary statistics (median, two hinges and two whiskers), and all “outlying” points individually. The lower and upper hinges correspond to the first and third quartiles. The upper whisker extends from the hinge to the largest value no further than \(1.5 \,\times \,\)IQR from the hinge (where IQR is the inter-quartile range, or distance between the first and third quartiles). The lower whisker extends from the hinge to the smallest value at most \(1.5\,\times \,\)IQR of the hinge. Data beyond the end of the whiskers are called “outlying” points and are plotted individually  [15].

Data Movements. Figure 4 depicts the number and volume of data movements normalized to All2All and summed over all the MPI nodes with different MPI settings. The data movements are defined as a write operation on the remote memory region of other cores of the same MPI node. Note that accumulations in local buffers before send, also called fan-in in sparse direct solvers, are always considered as remote write. This explains why all MPI configurations have equivalent number of data movements. As expected, proportional mapping heuristics outperform All2All by a large factor on both number and volume, which can have an important impact on NUMA architectures. Compared to PropMap, Steal and StealLocal are equivalent and have respectively \(1.38\times \), and \(1.32\times \), larger number of data movements on average respectively, which translates into \(9\%\), and \(8\%\) of volume increase. Note that in the shared memory case, StealLocal behaves as Steal as there is only one MPI node.

Fig. 4.
figure 4

Shared memory data movements number (left) and volume (right) within MPI nodes for PropMap, Steal, and StealLocal, with respect to All2All.

Fig. 5.
figure 5

Final simulation cost in second (left) and simulation cost of PropMap, Steal and StealLocal, normalized to All2All (right).

Fig. 6.
figure 6

Estimated factorization time (left), and that of PropMap, Steal, and StealLocal, normalized to All2All (right).

Simulation Cost. Figure 5 shows the simulation cost in seconds (duration of the refined mapping via simulation) on the left, and that of PropMap, Steal and StealLocal with respect to All2All on the right. Figure 6 shows the original simulated factorization time obtained with these heuristics and a normalized version. Note that, for the sake of clarity, some large outliers are removed from Fig. 6a. As stated in Sect. 2, the All2All strategy allows for more flexibility in the scheduling, hence it results in a better simulated time for the factorization in average. However, its cost is already 4x larger for this relatively small number of cores. Figure 5a shows that the proposed heuristics have similar simulation cost to the original PropMap, while Fig. 6 shows that the simulated factorization time gets closer to All2All, and can even outperform it in extreme cases. Indeed, in the 24M1 case, Steal outperforms All2All due to bad decisions taken by the latter at the beginning of the scheduling. The bad mapping of the leaves is then never recovered and induces extra communications that explain this difference. In conclusion, the proposed heuristic, StealLocal, manages to generate better schedules with a better load-balance than the original PropMap heuristic, while generating small or no overhead on the mapping algorithm. This strategy is also able to limit the volume of communications and data movements as expected.

Factorization Time for Shared Memory. Figure 7 presents factorization time and its normalized version in a shared memory environment, on both miriel and crunch machines. Note that we present only the results for Steal, as StealLocal and Steal behave similarly in shared memory environment. For the sake of clarity, some large outliers are removed from Fig. 7a. On miriel, with a smaller number of cores and less NUMA effects, all these algorithms have almost similar factorization time, and present variations of a few tens of GFlop/s over 500GFlop/s in average. Steal slightly outperforms PropMap, and both are slower than All2All respectively by \(1\%\) and \(2\%\) in average. On crunch, with more cores and more NUMA effects, the difference between Steal and PropMap increases in favor of Steal. Both remain slightly behind All2All, respectively by \(2\%\) and \(4\%\); indeed, All2All outperforms them since it has the greatest flexibility, and communications have less impact in a shared memory environment.

Fig. 7.
figure 7

Factorization time (left), and that of PropMap and Steal, normalized to All2All (right), on miriel and crunch. White diamonds represent mean values.

6 Conclusion

In this paper, we revisit the classical mapping and scheduling strategies for sparse direct solvers. The goal is to efficiently schedule the task graph corresponding to an elimination tree, so that the factorization time can be minimized. Thus, we aim at finding a trade-off between data locality (focus of the traditional PropMap strategy) and load balancing (focus of the All2All strategy). First, we improve upon PropMap by proposing a refined (and optimal) mapping strategy with an integer number of processors. Next, we design a new heuristic, Steal, together with a variant StealLocal, which predicts processor idle times in PropMap and assigns tasks to idle processors. This leads to a limited loss of locality, but improves the load balance of PropMap.

Extensive experimental and simulation results, both on shared memory and distributed memory settings, demonstrate that the Steal approach generates almost the same number of data movements than PropMap, hence the loss in locality is not significant, while it leads to better simulated factorization times, very close to that of All2All, hence improving the load balance of the schedule.

PaStiX has only recently been extended to work on distributed settings, and hence we plan to perform further experiments on distributed platforms, in order to assess the performance of Steal on the numerical factorization in distributed environments. Future working directions may also include the design of novel strategies to further improve performance of sparse direct solvers.