Task scheduling using a block dependency DAG for block-oriented sparse Cholesky factorization☆
Introduction
Sparse Cholesky factorization is a computationally intensive operation commonly encountered in scientific and engineering applications including structural analysis, linear programming, and circuit simulation. Much work has been done on parallelizing sparse Cholesky factorization, which is used for solving large sparse systems of linear equations. The performance of parallel Cholesky factorization is greatly influenced by the method used to map a sparse matrix onto the processors of a parallel system. Based on the mapping method, parallel sparse Cholesky factorization methods are classified into the column-oriented Cholesky, the supernodal Cholesky, the amalgamated supernodal Cholesky, and the 2-D block Cholesky. The earliest work is based on the column-oriented Cholesky in which a single column is mapped to a single processor [8], [17]. In the supernodal Cholesky, a supernode, which is a group of consecutive columns with the same row structure, is mapped to a single processor [5], [26]. The amalgamated supernodal Cholesky uses the supernode amalgamation technique in which several small supernodes are merged into a larger supernode, and an amalgamated supernode is then mapped to a single processor [4], [30]. In the 2-D block Cholesky, a matrix is decomposed into rectangular blocks, and a block is mapped to a single processor [10], [31].
Recent advanced methods for sparse Cholesky factorization are based on the use of the 2-D block Cholesky to process non-zero blocks using Level 3 basic linear algebra subprograms (BLAS) [7], [8]. Such a 2-D decomposition is more scalable than a 1-D decomposition and has an increased degree of concurrency [34], [35]. Also, the 2-D decomposition allows one to use efficient computation kernels such as Level 3 BLAS so that caching performance is improved [30]. Even in a single processor system, block factorizations are performed efficiently [25].
There are few works reported for the 2-D block Cholesky in a distributed-memory system. Rothberg and Gupta introduced the block fan-out algorithm [31]. Similarly, Dumitrescu et al. introduced the block fan-in algorithm [10]. Gupta, Karypis, and Kumar [16] also used 2-D mapping for implementing a multifrontal method. In [29], Rothberg has shown that a block fan-out algorithm using the 2-D decomposition outperforms a panel multifrontal method using 1-D decomposition. Even though the block fan-out algorithm increases concurrency and reduces overall communication volume, the performance achieved is not satisfactory due to load imbalance among the processors. Therefore, several load balancing heuristics have been proposed in [32].
However, the load balance is not the sole key parameter for improving the performance of parallel block sparse Cholesky factorization. Mapping for load balance only guarantees that the computation is well distributed among processors; it does not guarantee that the computation is well scheduled when considering the communication requirements. Thus, communication dependencies among blocks may cause some processors to wait even with balanced loads.
In this paper, we introduce a task scheduling method using a DAG-based task graph which represents the behavior of block sparse Cholesky factorization with exact computation and communication costs. As we will show in Section 3, a task graph for sparse Cholesky factorization is different from a conventional parallel task graph. Hence we propose a new heuristic algorithm which attempts to minimize the completion time while preserving the earliest start time of each task in a graph. It has been reported that a limitation on memory space can adversely affect performance [37]. But we do not consider the memory space limitations in this paper, since we assume that the factorization is done on a distributed-memory system with sufficient memory to handle the work assigned to each processor. Even though there have been recent efforts for scheduling irregular computations on parallel systems [11], [18], [19], this paper presents the first work that deals with the entire framework of applying a scheduling approach for block-oriented sparse Cholesky factorization in a distributed system.
The next section describes the block fan-out method for parallel sparse Cholesky factorization. In Section 3, the sparse Cholesky factorization is modeled as a DAG-based task graph, and the characteristics of a task for this problem are summarized. Since the characteristics of this type of task are different from those of the conventional precedence-constrained parallel task, a new task scheduling algorithm is proposed in Section 4. The performance of the proposed scheduling algorithm is compared with the previous processor mapping methods using experiments on a Myrinet cluster system in Section 5. Finally, in Section 6, we summarize and conclude the paper.
Section snippets
Block-oriented sparse Cholesky factorization
This section describes the block fan-out method for sparse Cholesky factorization, which is an efficient method for distributed-memory systems [3], [20], [29], [31]. The block Cholesky factorization method decomposes a sparse matrix into rectangular blocks, and then factorizes it with dense matrix operations.
Task model with communication costs
Since a non-zero block Li,j is assigned to one processor [31], all block operations for a block can be treated as one task. This means that a task is executed in one processor, and a task consists of several subtasks for block operations. This section describes the characteristics of tasks, and proposes a task graph that represents the execution sequence of block factorization. The task graph, referred to as a block dependency DAG, reflects the costs of computations and communications and the
Task scheduling using a block dependency DAG
The problem of finding the optimal solution for a weighted DAG is known to be NP-hard in the strong sense [33], [36]. When each task in a block dependency graph consists of only one or two subtasks, the scheduling problem using the block dependency DAG reduces to the NP-hard scheduling problem. Thus, finding an optimal scheduling of tasks in a block dependency DAG is an NP-hard problem. Therefore, a heuristic algorithm is presented in this section.
The proposed scheduling algorithm consists of
Performance comparison
The performance of the proposed scheduling method is compared with various other mapping methods. The methods used for comparison are as follows:
- •
Wrap: 1-D wrap mapping [13] simply allocates all blocks in column j, i.e., L*,j, to the processor .
- •
Cyclic: 2-D cyclic mapping [31] allocates Li,j to the processor .
- •
Balance: Balance mapping [32] attempts to balance the workload among processors. The decreasing number heuristic is used for ordering within a row or column.
- •
Schedule:
Conclusion
We introduced a task scheduling approach for block-oriented sparse Cholesky factorization on a distributed-memory system. The block Cholesky factorization problem is modeled as a block dependency DAG, which represents the execution behavior of 2-D decomposed blocks. Using a block dependency DAG, we proposed a task scheduling algorithm consisting of early-start clustering and ACM. Based on experiments using a Myrinet cluster system, we have shown that the proposed scheduling algorithm
Acknowledgements
We would like to thank Cleve Ashcraft for his valuable comments on this work. In addition, he provided us with his technical reports and the SPOOLES library, which is used for ordering and supernode amalgamation.
References (39)
- et al.
Run-time techniques for exploiting irregular task parallelism on distributed memory architectures
J. Parallel Distrib. Comput.
(1997) - et al.
Parallel cholesky factorization on a shared memory processor
Lin. Algebra Appl.
(1986) - et al.
Comparison of clustering heuristics for scheduling DAGs on multiprocessors
J. Parallel Distrib. Comput.
(1992) - et al.
Multiprocessor scheduling with communication delays
Parallel Comput.
(1990) - C.C. Ashcraft, The domain/segment partition for the factorization of sparse symmetric positive definite matrices,...
- C.C. Ashcraft, SPOOLES: An object-oriented sparse matrix library, In Proc. of 1999 SIAM Conference on Parallel...
- C.C. Ashcraft, S. Eisenstat, J. Liu, A. Sherman, A comparison of three column-based distributed sparse factorization...
- et al.
The influence of relaxed supernode partitions on the multifrontal method
ACM Trans. Math. Software
(1989) - et al.
Progress in sparse matrix methods for large linear systems on vector supercomputers
Int. J. Supercomput. Appl.
(1987) - et al.
A set of level 3 basic linear algebra subprograms
ACM Trans. Math. Software
(1990)
Numerical Linear Algebra for High Performance Computers
SIAM
Sparse matrix test problems
ACM Trans. Math. Software
Two-dimensional block partitionings for the parallel sparse cholesky factorization
Numer. Algorithms
Sparse cholesky factorization on a local memory multiprocessor
SIAM J. Sci. Stat. Comput.
On the granularity and clustering directed acyclic task graphs
IEEE Trans. Parallel Distrib. Syst.
Highly scalable parallel algorithms for sparse matrix factorization
IEEE Trans. Parallel Distrib. Syst.
Parallel algorithms for sparse linear systems
SIAM Rev.
Cited by (6)
A survey of direct methods for sparse linear systems
2016, Acta NumericaApplication of GPU on-orbit and self-adaptive scheduling by its internal thermal sensor
2018, Proceedings of the International Astronautical Congress, IACAn optimized task duplication based schedulingin parallel system
2016, International Journal of Intelligent Systems and ApplicationsA novel power-conscious scheduling algorithm for data-intensive precedence-constrained applications in cloud environments
2014, International Journal of High Performance Computing and NetworkingIPM based sparse LP solver on a heterogeneous processor
2012, Computational Management ScienceA parallel algorithm for stiffness matrix decomposition using threadpool method
2012, Advanced Materials Research
- ☆
This research was supported in part by the Ministry of Education of Korea through its BK21 program toward the Electrical and Computer Engineering Division at POSTECH.