Elsevier

Parallel Computing

Volume 29, Issue 1, January 2003, Pages 135-159
Parallel Computing

Task scheduling using a block dependency DAG for block-oriented sparse Cholesky factorization

https://doi.org/10.1016/S0167-8191(02)00220-XGet rights and content

Abstract

Block-oriented sparse Cholesky factorization decomposes a sparse matrix into rectangular subblocks; each block can then be handled as a computational unit in order to increase data reuse in a hierarchical memory system. Also, the factorization method increases the degree of concurrency and reduces the overall communication volume so that it performs more efficiently on a distributed-memory multiprocessor system than the customary column-oriented factorization method. But until now, mapping of blocks to processors has been designed for load balance with restricted communication patterns. In this paper, we represent tasks using a block dependency DAG that represents the execution behavior of block sparse Cholesky factorization in a distributed-memory system. Since the characteristics of tasks for block Cholesky factorization are different from those of the conventional parallel task model, we propose a new task scheduling algorithm using a block dependency DAG. The proposed algorithm consists of two stages: early-start clustering, and affined cluster mapping (ACM). The early-start clustering stage is used to cluster tasks while preserving the earliest start time of a task without limiting parallelism. After task clustering, the ACM stage allocates clusters to processors considering both communication cost and load balance. Experimental results on a Myrinet cluster system show that the proposed task scheduling approach outperforms other processor mapping methods.

Introduction

Sparse Cholesky factorization is a computationally intensive operation commonly encountered in scientific and engineering applications including structural analysis, linear programming, and circuit simulation. Much work has been done on parallelizing sparse Cholesky factorization, which is used for solving large sparse systems of linear equations. The performance of parallel Cholesky factorization is greatly influenced by the method used to map a sparse matrix onto the processors of a parallel system. Based on the mapping method, parallel sparse Cholesky factorization methods are classified into the column-oriented Cholesky, the supernodal Cholesky, the amalgamated supernodal Cholesky, and the 2-D block Cholesky. The earliest work is based on the column-oriented Cholesky in which a single column is mapped to a single processor [8], [17]. In the supernodal Cholesky, a supernode, which is a group of consecutive columns with the same row structure, is mapped to a single processor [5], [26]. The amalgamated supernodal Cholesky uses the supernode amalgamation technique in which several small supernodes are merged into a larger supernode, and an amalgamated supernode is then mapped to a single processor [4], [30]. In the 2-D block Cholesky, a matrix is decomposed into rectangular blocks, and a block is mapped to a single processor [10], [31].

Recent advanced methods for sparse Cholesky factorization are based on the use of the 2-D block Cholesky to process non-zero blocks using Level 3 basic linear algebra subprograms (BLAS) [7], [8]. Such a 2-D decomposition is more scalable than a 1-D decomposition and has an increased degree of concurrency [34], [35]. Also, the 2-D decomposition allows one to use efficient computation kernels such as Level 3 BLAS so that caching performance is improved [30]. Even in a single processor system, block factorizations are performed efficiently [25].

There are few works reported for the 2-D block Cholesky in a distributed-memory system. Rothberg and Gupta introduced the block fan-out algorithm [31]. Similarly, Dumitrescu et al. introduced the block fan-in algorithm [10]. Gupta, Karypis, and Kumar [16] also used 2-D mapping for implementing a multifrontal method. In [29], Rothberg has shown that a block fan-out algorithm using the 2-D decomposition outperforms a panel multifrontal method using 1-D decomposition. Even though the block fan-out algorithm increases concurrency and reduces overall communication volume, the performance achieved is not satisfactory due to load imbalance among the processors. Therefore, several load balancing heuristics have been proposed in [32].

However, the load balance is not the sole key parameter for improving the performance of parallel block sparse Cholesky factorization. Mapping for load balance only guarantees that the computation is well distributed among processors; it does not guarantee that the computation is well scheduled when considering the communication requirements. Thus, communication dependencies among blocks may cause some processors to wait even with balanced loads.

In this paper, we introduce a task scheduling method using a DAG-based task graph which represents the behavior of block sparse Cholesky factorization with exact computation and communication costs. As we will show in Section 3, a task graph for sparse Cholesky factorization is different from a conventional parallel task graph. Hence we propose a new heuristic algorithm which attempts to minimize the completion time while preserving the earliest start time of each task in a graph. It has been reported that a limitation on memory space can adversely affect performance [37]. But we do not consider the memory space limitations in this paper, since we assume that the factorization is done on a distributed-memory system with sufficient memory to handle the work assigned to each processor. Even though there have been recent efforts for scheduling irregular computations on parallel systems [11], [18], [19], this paper presents the first work that deals with the entire framework of applying a scheduling approach for block-oriented sparse Cholesky factorization in a distributed system.

The next section describes the block fan-out method for parallel sparse Cholesky factorization. In Section 3, the sparse Cholesky factorization is modeled as a DAG-based task graph, and the characteristics of a task for this problem are summarized. Since the characteristics of this type of task are different from those of the conventional precedence-constrained parallel task, a new task scheduling algorithm is proposed in Section 4. The performance of the proposed scheduling algorithm is compared with the previous processor mapping methods using experiments on a Myrinet cluster system in Section 5. Finally, in Section 6, we summarize and conclude the paper.

Section snippets

Block-oriented sparse Cholesky factorization

This section describes the block fan-out method for sparse Cholesky factorization, which is an efficient method for distributed-memory systems [3], [20], [29], [31]. The block Cholesky factorization method decomposes a sparse matrix into rectangular blocks, and then factorizes it with dense matrix operations.

Task model with communication costs

Since a non-zero block Li,j is assigned to one processor [31], all block operations for a block can be treated as one task. This means that a task is executed in one processor, and a task consists of several subtasks for block operations. This section describes the characteristics of tasks, and proposes a task graph that represents the execution sequence of block factorization. The task graph, referred to as a block dependency DAG, reflects the costs of computations and communications and the

Task scheduling using a block dependency DAG

The problem of finding the optimal solution for a weighted DAG is known to be NP-hard in the strong sense [33], [36]. When each task in a block dependency graph consists of only one or two subtasks, the scheduling problem using the block dependency DAG reduces to the NP-hard scheduling problem. Thus, finding an optimal scheduling of tasks in a block dependency DAG is an NP-hard problem. Therefore, a heuristic algorithm is presented in this section.

The proposed scheduling algorithm consists of

Performance comparison

The performance of the proposed scheduling method is compared with various other mapping methods. The methods used for comparison are as follows:

  • Wrap: 1-D wrap mapping [13] simply allocates all blocks in column j, i.e., L*,j, to the processor (jmodP).

  • Cyclic: 2-D cyclic mapping [31] allocates Li,j to the processor (imodPr,jmodPc).

  • Balance: Balance mapping [32] attempts to balance the workload among processors. The decreasing number heuristic is used for ordering within a row or column.

  • Schedule:

Conclusion

We introduced a task scheduling approach for block-oriented sparse Cholesky factorization on a distributed-memory system. The block Cholesky factorization problem is modeled as a block dependency DAG, which represents the execution behavior of 2-D decomposed blocks. Using a block dependency DAG, we proposed a task scheduling algorithm consisting of early-start clustering and ACM. Based on experiments using a Myrinet cluster system, we have shown that the proposed scheduling algorithm

Acknowledgements

We would like to thank Cleve Ashcraft for his valuable comments on this work. In addition, he provided us with his technical reports and the SPOOLES library, which is used for ordering and supernode amalgamation.

References (39)

  • J.J. Dongarra et al.

    Numerical Linear Algebra for High Performance Computers

    SIAM

    (1998)
  • I.S. Duff, Sparse numerical linear algebra: Direct methods and preconditioning, Technical report, CERFACS, Toulouse...
  • I.S. Duff et al.

    Sparse matrix test problems

    ACM Trans. Math. Software

    (1989)
  • B. Dumitrescu et al.

    Two-dimensional block partitionings for the parallel sparse cholesky factorization

    Numer. Algorithms

    (1997)
  • A. George et al.

    Sparse cholesky factorization on a local memory multiprocessor

    SIAM J. Sci. Stat. Comput.

    (1988)
  • A. Gerasoulis et al.

    On the granularity and clustering directed acyclic task graphs

    IEEE Trans. Parallel Distrib. Syst.

    (1993)
  • A. Gupta et al.

    Highly scalable parallel algorithms for sparse matrix factorization

    IEEE Trans. Parallel Distrib. Syst.

    (1997)
  • M.T. Heath et al.

    Parallel algorithms for sparse linear systems

    SIAM Rev.

    (1991)
  • P. Henon, P. Ramet, J. Roman, A Mapping and Scheduling Algorithm for Parallel Sparse Fan-In Numerical Factorization,...
  • Cited by (6)

    This research was supported in part by the Ministry of Education of Korea through its BK21 program toward the Electrical and Computer Engineering Division at POSTECH.

    View full text