## COORDINATED SCIENCE LABORATORY College of Engineering

# A MULTILEVEL PARALLEL SOLVER FOR BLOCK TRIDIAGONAL AND BANDED LINEAR SYSTEMS 

Ibrahim N. Hajj Stig Skelboe

| 1a. REPORT SECURITY CLASSIFICATION Unclassified | 1b. RESTRICTIVE MARKINGS None $\qquad$ |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| 20. SECURITY CLASSIFICATION AUTHORITY | 3. DISTRIBUTION/AVAILABILITY OF REPORT Approved for public release; distribution unlimited |  |  |  |
| 2b. DECLASSIFICATION/DOWNGRADING SCHEDULE |  |  |  |  |
| 4. PERFORMING ORGANIZATION REPORT NUMBER(S) (DAC-15) UILU-ENG-89-2223 (DAC-15) | 5. MONITORING ORGANIZATION REPORT NUMBER(S) |  |  |  |
| 6e. NAME OF PERFORMING ORGANIZATION 6b. OFFICE SYMBOL <br> (if appliceble)  <br> Coordinated Science Lab N/A <br> University of Illinois  | 7. NAME OF MONITORING ORGANIZATION Office of Naval Research |  |  |  |
| 6C ADDRESS (Chy, Stato, and ZIP Cooh) <br> 1101 W. Springfield Ave. <br> Urbana, IL 61801 | 7b. ADORESS (City, State, and IIP Code) <br> Arlington, VA 22217 <br> Santa Clara, CA |  |  |  |
| 80. <br> NAME OF FUNDING/SPONSORING 3b. OFFICE SYMBOL organization Joint Services (If applicable) Electronics Program \& INTEL | $\begin{aligned} & \text { 9. PROCUREMEN } \\ & \text { NOOO14--8 } \end{aligned}$ | $\begin{aligned} & \text { VT INSTRUME } \\ & 84-\mathrm{C}-014 \mathrm{~S} \end{aligned}$ | DENTIFIC | MBER |
| BC ADDRESS (Chy, State, and 2IP Code) <br> Arlington, VA 22217 <br> Santa Clara, CA | 10. SOURCE OF FUNDING NUMBERS |  |  |  |
|  | PROGRAM ELEMENT NO. | PROJECT NO. | $\begin{aligned} & \text { TASK } \\ & \text { NO. } \end{aligned}$ | WORK UNIT <br> ACCESSION NO. |

11. TITLE (Include Security Claspification)
"A Multilevel Parallel Solver for Block Tridiagonal and Banded Linear Systems"
12. PERSONAL AUTHOR(S)

Hajj, Ibrahim N. and Skelboe, Stig

| 13a. TYPE OF REPORT |
| :--- |
| Technical |

13b. TIME COVERED $\qquad$

| 14. DATE OF REPORT (Year, Month, Day) | 15. PAGE COUNT |
| :---: | :---: |
| 1989 August 25 | 31 |

16. SUPPLEMENTARY NOTATION

17. SUBJECT TERMS (Continue on reverse if necessary and identify by block number)

Parallel solver, banded linear systems, block tridiagonal, cyclic reduction, multilevel decomposition, hypercube, multiprocessing.
19. ABSTRACT (Continue on reverse if necessary and identify by block number)
over
This paper describes an efficient algorithm for the parallel solution of systems of linear equations with a block tridiagonal coefficient matrix. The algorithm comprises a multilevel LU-factorization based on block cyclic reduction and a corresponding solution algorithm.

The paper includes a general presentation of the parallel multilevel LU-factorization and solution algorithms, but the main emphasis is on implementation principles for a message passing computer with hypercube topology. Problem partitioning, processor allocation and communication requirements are discussed for the general block tridiagonal algorithm.
20. DISTRIBUTION IAVAILABILITY OF ABSTRACT

X UNCLASSIFIEDNNLIMITED $\square$ SAME AS RPT. $\quad \square$ DTIC USERS | 21. ABSTRACT SECURITY CLASSIFICATION |
| :---: |
| Unclassified |

Band matrices can be cast into block tridiagonal form, and this special but important problem is dealt with in detail. It is demonstrated how the efficiency of the general block tridiagonal multilevel algorithm can be improved by introducing the equivalent of two-way Gaussian elimination for the first and the last partitioning and by carefully balancing the load of the processors. The presentation of the multilevel band solver is accompanied by detailed complexity analyses.

The properties of the parallel band solver were evaluated by implementing the algorithm on an Intel iPSC hypercube parallel computer and solving a larger number of banded linear equations using 2 to 32 processors. The results of the evaluation include speed-up over a sequential processor, and the measured values are in good agreement with the theoretical values resulting from complexity analysis. It is found that the maximum asymptotic speed-up of the multilevel LU-factorization using $p$ processors and load balancing is approximated well by the expression $(p+6) / 4$.

Finally, the multilevel parallel solver is compared with solvers based on row interleaved organization and with other block solvers.

# A MULTILEVEL PARALLEL SOLVER FOR BLOCK TRIDIAGONAL AND BANDED LINEAR SYSTEMS 

Ibrahim N. Hajj *<br>Stig Skelboe ${ }^{\dagger}$

[^0]
#### Abstract

This paper describes an efficient algorithm for the parallel solution of systems of linear equations with a block tridiagonal coefficient matrix. The algorithm comprises a multilevel LU-factorization based on block cyclic reduction and a corresponding solution algorithm.

The paper includes a general presentation of the parallel multilevel LU-factorization and solution algorithms, but the main emphasis is on implementation principles for a message passing computer with hypercube topology. Problem partitioning, processor allocation and communication requirements are discussed for the general block tridiagonal algorithm.

Band matrices can be cast into block tridiagonal form, and this special but important problem is dealt with in detail. It is demonstrated how the efficiency of the general block tridiagonal multilevel algorithm can be improved by introducing the equivalent of two-way Gaussian elimination for the first and the last partitioning and by carefully balancing the load of the processors. The presentation of the multilevel band solver is accompanied by detailed complexity analyses.

The properties of the parallel band solver were evaluated by implementing the algorithm on an Intel iPSC hypercube parallel computer and solving a larger number of banded linear equations using 2 to 32 processors. The results of the evaluation include speed-up over a sequential processor, and the measured values are in good agreement with the theoretical values resulting from complexity analysis. It is found that the maximum asymptotic speed-up of the multilevel LU-factorization using $p$ processors and load balancing is approximated well by the expression $(p+6) / 4$.

Finally, the multilevel parallel solver is compared with solvers based on row interleaved organization and with other block solvers.


## 1 Introduction

Many technical and scientific problems involve the solution of linear systems of equations

$$
\begin{equation*}
A x=b \tag{1.1}
\end{equation*}
$$

where A can be structured as a block tridiagonal matrix. The discretization of boundary value problems for both ordinary and partial differential equations may lead to band matrices which can be structured into block tridiagonal matrices. Likewise, the analysis of linearly connected substructures (mechanical, electrical,...) often leads to block tridiagonal matrices. These matrices may be numerically symmetric, symmetric in structure only, or nonsymmetric but still block tridiagonal in structure.

This paper presents an algorithm for the efficient solution of (1.1) on a parallel computer with medium to large number of processors. A version of the algorithm for solving structurally symmetric band systems has been implemented for the Intel iPSC hypercube, and the performance of the implementation of the algorithm is evaluated in great detail.

The multilevel parallel solver is based on block cyclic reduction [1] which permits separate LU-factorization and solution stages. The formulation of the block cyclic reduction which we have used is related to nested dissection [2,3], and it is an LU-factorization and solution of a block reordered system. The communication required by the band matrix implementation of the parallel solver is shown to be negligible.

Our implementation of the parallel solver introduces some enhancements to standard block cyclic reduction which may also be used for a general block tridiagonal matrix. The first enhancement is two-way Gaussian elimination for the first and the last block. This eliminates fill-ins in the two blocks and permits the second enhancement, an efficient load balancing which improves the efficiency with the equivalent of 6 extra processors. The third enhancement is a block parallel organization which improves processor utilization at the lower levels of the algorithm, with a modest communication penalty.

The performance of the implementation of the multilevel parallel band solver on the Intel iPSC with 32 processors is evaluated carefully, and the measured execution times are compared with predictions derived from complexity analysis.

Direct parallel solvers can be classified into two different groups: block methods (as the one described here and $[4,5,6,7]$ ) and row interleaved methods [ 8$]$. The block methods pay a heavy penalty in terms of fill-ins while communication cost is negligible. The block methods can exploit many processors, limited only by the dimension/bandwidth ratio.

The row interleaved algorithm is computationally identical to a sequential Gaussian elimination. The pivot row is broadcasted and the processors perform the elimination in parallel. Ideal speed-up is prevented by communication which is fairly expensive on medium grain parallel processors such as the Intel iPSC [9]. The row interleaved algorithm can only exploit a number of processors corresponding to the half bandwidth.

The block methods are therefore advantageous for narrow band problems and medium to large number of processing elements. For wide band problems and few processors, the row interleaved algorithm is the better. It is also worth mentioning at this point that almost all reported implementations of parallel solvers of banded systems are done on
shared-memory vector machines, such as the Alliant and the Cray computers, while our implementation is done on a distributed-memory system with a limited number of processors and no vectorization, namely, the Intel iPSC.

This paper is organized as follows. In section 2 block tridiagonal matrices are briefly described. Sections 3 and 4 explain the multilevel LU-factorization and solution steps. Section 5 describes the general principles for implementing the multilevel algorithm on a hypercube, including the communication scheme among the processors. The details of the implementation on the hypercube are given in section 6. Section 6.1 describes the partitioning and allocation of tasks to different processors and the ordering of the first and the last partitions to reduce fill-ins by using 2 -way Gaussian elimination. The performance of the multilevel algorithm is estimated in section 6.2 using complexity analysis. The important problem of load balancing is addressed in section 6.3. As a result of the complexity analysis it becomes clear that uniform partitioning of the band matrix will lead to poor load balance for the LU-factorization. Balance equations for selecting the sizes of the partitions are then derived. The performance of the parallel band matrix solver is presented in section 7. This includes the executing time graph model and actual numerical results. A comparison between the multilevel approach and the row interleaved factorization and solution approach is given in section 7.3.

## 2 Block tridiagonal matrices

Consider the system of linear equations (1.1) where $\mathrm{A}, \mathrm{x}$ and b are partitioned as follows:

$$
A=\left[\begin{array}{ccccc}
A_{1} & B_{1} & & &  \tag{2.1}\\
C_{2} & A_{2} & B_{2} & & \\
& C_{3} & A_{3} & B_{3} & \\
& & \cdots & \ldots & \ldots \\
& & & C_{N} & A_{N}
\end{array}\right], x=\left[\begin{array}{c}
x_{1} \\
x_{2} \\
x_{3} \\
\cdots \\
x_{N}
\end{array}\right], b=\left[\begin{array}{c}
b_{1} \\
b_{2} \\
b_{3} \\
\ldots \\
b_{N}
\end{array}\right]
$$

N is odd by assumption and

$$
\begin{gathered}
A_{r} \epsilon R^{n_{r} \times n_{r}} \text { and } x_{r}, b_{r} \epsilon R^{n_{r}} \text { for } r=1,2, . ., N . \\
B_{r} \epsilon R^{n_{r} \times n_{r+1}} \text { for } r=1,2, . ., N-1 \text { and } \\
C_{r} \epsilon R^{n_{r-1} \times n_{r}} \text { for } r=2,3, . ., N .
\end{gathered}
$$

The entries of A are zero outside the tridiagonal band of matrices.


Figure 2.1.: Examples of typical block tridiagonal matrices.

Figure 2.1 shows three different examples of typical block tridiagonal matrices. The first matrix occurs in the lower levels of the multilevel algorithm applied to a structurally symmetric band matrix. The second matrix is a nonsymmetric band matrix with a block tridiagonal structure superimposed on it. The third example illustrates two connected arbitrary nonsymmetric sub-structures grouped into a block tridiagonal form.

The first level of the multilevel algorithm (block cyclic reduction) is based on the reordering of A given in (2.2).

$$
\tilde{A}=\left[\begin{array}{llllllllll}
A_{1} & & & & & B_{1} & & & &  \tag{2.2}\\
& A_{3} & & & & C_{3} & B_{3} & & & \\
& & A_{5} & & & & C_{5} & B_{5} & & \\
& & & \ldots & & & & \cdots & \cdots & \\
\hdashline & & & & A_{N} & & & & \\
\hdashline C_{2} & B_{2} & & & & A_{2} & & & & \\
& C_{4} & B_{4} & & & & A_{4} & & & \\
& & C_{6} & \ldots & & & & A_{6} & & \\
& & & \ldots & & & & & \cdots & \\
& & & \ldots & B_{N-1} & & & & & A_{N-1}
\end{array}\right]
$$

The sets $\left\{C_{r}, A_{r}, B_{r}, B_{r-1}, C_{r+1}\right\}$ for $\mathrm{r}=2,4, . ., \mathrm{N}-1$ are called separators since they separate the matrix A into independent blocks $A_{r}, r=1,3, . ., N$. The reordering of A into $\tilde{A}$ therefore involves a symmetric row column reordering where the separators are moved to the last rows and columns. If A is diagonally dominant, so is $\tilde{A}$ since the symmetric
reordering preserves diagonal dominance. Obviously, any symmetry of A will also be preserved.

N is assumed to be odd, and this is merely a convenient assumption. If N is originally even, either two block rows and columns can be merged to reduce N by one or the reordering of (2.2) can be modified slightly to account for an even value of N .

In practical applications, the dimensions of separators, $n_{r}$ for $\mathrm{r}=2,4, \ldots, \mathrm{~N}-1$, are usually small compared with the dimensions of the remaining blocks, $n_{r}$ for $\mathrm{r}=1,3, \ldots, \mathrm{~N}$, but this is not a necessary condition for the application of the multilevel algorithm although it may be a necessary condition for obtaining high efficiency.

## 3 Multilevel LU-factorization

The first level of the multilevel LU-factorization is a standard LU-factorization of $\tilde{A}$ defined in (2.2) stopping after $n_{1}+n_{3} \ldots+n_{N}$ pivot rows. This leaves the lower righthand block of dimension $n_{2}+n_{4} \ldots+n_{N-1}$ unfactored. The partially factored matrix is called $\overline{\mathrm{A}}$ and is defined in (3.1)

$$
\bar{A}=\left[\begin{array}{cccc:ccccc}
L_{1} U_{1} & & & & & \bar{B}_{1} & & &  \tag{3.1}\\
& L_{3} U_{3} & & & & \bar{C}_{3} & \bar{B}_{3} & & \\
& & L_{5} U_{5} & & & & \bar{C}_{5} & \bar{B}_{5} & \\
& & & \ldots & & & & \\
& & & L_{N} U_{N} & & & \ldots & \ldots & \ldots \\
& & & & \bar{A}_{2} & \bar{D}_{2} & & & \\
\hdashline \bar{C}_{2} & \bar{B}_{2} & & & \bar{B}_{4} & & & \bar{E}_{4} & \bar{A}_{4} \\
& \bar{C}_{4} & \bar{D}_{4} & & \\
& & \bar{C}_{6} & \ldots & & & \bar{E}_{6} & \bar{A}_{6} & \ldots \\
& & & \ldots & & & & \ldots & \ldots \\
& & & \ldots & \bar{B}_{N-1} & & & & \ldots \\
& & & & & & \bar{A}_{N-1}
\end{array}\right]
$$

The submatrices in (3.1) are related to the ones in (2.2) as follows

$$
\begin{equation*}
L_{r} U_{r}=A_{r}, r=1,3, . ., N \tag{3.2a}
\end{equation*}
$$

where $L_{r}$ and $U_{r}$ are lower and upper triangular matrices, respectively.

$$
\begin{gather*}
\bar{B}_{r}=L_{r}^{-1} B_{r}, r=1,3, . ., N-2  \tag{3.2b}\\
\bar{C}_{r}=L_{r}^{-1} C_{r}, r=3,5, . ., N  \tag{3.2c}\\
\bar{C}_{r+1}=C_{r+1} U_{r}^{-1}, r=1,3, . ., N-2  \tag{3.2~d}\\
\bar{B}_{r-1}=B_{r-1} U_{r}^{-1}, r=3,5, . ., N  \tag{3.2e}\\
\bar{A}_{s}=A_{s}-\bar{C}_{s} \bar{B}_{s-1}-\bar{B}_{s} \bar{C}_{s+1}, s=2,4, . ., N-1 \tag{3.2f}
\end{gather*}
$$

$$
\begin{align*}
& \bar{D}_{s}=D_{s}-\bar{B}_{s} \bar{B}_{s+1}, s=2,4, . ., N-3  \tag{3.2~g}\\
& \bar{E}_{s}=E_{s}-\bar{C}_{s} \bar{C}_{s-1}, s=4,6, \ldots, N-1 \tag{3.2h}
\end{align*}
$$

The tableau in (3.1) and the relations (3.2) reveal the block parallelism in computing $\bar{A}$. The factorization of a diagonal block $A_{r}$ ( r is odd) and the associated row operations leading to $\bar{C}_{r}, \bar{B}_{r}, \bar{B}_{r-1}, \bar{C}_{r+1}, \bar{D}_{r-1}$ and $\bar{E}_{r+1}$ can be performed independenly of the corresponding computations for different $r$-values. The only overlap is in the computation of $\bar{A}_{s}$ as specified in (3.2f). At the end of this section, a convenient organization for the parallel computation of $\bar{A}$ is presented.

Notice that $D_{s}=0(s=2,4, \ldots, N-3)$ and $E_{s}=0(s=4,6, . ., N-1)$ for the block tridiagonal matrix A defined in (2.1). However, the multilevel algorithm handles at no extra cost the case where the separators are extended with $D_{s} \neq 0$ and $E_{s} \neq 0$. In this case the even numbered rows and columns of $A$ (the separators) will contain five matrix blocks: $E_{s}, C_{s}, A_{s}, B_{s}, D_{s}$ and $D_{s-2}, B_{s-1}, A_{s}, C_{s+1}, E_{s+2}$, respectively for $\mathrm{s}=4,6, . ., \mathrm{N}-3$. For $\mathrm{s}=2, D_{0}$ and $E_{2}$ will be absent while $D_{N-1}$ and $E_{N+1}$ will be absent for $\mathrm{s}=\mathrm{N}$-1.

The $D_{s}$ and $E_{s}$ matrices have the following dimensions:

$$
\begin{aligned}
& D_{s} \epsilon R^{n_{s} \times n_{s+2}} \text { for } s=2,4, . ., N-3 \\
& E_{s} \epsilon R^{n_{s} \times n_{s-2}} \text { for } s=4,6, \ldots, N-1
\end{aligned}
$$

In the tridiagonal case (2.1), the matrices $\bar{D}_{s}$ and $\bar{E}_{s}$ are fill-ins. The LU-factorization will in general also create fill-ins in the original blocks of A unless they are full from the outset.

The four blocks of $\bar{A}$ seperated by the dashed lines can be expressed in a compact form as:

$$
\bar{A}=\left[\begin{array}{cc}
L U & V  \tag{3.3a}\\
W & A_{t}
\end{array}\right]
$$

where

$$
\tilde{A}=\left[\begin{array}{cc}
L & 0  \tag{3.3b}\\
W & I
\end{array}\right]\left[\begin{array}{cc}
U & V \\
0 & A_{t}
\end{array}\right]
$$

$A_{t}$ is the Schur complement and L and U are lower and upper block triangular matrices, respectively, composed of $L_{r}$ and $U_{r}, \mathrm{r}=1,3, . ., \mathrm{N}$.

The multilevel algorithm (block cyclic reduction) is now based on the fact that $A_{t}$ is a block tridiagonal matrix which can be reordered into a form similar to $\tilde{A}$ in (2.2), partially LU-factored like $\bar{A}$ leaving a block tridiagonal partially factored lower right-hand block etc.

If $N=2^{d+1}-1$, the process will terminate after d levels with a Schur complement consisting of just one block which is then factored. If N is composed differently, some
of the intermediate block tridiagonal matrices will have an even block dimension, but as mentioned in the previous section, this is only a minor complication.

The partial factorization of $\tilde{A}$ leading to $\bar{A}$ can be performed conveniently in parallel by partitioning the original block tridiagonal matrix A as follows:

$$
\begin{gather*}
Q_{1}=\left[\begin{array}{ll}
A_{1} & B_{1} \\
C_{2} & A_{2}
\end{array}\right] \text { and } \mathrm{Q}_{\mathrm{N}}=\left[\begin{array}{cc}
A_{N} & C_{N} \\
B_{N-1} & 0
\end{array}\right] \\
Q_{r}=\left[\begin{array}{ccc}
A_{r} & B_{r} & C_{r} \\
C_{r+1} & A_{r+1} & E_{r+1} \\
B_{r-1} & D_{r-1} & 0
\end{array}\right] \text { for } \mathrm{r}=3,5, . ., \mathrm{N}-2 \tag{3.4c}
\end{gather*}
$$

$D_{r-1}=0$ and $E_{r+1}=0$ when A is block tridiagonal, and $D_{r-1} \neq 0$ and $E_{r+1} \neq 0$ only when $A$ has extended separators as discussed priviously. The Q -matrices can be considered slightly reordered samples of $A$ which are straightforward to establish.

The Q-matrices can be partially factored in parallel leading to the $\bar{Q}$-matrices defined in (3.5):

$$
\begin{align*}
& \bar{Q}_{1}=\left[\begin{array}{cc}
L_{1} U_{1} & \bar{B}_{1} \\
\bar{C}_{2} & \hat{A}_{2}
\end{array}\right] \text { and } \bar{Q}_{N}=\left[\begin{array}{cc}
L_{N} U_{N} & \bar{C}_{N} \\
\bar{B}_{N-1} & \tilde{A}_{N-1}
\end{array}\right]  \tag{3.5a,b}\\
& \bar{Q}_{r}=\left[\begin{array}{ccc}
L_{r} U_{r} & \bar{B}_{r} & \bar{C}_{r} \\
\bar{C}_{r+1} & \hat{A}_{r+1} & \bar{E}_{r+1} \\
\bar{B}_{r-1} & \bar{D}_{r-1} & \tilde{A}_{r-1}
\end{array}\right] \text { for } r=3,5, . ., N-2 . \tag{3.5c}
\end{align*}
$$

The $\bar{Q}$-matrices can be computed conveniently by applying standard LU-factorization to the Q-matrices and stopping when $A_{r}$, for r being odd, is completely factored.

The entries of the $\bar{Q}$-matrices are as defined in (3.2) except for $\hat{A}_{s}$ and $\tilde{A}_{s}$ :

$$
\begin{gathered}
\hat{A}_{s}=A_{s}-\bar{C}_{s} \bar{B}_{s-1} \text { and } \\
\tilde{A}_{s}=-\bar{B}_{s} \bar{C}_{s+1} \text { leading to } \\
\bar{A}_{s}=\hat{A}_{s}+\tilde{A}_{s} \text { for } \mathrm{s}=2,4, \ldots, \mathrm{~N}-1
\end{gathered}
$$

Pivoting has not been considered so far since the possibilities are limited in the multilevel algorithm. From the partial LU-factorizations leading to the $\bar{Q}$-matrices, however, it is obvious that pivoting can be done as usual during the factorization of $A_{r}$ for $r$ being odd as long as the search for a pivot element is limited to $A_{r}$. If $A_{r}$ is singular, the algorithm requires fundamental modifications to work properly.

The next level of the multilevel algorithm involves the partial LU-factorization of $A_{t}$ defined in (3.3). The block representation of $A_{t}$ as given in (3.1) is

$$
A_{t}=\left[\begin{array}{ccccc}
\bar{A}_{2} & \bar{D}_{2} & & & \\
\bar{E}_{4} & \bar{A}_{4} & \bar{D}_{4} & & \\
& \bar{E}_{6} & \bar{A}_{6} & \ldots & \\
& & \cdots & \cdots & \ldots \\
& & & \ldots & \bar{A}_{N-1}
\end{array}\right]
$$

The $A_{t}$-matrix is sampled similarly to A to create $Q_{2}, Q_{6}, . ., Q_{N-1}$ defined in (3.6). From (3.5) it is seen that $Q_{2}, Q_{6}, . ., Q_{N-1}$ can also be obtained by sampling $\bar{Q}_{r}$ for r being odd. The relationships are listed in (3.7) where $\left(\bar{Q}_{s-1}, \bar{Q}_{s+1}, \bar{Q}_{s+3}\right) \rightarrow Q_{s}$ specifies that $Q_{s}$ is composed of samples from $\bar{Q}_{s-1}, \bar{Q}_{s+1}$ and $\bar{Q}_{s+3}$ etc.

$$
\begin{gather*}
Q_{2}=\left[\begin{array}{cc}
\bar{A}_{2} & \bar{D}_{2} \\
\bar{E}_{4} & \bar{A}_{4}
\end{array}\right] \text { and } Q_{N-1}=\left[\begin{array}{cc}
\bar{A}_{N-1} & \bar{E}_{N-1} \\
\bar{D}_{N-3} & 0
\end{array}\right]  \tag{3.6a,b}\\
Q_{s}=\left[\begin{array}{ccc}
\bar{A}_{s} & \bar{D}_{s} & \bar{E}_{s} \\
\bar{E}_{s+2} & \bar{A}_{s+2} & 0 \\
\bar{D}_{s-2} & 0 & 0
\end{array}\right] \text { for } s=6,10, \ldots, N-5 .  \tag{3.6c}\\
\left(\bar{Q}_{s-1}, \bar{Q}_{s+1}, \bar{Q}_{s+3}\right) \rightarrow Q_{s} \text { for } s=2,6,10, \ldots, N-5 .  \tag{3.7a}\\
\left(\bar{Q}_{N-2}, \bar{Q}_{N}\right) \rightarrow Q_{N-1} . \tag{3.7~b}
\end{gather*}
$$

The partial LU-factorization of $A_{t}$ can now be performed in parallel by partially factoring $Q_{2}, Q_{6}, . ., Q_{N-1}$ which are resampled to create $Q_{4}, Q_{8}, . ., Q_{N-3}$ etc. until the LU-factorization is completed by factoring just one block.

Each partial LU-factorization of a Q-matrix completes the factorization of the first block row and column leaving the lower right-hand $2 \times 2$ block unfactored and the subject of the next level of factorization.

So far it has been implied that only one processor is to be used for the partial factorization of a Q-matrix. Since each level of the multilevel algorithm deals with only half the number of Q -matrices as the previous level, one might consider using more than one processor for each Q-matrix to try to keep all processors busy.

In the block parallel organization described in [10], one processor is assigned to each Q-matrix in the top level, two processors in the next level etc. up to 4 processors for the $2 \times 2$ Q-matrices and 8 processors for the $3 \times 3$ Q-matrices. The natural partitioning of the Q-matrices into blocks is used to allocate one or several blocks to each processor. Since the LU-factorization is partial, this approach results in good load balance and moderate communication overhead.

## 4 Multilevel solution

The purpose of the solution step is to compute $x$ of (1.1). The solution is expressed symbolically as

$$
\begin{equation*}
x=A^{-1} b \tag{4.1}
\end{equation*}
$$

The reordering of A into $\tilde{A}$ is expressed by the permutation matrix $H_{1}$ such that $H_{1} A H_{1}^{T}=\tilde{A}$. This leads to the equation

$$
\begin{equation*}
\tilde{A} \tilde{x}=\tilde{b} \tag{4.2}
\end{equation*}
$$

where $\tilde{x}=H_{1} x$ (and $\left.\mathrm{x}=\mathrm{H}_{1}^{\mathrm{T}} \tilde{\mathrm{x}}\right)$ and $\tilde{b}=H_{1} b$. Equation (4.2) can now be expressed by the partial factorization in (3.3b),

$$
\left[\begin{array}{cc}
L & 0  \tag{4.3}\\
W & I
\end{array}\right]\left[\begin{array}{cc}
U & V \\
0 & A_{t}
\end{array}\right]\left[\begin{array}{l}
x_{u} \\
x_{t}
\end{array}\right]=\left[\begin{array}{l}
b_{u} \\
b_{t}
\end{array}\right]
$$

where $\left(x_{u}, x_{t}\right)$ and $\left(b_{u}, b_{t}\right)$ are partitionings of $\tilde{x}$ and $\tilde{b}$, respectively, corresponding to the block factorization of $\tilde{A}$.

Equation (4.3) is solved by the standard approach,

$$
\begin{gather*}
y_{u}=L^{-1} b_{u}, \bar{b}_{t}=b_{t}-W y_{u}  \tag{4.4a,b}\\
x_{t}=A_{t}^{-1} \bar{b}_{t}  \tag{4.4c}\\
x_{u}=U^{-1}\left(y_{u}-V x_{t}\right) \tag{4.4d}
\end{gather*}
$$

The computation of $x_{t}$ in (4.4c) expresses symbolically the solution of a block tridiagonal system of equations similar to (1.1), and (4.4c) is therefore analogous to (4.1). The algorithm outlined by the equations (4.1), (4.2), (4.3) and (4.4) is therefore applied recursively until all components of the solution are eventually computed.

When the complete multilevel LU-factorization is available, the multilevel solution is obtained by a recursive application of the relations (4.4a,b,d) until (4.4c) involves just one LU-factored block. In order to describe the multilevel solution algorithm the vectors $x_{u}, x_{t}, b_{u}$ and $b_{t}$ are defined from the partitioning in (2.1) and the permutation $H_{1}$.

$$
x_{u}=\left[\begin{array}{c}
x_{1} \\
x_{3} \\
\cdots \\
x_{N}
\end{array}\right], x_{t}=\left[\begin{array}{c}
x_{2} \\
x_{4} \\
\ldots \\
x_{N-1}
\end{array}\right], b_{u}=\left[\begin{array}{c}
b_{1} \\
b_{3} \\
\ldots \\
b_{N}
\end{array}\right], b_{t}=\left[\begin{array}{c}
b_{2} \\
b_{4} \\
\ldots \\
b_{N-1}
\end{array}\right]
$$

Furthermore partitioned vectors $y_{u}$ and $\bar{b}_{t}$ similar to $x_{u}$ and $b_{t}$, respectively, are defined. The detailed block relations corresponding to (4.4 a,b,d) are:

$$
\begin{gather*}
y_{r}=L_{r}^{-1} b_{r} \text { for } \mathrm{r}=1,3, . . \mathrm{N}  \tag{4.5a}\\
\bar{b}_{r+1}=b_{r+1}-\bar{C}_{r+1} y_{r}-\bar{B}_{r+1} y_{r+2} \text { for } \mathrm{r}=1,3, . ., \mathrm{N}-2 \tag{4.5b}
\end{gather*}
$$

$$
\begin{gather*}
x_{1}=U_{1}^{-1}\left(y_{1}-\bar{B}_{1} x_{2}\right), x_{N}=U_{N}^{-1}\left(y_{N}-\bar{C}_{N} x_{N-1}\right)  \tag{4.5d}\\
x_{r}=U_{r}^{-1}\left(y_{r}-\bar{C}_{r} x_{r-1}-\bar{B}_{r} x_{r+1}\right) \text { for } \mathrm{r}=3,5, \ldots, \mathrm{~N}-2
\end{gather*}
$$

Each of the three sets of relations $(4.5 \mathrm{a}),(4.5 \mathrm{~b})$ and (4.5d) can be computed in parallel using $(\mathrm{N}+1) / 2$ processors. The computation of $y_{r}$ and $x_{r}$ requires factorized matrix blocks from $\bar{Q}_{r}$ defined in (3.5) while $\bar{b}_{s}$ requires blocks from both $\bar{Q}_{s-1}$ and $\bar{Q}_{s+1}$.

## 5 Hypercube implementation

This section describes the general principles for the implementation of the multilevel algorithm on a hypercube parallel computer of dimension d . The number of processors is $p=2^{d}$, and it will be assumed that all processors are used for the top level of the multilevel algorithm which means that $p$ is related to the block dimension N through the following relation,

$$
\begin{equation*}
p=2^{d}=(N+1) / 2 \tag{5.1}
\end{equation*}
$$

If N is determined by the problem (which is not the case for a band matrix) and violates relation (5.1), the multilevel algorithm may require minor modifications. If $N>2 p-1$, more processors may be simulated by running several processes on each processor. If $N<2 p-1$, some processors may be left idle or more than one processor may be used for some or all Q-matrices (3.4).

It will be assumed in the rest of this section that (5.1) is satisfied. The Q-matrices of the top level, level 1, are allocated to the processors as follows. Matrix $Q_{2 i+1}$ is allocated to processors $P_{i}$ for $\mathrm{i}=0,1, \ldots, \mathrm{p}-1$. This is expressed formally as

$$
\begin{equation*}
Q_{2 i+1} \longrightarrow P_{i} \text { for } \mathrm{i}=0,1, \ldots, \mathrm{p}-1 \tag{5.2a}
\end{equation*}
$$

The allocation relation for level 2 is

$$
\begin{equation*}
Q_{4 i+2} \longrightarrow P_{2 i} \text { for } \mathrm{i}=0,1, \ldots,(\mathrm{p} / 2)-1 \tag{5.2b}
\end{equation*}
$$

The general allocation relation for level $\ell$ is

$$
\begin{equation*}
Q_{2^{\ell} i+2^{\ell-1}} \longrightarrow P_{2^{(l-1)} i} \text { for } \mathrm{i}=0,1, . .,\left(\mathrm{p} / 2^{(\ell-1)}\right)-1 \tag{5.3}
\end{equation*}
$$

The allocation principle is illustrated in Fig. 5.1 for $\mathrm{p}=8$ and consequently $\mathrm{N}=15$. The processors are labeled $0,1, . ., 7$ expressed as binary numbers. Processors in a hypercube can be labeled such that processors whose labels differ in only one bit position are

| Processor | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| Level |  |  |  |  |  |  |  |  |
| 1 | $Q_{1}$ | $Q_{3}$ | $Q_{5}$ | $Q_{7}$ | $Q_{9}$ | $Q_{11}$ | $Q_{13}$ | $Q_{15}$ |
| 2 |  | $\swarrow$ |  | $\swarrow$ |  | $\swarrow$ |  | $\swarrow$ |
|  | $Q_{2}$ |  | $Q_{6}$ |  | $Q_{10}$ |  | $Q_{14}$ |  |
| 3 |  | $Q_{4}$ |  |  |  |  |  |  |
|  |  |  |  |  |  | $Q_{12}$ |  |  |
|  |  |  |  |  |  |  |  |  |
| 4 | $Q_{8}$ |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |

Figure 5.1: Processor allocation and communication structure for a hypercube implementation of the multilevel algorithm.
neighbors. The allocation relation in (5.3) is constructed to permit multilevel factorization and solution using only neighbor to neighbor communication.

The multilevel LU-factorization proceeds as follows. At level $1, Q_{1}, Q_{3}, . ., Q_{N}$ are partially LU-factored in parallel. Then the unfactored parts of $\bar{Q}_{3}, \bar{Q}_{7}, . ., \bar{Q}_{N}$ are passed on to the processors storing $\bar{Q}_{1}, \bar{Q}_{5}, . ., \bar{Q}_{N-2}$ (see Fig. 5.1). This involves only neighbor to neighbor communication, and it can be done completely in parallel. However, according to (3.7a) the construction of $Q_{2}, Q_{6}, . ., Q_{N-5}$ requires contributions from three $\bar{Q}$-matrices from the previous level, and thus more communication and a more elaborate communication scheme to limit communication distance are required.

In order to circumvent this problem and use the simple allocation and communication scheme examplified in Fig. 5.1, the Q-matrices of levels other than the first one are redefined slightly.

$$
\begin{gather*}
Q_{2}^{\prime}=\left[\begin{array}{ll}
\bar{A}_{2} & \bar{D}_{2} \\
\bar{E}_{4} & \hat{A}_{4}
\end{array}\right], Q_{N-1}^{\prime}=\left[\begin{array}{ll}
\bar{A}_{N-1} & \bar{E}_{N-1} \\
\bar{D}_{N-3} & \tilde{A}_{N-3}
\end{array}\right]  \tag{5.4a,b}\\
Q_{s}^{\prime}=\left[\begin{array}{ccc}
\bar{A}_{s} & \bar{D}_{s} & \bar{E}_{s} \\
\bar{E}_{s+2} & \hat{A}_{s+2} & 0 \\
\bar{D}_{s-2} & 0 & \tilde{A}_{s-2}
\end{array}\right], \text { for } \mathrm{s}=6,10, . ., \mathrm{N}-5 \tag{5.4c}
\end{gather*}
$$

The definition in (5.4) will supersede (3.6) in the following, and the prime symbol will be left out from the Q-matrices defined by (5.4). Likewise, the construction rule (3.7) is superseded by

$$
\begin{equation*}
\left(\bar{Q}_{s-1}, \bar{Q}_{s+1}\right) \rightarrow Q_{s} \text { for } \mathrm{s}=2,6, . ., \mathrm{N}-1 \tag{5.5}
\end{equation*}
$$

The construction of the general matrix $Q_{s}$ in (5.4c) is easily verified by writing $\bar{Q}_{s-1}$ and $\bar{Q}_{s+1}$ in detail, based on (3.5c):

$$
\bar{Q}_{s-1}=\left[\begin{array}{ccc}
L_{s-1} U_{s-1} & \bar{B}_{s-1} & \bar{C}_{s-1} \\
\bar{S}_{s} & \bar{A}_{s} & \bar{E}_{s} \\
\bar{B}_{s-2} & \bar{D}_{s-2} & \tilde{A}_{s-2}
\end{array}\right], \bar{Q}_{s+1}=\left[\begin{array}{ccc}
L_{s+1} U_{s+1} & \bar{B}_{s+1} & \bar{s}_{s+1} \\
\bar{C}_{s+2} & \hat{A}_{s+2} & \bar{E}_{s+2} \\
\bar{B}_{s} & \bar{D}_{s} & \bar{A}_{s}
\end{array}\right]
$$

$Q_{s}$ is composed from the lower right-hand $2 \times 2$ blocks and $\bar{A}_{s}$ is computed as $\bar{A}_{s}=$ $\hat{A}_{s}+\tilde{A}_{s}$.

The observation leading to the redefinition of the Q -matrices for levels greater than one is that the partial factorization of a Q-matrix leaves the first block row and column completely factored and the rest unfactored. This means that the first block row and column (and especially $A_{s}, \mathrm{~s}=2,6, \ldots, \mathrm{~N}-1$ ) must hold the final values before the factorization while intermediate values (like $\hat{A}_{s}$ and $\tilde{A}_{s}, s=4,8, . ., \mathrm{N}-3$ ) suffice for the remaining entries of the Q -matrix.

After the partial LU-factorization, the $\bar{Q}$-matrices of level 2 will have the same structure as the $\bar{Q}$-matrices of level 1 defined in (3.5). The allocation algorithm ensures that the Q-matrices on each level are allocated to processors that are pairwize neighbors. This means that the partial LU-factorization of the $Q$-matrices of level $\ell$ and neighbor to neighbor communication to compose the $Q$-matrices of level $\ell+1$ can be continued until the last level, $\mathrm{d}+1$. At the last level $Q_{(N+1) / 2}$, which is only one block, is computed by adding the lower right-hand blocks of $\bar{Q}_{(N+1) / 4}$ and $\bar{Q}_{3(N+1) / 4}$, and finally $Q_{(N+1) / 2}$ is fully LU -factored.

The blocks of the b -vector defined in (2.1) are allocated with the corresponding Ablocks (diagonal blocks) which means that

$$
\begin{gather*}
\left(b_{2 i+1}, b_{2 i+2}\right) \longrightarrow P_{i} \text { for } \mathrm{i}=0,1, \ldots, \mathrm{p}-2  \tag{5.6a}\\
b_{N} \longrightarrow P_{p-1} \tag{5.6b}
\end{gather*}
$$

With this allocation, the first level of the solution algorithm defined in $(4.5 \mathrm{a}, \mathrm{b})$ is performed as follows

$$
\begin{gather*}
P_{0}: y_{1}=L_{1}^{-1} b_{1}, \hat{b}_{2}=b_{2}-\bar{C}_{2} y_{1}  \tag{5.7a,b}\\
P_{i}: y_{r}=L_{r}^{-1} b_{r}, \hat{b}_{r+1}=b_{r+1}-\bar{C}_{r+1} y_{r}  \tag{5.7c,d}\\
\tilde{b}_{r-1}=-\bar{B}_{r-1} y_{r}  \tag{5.7e}\\
\text { for } \mathrm{r}=2 \mathrm{i}+1 \text { and } \mathrm{i}=1,2, \ldots, \mathrm{p}-2 \\
P_{p-1}: y_{N}=L_{N}^{-1} b_{N}, \tilde{b}_{N-1}=-\bar{B}_{N-1} y_{N} \tag{5.7f,g}
\end{gather*}
$$

The partial $\bar{b}$-vectors $\hat{b}_{s}$ and $\tilde{b}_{s}$ are associated with the unfactored diagonal blocks $\hat{A}_{s}$ and $\tilde{A}_{s}$. This implies that they are communicated along the same routes and the calculation $\bar{b}_{s}=\hat{b}_{s}+\tilde{b}_{s}$ takes place in the same processor and at the same level as the calculation $\bar{A}_{s}=\hat{A}_{s}+\tilde{A}_{s}$.

The first communication step then involves

$$
\begin{aligned}
& \text { Send }\left(\tilde{b}_{4 i+2}, \hat{b}_{4 i+4}\right) \text { to } P_{2 i} \text { for } i=0,1, \ldots,(p / 2)-2 . \\
& \text { Send } \tilde{b}_{N-1} \text { to } P_{p-2}
\end{aligned}
$$

At level 2 the computation of $\bar{b}_{2}, \bar{b}_{6}, . ., \bar{b}_{N-1}$ are completed according to (4.5b),

$$
\begin{equation*}
\bar{b}_{4 i+2}=\hat{b}_{4 i+2}+\tilde{b}_{4 i+2} \text { for } \mathrm{i}=0,1, . .,(\mathrm{p} / 2)-1 \tag{5.8}
\end{equation*}
$$

The remaining $\bar{b}$-vectors, $\bar{b}_{4}, \bar{b}_{8}, . ., \bar{b}_{N-3}$ are not computed at level 2 since they are not needed at this level. Besides, their computation would require communication among non-neighbors.

The computation algorithm and communication scheme as outlined can be continued down to the bottom level and up again according to (4.5d). Instead of giving a formal algorithm which will come out rather complicated, the informal description of the multilevel solution algorithm will be supplemented with an example for $\mathrm{N}=7$ and $\mathrm{p}=4$ processors. The example shown in Fig. 5.2 also includes the $\bar{Q}$-matrices and a specification of the level where they were computed.

The communication pattern of the solution phase can be deduced from the data dependencies among the different levels of computations shown in Fig. 5.2.

Proc.
00
Level
$\left.\begin{array}{llll} \\ 1 & {\left[\begin{array}{ll}L_{1} U_{1} & \bar{B}_{1} \\ \bar{C}_{2} & \hat{A}_{2}\end{array}\right]} \\ 2 & {\left[\begin{array}{llll}L_{2} U_{2} & \bar{D}_{2} \\ \bar{E}_{4} & \hat{\hat{A}}_{4}\end{array}\right]} & {\left[\begin{array}{llll}L_{3} U_{3} & \bar{B}_{3} & \bar{C}_{3} \\ \bar{C}_{4} & \hat{A}_{4} & \bar{E}_{4} \\ \bar{B}_{2} & \bar{D}_{2} & \tilde{A}_{2}\end{array}\right]} & {\left[\begin{array}{lll}L_{5} U_{5} & \bar{B}_{5} & \bar{C}_{5} \\ \bar{C}_{6} & \hat{A}_{6} & \bar{E}_{6} \\ \bar{B}_{4} & \bar{D}_{4} & \tilde{A}_{4}\end{array}\right]}\end{array}\right]\left[\begin{array}{lll}L_{6} U_{6} & \bar{E}_{6} \\ \bar{D}_{4} & \tilde{\tilde{A}}_{4}\end{array}\right] \quad\left[\begin{array}{lll}L_{7} U_{7} & \bar{C}_{7} \\ \bar{B}_{6} & \tilde{A}_{6}\end{array}\right]$

2

3

$$
-4-4
$$

3x

$$
\bar{b}_{2}=\hat{b}_{2}+\tilde{b}_{2}
$$

$$
\bar{b}_{6}=\hat{b}_{6}+\tilde{b}_{6}
$$

$$
y_{6}=L_{6}^{-1} b_{6}
$$

$$
\tilde{\tilde{b}}_{4}=\tilde{b}_{4}-\overline{\bar{D}}_{4} y_{6}
$$

Figure 5.2: Parallel solution algorithm for $\mathrm{N}=7$ based on multilevel LU-factorization.

## 6 Parallel band matrix solver for hypercube

### 6.1 Partitioning and allocation

The principles of a multilevel parallel solver described in the previous sections were used for an implementation on the Intel iPSC hypercube. The implementation is restricted to linear systems with structurally symmetric band matrices. This restriction is exploited in the implementation to improve the basic implementation principles described in the previous section. The improvement involves the equivalent of 2-way Gaussian elimination for the partial factorization of $Q_{1}$ and $Q_{N}$ to permit an efficient load distribution. Programs and documentation are presented in [11].

The band matrix is charaterized by dimension n and by bandwidth $\mathrm{w}=1+2 \mathrm{~m}$ where m is the number of upper or lower off-diagonal elements.

The block tridiagonal structure of (2.1) is imposed on the band matrix by choosing N and $n_{1}, n_{2} . ., n_{N}$ properly. There exist the following constraints:

$$
\sum_{r=1}^{N} n_{r}=n \text { and } \mathrm{n}_{\mathrm{r}} \geq \mathrm{m} \text { for } \mathrm{r}=1,2, . ., \mathrm{N}
$$

N is chosen to match the actual hypercube (or sub-cube) of dimension d which means that relation (5.1) is fulfilled.

The dimension of the separators is chosen to the minimum dimension,

$$
n_{s}=m, s=2,4, . ., N-1
$$

in order to minimize the amount of work involved in the lower levels of the algorithm.
The dimensions of the odd numbered blocks are in general determined by the load distribution algorithm which will be described later. However, a special case should be mentioned here, namely the minimum dimension problem where $n_{r}=m, r=1,2, . ., N$. This is the case where the blocks of the block tridiagonal matrix in (2.1) are chosen as small as possible. This could be the case if the algorithm is excuted by a fine grain parallel computer where the maximum number of blocks for a particular band matrix are required in order to exploit as many processors as possible.

Besides, the lower right-hand block tridiagonal matrix of (3.1) (called $A_{t}$ in (3.3)) is a minimum dimension matrix with all blocks of dimension $m$ when $\bar{A}$ originates from a block tridiagonal matrix with seperators of dimension m . The minimum dimension property also applies to the lower levels.

The structure of $\bar{Q}_{r}$ for $\mathrm{r}=3,5, . ., \mathrm{N}-2$ is given in Fig. 6.1. The densely dotted areas correspond to the non-zero entries of $Q_{r}$ with $D_{r-1}=0, E_{r+1}=0$ and the lightly dotted areas correspond to fill-ins created by the partial factorization. $\bar{Q}_{1}$ and $\bar{Q}_{N}$ are depicted similarly in Fig. 6.2 and 6.3.

The partial factorization of $Q_{r}$ leading to $\bar{Q}_{r}$ for $\mathrm{r}=3,5, . ., \mathrm{N}$ results in rather severe amounts of fill-ins, not just in terms of zero blocks being filled, like $\tilde{A}_{r-1}, \bar{D}_{r-1}$ and $\bar{E}_{r+1}$ but also fill-ins inside $B_{r-1}$ and $C_{r} . Q_{1}$ is partially factored completely without fill-ins while $Q_{N}$ is similar to the general $Q_{r}$-matrix in this respect.

However, it is possible to make the partial LU-factorization of level 1 symmetric by modifying the factorization of $Q_{N}$. The ordering of the band matrix $A_{N}$ of $Q_{N}$ is reversed by a symmetric row column reordering. The column reordering also includes $B_{N-1}$ and the row reordering includes $C_{N}$. After this reordering, the partial factorization of $Q_{N}$ leads to a matrix $\bar{Q}_{N}$ with a structure identical to $\bar{Q}_{1}$ shown in Fig. 6.2.

The reordering of $Q_{N}$ results in a considerable saving in the operations count of the partial LU-factorization. The large differences in operations count in partially factoring $Q_{1}$ and $Q_{N}$ on one side and $Q_{3}, Q_{5}, \ldots, Q_{N-2}$ on the other side are exploited in the load distribution algorithm.


Figure 6.1. The structure of $\overline{\mathrm{Q}}_{\mathrm{T}}$ for $\mathrm{r}=3,5, \ldots, \mathrm{~N}-2$


Figure 6.2. The structure of $\overline{\mathrm{Q}}_{1}$


Figure 6.3. The structure of $\overline{\mathrm{Q}}_{\mathrm{N}}$

The Q-matrices of the lower levels are all block matrices with full $\mathrm{m} \times \mathrm{m}$ blocks. The extreme matrices, corresponding to $Q_{1}$ and $Q_{N}$, are $2 \times 2$ block matrices while the interior matrices have block dimension 3. The matrix at the lowest level $Q_{(N+1) / 2}$ is just one block of dimension $m$.

### 6.2 Complexity analysis

The performance of the multilevel algorithm can be estimated on the basis of complexity analysis. Let $F_{r}$ denote the number of floating point operations required for the partial factorization of $Q_{r}$. We then have the following operations counts.

Level 1

$$
\begin{equation*}
F_{r}\left(n_{r}\right)=n_{r} m(2 m+1), r=1, N . \tag{6.1a}
\end{equation*}
$$

$$
\begin{equation*}
F_{r}\left(n_{r}\right)=2 n_{r} m(4 m+1), r=3,5, . ., N-2 \tag{6.1b}
\end{equation*}
$$

Level $\ell(2 \leq \ell \leq d)$

$$
\begin{gather*}
F_{2(\ell-1)}=F_{N+1-2^{\ell-1}}=\frac{14}{3} m^{3}-\frac{3}{2} m^{2}-\frac{1}{6} m  \tag{6.1c}\\
F_{2^{\ell} i+2^{\ell-1}}=\frac{38}{3} m^{3}-\frac{5}{2} m^{2}-\frac{1}{6} m  \tag{6.1d}\\
\text { for } i=1,2, . .,\left(p / 2^{\ell-1}\right)-2
\end{gather*}
$$

Level d+1

$$
\begin{equation*}
F_{2^{d}}=\frac{2}{3} m^{3}-\frac{1}{2} m^{2}-\frac{1}{6} m \tag{6.1e}
\end{equation*}
$$

The operations count of a standard band LU-factorization, called $F_{B L U}$ is

$$
\begin{equation*}
F_{B L U}=n m(2 m+1)-\frac{4}{3} m^{3}-\frac{3}{2} m^{2}-\frac{1}{6} m \tag{6.2}
\end{equation*}
$$

A rough estimate of the total complexity of the multilevel algorithm is $F_{M L U} \approx$ $2 n m(4 m+1)$ for $n \gg m$ and $N \gg 1$. This implies that $F_{M L U} / F_{B L U} \approx 4$ which is the penalty for being able to do the LU-factorization in parallel. According to this, the maximum speed-up from the multilevel algorithm using p processors is expected to be $\mathrm{p} / 4$. A more accurate speed-up calculation including the effect of the load balancing will be given in the next section.

The parallel complexity of the multilevel LU-factorization called $F_{P L U}$ can be computed as the sum of the dominating complexities at each level,

$$
\begin{gather*}
F_{P L U}=F_{2^{d}-1}\left(n_{2^{d}-1}\right)+(d-2) F_{2^{d}-2}+F_{2^{(d-1)}}+F_{2^{d}} \\
=2 n_{2^{d}-1} m(4 m+1)+(d-2)\left(\frac{38}{3} m^{3}-\frac{5}{2} m^{2}-\frac{1}{6} m\right)+\frac{16}{3} m^{3}-2 m^{2}-\frac{1}{3} m \tag{6.3}
\end{gather*}
$$

$F_{2^{d}-1}$ and $F_{2^{d}-2}$ represent complexities at top and intermediate levels, respectively. The expression in (6.3) is correct under certain assumptions about the partitioning, e.g. $n_{1}=$ $n_{3}=. . n_{N}$ or the partitioning defined by the load balance relations (6.10). $F_{P L U}$ is valid for $d \geq 2$, and using $p=2^{d}$ processors, $F_{P L U}$ gives the execution time for an LU-factorization when multiplied by the average floating point excution time called $T_{F}$.

The communication time model is based on the model [9]

$$
\begin{equation*}
t=T_{0}+M / B \tag{6.4}
\end{equation*}
$$

where the latency $T_{0}=1.2 \mathrm{msec}$ and bandwidth $\mathrm{B}=1 \mathrm{MByte} / \mathrm{sec}$ are measured values for neighbor to neighbor communication on the Intel iPSC. The model in (6.4) is valid for messages of length M bytes fulfilling $0 \leq \mathrm{M} \leq 1024$ bytes.

The multilevel LU-factorization algorithm involves the communication of $2 \mathrm{~m} \times 2 \mathrm{~m}$ matrices from one level to a lower, except to the last level where only an $m \times m$ matrix is
transferred. The $2 \mathrm{~m} \times 2 \mathrm{~m}$ matrices of double precision numbers fit into 1 KByte for $\mathrm{m} \leq 5$. Under this assumption, the communication model corresponding to (6.3) is

$$
\begin{equation*}
T_{C L U}=(d-1)\left(T_{0}+32 m^{2} / B\right)+T_{0}+8 m^{2} / B \tag{6.5}
\end{equation*}
$$

Table 6.1 shows $T_{C L U} /\left(T_{F} F_{P L U}\right)$ for the minimum dimension problem ( $F_{P L U}$ is computed from (6.3) with $n_{2^{d}-1}=\mathrm{m}$ ). $T_{F}=50 \mu \mathrm{sec}$ is an approximate value of the gross floating point execution time measured for the multilevel algorithm. Communication is $\mathrm{O}\left(\mathrm{m}^{2}\right)$ while computation is $\mathrm{O}\left(\mathrm{m}^{3}\right)$.

| $\mathrm{d} \backslash \mathrm{m}$ | 1 | 2 | 3 | 4 | 5 | 6 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 2 | 3.75 | 0.48 | 0.153 | 0.070 | 0.041 | 0.034 |
| 3 | 3.68 | 0.39 | 0.125 | 0.058 | 0.033 | 0.030 |
| 4 | 3.63 | 0.36 | 0.115 | 0.054 | 0.031 | 0.029 |
| 5 | 3.61 | 0.34 | 0.109 | 0.051 | 0.030 | 0.028 |

Table 6.1 Communication to computation ratio, $T_{C L U} /\left(T_{F} F_{P L U}\right)$, for the multilevel LU-factorization algorithm applied to the minimum dimension problem.

This is clearly reflected by Table 6.1 which shows that communication dominates for $\mathrm{m}=1$ but loses significance very quickly for increasing values of m . The model (6.4) and therefore also (6.5) is only valid for $\mathrm{m} \leq 5$. The model was extended also to include $\mathrm{m}=6$ which is shown in Table 6.1. This demonstrates that it is sufficient to take the communication cost into account for $\mathrm{m} \leq 5$.

The communication to computation ratios of Table 6.1 can be considered worst-case values for the multilevel algorithm. The operations count $F_{P L U}$ in (6.3) is $\mathrm{O}\left(n_{2^{d}-1} \mathrm{~m}^{2}\right)$ while $T_{C L U}$ in (6.5) is $\mathrm{O}\left(\mathrm{m}^{2}\right)$. This means that communication becomes negligible for $n_{r} \gg m, \mathrm{r}=1,3, . ., \mathrm{N}$, even for $\mathrm{m}=1$.

The speed-up of the parallel multilevel LU-factorization algorithm is defined as the execution time of a standard band LU-factorization on a single processor divided by the execution time of the parallel algorithm executed on $p$ processors.

The speed-up is computed as $S_{L U}=T_{F} F_{B L U} /\left(T_{F} F_{P L U}+T_{C L U}\right)$ where the complexities are given in (6.2), (6.3) and (6.5). Table 7.1 shows speed-up values computed for the minimum dimension problem $\left(n_{2^{d}-1}=m\right)$ for selected values of $m$ and d. Theoretical values of $S_{L U}$ are given in parantheses for comparison with experimentally determined values.

The multilevel solution algorithm consists of a forward sweep (levels 1 to $d+1$ ) and a backward sweep (levels $\mathrm{d}+1$ to 1), cf. Fig. 5.2. Analogously to the LU -factorization, only the computations of level 1 depend on the total dimension $n$ of the problem while the lower levels only depend on m .

The complexity of the solution algorithm is stated separately for the forward and the backward sweep. Let $\bar{F}_{r}$ denote the complexity of the computational step during the forward sweep involving the partially factored matrix $\bar{Q}_{r}$, e.g. the complexity of

$$
y_{r}=L_{r}^{-1} b_{r}, \hat{b}_{r+1}=b_{r+1}-\bar{C}_{r+1} y_{r}, \tilde{b}_{r-1}=-\bar{B}_{r-1} y_{r}
$$

Similarly $\tilde{F}_{r}$ denotes the complexity of a computational step involving $\bar{Q}_{r}$ during the backward sweep. We then have the following complexities:

Level 1

$$
\begin{gather*}
\bar{F}_{1}\left(n_{1}\right)=2 n_{1} m, \bar{F}_{N}\left(n_{N}\right)=2 n_{N} m-m  \tag{6.6a,b}\\
\bar{F}_{r}\left(n_{r}\right)=4 n_{r} m-m, r=3,5, . ., N-2  \tag{6.6c}\\
\tilde{F}_{1}\left(n_{1}\right)=n_{1}(2 m+1), \tilde{F}_{N}\left(n_{N}\right)=n_{N}(2 m+1)  \tag{6.6~d,e}\\
\tilde{F}_{r}\left(n_{r}\right)=n_{r}(4 m+1), r=3,5, . ., N-2 . \tag{6.6f}
\end{gather*}
$$

Level $\ell(2 \leq \ell \leq d)$

$$
\begin{gather*}
\bar{F}_{2(\ell-1)}=\bar{F}_{N+1-2(l-1)}=\tilde{F}_{2(l-1)}=\tilde{F}_{N+1-2(l-1)}=3 m^{2}  \tag{6.6~g}\\
\bar{F}_{2^{\ell} i+2^{(\ell-1)}}=5 m^{2}-m, \tilde{F}_{2^{\ell} i+2^{(\ell-1)}}=5 m^{2} \text { for } i=1,2, \ldots,\left(p / 2^{(\ell-1)}\right)-2 \tag{6.6~h,i}
\end{gather*}
$$

Level d+1

$$
\begin{equation*}
\bar{F}_{(N+1) / 2}=\tilde{F}_{(N+1) / 2}=m^{2} \tag{6.6j}
\end{equation*}
$$

The parallel complexity of the solution algorithm $F_{P S}$ can now be computed as the sum of the dominating complexities at each level. The expression (6.7) is valid under the same assumptions as (6.3).

$$
\begin{equation*}
F_{P S}=(d-2)\left(10 m^{2}-m\right)+n_{2^{d}-1}(8 m+1)+8 m^{2}-m \tag{6.7}
\end{equation*}
$$

The communication model for the multilevel solution algorithm is

$$
\begin{equation*}
T_{C S}(m)=2\left[(d-1)\left(T_{0}+16 m / B\right)+T_{0}+8 m / B\right] \tag{6.8}
\end{equation*}
$$

The parameters $T_{0}$ and B are given in connection with (6.5), and (6.8) is valid for $\mathrm{m} \leq 64$

| $\mathrm{d} \backslash \mathrm{m}$ | 1 | 2 | 5 | 10 | 15 | 20 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 2 | 5.36 | 1.43 | 0.24 | 0.062 | 0.028 | 0.016 |
| 3 | 5.36 | 1.38 | 0.23 | 0.058 | 0.027 | 0.015 |
| 4 | 5.36 | 1.35 | 0.22 | 0.056 | 0.026 | 0.015 |
| 5 | 5.36 | 1.34 | 0.22 | 0.055 | 0.025 | 0.015 |

Table 6.2. Communication to computation ratio, $T_{C S} /\left(T_{F} F_{P S}\right)$, for the multilevel solution algorithm applied to minimum dimension problems.

Table 6.2 shows how communication affects the efficiency of the multilevel solution algorithm applied to minimum dimension problems, i.e. the computational complexity $F_{P S}$ is computed from (6.7) for $n_{2^{d}-1}=m$. Comparing with Table 6.1, it is seen that the solution algorithm is communication bounded to a larger extent than the LU-factorization algorithm.

Communication cost only depends on $m$ and $d(6.8)$ while the computational complexity is $O\left(n_{2^{d}-1} m\right)$. This means that Table 6.3 shows a worst-case situation, and that communication cost even for $m=1$ becomes negligible for $n_{2^{d}-1} / m \gg 1$.

The operations count of a standard forward-backward solution based on an LUfactored band matrix is

$$
\begin{equation*}
F_{B S}=4 n m+n-2 m^{2}-2 m \tag{6.9}
\end{equation*}
$$

The speed-up of the multilevel solution algorithm can now be computed for the minimum dimension problem as $S_{S}=T_{F} F_{B S} /\left(T_{F} F_{P S}+T_{C S}\right)$ where the complexities are given in (6.7) (for $n_{2^{d}-1}=m$ ), (6.8) and (6.9). Table 7.3 in section 7 below shows speed-up values computed from $S_{S}$ for selected values of $m$ and d. The speed-up values estimated by $S_{S}$ are given in parantheses for comparison with experimentally determined values.

### 6.3 Load balancing

From (6.1 a,b) it is clear that a uniform partitioning of the band matrix, $n_{1}=n_{3}=. .=$ $n_{N}$, will lead to poor load balance for the LU-factorization. A first attempt to improve the load balance would be to choose $n_{3}=n_{5}=. .=n_{N-2}$ and $n_{1}=n_{N}=4 n_{3} \frac{4 m+1}{4 m+2}$. This choice results in $F_{1}=F_{3}=. .=F_{N}$ and thus load balance in level 1. However, some imbalance still remains in the lower levels as specified by ( $6.1 \mathrm{c}, \mathrm{d}$ ).

The lower levels could be load balanced after the same principle followed in level 1 by chosing some of $n_{2}, n_{4}, . ., n_{N-1}$ greater than m, but this is undesirable since it would increase the computational load of the lower levels where parallelism is harder to exploit.

The load balancing scheme chosen for the band matrix can be explained by referring to the complexity relations (6.1) and to Fig. 5.1. The factorization of $Q_{1}$ and $Q_{3}$ finish at the same time when $n_{1}=4 n_{3} \frac{4 m+1}{4 m+2}$ permitting processor $P_{0}$ to start the factorization of $Q_{2}$ without idle time. Likewise, the factorization of $Q_{5}$ and $Q_{7}$ complete at the same time when $n_{5}=n_{7}$. The ratio $n_{5} / n_{3}$ is now adjusted such that the factorization of $Q_{2}$ and $Q_{6}$ finish at the same time. This would complete the load balancing of Fig. 5.1 since the matrix partitioning is symmetric $n_{1}=n_{N}, n_{3}=n_{N-2}, \ldots$

The balancing principle is easily generalized and (6.10) gives a general set of balance equations based on (6.1). For a given set of values $n, m$, and $d$, the load balance equations determine $n_{1}, n_{3}, . ., n_{N}$. The equations are easily solved by expressing $n_{3}, n_{7}, .$. by $n_{1}$ and substituting these expressions into the last equation and solving it for $n_{1}$.

$$
\begin{gather*}
n_{r}=n_{N+1-r}, r=1,3, \ldots,(N-1) / 2 \quad \text { (symmetry) }  \tag{6.10a}\\
n_{5}=n_{7} \tag{6.10b}
\end{gather*}
$$

$$
\begin{gather*}
n_{9}=n_{11}=n_{13}=n_{15}  \tag{6.10c}\\
\ldots  \tag{6.10d}\\
n_{2^{(d-1)}+1}=n_{2^{(d-1)}+3}=\ldots=n_{2^{d-1}}  \tag{6.10e}\\
F_{1}\left(n_{1}\right)=F_{3}\left(n_{3}\right)  \tag{6.10f}\\
F_{1}\left(n_{1}\right)+F_{2}=F_{7}\left(n_{7}\right)+F_{6}  \tag{6.10~g}\\
F_{1}\left(n_{1}\right)+2 F_{2}=C_{15}\left(n_{15}\right)+2 F_{14}  \tag{6.10h}\\
\ldots  \tag{6.10i}\\
F_{1}\left(n_{1}\right)+(d-2) F_{2}=F_{2^{d}-1}\left(n_{2^{d}-1}\right)+(d-2) F_{2^{d-2}} \\
2\left(n_{1}+n_{3}+2 n_{7}+4 n_{15 \cdot .}+2^{d-2} n_{2^{d}-1}\right)+\left(2^{d}-1\right) m=n
\end{gather*}
$$

The solution of (6.10) will in general not lead to integer values of $n_{1}, n_{3}, \ldots$, but the resulting values are rounded to satisfy the last equation which states that the total number of equations is n . When n is too small, an effective load balance is not obtained. The balance equations (6.10) yield $n_{r}$-values smaller than $m$ which is not permitted by the present implementation. This situation is dealt with by increasing the $n_{r}$-values smaller than m to become m , and by reducing the $n_{r}$-values greater than m correspondingly.

Communication cost has not been included into the load balance equations since communication cost is such a small fraction that it can only be taken properly into consideration for $\mathrm{m} \leq 2$. This can be seen from the following argument.

Equation (6.10e) is modified to include communication cost:

$$
F_{1}\left(n_{1}\right)=F_{3}\left(n_{3}\right)+\left(T_{0}+32 m^{2} / B\right) / T_{F}
$$

Since $F_{3}\left(n_{3}\right)=2 n_{3} m(4 m+1)$, communication is only going to affect $n_{3}$ after rounding if

$$
2\left(T_{0}+32 m^{2} / B\right) / T_{F} \geq 2 m(4 m+1)
$$

The break-even value for m is $\mathrm{m}=2.52$ which means that communication cost is too small in a relative sense to be included in the load balance if $m>2$. This is in good agreement with Table 6.1.

It is obvious from (6.1) that the execution time for the multilevel LU-factorization is proportional to $n$ when $n_{1}=n_{3}=\ldots=n_{N}$. When the partitioning is based on the load balance equations (6.10), the complexity $F_{P L U}$ defined in (6.3) and thus the execution time is still proportional to n since $n_{2^{d}-1}$ in (6.3) depends linearly on n through the load balance equations (6.10). The derivative obtained by solving (6.10) is:

$$
\begin{equation*}
\partial n_{2^{d}-1} / \partial n=\left[4(4 m+1) /(2 m+1)+2+4+8+\ldots+2^{d-1}\right]^{-1} \tag{6.11}
\end{equation*}
$$

The multilevel solution algorithm can be load balanced using the equations (6.10) where $\bar{F}_{r}$ functions from (6.6) are substituted for $F_{r}$. This entails some approximations besides ignoring communication cost. First, (6.10) has $n_{1}=n_{N}$ while $\bar{F}_{1}\left(n_{1}\right) \neq \bar{F}_{N}\left(n_{1}\right)$;
and secondly, $\bar{F}_{r} \neq \tilde{F}_{r}$ which means that balance is only obtained for the forward sweep. The discrepancies, however, are either $\mathrm{O}(\mathrm{m})$ or $\mathrm{O}\left(n_{r}\right)$ and do not lead to serious imbalance.

A crude approximation to the solution algorithm load balance can be based on the equation $\bar{F}_{1}\left(n_{1}\right)=\bar{F}_{3}\left(n_{3}\right), n_{1}=n_{N}$ and $n_{3}=n_{5}=. .=n_{N-2}$. This leads to $n_{1} \approx$ $2 n_{3}$ which should be compared with the corresponding relation $n_{1} \approx 4 n_{3}$ for the LUfactorization. This implies that optimum load balance requires different partitionings of the band matrix for LU-factorization and solution. Since the solution must be based on the result of a factorization and the load distribution chosen for the factorization, load distribution should in practice be chosen to minimize total execution time of LUfactorization and solution(s).

The minimization of the execution time of a multilevel LU-factorization followed by a sequence of solution steps (e.g. for pseudo-Newton iteration) is a complicated problem and will not be addressed. If only one solution per LU-factorization is needed, the forward sweep of the solution algorithm can be merged with LU-factorization leading to a multilevel Gaussian elimination, and this part can easily be load balanced by solving (6.10) with $F_{r}+\bar{F}_{r}$ substituted for $F_{r}$. The backward sweep will not be well balanced, but execution time can only be reduced by a different load distribution for a very large number of processors and very small value of m.

## 7 Performance of the parallel band matrix solver

### 7.1 Execution time model

Figure 7.1 shows an execution time model for sequential band LU-factorization (marked $\left.T_{B L U}\right)$. Execution time for the sequential algorithm is proportional to $F_{B L U}$ defined in (6.2).

The execution time graph for the parallel multilevel algorithm has more complicated features. For a given number of processors, $\mathrm{p}=(\mathrm{N}+1) / 2$, the smallest problem that can be solved has dimension $n=m$. Therefore the execution time graph starts at $n=m N$.


Figure 7.1.: Executing time T as a function of problem dimension $n$ for a given set of $m$ and $N\left(=2^{d+1}-1\right)$.

The load balance equations (6.10) lead to the following relations:

$$
n_{1}>n_{3}>n_{7}>n_{15}>\ldots>n_{2^{d}-1}
$$

This means that the smallest value of n for which load balance is effective ( $\mathrm{n}=n_{L}$ ) is defined by $n_{2^{d}-1}=\mathrm{m}$. This value is substituted into $(6.10 \mathrm{~h})$ from which $n_{1}$ can be computed. The remaining $n_{r}$-values can now be computed from equations ( $6.10 \mathrm{e}, \mathrm{f}, \mathrm{g}, .$. ) and $n_{L}$ is computed from (6.10i) as

$$
n_{L}=2\left(n_{1}+n_{3}+2 n_{7}+4 n_{15}+. .+2^{d-2} m\right)+\left(2^{d}-1\right) m
$$

Execution time is constant ( $T_{P L U}=T_{m}$ ) for $\mathrm{m} \mathrm{N} \leq \mathrm{n} \leq n_{L}$.
For $\mathrm{n}>n_{L}$, the execution time increases linearly with n , according to (6.3) and (6.11). The speed-up of the parallel multilevel algorithm over the sequential algorithm for a given value of n is derived from Fig. 7.1 as $S_{L U}=T_{B L U} / T_{P L U}$. The speed-up is a nonlinear function of $n$.

Two speed-up values are of particular interest, namely the speed-up of the minimum dimension problem,

$$
\begin{equation*}
S_{L U}^{\min }=T_{B L U} /\left.T_{P L U}\right|_{n=m N} \tag{7.1}
\end{equation*}
$$

and the maximum asymptotic speed-up computed for $n>n_{L}$ :

$$
\begin{equation*}
S_{L U}^{\infty}=\left(\partial T_{B L U} / \partial n\right) /\left(\partial T_{P L U} / \partial n\right) \tag{7.2}
\end{equation*}
$$

The speed-up of the minimum dimension problem $S_{L U}^{\min }$ is a worst-case value. It is characterized by only two parameters, m and N , and it does not involve load balancing.

The maximum asymptotic speed-up cannot be attained since

$$
S_{L U}^{\infty}=\lim _{n \rightarrow \infty} S_{L U}
$$

and it corresponds to the speed-up obtained by neglecting the computation involved in the lower levels of the multilevel algorithm. For a given value of n , the speed-up, $S_{L U}$, is bounded as follows:

$$
S_{L U}^{\min } \leq S_{L U}<S_{L U}^{\infty}
$$

The discussion so far has only been concerned with LU-factorization. The solution algorithms have the same qualitative properties and $S_{S}^{\min }$ and $S_{S}^{\infty}$ are defined analogously to $S_{L U}^{m i n}$ and $S_{L U}^{\infty}$ in (7.1) and (7.2).

### 7.2 Numerical results

Figure 7.2 shows an example of the problems which were solved by the parallel multilevel LU-factorization in order to verify the properties of the algorithm experimentally. The graphs of Fig. 7.2 are of the types presented in Fig. 7.1. The dots indicate measured values. All results in this section are based on measurements reported in [11].

An interesting feature of Fig. 7.2 is that the graphs intersect. This has as a consequence that certain problems are solved more efficiently by fewer processors than the maximum number. The phenomenon originates from the load balance algorithm and from the fact that a doubling of the number of processors increases the depth of the multilevel algorithm by one.


Figure 7.2.: Execution time versus problem size for sequential ( $\mathrm{d}=0$ ) and multilevel parallel LU-factorization $(\mathrm{d} \geq 1)$. Bandwidth $\mathrm{w}=21$ corresponding to $\mathrm{m}=10$.

| $\mathrm{m} \backslash \mathrm{p}$ | 2 |  | 4 |  | 8 |  | 16 |  | 32 |  |
| :---: | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| 2 | - | $(1.06)$ | 0.7 | $(0.78)$ | 0.8 | $(1.03)$ | 1.14 | $(1.54)$ | 2.13 | $(2.44)$ |
| 5 | 1.4 | $(1.56)$ | 0.95 | $(0.99)$ | 1.10 | $(1.19)$ | 1.60 | $(1.72)$ | 2.57 | $(2.67)$ |
| 10 | 1.5 | $(1.69)$ | 0.96 | $(0.99)$ | 1.14 | $(1.17)$ | 1.64 | $(1.67)$ | 2.57 | $(2.59)$ |
| 15 | 1.6 | $(1.71)$ | 0.96 | $(0.95)$ | 1.14 | $(1.10)$ | 1.65 | $(1.57)$ | 2.56 | $(2.43)$ |
| 20 | 1.6 | $(1.72)$ | 0.97 | $(0.95)$ | 1.15 | $(1.10)$ | 1.65 | $(1.57)$ | 2.53 | $(2.43)$ |

Table 7.1. Measured and predicted (in paranthesis) values of speed-up of the multilevel parallel LU-factorization over a sequential band factorization. The speed-up, $S_{L U}^{\min }$, applies to the minimum dimension problem, $\mathrm{N}=2 \mathrm{p}-1$ and $\mathrm{n}=\mathrm{m} \mathrm{N}$.

Table 7.1 shows measured and predicted (in parantheses) values of speed-up for the parallel multilevel LU-factorization algorithm applied to the minimum dimension problem. All execution times are measured with a resolution of 5 msec . This means that execution time for $m=2$ and $p=2$ cannot be measured ( $2-3 \mathrm{msec}$ ) and that execution time for $m=5$ and $p=2$ is not very accurate ( $20-25 \mathrm{msec}$ ). The measurements only include LUfactorization and corresponding communication for the parallel algorithm. Downloading of programs, set-up of problem etc. are excluded from the execution time measurements.

The predicted speed-up values are computed as explained in Section 6.1. The expression for $F_{P L U}$ given in (6.3) does not include $\mathrm{p}=2(\mathrm{~d}=1)$. A special formula, which is easily derived, was used for this column in Table 7.1.

There is good agreement between measured and predicted speed-up values except for $\mathrm{m}=2$ where overhead like procedure calls and initialization leads to smaller speed-up than expected from the model which only includes floating point operations.

For $m=15$ and $m=20$ the observed speed-up is slightly greater than the predicted speed-up for $p \geq 4$. This phenomenon could be explained by the fact that the block structure of the Q-matrices in the multilevel algorithm leads to the equivalent of unrolling of the loops of the factorization algorithm. The sequential band matrix factorization is programmed in a straightforward style.

Table 7.2 shows measured and predicted (in paranthesis) values of the maximum asymptotic speed-up for the parallel multilevel $L U$-factorization. Since these speed-up values correspond to neglecting the computational expense of the lower levels, communication is also neglected in the model. The predicted speed-up values are computed from $(6.2),(6.3),(6.11)$ and (7.2). There is very close agreement between measured and predicted values since the top level of the multilevel algorithm is programmed very similarly to the sequential band matrix factorization. This means that the measured speed-up is very close to the ratio of floating point operations counts.

The maximum asymptotic speed-up of the parallel LU-factorization using p processors and load balancing has the following limit which is easily derived from (6.1a,b) and (6.2):

$$
\begin{equation*}
S_{L U}^{\infty} \rightarrow(p+6) / 4 \text { for } m \rightarrow \infty \tag{7.3}
\end{equation*}
$$

Put into words, the performance of the multilevel LU -factorization algorithm with $p$ processor and load balancing according to (6.10) is identical to the performance with $\mathrm{p}+6$ processors and no load balancing ( $n_{1}=n_{3}=\ldots=n_{N}$ ).

| $\mathrm{m} \backslash \mathrm{p}$ | 2 |  | 4 |  | 8 |  | 16 |  | 32 |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 2 | 1.99 | $(2)$ | 2.55 | $(2.56)$ | 3.66 | $(3.67)$ | 5.90 | $(5.89)$ | 10.4 | $(10.3)$ |
| 5 | 1.99 | $(2)$ | 2.51 | $(2.52)$ | 3.56 | $(3.57)$ | 5.65 | $(5.67)$ | 9.84 | $(9.86)$ |
| 10 | 1.99 | $(2)$ | 2.50 | $(2.51)$ | 3.52 | $(3.54)$ | 5.57 | $(5.59)$ | 9.69 | $(9.68)$ |
| 15 | 1.99 | $(2)$ | 2.50 | $(2.51)$ | 3.51 | $(3.52)$ | 5.50 | $(5.56)$ | 9.66 | $(9.62)$ |
| 20 | 1.99 | $(2)$ | 2.49 | $(2.51)$ | 3.49 | $(3.52)$ | 5.48 | $(5.54)$ | 9.72 | $(9.59)$ |

Table 7.2. Measured and predicted (in paranthesis) values of maximum asymptotic speed-up, $S_{L U}^{\infty}$, of the multilevel parallel LU-factorization over
a sequential band factorization.

| $\mathrm{m} \backslash \mathrm{p}$ | 2 |  | 4 |  | 8 |  | 16 |  | 32 |  |
| :---: | :--- | ---: | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| 2 | - | $(0.51)$ | 1.3 | $(0.73)$ | 1.8 | $(1.06)$ | 2.0 | $(1.66)$ | 2.7 | $(2.69)$ |
| 5 | 1.5 | $(1.2)$ | 1.25 | $(1.36)$ | 2.0 | $(1.91)$ | 3.21 | $(2.94)$ | 5.11 | $(4.73)$ |
| 10 | 1.5 | $(1.5)$ | 1.67 | $(1.56)$ | 2.41 | $(2.16)$ | 3.88 | $(3.30)$ | 6.17 | $(5.31)$ |
| 15 | 1.45 | $(1.58)$ | 1.77 | $(1.60)$ | 2.58 | $(2.21)$ | 4.07 | $(3.37)$ | 6.60 | $(5.41)$ |
| 20 | 1.59 | $(1.61)$ | 1.78 | $(1.61)$ | 2.65 | $(2.23)$ | 4.11 | $(3.39)$ | 6.67 | $(5.44)$ |

Table 7.3. Measured and predicted (in paranthesis) values of speed-up of the multilevel parallel solution algorithm over a sequential forward-backward band substitution algorithm. The speed-up $S_{S}^{\min }$ applies to the minimum dimension problem, $\mathrm{N}=2 \mathrm{p}-1$ and $\mathrm{n}=\mathrm{mN}$.

In Table 7.3 the speed-up values of the parallel solution algorithm are given for the minimum dimension problem. The solution algorithm examplified by Fig. 5.2 is applied to a multilevel LU-factorization produced by the parallel multilevel LU-factorization algorithm.

The measured speed-up values in Table 7.3 for $\mathrm{m}=2$ and $\mathrm{m}=5$ with $\mathrm{p} \leq 8$ are rather inaccurate because of the resolution of 5 msec in the execution time measurements.

Disregarding the inaccurate speed-up measurements, the observed speed-up is consistently greater than the predicted speed-up for $\mathrm{p}>2$. This somewhat surprising result was traced to an inadvertent exploitation of processor parallelism at the lower levels of the parallel solution algorithm. Floating point computation on the 80287 processor and index computation on the 80286 processor were to a certain degree overlapped in the parallel algorithm and not in the sequential solution algorithm.

There was no attempt to optimize either the sequential or the parallel implementation, and very careful optimization could probably improve the speed by $25 \%-50 \%$. However communication would still be insignificant for problems large enough to justify the use of a parallel computer. The experimental implementation therefore fulfils its purpose of demonstrating the feasibility of the multilevel algorithm.

The predicted asymptotic speed-up $S_{L U}^{\infty}$ involves the computation of $\partial T_{P L U} / \partial n$ which is computed as

$$
\partial T_{P L U} / \partial n=\left(\partial F_{2^{d}-1} / \partial n_{2^{d}-1}\right)\left(\partial n_{2^{d}-1} / \partial n\right) T_{F}
$$

Because of the nature of the load balance algorithm, we have the following result for $\mathrm{n}>n_{L}$ :

$$
\begin{equation*}
\partial T_{P L U} / \partial n=\left(\partial F_{r} / \partial n_{r}\right)\left(\partial n_{r} / \partial n\right) T_{F} \tag{7.4}
\end{equation*}
$$

for $r=1,3, . ., N$.
$S_{L U}^{\infty}$ can now be expressed as

$$
S_{L U}^{\infty}=\left(\partial T_{B L U} / \partial n\right) /\left[\left(\partial F_{1} / \partial n_{1}\right)\left(\partial n_{1} / \partial n\right) T_{F}\right]=\left(\partial n_{1} / \partial n\right)^{-1}
$$

The predicted asymptotic speed-up for the multilevel solution algorithm with load distribution for optimum $L U$-factorization speed can be expressed as

$$
S_{S}^{\infty}=\left(\partial T_{B S} / \partial n\right) /\left[\left(\partial\left(\bar{F}_{1}+\tilde{F}_{1}\right) / \partial n_{1}\right)\left(\partial n_{1} / \partial n\right) T_{F}\right]=\left(\partial n_{1} / \partial n\right)^{-1}
$$

which is equal to $S_{L U}^{\infty}$.
Since load balance is not for optimum solution algorithm speed, a relation similar to (7.4) is not valid. The partitioning $n_{1}$ is approximately twice as large as the optimum value. Therefore we have $\partial F_{P S} / \partial n=\partial\left(\bar{F}_{1}+\tilde{F}_{1}\right) / \partial n$.

A table analogous to Table 7.2 with measured and predicted values of $S_{S}^{\infty}$ for load distribution for optimum LU-factorization was constructed. As expected, it was essentially identical to Table 7.2, and it was therefore omitted.

Load distribution is chosen to be optimum for LU-factorization since this is close to minimum execution time for one LU-factorization followed by one solution step.

When one LU-factorization is computed followed by a large number of solution steps (e.g. Newton iteration) it may be advantageous to load balance for optimum solution speed. In this case we have:

$$
S_{S}^{\infty} \rightarrow(p+2) / 2 \text { for } m \rightarrow \infty
$$

This speed-up is almost twice as large as the corresponding limit value when load distribution is with respect to LU-factorization as stated in (7.3).

### 7.3 Row interleaved factorization and solution

An alternative parallel LU-factorization and solution approach is the row interleaved algorithm [8]. Contrary to the multilevel algorithm, the row interleaved algorithm does not pay any computational penalty for the parallelization, only communication penalty. Each pivot row must be broadcast to all processors to permit parallel factorization. The execution time model for the row interleaved LU-factorization is as follows:

$$
T_{R L U}=n\left[m(2 m+1) T_{F} / 2^{d}+d\left(T_{0}+8 m / B\right)\right]
$$

The row interleaved LU-factorization can be compared with the multilevel LU-factorization by comparing asymptotic speed-ups. For small values of m we have $S_{L U}^{\infty}>S_{R L U}^{\infty}$ where $S_{R L U}^{\infty}$ is defined analogously to $S_{L U}^{\infty}$ :

$$
S_{R L U}^{\infty}=\left(\partial T_{B L U} / \partial n\right) /\left(\partial T_{R L U} / \partial n\right)
$$

Table 7.4 gives the integer m -values for which the algorithms break even, $S_{L U}^{\infty} \approx S_{R L U}^{\infty}$.

| p | 4 | 8 | 16 | 32 |
| :---: | :---: | :---: | :---: | :---: |
| m | 13 | 15 | 21 | 31 |

Table 7.4. Integer values of m for which row interleaved and multilevel LU-factorizations break even in asymptotic speed-up.

The entries of Table 7.5 are computed from the equation

$$
\partial T_{R L U} / \partial n=\partial T_{P L U} / \partial n
$$

where

$$
\partial T_{P L U} / \partial n \approx 4 m(2 m+1) T_{F} /\left(2^{d}+6\right)
$$

This value is a good approximation assuming load distribution and "large" m .
For $m$-values larger than those given in Table 7.4, the row interleaved LU-factorization will be superior to the multilevel algorithm. For $\mathrm{p}=2$, the multilevel algorithm is always superior.

The values of table 7.4 were compared with measurements of an implementation of the row interleaved algorithm and measurements of the multilevel algorithm. The measurements resulted in $\mathrm{m}=14$ and $\mathrm{m}=20$ for $\mathrm{p}=8$ and $\mathrm{p}=16$, respectively. This is in good agreement with the predictions of the model.

The row interleaved solution algorithm performs very poorly since the number og broadcasts is the same as for the LU-factorization while computation is $\mathrm{O}(\mathrm{m})$ for each broadcast for the solution algorithm compared with $\mathrm{O}\left(\mathrm{m}^{2}\right)$ for the LU-factorization.

Concluding the comparison of multilevel and row interleaved algorithms, the former is superiour for narrow band problems while the latter takes over for wide band problems. The LU-factorizations break even for the m-values given in Table 7.5. For one LU-factorization and one solution step, the m-values corresponding to break even will increase.

It is obvious that the m-values of Table 7.4 are sensitive to the communication and computation performance of the parallel computer as modeled by $T_{0}, \mathrm{~B}$ and $T_{F}$. However, there is no trend in parallel computer technology towards a substantial shift of the breakeven values of $m$.

Finally, the multilevel solution method and the implementation techniques on the hypercube discribed in this paper should also be applicable to the other members of the family of permutations for parallel solution of block tridiagonal matrices proposed in [7].

## Acknowledgement

The careful implementation of the multilevel algorithms together with the numerical experiments were done by Per Ulkjær Andersen who also contributed to the principles behind load balancing, processor allocation and communication for the multilevel algorithm.

The work of I.N.Hajj was supported in part by the U.S.Joint Services Electronics Program and by Intel Corporation.

## References

[1 ] D. Heller, "Some aspects of the cyclic reduction algorithm for block tridiagonal systems", SIAM J. Numer.Anal., Vol. 13, no. 4, pp. 484-496, 1976.
[2 ] G. Birkhoff and A. George, "Elimination by nested dissection", in Complexity of Sequential and Numerical Algorithms edited by J. F. Traub, pp. 221-269, Academic Press, 1973.
[3 ] I. N. Hajj, "Sparsity considerations in network solution by tearing, IEEE Trans. Circuit and Systems, Vol. CAS-27, no. 5, pp. 357-366, 1980.
[4 ] J. J. Dongarra and A. H. Sameh, "One some parallel banded system solvers", Parallel Computing 1 (1984) 223-235.
[5 ] S. L. Johnsson, "Solving narrow banded systems on emsemble architectures", ACM Trans. Math. Software, Vol. 11, no. 3, pp. 271-288, 1985.
[6 ] U. Meier, "A parallel partition method for solving banded systems of linear equations", Parallel Computing 2 (1985) 33-43.
[7] S. Utku, M. Salama and R. J. Melosh, "A family of permutations for concurrent factorization of block tridiagonal matrices", IEEE Trans. on Computers, Vol. 38, No. 6, pp. 812-824, June 1989.
[8 ] Y. Saad and H. M. Schultz, "Parallel direct methods for solving banded linear systems", Linear Algebra and its Applications, Vol. 88/89, pp. 623-650, 1987.
[9 ] D. K. Bradley, "First and second generation hypercube performance", Report No. UIUCDCS-R-88-1455, Dept. Computer Science, University of Illinois, Urbana Champaign, USA, 1988.
[10 ] I. N. Hajj and S. Skelboe, "Multilevel parallel solver for banded linear systems", in Aspects of Computation on Asynchronous Parallel Processors, edited by M. H. Wright, IFIP 1989, pp. 69-78.
[11 ] P. Ulfkjær Andersen, "Implementation of a two-level a multilevel and a block parallel multilevel parallel solver for banded linear systems", Report, Department of Computer Science, University of Copenhagen, 1988.


[^0]:    *Coordinated Science Laboratory and the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
    ${ }^{\dagger}$ Department of Computer Science, University of Copenhagen, Universitetsparken 1, DK-2100 Copenhagen, Denmark

