Elsevier

Parallel Computing

Volume 25, Issue 3, March 1999, Pages 311-319
Parallel Computing

An efficient parallel version of the householder-QL matrix diagonalisation algorithm

https://doi.org/10.1016/S0167-8191(98)00116-1Get rights and content

Abstract

In this paper we report an effective parallelisation of the Householder routine for the reduction of a real symmetric matrix to tri-diagonal form and the QL algorithm for the diagonalisation of the resulting matrix. The Householder algorithm scales like αN3/P+βN2log2(P) and the QL algorithm like γN2+δN3/P as the number of processors P is increased for fixed problem size. The constant parameters α,β,γ and δ are obtained empirically. When the eigenvalues only are required the Householder method scales as above while the QL algorithm remains sequential. The code is implemented in c in conjunction with the message passing interface (MPI) libraries and verified on a sixteen node IBM SP2 and for real matrices that occur in the simulation of properties of crystaline materials.

Introduction

The parallelisation of methods for finding the eigenvalues and eigenvectors for real symetric matrices has received little attention over the years, for two reasons. Firstly for very large systems quite often only a few eigenvalues and eigenvectors are needed. Secondly, the parallelisation of the best sequential techniques for matrix diagonalisation has proved difficult and ineffective.

In the simulation of the dynamics of large structures, only the lowest few eigenvalues that correspond to the most likely occuring low frequency modes of vibration are needed. Similarly, in solid state physics, states at or near the ground state are those most likely to be occupied, at least at low temperatures, and so make the largest contribution to the material properties. For these reasons, in subjects like structural mechanics, techniques have concentrated in reducing the degrees of freedom in the system, from perhaps millions to a few hundred, enabling classical sequential techniques to be used. In Physics and Chemistry, and where the original matrix is regularly structured and sparse, iterative techniques like the Lanczos [9] or Davidson [2] methods, that recover the lowest few eigenvalues and their corresponding eigenvectors can be used effectively.

There are however many problems for which it is desirable to have all of the eigenvalues and eigenvectors of systems. Physical simulations of quantum mechanical systems that have a few particles present at finite temperature cannot be properly analysed by methods for large numbers of particles like statistical mechanics and are sufficiently far from their ground states as to be influenced by all available states. Similarly in the design of mechanical structures acoustic resonances might occur over a broad bandwidth. In such cases all eigenvalues and eigenvectors of large symmetric matrices are needed, for which no algorithm of complexity less than O(N3), where N is the dimension of the matrix to be diagonalised, exists, regardless of the sparsity of the matrix.

The most efficient algorithm for the diagonalisation of large symmetric matrices is the Householder tri-diagonalisation procedure followed by the QL diagonalisation algorithm. The operation count for the combined Householder method is approximately N3 whether or not eigenvectors are required and the operation count for the QL method is roughly 30N2 when the eigenvectors are not required and 3N3 when they are. These counts presume that the average number of iterations is about 1.7 [10], an assumption that is confirmed for the test matrices that we use. The only viable alternative is the Jacobi method which has about 20N3 operations when eigenvectors are not required and about 40N3 when the are assuming less that 10 Jacobi cycles for convergence [10]. The Jacobi method might readily parallelise [11] but requires more than 20 processors before becoming competitive with the sequential Householder-QL algorithm. Previous attempts at parallelising the Householder and QL algorithms have resulted in inefficient solutions [4] or have altered the algorithms in some way 12, 3, 1, 8

Section snippets

Description of the algorithm

Clear descriptions of the Householder and QL algorithms can be found in [10] and [6]. If A is an N×N symmetric matrix, then a sequence of N−2 transformations of the form PN−2P1AP1PN−2 reduces the matrix to symmetric tri-diagonal form. The matrix Pk is formed from the requirement that the matrix Ak−1=Pk−1P1AP1Pk−1, is tri-diagonal up to the kth column. For example for a 6×6 matrix A1 has the formd1e10000e1d2xxxx0xxxxx0xxxxx0xxxxx0xxxxx0xxxxx.The Pk matrices can be written as Pk=I−2WkWkT,

Test cases

The algorithms were tested using matrices that occur in many-body quantum mechanical model of crystals for which there are two states per atom. The total number of available states for the system is thus 2N for a N atom system. The nature of the problem is such that the matrices for l=0,1,…,N particles on N crystal sites can be generated separately and the resulting matrices have dimension CNl=N!/(Nl)!l!. The general sparsity pattern of the test matrices is shown in Fig. 1 in which non-zero

Results

The results presented below were all carried out on an IBM SP2 which is a heterogeneous 23 node machine with a subset of 16 `thin1' nodes on which our code was tested. Each `thin1' node consists of a 66 Mhz Power2 processor, 128 Mbytes of RAM with a 64 bit memory bus, a 32 Kbyte instruction cache and a 64 Kbyte data cache. The latency and bandwidth for the machine, as measured by simple processor to processor `ping-pong' benchmarks are 100μs and 20 Mbytes/s [7].

For the scaling test presented in

Summary

We have shown that a relatively straight forward parallelisation of the Householder tri-diagonalisation algorithm and the QL method for finding the eigenvalues and eigenvectors of the resulting matrix shows a time complexity like αN3/P+βN2log2(P) for the Householder problem and like γN2+δN3/P for the QL algorithm for fixed problem size on increasing numbers of processors, P. The log2(P) term is empirically fitted to the data, but is consistent with MPI_Bcast and MPI_Allgather being implemented

References (12)

There are more references available in the full text version of this article.

Cited by (7)

View all citing articles on Scopus
View full text