An efficient parallel version of the householder-QL matrix diagonalisation algorithm

doi:10.1016/S0167-8191(98)00116-1

Parallel Computing

Volume 25, Issue 3, March 1999, Pages 311-319

https://doi.org/10.1016/S0167-8191(98)00116-1 Get rights and content

Abstract

In this paper we report an effective parallelisation of the Householder routine for the reduction of a real symmetric matrix to tri-diagonal form and the QL algorithm for the diagonalisation of the resulting matrix. The Householder algorithm scales like $αN^{3} /P+βN^{2} log_{2} (P)$ and the QL algorithm like γN²+δN³/P as the number of processors P is increased for fixed problem size. The constant parameters α,β,γ and δ are obtained empirically. When the eigenvalues only are required the Householder method scales as above while the QL algorithm remains sequential. The code is implemented in c in conjunction with the message passing interface (MPI) libraries and verified on a sixteen node IBM SP2 and for real matrices that occur in the simulation of properties of crystaline materials.

Introduction

The parallelisation of methods for finding the eigenvalues and eigenvectors for real symetric matrices has received little attention over the years, for two reasons. Firstly for very large systems quite often only a few eigenvalues and eigenvectors are needed. Secondly, the parallelisation of the best sequential techniques for matrix diagonalisation has proved difficult and ineffective.

In the simulation of the dynamics of large structures, only the lowest few eigenvalues that correspond to the most likely occuring low frequency modes of vibration are needed. Similarly, in solid state physics, states at or near the ground state are those most likely to be occupied, at least at low temperatures, and so make the largest contribution to the material properties. For these reasons, in subjects like structural mechanics, techniques have concentrated in reducing the degrees of freedom in the system, from perhaps millions to a few hundred, enabling classical sequential techniques to be used. In Physics and Chemistry, and where the original matrix is regularly structured and sparse, iterative techniques like the Lanczos [9] or Davidson [2] methods, that recover the lowest few eigenvalues and their corresponding eigenvectors can be used effectively.

There are however many problems for which it is desirable to have all of the eigenvalues and eigenvectors of systems. Physical simulations of quantum mechanical systems that have a few particles present at finite temperature cannot be properly analysed by methods for large numbers of particles like statistical mechanics and are sufficiently far from their ground states as to be influenced by all available states. Similarly in the design of mechanical structures acoustic resonances might occur over a broad bandwidth. In such cases all eigenvalues and eigenvectors of large symmetric matrices are needed, for which no algorithm of complexity less than O(N³), where N is the dimension of the matrix to be diagonalised, exists, regardless of the sparsity of the matrix.

The most efficient algorithm for the diagonalisation of large symmetric matrices is the Householder tri-diagonalisation procedure followed by the QL diagonalisation algorithm. The operation count for the combined Householder method is approximately N³ whether or not eigenvectors are required and the operation count for the QL method is roughly 30N² when the eigenvectors are not required and 3N³ when they are. These counts presume that the average number of iterations is about 1.7 [10], an assumption that is confirmed for the test matrices that we use. The only viable alternative is the Jacobi method which has about 20N³ operations when eigenvectors are not required and about 40N³ when the are assuming less that 10 Jacobi cycles for convergence [10]. The Jacobi method might readily parallelise [11] but requires more than 20 processors before becoming competitive with the sequential Householder-QL algorithm. Previous attempts at parallelising the Householder and QL algorithms have resulted in inefficient solutions [4] or have altered the algorithms in some way 12, 3, 1, 8

Section snippets

Description of the algorithm

Clear descriptions of the Householder and QL algorithms can be found in [10] and [6]. If A is an N×N symmetric matrix, then a sequence of N−2 transformations of the form P_N−2…P₁AP₁…P_N−2 reduces the matrix to symmetric tri-diagonal form. The matrix P_k is formed from the requirement that the matrix A_k−1=P_k−1…P₁AP₁…P_k−1, is tri-diagonal up to the kth column. For example for a 6×6 matrix A₁ has the form $d_{1} e_{1} 0000 e_{1} d_{2} x x x x 0 x x x x x 0 x x x x x 0 x x x x x 0 x x x x x 0 x x x x x .$ The P_k matrices can be written as P_k=I−2W_kW_k^T,

Test cases

The algorithms were tested using matrices that occur in many-body quantum mechanical model of crystals for which there are two states per atom. The total number of available states for the system is thus 2^N for a N atom system. The nature of the problem is such that the matrices for l=0,1,…,N particles on N crystal sites can be generated separately and the resulting matrices have dimension C^N_l=N!/(N−l)!l!. The general sparsity pattern of the test matrices is shown in Fig. 1 in which non-zero

Results

The results presented below were all carried out on an IBM SP2 which is a heterogeneous 23 node machine with a subset of 16 `thin1' nodes on which our code was tested. Each `thin1' node consists of a 66 Mhz Power2 processor, 128 Mbytes of RAM with a 64 bit memory bus, a 32 Kbyte instruction cache and a 64 Kbyte data cache. The latency and bandwidth for the machine, as measured by simple processor to processor `ping-pong' benchmarks are $100 μ$ s and 20 Mbytes/s [7].

For the scaling test presented in

Summary

We have shown that a relatively straight forward parallelisation of the Householder tri-diagonalisation algorithm and the QL method for finding the eigenvalues and eigenvectors of the resulting matrix shows a time complexity like $αN^{3} /P+βN^{2} log_{2} (P)$ for the Householder problem and like γN²+δN³/P for the QL algorithm for fixed problem size on increasing numbers of processors, P. The $log_{2} (P)$ term is empirically fitted to the data, but is consistent with MPI_Bcast and MPI_Allgather being implemented

References (12)

E.R. Davidson
The iterative calculation of a few of the lowest eigenvalues and corresponding eigenvectors of large real-symmetric matrices
J. Computational Physics
(1975)
F.J. Ferreira et al.
Performance of a QR algorithm inplementation on a multicluster of transputers
Computer Systems in Engineering
(1995)
M.A. Shalaby
A parallel processed scheme for the eigenproblem of positive-definite matrices
J. of Computational and Applied Math.
(1994)
M.M. Chawla et al.
Parallel householder method for linear-systems international
Journal of Computer Mathematics
(1995)
G. Fox et al., Solving Problems on Concurrent Processors, Prentice-Hall, Englewood Cliffs,...
V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduction to Parallel Computing – Design and Analysis of Algorithms,...

There are more references available in the full text version of this article.

Cited by (7)

An analytical solution to applied mathematics-related Love's equation using the Boubaker polynomials expansion scheme
2010, Journal of the Franklin Institute
Supramolecular organization of photosystem I and light-harvesting complex I in Chlamydomonas reinhardtii
2002, FEBS Letters
We report a structural characterization by electron microscopy and image analysis of a supramolecular complex consisting of photosystem I and light-harvesting complex I from the unicellular green alga Chlamydomonas reinhardtii. The complex is a monomer, has longest dimensions of 21.3 and 18.2 nm in projection, and is significantly larger than the corresponding complex in spinach. Comparison with photosystem I complexes from other organisms suggests that the complex contains about 14 light-harvesting proteins, two or three of which bind at the side of the PSI-H subunit. We suggest that special light-harvesting I proteins play a role in the binding of phosphorylated light-harvesting complex II in state 2.
An efficient implementation of parallel eigenvalue computation for massively parallel processing
2001, Parallel Computing
This paper describes an efficient implementation and evaluation of a parallel eigensolver for computing all eigenvalues of dense symmetric matrices. Our eigensolver uses a Householder tridiagonalization method, which has higher parallelism and performance than conventional methods when problem size is relatively small, e.g., the order of 10,000. This is very important for relevant practical applications, where many diagonalizations for such matrices are required so often. The routine was evaluated on the 1024 processors HITACHI SR2201, and giving speedup ratios of about 2–5 times as compared to the ScaLAPACK library on 1024 processors of the HITACHI SR2201.
Parallelization of principal component analysis (using Eigen value decomposition) on scalable multi-core architecture
2010, 2010 IEEE 2nd International Advance Computing Conference, IACC 2010
Performance of electronic structure calculations on BG/L and XT4 computers
2009, Journal of Computational and Theoretical Nanoscience
A parallel computational code for the proper orthogonal decomposition of turbulent flows
2007, Journal of Flow Visualization and Image Processing

View all citing articles on Scopus

View full text

An efficient parallel version of the householder-QL matrix diagonalisation algorithm

Abstract

Introduction

Section snippets

Description of the algorithm

Test cases

Results

Summary

J. Computational Physics

Computer Systems in Engineering

J. of Computational and Applied Math.

Parallel householder method for linear-systems international

Journal of Computer Mathematics