An efficient parallel version of the householder-QL matrix diagonalisation algorithm
Introduction
The parallelisation of methods for finding the eigenvalues and eigenvectors for real symetric matrices has received little attention over the years, for two reasons. Firstly for very large systems quite often only a few eigenvalues and eigenvectors are needed. Secondly, the parallelisation of the best sequential techniques for matrix diagonalisation has proved difficult and ineffective.
In the simulation of the dynamics of large structures, only the lowest few eigenvalues that correspond to the most likely occuring low frequency modes of vibration are needed. Similarly, in solid state physics, states at or near the ground state are those most likely to be occupied, at least at low temperatures, and so make the largest contribution to the material properties. For these reasons, in subjects like structural mechanics, techniques have concentrated in reducing the degrees of freedom in the system, from perhaps millions to a few hundred, enabling classical sequential techniques to be used. In Physics and Chemistry, and where the original matrix is regularly structured and sparse, iterative techniques like the Lanczos [9] or Davidson [2] methods, that recover the lowest few eigenvalues and their corresponding eigenvectors can be used effectively.
There are however many problems for which it is desirable to have all of the eigenvalues and eigenvectors of systems. Physical simulations of quantum mechanical systems that have a few particles present at finite temperature cannot be properly analysed by methods for large numbers of particles like statistical mechanics and are sufficiently far from their ground states as to be influenced by all available states. Similarly in the design of mechanical structures acoustic resonances might occur over a broad bandwidth. In such cases all eigenvalues and eigenvectors of large symmetric matrices are needed, for which no algorithm of complexity less than O(N3), where N is the dimension of the matrix to be diagonalised, exists, regardless of the sparsity of the matrix.
The most efficient algorithm for the diagonalisation of large symmetric matrices is the Householder tri-diagonalisation procedure followed by the QL diagonalisation algorithm. The operation count for the combined Householder method is approximately N3 whether or not eigenvectors are required and the operation count for the QL method is roughly 30N2 when the eigenvectors are not required and 3N3 when they are. These counts presume that the average number of iterations is about 1.7 [10], an assumption that is confirmed for the test matrices that we use. The only viable alternative is the Jacobi method which has about 20N3 operations when eigenvectors are not required and about 40N3 when the are assuming less that 10 Jacobi cycles for convergence [10]. The Jacobi method might readily parallelise [11] but requires more than 20 processors before becoming competitive with the sequential Householder-QL algorithm. Previous attempts at parallelising the Householder and QL algorithms have resulted in inefficient solutions [4] or have altered the algorithms in some way 12, 3, 1, 8
Section snippets
Description of the algorithm
Clear descriptions of the Householder and QL algorithms can be found in [10] and [6]. If A is an N×N symmetric matrix, then a sequence of N−2 transformations of the form PN−2…P1AP1…PN−2 reduces the matrix to symmetric tri-diagonal form. The matrix Pk is formed from the requirement that the matrix Ak−1=Pk−1…P1AP1…Pk−1, is tri-diagonal up to the kth column. For example for a 6×6 matrix A1 has the formThe Pk matrices can be written as Pk=I−2WkWkT,
Test cases
The algorithms were tested using matrices that occur in many-body quantum mechanical model of crystals for which there are two states per atom. The total number of available states for the system is thus 2N for a N atom system. The nature of the problem is such that the matrices for l=0,1,…,N particles on N crystal sites can be generated separately and the resulting matrices have dimension CNl=N!/(N−l)!l!. The general sparsity pattern of the test matrices is shown in Fig. 1 in which non-zero
Results
The results presented below were all carried out on an IBM SP2 which is a heterogeneous 23 node machine with a subset of 16 `thin1' nodes on which our code was tested. Each `thin1' node consists of a 66 Mhz Power2 processor, 128 Mbytes of RAM with a 64 bit memory bus, a 32 Kbyte instruction cache and a 64 Kbyte data cache. The latency and bandwidth for the machine, as measured by simple processor to processor `ping-pong' benchmarks are s and 20 Mbytes/s [7].
For the scaling test presented in
Summary
We have shown that a relatively straight forward parallelisation of the Householder tri-diagonalisation algorithm and the QL method for finding the eigenvalues and eigenvectors of the resulting matrix shows a time complexity like for the Householder problem and like γN2+δN3/P for the QL algorithm for fixed problem size on increasing numbers of processors, P. The term is empirically fitted to the data, but is consistent with MPI_Bcast and MPI_Allgather being implemented
References (12)
The iterative calculation of a few of the lowest eigenvalues and corresponding eigenvectors of large real-symmetric matrices
J. Computational Physics
(1975)- et al.
Performance of a QR algorithm inplementation on a multicluster of transputers
Computer Systems in Engineering
(1995) A parallel processed scheme for the eigenproblem of positive-definite matrices
J. of Computational and Applied Math.
(1994)- et al.
Parallel householder method for linear-systems international
Journal of Computer Mathematics
(1995) - G. Fox et al., Solving Problems on Concurrent Processors, Prentice-Hall, Englewood Cliffs,...
- V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduction to Parallel Computing – Design and Analysis of Algorithms,...
Cited by (7)
An analytical solution to applied mathematics-related Love's equation using the Boubaker polynomials expansion scheme
2010, Journal of the Franklin InstituteAn efficient implementation of parallel eigenvalue computation for massively parallel processing
2001, Parallel ComputingParallelization of principal component analysis (using Eigen value decomposition) on scalable multi-core architecture
2010, 2010 IEEE 2nd International Advance Computing Conference, IACC 2010Performance of electronic structure calculations on BG/L and XT4 computers
2009, Journal of Computational and Theoretical NanoscienceA parallel computational code for the proper orthogonal decomposition of turbulent flows
2007, Journal of Flow Visualization and Image Processing