A hybrid method for the parallel computation of Green’s functions

doi:10.1016/j.jcp.2009.03.035

Journal of Computational Physics

Volume 228, Issue 14, 1 August 2009, Pages 5020-5039

https://doi.org/10.1016/j.jcp.2009.03.035 Get rights and content

Abstract

Quantum transport models for nanodevices using the non-equilibrium Green’s function method require the repeated calculation of the block tridiagonal part of the Green’s and lesser Green’s function matrices. This problem is related to the calculation of the inverse of a sparse matrix. Because of the large number of times this calculation needs to be performed, this is computationally very expensive even on supercomputers. The classical approach is based on recurrence formulas which cannot be efficiently parallelized. This practically prevents the solution of large problems with hundreds of thousands of atoms. We propose new recurrences for a general class of sparse matrices to calculate Green’s and lesser Green’s function matrices which extend formulas derived by Takahashi and others. We show that these recurrences may lead to a dramatically reduced computational cost because they only require computing a small number of entries of the inverse matrix. Then, we propose a parallelization strategy for block tridiagonal matrices which involves a combination of Schur complement calculations and cyclic reduction. It achieves good scalability even on problems of modest size.

Introduction

Commercial electronic devices are rapidly approaching the scale where quantum mechanical effects are affecting the device characteristics [1]. For example, nano-transistors in the 10–30 nm range are strongly affected by quantum mechanical effects such as tunneling and leakage. To allow for continued scaling of silicon technology, new materials are introduced [2] and new nanomaterials such as nanowires [3] or carbon nanotubes [4] are currently being considered as wires or as active parts in future generations of transistors. Also, more exotic devices using DNA [5] or other small molecules [6] are currently under investigation for future use in electronic devices.

To model such devices, it is necessary to apply a quantum mechanical framework [2]. In a system where current flows from a large reservoir (source) to a large drain, the non-equilibrium Green’s function approach is applicable [7], [8], [9]. This is a very general framework which can be applied to all the systems mentioned above. This method involves the coupled solution of the Schrödinger equation for the quantum wave function and the Poisson equation for the electrostatic potential. The Schrödinger equation can be solved in multiple ways, including using a Kohn–Sham Hamiltonian. The main computational difficulty of this problem is the self-consistent solution of the Kohn–Sham Hamiltonian and the Poisson equation, which usually requires an iterative scheme with many iterations. The solution of this system requires the calculation of a Green’s function which, once discretized, is often written in the matrix form: $G \overset{def}{=} A^{- 1}, where A \overset{def}{=} E S^{o} - H - Σ^{L} - Σ^{R}$ where E is an energy point (a scalar), $S^{o}$ is an overlap matrix, H is the Hamiltonian of the system, and $Σ^{L}$ and $Σ^{R}$ are the self-energies. The Green’s function matrix is computed at many energy points and this is computationally quite expensive even for small matrices. New nanodevices usually require a description at the atomic level; however, their size is large enough that we may have thousands or even millions of atoms in the active part of the device and thus the dimension of A becomes very large. In such cases, direct inversion of A becomes impractical, and the application of the non-equilibrium Green’s function framework requires new efficient parallel algorithms that exploit the sparsity of A. The focus of this paper is the derivation and benchmarking of a new parallel algorithm that efficiently calculates select entries in G and the lesser Green’s function matrix $G^{<} \overset{def}{=} G Γ G^{†}$ where $Γ$ is at this point an arbitrary matrix. Assumptions regarding the non-zero entries of $Γ$ will be formulated later on. The lesser Green’s function matrix is needed to account for non-equilibrium and scattering effects. The notations in this paper are indicated in Table 1.

In most cases not all entries in G and $G^{<}$ are needed. Usually only the diagonal entries of the matrix are required in the iterative process. This leads to a significant reduction in computational cost, which can be realized using a number of different methods [10]. Most of them are related to a technique developed by Takahashi et al. [11], [12], [13]. In this approach, a block LDU factorization of A is computed. Simple algebra shows that: $G = D^{- 1} L^{- 1} + (I - U) G$ $= U^{- 1} D^{- 1} + G (I - L)$ where L, U, and D correspond, respectively, to the lower block unit triangular, upper block unit triangular, and diagonal block LDU factorization of A. The dense inverse G is treated conceptually as having a block structure based on that of A.

For example, consider the first equation and $j > i$ . The block entry $g_{ij}$ resides above the block diagonal of the matrix and therefore $[D^{- 1} L^{- 1}]_{ij} = 0$ . The first equation can then be written as: $g_{ij} = - \sum_{k > i} u_{ik} g_{kj}$ This allows computing the entry $g_{ij}$ if the entries $g_{kj} (k > i)$ below it are known. Two similar equations can be derived for the case $j < i$ and $j = i$ : $g_{ij} = - \sum_{k > i} g_{ik} ℓ_{kj}$ $g_{ii} = d_{ii}^{- 1} - \sum_{k > i} u_{ik} g_{ki}$ This approach leads to fast backward recurrences, as shown in [12].

The generalization of Takahashi’s method to computing $G^{<}$ is not straightforward. In particular Eqs. (3), (4) do not extend to this case. In this paper, we will provide a new way of deriving Eqs. (5), (6), (7). This derivation will then be extended to $G^{<}$ for which similar relations will be derived. These recurrences are most efficient when the sparsity graph of $Γ$ (Eq. (2)) is a subset of the graph of A, i.e., $γ_{ij} \neq 0 \Rightarrow a_{ij} \neq 0$ .

For the purpose of computer implementation, an important distinction must be made to distinguish what we term 1D, 2D and 3D problems. In principle, all problems are 3 dimensional. However, if the mesh or device is elongated in one direction, say z, then the mesh can be split into “slices” along the z direction. This gives rise to a matrix A with block tridiagonal structure. The problem is then termed 1 dimensional. Similarly, if the mesh is elongated in two directions, the matrix A assumes a block penta-diagonal form and the problem is called 2 dimensional.

This paper is organized as follows. We first describe general relations to compute Green’s functions (Sections 2 Recurrence formulas for the inverse matrix, 3 Sequential algorithm for 1D problems). These are applicable to meshes with arbitrary connectivity. In Section 4, we calculate the computational cost of these approaches for 1D and 2D Cartesian meshes and for discretizations involving nearest neighbor nodes only, e.g., a 3 point stencil in 1D or a 5 point stencil in 2D. We also compare the computational cost of this approach with a recently published method by Li and Darve (the FIND algorithm [14]).

In Section 5, a parallel implementation of the recurrences for 1 dimensional problems is proposed. The original algorithms by Takahashi [11] and Svizhenko [10] are not parallel since they are based on intrinsically sequential recurrences. However, we show that an appropriate reordering of the nodes and definition of the blocks in A lead to a large amount of parallelism. In practical cases for nano-transistor simulations, we found that the communication time was small and that the scalability was very good, even for small benchmark cases. This scheme is based on a combination of domain decomposition and cyclic reduction techniques¹ (see, for example, Varga and Hageman [16]). Section 6 has some numerical results.

Section snippets

Recurrence formulas for the inverse matrix

Consider a general matrix A written in block form: $A \overset{def}{=} [\begin{matrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{matrix}]$ From the LDU factorization of A, we can form the factors $L^{S} \overset{def}{=} A_{21} A_{11}^{- 1}, U^{S} \overset{def}{=} A_{11}^{- 1} A_{12}$ , and the Schur complement $S \overset{def}{=} A_{22} - A_{21} A_{11}^{- 1} A_{12}$ . The following equation holds for the inverse matrix $G = A^{- 1}$ : $G = [\begin{matrix} A_{11}^{- 1} + U^{S} S^{- 1} L^{S} & - U^{S} S^{- 1} \\ - S^{- 1} L^{S} & S^{- 1} \end{matrix}]$ This equation can be verified by direct multiplication with A. It allows computing the inverse matrix using backward recurrence formulas. These formulas can be obtained by considering step i of the LDU

Sequential algorithm for 1D problems

In this section, we present a sequential implementation of the relations presented above, for 1D problems. Section 5 will discuss the parallel implementation.

In 1 dimensional problems, matrix A assumes a block tridiagonal form. The LDU factorization is obtained using the following recurrences: $ℓ_{i + 1, i} = a_{i + 1, i} d_{ii}^{- 1}$ $u_{i, i + 1} = d_{ii}^{- 1} a_{i, i + 1}$ $d_{i + 1, i + 1} = a_{i + 1, i + 1} - a_{i + 1, i} d_{ii}^{- 1} a_{i, i + 1} = a_{i + 1, i + 1} - ℓ_{i + 1, i} a_{i, i + 1} = a_{i + 1, i + 1} - a_{i + 1, i} u_{i, i + 1}$ From Eq. (13), the backward recurrences for G are given by: $g_{i + 1, i} = - g_{i + 1, i + 1} ℓ_{i + 1, i}$ $g_{i, i + 1} = - u_{i}$

Computational cost for 1D and 2D cartesian meshes with nearest neighbor stencils

In this section, we determine the computational cost of the sequential algorithms. From the recurrence relations (13), (20), one can prove that the computational cost of calculating G and $G^{<}$ has the same scaling with problem size as the LDU factorization. If one uses a nested dissection ordering associated with the mesh [17], we obtain the following costs for Cartesian grids assuming a local discretization stencil:

Dimension	Cost
1D	$O (n d^{3})$
2D (square grid)	$O (n^{3 / 2} d^{3})$
3D (cubic grid)	$O (n^{2} d^{3})$

We now do a

Parallel algorithm for 1D problems

We present a parallel algorithm for the calculation of the Green’s function matrix G typically encountered in electronic structure calculations where the matrix A (cf. Eq. (1)) is assumed to be an $n \times n$ block matrix, and in block tridiagonal form as shown by $A = [\begin{matrix} a_{11} & a_{12} \\ a_{21} & a_{22} & a_{23} \\ a_{32} & a_{33} & a_{34} \\ ⋱ & ⋱ & ⋱ \end{matrix}] .$ where each block element $a_{ij}$ is a dense complex matrix. In order to develop a parallel algorithm, we assume that we have at our disposal a total of $P$ processing elements (e.g., single core on a modern processor).

Stability

Our elimination algorithm is equivalent to an unpivoted Gaussian elimination on a suitably permuted matrix PAP for some permutation matrix P [19], [20].

Fig. 7, Fig. 8 show the permutation corresponding to the Schur phase and the cyclic reduction phase for a $31 \times 31$ block matrix A.

In the case of our algorithm, both approaches are combined. The process is shown on Fig. 9. Once the Schur reduction phase has completed on A, we are left with a reduced block tridiagonal system. The rows and columns of

Conclusion

We have proposed new recurrence formulas to calculate certain entries of the inverse of a sparse matrix. This problem has application in quantum transport and quantum mechanical models, to calculate Green’s and lesser Green’s function matrices for example. This calculation scales like $N^{3}$ for a matrix of size N using a naive algorithm. We have shown that the computational cost can be reduced by orders of magnitude using novel recurrences Eqs. (13), (20). This is an extension of the work of K.

Acknowledgment

This work was supported by Grant Number 2106-04-0017, “Parallel Algorithms for Computational Nano-Science”, under the NABIIT program from the Danish Council for Strategic Research, the Army High Performance Computing and Research Center (AHPCRC) at Stanford University, funded by the United States Army (Grant number W911NF-07-2-0027), and the Stanford School of Engineering.

References (22)

E. Ben-Jacob et al.
DNA transistor and quantum bit element: realization of nano-biomolecular logical devices
Phys. Lett. A
(1999)
S. Li et al.
Computing entries of the inverse of a sparse matrix using the FIND algorithm
J. Comput. Phys.
(2008)
International technology roadmap for semiconductors, <http://www.itrs.net/Links/2007ITRS/Home2007.htm>,...
S. Shankar et al.
Density functional theory and beyond – opportunities for quantum methods in materials modeling semiconductor technology
J. Phys.: Condens. Matter
(2008)
R.S. Friedman et al.
Nanotechnology: high-speed integrated nanowire circuits
Nature
(2005)
J. Appenzeller et al.
Band-to-band tunneling in carbon nanotube field-effect transistors
Phys. Rev. Lett.
(2004)
S. Kubatkin et al.
Single-electron transistor of a single organic molecule with access to several redox states
Nature
(2003)
S. Datta
Electronic Transport in Mesoscopic Systems
(1997)
H. Haug et al.
Quantum Kinetics in Transport and Optics of Semiconductors
(1996)
M. Brandbyge et al.
Density-functional method for nonequilibrium electron transport
Phys. Rev. B
(2002)

A. Svizhenko et al.

Two-dimensional quantum mechanical modeling of nanotransistors

J. Appl. Phys.

(2002)

Cited by (32)

PSelInv – A distributed memory parallel algorithm for selected inversion: The non-symmetric case
2018, Parallel Computing
This paper generalizes the parallel selected inversion algorithm called PSelInv to sparse non-symmetric matrices. We assume a general sparse matrix A has been decomposed as $P A Q = L U$ on a distributed memory parallel machine, where L, U are lower and upper triangular matrices, and P, Q are permutation matrices, respectively. The PSelInvmethod computes selected elements of $A^{- 1}$ . The selection is confined by the sparsity pattern of the matrix A^T. Our algorithm does not assume any symmetry properties of A, and our parallel implementation is memory efficient, in the sense that the computed elements of $A^{- T}$ overwrites the sparse matrix $L + U$ in situ. PSelInv involves a large number of collective data communication activities within different processor groups of various sizes. In order to minimize idle time and improve load balancing, tree-based asynchronous communication is used to coordinate all such collective communication. Numerical results demonstrate that PSelInv can scale efficiently to 6,400 cores for a variety of matrices.
Improved recursive Green's function formalism for quasi one-dimensional systems with realistic defects
2017, Journal of Computational Physics
Citation Excerpt :
This is especially interesting for transport problems, because only a few elements of the Green's function are necessary. Based on this, fast recurrence formulas can be found similar to the RGF but based on the LU-decomposition [15,16]. Also combinations with the RGF like the over-bridging boundary-matching method in combination with the shifted conjugate-orthogonal conjugate-gradient method [17–21] are promising.
We derive an improved version of the recursive Green's function formalism (RGF), which is a standard tool in the quantum transport theory. We consider the case of disordered quasi one-dimensional materials where the disorder is applied in form of randomly distributed realistic defects, leading to partly periodic Hamiltonian matrices. The algorithm accelerates the common RGF in the recursive decimation scheme, using the iteration steps of the renormalization decimation algorithm. This leads to a smaller effective system, which is treated using the common forward iteration scheme. The computational complexity scales linearly with the number of defects, instead of linearly with the total system length for the conventional approach. We show that the scaling of the calculation time of the Green's function depends on the defect density of a random test system. Furthermore, we discuss the calculation time and the memory requirement of the whole transport formalism applied to defective carbon nanotubes.
Numerical simulations of time-resolved quantum electronics
2014, Physics Reports
Numerical simulation has become a major tool in quantum electronics both for fundamental and applied purposes. While for a long time those simulations focused on stationary properties (e.g. DC currents), the recent experimental trend toward GHz frequencies and beyond has triggered a new interest for handling time-dependent perturbations. As the experimental frequencies get higher, it becomes possible to conceive experiments which are both time-resolved and fast enough to probe the internal quantum dynamics of the system. This paper discusses the technical aspects–mathematical and numerical–associated with the numerical simulations of such a setup in the time domain (i.e. beyond the single-frequency AC limit). After a short review of the state of the art, we develop a theoretical framework for the calculation of time-resolved observables in a general multiterminal system subject to an arbitrary time-dependent perturbation (oscillating electrostatic gates, voltage pulses, time-varying magnetic fields, etc.) The approach is mathematically equivalent to (i) the time-dependent scattering formalism, (ii) the time-resolved non-equilibrium Green’s function (NEGF) formalism and (iii) the partition-free approach. The central object of our theory is a wave function that obeys a simple Schrödinger equation with an additional source term that accounts for the electrons injected from the electrodes. The time-resolved observables (current, density, etc.) and the (inelastic) scattering matrix are simply expressed in terms of this wave function. We use our approach to develop a numerical technique for simulating time-resolved quantum transport. We find that the use of this wave function is advantageous for numerical simulations resulting in a speed up of many orders of magnitude with respect to the direct integration of NEGF equations. Our technique allows one to simulate realistic situations beyond simple models, a subject that was until now beyond the simulation capabilities of available approaches.
A fast algorithm for sparse matrix computations related to inversion
2013, Journal of Computational Physics
Citation Excerpt :
However, in our work we aim to optimize the reuse of partial results across multiple eliminations. The second topic is regarding a hybrid algorithm that combines FIND-SS and the RGF-2D algorithm proposed in [16,27,28]. In the hybrid algorithm, the upward pass is executed in FIND-SS mode, but the downward pass is a mixture of FIND-SS and RGF-2D.
We have developed a fast algorithm for computing certain entries of the inverse of a sparse matrix. Such computations are critical to many applications, such as the calculation of non-equilibrium Green’s functions $G^{r}$ and $G^{<}$ for nano-devices. The FIND (Fast Inverse using Nested Dissection) algorithm is optimal in the big-O sense. However, in practice, FIND suffers from two problems due to the width-2 separators used by its partitioning scheme. One problem is the presence of a large constant factor in the computational cost of FIND. The other problem is that the partitioning scheme used by FIND is incompatible with most existing partitioning methods and libraries for nested dissection, which all use width-1 separators. Our new algorithm resolves these problems by thoroughly decomposing the computation process such that width-1 separators can be used, resulting in a significant speedup over FIND for realistic devices — up to twelve-fold in simulation. The new algorithm also has the added advantage that desired off-diagonal entries can be computed for free. Consequently, our algorithm is faster than the current state-of-the-art recursive methods for meshes of any size. Furthermore, the framework used in the analysis of our algorithm is the first attempt to explicitly apply the widely-used relationship between mesh nodes and matrix computations to the problem of multiple eliminations with reuse of intermediate results. This framework makes our algorithm easier to generalize, and also easier to compare against other methods related to elimination trees. Finally, our accuracy analysis shows that the algorithms that require back-substitution are subject to significant extra round-off errors, which become extremely large even for some well-conditioned matrices or matrices with only moderately large condition numbers. When compared to these back-substitution algorithms, our algorithm is generally a few orders of magnitude more accurate, and our produced round-off errors stay at a reasonable level.
Extension and optimization of the FIND algorithm: Computing Green's and less-than Green's functions
2012, Journal of Computational Physics
The FIND algorithm is a fast algorithm designed to calculate certain entries of the inverse of a sparse matrix. Such calculation is critical in many applications, e.g., quantum transport in nano-devices. We extended the algorithm to other matrix inverse related calculations. Those are required for example to calculate the less-than Green’s function and the current density through the device. For a 2D device discretized as an N_x × N_y mesh, the best known algorithms have a running time of $O (N_{x}^{3} N_{y})$ , whereas FIND only requires $O (N_{x}^{2} N_{y})$ . Even though this complexity has been reduced by an order of magnitude, the matrix inverse calculation is still the most time consuming part in the simulation of transport problems. We could not reduce the order of complexity, but we were able to significantly reduce the constant factor involved in the computation cost. By exploiting the sparsity and symmetry, the size of the problem beyond which FIND is faster than other methods typically decreases from a 130 × 130 2D mesh down to a 40 × 40 mesh. These improvements make the optimized FIND algorithm even more competitive for real-life applications.
Numerical methods for Kohn-Sham density functional theory
2019, Acta Numerica

View all citing articles on Scopus

View full text

A hybrid method for the parallel computation of Green’s functions

Abstract

Introduction

Section snippets

Recurrence formulas for the inverse matrix

Sequential algorithm for 1D problems

Computational cost for 1D and 2D cartesian meshes with nearest neighbor stencils

Parallel algorithm for 1D problems

Stability

Conclusion

Acknowledgment

Phys. Lett. A

J. Comput. Phys.

Density functional theory and beyond – opportunities for quantum methods in materials modeling semiconductor technology

J. Phys.: Condens. Matter

Nanotechnology: high-speed integrated nanowire circuits

Nature

Band-to-band tunneling in carbon nanotube field-effect transistors

Phys. Rev. Lett.

Single-electron transistor of a single organic molecule with access to several redox states

Nature

Electronic Transport in Mesoscopic Systems

Quantum Kinetics in Transport and Optics of Semiconductors

Density-functional method for nonequilibrium electron transport

Phys. Rev. B

Two-dimensional quantum mechanical modeling of nanotransistors

J. Appl. Phys.