A hybrid method for the parallel computation of Green’s functions
Introduction
Commercial electronic devices are rapidly approaching the scale where quantum mechanical effects are affecting the device characteristics [1]. For example, nano-transistors in the 10–30 nm range are strongly affected by quantum mechanical effects such as tunneling and leakage. To allow for continued scaling of silicon technology, new materials are introduced [2] and new nanomaterials such as nanowires [3] or carbon nanotubes [4] are currently being considered as wires or as active parts in future generations of transistors. Also, more exotic devices using DNA [5] or other small molecules [6] are currently under investigation for future use in electronic devices.
To model such devices, it is necessary to apply a quantum mechanical framework [2]. In a system where current flows from a large reservoir (source) to a large drain, the non-equilibrium Green’s function approach is applicable [7], [8], [9]. This is a very general framework which can be applied to all the systems mentioned above. This method involves the coupled solution of the Schrödinger equation for the quantum wave function and the Poisson equation for the electrostatic potential. The Schrödinger equation can be solved in multiple ways, including using a Kohn–Sham Hamiltonian. The main computational difficulty of this problem is the self-consistent solution of the Kohn–Sham Hamiltonian and the Poisson equation, which usually requires an iterative scheme with many iterations. The solution of this system requires the calculation of a Green’s function which, once discretized, is often written in the matrix form:where E is an energy point (a scalar), is an overlap matrix, H is the Hamiltonian of the system, and and are the self-energies. The Green’s function matrix is computed at many energy points and this is computationally quite expensive even for small matrices. New nanodevices usually require a description at the atomic level; however, their size is large enough that we may have thousands or even millions of atoms in the active part of the device and thus the dimension of A becomes very large. In such cases, direct inversion of A becomes impractical, and the application of the non-equilibrium Green’s function framework requires new efficient parallel algorithms that exploit the sparsity of A. The focus of this paper is the derivation and benchmarking of a new parallel algorithm that efficiently calculates select entries in G and the lesser Green’s function matrixwhere is at this point an arbitrary matrix. Assumptions regarding the non-zero entries of will be formulated later on. The lesser Green’s function matrix is needed to account for non-equilibrium and scattering effects. The notations in this paper are indicated in Table 1.
In most cases not all entries in G and are needed. Usually only the diagonal entries of the matrix are required in the iterative process. This leads to a significant reduction in computational cost, which can be realized using a number of different methods [10]. Most of them are related to a technique developed by Takahashi et al. [11], [12], [13]. In this approach, a block LDU factorization of A is computed. Simple algebra shows that:where L, U, and D correspond, respectively, to the lower block unit triangular, upper block unit triangular, and diagonal block LDU factorization of A. The dense inverse G is treated conceptually as having a block structure based on that of A.
For example, consider the first equation and . The block entry resides above the block diagonal of the matrix and therefore . The first equation can then be written as:This allows computing the entry if the entries below it are known. Two similar equations can be derived for the case and :This approach leads to fast backward recurrences, as shown in [12].
The generalization of Takahashi’s method to computing is not straightforward. In particular Eqs. (3), (4) do not extend to this case. In this paper, we will provide a new way of deriving Eqs. (5), (6), (7). This derivation will then be extended to for which similar relations will be derived. These recurrences are most efficient when the sparsity graph of (Eq. (2)) is a subset of the graph of A, i.e., .
For the purpose of computer implementation, an important distinction must be made to distinguish what we term 1D, 2D and 3D problems. In principle, all problems are 3 dimensional. However, if the mesh or device is elongated in one direction, say z, then the mesh can be split into “slices” along the z direction. This gives rise to a matrix A with block tridiagonal structure. The problem is then termed 1 dimensional. Similarly, if the mesh is elongated in two directions, the matrix A assumes a block penta-diagonal form and the problem is called 2 dimensional.
This paper is organized as follows. We first describe general relations to compute Green’s functions (Sections 2 Recurrence formulas for the inverse matrix, 3 Sequential algorithm for 1D problems). These are applicable to meshes with arbitrary connectivity. In Section 4, we calculate the computational cost of these approaches for 1D and 2D Cartesian meshes and for discretizations involving nearest neighbor nodes only, e.g., a 3 point stencil in 1D or a 5 point stencil in 2D. We also compare the computational cost of this approach with a recently published method by Li and Darve (the FIND algorithm [14]).
In Section 5, a parallel implementation of the recurrences for 1 dimensional problems is proposed. The original algorithms by Takahashi [11] and Svizhenko [10] are not parallel since they are based on intrinsically sequential recurrences. However, we show that an appropriate reordering of the nodes and definition of the blocks in A lead to a large amount of parallelism. In practical cases for nano-transistor simulations, we found that the communication time was small and that the scalability was very good, even for small benchmark cases. This scheme is based on a combination of domain decomposition and cyclic reduction techniques1 (see, for example, Varga and Hageman [16]). Section 6 has some numerical results.
Section snippets
Recurrence formulas for the inverse matrix
Consider a general matrix A written in block form:From the LDU factorization of A, we can form the factors , and the Schur complement . The following equation holds for the inverse matrix :This equation can be verified by direct multiplication with A. It allows computing the inverse matrix using backward recurrence formulas. These formulas can be obtained by considering step i of the LDU
Sequential algorithm for 1D problems
In this section, we present a sequential implementation of the relations presented above, for 1D problems. Section 5 will discuss the parallel implementation.
In 1 dimensional problems, matrix A assumes a block tridiagonal form. The LDU factorization is obtained using the following recurrences:From Eq. (13), the backward recurrences for G are given by:
Computational cost for 1D and 2D cartesian meshes with nearest neighbor stencils
In this section, we determine the computational cost of the sequential algorithms. From the recurrence relations (13), (20), one can prove that the computational cost of calculating G and has the same scaling with problem size as the LDU factorization. If one uses a nested dissection ordering associated with the mesh [17], we obtain the following costs for Cartesian grids assuming a local discretization stencil:Dimension Cost 1D 2D (square grid) 3D (cubic grid)
We now do a
Parallel algorithm for 1D problems
We present a parallel algorithm for the calculation of the Green’s function matrix G typically encountered in electronic structure calculations where the matrix A (cf. Eq. (1)) is assumed to be an block matrix, and in block tridiagonal form as shown bywhere each block element is a dense complex matrix. In order to develop a parallel algorithm, we assume that we have at our disposal a total of processing elements (e.g., single core on a modern processor).
Stability
Our elimination algorithm is equivalent to an unpivoted Gaussian elimination on a suitably permuted matrix PAP for some permutation matrix P [19], [20].
Fig. 7, Fig. 8 show the permutation corresponding to the Schur phase and the cyclic reduction phase for a block matrix A.
In the case of our algorithm, both approaches are combined. The process is shown on Fig. 9. Once the Schur reduction phase has completed on A, we are left with a reduced block tridiagonal system. The rows and columns of
Conclusion
We have proposed new recurrence formulas to calculate certain entries of the inverse of a sparse matrix. This problem has application in quantum transport and quantum mechanical models, to calculate Green’s and lesser Green’s function matrices for example. This calculation scales like for a matrix of size N using a naive algorithm. We have shown that the computational cost can be reduced by orders of magnitude using novel recurrences Eqs. (13), (20). This is an extension of the work of K.
Acknowledgment
This work was supported by Grant Number 2106-04-0017, “Parallel Algorithms for Computational Nano-Science”, under the NABIIT program from the Danish Council for Strategic Research, the Army High Performance Computing and Research Center (AHPCRC) at Stanford University, funded by the United States Army (Grant number W911NF-07-2-0027), and the Stanford School of Engineering.
References (22)
- et al.
DNA transistor and quantum bit element: realization of nano-biomolecular logical devices
Phys. Lett. A
(1999) - et al.
Computing entries of the inverse of a sparse matrix using the FIND algorithm
J. Comput. Phys.
(2008) - International technology roadmap for semiconductors, <http://www.itrs.net/Links/2007ITRS/Home2007.htm>,...
- et al.
Density functional theory and beyond – opportunities for quantum methods in materials modeling semiconductor technology
J. Phys.: Condens. Matter
(2008) - et al.
Nanotechnology: high-speed integrated nanowire circuits
Nature
(2005) - et al.
Band-to-band tunneling in carbon nanotube field-effect transistors
Phys. Rev. Lett.
(2004) - et al.
Single-electron transistor of a single organic molecule with access to several redox states
Nature
(2003) Electronic Transport in Mesoscopic Systems
(1997)- et al.
Quantum Kinetics in Transport and Optics of Semiconductors
(1996) - et al.
Density-functional method for nonequilibrium electron transport
Phys. Rev. B
(2002)
Two-dimensional quantum mechanical modeling of nanotransistors
J. Appl. Phys.
Cited by (32)
PSelInv – A distributed memory parallel algorithm for selected inversion: The non-symmetric case
2018, Parallel ComputingImproved recursive Green's function formalism for quasi one-dimensional systems with realistic defects
2017, Journal of Computational PhysicsCitation Excerpt :This is especially interesting for transport problems, because only a few elements of the Green's function are necessary. Based on this, fast recurrence formulas can be found similar to the RGF but based on the LU-decomposition [15,16]. Also combinations with the RGF like the over-bridging boundary-matching method in combination with the shifted conjugate-orthogonal conjugate-gradient method [17–21] are promising.
Numerical simulations of time-resolved quantum electronics
2014, Physics ReportsA fast algorithm for sparse matrix computations related to inversion
2013, Journal of Computational PhysicsCitation Excerpt :However, in our work we aim to optimize the reuse of partial results across multiple eliminations. The second topic is regarding a hybrid algorithm that combines FIND-SS and the RGF-2D algorithm proposed in [16,27,28]. In the hybrid algorithm, the upward pass is executed in FIND-SS mode, but the downward pass is a mixture of FIND-SS and RGF-2D.
Extension and optimization of the FIND algorithm: Computing Green's and less-than Green's functions
2012, Journal of Computational PhysicsNumerical methods for Kohn-Sham density functional theory
2019, Acta Numerica