A hybrid method for the parallel computation of Green’s functions

https://doi.org/10.1016/j.jcp.2009.03.035Get rights and content

Abstract

Quantum transport models for nanodevices using the non-equilibrium Green’s function method require the repeated calculation of the block tridiagonal part of the Green’s and lesser Green’s function matrices. This problem is related to the calculation of the inverse of a sparse matrix. Because of the large number of times this calculation needs to be performed, this is computationally very expensive even on supercomputers. The classical approach is based on recurrence formulas which cannot be efficiently parallelized. This practically prevents the solution of large problems with hundreds of thousands of atoms. We propose new recurrences for a general class of sparse matrices to calculate Green’s and lesser Green’s function matrices which extend formulas derived by Takahashi and others. We show that these recurrences may lead to a dramatically reduced computational cost because they only require computing a small number of entries of the inverse matrix. Then, we propose a parallelization strategy for block tridiagonal matrices which involves a combination of Schur complement calculations and cyclic reduction. It achieves good scalability even on problems of modest size.

Introduction

Commercial electronic devices are rapidly approaching the scale where quantum mechanical effects are affecting the device characteristics [1]. For example, nano-transistors in the 10–30 nm range are strongly affected by quantum mechanical effects such as tunneling and leakage. To allow for continued scaling of silicon technology, new materials are introduced [2] and new nanomaterials such as nanowires [3] or carbon nanotubes [4] are currently being considered as wires or as active parts in future generations of transistors. Also, more exotic devices using DNA [5] or other small molecules [6] are currently under investigation for future use in electronic devices.

To model such devices, it is necessary to apply a quantum mechanical framework [2]. In a system where current flows from a large reservoir (source) to a large drain, the non-equilibrium Green’s function approach is applicable [7], [8], [9]. This is a very general framework which can be applied to all the systems mentioned above. This method involves the coupled solution of the Schrödinger equation for the quantum wave function and the Poisson equation for the electrostatic potential. The Schrödinger equation can be solved in multiple ways, including using a Kohn–Sham Hamiltonian. The main computational difficulty of this problem is the self-consistent solution of the Kohn–Sham Hamiltonian and the Poisson equation, which usually requires an iterative scheme with many iterations. The solution of this system requires the calculation of a Green’s function which, once discretized, is often written in the matrix form:G=defA-1,whereA=defESo-H-ΣL-ΣRwhere E is an energy point (a scalar), So is an overlap matrix, H is the Hamiltonian of the system, and ΣL and ΣR are the self-energies. The Green’s function matrix is computed at many energy points and this is computationally quite expensive even for small matrices. New nanodevices usually require a description at the atomic level; however, their size is large enough that we may have thousands or even millions of atoms in the active part of the device and thus the dimension of A becomes very large. In such cases, direct inversion of A becomes impractical, and the application of the non-equilibrium Green’s function framework requires new efficient parallel algorithms that exploit the sparsity of A. The focus of this paper is the derivation and benchmarking of a new parallel algorithm that efficiently calculates select entries in G and the lesser Green’s function matrixG<=defGΓGwhere Γ is at this point an arbitrary matrix. Assumptions regarding the non-zero entries of Γ will be formulated later on. The lesser Green’s function matrix is needed to account for non-equilibrium and scattering effects. The notations in this paper are indicated in Table 1.

In most cases not all entries in G and G< are needed. Usually only the diagonal entries of the matrix are required in the iterative process. This leads to a significant reduction in computational cost, which can be realized using a number of different methods [10]. Most of them are related to a technique developed by Takahashi et al. [11], [12], [13]. In this approach, a block LDU factorization of A is computed. Simple algebra shows that:G=D-1L-1+(I-U)G=U-1D-1+G(I-L)where L, U, and D correspond, respectively, to the lower block unit triangular, upper block unit triangular, and diagonal block LDU factorization of A. The dense inverse G is treated conceptually as having a block structure based on that of A.

For example, consider the first equation and j>i. The block entry gij resides above the block diagonal of the matrix and therefore [D-1L-1]ij=0. The first equation can then be written as:gij=-k>iuikgkjThis allows computing the entry gij if the entries gkj(k>i) below it are known. Two similar equations can be derived for the case j<i and j=i:gij=-k>igikkjgii=dii-1-k>iuikgkiThis approach leads to fast backward recurrences, as shown in [12].

The generalization of Takahashi’s method to computing G< is not straightforward. In particular Eqs. (3), (4) do not extend to this case. In this paper, we will provide a new way of deriving Eqs. (5), (6), (7). This derivation will then be extended to G< for which similar relations will be derived. These recurrences are most efficient when the sparsity graph of Γ (Eq. (2)) is a subset of the graph of A, i.e., γij0aij0.

For the purpose of computer implementation, an important distinction must be made to distinguish what we term 1D, 2D and 3D problems. In principle, all problems are 3 dimensional. However, if the mesh or device is elongated in one direction, say z, then the mesh can be split into “slices” along the z direction. This gives rise to a matrix A with block tridiagonal structure. The problem is then termed 1 dimensional. Similarly, if the mesh is elongated in two directions, the matrix A assumes a block penta-diagonal form and the problem is called 2 dimensional.

This paper is organized as follows. We first describe general relations to compute Green’s functions (Sections 2 Recurrence formulas for the inverse matrix, 3 Sequential algorithm for 1D problems). These are applicable to meshes with arbitrary connectivity. In Section 4, we calculate the computational cost of these approaches for 1D and 2D Cartesian meshes and for discretizations involving nearest neighbor nodes only, e.g., a 3 point stencil in 1D or a 5 point stencil in 2D. We also compare the computational cost of this approach with a recently published method by Li and Darve (the FIND algorithm [14]).

In Section 5, a parallel implementation of the recurrences for 1 dimensional problems is proposed. The original algorithms by Takahashi [11] and Svizhenko [10] are not parallel since they are based on intrinsically sequential recurrences. However, we show that an appropriate reordering of the nodes and definition of the blocks in A lead to a large amount of parallelism. In practical cases for nano-transistor simulations, we found that the communication time was small and that the scalability was very good, even for small benchmark cases. This scheme is based on a combination of domain decomposition and cyclic reduction techniques1 (see, for example, Varga and Hageman [16]). Section 6 has some numerical results.

Section snippets

Recurrence formulas for the inverse matrix

Consider a general matrix A written in block form:A=defA11A12A21A22From the LDU factorization of A, we can form the factors LS=defA21A11-1,US=defA11-1A12, and the Schur complement S=defA22-A21A11-1A12. The following equation holds for the inverse matrix G=A-1:G=A11-1+USS-1LS-USS-1-S-1LSS-1This equation can be verified by direct multiplication with A. It allows computing the inverse matrix using backward recurrence formulas. These formulas can be obtained by considering step i of the LDU

Sequential algorithm for 1D problems

In this section, we present a sequential implementation of the relations presented above, for 1D problems. Section 5 will discuss the parallel implementation.

In 1 dimensional problems, matrix A assumes a block tridiagonal form. The LDU factorization is obtained using the following recurrences:i+1,i=ai+1,idii-1ui,i+1=dii-1ai,i+1di+1,i+1=ai+1,i+1-ai+1,idii-1ai,i+1=ai+1,i+1-i+1,iai,i+1=ai+1,i+1-ai+1,iui,i+1From Eq. (13), the backward recurrences for G are given by:gi+1,i=-gi+1,i+1i+1,igi,i+1=-ui

Computational cost for 1D and 2D cartesian meshes with nearest neighbor stencils

In this section, we determine the computational cost of the sequential algorithms. From the recurrence relations (13), (20), one can prove that the computational cost of calculating G and G< has the same scaling with problem size as the LDU factorization. If one uses a nested dissection ordering associated with the mesh [17], we obtain the following costs for Cartesian grids assuming a local discretization stencil:

DimensionCost
1DO(nd3)
2D (square grid)O(n3/2d3)
3D (cubic grid)O(n2d3)

We now do a

Parallel algorithm for 1D problems

We present a parallel algorithm for the calculation of the Green’s function matrix G typically encountered in electronic structure calculations where the matrix A (cf. Eq. (1)) is assumed to be an n×n block matrix, and in block tridiagonal form as shown byA=a11a12a21a22a23a32a33a34.where each block element aij is a dense complex matrix. In order to develop a parallel algorithm, we assume that we have at our disposal a total of P processing elements (e.g., single core on a modern processor).

Stability

Our elimination algorithm is equivalent to an unpivoted Gaussian elimination on a suitably permuted matrix PAP for some permutation matrix P [19], [20].

Fig. 7, Fig. 8 show the permutation corresponding to the Schur phase and the cyclic reduction phase for a 31×31 block matrix A.

In the case of our algorithm, both approaches are combined. The process is shown on Fig. 9. Once the Schur reduction phase has completed on A, we are left with a reduced block tridiagonal system. The rows and columns of

Conclusion

We have proposed new recurrence formulas to calculate certain entries of the inverse of a sparse matrix. This problem has application in quantum transport and quantum mechanical models, to calculate Green’s and lesser Green’s function matrices for example. This calculation scales like N3 for a matrix of size N using a naive algorithm. We have shown that the computational cost can be reduced by orders of magnitude using novel recurrences Eqs. (13), (20). This is an extension of the work of K.

Acknowledgment

This work was supported by Grant Number 2106-04-0017, “Parallel Algorithms for Computational Nano-Science”, under the NABIIT program from the Danish Council for Strategic Research, the Army High Performance Computing and Research Center (AHPCRC) at Stanford University, funded by the United States Army (Grant number W911NF-07-2-0027), and the Stanford School of Engineering.

References (22)

  • E. Ben-Jacob et al.

    DNA transistor and quantum bit element: realization of nano-biomolecular logical devices

    Phys. Lett. A

    (1999)
  • S. Li et al.

    Computing entries of the inverse of a sparse matrix using the FIND algorithm

    J. Comput. Phys.

    (2008)
  • International technology roadmap for semiconductors, <http://www.itrs.net/Links/2007ITRS/Home2007.htm>,...
  • S. Shankar et al.

    Density functional theory and beyond – opportunities for quantum methods in materials modeling semiconductor technology

    J. Phys.: Condens. Matter

    (2008)
  • R.S. Friedman et al.

    Nanotechnology: high-speed integrated nanowire circuits

    Nature

    (2005)
  • J. Appenzeller et al.

    Band-to-band tunneling in carbon nanotube field-effect transistors

    Phys. Rev. Lett.

    (2004)
  • S. Kubatkin et al.

    Single-electron transistor of a single organic molecule with access to several redox states

    Nature

    (2003)
  • S. Datta

    Electronic Transport in Mesoscopic Systems

    (1997)
  • H. Haug et al.

    Quantum Kinetics in Transport and Optics of Semiconductors

    (1996)
  • M. Brandbyge et al.

    Density-functional method for nonequilibrium electron transport

    Phys. Rev. B

    (2002)
  • A. Svizhenko et al.

    Two-dimensional quantum mechanical modeling of nanotransistors

    J. Appl. Phys.

    (2002)
  • Cited by (32)

    • Improved recursive Green's function formalism for quasi one-dimensional systems with realistic defects

      2017, Journal of Computational Physics
      Citation Excerpt :

      This is especially interesting for transport problems, because only a few elements of the Green's function are necessary. Based on this, fast recurrence formulas can be found similar to the RGF but based on the LU-decomposition [15,16]. Also combinations with the RGF like the over-bridging boundary-matching method in combination with the shifted conjugate-orthogonal conjugate-gradient method [17–21] are promising.

    • A fast algorithm for sparse matrix computations related to inversion

      2013, Journal of Computational Physics
      Citation Excerpt :

      However, in our work we aim to optimize the reuse of partial results across multiple eliminations. The second topic is regarding a hybrid algorithm that combines FIND-SS and the RGF-2D algorithm proposed in [16,27,28]. In the hybrid algorithm, the upward pass is executed in FIND-SS mode, but the downward pass is a mixture of FIND-SS and RGF-2D.

    View all citing articles on Scopus
    View full text