GPU-accelerated artificial neural network potential for molecular dynamics simulation

https://doi.org/10.1016/j.cpc.2022.108655Get rights and content

Highlights

  • A flexible approach is used to implement ANNP on GPU device.

  • GPU-accelerated ANNP show much higher speedup than CPU-only run.

  • Force of neighbors is updated during implementation, but without atomic operation.

Abstract

Artificial neural network potential (ANNP), obtained by training a large database by the first-principles calculations, has become popular in molecular dynamics (MD) simulation since it can capture accurate physical and chemical properties. However, the complex procedure and heavy data dependence during implementation make the performance of CPU-only runs worse, which further limits its application. In this contribution, we report a flexible computation method for ANNP in LAMMPS, in which the simulation box is divided into several parts in accordance with the resource on the accelerator such as the size of global memory and the number of work items (cores). The number of dividing parts has little influence on the performance when the number of calculated atoms per loop is larger than the number of work items on the device. In this approach, the forces of neighbor atoms are updated using hierarchical memory without atomic operation. Typical dynamic and static tests are performed to validate the implementation. The results show that our approach is 12 or 13 times faster when using one graphics processing unit (GPU) compared with 8-MPI tasks CPU-only runs. Additionally, this implementation is supported for CUDA- and OpenCL-enabled GPU cards.

Introduction

The force fields in molecular dynamics simulation have been the subject of extensive studies over the past few decades, and have been found to play a crucial role in capturing the accurate physical and chemical properties of materials. Compared with the traditional force fields, the artificial neural network potential (ANNP) trained using a huge density functional theory (DFT) dataset is demonstrated to be able to reproduce those properties well [1], [2], [3], especially the physically informed ANNP [4]. However, the lower performance caused by the complex procedures for force and energy decomposition has a great limitation on the wide application of ANNP.

Of course, the problem can be solved by increasing the CPU clock speed or the number of cores while ignoring the cost efficiency. However, considering the ratio of performance to the electrical power, the hybrid machines with a combination of a CPU with an accelerator including GPU, physics processing units, and other multicore units are a good choice [5]. Typically, there are thousands of work items (i.e., processors or threads) on the accelerator that enable the hybrid devices to attain high performance. However, the more complex data dependence than that of serial calculation requires us to modify the algorithms or codes by, for example, (i) management of the hierarchical memory (e.g., register, local, shared, and global memory), (ii) restraint of the atomic operation to prevent memory collisions, and (iii) reduction of the high-latency global memory access [5]. For the pairwise potentials such as the embedded atom model (EAM) [6], Morse [7], Lennard-Jones (LJ) [8], and ZBL [9] potentials, the issue of data dependence can be resolved easily to obtain significant speedup [10], [11].

However, the data dependences for three-body potentials are more complex since those potentials require the calculation of every triplet of atoms and the updating of the force of each atom using three nested loops. Therefore, the communication of parallel work items is necessary to accumulate force at each step, which leads to atomic operation and increasing global memory access. To avoid these, the redundant computation approach (RCA) was developed by Brown and Yamada [5] for the three-body potentials. On the basis of this approach, the Stillinger-Weber potential [12] was first added to the GPU package of LAMMPS [13]. After that, the implementations of Tersoff potential [14] were described using RCA. The two three-body potentials exhibit significant performance improvement. The RCA only updates the force for atom i for different work items. To achieve that, two inner loops are needed: (i) vertex of angle at the atom i position; (ii) vertex of angle at neighbors. Therefore, the neighbors of neighbors are required not only for local atoms (atoms belong to a domain of a CPU core), but also for ghost atoms (atoms copied from the neighboring domain) in LAMMPS. By using the two loops, the number of force updates in global memory for n atoms is decreased from n+bn2+bn2 to n with bn neighbors per atom, which finally eliminates the interprocess communication. This approach can be used for any other traditional three-body potentials such as modified EAM [15], AIREBO [16], REBO [17], and bond-order potentials [18]. However, the only drawback of RCA is that the total number of loops for calculating the force of atoms i are increased from bni2bni2 to bni2bni2+(bni×bnj), where bni and bnj are the number of neighbors for atoms i and j (local and ghost atoms), respectively. The total number of loops in RCA is, of course, approximately three times larger than that in the traditional approach, which will increase the time of force decomposition.

Therefore, the RCA method is unsuitable for calculating the energy and force by a complex procedure, even though it can reduce global memory access. The implementation of the ANNP was first described by Behler and Parrinello (BP) [19]; it uses the feed-forward artificial neural networks (ANN) to calculate the energies associated with the local configurations of atoms [18]. Generally, the feed-forward structure contains an input layer, one or a few hidden layers, and an energy output layer [20]. In the input layer, the symmetry function, a kind of structural fingerprint containing radial and angular information of atomic structure, is calculated. Then the input data are fed into the first hidden layer, which will be processed using the weight, bias parameters, and activation function in each layer. After the data crosses the few hidden layers, the potential energy related to the initial local structure can be obtained [1]. The atomic force can be calculated from the gradient of the energy function, but the procedure for estimating the atomic force is more complex than that of feed-forward ANN and will cost much more time for implementation. It is quite similar to the back-propagation procedure [21], [22], which is used to calculate the derivative of the energy function to the output data of each layer, and then to obtain the derivate value of the energy function to the coordinate of atoms. In view of these, the RCA might not be a good choice for accelerating the ANNP implementation since it requires approximately bni×bnj additional loops to repeat the complex procedure of energy and force decomposition for ANNP.

To increase performance, several packages and libraries were developed based on GPU device such as n2p2 [19] and CabanaMD [23] for ANN potentials and OpenMM [24], TorchMD [25], SNAP [26], and JAX-MD [27] for machine learning potentials. In these packages, different libraries e.g., Kokkos [28], PyTorch [29], CUDA or others were used to accelerate the potential decomposition. Despite of these, the speed of some packages is still unsatisfactory. For example, the JAX-MD has a poor neighbor list performance, which result in low performance for the large systems [27]. The PyTorch library used in the TorchMD package is good for the integration of potential training and MD simulation. However, lacking the assorted supports such as neighbor list building and updating, which leads to the low performance compared with special MD packages [25]. The Kokkos library always uses the undesirable thread-safe atomic operation for updating the symmetry functions and forces of neighbors [13], [23], which has a limitation on the performance for the packages of SNAP [26] and n2p2 [23], as discussed below. Therefore, increasing the performance is an urgent task for widely using of these kinds of potentials.

In this paper, we propose a simple flexible approach for computing ANNP by segmenting the total number of atoms into different parts, which depends on the size of global memory and the number of work items on the accelerator. This approach allows the updating of the force for neighbors, but the atomic operations are eliminated since each work item can update the force of neighbors through its own memory. Although the global memory access increases, the total number of loops for the triplet configuration of each atom decreases to bni2bni2. The performance of this approach is evaluated by using the ANNP developed for pure iron by Mori and Ozaki [3]. The time taken for force calculation and memory access is monitored, and the result proves that this approach is beneficial for implementing the complex ANNP.

Section snippets

LAMMPS

LAMMPS, developed at Sandia National Laboratories [30], is a classical MD simulation code and is widely used to simulate solid-state, soft materials, and other systems. It is parallelized by message passing interface (MPI) techniques and supports many accelerator packages. Each MPI or processor updates the position, velocity, force, and energy of local atoms in the owned non-overlapping subdomain, which is divided via spatial decomposition techniques. However, updating these data for “ghost

Performance

In the performance tests, a relaxation simulation using the large Fe model was performed at a temperature of 300 K for 1 ps (1,000 time steps). Two types of GPU packages were built using the CUDA toolkit and OpenCL library, respectively. The 2-MPI tasks were required to use two GPU cards to make each GPU busy. Fig. 4 shows the results for 1,000 time steps for GPU-accelerated versions compared with 8-MPI tasks CPU-only runs. We can see that the GPU-accelerated version exhibits a significant

Discussion

It is known that the bandwidth of global memory in most GPU devices is lower than that of on-chip memory such as register and shared memories. Therefore, reducing the number of global memory accesses is an advisable method for the three-body potential implementation. However, it requires more additional loops, which is unfavorable for the complex three-body potential, e.g., ANNP [1], [19] and the physically informed neural network potential (PINN) [4], since they need more time for calculation

Conclusion

A simple approach (FCA) is proposed to implement the GPU-accelerated artificial neural network potential. In the FCA, the force of neighbors is required to be updated by using the hierarchical memory without atomic operations since the time of updating is approximately equal to the time of force and energy calculation. The GPU-accelerated version formulated using the CUDA toolkit is 13 (one card) and 26 times (two cards) faster than that of 8-MPI tasks CPU-only run, which enables the simulation

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

We are grateful to the LAMMPS team for developing and sharing the source code of the software. We are also grateful to Dr. Mori and Dr. Artrith for providing all the parameters of the ANNP of iron. This work was supported by the Council for Science, Technology and Innovation (CSTI), Cross-ministerial Strategic Innovation Promotion Program (SIP), “Materials Integration” for Revolutionary Design System of Structural Materials (Funding agency: JST). All the codes described in this paper are

References (35)

  • W.M. Brown et al.

    Comput. Phys. Commun.

    (2013)
  • J.F. Ziegler et al.

    Nucl. Instrum. Methods B

    (2010)
  • I.V. Morozov et al.

    Comput. Phys. Commun.

    (2011)
  • S. Plimpton

    J. Comput. Phys.

    (1995)
  • T.D. Nguyen

    Comput. Phys. Commun.

    (2017)
  • N. Artrith et al.

    Comput. Mater. Sci.

    (2016)
  • G. Montavon et al.

    Digit. Signal Process.

    (2018)
  • A.P. Thompson et al.

    J. Comput. Phys.

    (2015)
  • S. Plimpton

    Comput. Mater. Sci.

    (1995)
  • W.M. Brown et al.

    Comput. Phys. Commun.

    (2011)
  • W.M. Brown et al.

    Comput. Phys. Commun.

    (2012)
  • S. Lorenz et al.

    Chem. Phys. Lett.

    (2004)
  • J. Behler

    Phys. Chem. Chem. Phys.

    (2011)
  • N. Artrith et al.

    Phys. Rev. B

    (2012)
  • H. Mori et al.

    Phys. Rev. Mater.

    (2020)
  • G.P.P. Pun et al.

    Nat. Commun.

    (2019)
  • M.S. Daw et al.

    Phys. Rev. B

    (1984)
  • View full text