Exploiting hierarchy parallelism for molecular dynamics on a petascale heterogeneous system

https://doi.org/10.1016/j.jpdc.2013.07.015Get rights and content

Highlights

  • Proposed a Three-Level Parallelization Scheme for Molecular Dynamics.

  • Employed hierarchical optimizations to address the bottlenecks of MD with TLPS.

  • Evaluated MD simulation with TLPS and optimizations on TH-1A.

Abstract

Heterogeneous systems with nodes containing more than one type of computation units, e.g., central processing units (CPUs) and graphics processing units (GPUs), are becoming popular because of their low cost and high performance. In this paper, we have developed a Three-Level Parallelization Scheme (TLPS) for molecular dynamics (MD) simulation on heterogeneous systems. The scheme exploits multi-level parallelism combining (1) inter-node parallelism using spatial decomposition via message passing, (2) intra-node parallelism using spatial decomposition via dynamically scheduled multi-threading, and (3) intra-chip parallelism using multi-threading and short vector extension in CPUs, and employing multiple CUDA threads in GPUs. By using a hierarchy of parallelism with optimizations such as communication hiding intra-node, and memory optimizations in both CPUs and GPUs, we have implemented and evaluated a MD simulation on a petascale heterogeneous supercomputer TH-1A. The results show that MD simulations can be efficiently parallelized with our TLPS scheme and can benefit from the optimizations.

Introduction

The area of high-performance computing (HPC) is changing rapidly. The trend towards heterogeneous architectures that use accelerators in addition to CPUs has become popular. The heterogeneous systems are typically configured with two kinds of computing units: general processing multi-core CPUs and accelerators such as GPUs, FPGAs, DSPs and so on. With the aid of accelerators, the heterogeneous systems can achieve higher performance with reduced cost and energy consumption  [27]. However, real applications always have difficulty in exploiting the full computation power of heterogeneous systems (especially large-scale systems) primarily because such systems contain unprecedented computation cores. For example, the TH-1A, a GPU-accelerated heterogeneous petascale supercomputer, contains hundreds of thousands computation cores (both CPU cores and GPU cores)  [51]. Moreover, these computation cores are of heterogeneous types requiring different programming models. For example, the multi-core CPUs support traditional programming languages such as C, C++, and FORTRAN, and popular parallel programming interfaces such as MPI  [18], OpenMP  [10] and so on. GPUs support languages such as CUDA  [37] and OpenCL  [34], which are developed for fine-grained parallelism. Hybrid programming models have to be employed to meet the requirements of software development. Therefore the development of efficient parallel applications on heterogeneous systems remains a challenge.

To address this challenge, we propose a hierarchical parallelization scheme for molecular dynamics (MD) to exploit multi-level parallelism on heterogeneous systems. MD is a vital tool in many areas such as materials  [21], chemistry  [41], biology  [3] and so on for its capability to provide a particle-level view of simulation systems and processes, which is unobtainable through experiments  [43]. Large-scale MD simulations involving multi-billion particles are beginning to address broad scientific problems [43], [36], thus requiring higher computation power. As a result, high-performance heterogeneous supercomputers could become important for MD research and development.

MD simulations typically analyze the movement of particles through time-integration equations  [23]. Traditional parallel models employ spatial, force, or particle decomposition methods  [39] to exploit parallelism on homogeneous systems. For heterogeneous systems, we have designed a particle–cell–patch based decomposition method that contains three types of decomposition units, namely, the particles, the cells, and the patches. With this hybrid decomposition method, we implement an MD simulation onto TH-1A through our Three-Level Parallelization Scheme (TLPS), which includes (1) inter-node parallelism using spatial decomposition via message passing, (2) intra-node (CPU–GPU) parallelism using spatial decomposition via dynamically scheduled multi-threading, and (3) intra-chip parallelism using multi-threading and short vector extension (SSE) in CPUs, and employing multiple CUDA threads in GPUs. We also employ several optimizations such as communication hiding intra-node, and memory optimizations in both CPUs and GPUs.

By combining the TLPS parallel scheme with optimizations, our MD code running on TH-1A thereby achieves (1) nearly linear scalability across CPU cores with up to 1.39× SIMD speedup on each CPU core, and up to 18× speedup on one GPU over a single CPU core for the force computation tasks, (2) up to 2.34× and 1.54× speedups on one node over two CPUs and one GPU, respectively, (3) 1.4× and 1.25× speedups over pure CPUs and GPUs respectively on 7000 nodes, and (4) 414.17 Tflops on 7000 nodes with both CPUs and GPUs. Our work shows that MD simulations can be efficiently parallelized with our TLPS scheme and can benefit from the optimizations.

This paper is organized as follows: Section  2 gives a background on the TH-1A heterogeneous system and MD simulations. Section  3 describes our TLPS scheme. Section  4 presents the optimizations. The results and analysis of performance are given in Section  5. Section  6 introduces the related works. Conclusions are drawn in Section  7.

Section snippets

The TH-1A heterogeneous parallel system

TH-1A is a GPU-accelerated heterogeneous parallel system developed by the National University of Defense Technology (NUDT). It is configured with 7168 computation nodes connecting with specially designed networks. Each computation node consists of two multi-core CPUs and one GPU connecting with PCI-E bus. By employing both CPUs and GPUs, TH-1A achieves a LINPACK test performance over 1 Pflops  [51]. Fig. 1 gives the architecture of TH-1A.

The interconnect is an important component for

Three-level parallelization scheme

Our TLPS scheme for MD on large-scale heterogeneous systems combines: (1) inter-node parallelism by multi-patch based spatial decomposition using message passing; (2) intra-node (CPU–GPU) parallelism by patch based spatial decomposition via dynamically scheduled multi-threading; (3) intra-chip parallelism with multi-threading and short vector extension (SSE) in CPUs, and multiple CUDA threads in GPUs.

Communication hiding intra-node

The communication between CPU and GPU is prone to be a bottleneck during hybrid computation. For each patch, the GPU-controlling thread has to copy a patch from the host memory to the device memory first, and then copy the results back after computation. To keep these copy overheads as small as possible, a convenient method is to employ the CUDA stream model  [37]. A stream is a sequence of commands that executed in order, whereas different streams may execute their commands out of order with

Problem model

We present the performance results of a shock wave propagating through the target copper material with plenty of cavities, as is shown in Fig. 8. Compared with those evenly distributed copper particles, material with cavities is in a higher pressure and has more pronounced kinetic energy, denoted with deeper red in the figure. The boundaries of the target material are treated as rigid walls. The corresponding times in Fig. 8(a)–(c) are 6.4, 20.4, and 38.4 ns, respectively. Fig. 8(a) shows a

Large-scale MD simulations

For several decades, MD simulations have been an important tool in many areas. High-performance implementations of MD simulations mainly address two issues. The one is the timescale issue on allowing MD simulations to reach times of minutes or longer for a particular class of rare event systems  [26]. These simulations always contain limited particles. Plenty of standard production packages such as Chemistry at Harvard Molecular Mechanics (CHARMM)  [7], Amber  [9], NAMD  [35] and so on keep on

Conclusions

In summary, we have developed a hierarchical parallelization scheme TLPS for MD on heterogeneous systems, thereby achieving 1.4× and 1.25× speedup on 7000 nodes of a heterogeneous supercomputer TH-1A compared with that of pure CPUs and GPUs, respectively. The simulation delivers 414.17 Tflops on 7000 nodes with both CPUs and GPUs. Within each node, the dynamic scheduling based multi-threading between CPUs and GPUs has achieved good workload balancing, thus achieving up to 2.34× and 1.54×

Acknowledgments

The authors thank the anonymous reviewers for careful reading and helpful suggestions. This work is supported by National Natural Science Foundation of China (NSFC) No. 61170049, and National High Technology Research and Development Program of China (863 Program) No. 2012AA010903.

Qiang Wu received his Master degree in Computer Science from the National University of Defense Technology, China, in 2009. He is currently pursuing his Ph.D. degree at the National University of Defense Technology. His research interests include compiler techniques for high performance, compiler techniques for embedded systems, and parallel programming. He is a member of the IEEE.

References (53)

  • H. Berendsen

    Bio-molecular dynamics comes of age

    Science (New York, NY)

    (1996)
  • K.J. Bowers et al.

    Scalable algorithms for molecular dynamics simulations on commodity clusters

  • M. Breternitz et al.

    Compilation, architectural support, and evaluation of simd graphics pipeline programs on a general-purpose CPU

  • B.R. Brooks et al.

    Charmm: a program for macromolecular energy, minimization, and dynamics calculations

    J. Comput. Chem.

    (1983)
  • D.A. Case et al.

    The amber biomolecular simulation programs

    J. Comput. Chem.

    (2005)
  • R. Chandra et al.

    Parallel Programming in OpenMP

    (2000)
  • J.E. Davis et al.

    Towards large-scale molecular dynamics simulations on graphics processors

  • A. Duran et al.

    Evaluation of openmp task scheduling strategies

  • M.S. Friedrichs et al.

    Accelerating molecular dynamic simulation on graphics processing units

    J. Comput. Chem.

    (2009)
  • T.C. Germann, K. Kadau, P.S. Lomdahl, 25 tflop/s multibillion-atom molecular dynamics simulations and...
  • R. Giles et al.

    A parallel scalable approach to short-range molecular dynamics on the cm-5

  • D. Goddeke et al.

    Using GPUs to improve multigrid solver performance on a cluster

    Internat. J. Comput. Sci. Eng.

    (2008)
  • M. Griebel et al.

    Numerical Simulation in Molecular Dynamics

    (2007)
  • W. Gropp et al.

    Using MPI—Portable Parallel Programming with the Message Passing Interface, Vol. 1

    (1999)
  • B. Hendrickson et al.

    A multilevel algorithm for partitioning graphs, Tech. Rep., Citeseer

    (1993)
  • B. Holian

    Molecular dynamics comes of age for shockwave research

    Shock Waves

    (2004)
  • Cited by (14)

    • A novel cooperative accelerated parallel two-list algorithm for solving the subset-sum problem on a hybrid CPU–GPU cluster

      2016, Journal of Parallel and Distributed Computing
      Citation Excerpt :

      proposed an efficient CPU–GPU cooperative implementation of the parallel two-list algorithm for solving SSP on a single machine with two multi-core CPUs and one GPU. Moreover, some parallel applications have also been reported to success in performing the CPU–GPU cooperative computing, such as matrix multiplication [22], Linpack [26], QR factorization [7], LU factorization [23], molecular dynamics [27], branch-and-bound algorithm [5] and divide-and-conquer algorithm [17]. These works demonstrate that the CPU–GPU cooperative computing yields better performance than the CPU-only or GPU-only computing.

    • An efficient wavefront parallel algorithm for structured three dimensional LU-SGS

      2016, Computers and Fluids
      Citation Excerpt :

      Parallel computing is used to solve computation intensive applications simultaneously [5,6]. Large scale applications in science and engineering [7–9] such as turbulent flow simulation [10], molecular dynamics [11], unstructured CFD solver [12], non-linear fractional phenomenon [13–15], multiphase flows [16] rely on parallel computing. The traditional parallel scheme for CFD is multi-block or multi-zone parallelization.

    • Optimizing the performance of reactive molecular dynamics simulations for many-core architectures

      2019, International Journal of High Performance Computing Applications
    • High performance coordinate descent matrix factorization for recommender systems

      2017, ACM International Conference on Computing Frontiers 2017, CF 2017
    View all citing articles on Scopus

    Qiang Wu received his Master degree in Computer Science from the National University of Defense Technology, China, in 2009. He is currently pursuing his Ph.D. degree at the National University of Defense Technology. His research interests include compiler techniques for high performance, compiler techniques for embedded systems, and parallel programming. He is a member of the IEEE.

    Canqun Yang received the M.S. and Ph.D. degrees in Computer Science from the National University of Defense Technology, China, in 1995 and 2008, respectively. Currently he is a Professor at the National University of Defense Technology. His research interests include programming languages and compiler implementation. He is the major designer dealing with the compiler system of the Tianhe Supercomputer.

    Tao Tang received the Ph.D. degrees in Computer Science from the National University of Defense Technology, China, in 2011. Currently he is an Assistant Professor at the National University of Defense Technology. His research interests include programming languages and compiler implementation.

    Liquan Xiao received the Ph.D. degrees in Computer Science from the National University of Defense Technology. Currently he is a Professor at the same university. His research interests include network and I/O systems.

    View full text