Exploiting hierarchy parallelism for molecular dynamics on a petascale heterogeneous system
Introduction
The area of high-performance computing (HPC) is changing rapidly. The trend towards heterogeneous architectures that use accelerators in addition to CPUs has become popular. The heterogeneous systems are typically configured with two kinds of computing units: general processing multi-core CPUs and accelerators such as GPUs, FPGAs, DSPs and so on. With the aid of accelerators, the heterogeneous systems can achieve higher performance with reduced cost and energy consumption [27]. However, real applications always have difficulty in exploiting the full computation power of heterogeneous systems (especially large-scale systems) primarily because such systems contain unprecedented computation cores. For example, the TH-1A, a GPU-accelerated heterogeneous petascale supercomputer, contains hundreds of thousands computation cores (both CPU cores and GPU cores) [51]. Moreover, these computation cores are of heterogeneous types requiring different programming models. For example, the multi-core CPUs support traditional programming languages such as C, C++, and FORTRAN, and popular parallel programming interfaces such as MPI [18], OpenMP [10] and so on. GPUs support languages such as CUDA [37] and OpenCL [34], which are developed for fine-grained parallelism. Hybrid programming models have to be employed to meet the requirements of software development. Therefore the development of efficient parallel applications on heterogeneous systems remains a challenge.
To address this challenge, we propose a hierarchical parallelization scheme for molecular dynamics (MD) to exploit multi-level parallelism on heterogeneous systems. MD is a vital tool in many areas such as materials [21], chemistry [41], biology [3] and so on for its capability to provide a particle-level view of simulation systems and processes, which is unobtainable through experiments [43]. Large-scale MD simulations involving multi-billion particles are beginning to address broad scientific problems [43], [36], thus requiring higher computation power. As a result, high-performance heterogeneous supercomputers could become important for MD research and development.
MD simulations typically analyze the movement of particles through time-integration equations [23]. Traditional parallel models employ spatial, force, or particle decomposition methods [39] to exploit parallelism on homogeneous systems. For heterogeneous systems, we have designed a particle–cell–patch based decomposition method that contains three types of decomposition units, namely, the particles, the cells, and the patches. With this hybrid decomposition method, we implement an MD simulation onto TH-1A through our Three-Level Parallelization Scheme (TLPS), which includes (1) inter-node parallelism using spatial decomposition via message passing, (2) intra-node (CPU–GPU) parallelism using spatial decomposition via dynamically scheduled multi-threading, and (3) intra-chip parallelism using multi-threading and short vector extension (SSE) in CPUs, and employing multiple CUDA threads in GPUs. We also employ several optimizations such as communication hiding intra-node, and memory optimizations in both CPUs and GPUs.
By combining the TLPS parallel scheme with optimizations, our MD code running on TH-1A thereby achieves (1) nearly linear scalability across CPU cores with up to SIMD speedup on each CPU core, and up to speedup on one GPU over a single CPU core for the force computation tasks, (2) up to and speedups on one node over two CPUs and one GPU, respectively, (3) and speedups over pure CPUs and GPUs respectively on 7000 nodes, and (4) 414.17 Tflops on 7000 nodes with both CPUs and GPUs. Our work shows that MD simulations can be efficiently parallelized with our TLPS scheme and can benefit from the optimizations.
This paper is organized as follows: Section 2 gives a background on the TH-1A heterogeneous system and MD simulations. Section 3 describes our TLPS scheme. Section 4 presents the optimizations. The results and analysis of performance are given in Section 5. Section 6 introduces the related works. Conclusions are drawn in Section 7.
Section snippets
The TH-1A heterogeneous parallel system
TH-1A is a GPU-accelerated heterogeneous parallel system developed by the National University of Defense Technology (NUDT). It is configured with 7168 computation nodes connecting with specially designed networks. Each computation node consists of two multi-core CPUs and one GPU connecting with PCI-E bus. By employing both CPUs and GPUs, TH-1A achieves a LINPACK test performance over 1 Pflops [51]. Fig. 1 gives the architecture of TH-1A.
The interconnect is an important component for
Three-level parallelization scheme
Our TLPS scheme for MD on large-scale heterogeneous systems combines: (1) inter-node parallelism by multi-patch based spatial decomposition using message passing; (2) intra-node (CPU–GPU) parallelism by patch based spatial decomposition via dynamically scheduled multi-threading; (3) intra-chip parallelism with multi-threading and short vector extension (SSE) in CPUs, and multiple CUDA threads in GPUs.
Communication hiding intra-node
The communication between CPU and GPU is prone to be a bottleneck during hybrid computation. For each patch, the GPU-controlling thread has to copy a patch from the host memory to the device memory first, and then copy the results back after computation. To keep these copy overheads as small as possible, a convenient method is to employ the CUDA stream model [37]. A stream is a sequence of commands that executed in order, whereas different streams may execute their commands out of order with
Problem model
We present the performance results of a shock wave propagating through the target copper material with plenty of cavities, as is shown in Fig. 8. Compared with those evenly distributed copper particles, material with cavities is in a higher pressure and has more pronounced kinetic energy, denoted with deeper red in the figure. The boundaries of the target material are treated as rigid walls. The corresponding times in Fig. 8(a)–(c) are 6.4, 20.4, and 38.4 ns, respectively. Fig. 8(a) shows a
Large-scale MD simulations
For several decades, MD simulations have been an important tool in many areas. High-performance implementations of MD simulations mainly address two issues. The one is the timescale issue on allowing MD simulations to reach times of minutes or longer for a particular class of rare event systems [26]. These simulations always contain limited particles. Plenty of standard production packages such as Chemistry at Harvard Molecular Mechanics (CHARMM) [7], Amber [9], NAMD [35] and so on keep on
Conclusions
In summary, we have developed a hierarchical parallelization scheme TLPS for MD on heterogeneous systems, thereby achieving and speedup on 7000 nodes of a heterogeneous supercomputer TH-1A compared with that of pure CPUs and GPUs, respectively. The simulation delivers 414.17 Tflops on 7000 nodes with both CPUs and GPUs. Within each node, the dynamic scheduling based multi-threading between CPUs and GPUs has achieved good workload balancing, thus achieving up to and
Acknowledgments
The authors thank the anonymous reviewers for careful reading and helpful suggestions. This work is supported by National Natural Science Foundation of China (NSFC) No. 61170049, and National High Technology Research and Development Program of China (863 Program) No. 2012AA010903.
Qiang Wu received his Master degree in Computer Science from the National University of Defense Technology, China, in 2009. He is currently pursuing his Ph.D. degree at the National University of Defense Technology. His research interests include compiler techniques for high performance, compiler techniques for embedded systems, and parallel programming. He is a member of the IEEE.
References (53)
- et al.
General purpose molecular dynamics simulations fully implemented on graphics processing units
J. Comput. Phys.
(2008) - et al.
Accelerated molecular dynamics simulation with the parallel fast multipole algorithm
Chem. Phys. Lett.
(1992) - et al.
Implementing molecular dynamics on hybrid high performance computers-particle–particle particle-mesh
Comput. Phys. Comm.
(2012) - et al.
Dynamic load balancing in computational mechanics
Comput. Methods Appl. Mech. Engrg.
(2000) - et al.
Vmd: visual molecular dynamics
J. Mol. Graph.
(1996) - et al.
Molecular dynamics simulations of the relaxation processes in the condensed matter on GPUs
Comput. Phys. Comm.
(2011) Fast parallel algorithms for short-range molecular dynamics
J. Comput. Phys.
(1995)- et al.
Multimillion atom simulation of materials on parallel computersnanopixel, interfacial fracture, nanoindentation, and oxidation
Appl. Surf. Sci.
(2001) - et al.
Tabulated potentials in molecular dynamics simulations
Comput. Phys. Comm.
(1999) - et al.
Molecular dynamics simulations of aqueous ions at the liquid–vapor interface accelerated using graphics processors
J. Comput. Chem.
(2011)
Bio-molecular dynamics comes of age
Science (New York, NY)
Scalable algorithms for molecular dynamics simulations on commodity clusters
Compilation, architectural support, and evaluation of simd graphics pipeline programs on a general-purpose CPU
Charmm: a program for macromolecular energy, minimization, and dynamics calculations
J. Comput. Chem.
The amber biomolecular simulation programs
J. Comput. Chem.
Parallel Programming in OpenMP
Towards large-scale molecular dynamics simulations on graphics processors
Evaluation of openmp task scheduling strategies
Accelerating molecular dynamic simulation on graphics processing units
J. Comput. Chem.
A parallel scalable approach to short-range molecular dynamics on the cm-5
Using GPUs to improve multigrid solver performance on a cluster
Internat. J. Comput. Sci. Eng.
Numerical Simulation in Molecular Dynamics
Using MPI—Portable Parallel Programming with the Message Passing Interface, Vol. 1
A multilevel algorithm for partitioning graphs, Tech. Rep., Citeseer
Molecular dynamics comes of age for shockwave research
Shock Waves
Cited by (14)
A novel cooperative accelerated parallel two-list algorithm for solving the subset-sum problem on a hybrid CPU–GPU cluster
2016, Journal of Parallel and Distributed ComputingCitation Excerpt :proposed an efficient CPU–GPU cooperative implementation of the parallel two-list algorithm for solving SSP on a single machine with two multi-core CPUs and one GPU. Moreover, some parallel applications have also been reported to success in performing the CPU–GPU cooperative computing, such as matrix multiplication [22], Linpack [26], QR factorization [7], LU factorization [23], molecular dynamics [27], branch-and-bound algorithm [5] and divide-and-conquer algorithm [17]. These works demonstrate that the CPU–GPU cooperative computing yields better performance than the CPU-only or GPU-only computing.
An efficient wavefront parallel algorithm for structured three dimensional LU-SGS
2016, Computers and FluidsCitation Excerpt :Parallel computing is used to solve computation intensive applications simultaneously [5,6]. Large scale applications in science and engineering [7–9] such as turbulent flow simulation [10], molecular dynamics [11], unstructured CFD solver [12], non-linear fractional phenomenon [13–15], multiphase flows [16] rely on parallel computing. The traditional parallel scheme for CFD is multi-block or multi-zone parallelization.
Optimizing the performance of reactive molecular dynamics simulations for many-core architectures
2019, International Journal of High Performance Computing ApplicationsHeterogeneous Parallel Optimization of an Engine Combustion Simulation Application with the OpenMP 4.0 Standard
2018, Jisuanji Yanjiu yu Fazhan/Computer Research and DevelopmentHigh performance coordinate descent matrix factorization for recommender systems
2017, ACM International Conference on Computing Frontiers 2017, CF 2017
Qiang Wu received his Master degree in Computer Science from the National University of Defense Technology, China, in 2009. He is currently pursuing his Ph.D. degree at the National University of Defense Technology. His research interests include compiler techniques for high performance, compiler techniques for embedded systems, and parallel programming. He is a member of the IEEE.
Canqun Yang received the M.S. and Ph.D. degrees in Computer Science from the National University of Defense Technology, China, in 1995 and 2008, respectively. Currently he is a Professor at the National University of Defense Technology. His research interests include programming languages and compiler implementation. He is the major designer dealing with the compiler system of the Tianhe Supercomputer.
Tao Tang received the Ph.D. degrees in Computer Science from the National University of Defense Technology, China, in 2011. Currently he is an Assistant Professor at the National University of Defense Technology. His research interests include programming languages and compiler implementation.
Liquan Xiao received the Ph.D. degrees in Computer Science from the National University of Defense Technology. Currently he is a Professor at the same university. His research interests include network and I/O systems.