Performance characteristics of biomolecular simulations on high-end systems with multi-core processors
Introduction
A better knowledge of biomolecules is the key to understanding mechanistic details of the various biochemical processes that occur in all living cells. The biomolecular structure, dynamics and function span multiple scales of time and length [7], [8], [9], [10]. In the past, experimental techniques have provided a wealth of information into the working of biomolecules; more recently theoretical and computational multi-scale modeling techniques based upon biomolecular simulations continue to provide novel insights [10]. Till recently, the computing power required for simulating the length and time-scales relevant to biomolecules were beyond the reach of even the fastest supercomputers. In particular, the dynamics and functions of biomolecules span more than 15 orders of magnitude in time; the computing power falls short by 4–6 orders of magnitude in its ability to simulate the desired time-scales [11]. The availability of petaFLOPS-scale computing power in near future holds great promise for this area. Many of the popular biomolecular simulations codes in use today were designed several decades ago based on a different programming paradigm in mind. Unfortunately, it is now becoming evident that the biosimulations software technology has not kept up with the change in hardware. In particular, with the introduction of multi-core processing technologies in systems with tens to hundreds of thousands of processing cores, it is unclear whether the existing biomolecular simulation frameworks will be able to scale and to utilize these resources effectively [12], [18].
Microprocessor vendors today have ability to produce chips with an ever-increasing number of transistors, therefore, the approach of duplicating existing cores is a straightforward way to address problems related to physical and power constraints and limited instruction-level parallelism. However, because all cores of a processor share the link between the processor’s resources including memory, IO links and off-node communication contention for these resources can limit the achievable performance when using more than one core per processor. Applications, such as biomolecular simulations, can perform well on systems with these multi-core processors, but only if they expose enough parallelism to use the multiple cores within their collective memory bandwidth limitations [14].
The fundamental question for biomolecular simulation frameworks is whether multiple cores per processor can provide performance commensurate with initial expectations. The shared memory and I/O (network) bandwidth of multiple cores in a socket draws into question both how efficiently an application can use multiple cores and what methods provide the highest efficiency. In this preliminary study, we characterize computation, communication and memory efficiencies of a scalable biomolecular simulation framework called LAMMPS [23] on Cray XT systems; XT3 and XT4 system together provide 119 Teraflops processing power. The systems at the Oak Ridge National Laboratory (ORNL) are based on dual-core Opteron processors. We also evaluated performance of LAMMPS on dual-core and quad-core systems by Intel, which offer distinct micro architectural features. We identify that the performance gap between single and dual-core execution times depends on the problem size as well as the size of the target system. In addition, we evaluated a number of processor affinity techniques for managing memory placement on multi-core systems. Our experiments on a stand-alone dual-core system show that an appropriate selection of MPI task and memory placement schemes result in significant performance variations for our target test cases.
The paper organization is as follows: In Section 2, we provide a brief introduction to the biomolecular simulations, the LAMMPS framework, our test cases, and the architecture and programming environment of our target Cray XT3 system. An overview of the related research efforts in the area of biomolecular simulation frameworks on high-end supercomputers is presented in Section 3. Performance evaluation and data collection experiments and results are presented in Section 4. Conclusions and future plans are outlined in Section 5.
Section snippets
Molecular dynamics simulations
Numerous applications use molecular dynamics (MD) for biomolecular simulations. MD and related techniques can be defined as computer simulation methodology where the time evolution of a set of interacting particles is modeled by integrating the equation of motion. The underlying MD technique is based on the law of classical mechanics—most notably Newton’s law, F = ma. The MD steps performed in LAMMPS or other MD engines consist of three calculations: determining energy of a system and forces on
Related research
Qbox, is a first principle molecular dynamics (FPMD) code which has been shown to scale to relatively high number of processors [17]. FPMD differs from classical MD code, in its capability to combines a quantum mechanical description of electrons with a classical description of atomic nuclei. Qbox is a parallel implementation of the FPMD method designed specifically for large parallel platforms, including BlueGene/L. Simulations have been performed using up to 32,768 processors and performance
Experiments and results
We performed three set of experiments to collect and to analyze runtime and performance data. First, we collected runtime scaling data on the XT3 platforms in single and dual-core mode and measured the impact of a runtime option called small_pages. Second, we instrumented the code with TAU (Tools and Analysis Utilities) [6] and gathered hardware counter and MPI performance data. Finally, on our dual-core system with Linux operating system, we measured the impact of different processor and
Conclusions and future plans
In this study, we have described performance characterization of a large-scale parallel molecular dynamics code, LAMMPS, for simulating biologically relevant systems. The focus of this study is to characterize the performance on multi-core processors as the upcoming generations of supercomputers including Cray’s XT3 and XT4 as well as IBM’s Blue Gene continue to pack increasingly more computing power through utilization of multi-core processors. The results on the XT3 system indicate that the
Acknowledgements
The authors would like to thank National Center for Computational Sciences (NCCS) for access to Cray XT3 and support (INCITE award). This research used NCCS at the Oak Ridge National Laboratory, which is supported by the Office of Science of the US Department of Energy under Contract No. DE-AC05-00OR22725.
References (24)
- et al.
AMBER, a package of computer programs for applying molecular mechanics, normal mode analysis, molecular dynamics and free energy calculations to simulate the structural and energetic properties of molecules
Comput. Phys. Commun.
(1995) Fast parallel algorithms for short-range molecular dynamics
J. Comp. Phys.
(1995)- The Cell project at IBM research....
- Cray Performance Analysis Tools (CrayPAT)....
- LAMMPS benchmarks....
- Performance API (PAPI)....
- National Center for Computational Sciences at ORNL....
- Tuning and Analysis Utilities (TAU)....
- et al.
Protein dynamics and enzymatic catalysis: investigating the peptidyl–prolyl cis/trans isomerization activity of cyclophilin A
Biochemistry
(2004) Computational studies of the mechanism of cis/trans isomerization in HIV-1 catalyzed by cyclophilin A
Proteins
(2004)
Role of protein dynamics in reaction rate enhancement by enzymes
J. Am. Chem. Soc.
Enzymes: an integrated view of structure, dynamics and function
Microb. Cell Factories
Cited by (5)
The small mass assumption applied to the multibody dynamics of motor proteins
2009, Journal of BiomechanicsCitation Excerpt :The rigid body approach alone does not address the third issue above. Several works reduce simulation run times by cleverly adapting or restructuring the numerical integration procedure (Jain et al., 1993; Watanabe and Karplus, 1993; Chun et al., 2000; Mukherjee et al., 2008; Medyanik and Liu, 2008) or the computational architecture (Alam et al., 2008) to efficiently calculate the model. A different approach is to isolate the dynamics of the larger time scales so that a larger time step can be used.
MAR: A Novel Power Management for CMP Systems in Data-Intensive Environment
2016, IEEE Transactions on ComputersA novel power management for CMP systems in data-intensive environment
2011, Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011Theory of free energy and entropy in noncovalent binding
2009, Chemical Reviews