Elsevier

Parallel Computing

Volume 34, Issue 11, November 2008, Pages 640-651
Parallel Computing

Performance characteristics of biomolecular simulations on high-end systems with multi-core processors

https://doi.org/10.1016/j.parco.2008.05.003Get rights and content

Abstract

Biological processes occurring inside cell involve multiple scales of time and length; many popular theoretical and computational multi-scale techniques utilize biomolecular simulations based on molecular dynamics. Till recently, the computing power required for simulating the relevant scales was even beyond the reach of fastest supercomputers. The availability of petaFLOPS-scale computing power in near future holds great promise. Unfortunately, the biosimulations software technology has not kept up with the changes in hardware. In particular, with the introduction of multi-core processing technologies in systems with tens of thousands of processing cores, it is unclear whether the existing biomolecular simulation frameworks will be able to scale and to utilize these resources effectively. While the multi-core processing systems provide higher processing capabilities, their memory and network subsystems are posing new challenges to application and system software developers. In this study, we attempt to characterize computation, communication and memory efficiencies of biomolecular simulations on Teraflops-scale Cray XT systems, which contain dual-core Opteron processors. We identify that the application efficiencies using the multi-core processors reduce with the increase of the simulated system size. Further, we measure the communication overhead of using both cores in the processor simultaneously and identify that the slowdown in the MPI communication performance can significantly lower the achievable performance in the dual-core execution mode. We conclude that not only the biomolecular simulations need to be aware of the underlying multi-core hardware in order to achieve maximum performance but also the system software needs to provide processor and memory placement features in the high-end systems. Our results on stand-alone multi-core AMD and Intel systems confirm that combinations of processor and memory affinity schemes cause significant performance variations for our target test cases.

Introduction

A better knowledge of biomolecules is the key to understanding mechanistic details of the various biochemical processes that occur in all living cells. The biomolecular structure, dynamics and function span multiple scales of time and length [7], [8], [9], [10]. In the past, experimental techniques have provided a wealth of information into the working of biomolecules; more recently theoretical and computational multi-scale modeling techniques based upon biomolecular simulations continue to provide novel insights [10]. Till recently, the computing power required for simulating the length and time-scales relevant to biomolecules were beyond the reach of even the fastest supercomputers. In particular, the dynamics and functions of biomolecules span more than 15 orders of magnitude in time; the computing power falls short by 4–6 orders of magnitude in its ability to simulate the desired time-scales [11]. The availability of petaFLOPS-scale computing power in near future holds great promise for this area. Many of the popular biomolecular simulations codes in use today were designed several decades ago based on a different programming paradigm in mind. Unfortunately, it is now becoming evident that the biosimulations software technology has not kept up with the change in hardware. In particular, with the introduction of multi-core processing technologies in systems with tens to hundreds of thousands of processing cores, it is unclear whether the existing biomolecular simulation frameworks will be able to scale and to utilize these resources effectively [12], [18].

Microprocessor vendors today have ability to produce chips with an ever-increasing number of transistors, therefore, the approach of duplicating existing cores is a straightforward way to address problems related to physical and power constraints and limited instruction-level parallelism. However, because all cores of a processor share the link between the processor’s resources including memory, IO links and off-node communication contention for these resources can limit the achievable performance when using more than one core per processor. Applications, such as biomolecular simulations, can perform well on systems with these multi-core processors, but only if they expose enough parallelism to use the multiple cores within their collective memory bandwidth limitations [14].

The fundamental question for biomolecular simulation frameworks is whether multiple cores per processor can provide performance commensurate with initial expectations. The shared memory and I/O (network) bandwidth of multiple cores in a socket draws into question both how efficiently an application can use multiple cores and what methods provide the highest efficiency. In this preliminary study, we characterize computation, communication and memory efficiencies of a scalable biomolecular simulation framework called LAMMPS [23] on Cray XT systems; XT3 and XT4 system together provide 119 Teraflops processing power. The systems at the Oak Ridge National Laboratory (ORNL) are based on dual-core Opteron processors. We also evaluated performance of LAMMPS on dual-core and quad-core systems by Intel, which offer distinct micro architectural features. We identify that the performance gap between single and dual-core execution times depends on the problem size as well as the size of the target system. In addition, we evaluated a number of processor affinity techniques for managing memory placement on multi-core systems. Our experiments on a stand-alone dual-core system show that an appropriate selection of MPI task and memory placement schemes result in significant performance variations for our target test cases.

The paper organization is as follows: In Section 2, we provide a brief introduction to the biomolecular simulations, the LAMMPS framework, our test cases, and the architecture and programming environment of our target Cray XT3 system. An overview of the related research efforts in the area of biomolecular simulation frameworks on high-end supercomputers is presented in Section 3. Performance evaluation and data collection experiments and results are presented in Section 4. Conclusions and future plans are outlined in Section 5.

Section snippets

Molecular dynamics simulations

Numerous applications use molecular dynamics (MD) for biomolecular simulations. MD and related techniques can be defined as computer simulation methodology where the time evolution of a set of interacting particles is modeled by integrating the equation of motion. The underlying MD technique is based on the law of classical mechanics—most notably Newton’s law, F = ma. The MD steps performed in LAMMPS or other MD engines consist of three calculations: determining energy of a system and forces on

Related research

Qbox, is a first principle molecular dynamics (FPMD) code which has been shown to scale to relatively high number of processors [17]. FPMD differs from classical MD code, in its capability to combines a quantum mechanical description of electrons with a classical description of atomic nuclei. Qbox is a parallel implementation of the FPMD method designed specifically for large parallel platforms, including BlueGene/L. Simulations have been performed using up to 32,768 processors and performance

Experiments and results

We performed three set of experiments to collect and to analyze runtime and performance data. First, we collected runtime scaling data on the XT3 platforms in single and dual-core mode and measured the impact of a runtime option called small_pages. Second, we instrumented the code with TAU (Tools and Analysis Utilities) [6] and gathered hardware counter and MPI performance data. Finally, on our dual-core system with Linux operating system, we measured the impact of different processor and

Conclusions and future plans

In this study, we have described performance characterization of a large-scale parallel molecular dynamics code, LAMMPS, for simulating biologically relevant systems. The focus of this study is to characterize the performance on multi-core processors as the upcoming generations of supercomputers including Cray’s XT3 and XT4 as well as IBM’s Blue Gene continue to pack increasingly more computing power through utilization of multi-core processors. The results on the XT3 system indicate that the

Acknowledgements

The authors would like to thank National Center for Computational Sciences (NCCS) for access to Cray XT3 and support (INCITE award). This research used NCCS at the Oak Ridge National Laboratory, which is supported by the Office of Science of the US Department of Energy under Contract No. DE-AC05-00OR22725.

References (24)

  • P.K. Agarwal

    Role of protein dynamics in reaction rate enhancement by enzymes

    J. Am. Chem. Soc.

    (2005)
  • P.K. Agarwal

    Enzymes: an integrated view of structure, dynamics and function

    Microb. Cell Factories

    (2006)
  • Cited by (5)

    • The small mass assumption applied to the multibody dynamics of motor proteins

      2009, Journal of Biomechanics
      Citation Excerpt :

      The rigid body approach alone does not address the third issue above. Several works reduce simulation run times by cleverly adapting or restructuring the numerical integration procedure (Jain et al., 1993; Watanabe and Karplus, 1993; Chun et al., 2000; Mukherjee et al., 2008; Medyanik and Liu, 2008) or the computational architecture (Alam et al., 2008) to efficiently calculate the model. A different approach is to isolate the dynamics of the larger time scales so that a larger time step can be used.

    • A novel power management for CMP systems in data-intensive environment

      2011, Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011
    View full text