tinyMD: Mapping molecular dynamics simulations to heterogeneous hardware using partial evaluation

doi:10.1016/j.jocs.2021.101425

Journal of Computational Science

Volume 54, September 2021, 101425

https://doi.org/10.1016/j.jocs.2021.101425 Get rights and content

Highlights

•
Device, data layout and communication abstractions in AnyDSL are presented.
•
Benefits achieved using the AnyDSL framework for our application are summarized.
•
Performance considerations for molecular dynamics are discussed.
•
A strategy to couple our approach with the waLBerla framework is described.
•
Performance comparison results between our approach and miniMD are shown.

Abstract

This paper investigates the suitability of the AnyDSL partial evaluation framework to implement tinyMD: an efficient, scalable, and portable simulation of pairwise interactions among particles. We compare tinyMD with the miniMD proxy application that scales very well on parallel supercomputers. We discuss the differences between both implementations and contrast miniMD’s performance for single-node CPU and GPU targets, as well as its scalability on SuperMUC-NG and Piz Daint supercomputers. Additionally, we demonstrate tinyMD’s flexibility by coupling it with the waLBerla multi-physics framework. This allow us to execute tinyMD simulations using the load-balancing mechanism implemented in waLBerla.

Introduction

Nowadays, compute-heavy simulation software typically runs on different types of high-end processors to solve these simulations in a reasonable amount of time. Nevertheless, this high-end hardware requires highly specialized code that is precisely adapted to the respective hardware in order to get anywhere near peak performance. What is more, in some cases different algorithmic variants are more suitable towards different kinds of hardware. For example, writing applications for a GPU, a massively parallel device with a very peculiar memory hierarchy, is very different from implementing applications for a CPU.

Very large problems even call for distributed systems in which we have to partition the workload among multiple computers within a network. This entails proper data communication between these systems; transfer latencies should be hidden as much as possible.

In this paper we focus on molecular dynamics (MD) simulations. These study the interactions among particles and how these interactions affect their motion. To achieve peak performance on these simulations, the implementation must consider the best data access pattern for the target architecture.

We base our implementation tinyMD on AnyDSL—a partial evaluation framework to write high-performance applications and libraries. We compare this implementation with miniMD, a parallel and scalable proxy application that also contains GPU support; miniMD is written in C++ and is based on Kokkos [1]. Additionally, we also couple our tinyMD application with the waLBerla [2], [3] multi-physics simulation framework. This allows us to exploit its load-balancing mechanism [4] implementation within tinyMD simulations. We discuss the advantages that tinyMD provides in order to ease the coupling with different technologies.

We use the Lennard-Jones potential model in order to calculate the forces of the atoms in the experiments comparing to miniMD. Whereas in the experiments for load-balancing, we rely on the Spring-Dashpot force model, which is common in discrete element methods (DEM) simulations [5]. Our goal is to compare and discuss the difference regarding the implementation and performance for both applications. We present experimental results for single-node performance in both CPU and GPU target processors, and we also show our experimental results for multi-node CPU processors in the SuperMUC-NG supercomputer, and multi-node GPU accelerators in the Piz Daint supercomputer.

In summary this paper makes the following contributions beyond our previous work [6]:

•
We present our new tinyMD distributed-memory parallel implementation based upon AnyDSL and discuss its differences to miniMD—a typical C++ implementation based upon the Kokkos library to target GPU devices. For example, we use higher-order functions to build array abstractions. Due to AnyDSL’s partial evaluator these abstractions are not accompanied by any overhead (see Section 4).
•
We demonstrate how flexible tinyMD is implemented with AnyDSL, and how its communication code can be coupled with the waLBerla framework to use its load-balancing feature in tinyMD simulations (see Section 5).
•
We show performance and scalability results for various CPU and GPU architectures including multi-CPU results on up to 2048 nodes of the SuperMUC-NG cluster (98 304 cores), and multi-GPU results on up to 1024 nodes of the Piz Daint cluster (see Section 6).

In order to make this paper as self-contained as possible, Section 3 provides necessary background for both AnyDSL and MD simulations after discussing related work in Section 2.

Section snippets

Related work

There is a wide effort on porting MD simulations to different target architectures while delivering good performance and scalability. The majority of the developed frameworks and applications use the traditional approach of using a general-purpose language to implement the code for the simulations.

GROMACS [7], [8], [9] is a versatile MD package used primarily for dynamical simulations of bio-molecules. It is implemented in C/C++ and supports most commonly used CPUs and GPUs. The package was

AnyDSL

AnyDSL [24] is a compiler framework designed to speed up the development of domain-specific libraries. It consists of three major components: the frontend Impala, its intermediate representation (IR) Thorin [25], and a runtime system. The syntax of Impala is inspired from Rust and allows both imperative and functional programming.

The tinyMD library

In this section we introduce and discuss tinyMD.² We focus on the main differences in writing portable code with AnyDSL as opposed to traditional C/C++ implementations. We explore the benefits achieved by using higher-order functions to map code to different target devices, different data layouts and to implement flexible code for MPI communication. In the following we use the term particle to refer to atoms.

Coupling tinyMD with waLBerla

In this section, we briefly present the fundamental concepts behind waLBerla to understand its load balancing mechanism. The most important characteristic is its domain partitioning using a forest of octrees called block forest. This kind of partitioning allows us to refine blocks in order to manage and distribute regions with smaller granularities. Furthermore, we explain how this block forest feature written in C++ is integrated into our tinyMD Impala code.

waLBerla is a modern multi-physics

Evaluation

We evaluated tinyMD as well as miniMD on several CPU and GPU architectures. We chose the following CPUs:

Cascade Lake:	Intel(R) Xeon(R) Gold 6246 CPU	@ 3.30 GHz
Skylake:	Intel(R) Xeon(R) Gold 6148 CPU	@ 2.40 GHz
Broadwell:	Intel(R) Xeon(R) CPU E5-2697 v4	@ 2.30 GHz

And the following GPUs:

Pascal:	GeForce GTX 1080	(8 GB memory)
Turing:	GeForce RTX 2080 Ti	(11 GB memory)
Volta:	Tesla V100-PCIe-32 GB	(32 GB memory)

We ran each simulation over 100 time steps—each time step with 0.005. We performed particle distribution

Conclusion

This paper presents tinyMD: an efficient, portable, and scalable implementation of an MD application using the AnyDSL partial evaluation framework. To evaluate tinyMD, we compare it with miniMD, implemented in C++ that relies on the Kokkos library to be portable to GPU accelerators. We discuss the implementation differences regarding code portability, data layout and MPI communication.

To achieve performance-portability on most recent processors and supercomputers, we provide abstractions in

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work is supported by the Federal Ministry of Education and Research (BMBF) as part of the HP-DLF, MetaDL, Metacca, and ProThOS projects. We are grateful to the Leibniz-Rechenzentrum Garching for providing computational resources.

Rafael Ravedutti Lucio Machado obtained his bachelor and masters degrees in computer science at the Federal University of Parana (UFPR) in Curitiba, Brazil.

He is currently a research assistant at the Chair for System Simulation at the University of Erlangen–Nuremberg in Germany, where he works with performance modeling, programming models and code generation tools for particle simulations, with focus on molecular dynamics.

His research interests include performance analysis, low-level code

References (31)

AbrahamM.J. et al.
GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers
SoftwareX
(2015)
PállS. et al.
A flexible algorithm for calculating pair interactions on SIMD architectures
Comput. Phys. Comm.
(2013)
PlimptonS.
Fast parallel algorithms for short-range molecular dynamics
J. Comput. Phys.
(1995)
BrownW.M. et al.
Implementing molecular dynamics on hybrid high performance computers – Particle–particle particle-mesh
Comput. Phys. Comm.
(2012)
EiblS. et al.
A local parallel communication algorithm for polydisperse rigid body dynamics
Parallel Comput.
(2018)
EiblS. et al.
A systematic comparison of runtime load balancing algorithms for massively parallel rigid particle dynamics
Comput. Phys. Comm.
(2019)
EdwardsH.C. et al.
Kokkos: Enabling performance portability across manycore architectures
BauerM. et al.
WaLBerla: A block-structured high-performance framework for multiphysics simulations
Comput. Math. Appl.
(2020)
GodenschwagerC. et al.
A framework for hybrid parallel flow simulations with a trillion cells in complex geometries
SchornbaumF. et al.
Massively parallel algorithms for the Lattice Boltzmann method on nonuniform grids
SIAM J. Sci. Comput.
(2016)

CundallP.A. et al.

A discrete numerical model for granular assemblies

Géotechnique

(1979)

SchmittJ. et al.

Unified code generation for the parallel computation of pairwise interactions using partial evaluation

van der SpoelD. et al.

GROMACS: Fast, flexible, and free

J. Comput. Chem.

(2005)

PállS. et al.

Tackling exascale software challenges in molecular dynamics simulations with GROMACS

LiM. et al.

Scalable minimd design with hybrid MPI and openshmem

Cited by (5)

MD-Bench: A performance-focused prototyping harness for state-of-the-art short-range molecular dynamics algorithms
2023, Future Generation Computer Systems
Molecular dynamics (MD) simulations provide considerable benefits for the investigation and experimentation of systems at atomic level. Their usage is widespread into several research fields, but their system size and timescale are crucially limited by the available computing power. Performance engineering of MD kernels is therefore critical to understand their bottlenecks and investigate possible improvements. For that reason, we developed MD-Bench, a performance-focused prototyping harness for short-range MD kernels that implements state-of-the-art algorithms from multiple production applications such as LAMMPS and GROMACS. The MD-Bench source code is simple, understandable, and extensible, and therefore well suited for benchmarking, teaching, and researching MD algorithms. In this paper we introduce MD-Bench, describe its design, structure, and implemented algorithms. Finally, we show five use-cases of MD-Bench and describe how these are useful to gain a deeper understanding of the performance of MD kernels.
P4irs: An Intermediate Representation and Compiler for Parallel and Performance-Portable Particle Simulations
2024, SSRN
MD-Bench: Engineering the in-core performance of short-range molecular dynamics kernels from state-of-the-art simulation packages
2023, arXiv
MD-Bench: A Generic Proxy-App Toolbox for State-of-the-Art Molecular Dynamics Algorithms
2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
MD-Bench: A generic proxy-app toolbox for state-of-the-art molecular dynamics algorithms
2022, arXiv

Rafael Ravedutti Lucio Machado obtained his bachelor and masters degrees in computer science at the Federal University of Parana (UFPR) in Curitiba, Brazil.

His research interests include performance analysis, low-level code optimization and compiler techniques to generate efficient, scalable and portable simulation applications to execute on heterogeneous parallel hardware and HPC clusters.

Jonas Schmitt received his Bachelor’s and Master’s degree in computer science at Friedrich-Alexander University Erlangen-Nürnberg in 2015 and 2017, respectively. He is currently a Ph.D. student at the Department of Computer Science at Friedrich-Alexander University Erlangen-Nürnberg.

His primary interest is the application of evolutionary computation and machine learning to the automated design and optimization of numerical methods for the solution of sparse linear systems.

Sebastian Eibl (M.Sc.) is a research assistant at the chair for system simulation. He studied physics at the Friedrich-Alexander University Erlangen-Nürnberg and received his master’s degree in 2015. Currently he is doing his Ph.D. in highly parallel rigid body dynamics simulations. The focus of his work lies in the development of scalable algorithms, software design, granular matter dynamics and collision models for rigid body dynamics simulations.

Jan Eitzinger [formerly Treibig] studied chemical engineering at the University of Erlangen–Nuremberg and holds a Ph.D. in Computer Science from the University of Erlangen–Nuremberg. He is the head of Software&Tools at the Erlangen National High Performance Computing Center (NHR@FAU). Apart from software and tool development he is also interested in architecture-specific and low-level optimization for current processor architectures, and performance modeling on processor and system level. He is the creator of LIKWID, a collection of lightweight performance tools and contributed the foundations of the ECM model. Jan Eitzinger is also active in teaching and training.

Roland Leißa is a postdoctoral researcher at the Compiler Design Lab, Saarland University. After he had received his Msc (Dipl.-Inf.) in Computer Science at the University of Münster in 2010, he joined this lab in order to research programming models and compiler support for various forms of parallelization.

He is particularly interested in domain-specific languages, the design of compiler intermediate representations, and SIMD vectorization. During his Ph.D., which he completed in 2018, he developed the AnyDSL framework, that he is still maintaining to this day.

Sebastian Hack is a professor of computer science at Saarland University. His work focuses on compiler construction, especially code generation, automatic vectorization and parallelization. Before, he was an assistant professor at Saarland University, a Post-Doc at EPFL, Switzerland in the LAMP lab and a Post-Doc at ENS Lyon, France in the COMPSYS project. He received his Ph.D. in 2006 from Karlsruhe University, Germany and his Diploma degree also from Karlsruhe University in 2004. From 2012 to 2014 he served as the dean of study affairs, and from 2018 to 2020 as the dean of the department of mathematics and computer science of Saarland University.

Arsène Pérard-Gayot is a Post-Doctoral Researcher at the Computer Graphics Lab of Saarland University.

He obtained his Master’s degree in 2014 from the French ENSIMAG Grande École, and did an internship at Dassault Systèmes on point cloud rendering. In 2020, he defended his Ph.D. on the topic of generating renderers using partial evaluation, under the supervision of Prof. Dr-Ing. Philipp Slusallek. His research interests include compilers, rendering, and high-performance computing.

Richard Membarth is a professor for system on a chip and AI for edge computing at the Technische Hochschule Ingolstadt, Germany. He holds a diploma degree and a Ph.D. in Computer Science from the Friedrich-Alexander University Erlangen-Nürnberg, Germany as well as a postgraduate diploma in Computer and Information Sciences from the Auckland University of Technologies, New Zealand.

His research interests include parallel computer architectures and programming models with a focus on automatic code generation for a variety of architectures ranging from embedded systems to HPC installations for applications from image processing, computer graphics, scientific computing, and deep learning.

Prof. Dr. Harald Köstler got his Ph.D. in computer science in 2008 on variational models and parallel multigrid methods in medical image processing. 2014 he finished his habilitation on Efficient Numerical Algorithms and Software Engineering for High Performance Computing. Currently, he works at the Chair for System Simulation at the University of Erlangen–Nuremberg in Germany.

His research interests include software engineering concepts especially using code generation for simulation software on HPC clusters, multigrid methods, and programming techniques for parallel hardware, especially GPUs. The application areas are computational fluid dynamics, rigid body dynamics, and medical imaging.

View full text

tinyMD: Mapping molecular dynamics simulations to heterogeneous hardware using partial evaluation

Highlights

Abstract

Introduction

Section snippets

Related work

AnyDSL

The tinyMD library

Coupling tinyMD with waLBerla

Evaluation

Conclusion

Declaration of Competing Interest

Acknowledgment

SoftwareX

Comput. Phys. Comm.

J. Comput. Phys.

Comput. Phys. Comm.

Parallel Comput.

Comput. Phys. Comm.

Kokkos: Enabling performance portability across manycore architectures

WaLBerla: A block-structured high-performance framework for multiphysics simulations

Comput. Math. Appl.

A framework for hybrid parallel flow simulations with a trillion cells in complex geometries

Massively parallel algorithms for the Lattice Boltzmann method on nonuniform grids

SIAM J. Sci. Comput.

A discrete numerical model for granular assemblies

Géotechnique

Unified code generation for the parallel computation of pairwise interactions using partial evaluation

GROMACS: Fast, flexible, and free

J. Comput. Chem.

Tackling exascale software challenges in molecular dynamics simulations with GROMACS

Scalable minimd design with hybrid MPI and openshmem