ABSTRACT
Molecular dynamics (MD) simulations exhibit big computational efforts, which makes them very time-consuming. This particularly holds for molecular-continuum simulations in fluid dynamics, which rely on the simulation of MD ensembles that are coupled to computational fluid dynamics (CFD) solvers. Massively parallel implementations for MD simulations and the respective ensembles are therefore of utmost importance.
However, the more processors are used for the molecular-continuum simulation, the higher the probability of software- and hardware-induced failures or malfunctions of one processor becomes, which may lead to the issue that the entire simulation crashes. To avoid long re-calculation times for the simulation, a fault tolerance mechanism is required, especially considering respective simulations carried out at the exascale.
In this paper, we introduce a fault tolerance method for molecular-continuum simulations implemented in the macro-micro-coupling tool (MaMiCo), an open-source coupling tool for such multiscale simulations which allows the re-use of one’s favorite MD and CFD solvers. The method makes use of a dynamic ensemble handling approach that has been used previously to estimate statistical errors due to thermal fluctuations in the MD ensemble. The dynamic ensemble is always homogeneously distributed and, thus, balanced on the computational resources to minimize the overall induced overhead overhead. The method further relies on an MPI implementation with fault tolerance support. We report scalability results with and without modeled system failures on three TOP500 supercomputers—Fugaku/RIKEN with ARM technology, Hawk/HLRS with AMD EPYC technology and HSUper/Helmut Schmidt University with Intel Icelake processors—to demonstrate the feasibility of our approach.
- Bilge Acun, David Joseph Hardy, Laxmikant Vasudeo Kale, K. Li, James Christopher Phillips, and John E. Stone. 2018. Scalable molecular dynamics with NAMD on the Summit system. IBM Journal of Research and Development 62, 6 (2018), 4:1–4:9. https://doi.org/10.1147/JRD.2018.2888986Google ScholarDigital Library
- Leopold Grinberg, Joseph A. Insley, Dmitry A. Fedosov, Vitali Morozov, Michael E. Papka, and George Em Karniadakis. 2012. Tightly Coupled Atomistic-Continuum Simulations of Brain Blood Flow on Petaflop Supercomputers. Computing in Science & Engineering 14, 6 (2012), 58–67.Google ScholarDigital Library
- Saurabh Gupta, Tirthak Patel, Christian Engelmann, and Devesh Tiwari. 2017. Failures in Large Scale Systems: Long-Term Measurement, Analysis, and Implications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC ’17). Association for Computing Machinery, New York, NY, USA, Article 44, 12 pages.Google ScholarDigital Library
- Nicolas G. Hadjiconstantinou, Alejandro L. Garcia, Martin Z. Bazant, and Gang He. 2003. Statistical error in particle simulations of hydrodynamic phenomena. J. Comput. Phys. 187(2003), 274–297.Google ScholarDigital Library
- Vahid Jafari, Niklas Wittmer, and Philipp Neumann. 2022. Massively Parallel Molecular-Continuum Flow Simulation with Error Control and Dynamic Ensemble Handling. In International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2022). 52–60.Google Scholar
- Marco Kalweit and Dimitris Drikakis. 2008. Multiscale Methods for Micro/Nano Flows and Materials. J. Comput. Theor. Nanosci. 5, 9 (2008), 1923–1938.Google ScholarCross Ref
- Timm Krüger, Halim Kusumaatmaja, Alexandr Kuzmin, Orest Shardt, Goncalo Silva, and Erlend Magnus Viggen. 2017. The lattice Boltzmann method. Springer.Google Scholar
- I. Laguna, D.F. Richards, T. Gamblin, M. Schulz, and B.R. de Supinski. 2014. Evaluating User-Level Fault Tolerance for MPI Applications. In Proceedings of the 21st European MPI Users’ Group Meeting (Kyoto, Japan) (EuroMPI/ASIA ’14). Association for Computing Machinery, New York, NY, USA, 57–62.Google ScholarDigital Library
- Khaled M. Mohamed and Abdulmajeed Mohamad. 2010. A review of the development of hybrid atomistic–continuummethods for dense fluids. Microfluid. Nanofluid. 8(2010), 283–302.Google ScholarCross Ref
- Philipp Neumann and Xin Bian. 2017. MaMiCo: Transient Multi-Instance Molecular-Continuum Flow Simulation on Supercomputers. Comput. Phys. Commun. 220 (2017), 390–402.Google ScholarCross Ref
- Philipp Neumann, Hanno Flohr, Rahul Arora, Piet Jarmatz, Nikola Tchipev, and Hans-Joachim Bungartz. 2016. MaMiCo: Software design for parallel molecular-continuum flow simulations. Comput. Phys. Commun. 200 (2016), 324–335.Google ScholarCross Ref
- Xiaobo Nie, Shiyi Chen, W. E, and Mark Robbins. 2004. A continuum and molecular dynamics hybrid method for micro-and nano-fluid flow. J. Fluid Mech. 500(2004), 55–64.Google ScholarCross Ref
- Xiaobo Nie, Mark O. Robbins, and Shiyi Chen. 2006. Resolving Singular Forces in Cavity Flow: Multiscale Modeling from Atomic to Millimeter Scales. Phys. Rev. Lett. 96(2006), 134501. Issue 13.Google ScholarCross Ref
- Christoph Niethammer, Stefan Becker, Martin Bernreuther, Martin Buchholz, Wolfgang Eckhardt, Alexander Heinecke, Stephan Werth, Hans-Joachim Bungartz, Colin W. Glass, Hans Hasse, Jadran Vrabec, and Martin Horsch. 2014. ls1 mardyn: The massively parallel molecular dynamics code for large systems. Journal of Chemical Theory and Computation 10, 10 (2014), 4455–4464.Google ScholarCross Ref
- Aidan P. Thompson, H. Metin Aktulga, Richard Berger, Dan S. Bolintineanu, W. Michael Brown, Paul S. Crozier, Pieter J. in ’t Veld, Axel Kohlmeyer, Stan G. Moore, Trung Dac Nguyen, Ray Shan, Mark J. Stevens, Julien Tranchida, Christian Trott, and Steven J. Plimpton. 2022. LAMMPS - a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Computer Physics Communications 271 (2022), 10817.Google ScholarCross Ref
- Shalini Yajnik and Niraj K. Jha. 1994. Synthesis of fault tolerant architectures for molecular dynamics. In Proceedings of IEEE International Symposium on Circuits and Systems - ISCAS ’94, Vol. 4. 247–250 vol.4.Google ScholarCross Ref
Index Terms
- Fault Tolerance for Ensemble-based Molecular-Continuum Flow Simulations
Recommendations
Application-Aware Byzantine Fault Tolerance
DASC '14: Proceedings of the 2014 IEEE 12th International Conference on Dependable, Autonomic and Secure ComputingByzantine fault tolerance has been intensively studied over the past decade as a way to enhance the intrusion resilience of computer systems. However, state-machine-based Byzantine fault tolerance algorithms require deterministic application processing ...
Optimistic Byzantine fault tolerance
The primary concern of traditional Byzantine fault tolerance is to ensure strong replica consistency by executing incoming requests sequentially according to a total order. Speculative execution at both clients and server replicas has been proposed as a ...
Algorithm-Based Fault Tolerance for FFT Networks
Algorithm-based fault tolerance (ABFT) is a low-overhead system-level fault tolerance technique. Many ABFT schemes have been proposed in the past for fast Fourier transform (FFT) networks. In this paper, a new ABFT scheme for FFT networks is proposed. ...
Comments