DNS of hydrodynamically interacting droplets in turbulent clouds: Parallel implementation and scalability analysis using 2D domain decomposition

https://doi.org/10.1016/j.cpc.2014.09.005Get rights and content

Abstract

The study of turbulent collision of cloud droplets requires simultaneous considerations of the transport by background air turbulence (i.e., geometric collision rate) and influence of droplet disturbance flows (i.e., collision efficiency). In recent years, this multiscale problem has been addressed through a hybrid direct numerical simulation (HDNS) approach (Ayala et al., 2007). This approach, while currently is the only viable tool to quantify the effects of air turbulence on collision statistics, is computationally expensive. In order to extend the HDNS approach to higher flow Reynolds numbers, here we developed a highly scalable implementation of the approach using 2D domain decomposition. The scalability of the parallel implementation was studied using several parallel computers, at 5123 and 10243 grid resolutions with O(106)O(107) droplets. It was found that the execution time scaled with number of processors almost linearly until it saturates and deteriorates due to communication latency issues.

To better understand the scalability, we developed a complexity analysis by partitioning the execution tasks into computation, communication, and data copy. Using this complexity analysis, we were able to predict the scalability performance of our parallel code. Furthermore, the theory was used to estimate the maximum number of processors below which the approximately linear scalability is sustained. We theoretically showed that we could efficiently solved problems of up to 81923 with O(100,000) processors. The complexity analysis revealed that the pseudo-spectral simulation of background turbulent flow for a dilute droplet suspension typical of cloud conditions typically takes about 80% of the total execution time, except when the droplets are small (less than 5 μm in a flow with energy dissipation rate of 400 cm2/s3 and liquid water content of 1 g/m3), for which case the particle–particle hydrodynamic interactions become the bottleneck. The complexity analysis was also used to explore some alternative methods to handle FFT calculations within the flow simulation and to advance droplets less than 5 μm in radius, for better computational efficiency. Finally, preliminary results are reported to shed light on the Reynolds number-dependence of collision kernel of non-interacting droplets.

Introduction

Dispersed flows consist of a carrier fluid and at least one discrete phase such as droplets, bubbles, or particles. They are present in many industrial processes and in the environment such as pneumatic conveying systems, coal combustion, sediment transport by wind and water waves, and clouds and precipitation via rain and snow. Despite the industrial and environmental relevance, fundamental understanding of the mechanisms governing dispersed flows is incomplete due to a host of length and time scales associated with the dispersed phase and its interactions with the carrier phase. Typically, the number of variables characterizing a dispersed flow is considerably larger than that of the single-phase flow. In certain applications such as turbulent collision of dispersed particles, direct experimental measurements of both the carrier phase and the dispersed phase encompassing all relevant scales are often difficult.

In recent years, due to advances in both computing power and algorithms, numerical simulations of turbulent dispersed flows based on first principles have been established as the third pillar for scientific discovery in addition to theory and experimentation. In the so-called point-particle direct numerical simulations, the carrier phase is solved through a set of continuum differential equations while each particle in the dispersed phase is tracked through a modeled equation of motion. This Eulerian–Lagrangian approach has been advanced now to a point that various particle–fluid, particle–particle, and particle–wall interactions could be included (e.g., Elghobashi  [1], [2], Sommerfeld  [3], Wang et al.  [4]). This advantage makes the Eulerian–Lagrangian approach a major research tool for studying turbulent dispersed flow. However, this approach is computationally expensive when the number of discrete particles is large and most scales of the carrier-phase turbulent flow must be resolved.

In order to realize the full potential of the Eulerian–Lagrangian approach for dispersed flows, it is necessary to employ highly scalable computation using a large number of processors (e.g., 10,000 to 100,000) typically available on the current petascale supercomputers. The distributed-memory, many-core architecture on these computers calls for parallel implementation based on the divide-and-conquer approach so simulations of large problem sizes can be performed efficiently.

The Eulerian flow simulation of the carrier phase can be efficiently scaled using a domain decomposition (dd) approach, where the domain is explicitly decomposed and mapped onto the processors in such a way that a load balancing is maintained and communication between processors is minimized and well structured. Depending on whether the boundaries of the domain are moving or not, the decomposition could be dynamic or static, i.e. the domain covered by each processor could change with time or be fixed. Alfonsi  [5] surveyed a compendium of parallel implementations of direct numerical simulations of turbulent flows, all based on dd using a variety of flow solvers including the pseudo spectral method [6], [7], [8], [9], [10], [11], [12], lattice Boltzmann method  [13], [14], finite element method  [15], and finite difference method  [16], [17], [18], [19]. Due to different machine architectures in terms of memory (distributed memory, shared memory, or combination), the techniques for parallel implementation vary from OpenMP, MPI, or a hybrid approach.

There are alternative strategies for parallel implementation of the dispersed phase, and the optimal approach may depend on the nature of coupling interactions considered (i.e., one-, two-, or four-way interactions)  [20], [21]. Parallel implementation strategies in a multi-core machine  [22], [23] can be categorized into: (1) domain decomposition, (2) particle decomposition, (3) particle-interaction decomposition, and (4) replicated data method. In the domain decomposition (dd) method, the physical space is partitioned and each processor handles particles spatially located in a subdomain assigned to the processor  [24], [25], [26]. The advantage of dd is that it is fairly simple to implement; however, the disadvantage is that a good load balance could be hard to achieve if particles are not evenly distributed in space. The communication of the particle data takes place between neighboring processors only. Onishi et al.  [26] allowed two processors to share a halo region where the same particles were tracked. This reduces the communication need but increases computation.

In the particle decomposition, the particles are evenly distributed to different groups and each group is assigned to a processor  [22]. A good load balance is achieved relatively easily but each processor has to store the same flow field data or otherwise an additional communication step is needed to make the flow field data available to a given processor.

The particle-interaction decomposition is a scheme used when particles interact among themselves, besides the fluid–particle interactions, such as in the N-body problems  [27]. In this case, the pair interactions within the N-body problem are divided among all the processors. This is usually done for relatively small systems.

Lastly, in the replicated data method, particle coordinates (or other particle information) are replicated in each processor, usually through a broadcast operation. Its use is usually combined with the other schemes  [20], [14]. The obvious problem of this method is a poor scalability as well as the memory limitation it might impose.

In recent years, considerable attention has been paid to parallelization strategies of dispersed flow computation using the Eulerian–Lagrangian approach where the carrier flow is solved by the Reynolds-averaged Navier–Stokes equation [17], [18], [28], [29], large-eddy simulation  [20], [30], and direct numerical simulation  [26]. To the best of our knowledge, none was intended for large computers. In addition, there has been no systematic complexity analysis of parallel codes for dispersed flow. The few available studies on codes focused primarily on the flow simulation.

Our study is motivated by an important problem in warm-rain cloud microphysics, namely, the turbulent collision-coalescence of cloud droplets  [31], [32], [33], [34], [35]. Here the dispersed phase is made of small water droplets.1 It is well established that small-scale turbulent eddies down to the Kolmogorov scale affect collision statistics while large-scale turbulent eddies mix and disperse the droplets  [35]. The study of turbulent collision of cloud droplets then requires direct numerical simulation of the carrier flow where all scales of undisturbed fluid motion are numerically resolved.

Under the working hypothesis that the small-scale fluid motion dominates the pair collision statistics, the Eulerian–Lagrangian approach with simulated DNS flow field was first applied to study the geometric collision of cloud droplets  [36], [37], [38]. In order to study the effect of turbulence on the collision efficiency, droplet–droplet hydrodynamic interactions must also be considered. By embedding Stokes disturbance flows due to droplets, Wang et al.  [39] and Ayala et al.  [40] introduced the Hybrid DNS (HDNS) scheme. Since in clouds the droplet volume fraction is small, we assume one-way interaction between the dispersed phase and the background turbulent flow. The disturbance flows due to the presence of droplets are treated separately, only for the purpose to incorporate droplet–droplet hydrodynamic interactions at the sub-Kolmogorov scale of the background turbulent flow.

A key aspect of the HDNS approach is the determination of disturbance flow velocity at the location of each particle, requiring an iterative algorithm to solve a large linear system at each time step. This adds yet another level of computational complexity. The initial implementation  [40] was carried out with loop-level parallelization using OpenMP, which is limited to the use of processors within a same computational node. To extend the HDNS to a larger domain (or larger flow Reynolds number), Rosa and Wang  [41] considered an MPI implementation using dd in one spatial dimension (1D). Their MPI implementation did not include droplet hydrodynamic interactions.

In this paper, we report an MPI implementation of the HDNS code based on dd in two spatial dimensions (2D) for dilute droplet suspensions typical of cloud conditions. The first objective is to discuss the implementation details that lead to a highly scalable HDNS code for turbulent collision of hydrodynamically-interacting cloud droplets. The second objective is to develop an extensive complexity analysis to predict the execution time. We must stress that the basic concepts invoked in our complexity analysis are not necessarily new but previous related studies on parallel implementation did not provide such an analysis in a quantitative manner. In addition, a detailed complexity analysis of such complex computational problem is not a trivial task. Therefore, our systematic complexity analysis fills a gap in the scalable computation of turbulent dispersed flow.

The paper is organized as follows. In Section  2, we review the HDNS approach. Section  3 describes in detail the parallel implementation of the approach using a 2D dd strategy. We will also develop a complexity analysis to understand the parallel performance of the code. Section  4 contains evaluation and testing results of the implementation, using several state-of-the-art supercomputers (Stampede, Yellowstone, and Kraken). Subsequently, after the complexity analysis is validated, it is applied to explore alternative treatments for some components of the code. To demonstrate the capability of the code, we discuss in Section  5 the Reynolds number dependence of the collision kernel. Finally, conclusions are summarized in Section  6.

Section snippets

Model formulation and numerical solution method

We only provide a brief description of the HDNS approach here as the basic ideas and algorithms have been presented in  [40]. We consider a dilute suspension of droplets in a background turbulent air flow U(x,t). Under one-way coupling, the background flow is solved by a pseudo-spectral method in a periodic domain, using the Navier–Stokes equation Ut=U×ω(Pρ+12U2)+ν2U+f(x,t), together with the continuity equation U(x,t)=0.

Here ω×U is the vorticity vector, P is the pressure, ρ is the

Parallel implementation using 2D dd and a complexity analysis

Our MPI implementation is based on a 2D dd previously adopted for efficient parallel implementation of Fast Fourier Transform (FFT) in the pseudo-spectral simulation of fluid turbulence by Ayala and Wang  [44]. In their work, they argued that the 3D dd performance tends to saturate and deteriorate with excessive communication, leaving 2D dd to be the best option to parallelize the FFT for large number of processors since 1D dd is limited to few processors (number of processors has to be less

Scalability

The resulting MPI code was run on Stampede at Texas Advanced Computing Center (TACC), a Dell Linux Cluster based on 6400+Dell PowerEdge server nodes with 16 cores per node (http://www.tacc.utexas.edu/resources/hpc/stampede). The core frequency is 2.7 GHz and supports 8 floating-point operations per clock period with a peak performance of 21.6 GFLOPS/core. Each node has an accelerator Intel Xeon Phi Coprocessor (MIC Architecture), which was not used for performance testing since our code is

Application to turbulent collision of cloud droplets

The 2D dd implementation offers a unique opportunity to study sensitivity of the collision statistics of inertial particles on the range of scales resolved in DNS and equivalently on the flow Reynolds number. Using petascale computers, we were able to increase the scale separation in the flow and obtain collision statistics at a higher Taylor microscale Reynolds number (Rλ400). Simulations for even higher Rλ will be achievable with the use of larger grid resolutions and sufficiently large

Summary and conclusions

In this paper, we have presented the details of a massively parallel implementation of turbulent particle-laden flow computation. The implementation is carried out in the context of cloud microphysics but could be extended to other applications depending on fluid and particle properties. We simulate the flow using a pseudo spectral DNS tool and incorporate the droplets which are being affected by Stokes drag and gravity and hydrodynamic interactions. The hydrodynamic interactions have been

Acknowledgments

This work was made possible by support from the National Science Foundation through grants OCI-0904534, ATM-0730766, and CRI-0958512 and by the National Center for Atmospheric Research (NCAR). NCAR is sponsored by the National Science Foundation. Computing resources in the US were provided by: (1) National Center for Atmospheric Research through CISL-35751010, CISL-35751014 and CISL-35751015, and (2) Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National

References (53)

  • L.P. Wang et al.

    Int. J. Multiphase Flow

    (2009)
  • R.B. Pelz

    J. Comput. Phys.

    (1991)
  • P. Mininni et al.

    Parallel Comput.

    (2011)
  • K. Itakura et al.

    Parallel Comput.

    (2004)
  • V. Heuveline et al.

    Comput. Math. Appl.

    (2009)
  • H. Gao et al.

    Comput. Math. Appl.

    (2013)
  • K. Nakajima

    Appl. Numer. Math.

    (2005)
  • A. Gorobets et al.

    Comput. & Fluids

    (2011)
  • H. Enwald et al.

    Chem. Eng. Sci.

    (1999)
  • H. Lindborg et al.

    Comput. Chem. Eng.

    (2004)
  • R. Onishi et al.

    J. Computational Physics

    (2011)
  • D. Darmana et al.

    J. Comput. Phys.

    (2006)
  • B.P.B. Hoomans et al.

    Chem. Eng. Sci.

    (1996)
  • W.M. Charles et al.

    J. Parallel Distrib. Comput.

    (2008)
  • M. Winkel et al.

    Comput. Phys. Comm.

    (2012)
  • R. Capuzzo-Dolcetta et al.

    J. Comput. Phys.

    (2013)
  • P. Pepiot et al.

    Powder Technol.

    (2012)
  • B. Nkonga et al.

    Parallel Comput.

    (2002)
  • L.P. Wang et al.

    Atmos. Sci. Lett.

    (2009)
  • O. Ayala et al.

    J. Comput. Phys.

    (2007)
  • O. Ayala et al.

    Parallel Comput.

    (2013)
  • S. Balachandar et al.

    J. Comput. Phys.

    (1989)
  • S. Elghobashi

    Appl. Sci. Res.

    (1991)
  • S. Elghobashi

    Appl. Sci. Res.

    (1994)
  • M. Sommerfeld
  • G. Alfonsi

    Appl. Mech. Rev.

    (2011)
  • Cited by (25)

    • A high order finite difference solver for simulations of turbidity currents with high parallel efficiency[Formula presented]

      2022, Computers and Mathematics with Applications
      Citation Excerpt :

      Moreover, the requirement of storing and exchanging the halo cells (cells that contain copies of the boundary values stored on neighbors' domain) becomes significant with the increase of processor numbers. 2D pencil-like decomposition adopted here can greatly circumvent the above two limitations, and has been one of the main choices of decomposition strategy in the field of fast direct numerical or large-eddy simulations [18,44]. Many fast CFD solvers [17,45,46] using pencil-distributed decomposition scheme are more or less influenced by the open-source Fortran library, 2DECOMP&FFT [25], a general-purpose pencil decomposition and three-dimensional distributed FFTs.

    • A pencil distributed direct numerical simulation solver with versatile treatments for viscous term[Formula presented]

      2021, Computers and Mathematics with Applications
      Citation Excerpt :

      Their algorithm showed excellent scalability up to 786,432 processors. Ayala et al. [9] analyzed the interacting droplets in turbulent clouds, using the pseudo-spectral method with direct integration of the equations of particle motion. Their study showed the good scalability of pencil distributed decomposition for particle-laden flows.

    • Effects of turbulence modulation and gravity on particle collision statistics

      2020, International Journal of Multiphase Flow
      Citation Excerpt :

      The MPI code was designed to run on supercomputers with distributed memory. A complete description of the code, along with results of former numerical experiments and scalability analysis can be found in (Ayala et al., 2014; Parishani et al., 2015; Rosa et al., 2015). The two-point particle collision statistics such as the radial distribution function and radial relative velocity are handled using a specially designed parallel algorithm with optimized data communication between processes.

    • A highly scalable particle tracking algorithm using partitioned global address space (PGAS) programming for extreme-scale turbulence simulations

      2017, Computer Physics Communications
      Citation Excerpt :

      We also give a brief account of how cubic spline coefficients are calculated, and of our baseline algorithm in which the mapping between particles and MPI processes is fixed in time. Several different interpolation schemes have been used in the literature [24,26,32,33] to obtain particle velocities in turbulence simulations. A scheme of high order of accuracy is important, especially if one wishes to compute velocity gradients (which are less well-resolved in space) following the particle trajectories as well [47,48].

    View all citing articles on Scopus
    View full text