Skip to main content
Log in

A parallel lattice Boltzmann method for large eddy simulation on multiple GPUs

  • Published:
Computing Aims and scope Submit manuscript

Abstract

To improve the simulation efficiency of turbulent fluid flows at high Reynolds numbers with large eddy dynamics, a CUDA-based simulation solution of lattice Boltzmann method for large eddy simulation (LES) using multiple graphics processing units (GPUs) is proposed. Our solution adopts the “collision after propagation” lattice evolution way and puts the misaligned propagation phase at global memory read process. The latest GPU platform allows a single CPU thread to control up to four GPUs that run in parallel. In order to make use of multiple GPUs, the whole working set is evenly partitioned into sub-domains. We implement Smagorinsky model and Vreman model respectively to verify our multi-GPU solution. These two LES models have different relaxation time calculation behavior and lead to different CUDA implementation characteristics. The implementation based on Smagorinsky model achieves 190 times speedup over the sequential implementation on CPU, while the implementation based on Vreman model archives more than 90 times speedup. The experimental results show that the parallel performance of our multi-GPU solution scales very well on multiple GPUs. Therefore large-scale (up to 10,240 \(\times \) 10,240 lattices) LES–LBM simulation becomes possible at a low cost, even using double-precision floating point calculation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. Benzi R, Succi S, Vergassola M (1992) The lattice boltzmann equation: theory and applications. Phys Rep 222(3):145–197

    Article  Google Scholar 

  2. Chu X, Zhao K, Wang M (2008) Massively parallel network coding on gpus. In: Performance, computing and communications conference, 2008. IPCCC 2008. IEEE International. IEEE, pp 144–151

  3. Habich J (2008) Performance evaluation of numeric compute kernels on nvidia gpus. Master’s Thesis, University of Erlangen-Nurnberg

  4. Hou S, Sterling J, Chen S, Doolen G (1996) A lattice boltzmann subgrid model for high reynolds number flows. Pattern Form Lattice Gas Automata 6:151–166

    MathSciNet  Google Scholar 

  5. Kuznik F, Obrecht C, Rusaouen G, Roux J (2010) Lbm based flow simulation using gpu computing processor. Comput Math Appl 59(7):380–2392

    Article  Google Scholar 

  6. Li Y, Zhao K, Chu X, Liu J (2012) Speeding up k-means algorithm by gpus. J Comput Sys Sci

  7. Maier R, Bernard R, Grunau D (1996) Boundary conditions for the lattice boltzmann method. Phys Fluids 8(7):1788–1801

    Article  MATH  Google Scholar 

  8. Micikevicius P (2011) Multi-gpu programing. http://www.nvidia.com/docs/IO/116711/sc11-multi-gpu.pdf

  9. Nvidia C (2011) Nvidia cuda programming guide. http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf

  10. Nvidia C (2011) Nvidia gpudirect. http://developer.nvidia.com/gpudirect

  11. Obrecht C, Kuznik F, Tourancheau B, Roux J (2011) Global memory access modelling for efficient implementation of the lattice boltzmann method on graphics processing units. High Performance Comput Comput Sci-VECPAR 2010:151–161

    Google Scholar 

  12. Obrecht C, Kuznik F, Tourancheau B, Roux J (2011) Multi-gpu implementation of the lattice boltzmann method. Comput Math Appl

  13. Obrecht C, Kuznik F, Tourancheau B, Roux J (2011) A new approach to the lattice boltzmann method for graphics processing units. Comput Math Appl 61(12):3628–3638

    Article  MATH  Google Scholar 

  14. Qian Y, d’Humieres D, Lallemand P (2007) Lattice bgk models for navier-stokes equation. EPL (Europhys Lett) 17(6):479

    Article  Google Scholar 

  15. Rosales C (2011) Multiphase lbm distributed over multiple gpus. In: 2011 IEEE international conference on cluster computing (CLUSTER).IEEE, pp 1–7

  16. Smagorinsky J (1963) General circulation experiments with the primitive equations. Monthly Weather Rev 91(3):99–164

    Article  Google Scholar 

  17. Tölke J (2010) Implementation of a lattice boltzmann kernel using the compute unified device architecture developed by nvidia. Comput V Sci 13(1):29–39

    Google Scholar 

  18. Tölke J, Krafczyk M (2008) Teraflop computing on a desktop pc with gpus for 3d cfd. Int J Comput Fluid Dyn 22(7):443–456

    Article  MATH  Google Scholar 

  19. Vreman A (2004) An eddy-viscosity subgrid-scale model for turbulent shear flow: Algebraic theory and applications. Phys Fluids 16:3670

    Article  Google Scholar 

Download references

Acknowledgments

This project was supported by the Aeronautical Science Fund of China (Grant No. 20111453012) and the National Defense Pre-Research Foundation of China (Grant No. 9140A13040111HK0329).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chengwen Zhong.

Appendices

Appendix A: Table 3

Table 3 A table of notation definition

Appendix B: Kernel function LBCollProp \( Smagorinsky\,\, model \)

__global__ static void LBCollProp ( int nx, int* geoD, double tauf, double prefix, double* fr0, \(\cdots \), double fse0, double fr1, \(\cdots \), double* fse1 )

{

// collision and propagation processes of the fluid nodes

//registers for kernel processing

double ux, uy, uv, rho, tau;

double fr, fe, fn, fw, fs, fne, fnw, fsw, fse; //f*

double f1, f2, f3, f4, f5, f6, f7, f8; //feq

//Index K in 1D-arrays

int k = nx*blockIdx.y + blockIdx.x*blockDim.x + threadIdx.x;

if(geoD[k] \(=\)= FLUID) //Inner fluid nodes for update

{

//“ propagation” was performed at the global memory read transactions.

// Note: some global memory read transactions remained misaligned.

fr = fr0[k];

fe = fe0[k-1];

fn = fn0[k-nx];

\(\cdots \)

//get the macroscopic properties: velocity (ux, uy) and density (rho).

rho = fr + fe + fn + fw + fs + fne + fnw + fsw + fse;

uv = 1/rho;

ux = (fe + fne + fse - fw - fnw - fsw);

ux *= uv;

uy = (fn + fne + fnw - fs - fsw - fse);

uy *= uv;

//get the feq after propagation.

tau = rho/9.0;

uv = 1.0 - 1.5*(ux*ux + uy*uy);

f1 = tau*( 3.0*ux + 4.5*ux*ux + uv);

\(\cdots \)

tau = tau*0.25;

f5 = tau*( 3.0*(ux+uy) + 4.5*(ux+uy)*(ux+uy) + uv);

\(\cdots \)

uv = 16*tau*uv; //means f0

//get the single relaxation time according Smagorinsky model.

//only needs the local quantities such as nonequilibrium stress tensor.

ux = (fe-f1) + (fne-f5) + (fse-f8) + (fnw-f6) + (fw-f3) + (fsw-f7);

uy = (fne-f5) + (fn-f2) + (fnw-f6) + (fsw-f7) + (fs-f4) + (fse-f8);

tau = (fne-f5) + (fsw-f7) + (f6-fnw) + (f8-fse);

tau = sqrt(2*(ux*ux+uy*uy+2*tau*tau))*prefix;

tau = (sqrt(tauf*tauf+tau/rho)+tauf)*0.5;

// perform the LBM iteration according to equation:

// f = f* - (f* - feq)/tau \( \rightarrow \) f = (1-1/tau)f* + feq/tau

// write data back to device memory, coalesced global memory accesses.

tau = 1/tau;

ux = 1-tau;

fr1[k] = ux*fr + tau*uv;

fe1[k] = ux*fe + tau*f1;

fn1[k] = ux*fn + tau*f2;

\(\cdots \)

}

}

Appendix C: Kernel function LBProp and LBColl (Vreman model)

__global__ static void LBProp (const int nx, const int* geoD, double * dev_ux, double * dev_uy, double * dev_rho, double* fr0, \(\cdots \), double* fse0, double* fr1, \(\cdots \), double* fse1)

{

//This kernel function is responsible for propagation of the fluid nodes

double ux, uy, uv, rho, tmp;

double fr, fe, fn, fw, fs, fne, fnw, fsw, fse; //f*

//Index K in 1D-arrays

int k = nx*blockIdx.y + blockIdx.x*blockDim.x + threadIdx.x;

if(geoD[k] \(=\)= FLUID) //Inner fluid nodes for update

{

//“ propagation” was performed at the global memory read transactions.

// Note: some global memory read transactions remained misaligned.

fr = fr0[k];

fe = fe0[k-1];

fn = fn0[k-nx];

\(\cdots \)

// get and store the whole updated macroscopic properties to global memory

// for the sake of synchronization.

rho = fr + fe + fn + fw + fs + fne + fnw + fsw + fse;

dev_rho[k] = rho;

tmp = 1/rho;

ux = (fe + fne + fse - fw - fnw - fsw);

dev_ux[k] = ux*tmp;

uy = (fn + fne + fnw - fs - fsw - fse);

dev_uy[k] = uy*tmp;

fr1[k] = fr;

fe1[k] = fe;

fn1[k] = fn;

\(\cdots \)

}

}

__global__ static void LBColl ( int nx, int* geoD, double tauf, double prefix, double * dev_ux, double * dev_uy, double * dev_rho, double* fr1, \(\cdots \) , double* fse1)

{

//This kernel function is responsible for collision of the fluid nodes

double ux, uy, uv, rho, tmp1, tmp2;

double delta_Uxx, delta_Uxy, delta_Uyx, delta_Uyy, up, down;

double tauf0, tauf1, tauf2, tauf3,tauf4, tauf5;

//Index K in 1D-arrays

int k = nx*blockIdx.y + blockIdx.x*blockDim.x + threadIdx.x;

if(geoD[k]\(=\)=FLUID)

{

ux = dev_ux[k];

uy = dev_uy[k];

rho = dev_rho[k];

//get the resolved velocity gradient tensor: gij

delta_Uxx = 0.5*(dev_ux[k+1] - dev_ux[k-1]); //g11

delta_Uxy = 0.5*(dev_ux[k+nx] - dev_ux[k-nx]); //g12

delta_Uyx = 0.5*(dev_uy[k+1] - dev_uy[k-1]); //g21

delta_Uyy = 0.5*(dev_uy[k+nx] - dev_uy[k-nx]); //g22

//numerator : \((g12*g21 - g11*g22)^2\)

up = (delta_Uxy*delta_Uyx - delta_Uxx*delta_Uyy) * (delta_Uxy*delta_Uyx - delta_Uxx*delta_Uyy);

//denominator : gij gij = g11*g11 + g12*g21 + g21*g11 + g22*g22

down = delta_Uxx*delta_Uxx + delta_Uxy*delta_Uxy + delta_Uyx*delta_Uyx + delta_Uyy*delta_Uyy + 1e-15;

//get the single relaxation time according to Equation(6).

tauf0 = tauf + 3.0*prefix*sqrt(up/down);

// perform the LBM iteration according to equation:

// f = f* - (f* - feq)/tau \(\rightarrow \) f = (1-1/tau)f* + feq/tau

// read f* and write f back, coalesced global memory accesses.

tauf1 = 1.0/tauf0;

tauf2 = 1.0 - tauf1;

tauf3 = (4.0/9.0)*tauf1;

tauf4 = 0.25*tauf3;

tauf5 = 0.25*tauf4;

uv = 1.0 - 1.5*(ux*ux + uy*uy);

tmp1 = 4.5*ux*ux;

tmp2 = 4.5*uy*uy;

fr1[k] = tauf2*fr1[k] + tauf3*rho*uv;

fe1[k] = tauf2*fe1[k] + tauf4*rho*( 3.0*ux + tmp1 + uv);

fn1[k] = tauf2*fn1[k] + tauf4*rho*( 3.0*uy + tmp2 + uv);

\(\cdots \)

tmp1 = 4.5*(ux + uy)*(ux + uy) + uv;

tmp2 = 4.5*(uy - ux)*(uy - ux) + uv;

fne1[k] = tauf2*fne1[k] + tauf5*rho*( 3.0*(ux + uy) + tmp1);

fnw1[k] = tauf2*fnw1[k] + tauf5*rho*( 3.0*(uy - ux) + tmp2);

\(\cdots \)

}

}

Appendix D: An excerpt of the evolution loop of multi-GPU implementation

void evolution(int steps)

{

dim3 dimBlock( THREAD_NUM, 1);

dim3 dimGrid1( nx / THREAD_NUM );

dim3 dimGrid2( 0, 2 );

dim3 dimGrid3( nx / THREAD_NUM, 1 );

bool isEvenStep = true; //switch between two sets of arrays.

for(int i=0; i \(<\) steps; i++){

isEvenStep = (i%2 \(=\)= 0)?true:false;

switch(gpu_count){

case QUADRUPLE_GPU:

cudaSetDevice(gpuid_tesla[3]);

dimGrid1.y = height_r[3];

dimGrid2.x = (height_r[3] + THREAD_NUM - 1) / THREAD_NUM;

launch_LBCollProp(\(\cdots \)); //dimGrid1, inner fluid nodes

launch_LBBC_LR(\(\cdots \)); //dimGrid2, left and right boundaries

launch_LBBC_UP(\(\cdots \)); //dimGrid3

\(\cdots \) //Note: no “ break” here!

case SINGLE_GPU:

cudaSetDevice(gpuid_tesla[0]);

dimGrid1.y = height_r[0];

dimGrid2.x = (height_r[0] + THREAD_NUM - 1) / THREAD_NUM;

launch_LBCollProp(\(\cdots \));

launch_LBBC_LR(\(\cdots \));

launch_LBBC_DOWN(\(\cdots \)); //dimGrid3

if(gpu_count \(=\)= SINGLE_GPU)

launch_LBBC_UP(\(\cdots \));

break;

default:

break;

}

//GPU to GPU communication between adjacent work sets.

if(gpu_count \(>\)= DOUBLE_GPU){ //GPU1 - GPU2

if(isEvenStep \(=\)= true){

src_offset = nx*(ny/gpu_count-1);

dest_offset = 0;

//transfer 3 fractions to the corresponding “ ghost layer”

packedTransfer(src_offset, dest_offset, fn1, fne1, fnw1, fn3, fne3, fnw3, pitch, width, 1, cudaMemcpyDefault);

src_offset = nx;

dest_offset = nx*(ny/gpu_count);

packedTransfer(src_offset, dest_offset, fs3, fsw3, fse3, fs1, fsw1, fse1, pitch, width, 1, cudaMemcpyDefault);

}else{

src_offset = nx*(ny/gpu_count-1);

dest_offset = 0;

packedTransfer(src_offset, dest_offset, fn0, fne0, fnw0, fn2, fne2, fnw2, pitch, width, 1, cudaMemcpyDefault);

src_offset = nx;

dest_offset = nx*(ny/gpu_count);

packedTransfer(src_offset, dest_offset, fs2, fsw2, fse2, fs0, fsw0, fse0, pitch, width, 1, cudaMemcpyDefault);

}

}

if(gpu_count \(>\)= TRIPLE_GPU) {\(\cdots \)} //GPU2 - GPU3

if(gpu_count \(>\)= QUADRUPLE_GPU) {\(\cdots \)} //GPU3 - GPU4

cudaThreadSynchronize();

}

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Q., Zhong, C., Li, K. et al. A parallel lattice Boltzmann method for large eddy simulation on multiple GPUs. Computing 96, 479–501 (2014). https://doi.org/10.1007/s00607-013-0356-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-013-0356-7

Keywords

Mathematics Subject Classification

Navigation