A parallel lattice Boltzmann method for large eddy simulation on multiple GPUs

Li, Qinjian; Zhong, Chengwen; Li, Kai; Zhang, Guangyong; Lu, Xiaowei; Zhang, Qing; Zhao, Kaiyong; Chu, Xiaowen

doi:10.1007/s00607-013-0356-7

A parallel lattice Boltzmann method for large eddy simulation on multiple GPUs

Published: 08 October 2013

Volume 96, pages 479–501, (2014)
Cite this article

Computing Aims and scope Submit manuscript

Qinjian Li¹,
Chengwen Zhong²,
Kai Li²,
Guangyong Zhang³,
Xiaowei Lu³,
Qing Zhang³,
Kaiyong Zhao⁴ &
…
Xiaowen Chu^4,5

571 Accesses
6 Citations
Explore all metrics

Abstract

To improve the simulation efficiency of turbulent fluid flows at high Reynolds numbers with large eddy dynamics, a CUDA-based simulation solution of lattice Boltzmann method for large eddy simulation (LES) using multiple graphics processing units (GPUs) is proposed. Our solution adopts the “collision after propagation” lattice evolution way and puts the misaligned propagation phase at global memory read process. The latest GPU platform allows a single CPU thread to control up to four GPUs that run in parallel. In order to make use of multiple GPUs, the whole working set is evenly partitioned into sub-domains. We implement Smagorinsky model and Vreman model respectively to verify our multi-GPU solution. These two LES models have different relaxation time calculation behavior and lead to different CUDA implementation characteristics. The implementation based on Smagorinsky model achieves 190 times speedup over the sequential implementation on CPU, while the implementation based on Vreman model archives more than 90 times speedup. The experimental results show that the parallel performance of our multi-GPU solution scales very well on multiple GPUs. Therefore large-scale (up to 10,240 \(\times \) 10,240 lattices) LES–LBM simulation becomes possible at a low cost, even using double-precision floating point calculation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GPU Implementation of ConeTorre Algorithm for Fluid Dynamics Simulation

Multi-relaxation-time lattice Boltzmann simulations of lid driven flows using graphics processing unit

Article 03 April 2017

Chenggong Li & J. P. Y. Maa

An Efficient Implementation of Entropic Lattice Boltzmann Method in a Hybrid CPU-GPU Computing Environment

References

Benzi R, Succi S, Vergassola M (1992) The lattice boltzmann equation: theory and applications. Phys Rep 222(3):145–197
Article Google Scholar
Chu X, Zhao K, Wang M (2008) Massively parallel network coding on gpus. In: Performance, computing and communications conference, 2008. IPCCC 2008. IEEE International. IEEE, pp 144–151
Habich J (2008) Performance evaluation of numeric compute kernels on nvidia gpus. Master’s Thesis, University of Erlangen-Nurnberg
Hou S, Sterling J, Chen S, Doolen G (1996) A lattice boltzmann subgrid model for high reynolds number flows. Pattern Form Lattice Gas Automata 6:151–166
MathSciNet Google Scholar
Kuznik F, Obrecht C, Rusaouen G, Roux J (2010) Lbm based flow simulation using gpu computing processor. Comput Math Appl 59(7):380–2392
Article Google Scholar
Li Y, Zhao K, Chu X, Liu J (2012) Speeding up k-means algorithm by gpus. J Comput Sys Sci
Maier R, Bernard R, Grunau D (1996) Boundary conditions for the lattice boltzmann method. Phys Fluids 8(7):1788–1801
Article MATH Google Scholar
Micikevicius P (2011) Multi-gpu programing. http://www.nvidia.com/docs/IO/116711/sc11-multi-gpu.pdf
Nvidia C (2011) Nvidia cuda programming guide. http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf
Nvidia C (2011) Nvidia gpudirect. http://developer.nvidia.com/gpudirect
Obrecht C, Kuznik F, Tourancheau B, Roux J (2011) Global memory access modelling for efficient implementation of the lattice boltzmann method on graphics processing units. High Performance Comput Comput Sci-VECPAR 2010:151–161
Google Scholar
Obrecht C, Kuznik F, Tourancheau B, Roux J (2011) Multi-gpu implementation of the lattice boltzmann method. Comput Math Appl
Obrecht C, Kuznik F, Tourancheau B, Roux J (2011) A new approach to the lattice boltzmann method for graphics processing units. Comput Math Appl 61(12):3628–3638
Article MATH Google Scholar
Qian Y, d’Humieres D, Lallemand P (2007) Lattice bgk models for navier-stokes equation. EPL (Europhys Lett) 17(6):479
Article Google Scholar
Rosales C (2011) Multiphase lbm distributed over multiple gpus. In: 2011 IEEE international conference on cluster computing (CLUSTER).IEEE, pp 1–7
Smagorinsky J (1963) General circulation experiments with the primitive equations. Monthly Weather Rev 91(3):99–164
Article Google Scholar
Tölke J (2010) Implementation of a lattice boltzmann kernel using the compute unified device architecture developed by nvidia. Comput V Sci 13(1):29–39
Google Scholar
Tölke J, Krafczyk M (2008) Teraflop computing on a desktop pc with gpus for 3d cfd. Int J Comput Fluid Dyn 22(7):443–456
Article MATH Google Scholar
Vreman A (2004) An eddy-viscosity subgrid-scale model for turbulent shear flow: Algebraic theory and applications. Phys Fluids 16:3670
Article Google Scholar

Download references

Acknowledgments

This project was supported by the Aeronautical Science Fund of China (Grant No. 20111453012) and the National Defense Pre-Research Foundation of China (Grant No. 9140A13040111HK0329).

Author information

Authors and Affiliations

School of Computer Science, Northwestern Polytechnical University, Xi’an, China
Qinjian Li
National Key Laboratory of Science and Technology on Aerodynamics Design and Research, Northwestern Polytechnical University, Xi’an, China
Chengwen Zhong & Kai Li
National Key Laboratory for High-efficient Server and Storage Technology, Inspur, Jinan, China
Guangyong Zhang, Xiaowei Lu & Qing Zhang
Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
Kaiyong Zhao & Xiaowen Chu
Institute of Computational and Theoretical Studies, Hong Kong Baptist University, Hong Kong, China
Xiaowen Chu

Authors

Qinjian Li
View author publications
You can also search for this author in PubMed Google Scholar
Chengwen Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Kai Li
View author publications
You can also search for this author in PubMed Google Scholar
Guangyong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaowei Lu
View author publications
You can also search for this author in PubMed Google Scholar
Qing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Kaiyong Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Xiaowen Chu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chengwen Zhong.

Appendices

Appendix A: Table 3

Table 3 A table of notation definition

Full size table

Appendix B: Kernel function LBCollProp \( Smagorinsky\,\, model \)

__global__ static void LBCollProp ( int nx, int* geoD, double tauf, double prefix, double* fr0, \(\cdots \), double fse0, double fr1, \(\cdots \), double* fse1 )

{

// collision and propagation processes of the fluid nodes

//registers for kernel processing

double ux, uy, uv, rho, tau;

double fr, fe, fn, fw, fs, fne, fnw, fsw, fse; //f*

double f1, f2, f3, f4, f5, f6, f7, f8; //feq

//Index K in 1D-arrays

int k = nx*blockIdx.y + blockIdx.x*blockDim.x + threadIdx.x;

if(geoD[k] \(=\)= FLUID) //Inner fluid nodes for update

{

//“ propagation” was performed at the global memory read transactions.

// Note: some global memory read transactions remained misaligned.

fr = fr0[k];

fe = fe0[k-1];

fn = fn0[k-nx];

\(\cdots \)

//get the macroscopic properties: velocity (ux, uy) and density (rho).

rho = fr + fe + fn + fw + fs + fne + fnw + fsw + fse;

uv = 1/rho;

ux = (fe + fne + fse - fw - fnw - fsw);

ux *= uv;

uy = (fn + fne + fnw - fs - fsw - fse);

uy *= uv;

//get the feq after propagation.

tau = rho/9.0;

uv = 1.0 - 1.5*(ux*ux + uy*uy);

f1 = tau*( 3.0*ux + 4.5*ux*ux + uv);

\(\cdots \)

tau = tau*0.25;

f5 = tau*( 3.0*(ux+uy) + 4.5*(ux+uy)*(ux+uy) + uv);

\(\cdots \)

uv = 16*tau*uv; //means f0

//get the single relaxation time according Smagorinsky model.

//only needs the local quantities such as nonequilibrium stress tensor.

ux = (fe-f1) + (fne-f5) + (fse-f8) + (fnw-f6) + (fw-f3) + (fsw-f7);

uy = (fne-f5) + (fn-f2) + (fnw-f6) + (fsw-f7) + (fs-f4) + (fse-f8);

tau = (fne-f5) + (fsw-f7) + (f6-fnw) + (f8-fse);

tau = sqrt(2*(ux*ux+uy*uy+2*tau*tau))*prefix;

tau = (sqrt(tauf*tauf+tau/rho)+tauf)*0.5;

// perform the LBM iteration according to equation:

// f = f* - (f* - feq)/tau \( \rightarrow \) f = (1-1/tau)f* + feq/tau

// write data back to device memory, coalesced global memory accesses.

tau = 1/tau;

ux = 1-tau;

fr1[k] = ux*fr + tau*uv;

fe1[k] = ux*fe + tau*f1;

fn1[k] = ux*fn + tau*f2;

\(\cdots \)

}

Appendix C: Kernel function LBProp and LBColl (Vreman model)

__global__ static void LBProp (const int nx, const int* geoD, double * dev_ux, double * dev_uy, double * dev_rho, double* fr0, \(\cdots \), double* fse0, double* fr1, \(\cdots \), double* fse1)

{

//This kernel function is responsible for propagation of the fluid nodes

double ux, uy, uv, rho, tmp;

double fr, fe, fn, fw, fs, fne, fnw, fsw, fse; //f*

//Index K in 1D-arrays

int k = nx*blockIdx.y + blockIdx.x*blockDim.x + threadIdx.x;

if(geoD[k] \(=\)= FLUID) //Inner fluid nodes for update

{

//“ propagation” was performed at the global memory read transactions.

// Note: some global memory read transactions remained misaligned.

fr = fr0[k];

fe = fe0[k-1];

fn = fn0[k-nx];

\(\cdots \)

// get and store the whole updated macroscopic properties to global memory

// for the sake of synchronization.

rho = fr + fe + fn + fw + fs + fne + fnw + fsw + fse;

dev_rho[k] = rho;

tmp = 1/rho;

ux = (fe + fne + fse - fw - fnw - fsw);

dev_ux[k] = ux*tmp;

uy = (fn + fne + fnw - fs - fsw - fse);

dev_uy[k] = uy*tmp;

fr1[k] = fr;

fe1[k] = fe;

fn1[k] = fn;

\(\cdots \)

}

__global__ static void LBColl ( int nx, int* geoD, double tauf, double prefix, double * dev_ux, double * dev_uy, double * dev_rho, double* fr1, \(\cdots \) , double* fse1)

{

//This kernel function is responsible for collision of the fluid nodes

double ux, uy, uv, rho, tmp1, tmp2;

double delta_Uxx, delta_Uxy, delta_Uyx, delta_Uyy, up, down;

double tauf0, tauf1, tauf2, tauf3,tauf4, tauf5;

//Index K in 1D-arrays

int k = nx*blockIdx.y + blockIdx.x*blockDim.x + threadIdx.x;

if(geoD[k]\(=\)=FLUID)

{

ux = dev_ux[k];

uy = dev_uy[k];

rho = dev_rho[k];

//get the resolved velocity gradient tensor: gij

delta_Uxx = 0.5*(dev_ux[k+1] - dev_ux[k-1]); //g11

delta_Uxy = 0.5*(dev_ux[k+nx] - dev_ux[k-nx]); //g12

delta_Uyx = 0.5*(dev_uy[k+1] - dev_uy[k-1]); //g21

delta_Uyy = 0.5*(dev_uy[k+nx] - dev_uy[k-nx]); //g22

//numerator : \((g12*g21 - g11*g22)^2\)

up = (delta_Uxy*delta_Uyx - delta_Uxx*delta_Uyy) * (delta_Uxy*delta_Uyx - delta_Uxx*delta_Uyy);

//denominator : gij gij = g11*g11 + g12*g21 + g21*g11 + g22*g22

down = delta_Uxx*delta_Uxx + delta_Uxy*delta_Uxy + delta_Uyx*delta_Uyx + delta_Uyy*delta_Uyy + 1e-15;

//get the single relaxation time according to Equation(6).

tauf0 = tauf + 3.0*prefix*sqrt(up/down);

// perform the LBM iteration according to equation:

// f = f* - (f* - feq)/tau \(\rightarrow \) f = (1-1/tau)f* + feq/tau

// read f* and write f back, coalesced global memory accesses.

tauf1 = 1.0/tauf0;

tauf2 = 1.0 - tauf1;

tauf3 = (4.0/9.0)*tauf1;

tauf4 = 0.25*tauf3;

tauf5 = 0.25*tauf4;

uv = 1.0 - 1.5*(ux*ux + uy*uy);

tmp1 = 4.5*ux*ux;

tmp2 = 4.5*uy*uy;

fr1[k] = tauf2*fr1[k] + tauf3*rho*uv;

fe1[k] = tauf2*fe1[k] + tauf4*rho*( 3.0*ux + tmp1 + uv);

fn1[k] = tauf2*fn1[k] + tauf4*rho*( 3.0*uy + tmp2 + uv);

\(\cdots \)

tmp1 = 4.5*(ux + uy)*(ux + uy) + uv;

tmp2 = 4.5*(uy - ux)*(uy - ux) + uv;

fne1[k] = tauf2*fne1[k] + tauf5*rho*( 3.0*(ux + uy) + tmp1);

fnw1[k] = tauf2*fnw1[k] + tauf5*rho*( 3.0*(uy - ux) + tmp2);

\(\cdots \)

}

Appendix D: An excerpt of the evolution loop of multi-GPU implementation

void evolution(int steps)

{

dim3 dimBlock( THREAD_NUM, 1);

dim3 dimGrid1( nx / THREAD_NUM );

dim3 dimGrid2( 0, 2 );

dim3 dimGrid3( nx / THREAD_NUM, 1 );

bool isEvenStep = true; //switch between two sets of arrays.

for(int i=0; i \(<\) steps; i++){

isEvenStep = (i%2 \(=\)= 0)?true:false;

switch(gpu_count){

case QUADRUPLE_GPU:

cudaSetDevice(gpuid_tesla[3]);

dimGrid1.y = height_r[3];

dimGrid2.x = (height_r[3] + THREAD_NUM - 1) / THREAD_NUM;

launch_LBCollProp(\(\cdots \)); //dimGrid1, inner fluid nodes

launch_LBBC_LR(\(\cdots \)); //dimGrid2, left and right boundaries

launch_LBBC_UP(\(\cdots \)); //dimGrid3

\(\cdots \) //Note: no “ break” here!

case SINGLE_GPU:

cudaSetDevice(gpuid_tesla[0]);

dimGrid1.y = height_r[0];

dimGrid2.x = (height_r[0] + THREAD_NUM - 1) / THREAD_NUM;

launch_LBCollProp(\(\cdots \));

launch_LBBC_LR(\(\cdots \));

launch_LBBC_DOWN(\(\cdots \)); //dimGrid3

if(gpu_count \(=\)= SINGLE_GPU)

launch_LBBC_UP(\(\cdots \));

break;

default:

break;

}

//GPU to GPU communication between adjacent work sets.

if(gpu_count \(>\)= DOUBLE_GPU){ //GPU1 - GPU2

if(isEvenStep \(=\)= true){

src_offset = nx*(ny/gpu_count-1);

dest_offset = 0;

//transfer 3 fractions to the corresponding “ ghost layer”

packedTransfer(src_offset, dest_offset, fn1, fne1, fnw1, fn3, fne3, fnw3, pitch, width, 1, cudaMemcpyDefault);

src_offset = nx;

dest_offset = nx*(ny/gpu_count);

packedTransfer(src_offset, dest_offset, fs3, fsw3, fse3, fs1, fsw1, fse1, pitch, width, 1, cudaMemcpyDefault);

}else{

src_offset = nx*(ny/gpu_count-1);

dest_offset = 0;

packedTransfer(src_offset, dest_offset, fn0, fne0, fnw0, fn2, fne2, fnw2, pitch, width, 1, cudaMemcpyDefault);

src_offset = nx;

dest_offset = nx*(ny/gpu_count);

packedTransfer(src_offset, dest_offset, fs2, fsw2, fse2, fs0, fsw0, fse0, pitch, width, 1, cudaMemcpyDefault);

}

if(gpu_count \(>\)= TRIPLE_GPU) {\(\cdots \)} //GPU2 - GPU3

if(gpu_count \(>\)= QUADRUPLE_GPU) {\(\cdots \)} //GPU3 - GPU4

cudaThreadSynchronize();

}

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Q., Zhong, C., Li, K. et al. A parallel lattice Boltzmann method for large eddy simulation on multiple GPUs. Computing 96, 479–501 (2014). https://doi.org/10.1007/s00607-013-0356-7

Download citation

Received: 30 January 2013
Accepted: 26 September 2013
Published: 08 October 2013
Issue Date: June 2014
DOI: https://doi.org/10.1007/s00607-013-0356-7

Keywords

Mathematics Subject Classification

76F65

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A parallel lattice Boltzmann method for large eddy simulation on multiple GPUs

Abstract

Access this article

Similar content being viewed by others

GPU Implementation of ConeTorre Algorithm for Fluid Dynamics Simulation

Multi-relaxation-time lattice Boltzmann simulations of lid driven flows using graphics processing unit

An Efficient Implementation of Entropic Lattice Boltzmann Method in a Hybrid CPU-GPU Computing Environment

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Table 3

Appendix B: Kernel function LBCollProp \( Smagorinsky\,\, model \)

Appendix C: Kernel function LBProp and LBColl (Vreman model)

Appendix D: An excerpt of the evolution loop of multi-GPU implementation

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

A parallel lattice Boltzmann method for large eddy simulation on multiple GPUs

Abstract

Access this article

Similar content being viewed by others

GPU Implementation of ConeTorre Algorithm for Fluid Dynamics Simulation

Multi-relaxation-time lattice Boltzmann simulations of lid driven flows using graphics processing unit

An Efficient Implementation of Entropic Lattice Boltzmann Method in a Hybrid CPU-GPU Computing Environment

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Table 3

Appendix B: Kernel function LBCollProp \( Smagorinsky\,\, model \)

Appendix C: Kernel function LBProp and LBColl (Vreman model)

Appendix D: An excerpt of the evolution loop of multi-GPU implementation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation