Discrete dipole approximation simulations on GPUs using OpenCL—Application on cloud ice particles

https://doi.org/10.1016/j.jocs.2011.05.011Get rights and content

Abstract

The global distribution and climatology of ice clouds are among the main uncertainties in climate modeling and prediction. In order to retrieve ice cloud properties from remote sensing measurements, the scattering properties of all cloud ice particle types must be known. The discrete dipole approximation (DDA) simulates scattering of radiation by arbitrarily shaped particles and is thus suitable for cloud ice crystals. The DDA models the particle as a collection of equal dipoles on a lattice, and is computationally much more expensive than approximations restricted to more regularly shaped particles. On a single computer the calculation for an ice particle of a specific size, for a given scattering plane at one specific wavelength can take several days. We have ported the core routines of the scattering suite “ADDA” to the open computing language (OpenCL), a framework for programming parallel devices like PC graphics cards (graphics processing units, GPUs) or multi-core CPUs. In a typical case we can achieve a speed-up on a GPU as compared to a CPU by a factor of 5 in double precision and a factor of 15 in single precision. Spreading the work load over multiple GPUs will allow calculating the scattering properties even of large cloud ice particles.

Highlights

► We analyse the use of graphic cards in discrete dipole approximation calculations. ► We have ported core routine for matrix vector multiplication to OpenCL language. ► A speedup by factor of 5 (15) per iteration is achieved in double (single) precision. ► We compare the calculation speed of graphic card and CPU in several problem sizes.

Introduction

The global distribution and climatology of ice clouds represent one of the main uncertainties in climate analysis and prediction [1]. For their daily global observation, spaceborne microwave millimeter and submillimeter radiometers have been suggested [2] because of their ability to sense the interior of cloud. Many algorithms to retrieve geophysical properties of ice clouds assume a spherical shape of the single crystals when calculating their scattering properties. While this simplifying assumption is appropriate in the Rayleigh regime, i.e., for particles small compared to the wavelength (>3 mm or ν < 100 GHz), for higher frequencies the aspherical shape of the ice crystals more and more determines their scattering properties which in turn are required for the forward radiative transfer modeling of ice clouds. The main properties of the scattered radiation depend on the single scattering properties of every involved particle. There are no sufficient possibilities to measure directly the distributions of size, shape and orientation of the ice crystals in ice clouds and convective systems. However, a possible approach is to build a database of the scattering properties of different shapes and sizes of ice crystals based on forward scattering calculations [3]. There are several different methods to calculate the single scattering properties of ice crystals.

The T-matrix method [4] is an accurate method for calculating scattering properties of particles without distinctive concavity, so it can be used for many different crystal shapes. The FDTD method [5] is computationally expensive on particle sizes which exceed twenty times the wavelength. However, the most flexible method is the DDA method [6] suited for arbitrary shapes. It is an accurate method to solve the Maxwell equations for single particle scattering. But DDA is computationally expensive and has a high memory consumption especially for lager particles. The field of application of FDTD and DDA is similar but DDA is better suited for small refractive indices [7]. In this study we use the ADDA program package [8] because it is written in the C programming language where an interface to OpenCL is provided.

In view of the growing number of multi-core CPU, the growing numbers of cores per CPU and the reduction of clock rate it seems to be the right time to rethink the algorithmic approaches which have been developed for serial computers. A well suited application for parallel computers are for example linear filters applied to large datasets like images. That kind of task is nowadays often done by GPU with up to 1600 scalar processing cores, in contrast to CPU with up to 6 cores. On the other hand the instruction set of a CPU is a lot bigger than that of a GPU, i.e., a CPU can perform more complex mathematical operations in fewer cycles than a GPU. As a consequence, GPU appear attractive to apply mathematically simple function on large datasets, as it is the case for DDA where a system of linear equations is solved. However, FDTD was the first method implemented on GPU by using the graphics API and low level programming [9].

A unified standard framework for programming multi-core CPU, GPU and also other parallel computing devices is the OpenCL by the Khronos group in December 2008 [10]. The GPU vendor Nvidia released an example with their SDK for OpenCL which does an FDTD calculation [11]. In this report, the DDA method, as realized in the ADDA software package, is analysed for the possibility to implement it in OpenCL, and the possible speedup over the CPU implementation will be estimated.

Section snippets

Overview

The DDA approach for calculating single scattering properties of particles approximates the shape of an ice crystal by a number of small dipoles on a cubic lattice. An incident wave is scattered by all dipoles. Every dipole will scatter the incident wave into all directions, among them the directions of all other dipoles so that not only the incident wave arrives at one specific dipole but also the scattered waves from all other dipoles. Therefore the self consistent solution for the dipole

Overview over open computing language (OpenCL)

OpenCL is the name of a parallelization software framework. It is an API and a programming language at the same time. It is used to distribute computational work load over several platforms and devices. The best value of computational power per money and computational power per power consumption available nowadays are high end gaming GPU which are basically highly parallel computation units. OpenCL offers a way to program such a GPU. The OpenCL language is similar to the programming language C

Used hardware

All timing results were obtained on ordinary consumer computer hardware. The relevant details are listed in Table 2. The CPU is a dual core CPU, but for the timing of ADDA just one core is used since we compiled ADDA to sequential mode.

The Nvidia Geforce GTX 260 used in this study is a consumer gaming graphics card. Its fast SP capabillities (about 800 Giga FLOPS) makes it well suited for scientific calculations. However, the DP performance is just about 65 Giga FLOPS. It was one of the first

Splitting into slices

The algorithm of the matrix vector multiplication in ADDA is designed to be executed in parallel by multiple CPU by splitting the matrix into independent slices. We take a closer look at the possibility to use this approach for distributing the work onto the GPU cores. In this approach, there will be as many slices as the x-dimension of the crystal-grid has elements. For some GPU architectures this will result in a low occupancy of the cores when there are more cores than slices. But even with

Runtime

We decided to use the FFT implementation of Apple (Section 5.4) because it showed the best performance. The arithmetic routines were converted to OpenCL without GPU architecture dependent customizations by using the guideline of Section 3. Fig. 3 compares the runtime of our GPU implementations of the matrix–vector multiplication of ADDA with the original GPU implementation on a sphere in a grid of 128 × 128 × 128. We use a power-of-two grid to have comparable results since non-power-of-two grids

Application example

We demonstrate the functionality of the implementation by aplying it to a 6-branch-bullet-rosette-shaped crystals in a grid of 76 × 76 × 76 dipoles where the wavelength is one per crystal dimension. As a validation criterion Draine and Flatau pointed out that one needs at least 10|m| dipoles per wavelength to obtain an accurate result [6] with m the complex refractive index. In our in our example we fulfill this condition with 76 dipoles per wavelength. The crystal consists of 4746 non-void dipoles

Conclusions and outlook

In this study, the potential to use GPU for light scattering calculations with DDA has been investigated. The time consuming matrix vector multiplication of the DDA was reimplemented in OpenCL. The ADDA scattering program by Yurkin and Hoekstra was modified to use the OpenCL implementation of the matrix vector multiplication. The OpenCL programming example from Apple for FFT was modified to handle DP and used to calculate the FFT needed for the matrix vector multiplication. The other parts were

Acknowledgements

Fruitful discussions with Thomas Wriedt on scattering and the suggestions of Maxim Yurkin on the possibility to contribute GPU calculations to the DDA are gratefully acknowledged. We also want to thank the anonymous reviewers for their fruitful comments and suggestions.

Marcus Huntemann studies physics at the University of Bremen. He works at the Institute of Environmental Physics, University of Bremen. His interests lie in high performance computing and general-purpose computing on graphics processing units. The context of his work is the retrieval of geophysical parameters from satellite data.

References (40)

  • M.A. Yurkin et al.

    User Manual for the Discrete Dipole Approximation Code ADDA v. 0. 79

    (2009)
  • S. Adams et al.

    Finite difference time domain (fdtd) simulations using graphics processors

  • P.J. Flatau

    Improvements in the discrete-dipole approximation method of computing scattering and absorption

    Opt. Lett.

    (1997)
  • P.J. Flatau et al.

    Light scattering by rectangular solids in the discrete-dipole approximation: a new algorithm exploiting the block-toeplitz structure

    J. Opt. Soc. Am. A

    (1990)
  • J.J. Goodman et al.

    Application of fast-Fourier-transform techniques to the discrete-dipole approximation

    Opt. Lett.

    (1991)
  • E. Collett

    Polarized Light

    (1993)
  • Cited by (12)

    • Discrete dipole approximation

      2023, Light, Plasmonics and Particles
    • “pyGDM” - new functionalities and major improvements to the python toolkit for nano-optics full-field simulations

      2022, Computer Physics Communications
      Citation Excerpt :

      From the technical side, we plan to implement a conjugate gradients solver to avoid full inversion of systems with large amounts of mesh-cells [3]. Furthermore, such iterative solver could be strongly accelerated using GPU-based fast Fourier transforms [80,81]. More flexible meshing is also a feature that we plan to implement in the future [44,20,45].

    • MATLAB package for discrete dipole approximation by graphics processing unit: Fast Fourier Transform and Biconjugate Gradient

      2021, Journal of Quantitative Spectroscopy and Radiative Transfer
      Citation Excerpt :

      However, in other electromagnetic scattering problems, the biconjugate gradient (BCG) technique has been used as an iterative method, which is three- to six- times faster than the corresponding CCG [25,26]. Several attempts have been made to increase the performance of the DDA [22–24,27–31]. One of the pioneering works was carried out by Draine and Flatau, who popularized the DDA by releasing an open-source package written in FORTRAN, named DDSCAT [22,27].

    • Graphics Processing Units and Open Computing Language for parallel computing

      2014, Computers and Electrical Engineering
      Citation Excerpt :

      Some of these work are for image processing, the very application GPUs were initially designed for (see for example, [13–19]). But more are for modeling and simulation in many diverse fields, including molecular dynamics [20–25], computational fluid dynamics [26–28], climate physics modeling [29,30], astrophysics and other physics studies [31–35], computer aided design [36], fashion design [37], automotive manufacturing [38], speech recognition [39], and social systems [40]. Of the many modeling studies, n-body simulations are seen in all fields of physics and engineering.

    • Light scattering theory and programs: Discussion of latest advances and open problems

      2012, Journal of Quantitative Spectroscopy and Radiative Transfer
      Citation Excerpt :

      A suitable parallelization software framework is OpenCL. This framework has recently been applied to write a parallel version of A-DDA [38]. The time consuming matrix vector multiplication of the A-DDA was reimplemented in OpenCL.

    View all citing articles on Scopus

    Marcus Huntemann studies physics at the University of Bremen. He works at the Institute of Environmental Physics, University of Bremen. His interests lie in high performance computing and general-purpose computing on graphics processing units. The context of his work is the retrieval of geophysical parameters from satellite data.

    Georg Heygster received the diploma degree in 1976 in solid state physics and the PhD in 1979 in digital image processing, both in physics from the University of Göttingen, Germany. He served as a consultant at the Computer Center of the University of Bremen, Germany, from 1979 to 1988. Since then, after one year working for one year on the imaging mechanisms of scanning acoustic microscopes, he has been head of the group Geophysical Analysis of Satellite Images at the Institute of Environmental Physics, University of Bremen. His research activities include passive and active microwave remote sensing, especially of both surface and atmospheric parameters in the high latitudes, various aspects of the hydrological cycle, long-term trends and retrieval techniques. He was or still is Principal Investigator or co-Investigator of many research projects funded by EU, ESA, DFG and JAXA. These projects include the development of sensor soft- and hardware, conducting campaigns, the final data analysis from multi- and single-sensor data to geophysical parameters, and the interpretation and application of these results in many areas such as meteorology, climatology and oceanography.

    Gang Hong received the B.S. degree in atmospheric sciences from Nanjing Institute of Meteorology, Nanjing, China in 1995 and the PhD degree in environmental physics and remote sensing from the University of Bremen, Germany in 2004. He is currently a Senior Research Scientist with Science Systems and Applications, Inc., Hampton, VA, and is a member of the NASA Langley Cloud and Radiation Group. His research is focused on the remote sensing of cloud and aerosol properties and investigating their radiative properties and climatologies. He has also worked on investigating the optical properties of nonspherical cloud ice crystals and aerosols.

    View full text