Practical aspects
Efficient filtering techniques for finite-difference atmospheric general circulation models on parallel processors1

https://doi.org/10.1016/S0167-8191(98)00012-XGet rights and content

Abstract

Finite-difference-based atmospheric general circulation models using latitude/longitude meshes typically invoke filtering to suppress numerical instabilities occurring near the poles. We present a more efficient methodology for performing this filtering on parallel processors. This involves redistribution of the computational workload among all available processing elements, along with either the invocation of fast Fourier transforms to replace the slower numerical convolutions, or the smoothing of the transfer functions to enable use of fewer terms in the convolution sum. Results are presented on contemporary parallel architectures such as the Cray-T3D.

Introduction

One of the difficulties with finite-difference-based general circulation models that use a latitude–longitude mesh is the convergence of lines of constant longitude (meridians) in the neighborhood of the poles. This has a negative effect on performance in at least two ways. First, there is an unnecessarily high resolution in the polar regions, leading to wasted computations at the extra meshpoints (see Fig. 1). Second, when using an explicit difference method, the allowable time step for numerical stability is extremely small. This second shortfall is commonly addressed with longitudinal filtering [1], whereby the allowable timestep is typically increased by nearly an order of magnitude while maintaining numerical stability. This filtering alleviates the need for a small time step by damping the high wave number (short wavelength) modes that would otherwise lead to numerical instability.

In the parallel LLNL/UCLA atmospheric general circulation model (AGCM) 2, 3, the filtering procedure aims at eliminating length scales below that of the latitudinal mesh spacing. This spacing is usually constant throughout the mesh and is typically chosen to be 80% of the longitudinal spacing at the equator, resulting in approximately square grid cells at the midlatitudes. Hence, by damping modes of wavelengths shorter than this characteristic length scale, the effect of the filter is to reduce the effective resolution in the high-latitude regions to that of the midlatitudes. As a result, the controlling explicit time step limit is determined by the latitudinal mesh spacing. In a certain sense, there is little loss of information content with this procedure, as high-frequency modes would eventually propagate out of the polar regions and be damped out anyway.

The filtering algorithm designed by Arakawa and Lamb [4]is of global extent in the longitudinal direction and can be written in the form =f*g. Here f is the function to be filtered, g is the filtering function (the Fourier transform of which is denoted the transfer function), * denotes convolution (with respect to longitude), and is the end result. Hence, the calculation of such a Fourier filter involves sums over all longitudinal points. The number of arithmetic operations required to perform this convolution depends on nx2×ny, where nx is the number of longitudinal grid points and ny is the number of latitudinal grid points. By contrast, the amount of arithmetic in the remainder of the AGCM algorithm depends on nx×ny. The poor scaling of the filters becomes evident as resolution is increased. At a grid resolution of 1° longitude by 1.25° latitude and nine vertical levels, the computation of the filtered variables consumes more than half of the execution time on a single processor of a Cray-C90. Our strategy for reducing the filter computation time is based in part on either invoking fast Fourier transforms in place of the convolutions, or reducing the longitudinal range of the filter sums.

The parallel LLNL/UCLA AGCM, developed at Lawrence Livermore National Laboratory in collaboration with UCLA, is based on the original UCLA AGCM [4]. Parallelization is accomplished using two-dimensional domain decomposition (see Fig. 1) and explicit message-passing [5]. The code runs on virtually all contemporary massively parallel processors and distributed systems (as well as shared memory systems), with portability accomplished through the macro interface for communication and allocation (MICA) 5, 6. This domain decomposition technique scales well with both processor count and mesh resolution. In the limit of high resolution, and assuming perfect load balance, the parallel speed-up of the algorithm, excluding filtering, scales approximately linearly with processor count np at low-to-medium processor count, and as np increases, the scaling evolves to go as np [7]. When filtering is included, the parallel speed-up is distinctly worse than linear, even at low-processor count. This is because the computational processors responsible for the equatorial subdomains are idle while those responsible for the polar regions are performing the high-latitude filtering. Thus, the other part of our strategy is to redistribute the filtering workload among all available processing elements.

In Section 2we summarize the Arakawa–Lamb high-latitude filtering algorithm. In Section 3we describe our strategy for redistributing the filtering workload among all processors. In Section 4we discuss the invocation of fast Fourier transforms. In Section 5we derive the limited-width approximations to the Arakawa–Lamb filters and discuss their validation. In Section 6we present a performance comparison of the various filtering methods. Conclusions and future directions are presented in Section 7.

Section snippets

High-latitude filtering

The temporal and spatial transport of mass, momentum and trace gases within the atmosphere is modeled in the AGCM by a set of equations derived from the Navier–Stokes equations using the hydrostatic approximation [8]. This approximation, which assumes that the gravitational force is roughly balanced by the vertical gradient of pressure, allows reduction from a fully three-dimensional system to a set of coupled two-dimensional equations. The finite-difference mesh is staggered in both the

Filter redistribution

A serious load-imbalance occurs during the invocation of the filtering due to the parallelization being based on latitude/longitude domain decomposition. This is particularly apparent for the weak filters, which are applied only in a small neighborhood of the poles. The way we remedy this is to redistribute the workload among all available computational processors.

Our redistribution procedure is modular, flexible and efficient. Unlike that of Lou and Farrara [10], who decompose in

Fast Fourier transforms

The original method of choice used by UCLA to compute the filters was to perform the numerical convolutions directly. This was undoubtedly due to the fact that the earliest computations were done on relatively coarse meshes, and furthermore, that fast Fourier transforms for arbitrary numbers of points were not available at the time the filters were first developed. It is of course well known that a numerical convolution can be computed by first performing a Fourier transform, then multiplying

Finite-width filters

In this approach we seek to reduce the operation count by performing the convolution with a function having narrow, rather than global support. One first observes that for each latitude, there is a critical wave number at which the transfer function of the original filters has a discontinuous derivative. This discontinuity plays a large role in the filters having a full longitudinal extent in the first place. Hence, the first step in our procedure is to smooth this discontinuity. This is

Code performance

We measure and analyze code performance using primarily the Cray-T3D machine at Lawrence Livermore National Laboratory. We also cite results from the Cray-T3E at the National Energy Research Scientific Computing Center at Lawrence Berkeley National Laboratory.

To properly assess the impact of filtering on code performance, one must consider all of its phases along with associated tasks. We usually think of the filtering as being comprised of the convolution plus overhead, with the overhead

Conclusions

We have presented various methods for improving the performance of the high-latitude filtering in finite-difference-based atmospheric general circulation models. This is a particularly important issue at high resolution because of the adverse scaling of the filtering with grid size. We find that at moderately high resolution (1°×1.25°×9-levels), implementation of fast Fourier transforms and redistribution of the workload can allow the filtering to improve ninefold and comprise as little as 9%

Acknowledgements

This work was performed under the auspices of the USDOE by Lawrence Livermore National Laboratory under Contract No. W-7405-ENG-48. Partial support was provided by the Computer Hardware, Advanced Mathematics, and Model Physics (CHAMMP) Program of the Department of Energy.

References (13)

There are more references available in the full text version of this article.

Cited by (3)

  • A reduced grid for a parallel global ocean general circulation model

    2000, Ocean Modelling
    Citation Excerpt :

    It also confirmed a significantly larger computational cost, enough to yield no net gain in performance. A filtering implementation as in Mirin et al. (1998) applied to these variables could perhaps be efficient enough to be useful on parallel architectures. It was not possible to increase the baroclinic step by increasing the number of barotropic subcycles beyond about 40.

  • Mathematical and Physical Fundamentals of Climate Change

    2014, Mathematical and Physical Fundamentals of Climate Change
  • A Ghost Cell Expansion Method for Reducing Communications in Solving PDE Problems

    2001, Proceedings of the International Conference on Supercomputing
1

This is LLNL Report Number UCRL-JC-126640.

View full text