Automatic computation of hierarchical biquadratic smoothing splines with minimum GCV

https://doi.org/10.1016/j.cageo.2005.10.010Get rights and content

Abstract

A computationally efficient numerical strategy for fitting approximate minimum GCV bivariate thin plate smoothing splines to large noisy data sets was developed. The procedure discretises the bivariate thin plate smoothing spline equations using biquadratic B-splines and uses a nested grid SOR iterative strategy to solve the discretised system. For efficient optimisation, the process incorporates a double iteration that simultaneously updates both the discretised solution and the estimate of the minimum GCV smoothing parameter. The GCV was estimated using a minimum variance stochastic estimator of the trace of the influence matrix associated with the fitted spline surface. A Taylor series expansion was used to estimate the smoothing parameter that minimises the GCV estimate. The computational cost of the procedure is optimal in the sense that it is proportional to the number of grid points supporting the fitted biquadratic spline. Convergence was improved by adding a first order correction to the solution estimate after each smoothing parameter update. The algorithm was tested on several simulated data sets with varying spatial complexity and noise level. An accurate approximation to the analytic minimum GCV thin plate smoothing spline was obtained in all cases.

Introduction

Thin plate smoothing splines are commonly used to fit smooth surfaces to noisy data. In environmental modelling applications, they have been used to construct surfaces representing surface climate processes (Hutchinson, 1995, Zheng and Basher, 1995), topography (Hutchinson, 1989a), remotely sensed data (Berman, 1994), pollutant dispersion (Ionescu et al., 2000) and plankton distributions (Wood and Horwood, 1995). They are also popular in other fields such as image analysis (Bhaskaran and Konstantinides, 1996), medical research (Lapeer and Prager, 2000) and data mining (Hegland et al., 1998a, Hegland et al., 1998b). For practical spatial interpolation problems, surface smoothness is a central issue given that the data observations contain a significant noise component. The data model underlying splines is therefore a statistical decomposition of the observed data into a spatially coherent signal and spatially discontinuous noise. The smoothing spline model is given byzi=g(xi)+εi,xiRd,i=1,,n,where zi are a set of n observations, g is a suitably continuous function of d predictor variables, and εi are realisations of a random variable ε. The errors εi are generally assumed to be independent with mean zero and variance σ2. They are assumed to be due to measurement error, and short range microscale variation that occurs over a range below the resolution of the data set (Hutchinson, 1993).

Thin plate smoothing spline functions are designed to approximate the spatially coherent signal, g, by effectively smoothing the data. A thin plate smoothing spline is defined to be the minimiser of1ni=1n(zi-f(xi))2+λJmd(f)over functions fX, where X is a space of functions whose partial derivatives of total order m are in L2(Rd), λ is a positive smoothing parameter, and Jmd(f) is a measure of the roughness of the function f in terms of mth order partial derivatives. Calculation of Jmd(f) depends on m, the order of the partial derivatives, and the number of independent variables d. For m=2 and d=2, the case considered hereJ22(f)=-(fx1x12+2fx1x22+fx2x22)dx1dx2(Wahba, 1990).

Minimising expression (2) represents a trade-off between fitting the data as closely as possible whilst maintaining surface smoothness. The smoothing parameter λ controls the separation of signal and noise. If λ=0 the function f exactly interpolates the data, implying zero noise, whereas if λ is very large the function approaches a plane.

Minimising the generalised cross-validation (GCV) has been shown by Craven and Wahba (1979) to be an accurate method of estimating the λ corresponding to the spline function f that best represents the underlying process g. The GCV is a measure of predictive error, that is defined by removing each data point in turn and summing, with appropriate weighting, the square of the discrepancy of each omitted data point from a surface fitted to all other data points. The GCV is actually calculated implicitly by the formulaGCV=nRT2,where R is the residual sum of squares i=1n(zi-f(xi))2, and T=tr(I-A(λ)), where A(λ) is the influence matrix associated with the fitted spline (Craven and Wahba, 1979, Wahba, 1990). The quantity tr(A(λ)), termed the signal by Wahba (1990), is the effective number of parameters of the fitted thin plate spline. Case studies by Hutchinson, 1993, Hutchinson and Gessler, 1994 have shown the signal to be a useful statistic in its own right. A signal exceeding n/2 can indicate insufficient data or short range correlation in the data (Hutchinson, 1998). Optimising λ by minimising the GCV gives an estimate of the unknown error variance σ2, stated in Wahba (1990) asσ^2=RT.

Minimum GCV thin plate smoothing splines have been widely used in spatial interpolation applications e.g. (Hutchinson and Bischof, 1983, Hutchinson, 1998, Hutchinson, 1995, Zheng and Basher, 1995, Price et al., 2000, Jeffrey et al., 2001). However, analytic procedures for calculating thin plate smoothing splines are O(n3), making them computationally impractical for large data sets. A number of strategies for improving the computational efficiency of analytic thin plate smoothing spline calculation exist (Beatson and Newman, 1992, Beatson and Powell, 1994, Beatson et al., 1996). Finite element techniques have also been developed (Hutchinson, 1989a, Hegland et al., 1998a, Ramsay, 2002). These methods tend to focus on the numeric-analytic properties of thin plate smoothing splines rather than their statistical properties, and do not incorporate an automatic mechanism for optimising smoothness. The method presented in this study is designed to efficiently compute accurate finite element approximations to minimum GCV bivariate thin plate smoothing splines. Another option for obtaining a computationally efficient approximation to minimum GCV thin plate splines is to use the method developed by Bates and Wahba (1982), which approximates thin plate splines for large data sets using knots.

Hutchinson, 1989a, Hutchinson, 2000 developed a simple multigrid based strategy based on SOR iteration that calculates finite element approximations to bivariate thin plate smoothing splines for elevation data in O(N) operations, where N is the number of grid points. This method uses a simple Newton iteration to optimise the smoothing parameter to yield a specified residual sum of squares. This criterion is appropriate in the context of interpolating topography, where an estimate of the amount of noise is available (Hutchinson, 1989a). The methodology presented here is based on a variation on the Hutchinson (2000) method, in that it seeks to minimise the GCV rather than achieve a prescribed residual sum of squares. No more than two dimensions were considered, as this is the minimum number required for spatial interpolation.

The procedure in Hutchinson (2000) uses a double iteration to produce increasingly accurate estimates of the optimal smoothing parameter λ whilst iteratively converging to the solution. The converged solution is therefore that corresponding to the optimal smoothing parameter, making the process of optimising smoothness more efficient. The algorithm developed in this study used a similar iterative framework to obtain λ updates that converged to a minimum GCV solution. The process of minimising GCV is less robust than the process of obtaining a prescribed residual sum of squares, as it depends on a larger number of estimated inputs. The investigations conducted during this study therefore raised a number of issues regarding the convergence of the double iteration. These mainly related to poor conditioning of the discretised thin plate smoothing spline system on fine resolutions, which can cause slow convergence of the basic iteration, and lead to poor synchronisation of the double iteration. This paper presents a procedure that is efficient and converges to an accurate representation of the analytic solution at a grid resolution appropriate to the scale of the data generation process. The development of the procedure discussed here was guided by preliminary results presented in Hancock and Hutchinson (2003). This earlier study applied the above procedures to approximate univariate smoothing splines.

Section snippets

Methods

A robust, efficient procedure for extending the method in Hutchinson (2000) to calculate finite element minimum GCV bivariate thin plate smoothing splines was constructed. The GCV was calculated using the minimum variance stochastic estimator of the trace of the influence matrix developed by Hutchinson (1989b). Applications of this trace estimator to data smoothing problems have been described by Golub and Matt (1997). The nested grid multigrid procedure, using SOR iteration, was adopted.

Discretisation

To obtain the system of equations for the discretised bivariate thin plate smoothing spline on an N×M grid, the discretised approximation,f(x,y)=IJαIJBI,J,2(x,y)is substituted into expression (2). The coefficients αIJ are then chosen to minimise the discretised form of expression (2). The discretised bivariate roughness penalty is obtained by calculating integrated products of first and second derivatives of biquadratic B-spline basis elements with respect to x and y. Calculation is simplified

An iterative procedure for optimising λ to minimise GCV

The techniques used in this study for optimising the parameter λ in Eq. (17) were based on an adaptive iterative strategy described in Hutchinson (2000). The process involves double iteration to produce increasingly accurate estimates of both the solution and the smoothing parameter. The method uses nested grid to iteratively solve Eq. (17) whilst periodically updating the estimate of λ using the current solution estimate.

Practical considerations

The MINGCV algorithm described above was tested on a number of data sets, with varying complexity, spatial distribution and noise level. This led to the emplacement of a number of controls on the behaviour of the algorithm to maintain stability in the face of situations presented by different types of data sets. The resulting MINGCV algorithm gave an accurate representation of the analytic solution for all test data sets at a resolution appropriate to the scale of the data generation process.

Performance

The most difficult test for the MINGCV algorithm is a data set featuring fine scale structure, and large areas of data sparsity. These characteristics are also common to spatial environmental data sets, which are the intended application for this algorithm. For example, surface climate data sets for the Australian continent generally contain areas of clumped points, corresponding to more heavily populated areas in the spatial region, and also large areas where data measurements are very sparse.

Computational efficiency

To give a practical perspective on the computational efficiency of the MINGCV method, we compare the number of operations required by this method to that required for a non-automated procedure for approximating finite element minimum GCV thin plate smoothing splines. The automated minimisation of the GCV has two main sources of computational workload; the solution of the equations for du/dθ and the extra iteration required to adjust to periodic changing of the smoothing parameter. The first

Conclusion

The MINGCV algorithm is a robust procedure that can generate accurate biquadratic spline approximations to thin plate smoothing splines fitted to large noisy data sets with uneven spatial coverage. The procedure is computationally efficient, with both space and time requirements proportional to the number of grid points used to support the biquadratic spline. The resolution of the final grid can be determined objectively to match the scale of the estimated data generation process. The procedure

Acknowledgements

We would like to thank two anonymous reviewers for their helpful comments on this manuscript.

References (34)

  • M. Berman

    Automated smoothing of image and other regularly spaced data

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1994)
  • Bhaskaran, V., Konstantinides, K., 1996. Image and Video Compression Standards: Algorithms and Architectures. Kluwer...
  • de Boor, C., 1978. A Practical Guide to Splines. Springer, Berlin, 372...
  • P. Craven et al.

    Smoothing noisy data with spline functions

    Numerische Mathematik

    (1979)
  • H. Curry et al.

    On Pólya frequency functions IVthe fundamental spline functions and their limits

    Journal D’Analyse Mathematique

    (1966)
  • D. Forsey et al.

    Hierarchical B-spline refinement

    Computer Graphics

    (1988)
  • G. Golub et al.

    Generalized cross-validation for large scale problemsrevised version

    Journal of Computational and Graphical Statistics

    (1997)
  • Cited by (0)

    View full text