Computational aspects of nonparametric smoothing with illustrations from the sm library

https://doi.org/10.1016/S0167-9473(02)00118-4Get rights and content

Abstract

Smoothing techniques such as density estimation and nonparametric regression are widely used in applied work and the basic estimation procedures can be implemented relatively easily in standard statistical computing environments. However, computationally efficient procedures quickly become necessary with large datasets, many evaluation points or more than one covariate. Further computational issues arise in the use of smoothing techniques for inferential, rather than simply descriptive, purposes. These issues are addressed in two ways by (i) deriving efficient matrix formulations of nonparametric smoothing methods and (ii) by describing further simple modifications to these for the use of ‘binned’ data when sample sizes are large. The implications for other graphical and inferential aspects of the estimators are also discussed. These issues are dealt with in an algorithmic manner, to allow implementation in any programming environment, but particularly those which are geared towards vector and matrix representations of data. Specific examples of S-Plus code from the sm library of Bowman and Azzalini (Applied Smoothing Techniques for Data Analysis: the Kernel Approach With S-Plus Illustrations, Oxford University Press, Oxford, 1997) are given in an appendix as illustrations.

Introduction

Smoothing techniques such as density estimation and nonparametric regression have become established tools in applied statistics. There is now a wide variety of texts which describe these methods and a huge literature of research papers. Recent texts include Green and Silverman (1994), Wand and Jones (1995), Fan and Gijbels (1996), Simonoff (1996), Bowman and Azzalini (1997) and Schimek (2000a). A broader framework for the case of regression, known as generalized additive models, is also described by Hastie and Tibshirani (1990). Some statistical computing environments such as S-Plus, Genstat and XLispStat include some general procedures for smoothing techniques and Simonoff (1996) and Schimek (2000b) give extensive references to more specialist software. However, the simplest forms of nonparametric smoothing techniques can be implemented relatively easily through elementary programming techniques.

Modern statistical computing environments are generally geared towards vector and matrix representations of data. It is therefore a principal aim of this paper to provide simple matrix formulations of smoothing techniques which allow efficient implementation in this type of environment. A second aim of the paper is to address the computational issues which arise when nonparametric methods are applied to large datasets. This issue is also dealt with in the context of vector and matrix representations of the estimators.

Efficient matrix formulations of smoothing techniques are discussed in Section 2 of the paper, for both density estimation and regression. The issue of binning and its implementation within a matrix framework is described in Section 3. Computational issues also arise when techniques other than straightforward construction of the estimators for descriptive purposes are required. Bowman and Azzalini (1997, Chapters 4 and 5) describe a variety of testing procedures based on quadratic form calculations and the implementation of these for binned data is described in Section 4. Some graphical issues and other topics are also discussed in that section. Further discussion is given in Section 5.

The computational methods discussed in the paper have been implemented in the sm library written to accompany the book by Bowman and Azzalini (1997). This library contains an extensive collection of functions for different data structures and statistical problems and version 2 now includes the binning techniques described in the paper. Examples of S-Plus code from the core instructions of the two principal functions, sm.density and sm.regression, are given in an appendix to illustrate the ideas described in the paper. The principal emphasis is on the algebraic expressions provided in the main body of the paper, to allow readers to implement the ideas in other programming environments. In addition, they also provide specific guidance to those who wish to amend the existing sm library functions or to implement the techniques themselves in S-Plus.

Section snippets

Density estimation

The formula for computing a density estimate at the evaluation point z from a sample y=(y1,…,yn)T of univariate data isf̂(z)=1nj=1nw(z−yj;h),where the kernel function w(x;h) is itself a density function which is symmetric with mean 0. The degree of smoothing applied to the data is controlled by the scale parameter h, which is conveniently parameterised as the standard deviation of the kernel function. The detailed shape of the kernel function is relatively unimportant, and for convenience a

Binning for large datasets

The matrix formulae described in Section 2 provide compact computational expressions for nonparametric estimators. However, the size of many of the matrices is dependent on the sample size n and this becomes inefficient as n increases. A number of highly efficient smoothing algorithms are available, as discussed by Fan and Marron (1994), for example. In particular, the concept of binning, where the raw data are reduced to frequencies over a fine grid, is a practical solution. Härdle and Scott

Inferential issues in nonparametric regression

The standard error of a nonparametric regression curve at its evaluation points provides a very useful indication of the variability of the estimator. Since the vector of estimated values can be expressed as Sy, the estimated variance is simplyvar(m̂)=diag(SP−1ST)σ̂2,where P is the diagonal matrix with the bin frequencies as its diagonal vector. This refers to the case of binned data. The formula which applies in the absence of binning is obtained by setting b=n and P=I.

Bowman and Young

Discussion

The methods described in this paper provide efficient vector–matrix implementations of standard smoothing procedures. There are, of course, many other types of smoothing procedure in common use. Local logistic regression for binary data (Fan et al., 1995) and quantile smoothing for survival data (Bowman and Wright, 2000) are particularly important cases. Vector–matrix implementations of the algorithms for these cases follow similar principles to those described above but the details have not

Software

The sm library is available from the following Web sites:

http://www.stats.gla.ac.uk/~adrian/sm and http://www.stat.unipd.it/~azzalini/Book_sm/

Version 2 of the library includes the binning procedures discussed in the paper.

References (27)

  • A. Azzalini et al.

    On the use of nonparametric regression for checking linear relationships

    J. Roy. Statist. Soc. Ser. B

    (1993)
  • A.W. Bowman et al.

    Applied Smoothing Techniques for Data Analysis: the Kernel Approach With S-Plus Illustrations

    (1997)
  • A.W. Bowman et al.

    Graphical exploration of covariate effects on survival data through nonparametric quantile curves

    Biometrics

    (2000)
  • A.W. Bowman et al.

    Graphical comparison of nonparametric curves

    Appl. Statist.

    (1996)
  • W.S. Cleveland

    Robust locally weighted regression and smoothing scatterplots

    J. Amer. Statist. Assoc.

    (1979)
  • J. Fan

    Local linear regression smoothers and their minimax efficiencies

    Ann. Statist.

    (1993)
  • J. Fan et al.

    Variable bandwidth and local linear regression smoothers

    Ann. Statist.

    (1992)
  • J. Fan et al.

    Local Polynomial Modelling and Its Applications

    (1996)
  • J. Fan et al.

    Fast implementations of nonparametric curve estimators

    J. Comput. Graph. Statist.

    (1994)
  • J. Fan et al.

    Local polynomial kernel regression for generalized linear models and quasi likelihood functions

    J. Amer. Statist. Assoc.

    (1995)
  • T. Gasser et al.

    Residual variance and residual pattern in nonlinear regression

    Biometrika

    (1986)
  • P.J. Green et al.

    Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach

    (1994)
  • W. Härdle

    Algorithm AS 222resistant smoothing using the fast fourier transform

    Appl. Statist.

    (1987)
  • Cited by (37)

    • The correlation of quadrupole transition rates of deformed nuclei by non-parametric approach

      2022, Nuclear Physics A
      Citation Excerpt :

      Also, to overcome the deficiencies of the parametric method, such as increased error caused by the chosen estimation method, and also, the limitations imposed by different distribution functions, non-parametric methods are suggested. Kernel density estimation (KDE) is completely explained in Refs. [13–23,88,114–117] and we are providing the necessary concepts and methods in this research. The familiar histogram is the simplest non-parametric form of density estimation.

    • The importance of exercise: Increased water velocity improves growth of Atlantic salmon in closed cages

      2019, Aquaculture
      Citation Excerpt :

      A Kolmogorov–Smirnov two-sample test was used to test the null hypothesis that the probability density functions (PDFs) of groups were equal over all diameters. Density curves for each treatment were also compared graphically by constructing a variability band around the density estimate for the combined populations using the mean smoothing parameter h, varying between 0.17 and 0.19 for the different groups (Bowman and Azzalini, 2003). This can be used to distinguish the underlying structure in the distributions from random variation providing an indicator of which part(s) of the distribution of diameters contributed to any significant differences.

    • Generalized nonparametric smoothing with mixed discrete and continuous data

      2016, Computational Statistics and Data Analysis
      Citation Excerpt :

      Here it would be probably useful to investigate if binning techniques could reduce the cost for large data sets (see e.g. Bowman and Azzalini, 2003).

    • Should progesterone on the human chorionic gonadotropin day still be measured?

      2016, Fertility and Sterility
      Citation Excerpt :

      The comparison between density function of the log-transformed P levels between pregnant and nonpregnant patients was performed with the “sm” package. This package uses kernel methods to construct nonparametric estimates of density functions (22). Analysis of variance was used to compare the log-transformed P levels means, depending on the stimulation protocol, gonadotropins, and the interaction of both, and adjusted for patient age, E2 level, hCG day, and oocytes retrieved.

    • Inference for variograms

      2013, Computational Statistics and Data Analysis
      Citation Excerpt :

      Software to implement the methods discussed in the paper will shortly be available in the sm package (Bowman and Azzalini, 2003) for the R statistical computing environment (R Development Core Team, 2011).

    View all citing articles on Scopus
    View full text