Computational aspects of nonparametric smoothing with illustrations from the sm library
Introduction
Smoothing techniques such as density estimation and nonparametric regression have become established tools in applied statistics. There is now a wide variety of texts which describe these methods and a huge literature of research papers. Recent texts include Green and Silverman (1994), Wand and Jones (1995), Fan and Gijbels (1996), Simonoff (1996), Bowman and Azzalini (1997) and Schimek (2000a). A broader framework for the case of regression, known as generalized additive models, is also described by Hastie and Tibshirani (1990). Some statistical computing environments such as S-Plus, Genstat and XLispStat include some general procedures for smoothing techniques and Simonoff (1996) and Schimek (2000b) give extensive references to more specialist software. However, the simplest forms of nonparametric smoothing techniques can be implemented relatively easily through elementary programming techniques.
Modern statistical computing environments are generally geared towards vector and matrix representations of data. It is therefore a principal aim of this paper to provide simple matrix formulations of smoothing techniques which allow efficient implementation in this type of environment. A second aim of the paper is to address the computational issues which arise when nonparametric methods are applied to large datasets. This issue is also dealt with in the context of vector and matrix representations of the estimators.
Efficient matrix formulations of smoothing techniques are discussed in Section 2 of the paper, for both density estimation and regression. The issue of binning and its implementation within a matrix framework is described in Section 3. Computational issues also arise when techniques other than straightforward construction of the estimators for descriptive purposes are required. Bowman and Azzalini (1997, Chapters 4 and 5) describe a variety of testing procedures based on quadratic form calculations and the implementation of these for binned data is described in Section 4. Some graphical issues and other topics are also discussed in that section. Further discussion is given in Section 5.
The computational methods discussed in the paper have been implemented in the sm library written to accompany the book by Bowman and Azzalini (1997). This library contains an extensive collection of functions for different data structures and statistical problems and version 2 now includes the binning techniques described in the paper. Examples of S-Plus code from the core instructions of the two principal functions, sm.density and sm.regression, are given in an appendix to illustrate the ideas described in the paper. The principal emphasis is on the algebraic expressions provided in the main body of the paper, to allow readers to implement the ideas in other programming environments. In addition, they also provide specific guidance to those who wish to amend the existing sm library functions or to implement the techniques themselves in S-Plus.
Section snippets
Density estimation
The formula for computing a density estimate at the evaluation point z from a sample y=(y1,…,yn)T of univariate data iswhere the kernel function w(x;h) is itself a density function which is symmetric with mean 0. The degree of smoothing applied to the data is controlled by the scale parameter h, which is conveniently parameterised as the standard deviation of the kernel function. The detailed shape of the kernel function is relatively unimportant, and for convenience a
Binning for large datasets
The matrix formulae described in Section 2 provide compact computational expressions for nonparametric estimators. However, the size of many of the matrices is dependent on the sample size n and this becomes inefficient as n increases. A number of highly efficient smoothing algorithms are available, as discussed by Fan and Marron (1994), for example. In particular, the concept of binning, where the raw data are reduced to frequencies over a fine grid, is a practical solution. Härdle and Scott
Inferential issues in nonparametric regression
The standard error of a nonparametric regression curve at its evaluation points provides a very useful indication of the variability of the estimator. Since the vector of estimated values m̂ can be expressed as Sy, the estimated variance is simplywhere P is the diagonal matrix with the bin frequencies as its diagonal vector. This refers to the case of binned data. The formula which applies in the absence of binning is obtained by setting b=n and P=I.
Bowman and Young
Discussion
The methods described in this paper provide efficient vector–matrix implementations of standard smoothing procedures. There are, of course, many other types of smoothing procedure in common use. Local logistic regression for binary data (Fan et al., 1995) and quantile smoothing for survival data (Bowman and Wright, 2000) are particularly important cases. Vector–matrix implementations of the algorithms for these cases follow similar principles to those described above but the details have not
Software
The sm library is available from the following Web sites:
Version 2 of the library includes the binning procedures discussed in the paper.http://www.stats.gla.ac.uk/~adrian/sm and http://www.stat.unipd.it/~azzalini/Book_sm/
References (27)
- et al.
On the use of nonparametric regression for checking linear relationships
J. Roy. Statist. Soc. Ser. B
(1993) - et al.
Applied Smoothing Techniques for Data Analysis: the Kernel Approach With S-Plus Illustrations
(1997) - et al.
Graphical exploration of covariate effects on survival data through nonparametric quantile curves
Biometrics
(2000) - et al.
Graphical comparison of nonparametric curves
Appl. Statist.
(1996) Robust locally weighted regression and smoothing scatterplots
J. Amer. Statist. Assoc.
(1979)Local linear regression smoothers and their minimax efficiencies
Ann. Statist.
(1993)- et al.
Variable bandwidth and local linear regression smoothers
Ann. Statist.
(1992) - et al.
Local Polynomial Modelling and Its Applications
(1996) - et al.
Fast implementations of nonparametric curve estimators
J. Comput. Graph. Statist.
(1994) - et al.
Local polynomial kernel regression for generalized linear models and quasi likelihood functions
J. Amer. Statist. Assoc.
(1995)
Residual variance and residual pattern in nonlinear regression
Biometrika
Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach
Algorithm AS 222resistant smoothing using the fast fourier transform
Appl. Statist.
Cited by (37)
The correlation of quadrupole transition rates of deformed nuclei by non-parametric approach
2022, Nuclear Physics ACitation Excerpt :Also, to overcome the deficiencies of the parametric method, such as increased error caused by the chosen estimation method, and also, the limitations imposed by different distribution functions, non-parametric methods are suggested. Kernel density estimation (KDE) is completely explained in Refs. [13–23,88,114–117] and we are providing the necessary concepts and methods in this research. The familiar histogram is the simplest non-parametric form of density estimation.
The importance of exercise: Increased water velocity improves growth of Atlantic salmon in closed cages
2019, AquacultureCitation Excerpt :A Kolmogorov–Smirnov two-sample test was used to test the null hypothesis that the probability density functions (PDFs) of groups were equal over all diameters. Density curves for each treatment were also compared graphically by constructing a variability band around the density estimate for the combined populations using the mean smoothing parameter h, varying between 0.17 and 0.19 for the different groups (Bowman and Azzalini, 2003). This can be used to distinguish the underlying structure in the distributions from random variation providing an indicator of which part(s) of the distribution of diameters contributed to any significant differences.
Generalized nonparametric smoothing with mixed discrete and continuous data
2016, Computational Statistics and Data AnalysisCitation Excerpt :Here it would be probably useful to investigate if binning techniques could reduce the cost for large data sets (see e.g. Bowman and Azzalini, 2003).
Should progesterone on the human chorionic gonadotropin day still be measured?
2016, Fertility and SterilityCitation Excerpt :The comparison between density function of the log-transformed P levels between pregnant and nonpregnant patients was performed with the “sm” package. This package uses kernel methods to construct nonparametric estimates of density functions (22). Analysis of variance was used to compare the log-transformed P levels means, depending on the stimulation protocol, gonadotropins, and the interaction of both, and adjusted for patient age, E2 level, hCG day, and oocytes retrieved.
Termites promote soil carbon and nitrogen depletion: Results from an in situ macrofauna exclusion experiment, Peru
2014, Soil Biology and BiochemistryInference for variograms
2013, Computational Statistics and Data AnalysisCitation Excerpt :Software to implement the methods discussed in the paper will shortly be available in the sm package (Bowman and Azzalini, 2003) for the R statistical computing environment (R Development Core Team, 2011).