Robust bandwidth selection in semiparametric partly linear regression models: Monte Carlo study and influential analysis

https://doi.org/10.1016/j.csda.2007.10.017Get rights and content

Abstract

In this paper, under a semiparametric partly linear regression model with fixed design, we introduce a family of robust procedures to select the bandwidth parameter. The robust plug-in proposal is based on nonparametric robust estimates of the νth derivatives and under mild conditions, it converges to the optimal bandwidth. A robust cross-validation bandwidth is also considered and the performance of the different proposals is compared through a Monte Carlo study. We define an empirical influence measure for data-driven bandwidth selectors and, through it, we study the sensitivity of the data-driven bandwidth selectors. It appears that the robust selector compares favorably to its classical competitor, despite the need to select a pilot bandwidth when considering plug-in bandwidths. Moreover, the plug-in procedure seems to be less sensitive than the cross-validation in particular, when introducing several outliers. When combined with the three-step procedure proposed by Bianco and Boente [2004. Robust estimators in semiparametric partly linear regression models. J. Statist. Plann. Inference 122, 229–252] the robust selectors lead to robust data-driven estimates of both the regression function and the regression parameter.

Introduction

Partly linear models have become an important tool when modelling biometric data, since they combine the flexibility of nonparametric models and the simple interpretation of the linear ones. These models assume that we have a response yiR and covariates or design points (xiT,ti)TRp+1 satisfyingyi=xiTβ+g(ti)+εi,1in,with the errors εi independent and independent of (xiT,ti)T. The semiparametric nature of model (1) offers more flexibility than the standard linear model, when modelling a complicated relationship between the response variable with one of the covariates. At the same time, they keep a simple functional form with the other covariates avoiding the “curse of dimensionality” existing in nonparametric regression.

In many situations, it seems reasonable to suppose that a relationship between the covariates x and t exists, so as in Speckman (1988), Linton (1995) and Aneiros-Pérez and Quintela del Río (2002), we will assume that for 1jpxij=φj(ti)+ηij,1in,where the errors ηij are independent. Moreover, the design points ti will be assumed to be fixed.

Several authors have considered the semiparametric model (1). See, for instance, Denby (1986), Rice (1986), Robinson (1988), Speckman (1988) and Härdle et al. (2000) among others.

All these estimators, as most nonparametric estimators, depend on a smoothing parameter that should be chosen by the practitioner. As it is well known, large bandwidths produce estimators with small variance but high bias, while small values produce more wiggly curves. This trade-off between bias and variance lead to several proposals to select the smoothing parameter, such as cross-validation procedures and plug-in methods. Linton (1995), using local polynomial regression estimators, obtained an asymptotic expression for the optimal bandwidth in the sense that it minimizes a second order approximation of the mean square error of the least squares estimate, β^LS(h), of β. This expression depends on the regression function we are estimating and on parameters which are unknown, such as the standard deviation of the errors. More precisely, for any cRp, let σ2=σε2cTΣη-1c be the asymptotic variance of U=cTn1/2(β^LS(h)-β), and nMSE(h)=EU2/σ2 its standardized mean square error. For the sake of simplicity, assume that the smoothing procedure corresponds to local means and that the design points are almost uniform design points, i.e., {ti}i=1n are fixed design points in [0,1], 0t1tn1, such that t0=0, tn+1=1 and max1in+1|(ti-ti-1)-1/n|=O(n-δ) for some δ>1. Then, under general conditions, we have that, for ν2,MSE(h)=n-1{1+(nh)-1A2+o(n-2μ)+(n1/2h2νA1+o(n-μ))2},where μ=(4ν-1)/(2(4ν+1)), φ(ν)(t)=(φ1(ν)(t),,φp(ν)(t))T, αν(K)=uνK(u)du, K*(u)=K*K(u)-2K(u) and A1=αν2(K)(ν!)-2σ-1cTΣη-101g(ν)(t)φ(ν)(t)dt,A2=K*2(u)du.Therefore, the optimal bandwidth in the sense of minimizing the asymptotic MSE(h), is given by hopt=A0n-π, with π=2/(4ν+1) and A0=(A2/(4νA12))π/2, i.e.,A0=K*2(u)du4νσ-1cTΣη-1αν2(K)(ν!)-201g(ν)(t)φ(ν)(t)dt2π/2.Linton (1995) considered a plug-in approach to estimate the optimal bandwidth and showed that it converges to the optimal one, while Aneiros-Pérez and Quintela del Río (2002) studied the case of dependent errors.

It is well known that, both in linear regression and in nonparametric regression, least squares estimators can be seriously affected by anomalous data. The same statement holds for partly linear models, where large values of the response variable yi can cause a peak on the estimates of the smooth function g in the neighborhood of ti. Moreover, large values of the response variable yi combined with high leverage points xi produce also, as in linear regression, breakdown of the classical estimates of the regression parameter β. To overcome that problem, Bianco and Boente (2004) considered a three-step robust estimate for the regression parameter and the regression function. Besides, for the nonparametric regression setting, i.e., when β=0, the sensitivity of the classical bandwidth selectors to anomalous data was discussed by several authors, such as, Leung et al. (1993), Wang and Scott (1994), Boente et al. (1997), Cantoni and Ronchetti (2001) and Leung (2005).

In this paper, we consider a robust plug-in selector for the bandwidth, under the partly linear model (1) which converges to the optimal one and leads to robust data-driven estimates of the regression function g and the regression parameter β. We derive an expression analoguous to (3) for the optimal bandwidth of the three-step estimator introduced in Bianco and Boente (2004). As for its linear relative, this expression will depend on the derivatives of the functions g and φ. In Section 2, we review some of the proposals given to estimate robustly the derivatives of the regression function under a nonparametric regression model. The robust plug-in bandwidth selector for the partial linear model is introduced in Section 3 together with a robust cross-validation procedure. In Section 4, for small samples, the behavior of the classical and resistant selectors is compared through a Monte Carlo study under normality and contamination. Finally, in Section 5 an empirical influence measure for the bandwidth selector is introduced. We use this measure to study the sensitivity of the proposed plug-in and cross-validation selectors on some generated examples.

Section snippets

Robust estimation of the derivative of order ν

In this section, we review some of the approaches given to provide robust estimators of the νth derivative of the regression function under a fully nonparametric regression model. Let ziR be independent observations such thatzi=ϕ(ti)+ui,1in,with the errors ui independent and identically distributed with symmetric common distribution F(·/σu) and 0t1tn1 fixed design points.

Robust estimates for the first derivative of the regression function have been introduced by Härdle and Gasser (1985)

Resistant choice of the smoothing parameter

As is well known an important issue in any smoothing procedure is the choice of the smoothing parameter. As mentioned in the Introduction, under a nonparametric regression model, two commonly approaches are cross-validation and plug-in. However, these procedures may not be robust and their sensitivity to anomalous data was discussed by several authors, including Leung et al. (1993), Wang and Scott (1994), Boente et al. (1997), Cantoni and Ronchetti (2001) and Leung (2005). Wang and Scott (1994)

Monte Carlo study

This section contains the results of a simulation study, in dimension p=1, designed to evaluate the performance, under a partly linear model, of the robust bandwidth selectors defined in Section 3. For the plug-in bandwidth, we have used both the differentiation approach and the local polynomial approximation to estimate the derivatives of the regression functions. The aims of this study are

  • to compare the behavior of the bandwidth selectors and of the regression estimators under contamination

Empirical influence of bandwidth selectors

One of the aims of a robust procedure is to produce estimates less sensitive to outliers than the classical ones. The influence function is a measure of robustness with respect to single outliers. Statistical diagnostics and graphical displays for detecting outliers can be built based on empirical influence functions. In parametric models this topic is widely developed, however, less attention has been given in the nonparametric literature. A smoothed functional approach to nonparametric kernel

Concluding remarks

Selection of the smoothing parameter is an important step in any nonparametric analysis, even when robust estimates are used. The classical procedures based on least squares cross-validation or on a plug-in rule turn out to be non-robust since they lead to over or undersmoothing as noted for nonparametric regression by Leung et al. (1993), Wang and Scott (1994), Boente et al. (1997), Cantoni and Ronchetti (2001) and Leung (2005). The same conclusions hold under a partly linear regression model.

Acknowledgments

The authors would like to thank the referee for its valuable comments and suggestions that lead to improve the presentation of the paper. This research was partially supported by Grants X-094 from the Universidad de Buenos Aires, PID 5505 from CONICET and PAV 120 and PICT 21407 from ANPCYT, Argentina.

References (28)

  • Th. Gasser et al.

    A flexible and fast method for automatic smoothing

    J. Amer. Statist. Assoc.

    (1991)
  • W. Härdle et al.

    On robust kernel estimation of derivatives of regression functions

    Scand. J. Statist.

    (1985)
  • Härdle, W., Liang, H., Gao, J., 2000. Partially Linear Models....
  • J. Jiang et al.

    Robust local polynomial regression for dependent data

    Statist. Sinica

    (2001)
  • Cited by (15)

    • Robust estimators in semi-functional partial linear regression models

      2017, Journal of Multivariate Analysis
      Citation Excerpt :

      The ideas of robust cross-validation have been adapted to partially linear models in the finite-dimensional setting by Bianco and Boente [10] and Boente and Rodriguez [15] who also considered a plug-in approach.

    • Robust nonparametric kernel regression estimator

      2016, Statistics and Probability Letters
      Citation Excerpt :

      Compared to robust plug-in bandwidth selection method, our procedure is also superior because of its stability. In fact, the plug-in method is highly dependent on the pilot bandwidth (Boente and Rodriguez, 2008). For example, for fixed outliers, in M2, the plug-in method has its MSE twice that of the cross-validation method, while in M1, its MSE deteriorates further.

    • Bandwidth choice for robust nonparametric scale function estimation

      2012, Computational Statistics and Data Analysis
    • Robust exponential smoothing of multivariate time series

      2010, Computational Statistics and Data Analysis
    View all citing articles on Scopus
    View full text