Robust estimation in partially linear errors-in-variables models

https://doi.org/10.1016/j.csda.2016.09.002Get rights and content

Abstract

In many applications of regression analysis, there are covariates that are measured with errors. A robust family of estimators of the parametric and nonparametric components of a structural partially linear errors-in-variables model is introduced. The proposed estimators are based on a three-step procedure where robust orthogonal regression estimators are combined with robust smoothing techniques. Under regularity conditions, it is proved that the resulting estimators are consistent. The robustness of the proposal is studied by means of the empirical influence function when the linear parameter is estimated using the orthogonal M-estimator. A simulation study allows to compare the behaviour of the robust estimators with their classical relatives and a real example data is analysed to illustrate the performance of the proposal.

Introduction

Two important branches of regression analysis arise from parametric and nonparametric models. The fully parametric models are readily interpretable, but they can be severely affected by misspecification. On the other hand, nonparametric models are very flexible to assess the relationship among variables, but they suffer from the well known curse of dimensionality. In the last decades semiparametric models, that amalgamate these two branches, have deserved a lot of attention. They take the best and avoid the worst of the parametric and nonparametric models. Among them, partially linear models have been extensively studied in the last years. Let (y,xT,t) be the observation in a subject or experimental unit, where y is the response that is related to the covariates (xT,t)Rp×R. The partially linear model assumes that y=xTβ+g(t)+e, where the error e is independent of the covariates (xT,t). By means of a nonparametric component, partially linear models are flexible enough to cover many situations; indeed, they can be a suitable choice when one suspects that the response y linearly depends on x, but that it is nonlinearly related to t. An extensive description of the different results obtained in partially linear regression models can be found in Härdle et al. (2000). Among the robust literature, we find He et al. (2002) that consider M-type estimates for repeated measurements using B-splines and Bianco and Boente (2004) who introduce a kernel-based stepwise procedure to define robust estimates under a partially linear model.

In practice, however, there often exist covariate measurement errors. This is a common situation in economics, medicine and social sciences. Errors-in-variables (EV) models have drawn a lot of attention and generated a wide literature, surveyed in Fuller (1987) and Carroll et al. (1995). The effect of measurement errors is well-known, indeed they can cause biased and inconsistent parameter estimators. Two approaches are adopted in order to overcome these difficulties according to the nature of the problem: the functional and structural modelling. In the functional model it is assumed that the covariates are deterministic, while in the structural model, which is treated in this paper, the covariates are considered as random variables. In our setting, we assume that we cannot observe x directly, but instead we observe a surrogate variable v which is related to x through the equation v=x+ex. In other words, the response and the vector of covariates x are observed with errors, while the scalar variable t is observable, that is, we assume the partially linear errors-in-variables (PLEV) model given by y=βTx+g(t)+e,v=x+ex, where the vector of measurement errors ϵ=(eex) is independent of (xT,t).

In order to correct for measurement error, some additional information or data is usually required. In the classical approach, at this point, there are two variants. In the first one, it is assumed that the covariance matrix of the measurement errors, Σex, is known and the approach is a correction for attenuation. Following these ideas, Liang et al. (1999) adapt the estimators of Severini and Staniswalis (1994), which combine local smoothers and linear parametric techniques, by including an attenuation term based on Σex that enables to adjust the regression coefficients for the effects of measurement error. If Σex were unknown, the estimation of the covariance matrix could be possible when replicates are available. In the second variant, it is assumed that the ratio between the variance of the error model e and the measurement errors ex is known. This assumption allows for identification of the model. In this case, Liang et al. (1999) propose to estimate β by total least squares method.

Even when in practice the feasibility of any of these conditions depends on the problem, in general, in the robust framework assumptions involving the existence of first or second moments of the errors are avoided and replaced by weaker conditions on the errors distribution, such as symmetry. So, in this paper, we will extend the second variant by assuming that the vector of errors ϵ follows a spherically symmetric distribution, which is a standard assumption in errors-in-variables models. In this case, if ϵ has a density, it is of the form ϕ(u) for some non-negative function ϕ. Spherical symmetry implies that e and each component of ex have the same distribution. Cui and Kong (2006) justify this assumption by noticing that in some situations the response y and the covariate x are measured in the same way or, even more, the response and the non-observable covariate are two methods that measure the same quantity. As motivating example, we can consider the problem of predicting cholesterol serum level (CS) from a previous register of CS and age, which corresponds to the case of the real dataset we analyse below. First, it is sensible to assume that both cholesterol serum variables (the response and the covariate) are affected by an error, justifying to fit an EV model. Second, since both measures are of the same nature, it seems natural to assume that the errors of the response and the covariate follow the same distribution, making reasonable the sphericity assumption.

Among the literature in partially linear EV models, we can highlight the contribution of several authors. As mentioned, Liang et al. (1999) introduce a semiparametric version of the parametric correction for attenuation, while He and Liang (2000) consider consistent regression quantile estimates of β. Partially linear models with measurement errors have been also studied by Ma and Carroll (2006), who propose locally efficient estimators in semiparametric models, Liang et al. (2007) that consider missing not at random responses, Pan et al. (2008) who deal with longitudinal data and by Liang and Li (2009) who focus on variable selection. As mentioned, we deal with the case in which variable t is observable. Measurement errors in both the parametric and the non-parametric part represent a much more complicated problem and would deserve a different approach, that is beyond the scope of this paper. In the classical setting, Liang (2000) and Zhu and Cui (2003), who deal with an unobservable variable t in the context of a partially linear model, consider deconvolution techniques to handle this type of situations.

However, if the smoothers involved in the estimation process are not resistant to outliers, then the resulting estimators can be severely affected by a relatively small fraction of atypical observations. The same can be asserted with respect to the estimation of the regression parameter when it is estimated by total least squares or least squares corrected for attenuation. For this reason, in this paper we consider an intuitively appealing way to obtain robust estimators for model (1) with spherically symmetric errors, which combines robust univariate smoothers with robust parametric estimators for a linear EV model. It is expected that the good robustness properties of estimates for linear EV models, such as M-orthogonal estimators or weighted orthogonal estimators introduced by Zamar (1989) and Fekri and Ruiz-Gazen (2004), respectively, combined with local smoothers, such as local medians or local M-type estimators, would result in estimators with good robustness properties as well. In what follows, we introduce a three-step procedure that yields robust and consistent estimators. We also derive the empirical influence function of the proposal when M-orthogonal estimators are used to estimate the regression parameter. The simulation results show that, regardless of the presence of outliers in the sample, the proposed estimators of the parametric and nonparametric components are very stable, making clear the advantage of using this kind of procedures.

The outline of the paper is as follows. In Section  2 we remind the classical estimators and the three-step procedure for robust estimation in the partially linear EV model is outlined. In Section  3 we prove the consistency of the proposal. In Section  4 we derive the empirical influence function in order to study the sensitivity of the parametric component of the model to outlying observations in the case in which the linear parameter is estimated using the orthogonal M-estimator. The robustness and performance for finite samples of the proposal are studied by means of a numerical study in Section  5 and a real data set is analysed in Section  6. Proofs are relegated to Appendix A.

Section snippets

Estimators

In this section we consider the estimation of β and g in a partially linear EV model, where for 1inyi=βTxi+g(ti)+ei,vi=xi+exi.

Consistency

We will assume a set of conditions in order to derive the consistency of the proposed estimators for the regression parameter and the nonparametric function g.

    H1.

    ψ1 is an odd function, strictly increasing, bounded and continuous differentiable, such that zψ1(z)ψ1(z).

    H2.

    Fo(y|t=τ) and Fj(v|t=τ),1jp, are symmetric around νo(τ) and νj(τ), respectively.

    H3.

    For any compact set CR, the density ft of t is bounded on C and inftCft(τ)>0.

    H4.

    Fo(y|t=τ) and Fj(v|t=τ),1jp, are continuous functions in t

Empirical influence curve

In this section we derive the empirical influence function of the regression parameter estimator when in Step 2 an orthogonal M-estimator is used. The empirical influence function (EIF), introduced by Tukey (1977), is a useful measure of the robustness of an estimator with respect to a single outlier. In fact, it reflects the effect on a given estimator of adding an arbitrary datum to the sample, that may not follow the central model. Mallows (1974) considers a finite version of the influence

Monte Carlo study

A Monte Carlo study was carried out to illustrate the behaviour of the proposed estimators and to compare them with the classical ones under different models and contamination schemes. For the numerical experiment we revisit model (19) which was inspired in the simulation study given in Zhu and Cui (2003) and was adapted to our framework, where variable t is assumed to be observable. The sine function in the nonparametric component challenges the ability of the estimators of g to capture the

Example: LA data

Afifi and Azen (1979) consider an epidemiological heart disease study on LA County based on 200 employees. Among other variables, age and serum cholesterol levels in 1950 and 1962 were recorded. Buonaccorsi (2010) considers the regression of serum cholesterol level in 1962 (CS62) on age (Age) and serum cholesterol in 1950 (CS50), assuming that (CS50) and (CS62) are measured with error, while Age is measured without error.

The left panel of Fig. 6 presents a kernel fit of the response variable CS

Acknowledgements

The authors thank the anonymous referee and Associated Editor for their comments and suggestions, that truly contributed to improve the paper. This research was partially supported by Grants 20120130100241ba from Universidad de Buenos Aires, pip 112-2011-01-00339 from conicet and pict 2014-0351 from anpcyt, Argentina.

References (32)

  • G. Boente et al.

    Strong uniform convergence rates for some robust equivariant nonparametric regression estimates for mixing processes

    Internat. Statist. Rev.

    (1991)
  • J.P. Buonaccorsi

    Measurement Error: Models, Methods and Applications

    (2010)
  • R.J. Carroll et al.

    Measurement Error in Nonlinear Models

    (1995)
  • C. Croux et al.

    Fast and robust estimation of the multivariate errors in variables model

    TEST

    (2010)
  • H. Cui et al.

    Empirical likelihood confidence region for parameters in semi-linear errors-in-variables models

    Scand. J. Stat.

    (2006)
  • W.A. Fuller

    Measurement Error Models

    (1987)
  • Cited by (0)

    View full text