Modal regression statistical inference for longitudinal data semivarying coefficient models: Generalized estimating equations, empirical likelihood and variable selection

https://doi.org/10.1016/j.csda.2018.10.010Get rights and content

Abstract

Modal regression is a good alternative of the mean regression, because of its merits of both robustness and high inference efficiency. This paper is concerned with modal regression based statistical inference for semivarying coefficient models with longitudinal data, which include modal regression generalized estimating equations, modal regression empirical likelihood inference procedure for the parametric component and smooth- threshold modal regression generalized estimating equations for variable selection. These methods can incorporate the correlation structure of the longitudinal data and inherit the robustness and efficiency superiorities of the modal regression by choosing an appropriate data adaptive tuning parameter. Under mild conditions, the large sample theoretical properties are established. Simulation studies and real data analysis are also included to illustrate the finite sample performance.

Introduction

Semivarying coefficient models are widely used in real data analysis, because of its flexibility, dimensionality and interpretability. Recently, there has been a rapid growth of interest in this model, e.g., Wang et al. (2009), Wang and Lin (2015), Zhao et al. (2014), Zhou and Liang (2009) and Xue and Qu (2012). Longitudinal data arises frequently from many subject-matter studies, such as medical and public health studies. Let {(Yij,Xij,Zij,Uij),1in,1jmi} be the jth observation of the ith subject, where Yij is response variable, Xij=(Xij,1,,Xij,p)T is a p-vector of covariates, Zij=(Zij,1,,Zij,q)T is a q-vector of covariates, and assume that index variable Uij[0,1] without loss of generality. We consider the semivarying coefficient models for this kind of data, which is given by Yij=μij+ϵij=XijTβ+k=1qZij,kαk(Uij)+ϵij,i=1,,n,j=1,,mi,where β is regression parameter, αk(),k=1,,q are smooth but unknown functions.

A major aspect of longitudinal data is the within-subject correlation, and ignoring the correlation may cause a loss of efficiency. This motivated Liang and Zeger (1986) to develop the generalized estimating equations (GEE), which can incorporate the correlation by using a working correlation matrix. They showed that the GEE estimators are still consistent even if the working correlation matrix is misspecified. Recent research on the GEE include Wang et al. (2005), Wang (2011), Wang et al. (2012), Li et al. (2013), Lian et al. (2014) and so on.

How to construct confidence regions for parameters is an important issue. A convenient choice is to use the asymptotic normal distribution. However, with this method, a plug-in estimator of the limiting variance is needed. The empirical likelihood (EL, Owen (1990) and Owen (2001)), can avoid this problem. It has many advantages over the normal approximation-based method, e.g., the shape of confidence regions is determined totally by data, it does not involve a plug-in estimation for the limiting variance and can yield better coverage probability for small sample. For the independent data, You and Zhou (2006), Yang and Li (2010), Li et al. (2012) and Fan et al. (2016) all considered the EL for the semivarying coefficient models. Furthermore, many EL based methods for the longitudinal data have been proposed, an incomplete list of the recent results include Xue and Zhu (2007a), Xue and Zhu (2007b), Zhao and Xue (2009), Bai et al. (2010), Wang et al. (2010), Tang and Leng (2011), Wang and Zhu (2011), Li and Pan (2013), Tang and Zhao (2013), Han et al. (2014) and Qiu and Wu (2015).

However, the GEE method is in principle similar to the weighted least squares, which does not possess robustness. Furthermore, the EL method may also be influenced by the outliers due to its close relationship with the maximum likelihood, and Owen (2001) pointed out that the EL confidence regions may be greatly lengthened in the direction of the outliers. In longitudinal data, one outlier in the subject level may generate a set of outliers due to repeated measurements. Hence, robustness is very important in longitudinal studies.

Recently, there is a huge literature devoted to constructing robust GEE and EL, e.g., Fan et al. (2012), He et al. (2005), Qin and Zhu (2007), Qin et al. (2009), Qin et al. (2012), Wang et al. (2005) and Zheng et al. (2014). All of these papers use the Huber’s score function on the Pearson residuals to dampen the effect of outliers.

Although the Huber’s score function is robust, it has limitation in terms of efficiency. To address this issue, Yao et al. (2012) and Yao and Li (2014) investigated a new modal regression estimation procedure. Specially, for the linear regression model yi=xiTβ+εi, modal regression estimate the parameters by maximizing Qh(β)=1ni=1nϕhyixiTβ,where ϕh()=h1ϕ(h), ϕ() is a kernel density function and h is a bandwidth, determining the degree of robustness and efficiency. Obviously, maximizing the objective function (1.2) is equivalent to solve the following estimating equations i=1nxiϕhyixiTβ=0,where ϕh() is the first derivative of ϕh(). In contrast to other estimation methods, modal regression treats ϕh() as a loss function, Yao et al. (2012) and Yao and Li (2014) showed that, since modal regression can estimate the “most likely” conditional values, it can provide more robust and efficient estimation than other existing methods by choosing an appropriate bandwidth h. Similar conclusions have been further confirmed in Zhang et al. (2013), Zhao et al. (2014) and Liu et al. (2013).

However, the new modal regression approach was only considered for independent data. The first goal of this paper is to propose a new modal regression based GEE and EL statistical inference for the longitudinal data semivarying coefficient models. Specially, (i) We propose a robust and efficient modal regression based GEE, which can use the Mallows-type weights to downweight the effect of leverage points and adopt the score function of ϕh() on the Pearson residuals to dampen the effect of outliers in the response. (ii) A robust EL statistical inference method for the parametric component in the model (1.1) is proposed through constructing robust modal regression auxiliary random vectors. (iii) Our new modal regression based GEE and EL all can incorporate the working correlation matrix automatically to interpret the correlations within the subjects.

What is more, for high dimensional data, variable selection is important. Recently, various penalty functions have been proposed, such as Lasso (Tibshirani, 1996), adaptive Lasso (Zou, 2006), SCAD (Fan and Li, 2001) and so on. However, these procedures require convex optimization, which will incur a computational burden. To overcome this problem, Ueki (2009) developed a new variable selection procedure called the smooth-threshold estimating equations. Recently, Lai et al. (2012), Li et al. (2013) and Lv et al. (2015) extended this method to the single index models and generalized linear models.

As the second goal of this paper, a new smooth-threshold modal regression based GEE variable selection procedure for the longitudinal data semivarying coefficient models is proposed, it can select the nonparametric and parametric parts simultaneously. Theoretically, the variable selection procedure works beautifully, including consistency in variable selection and oracle property in estimation. By inheriting the properties of the proposed modal regression based GEE, the new variable selection procedure has good robustness and efficiency, and can incorporate the correlation structure of longitudinal data.

The outline of this paper is as follows. Section 2 introduces the modal regression based GEE. Section 3 gives the modal regression based EL. The smooth-threshold modal regression based GEE variable selection procedure is introduced in Section 4. Numerical studies and real data analysis are reported in Section 5. Concluding remarks are given in Section 6. All the proofs are provided in the Appendix.

Section snippets

Estimating equation and main results

Following Huang et al. (2010), we use B-splines to approximate αk()s. Let 0=τ0<τ1<<τKn<τKn+1=1 be a partition of [0,1] into Kn+1 subintervals Inj=[τj,τj+1),j=0,,Kn1, and InKn=[τKn,τKn+1], where Kn=nϑ with 0<ϑ<0.5 is a positive integer such that max1jKn+1|τjτj1|=O(nϑ). Let Fn be the space of polynomial splines of degree D1 consisting of functions f satisfying: (i) the restriction of f to Inj is a polynomial of degree D for 0jKn; (ii) f is (D1)-times continuously differentiable on [0,

Modal regression based empirical likelihood inference for β

In real applications, the primary research interest may be statistical inferences on the regression coefficient β. For the semiparametric models, in order to conducted a EL on the parametric part, the nonparametric part are often regarded as nuisances (e.g., Xue and Zhu (2007a) and Qin et al. (2012)). After obtaining the modal regression GEE estimators αˆk(u)=B(u)Tγˆk,k=1,,q, we first absorb them by projection to improve the inferences on β, then by considering working correlation to improve

Variable selection via smooth-threshold modal regression based GEE

Variable selection is important for high dimensional data, motivated by Ueki (2009) and the modal regression based GEE in Section 2, we propose the following smooth-threshold modal regression based GEE IΛUβ,γ,δβ,γ+ΛβT,γTT=0,where Is is a p+qdn dimensional identity matrix, Λ=diag(Λ1,Λ2) is a block diagonal matrix, Λ1=diag(δ1,1,,δ1,p) and Λ2=diagδ2,1,,δ2,1dn,,δ2,q,,δ2,qdn.

Remark 4.1

Note that in Eq. (4.1), if Xij,k is an irrelevant variable, then δ1,k=1 will reduce the solution β̃k=0, and similarly, δ

Numerical experiments

In this section, Experiment 1 shows the consistency and asymptotic normality of the modal regression based GEE estimators, Experiment 2 demonstrates the variable selection results of the smooth-threshold modal regression based GEE, the simulation in the Experiment 3 is to investigate the modal regression based empirical likelihood inference procedure.

Experiment 1. We consider the following model Yij=k=13Xij,kβk+k=14Zij,kαk(Uij)+ϵij,i=1,,n,j=1,,5,and we generate 500 data sets from (5.1) with

Concluding remarks

In this paper, based on the modal regression, we propose robust and efficient statistical inference methods for the semivarying coefficient models with longitudinal data, which include a modal regression generalized estimating equations, a modal regression empirical likelihood inference procedure for the parametric component and a smooth-threshold modal regression generalized estimating equations for variable selection. These methods can incorporate the correlation structure of the longitudinal

Acknowledgments

The first author’s research was supported by NNSF, China project (71673171, 11571204 and 11231005), NSF project (ZR2017BA002) of Shandong Province of China.

References (47)

  • QinG. et al.

    Robust estimation in generalized semiparametric mixed models for longitudinal data

    J. Multivariate Anal.

    (2007)
  • QinG. et al.

    Robust estimation of covariance parameters in partial linear model for longitudinal data

    J. Statist. Plann. Inference

    (2009)
  • WangH. et al.

    Empirical likelihood for quantile regression models with longitudinal data

    J. Statist. Plann. Inference

    (2011)
  • YangH. et al.

    Empirical likelihood for semiparametric varying coefficient partially linear models with longitudinal data

    Statist. Probab. Lett.

    (2010)
  • YouJ. et al.

    Empirical likelihood for semiparametric varying-coefficient partially linear regression models

    Statist. Probab. Lett.

    (2006)
  • FanJ. et al.

    Variable selection via nonconcave penalized likelihood and its oracle properties

    J. Amer. Statist. Assoc.

    (2001)
  • HanP. et al.

    Longitudinal data analysis using the conditional empirical likelihood method

    Canad. J. Statist.

    (2014)
  • HeX. et al.

    Robust estimation in generalized partial linear models for clustered data

    J. Amer. Statist. Assoc.

    (2005)
  • HuangJ. et al.

    Variable selection in nonparametric additive models

    Ann. Statist.

    (2010)
  • LianH. et al.

    Generalized additive partial linear models for clustered data with diverging number of covariates using GEE

    Statist. Sinica

    (2014)
  • LiangK. et al.

    Longitudinal data analysis using generalized linear models

    Biometrika

    (1986)
  • OwenA.

    Empirical likelihood ratio confidence regins

    Ann. Statist.

    (1990)
  • OwenA.

    Empirical Likelihood

    (2001)
  • Cited by (9)

    View all citing articles on Scopus
    View full text