Modal regression statistical inference for longitudinal data semivarying coefficient models: Generalized estimating equations, empirical likelihood and variable selection
Introduction
Semivarying coefficient models are widely used in real data analysis, because of its flexibility, dimensionality and interpretability. Recently, there has been a rapid growth of interest in this model, e.g., Wang et al. (2009), Wang and Lin (2015), Zhao et al. (2014), Zhou and Liang (2009) and Xue and Qu (2012). Longitudinal data arises frequently from many subject-matter studies, such as medical and public health studies. Let be the th observation of the th subject, where is response variable, is a -vector of covariates, is a -vector of covariates, and assume that index variable without loss of generality. We consider the semivarying coefficient models for this kind of data, which is given by where is regression parameter, are smooth but unknown functions.
A major aspect of longitudinal data is the within-subject correlation, and ignoring the correlation may cause a loss of efficiency. This motivated Liang and Zeger (1986) to develop the generalized estimating equations (GEE), which can incorporate the correlation by using a working correlation matrix. They showed that the GEE estimators are still consistent even if the working correlation matrix is misspecified. Recent research on the GEE include Wang et al. (2005), Wang (2011), Wang et al. (2012), Li et al. (2013), Lian et al. (2014) and so on.
How to construct confidence regions for parameters is an important issue. A convenient choice is to use the asymptotic normal distribution. However, with this method, a plug-in estimator of the limiting variance is needed. The empirical likelihood (EL, Owen (1990) and Owen (2001)), can avoid this problem. It has many advantages over the normal approximation-based method, e.g., the shape of confidence regions is determined totally by data, it does not involve a plug-in estimation for the limiting variance and can yield better coverage probability for small sample. For the independent data, You and Zhou (2006), Yang and Li (2010), Li et al. (2012) and Fan et al. (2016) all considered the EL for the semivarying coefficient models. Furthermore, many EL based methods for the longitudinal data have been proposed, an incomplete list of the recent results include Xue and Zhu (2007a), Xue and Zhu (2007b), Zhao and Xue (2009), Bai et al. (2010), Wang et al. (2010), Tang and Leng (2011), Wang and Zhu (2011), Li and Pan (2013), Tang and Zhao (2013), Han et al. (2014) and Qiu and Wu (2015).
However, the GEE method is in principle similar to the weighted least squares, which does not possess robustness. Furthermore, the EL method may also be influenced by the outliers due to its close relationship with the maximum likelihood, and Owen (2001) pointed out that the EL confidence regions may be greatly lengthened in the direction of the outliers. In longitudinal data, one outlier in the subject level may generate a set of outliers due to repeated measurements. Hence, robustness is very important in longitudinal studies.
Recently, there is a huge literature devoted to constructing robust GEE and EL, e.g., Fan et al. (2012), He et al. (2005), Qin and Zhu (2007), Qin et al. (2009), Qin et al. (2012), Wang et al. (2005) and Zheng et al. (2014). All of these papers use the Huber’s score function on the Pearson residuals to dampen the effect of outliers.
Although the Huber’s score function is robust, it has limitation in terms of efficiency. To address this issue, Yao et al. (2012) and Yao and Li (2014) investigated a new modal regression estimation procedure. Specially, for the linear regression model , modal regression estimate the parameters by maximizing where , is a kernel density function and is a bandwidth, determining the degree of robustness and efficiency. Obviously, maximizing the objective function (1.2) is equivalent to solve the following estimating equations where is the first derivative of . In contrast to other estimation methods, modal regression treats as a loss function, Yao et al. (2012) and Yao and Li (2014) showed that, since modal regression can estimate the “most likely” conditional values, it can provide more robust and efficient estimation than other existing methods by choosing an appropriate bandwidth . Similar conclusions have been further confirmed in Zhang et al. (2013), Zhao et al. (2014) and Liu et al. (2013).
However, the new modal regression approach was only considered for independent data. The first goal of this paper is to propose a new modal regression based GEE and EL statistical inference for the longitudinal data semivarying coefficient models. Specially, (i) We propose a robust and efficient modal regression based GEE, which can use the Mallows-type weights to downweight the effect of leverage points and adopt the score function of on the Pearson residuals to dampen the effect of outliers in the response. (ii) A robust EL statistical inference method for the parametric component in the model (1.1) is proposed through constructing robust modal regression auxiliary random vectors. (iii) Our new modal regression based GEE and EL all can incorporate the working correlation matrix automatically to interpret the correlations within the subjects.
What is more, for high dimensional data, variable selection is important. Recently, various penalty functions have been proposed, such as Lasso (Tibshirani, 1996), adaptive Lasso (Zou, 2006), SCAD (Fan and Li, 2001) and so on. However, these procedures require convex optimization, which will incur a computational burden. To overcome this problem, Ueki (2009) developed a new variable selection procedure called the smooth-threshold estimating equations. Recently, Lai et al. (2012), Li et al. (2013) and Lv et al. (2015) extended this method to the single index models and generalized linear models.
As the second goal of this paper, a new smooth-threshold modal regression based GEE variable selection procedure for the longitudinal data semivarying coefficient models is proposed, it can select the nonparametric and parametric parts simultaneously. Theoretically, the variable selection procedure works beautifully, including consistency in variable selection and oracle property in estimation. By inheriting the properties of the proposed modal regression based GEE, the new variable selection procedure has good robustness and efficiency, and can incorporate the correlation structure of longitudinal data.
The outline of this paper is as follows. Section 2 introduces the modal regression based GEE. Section 3 gives the modal regression based EL. The smooth-threshold modal regression based GEE variable selection procedure is introduced in Section 4. Numerical studies and real data analysis are reported in Section 5. Concluding remarks are given in Section 6. All the proofs are provided in the Appendix.
Section snippets
Estimating equation and main results
Following Huang et al. (2010), we use B-splines to approximate s. Let be a partition of into subintervals , and , where with is a positive integer such that . Let be the space of polynomial splines of degree consisting of functions satisfying: (i) the restriction of to is a polynomial of degree for ; (ii) is -times continuously differentiable on
Modal regression based empirical likelihood inference for
In real applications, the primary research interest may be statistical inferences on the regression coefficient . For the semiparametric models, in order to conducted a EL on the parametric part, the nonparametric part are often regarded as nuisances (e.g., Xue and Zhu (2007a) and Qin et al. (2012)). After obtaining the modal regression GEE estimators , we first absorb them by projection to improve the inferences on , then by considering working correlation to improve
Variable selection via smooth-threshold modal regression based GEE
Variable selection is important for high dimensional data, motivated by Ueki (2009) and the modal regression based GEE in Section 2, we propose the following smooth-threshold modal regression based GEE where is a dimensional identity matrix, is a block diagonal matrix, and
Remark 4.1 Note that in Eq. (4.1), if is an irrelevant variable, then will reduce the solution , and similarly,
Numerical experiments
In this section, Experiment shows the consistency and asymptotic normality of the modal regression based GEE estimators, Experiment demonstrates the variable selection results of the smooth-threshold modal regression based GEE, the simulation in the Experiment is to investigate the modal regression based empirical likelihood inference procedure.
Experiment 1. We consider the following model and we generate 500 data sets from (5.1) with
Concluding remarks
In this paper, based on the modal regression, we propose robust and efficient statistical inference methods for the semivarying coefficient models with longitudinal data, which include a modal regression generalized estimating equations, a modal regression empirical likelihood inference procedure for the parametric component and a smooth-threshold modal regression generalized estimating equations for variable selection. These methods can incorporate the correlation structure of the longitudinal
Acknowledgments
The first author’s research was supported by NNSF, China project (71673171, 11571204 and 11231005), NSF project (ZR2017BA002) of Shandong Province of China.
References (47)
- et al.
Empirical likelihood inference for longitudinal generalized linear models
J. Statist. Plann. Inference
(2010) - et al.
Penalized empirical likelihood for high-dimensional partially linear varying coefficient model with measurement errors
J. Multivariate Anal.
(2016) - et al.
Variable selection in robust regression models for longitudinal data
J. Multivariate Anal.
(2012) - et al.
Bias-corrected GEE estimation and smooth-threshold GEE variable selection for single-index models with clustered data
J. Multivariate Anal.
(2012) - et al.
Automatic variable selection for longitudinal generalized linear models
Comput. Statist. Data Anal.
(2013) - et al.
Empirical likelihood for varying coefficient partially linear model with diverging number of parameters
J. Multivariate Anal.
(2012) - et al.
Empirical likelihood for generalized linear models with longitudinal data
J. Multivariate Anal.
(2013) - et al.
A robust and efficient estimation method for single index models
J. Multivariate Anal.
(2013) - et al.
An efficient and robust variable selection method for longitudinal generalized linear models
Comput. Statist. Data Anal.
(2015) - et al.
Robust empirical likelihood inference for generalized partial linear models with longitudinal data
J. Multivariate Anal.
(2012)
Robust estimation in generalized semiparametric mixed models for longitudinal data
J. Multivariate Anal.
Robust estimation of covariance parameters in partial linear model for longitudinal data
J. Statist. Plann. Inference
Empirical likelihood for quantile regression models with longitudinal data
J. Statist. Plann. Inference
Empirical likelihood for semiparametric varying coefficient partially linear models with longitudinal data
Statist. Probab. Lett.
Empirical likelihood for semiparametric varying-coefficient partially linear regression models
Statist. Probab. Lett.
Variable selection via nonconcave penalized likelihood and its oracle properties
J. Amer. Statist. Assoc.
Longitudinal data analysis using the conditional empirical likelihood method
Canad. J. Statist.
Robust estimation in generalized partial linear models for clustered data
J. Amer. Statist. Assoc.
Variable selection in nonparametric additive models
Ann. Statist.
Generalized additive partial linear models for clustered data with diverging number of covariates using GEE
Statist. Sinica
Longitudinal data analysis using generalized linear models
Biometrika
Empirical likelihood ratio confidence regins
Ann. Statist.
Empirical Likelihood
Cited by (9)
Robust distributed modal regression for massive data
2021, Computational Statistics and Data AnalysisCitation Excerpt :It can achieve balance between robustness and high inference efficiency by choosing an appropriate tuning parameter. For more recent research about modal regression, one can see Liu et al. (2013), Zhang et al. (2013a), Zhao et al. (2014), Zhou and Huang (2016), Wang et al. (2019), Kemp et al. (2019), and so on. The above considerations motivate us to develop a robust communication-efficient distributed modal regression for the distributed massive data, which can remedy the defects of the mean regression or likelihood-based methods.
Parametric modal regression with error in covariates
2024, Biometrical JournalRobust estimation for nonrandomly distributed data
2023, Annals of the Institute of Statistical MathematicsRobust empirical likelihood inference for partially linear varying coefficient models with longitudinal data
2023, Journal of Statistical Computation and SimulationRobust estimation via modified Cholesky decomposition for modal partially nonlinear models with longitudinal data
2023, Communications in Statistics: Simulation and ComputationRobust estimation and variable selection for varying-coefficient partially nonlinear models based on modal regression
2022, Journal of the Korean Statistical Society