Modal regression statistical inference for longitudinal data semivarying coefficient models: Generalized estimating equations, empirical likelihood and variable selection

doi:10.1016/j.csda.2018.10.010

Computational Statistics & Data Analysis

Volume 133, May 2019, Pages 257-276

https://doi.org/10.1016/j.csda.2018.10.010 Get rights and content

Abstract

Modal regression is a good alternative of the mean regression, because of its merits of both robustness and high inference efficiency. This paper is concerned with modal regression based statistical inference for semivarying coefficient models with longitudinal data, which include modal regression generalized estimating equations, modal regression empirical likelihood inference procedure for the parametric component and smooth- threshold modal regression generalized estimating equations for variable selection. These methods can incorporate the correlation structure of the longitudinal data and inherit the robustness and efficiency superiorities of the modal regression by choosing an appropriate data adaptive tuning parameter. Under mild conditions, the large sample theoretical properties are established. Simulation studies and real data analysis are also included to illustrate the finite sample performance.

Introduction

Semivarying coefficient models are widely used in real data analysis, because of its flexibility, dimensionality and interpretability. Recently, there has been a rapid growth of interest in this model, e.g., Wang et al. (2009), Wang and Lin (2015), Zhao et al. (2014), Zhou and Liang (2009) and Xue and Qu (2012). Longitudinal data arises frequently from many subject-matter studies, such as medical and public health studies. Let ${(Y_{i j}, X_{i j}, Z_{i j}, U_{i j}), 1 \leq i \leq n, 1 \leq j \leq m_{i}}$ be the $j$ th observation of the $i$ th subject, where $Y_{i j}$ is response variable, $X_{i j} = {(X_{i j, 1}, \dots, X_{i j, p})}^{T}$ is a $p$ -vector of covariates, $Z_{i j} = {(Z_{i j, 1}, \dots, Z_{i j, q})}^{T}$ is a $q$ -vector of covariates, and assume that index variable $U_{i j} \in [0, 1]$ without loss of generality. We consider the semivarying coefficient models for this kind of data, which is given by $Y_{i j} = μ_{i j} + ϵ_{i j} = X_{i j}^{T} β + \sum_{k = 1}^{q} Z_{i j, k} α_{k} (U_{i j}) + ϵ_{i j}, i = 1, \dots, n, j = 1, \dots, m_{i},$ where $β$ is regression parameter, $α_{k} (\cdot), k = 1, \dots, q$ are smooth but unknown functions.

A major aspect of longitudinal data is the within-subject correlation, and ignoring the correlation may cause a loss of efficiency. This motivated Liang and Zeger (1986) to develop the generalized estimating equations (GEE), which can incorporate the correlation by using a working correlation matrix. They showed that the GEE estimators are still consistent even if the working correlation matrix is misspecified. Recent research on the GEE include Wang et al. (2005), Wang (2011), Wang et al. (2012), Li et al. (2013), Lian et al. (2014) and so on.

How to construct confidence regions for parameters is an important issue. A convenient choice is to use the asymptotic normal distribution. However, with this method, a plug-in estimator of the limiting variance is needed. The empirical likelihood (EL, Owen (1990) and Owen (2001)), can avoid this problem. It has many advantages over the normal approximation-based method, e.g., the shape of confidence regions is determined totally by data, it does not involve a plug-in estimation for the limiting variance and can yield better coverage probability for small sample. For the independent data, You and Zhou (2006), Yang and Li (2010), Li et al. (2012) and Fan et al. (2016) all considered the EL for the semivarying coefficient models. Furthermore, many EL based methods for the longitudinal data have been proposed, an incomplete list of the recent results include Xue and Zhu (2007a), Xue and Zhu (2007b), Zhao and Xue (2009), Bai et al. (2010), Wang et al. (2010), Tang and Leng (2011), Wang and Zhu (2011), Li and Pan (2013), Tang and Zhao (2013), Han et al. (2014) and Qiu and Wu (2015).

However, the GEE method is in principle similar to the weighted least squares, which does not possess robustness. Furthermore, the EL method may also be influenced by the outliers due to its close relationship with the maximum likelihood, and Owen (2001) pointed out that the EL confidence regions may be greatly lengthened in the direction of the outliers. In longitudinal data, one outlier in the subject level may generate a set of outliers due to repeated measurements. Hence, robustness is very important in longitudinal studies.

Recently, there is a huge literature devoted to constructing robust GEE and EL, e.g., Fan et al. (2012), He et al. (2005), Qin and Zhu (2007), Qin et al. (2009), Qin et al. (2012), Wang et al. (2005) and Zheng et al. (2014). All of these papers use the Huber’s score function on the Pearson residuals to dampen the effect of outliers.

Although the Huber’s score function is robust, it has limitation in terms of efficiency. To address this issue, Yao et al. (2012) and Yao and Li (2014) investigated a new modal regression estimation procedure. Specially, for the linear regression model $y_{i} = x_{i}^{T} β + ε_{i}$ , modal regression estimate the parameters by maximizing $Q_{h} (β) = \frac{1}{n} \sum_{i = 1}^{n} ϕ_{h} (y_{i} - x_{i}^{T} β),$ where $ϕ_{h} (\cdot) = h^{- 1} ϕ (\cdot ∕ h)$ , $ϕ (\cdot)$ is a kernel density function and $h$ is a bandwidth, determining the degree of robustness and efficiency. Obviously, maximizing the objective function (1.2) is equivalent to solve the following estimating equations $\sum_{i = 1}^{n} x_{i} ϕ_{h}^{'} (y_{i} - x_{i}^{T} β) = 0,$ where $ϕ_{h}^{'} (\cdot)$ is the first derivative of $ϕ_{h} (\cdot)$ . In contrast to other estimation methods, modal regression treats $- ϕ_{h} (\cdot)$ as a loss function, Yao et al. (2012) and Yao and Li (2014) showed that, since modal regression can estimate the “most likely” conditional values, it can provide more robust and efficient estimation than other existing methods by choosing an appropriate bandwidth $h$ . Similar conclusions have been further confirmed in Zhang et al. (2013), Zhao et al. (2014) and Liu et al. (2013).

However, the new modal regression approach was only considered for independent data. The first goal of this paper is to propose a new modal regression based GEE and EL statistical inference for the longitudinal data semivarying coefficient models. Specially, (i) We propose a robust and efficient modal regression based GEE, which can use the Mallows-type weights to downweight the effect of leverage points and adopt the score function of $ϕ_{h} (\cdot)$ on the Pearson residuals to dampen the effect of outliers in the response. (ii) A robust EL statistical inference method for the parametric component in the model (1.1) is proposed through constructing robust modal regression auxiliary random vectors. (iii) Our new modal regression based GEE and EL all can incorporate the working correlation matrix automatically to interpret the correlations within the subjects.

What is more, for high dimensional data, variable selection is important. Recently, various penalty functions have been proposed, such as Lasso (Tibshirani, 1996), adaptive Lasso (Zou, 2006), SCAD (Fan and Li, 2001) and so on. However, these procedures require convex optimization, which will incur a computational burden. To overcome this problem, Ueki (2009) developed a new variable selection procedure called the smooth-threshold estimating equations. Recently, Lai et al. (2012), Li et al. (2013) and Lv et al. (2015) extended this method to the single index models and generalized linear models.

As the second goal of this paper, a new smooth-threshold modal regression based GEE variable selection procedure for the longitudinal data semivarying coefficient models is proposed, it can select the nonparametric and parametric parts simultaneously. Theoretically, the variable selection procedure works beautifully, including consistency in variable selection and oracle property in estimation. By inheriting the properties of the proposed modal regression based GEE, the new variable selection procedure has good robustness and efficiency, and can incorporate the correlation structure of longitudinal data.

The outline of this paper is as follows. Section 2 introduces the modal regression based GEE. Section 3 gives the modal regression based EL. The smooth-threshold modal regression based GEE variable selection procedure is introduced in Section 4. Numerical studies and real data analysis are reported in Section 5. Concluding remarks are given in Section 6. All the proofs are provided in the Appendix.

Section snippets

Estimating equation and main results

Following Huang et al. (2010), we use B-splines to approximate $α_{k} (\cdot)$ s. Let $0 = τ_{0} < τ_{1} < \dots < τ_{K_{n}} < τ_{K_{n} + 1} = 1$ be a partition of $[0, 1]$ into $K_{n} + 1$ subintervals $I_{n j} = [τ_{j}, τ_{j + 1}), j = 0, \dots, K_{n} - 1$ , and $I_{n K_{n}} = [τ_{K_{n}}, τ_{K_{n} + 1}]$ , where $K_{n} = n^{ϑ}$ with $0 < ϑ < 0.5$ is a positive integer such that ${max}_{1 \leq j \leq K_{n} + 1} | τ_{j} - τ_{j - 1} | = O (n^{- ϑ})$ . Let $F_{n}$ be the space of polynomial splines of degree $D \geq 1$ consisting of functions $f$ satisfying: (i) the restriction of $f$ to $I_{n j}$ is a polynomial of degree $D$ for $0 \leq j \leq K_{n}$ ; (ii) $f$ is $(D - 1)$ -times continuously differentiable on $[0,$

Modal regression based empirical likelihood inference for $β$

In real applications, the primary research interest may be statistical inferences on the regression coefficient $β$ . For the semiparametric models, in order to conducted a EL on the parametric part, the nonparametric part are often regarded as nuisances (e.g., Xue and Zhu (2007a) and Qin et al. (2012)). After obtaining the modal regression GEE estimators ${\hat{α}}_{k} (u) = B {(u)}^{T} {\hat{γ}}_{k}, k = 1, \dots, q$ , we first absorb them by projection to improve the inferences on $β$ , then by considering working correlation to improve

Variable selection via smooth-threshold modal regression based GEE

Variable selection is important for high dimensional data, motivated by Ueki (2009) and the modal regression based GEE in Section 2, we propose the following smooth-threshold modal regression based GEE $(I - Λ) U (β, γ, δ (β, γ)) + Λ {(β^{T}, γ^{T})}^{T} = 0,$ where $I_{s}$ is a $p + q d_{n}$ dimensional identity matrix, $Λ = diag (Λ_{1}, Λ_{2})$ is a block diagonal matrix, $Λ_{1} = diag (δ_{1, 1}, \dots, δ_{1, p})$ and $Λ_{2} = diag (\underset{d_{n}}{\underset{︸}{δ_{2, 1}, \dots, δ_{2, 1}}}, \dots, \underset{d_{n}}{\underset{︸}{δ_{2, q}, \dots, δ_{2, q}}}) .$

Remark 4.1

Note that in Eq. (4.1), if $X_{i j, k}$ is an irrelevant variable, then $δ_{1, k} = 1$ will reduce the solution ${\tilde{β}}_{k} = 0$ , and similarly, $δ$

Numerical experiments

In this section, Experiment $1$ shows the consistency and asymptotic normality of the modal regression based GEE estimators, Experiment $2$ demonstrates the variable selection results of the smooth-threshold modal regression based GEE, the simulation in the Experiment $3$ is to investigate the modal regression based empirical likelihood inference procedure.

Experiment 1. We consider the following model $Y_{i j} = \sum_{k = 1}^{3} X_{i j, k} β_{k} + \sum_{k = 1}^{4} Z_{i j, k} α_{k} (U_{i j}) + ϵ_{i j}, i = 1, \dots, n, j = 1, \dots, 5,$ and we generate 500 data sets from (5.1) with

Concluding remarks

In this paper, based on the modal regression, we propose robust and efficient statistical inference methods for the semivarying coefficient models with longitudinal data, which include a modal regression generalized estimating equations, a modal regression empirical likelihood inference procedure for the parametric component and a smooth-threshold modal regression generalized estimating equations for variable selection. These methods can incorporate the correlation structure of the longitudinal

Acknowledgments

The first author’s research was supported by NNSF, China project (71673171, 11571204 and 11231005), NSF project (ZR2017BA002) of Shandong Province of China.

References (47)

BaiY. et al.
Empirical likelihood inference for longitudinal generalized linear models
J. Statist. Plann. Inference
(2010)
FanG. et al.
Penalized empirical likelihood for high-dimensional partially linear varying coefficient model with measurement errors
J. Multivariate Anal.
(2016)
FanY. et al.
Variable selection in robust regression models for longitudinal data
J. Multivariate Anal.
(2012)
LaiP. et al.
Bias-corrected GEE estimation and smooth-threshold GEE variable selection for single-index models with clustered data
J. Multivariate Anal.
(2012)
LiG. et al.
Automatic variable selection for longitudinal generalized linear models
Comput. Statist. Data Anal.
(2013)
LiG. et al.
Empirical likelihood for varying coefficient partially linear model with diverging number of parameters
J. Multivariate Anal.
(2012)
LiD. et al.
Empirical likelihood for generalized linear models with longitudinal data
J. Multivariate Anal.
(2013)
LiuJ. et al.
A robust and efficient estimation method for single index models
J. Multivariate Anal.
(2013)
LvJ. et al.
An efficient and robust variable selection method for longitudinal generalized linear models
Comput. Statist. Data Anal.
(2015)
QinG. et al.
Robust empirical likelihood inference for generalized partial linear models with longitudinal data
J. Multivariate Anal.
(2012)

QinG. et al.

Robust estimation in generalized semiparametric mixed models for longitudinal data

J. Multivariate Anal.

(2007)

QinG. et al.

Robust estimation of covariance parameters in partial linear model for longitudinal data

J. Statist. Plann. Inference

(2009)

WangH. et al.

Empirical likelihood for quantile regression models with longitudinal data

J. Statist. Plann. Inference

(2011)

YangH. et al.

Empirical likelihood for semiparametric varying coefficient partially linear models with longitudinal data

Statist. Probab. Lett.

(2010)

YouJ. et al.

Empirical likelihood for semiparametric varying-coefficient partially linear regression models

Statist. Probab. Lett.

(2006)

FanJ. et al.

Variable selection via nonconcave penalized likelihood and its oracle properties

J. Amer. Statist. Assoc.

(2001)

HanP. et al.

Longitudinal data analysis using the conditional empirical likelihood method

Canad. J. Statist.

(2014)

HeX. et al.

Robust estimation in generalized partial linear models for clustered data

J. Amer. Statist. Assoc.

(2005)

HuangJ. et al.

Variable selection in nonparametric additive models

Ann. Statist.

(2010)

LianH. et al.

Generalized additive partial linear models for clustered data with diverging number of covariates using GEE

Statist. Sinica

(2014)

LiangK. et al.

Longitudinal data analysis using generalized linear models

Biometrika

(1986)

OwenA.

Empirical likelihood ratio confidence regins

Ann. Statist.

(1990)

OwenA.

Empirical Likelihood

(2001)

Cited by (9)

Robust distributed modal regression for massive data
2021, Computational Statistics and Data Analysis
Citation Excerpt :
It can achieve balance between robustness and high inference efficiency by choosing an appropriate tuning parameter. For more recent research about modal regression, one can see Liu et al. (2013), Zhang et al. (2013a), Zhao et al. (2014), Zhou and Huang (2016), Wang et al. (2019), Kemp et al. (2019), and so on. The above considerations motivate us to develop a robust communication-efficient distributed modal regression for the distributed massive data, which can remedy the defects of the mean regression or likelihood-based methods.
Modal regression is a good alternative of the mean regression and likelihood based methods, because of its robustness and high efficiency. A robust communication-efficient distributed modal regression for the distributed massive data is proposed in this paper. Specifically, the global modal regression objective function is approximated by a surrogate one at the first machine, which relates to the local datasets only through gradients. Then the resulting estimator can be obtained at the first machine and other machines only need to calculate the gradients, which can significantly reduce the communication cost. Under mild conditions, the asymptotical properties are established, which show that the proposed estimator is statistically as efficient as the global modal regression estimator. What is more, as a specific application, a penalized robust communication-efficient distributed modal regression variable selection procedure is developed. Simulation results and real data analysis are also included to validate our method.
Parametric modal regression with error in covariates
2024, Biometrical Journal
Robust estimation for nonrandomly distributed data
2023, Annals of the Institute of Statistical Mathematics
Robust empirical likelihood inference for partially linear varying coefficient models with longitudinal data
2023, Journal of Statistical Computation and Simulation
Robust estimation via modified Cholesky decomposition for modal partially nonlinear models with longitudinal data
2023, Communications in Statistics: Simulation and Computation
Robust estimation and variable selection for varying-coefficient partially nonlinear models based on modal regression
2022, Journal of the Korean Statistical Society

View all citing articles on Scopus

View full text

Modal regression statistical inference for longitudinal data semivarying coefficient models: Generalized estimating equations, empirical likelihood and variable selection

Abstract

Introduction

Section snippets

Estimating equation and main results

Modal regression based empirical likelihood inference for β

Variable selection via smooth-threshold modal regression based GEE

Numerical experiments

Concluding remarks

Acknowledgments

J. Statist. Plann. Inference

J. Multivariate Anal.

J. Multivariate Anal.

J. Multivariate Anal.

Comput. Statist. Data Anal.

J. Multivariate Anal.

J. Multivariate Anal.

J. Multivariate Anal.

Comput. Statist. Data Anal.

J. Multivariate Anal.

J. Multivariate Anal.

J. Statist. Plann. Inference

J. Statist. Plann. Inference

Statist. Probab. Lett.

Statist. Probab. Lett.

Variable selection via nonconcave penalized likelihood and its oracle properties

J. Amer. Statist. Assoc.

Longitudinal data analysis using the conditional empirical likelihood method

Canad. J. Statist.

Robust estimation in generalized partial linear models for clustered data

J. Amer. Statist. Assoc.

Variable selection in nonparametric additive models

Ann. Statist.

Generalized additive partial linear models for clustered data with diverging number of covariates using GEE

Statist. Sinica

Longitudinal data analysis using generalized linear models

Biometrika

Empirical likelihood ratio confidence regins

Ann. Statist.

Empirical Likelihood

Modal regression based empirical likelihood inference for $β$