A split-and-conquer variable selection approach for high-dimensional general semiparametric models with massive data

doi:10.1016/j.jmva.2022.105128

Journal of Multivariate Analysis

Volume 194, March 2023, 105128

https://doi.org/10.1016/j.jmva.2022.105128 Get rights and content

Abstract

Estimation and variable selection in partially linear models for massive data has been discussed by several authors. However, there does not seem to exist an established procedure for other semiparametric models, such as the semiparametric varying-coefficient linear model, the single index regression model, the partially linear errors-in-variables model, etc. In this paper, we propose a general procedure for variable selection in high-dimensional general semiparametric models by penalized semiparametric estimating equations. Under some regularity conditions, the oracle property is established, which the number of parameters is allowed to diverge. Furthermore, we also propose a split-and-conquer variable selection procedure for high-dimensional general semiparametric models with massive data. Under some weak regularity conditions, we establish the oracle property of the proposed procedure when the number of subsets does not grow too fast. What is more, the split-and-conquer procedure enjoys the oracle property as the penalized estimator by using all the dataset, and can substantially reduce computing time and computer memory requirements. The performance of the proposed method is illustrated via a real data application and numerical simulations.

Introduction

In the past two decades, revolutions in information technologies have produced many kinds of massive data. It is very challenging to extract useful features from a massive dataset since many statistics are difficult to compute by traditional algorithms or statistical packages when the massive dataset is too large to be stored in primary memory. In recent years, there have been active researches on developing methods to analyze massive data. Lin and Xi [15] developed an aggregated algorithm for estimating equation estimations in massive datasets using a divide-and-conquer strategy; Chen and Xie [2] considered a divide-and-conquer approach for generalized linear models where both the sample size $n$ and the number of covariates $p$ are large; Song and Liang [18] proposed a split-and-merge Bayesian variable selection method for ultrahigh dimensional linear regression models; Kleiner et al. [1] proposed the bag of little bootstraps approach for massive data; Liang et al. [14] developed a resampling-based method for the big geostatistical data; Schifano et al. [17] presented an online updating strategy for big data arising from online analytical processing; etc.

Variable selection is an interesting problem in statistical analysis. Variable selection for high dimensional regression is often treated with penalized likelihood procedures. The popular penalized likelihood procedures include the LASSO (Tibshirani [20]), the smoothly clipped absolute deviation (SCAD) (Fan and Li [4]), among others. Variable selection procedures for high-dimensional setting, where the dimension $p$ of the observations increases with the sample size $n$ , are given by Fan and Peng [5], Tang and Leng [19], etc. The penalized procedure has been extended to generalized estimating equations through the penalized generalized estimating equations. Fu [8] proposed a generalization of the bridge penalty to generalized estimating equations. Johnson et al. [10] proposed a procedure for variable selection in semiparametric regression models by penalizing appropriate estimating functions. Wang [24] proposed the SCAD-penalized generalized estimating equations procedure in which the number of parameters is allowed to diverge.

In this paper, motivated by the variable selection procedure for estimating equations in Johnson et al. [10], we firstly extend the penalized procedure to general semiparametric models based on penalized semiparametric estimating functions with the case of the data dimension $p$ diverging. Johnson et al. [10] proposed a variable selection procedure for estimating equations. However, the combining estimating equation and the penalty function method by Johnson et al. [10] could not be used directly to semi-parametric estimating equations. For example, we consider the partially linear model $Y = X^{⊤} θ + f (T) + ɛ$ , the semi-parametric estimating functions can be taken by $g (X, H (T), θ) = {X - E (X | T)} [{Y - E (Y | T)} - {X - E {(X | T)}^{⊤} θ}]$ . Since the corresponding semi-parametric estimating equations involve the unknown function $E (X | T)$ and $E (Y | T)$ , thus we need to estimate the function $E (X | T)$ and $E (X | T)$ in advance. We propose a general procedure for variable selection in high-dimensional semiparametric models by penalized semiparametric estimating equations, including partially linear models, semiparametric varying-coefficient linear models, partially linear errors-in-variables models, etc.

Semiparametric models are a useful compromise between parametric and nonparametric models to mitigate the curse of dimensionality but still allow reasonable flexibility to specify functional form; see, e.g., Engle et al. [3]. For massive data setting, Zhao, Cheng and Liu [25] considered a partially linear framework for modelling massive heterogeneous data, and proposed to extract the common feature across all sub-populations while exploring heterogeneity of each sub-population. Lian, Zhao and Lv [13] studied divide-and-conquer methodology for high-dimensional partially linear models, focusing on the estimation and asymptotic distribution of the nonparametric function. Wang, Lian and Liang [21] proposed an additive partially linear framework for modelling massive heterogeneous data. Wang, Zhang and Lian [22] considered divide-and-conquer strategy for partially linear additive models with a high dimensional linear part.

Statistical modeling for massive data has attracted a lot of recent research. However, as far as we are aware, semiparametric inference for massive data, such as the single index regression model, the semiparametric varying-coefficient linear model, the partially linear errors-in-variables model, etc, still remains untouched. Motivated by the divide-and-conquer approach for generalized linear models with massive data in Chen and Xie [2], we consider a split-and-conquer variable selection procedure for high-dimensional general semiparametric models with massive data, focusing on selection consistency and oracle property of the proposed procedure. The main contributions of this paper are:

1. We propose a general procedure for variable selection in high-dimensional semiparametric models by penalized semiparametric estimating equations, including partially linear models, single index regression models, semiparametric varying-coefficient linear models, partially linear errors-in-variables models, etc. Under some regularity conditions, the oracle property is established, which the number of parameters is allowed to diverge.

2. For extraordinarily large data, we propose a split-and-conquer variable selection procedure for general semiparametric models based on penalized semiparametirc estimating functions. Under some regularity conditions, we establish the oracle property of the aggregated estimator $\overset{ˇ}{θ}$ when the number of subsets does not grow too fast. What is more, the split-and-conquer procedure enjoys the oracle property as the penalized estimator by using all the dataset, and can substantially reduce computing time and computer memory requirements. The main differences between the second main contribution in this paper and theirs we compare with are: (1) Zhao, Cheng and Liu [25], Lian, Zhao and Lv [13] and Wang, Lian and Liang [21] considered partially linear models with massive heterogeneous data, while we consider general semiparametric models with massive data by using the semiparametric estimating equations, including partially linear models, semiparametric varying-coefficient linear models, partially linear errors-in-variables models, single index regression models, etc. (2) Lian, Zhao and Lv [13], Wang, Lian and Liang [21] and Wang, Zhang and Lian [22] proposed to employ the polynomial splines to approximate the nonparametric functions, while we propose to estimate $H (T)$ by the kernel estimators.

The remainder of this paper is organized as follows. In Section 2, we propose a split-and-conquer variable selection procedure for high-dimensional general semiparametric models with massive data, and establish selection consistency and the oracle property of the proposed method. In Section 3, we propose a computational algorithm to solve the penalized semiparametric estimating equations. Simulation results are reported in Section 4, and one real data example is presented in Section 5. Finally, the technical proofs of main results are stated in the Appendix.

Section snippets

Methodology and main results

Consider semiparametric models by using the estimating equations which contain unknown functions framework as the following: $E [g (X, H (T), θ)] = 0,$ where $X = {(X^{⊤}, Y, Z^{⊤})}^{⊤}$ is a random vector, $T$ is an associated variable with a bounded support, $H (T) \in R^{k}$ is an unknown smooth function with $H (t) = {(H_{1} (t), \dots, H_{k} (t))}^{⊤} = E [φ (\cdot) | T = t],$ $φ (\cdot) = {(φ_{1} (\cdot), \dots, φ_{k} (\cdot))}^{⊤}$ are known measurable functions, $g$ is a $r$ dimensional estimating function for $θ$ and $θ = {(θ_{1}, \dots, θ_{p})}^{⊤} \in Θ_{θ}$ is a vector of unknown parameters, $Θ_{θ} \in R^{p}$ . We assume that $r = p$ . The

Computational algorithm

Solving the semiparametric estimating equations (4) is obviously difficult. Hunter and Li [9] proposed MM algorithm for variable selection that can be viewed as a Newton–Raphson type algorithm for solving the perturbed penalized estimating equations. Inspired by Hunter and Li, we propose the iterative algorithm ${\hat{θ}}^{(k + 1)} = {argmin}_{θ} ‖ \frac{1}{N} \sum_{i = 1}^{N} g (X_{i}, \hat{H} (T_{i}), θ) - P_{λ}^{'} (| θ |) sgn (θ) ‖, k > 0$ to solve the semiparametric estimating equations (4), where ${\hat{θ}}^{(0)}$ is a minimizer of $‖ 1 / N \sum_{i = 1}^{N} g (X_{i}, \hat{H} (T_{i}), θ) ‖$ . The steps outline the

Simulation studies

We conduct simulation studies to illustrate the properties of the proposed procedure for the high-dimensional semiparametric models by using estimating equations framework, and illustrate the usefulness of the method by several examples of semiparametric models as follows.

Example 1

We consider the varying coefficient partially linear model: $Y_{i} = X_{i}^{⊤} θ + Z_{i}^{⊤} u (T_{i}) + ɛ_{i}, i \in {1, \dots, n},$ where $(X_{i}, Z_{i}, T_{i})$ is the associated covariates, $u (\cdot) = {(u_{1} (\cdot), \dots, u_{q} (\cdot))}^{⊤}$ is a $q$ -dimensional vector of unknown smoothing regression

Data application

We apply the proposed split-and-conquer variable selection procedure for general semiparametric models to the airline on-time performance data from the 2009 ASA Data Expo ( http://stat-computing.org/dataexpo/2009/the-data.html). The data is publicly available and has been used for demonstration with big data by Kane, Emerson and Weston [11] and Schifano et al. [17]. It consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008.

Acknowledgments

We are also grateful to Editor-in-Chief Professor von Rosen, Associate Editor and two anonymous referees for their insightful comments and suggestions, which led to substantial improvements. This work was supported by Scientific Research Fund of Hunan Provincial Education Department (Grant No. 20B139).

References (26)

FangJ.L. et al.
Penalized empirical likelihood for semiparametric models with a diverging number of parameters
J. Statist. Plann. Inference
(2017)
ArielK. et al.
A scalable bootstrap for massive data
J. R. Stat. Soc. Ser. B Stat. Methodol.
(2014)
ChenX.Y. et al.
A split-and-conquer approach for analysis of extraordinarily large data
Statist. Sinica
(2014)
EngleR.F. et al.
Semiparametric estimates of the relation between weather and electricity sales
J. Amer. Statist. Assoc.
(1986)
FanJ.Q. et al.
Variable selection via nonconcave penalized likelihood and its oracle properties
J. Amer. Statist. Assoc.
(2001)
FanJ.Q. et al.
Nonconcave penalized likelihood with a diverging number of parameters
Ann. Statist.
(2004)
FanJ.Q. et al.
Ultrahigh dimensional feature selection: beyond the linear model
J. Mach. Learn. Res.
(2009)
FuW.J.
Penalized estimating equations
Biometrics
(2003)
HuterD.R. et al.
Variable selection using MM algorithms
Ann. Statist.
(2005)
JohnsonB.A. et al.
Penalized estimating functions and variable selection in semiparametric regression models
J. Amer. Statist. Assoc.
(2008)

KaneM. et al.

Scalable strategies for computing with massive data

J. Stat. Softw.

(2013)

LengC.L. et al.

Penalized empirical likelihood and growing dimensional general estimating equations

Biometrika

(2012)

LianH. et al.

Projected spline estimation of the nonparametric function in high-dimensional partially linear models for massive data

Ann. Statist.

(2019)

Cited by (0)

View full text

A split-and-conquer variable selection approach for high-dimensional general semiparametric models with massive data

Abstract

Introduction

Section snippets

Methodology and main results

Computational algorithm

Simulation studies

Data application

Acknowledgments

J. Statist. Plann. Inference

A scalable bootstrap for massive data

J. R. Stat. Soc. Ser. B Stat. Methodol.

A split-and-conquer approach for analysis of extraordinarily large data

Statist. Sinica

Semiparametric estimates of the relation between weather and electricity sales

J. Amer. Statist. Assoc.

Variable selection via nonconcave penalized likelihood and its oracle properties

J. Amer. Statist. Assoc.

Nonconcave penalized likelihood with a diverging number of parameters

Ann. Statist.

Ultrahigh dimensional feature selection: beyond the linear model

J. Mach. Learn. Res.

Penalized estimating equations

Biometrics

Variable selection using MM algorithms

Ann. Statist.

Penalized estimating functions and variable selection in semiparametric regression models

J. Amer. Statist. Assoc.

Scalable strategies for computing with massive data

J. Stat. Softw.

Penalized empirical likelihood and growing dimensional general estimating equations

Biometrika

Projected spline estimation of the nonparametric function in high-dimensional partially linear models for massive data

Ann. Statist.