Abstract
Feature screening is a popular and efficient statistical technique in processing ultrahigh-dimensional data. When a regression model consists both categorical and continuous predictors, a unified feature screening procedure is needed. Thus, we propose a unified mean-variance sure independence screening (UMV-SIS) for this setup. The mean-variance (MV), an effective utility to measure the dependence between two random variables, is widely used in feature screening for discriminant analysis. In this paper, we advocate using the kernel smoothing method to estimate MV between two continuous variables, thereby extending it to screen categorical and continuous predictors simultaneously. Besides the uniformity for screening, UMV-SIS is a model-free procedure without any specification of a regression model; this broadens the scope of its application. In theory, we show that the UMV-SIS procedure has the sure screening and ranking consistency properties under mild conditions. To solve some difficulties in marginal feature screening for linear model and further enhance the screening performance of our proposed method, an iterative UMV-SIS procedure is developed. The promising performances of the new method are supported by extensive numerical examples.

Similar content being viewed by others
References
Cui H, Li R, Zhong W (2015) Model-free feature screening for ultrahigh dimensional discriminant analysis. J Am Statist Assoc 110:630–641
Cui H, Zhong W (2018), A Distribution-Free Test of Independence and Its Application to Variable Selection. arXiv:1801.10559
Fan J, Fan Y (2008) High dimensional classification using features annealed independence rules. Ann Statist 36:2605–2637
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Statist Assoc 96:1348–1360
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Statist Soc Ser B (Statist Methodol) 70:849–911
Fan J, Samworth R, Wu Y (2009) Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res 10:1829–1853
He X, Wang L, Hong HG et al (2013) Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann Statist 41:342–369
Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Statist Assoc 58:13–30
Hosmer D, Lemeshow S (1989) Applied Logistic Regression. John Wiley, New York
Kong E, Xia Y, Zhong W (2019) Composite coefficient of determination and its application in ultrahigh dimensional variable screening. J Am Statist Assoc 114:1740–1751
Li G, Peng H, Zhang J, Zhu L et al (2012) Robust rank correlation based screening. Ann Statist 40:1846–1877
Li Q, Racine JS (2007) Nonparametric econometrics: theory and practice. Princeton University Press, NY
Li R, Zhong W, Zhu L (2012) Feature screening via distance correlation learning. J Am Statist Assoc 107:1129–1139
Li X, Cheng G, Wang L, Lai P, Song F (2017) Ultrahigh dimensional feature screening via projection. Comput Statist Data Anal 114:88–104
Li X, Li R, Xia Z, Xu C (2020) Distributed Feature Screening via Componentwise Debiasing. J Mach Learn Res 21:1–32
Liu J, Li R, Wu R (2014) Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J Am Statist Assoc 109:266–274
Mai Q, Zou H (2012) The Kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika 100:229–234
Mai Q, Zou H et al (2015) The fused Kolmogorov filter: a nonparametric model-free screening method. Ann Statist 43:1471–1497
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Statist Soc Ser B (Methodol) 58:267–288
Wang H (2009) Forward regression for ultra-high dimensional variable screening. J Am Statist Assoc 104:1512–1524
Xu C, Chen J (2014) The sparse MLE for ultrahigh-dimensional feature screening. J Am Statist Assoc 109:1257–1269
Yan X, Tang N, Xie J, Ding X, Wang Z (2018) Fused mean-variance filter for feature screening. Comput Statist Data Anal 122:18–32
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Statist Soc Ser B (Statist Methodol) 68:49–67
Zhang CH (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Statist 38:894–942
Zhao SD, Li Y (2012) Principled sure independence screening for Cox models with ultra-high-dimensional covariates. J Multivar Anal 105:397–411
Zhou Y, Zhu L (2018) Model-free feature screening for ultrahigh dimensional data through a modified Blum-Kiefer-Rosenblatt correlation. Statistica Sinica 28:1351–1370
Zhu L-P, Li L, Li R, Zhu L-X (2011) Model-free feature screening for ultrahigh-dimensional data. J Am Statist Assoc 106:1464–1475
Zou H (2006) The adaptive lasso and its oracle properties. J Am Statist Assoc 101:1418–1429
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Statist Soc Ser B (Statist Methodol) 67:301–320
Acknowledgements
This work is supported by Philosophy and Social Science Research Fund of Jiangsu Province for Universities with grant No.2019SJA2093. The content is solely the responsibility of the authors and does not necessarily represent the official views of the aforementioned funding agency.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Proof of the theorems
Appendix: Proof of the theorems
Leave-one-out least-square cross validation (LSCV). Given m different \(\alpha _1,...,\alpha _m\), LSCV is used to determine the optimal \(\alpha \) for each \(X_j\) in Example 1 and 3.4. The objective function of LSCV is
where
is the leave-one-out kernel estimator of \(\hat{E}_{-i}(I(Y_i\le Y_m)|X_j=X_{ij})\), and \(W(\cdot )\) is a nonnegative weight function which truncates some bad estimates of \(\hat{E}_{-i}(I(Y_i\le Y_m)|X_j=X_{ij})\) caused by the so-called boundary effect or random denominator issue. Having the observations \(X_{1j},...,X_{nj}\) ordered to be \(X_{(1)j},...,X_{(n)j}\), we can simply set \(W(X_{ij})=I(X_{ij} \in \{X_{(r+1)j},...,X_{(n-r)j}\})\) for a given r. Given \(\{\alpha _1,...,\alpha _m\}\), we can determine the optimal \(\alpha \) for \(X_j\) by
Then, the screening index \(\hat{\omega }_j\) is computed with the \(\alpha ^{opt}_j\).
Lemma 1
(Hoeffding’s inequality; Hoeffding 1963) Let \(X_1,\ldots ,X_n\) be independent random variables. Assume that \(P(X_i\in [a_i,b_i])=1\) for \(1\le i\le n\), where \(a_i\) and \(b_i\) are constants. Let \(\bar{X}=n^{-1}\sum _{i=1}^nX_i\). Then the following inequality holds
where t is a positive constant and \(E(\bar{X})\) is the expected value of \(\bar{X}\).
Lemma 2
(Liu et al. 2014) For any random variable X, the following two statements are equivalent:
(A) There exists \(H>0\) such that \(Ee^{tX}<\infty \) for all \(|t|<H\).
(B) There exist \(\eta >0\) and \(\nu >0\) such that \(Ee^{\nu (X-EX)}\le e^{\eta \nu ^2}\).
Lemma 3
(Liu et al. 2014) Suppose that a(x) and b(x) are two uniformly-bounded functions, that is, there exist \(M_1>0\) and \(M_2>0\) such that
For a given \(x\in \mathbb {X}\), \(\hat{A}(x)\) and \(\hat{B}(x)\) are estimators of a(x) and b(x) with sample size n. For any small \(\varepsilon \in (0,1)\), suppose that there exist positive constants \(c_1\), \(c_2\) and \(\nu \), such that
Moreover, suppose b(x) is uniformly bounded away from 0 (i.e., there is \(M_3>0\) such that \(\inf _{x\in \mathbb {X}}|b(x)|>M_3\)). There exists a constant \(C'>0\) such that
Proof of Theorem 1
We first prove the exponential consistency of \(\hat{\omega }_{j}\) in (2.4) corresponding to continuous predictor. According to the definitions of \(\omega _{j}\) (2.2), we have
The term \(H_{j,1}\) can be further decomposed as
The terms \(H_{j,2}\) and \(I_{j,2}\) can be dealt with by Lemma 1. Let \(G(X_{mj})=\int _{\mathbb {Y}}(F(y|X_j=X_{mj})-F(y))^2dF(y)\), and \(Q(Y_i,X_{mj})=(F(Y_i|X_j=X_{mj})-F(Y_i))^2\). Obviously, \(G(X_{mj})\) and \(Q(Y_i,X_{mj})\) can be bounded between 0 and 1. It follows that
Similarly,
Then we deal with the \(I_{j,1}\).
Next, \(D_{j,1}\) is analysed elaborately by using the similar technology in Liu et al. (2014). Specifically,
where \(m_j(x,y)=E\left( I(Y\le y)|X_j=x\right) \) and \(f_j(x)\) is the density function of \(X_j\). We define
We now work on \(P_{j,1}(x,y)\). For any \(\xi >0\), by Markov’s inequality,
Furthermore,
Set \(\xi =n\nu \) and define \(\psi (\nu )= E\{\exp \left( \nu I(Y_m\le y)K(\frac{X_{mj}-x}{h})\right) \}\). Then (6.8) can be expressed by
Let us work on the factor \(\exp \{-\nu hf_j(x)m_j(x,y)\}\cdot \psi (\nu )\) of (6.9). It can be further decomposed by
where
When x is close to 0, by Taylor’s expansion, \(\exp (x)\) can be bounded by
Under Conditions C3-C5, we choose such a small \(\nu \) that (6.11) can be applied to bound \(L_{j,1}(x,y)\) as
Denote \(\delta _h(x,y)= E\left\{ I(Y_m\le y)K(\frac{X_{mj}-x}{h})\right\} -hf_j(x)m_j(x,y)\). Recalling that \(m_j(X_{j},y)=E\left( I(Y\le y)|X_{j}\right) \), we have \(\delta _h(x,y)=E\left\{ m_j(X_{j},y)K(\frac{X_{j}-x}{h})\right\} -hf_j(x)m_j(x,y)\). Since \(\int t K(t) dt=1\), it follows that
By using Condition C4, \(\int tK(t)dt=0\) and \(\int t^2K(t) dt<\infty \). Therefore,
where \(m'_{j}(x,y)=\frac{\partial m_j(x,y)}{\partial x}\) and \(m''_{j}(x,y)=\frac{\partial ^2 m_j(x,y)}{\partial x^2}\). By using the dominated convergence theorem together with \(m''_j(x,y)+2f'_j(x)m'_j(x,y)+f''_j(x)\) being uniformly bounded by Conditions C3 and C5, \(h^{-3}\delta _h(x,y)\) is uniformly bounded by some constant C for all \(x\in \mathbb {X}_j\). This implies that
Since \(h\rightarrow 0\) as \(n\rightarrow \infty \), it follows that \(\sup _{x\in \mathbb {X}_j, y\in \mathbb {Y}}L_{j,1}(x,y)\le 1+\varepsilon \nu /16\), for large enough n. Then we focus on \(L_{j,2}(x,y)\) in (6.10). According to Lemma 2, \(L_{j,2}(x,y)\) is uniformly bounded by \(\exp (\eta \nu ^2)\) for some constant \(\eta >0\). Using Taylor’s expansion, \(\exp (\eta \nu ^2)\le 1+2\eta \nu ^2<1+\varepsilon \nu /16\), as long as \(\nu \) is close to 0 and satisfies \(0<\nu <\varepsilon /(32\eta )\). Thus, for sufficiently small \(\nu >0\) and large n, (6.10) satisfies
By Taylor’s expansion, (6.9) can be bounded
Similarly,
It follows that
By setting \(m_j(x,y)=1\), we have \(\sup _{x\in \mathbb {X}_j} P_{j,2}(x)\le (1-\varepsilon \nu /4)^n\) and
Furthermore, by Lemma 3, there exists some \(c_3 > 0\) such that
Thus, (6.7) can be further bounded by
To bound \(D_{j,2}\), we use Hoeffding’s inequalities in Lemma 1 again,
Finally, by in Eqs. (6.4), (6.5), (6.6), (6.12) and (6.13), there exists a constant \(c_4\) such that
By using the similar skill, for those categorical estimates \(\hat{\omega }_j\) in (2.3), there exists a positive constant \(c_5\) such that
The detailed proof can be referred to the Lemma A.4 in Cui et al. (2015).
The convergent properties of categorical and continuous estimators have been developed in (6.14) and (6.15). By letting \(\varepsilon =cn^{-\tau }\) and recalling the assumption \(h=O(n^{-\theta })\) with \(\tau /3<\theta <1/3\), we have \(h^3=o(\varepsilon )\) and \(L_{j,1}(x,y)\) can be easily bounded. Therefore, there exist a positive constants \(C_1\) such that
The theorem is proved. \(\square \)
Proof of Theorem 2
If \(\mathcal {M}\nsubseteq \widetilde{\mathcal {M}}\), there must exist some \(j\in \mathcal {M}\) such that \(\widetilde{\omega }_j<cn^{-\tau }\). Also, by Condition C6, we assume \( \min \limits _{j \in \mathcal {M}} \omega _j \ge 2cn^{-\tau }\). Thus, \(\mathcal {M}\nsubseteq \widetilde{\mathcal {M}}\) implies \(|\widetilde{\omega }_j-\omega _j|>cn^{-\tau }\) for some \(j\in \mathcal {M}\). Therefore, by (6.16), we have
where d is the cardinality of \(\mathcal {M}\). The sure screening property is proved.
According to the assumption in Theorem 2 that \(\kappa =\min \limits _{j\in \mathcal {M}}\omega _j-\max \limits _{j\in \mathcal {M}^c}\omega _j>0\), therefore there exists a positive constant \(C_2\) such that
The last inequality is the direct result from Eq. (6.16), and it goes to 0 as \(n\rightarrow \infty \), for \(\log (p)=O(n^a)\) with some \(a<1\). The ranking consistency property is therefore proved. \(\square \)
Rights and permissions
About this article
Cite this article
Wang, L., Li, X., Wang, X. et al. Unified mean-variance feature screening for ultrahigh-dimensional regression. Comput Stat 37, 1887–1918 (2022). https://doi.org/10.1007/s00180-021-01184-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-021-01184-2