Skip to main content
Log in

Unified mean-variance feature screening for ultrahigh-dimensional regression

  • Original paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Feature screening is a popular and efficient statistical technique in processing ultrahigh-dimensional data. When a regression model consists both categorical and continuous predictors, a unified feature screening procedure is needed. Thus, we propose a unified mean-variance sure independence screening (UMV-SIS) for this setup. The mean-variance (MV), an effective utility to measure the dependence between two random variables, is widely used in feature screening for discriminant analysis. In this paper, we advocate using the kernel smoothing method to estimate MV between two continuous variables, thereby extending it to screen categorical and continuous predictors simultaneously. Besides the uniformity for screening, UMV-SIS is a model-free procedure without any specification of a regression model; this broadens the scope of its application. In theory, we show that the UMV-SIS procedure has the sure screening and ranking consistency properties under mild conditions. To solve some difficulties in marginal feature screening for linear model and further enhance the screening performance of our proposed method, an iterative UMV-SIS procedure is developed. The promising performances of the new method are supported by extensive numerical examples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Cui H, Li R, Zhong W (2015) Model-free feature screening for ultrahigh dimensional discriminant analysis. J Am Statist Assoc 110:630–641

    Article  MathSciNet  Google Scholar 

  • Cui H, Zhong W (2018), A Distribution-Free Test of Independence and Its Application to Variable Selection. arXiv:1801.10559

  • Fan J, Fan Y (2008) High dimensional classification using features annealed independence rules. Ann Statist 36:2605–2637

    Article  MathSciNet  Google Scholar 

  • Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Statist Assoc 96:1348–1360

    Article  MathSciNet  Google Scholar 

  • Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Statist Soc Ser B (Statist Methodol) 70:849–911

    Article  MathSciNet  Google Scholar 

  • Fan J, Samworth R, Wu Y (2009) Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res 10:1829–1853

    MathSciNet  MATH  Google Scholar 

  • He X, Wang L, Hong HG et al (2013) Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann Statist 41:342–369

    MathSciNet  MATH  Google Scholar 

  • Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Statist Assoc 58:13–30

    Article  MathSciNet  Google Scholar 

  • Hosmer D, Lemeshow S (1989) Applied Logistic Regression. John Wiley, New York

    MATH  Google Scholar 

  • Kong E, Xia Y, Zhong W (2019) Composite coefficient of determination and its application in ultrahigh dimensional variable screening. J Am Statist Assoc 114:1740–1751

    Article  MathSciNet  Google Scholar 

  • Li G, Peng H, Zhang J, Zhu L et al (2012) Robust rank correlation based screening. Ann Statist 40:1846–1877

    MathSciNet  MATH  Google Scholar 

  • Li Q, Racine JS (2007) Nonparametric econometrics: theory and practice. Princeton University Press, NY

    MATH  Google Scholar 

  • Li R, Zhong W, Zhu L (2012) Feature screening via distance correlation learning. J Am Statist Assoc 107:1129–1139

    Article  MathSciNet  Google Scholar 

  • Li X, Cheng G, Wang L, Lai P, Song F (2017) Ultrahigh dimensional feature screening via projection. Comput Statist Data Anal 114:88–104

    Article  MathSciNet  Google Scholar 

  • Li X, Li R, Xia Z, Xu C (2020) Distributed Feature Screening via Componentwise Debiasing. J Mach Learn Res 21:1–32

    MathSciNet  MATH  Google Scholar 

  • Liu J, Li R, Wu R (2014) Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J Am Statist Assoc 109:266–274

    Article  MathSciNet  Google Scholar 

  • Mai Q, Zou H (2012) The Kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika 100:229–234

    Article  MathSciNet  Google Scholar 

  • Mai Q, Zou H et al (2015) The fused Kolmogorov filter: a nonparametric model-free screening method. Ann Statist 43:1471–1497

    Article  MathSciNet  Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Statist Soc Ser B (Methodol) 58:267–288

    MathSciNet  MATH  Google Scholar 

  • Wang H (2009) Forward regression for ultra-high dimensional variable screening. J Am Statist Assoc 104:1512–1524

    Article  MathSciNet  Google Scholar 

  • Xu C, Chen J (2014) The sparse MLE for ultrahigh-dimensional feature screening. J Am Statist Assoc 109:1257–1269

    Article  MathSciNet  Google Scholar 

  • Yan X, Tang N, Xie J, Ding X, Wang Z (2018) Fused mean-variance filter for feature screening. Comput Statist Data Anal 122:18–32

    Article  MathSciNet  Google Scholar 

  • Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Statist Soc Ser B (Statist Methodol) 68:49–67

    Article  MathSciNet  Google Scholar 

  • Zhang CH (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Statist 38:894–942

    Article  MathSciNet  Google Scholar 

  • Zhao SD, Li Y (2012) Principled sure independence screening for Cox models with ultra-high-dimensional covariates. J Multivar Anal 105:397–411

    Article  MathSciNet  Google Scholar 

  • Zhou Y, Zhu L (2018) Model-free feature screening for ultrahigh dimensional data through a modified Blum-Kiefer-Rosenblatt correlation. Statistica Sinica 28:1351–1370

    MathSciNet  MATH  Google Scholar 

  • Zhu L-P, Li L, Li R, Zhu L-X (2011) Model-free feature screening for ultrahigh-dimensional data. J Am Statist Assoc 106:1464–1475

    Article  MathSciNet  Google Scholar 

  • Zou H (2006) The adaptive lasso and its oracle properties. J Am Statist Assoc 101:1418–1429

    Article  MathSciNet  Google Scholar 

  • Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Statist Soc Ser B (Statist Methodol) 67:301–320

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work is supported by Philosophy and Social Science Research Fund of Jiangsu Province for Universities with grant No.2019SJA2093. The content is solely the responsibility of the authors and does not necessarily represent the official views of the aforementioned funding agency.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xingxiang Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Proof of the theorems

Appendix: Proof of the theorems

Leave-one-out least-square cross validation (LSCV). Given m different \(\alpha _1,...,\alpha _m\), LSCV is used to determine the optimal \(\alpha \) for each \(X_j\) in Example 1 and 3.4. The objective function of LSCV is

$$\begin{aligned} \text {CV}(h(\alpha ))=\frac{1}{n^2}\sum _{i=1}^n\sum _{m=1}^n [I(Y_i\le Y_m)-\hat{E}_{-i}(I(Y_i\le Y_m)|X_j=X_{ij})]^2W(X_{ij}), \end{aligned}$$
(6.1)

where

$$\begin{aligned} \hat{E}_{-i}(I(Y_i\le Y_m)|X_j=X_{ij})=\frac{\sum _{l\ne i}^n I(Y_l\le Y_m)K_h(X_{lj}-X_{ij})}{\sum _{{l\ne i}}^nK_h(X_{lj}-X_{ij})} \end{aligned}$$
(6.2)

is the leave-one-out kernel estimator of \(\hat{E}_{-i}(I(Y_i\le Y_m)|X_j=X_{ij})\), and \(W(\cdot )\) is a nonnegative weight function which truncates some bad estimates of \(\hat{E}_{-i}(I(Y_i\le Y_m)|X_j=X_{ij})\) caused by the so-called boundary effect or random denominator issue. Having the observations \(X_{1j},...,X_{nj}\) ordered to be \(X_{(1)j},...,X_{(n)j}\), we can simply set \(W(X_{ij})=I(X_{ij} \in \{X_{(r+1)j},...,X_{(n-r)j}\})\) for a given r. Given \(\{\alpha _1,...,\alpha _m\}\), we can determine the optimal \(\alpha \) for \(X_j\) by

$$\begin{aligned} \alpha ^{opt}_j = \mathop {\text {argmin}}_{\alpha \in \{\alpha _1,...,\alpha _m\}} \text {CV}(h(\alpha )). \end{aligned}$$

Then, the screening index \(\hat{\omega }_j\) is computed with the \(\alpha ^{opt}_j\).

Lemma 1

(Hoeffding’s inequality; Hoeffding 1963) Let \(X_1,\ldots ,X_n\) be independent random variables. Assume that \(P(X_i\in [a_i,b_i])=1\) for \(1\le i\le n\), where \(a_i\) and \(b_i\) are constants. Let \(\bar{X}=n^{-1}\sum _{i=1}^nX_i\). Then the following inequality holds

$$\begin{aligned} P(|\bar{X}-E(\bar{X})|\ge t)\le 2 \text {exp}\left( -\frac{2n^{2}t^{2}}{\sum _{i=1}^n(b_i-a_i)^{2}}\right) , \end{aligned}$$

where t is a positive constant and \(E(\bar{X})\) is the expected value of \(\bar{X}\).

Lemma 2

(Liu et al. 2014) For any random variable X, the following two statements are equivalent:

(A) There exists \(H>0\) such that \(Ee^{tX}<\infty \) for all \(|t|<H\).

(B) There exist \(\eta >0\) and \(\nu >0\) such that \(Ee^{\nu (X-EX)}\le e^{\eta \nu ^2}\).

Lemma 3

(Liu et al. 2014) Suppose that a(x) and b(x) are two uniformly-bounded functions, that is, there exist \(M_1>0\) and \(M_2>0\) such that

$$\begin{aligned} \sup _{x\in \mathbb {X}}|a(x)|\le M_1, \ \sup _{x\in \mathbb {X}}|b(x)|\le M_2. \end{aligned}$$

For a given \(x\in \mathbb {X}\), \(\hat{A}(x)\) and \(\hat{B}(x)\) are estimators of a(x) and b(x) with sample size n. For any small \(\varepsilon \in (0,1)\), suppose that there exist positive constants \(c_1\), \(c_2\) and \(\nu \), such that

$$\begin{aligned} \sup _{x\in \mathbb {X}} P(|\hat{A}(x)-a(x)|\ge \varepsilon )\le \frac{c_1}{2}\left( 1-\frac{\varepsilon \nu }{c_1}\right) ^n, \nonumber \\ \sup _{x\in \mathbb {X}} P(|\hat{B}(x)-b(x)|\ge \varepsilon )\le \frac{c_2}{2}\left( 1-\frac{\varepsilon \nu }{c_2}\right) ^n. \end{aligned}$$
(6.3)

Moreover, suppose b(x) is uniformly bounded away from 0 (i.e., there is \(M_3>0\) such that \(\inf _{x\in \mathbb {X}}|b(x)|>M_3\)). There exists a constant \(C'>0\) such that

$$\begin{aligned} \sup _{x\in \mathbb {X}} P(|\hat{A}(x)/\hat{B}(x)-a(x)/b(x)|\ge \varepsilon )\le \frac{C'}{2}\left( 1-\frac{\varepsilon \nu }{C'}\right) ^n. \end{aligned}$$

Proof of Theorem 1

We first prove the exponential consistency of \(\hat{\omega }_{j}\) in (2.4) corresponding to continuous predictor. According to the definitions of \(\omega _{j}\) (2.2), we have

$$\begin{aligned} \hat{\omega }_{j}-\omega _{j}= & {} \frac{1}{n}\sum _{m=1}^{n}\frac{1}{n} \sum _{i=1}^{n}\left( \hat{F}(Y_i|X_j=X_{mj})-\hat{F}(Y_i)\right) ^2 \nonumber \\&-\int _{\mathbb {X}_j}\int _{\mathbb {Y}} \left( F(y|X_j=x)-F(y) \right) ^2dF(y)dF_j(x) \nonumber \\= & {} \left[ \frac{1}{n}\sum _{m=1}^{n}\left( \frac{1}{n}\sum _{i=1}^{n}(\hat{F}(Y_i|X_j=X_{mj})-\hat{F}(Y_i))^2\right. \right. \nonumber \\&\left. \left. -\int _{\mathbb {Y}}(F(y|X_j=X_{mj})-F(y))^2dF(y)\right) \right] + \nonumber \\&\left[ \frac{1}{n}\sum _{m=1}^{n}\int _{\mathbb {Y}}(F(y|X_j=X_{mj})-F(y))^2dF(y)\right. \nonumber \\&\left. -\int _{\mathbb {X}_j}\int _{\mathbb {Y}}[F(y|X_j=x)-F(y)]^2dF(y)dF_j(x)\right] \nonumber \\=: & {} H_{j,1}+H_{j,2}. \end{aligned}$$
(6.4)

The term \(H_{j,1}\) can be further decomposed as

$$\begin{aligned} H_{j,1}= & {} \frac{1}{n}\sum _{m=1}^{n}\left\{ \frac{1}{n}\sum _{i=1}^{n}(\hat{F}(Y_i|X_j=X_{mj})-\hat{F}(Y_i))^2\right. \\&\left. -\frac{1}{n}\sum _{i=1}^{n}(F(Y_i|X_j=X_{mj})-F(Y_i))^2 \right\} \nonumber \\&+\frac{1}{n}\sum _{m=1}^{n}\left\{ \frac{1}{n}\sum _{i=1}^{n}(F(Y_i|X_j=X_{mj})-F(Y_i))^2\right. \\&\left. -\int _{\mathbb {Y}}[F(y|X_j=X_{mj})-F(y)]^2dF(y)\right\} \nonumber \\=: & {} I_{j,1}+I_{j,2}. \end{aligned}$$

The terms \(H_{j,2}\) and \(I_{j,2}\) can be dealt with by Lemma 1. Let \(G(X_{mj})=\int _{\mathbb {Y}}(F(y|X_j=X_{mj})-F(y))^2dF(y)\), and \(Q(Y_i,X_{mj})=(F(Y_i|X_j=X_{mj})-F(Y_i))^2\). Obviously, \(G(X_{mj})\) and \(Q(Y_i,X_{mj})\) can be bounded between 0 and 1. It follows that

$$\begin{aligned} P\left\{ |H_{j,2}|\ge \varepsilon \right\}= & {} P\left\{ \left| \frac{1}{n}\sum _{m=1}^{n}G(X_{mj})-E(G(X_j))\right| \ge \varepsilon \right\} \nonumber \\\le & {} 2\exp \{-2n\varepsilon ^2\}. \end{aligned}$$
(6.5)

Similarly,

$$\begin{aligned} P\left\{ |I_{j,2}|\ge \varepsilon \right\}= & {} P\left\{ \left| \frac{1}{n}\sum _{m=1}^{n}\left( \frac{1}{n}\sum _{i=1}^{n}Q(Y_i,X_{mj})-E_Y(Q(Y,X_{mj}))\right) \right| \ge \varepsilon \right\} \nonumber \\\le & {} P\left\{ \sum _{m=1}^{n}\left| \frac{1}{n}\sum _{i=1}^{n}Q(Y_i,X_{mj})-E_Y(Q(Y,X_{mj}))\right| \ge n\varepsilon \right\} \nonumber \\\le & {} \sum _{m=1}^{n}P\left\{ \left| \frac{1}{n}\sum _{i=1}^{n} Q(Y_i,X_{mj})-E_Y(Q(Y,X_{mj}))\right| \ge \varepsilon \right\} \nonumber \\\le & {} 2n\exp \{-2n\varepsilon ^2\}. \end{aligned}$$
(6.6)

Then we deal with the \(I_{j,1}\).

$$\begin{aligned} |I_{j,1}|= & {} \frac{1}{n^2}\sum _{m=1}^{n}\sum _{i=1}^{n}\left\{ (\hat{F}(Y_i|X_j=X_{mj})-\hat{F}(Y_i))^2-(F(Y_i|X_j=X_{mj})-F(Y_i))^2\right\} \nonumber \\\le & {} \frac{2}{n^2}\sum _{m=1}^{n}\sum _{i=1}^{n}\left\{ |\hat{F}(Y_i|X_j=X_{mj})-F(Y_i|X_j=X_{mj})|+|\hat{F}(Y_i)-F(Y_i)|\right\} \nonumber \\=: & {} 2(D_{j,1}+D_{j,2}). \end{aligned}$$

Next, \(D_{j,1}\) is analysed elaborately by using the similar technology in Liu et al. (2014). Specifically,

$$\begin{aligned}&P\left( D_{j,1}\ge \varepsilon \right) =P\left( \frac{1}{n^2}\sum _{m=1}^{n}\sum _{i=1}^{n}\left| \hat{F}(Y_i|X_j=X_{mj})-F(Y_i|X_j=X_{mj})\right| \ge \varepsilon \right) \nonumber \\\le & {} \sum _{m=1}^{n}\sum _{i=1}^{n}P\left( \left| \hat{F}(Y_i|X_j=X_{mj})-F(Y_i|X_j=X_{mj})\right| \ge \varepsilon \right) \nonumber \\\le & {} n^2\sup _{x\in \mathbb {X}_j, y\in \mathbb {Y}}P\left( \left| \hat{F}(y|X_j=x)-F(y|X_j=x)\right| \ge \varepsilon \right) \nonumber \\= & {} n^2\sup _{x\in \mathbb {X}_j, y\in \mathbb {Y}}P\left( \left| \frac{\frac{1}{n}\sum _{m=1}^n I(Y_m\le y)K(\frac{X_{mj}-x}{h})}{\frac{1}{n}\sum _{m=1}^nK(\frac{X_{mj}-x}{h})} -E(I(Y\le y)|X_j=x) \right| \ge \varepsilon \right) \nonumber \\=: & {} n^2\sup _{x\in \mathbb {X}_j, y\in \mathbb {Y}}P\left( \left| \frac{Z_{j,1}(x,y)}{Z_{j,2}(x)}- \frac{hf_j(x)m_j(x,y)}{hf_j(x)}\right| \ge \varepsilon \right) , \end{aligned}$$
(6.7)

where \(m_j(x,y)=E\left( I(Y\le y)|X_j=x\right) \) and \(f_j(x)\) is the density function of \(X_j\). We define

$$\begin{aligned} P_{j,1}(x,y)= & {} P\left( Z_{j,1}(x,y)-hf_j(x)m_j(x,y)\ge \varepsilon \right) , \nonumber \\ P_{j,2}(x)= & {} P\left( Z_{j,2}(x)-hf_j(x)\ge \varepsilon \right) . \end{aligned}$$

We now work on \(P_{j,1}(x,y)\). For any \(\xi >0\), by Markov’s inequality,

$$\begin{aligned} P_{j,1}(x,y)\le & {} P\left( \exp \{\xi (Z_{j,1}(x,y)-hf_j(x)m_j(x,y))\}\ge \exp (\xi \varepsilon )\right) \nonumber \\\le & {} E[\exp \{\xi Z_{j,1}(x,y)-\xi hf_j(x)m_j(x,y)\}]/\exp (\xi \varepsilon ) \nonumber \\= & {} \exp (-\xi \varepsilon )\cdot \exp \{-\xi hf_j(x)m_j(x,y)\}\cdot E\{\exp (\xi Z_{j,1}(x,y))\}.\qquad \end{aligned}$$
(6.8)

Furthermore,

$$\begin{aligned} E\{\exp (\xi Z_{j,1}(x,y))\}= & {} E\left[ \exp \left\{ \xi \cdot \frac{1}{n}\sum _{m=1}^{n}I(Y_m\le y)K(\frac{X_{mj}-x}{h})\right\} \right] \nonumber \\= & {} E\left[ \prod _{m=1}^{n}\exp \left\{ \frac{\xi }{n}I(Y_m\le y)K(\frac{X_{mj}-x}{h})\right\} \right] \nonumber \\= & {} \left[ E\left\{ \exp \left( \frac{\xi }{n}I(Y_m\le y)K(\frac{X_{mj}-x}{h})\right) \right\} \right] ^n. \end{aligned}$$

Set \(\xi =n\nu \) and define \(\psi (\nu )= E\{\exp \left( \nu I(Y_m\le y)K(\frac{X_{mj}-x}{h})\right) \}\). Then (6.8) can be expressed by

$$\begin{aligned} P_{j,1}(x,y)\le & {} [\exp (-\nu \varepsilon )\cdot \exp \{-\nu hf_j(x)m_j(x,y)\}\cdot \psi (\nu )]^n. \end{aligned}$$
(6.9)

Let us work on the factor \(\exp \{-\nu hf_j(x)m_j(x,y)\}\cdot \psi (\nu )\) of (6.9). It can be further decomposed by

$$\begin{aligned}&\exp \{-\nu hf_j(x)m_j(x,y)\}\cdot \psi (\nu )\nonumber \\= & {} E\left[ \exp \left\{ \nu \left( I(Y_m\le y)K(\frac{X_{mj}-x}{h})-hf_j(x)m_j(x,y)\right) \right\} \right] \nonumber \\=: & {} L_{j,1}(x,y)L_{j,2}(x,y), \end{aligned}$$
(6.10)

where

$$\begin{aligned} L_{j,1}(x,y)= & {} \exp \left\{ \nu \left( E\left[ I(Y_m\le y)K(\frac{X_{mj}-x}{h})\right] -hf_j(x)m_j(x,y)\right) \right\} , \nonumber \\ L_{j,2}(x,y)= & {} E\left[ \exp \left\{ \nu \left( I(Y_m\le y)K(\frac{X_{mj}-x}{h})\right. \right. \right. \nonumber \\&\left. \left. \left. -E\{I(Y_m\le y) K(\frac{X_{mj}-x}{h})\}\right) \right\} \right] . \end{aligned}$$

When x is close to 0, by Taylor’s expansion, \(\exp (x)\) can be bounded by

$$\begin{aligned} \exp (x)=1+x+o(|x|)\le 1+x+|x|\le 1+2|x|. \end{aligned}$$
(6.11)

Under Conditions C3-C5, we choose such a small \(\nu \) that (6.11) can be applied to bound \(L_{j,1}(x,y)\) as

$$\begin{aligned} L_{j,1}(x,y)\le 1+2\nu \left| E\left\{ I(Y_m\le y)K(\frac{X_{mj}-x}{h})\right\} -hf_j(x)m_j(x,y)\right| , \forall x\in \mathbb {X}_j. \end{aligned}$$

Denote \(\delta _h(x,y)= E\left\{ I(Y_m\le y)K(\frac{X_{mj}-x}{h})\right\} -hf_j(x)m_j(x,y)\). Recalling that \(m_j(X_{j},y)=E\left( I(Y\le y)|X_{j}\right) \), we have \(\delta _h(x,y)=E\left\{ m_j(X_{j},y)K(\frac{X_{j}-x}{h})\right\} -hf_j(x)m_j(x,y)\). Since \(\int t K(t) dt=1\), it follows that

$$\begin{aligned} h^{-1}\delta _h(x,y)=\int \{m_j(x+th,y)f_j(x+th)-f_j(x)m_j(x,y)\}K(t)dt. \end{aligned}$$

By using Condition C4, \(\int tK(t)dt=0\) and \(\int t^2K(t) dt<\infty \). Therefore,

$$\begin{aligned}&\lim _{h\rightarrow 0} h^{-2}[m_j(x+th,y)f_j(x+th)-f_j(x)m_j(x,y)\nonumber \\&\quad -\{f'_j(x)m_j(x,y)+f_j(x)m'_{j}(x,y)\}th] \nonumber \\&\quad \rightarrow \{m''_j(x,y)+2f'_j(x)m'_j(x,y)+f''_j(x)\}t^2/2, \end{aligned}$$

where \(m'_{j}(x,y)=\frac{\partial m_j(x,y)}{\partial x}\) and \(m''_{j}(x,y)=\frac{\partial ^2 m_j(x,y)}{\partial x^2}\). By using the dominated convergence theorem together with \(m''_j(x,y)+2f'_j(x)m'_j(x,y)+f''_j(x)\) being uniformly bounded by Conditions C3 and C5, \(h^{-3}\delta _h(x,y)\) is uniformly bounded by some constant C for all \(x\in \mathbb {X}_j\). This implies that

$$\begin{aligned} \sup _{x\in \mathbb {X}_j, y\in \mathbb {Y}}L_{j,1}(x,y)\le 1+2\nu Ch^3, \ \ \text {as} \ \ h\rightarrow 0. \end{aligned}$$

Since \(h\rightarrow 0\) as \(n\rightarrow \infty \), it follows that \(\sup _{x\in \mathbb {X}_j, y\in \mathbb {Y}}L_{j,1}(x,y)\le 1+\varepsilon \nu /16\), for large enough n. Then we focus on \(L_{j,2}(x,y)\) in (6.10). According to Lemma 2, \(L_{j,2}(x,y)\) is uniformly bounded by \(\exp (\eta \nu ^2)\) for some constant \(\eta >0\). Using Taylor’s expansion, \(\exp (\eta \nu ^2)\le 1+2\eta \nu ^2<1+\varepsilon \nu /16\), as long as \(\nu \) is close to 0 and satisfies \(0<\nu <\varepsilon /(32\eta )\). Thus, for sufficiently small \(\nu >0\) and large n, (6.10) satisfies

$$\begin{aligned} \sup _{x\in \mathbb {X}_j, y\in \mathbb {Y}}\exp \{-\nu hf_j(x)m_j(x,y)\}\cdot \psi (\nu )\le & {} \sup _{x\in \mathbb {X}_j, y\in \mathbb {Y}}L_{j,1}(x,y)\cdot \sup _{x\in \mathbb {X}_j, y\in \mathbb {Y}}L_{j,2}(x,y) \nonumber \\< & {} (1+\varepsilon \nu /16)^2<1+\varepsilon \nu /4. \end{aligned}$$

By Taylor’s expansion, (6.9) can be bounded

$$\begin{aligned} \sup _{x\in \mathbb {X}_j, y\in \mathbb {Y}}P_{j,1}(x,y)\le & {} \{\exp (-\varepsilon \nu )(1+\varepsilon \nu /4)\}^n\\\le & {} (1-\varepsilon \nu +\varepsilon \nu /2)^n(1+\varepsilon \nu /4)^n\le (1-\varepsilon \nu /4)^n. \end{aligned}$$

Similarly,

$$\begin{aligned} \sup _{x\in \mathbb {X}_j, y\in \mathbb {Y}}P\left( Z_{j,1}(x,y)-hf_j(x)m_j(x,y)\le -\varepsilon \right) \le (1-\varepsilon \nu /4)^n. \end{aligned}$$

It follows that

$$\begin{aligned} \sup _{x\in \mathbb {X}_j, y\in \mathbb {Y}}P\left( |Z_{j,1}(x,y)-hf_j(x)m_j(x,y)|\ge \varepsilon \right) \le 2(1-\varepsilon \nu /4)^n. \end{aligned}$$

By setting \(m_j(x,y)=1\), we have \(\sup _{x\in \mathbb {X}_j} P_{j,2}(x)\le (1-\varepsilon \nu /4)^n\) and

$$\begin{aligned}&\sup _{x\in \mathbb {X}_j}P\left( |Z_{j,2}(x)-hf_j(x)|\ge \varepsilon \right) \le 2(1-\varepsilon \nu /4)^n. \end{aligned}$$

Furthermore, by Lemma 3, there exists some \(c_3 > 0\) such that

$$\begin{aligned} \sup _{x\in \mathbb {X}_j, y\in \mathbb {Y}} P\left( \left| \frac{Z_{j,1}(x,y)}{Z_{j,2}(x)}-\frac{hf_j(x)m_j(x,y)}{hf_j(x)}\right| \ge \varepsilon \right) \le \frac{c_3}{2}(1-\varepsilon \nu /c_3)^n. \end{aligned}$$

Thus, (6.7) can be further bounded by

$$\begin{aligned} P\left( D_{j,1}\ge \varepsilon \right)\le & {} n^2\sup _{x\in \mathbb {X}_j, y\in \mathbb {Y}} P\left( \left| \frac{Z_{j,1}(x,y)}{Z_{j,2}(x)}- \frac{hf_j(x)m_j(x,y)}{hf_j(x)}\right| \ge \varepsilon \right) \nonumber \\\le & {} \frac{c_3 n^2}{2}(1-\varepsilon \nu /c_3)^n. \end{aligned}$$
(6.12)

To bound \(D_{j,2}\), we use Hoeffding’s inequalities in Lemma 1 again,

$$\begin{aligned} P(D_{j,2}\ge \varepsilon )\le & {} P(\frac{1}{n^2}\sum _{m=1}^{n}\sum _{i=1}^{n}|\hat{F}(Y_i)-F(Y_i)| \ge \varepsilon )\nonumber \\\le & {} \sum _{m=1}^{n}\sum _{i=1}^{n}P(|\hat{F}(Y_i)-F(Y_i)|\ge \varepsilon )\nonumber \\\le & {} 2n^2\exp (-2n\varepsilon ^2). \end{aligned}$$
(6.13)

Finally, by in Eqs. (6.4), (6.5), (6.6), (6.12) and (6.13), there exists a constant \(c_4\) such that

$$\begin{aligned}&P(\left| \hat{\omega }_{j}-\omega _{j}\right| \ge \varepsilon ) \nonumber \\&\quad \le P(\left| I_{j,1}\right| +\left| I_{j,2}\right| +\left| H_{j,2}\right| \ge \varepsilon ) \nonumber \\&\quad \le P(2(D_{j,1}+D_{j,2})\ge 2\varepsilon /3))+P(\left| I_{j,2}\right| \ge \varepsilon /6)+P(\left| H_{j,2}\right| \ge \varepsilon /6) \nonumber \\&\quad \le P(D_{j,1}\ge \varepsilon /6)+P(D_{j,2}\ge \varepsilon /6)+P(\left| I_{j,2}\right| \ge \varepsilon /6)+P(\left| H_{j,2}\right| \ge \varepsilon /6) \nonumber \\&\quad \le (2n^2+2n+2)\exp (-\frac{n\varepsilon ^2}{18})+\frac{c_3}{12}n^2(1-\frac{\varepsilon \nu }{6c_3})^n.\nonumber \\&\quad \le O(n^2\exp (-c_4n\varepsilon ^2)). \end{aligned}$$
(6.14)

By using the similar skill, for those categorical estimates \(\hat{\omega }_j\) in (2.3), there exists a positive constant \(c_5\) such that

$$\begin{aligned} P(\left| \hat{\omega }_{j}-\omega _{j}\right| \ge \varepsilon )\le & {} O(nK_j\exp (-{c_5n\varepsilon ^2}/{K_j})). \end{aligned}$$
(6.15)

The detailed proof can be referred to the Lemma A.4 in Cui et al. (2015).

The convergent properties of categorical and continuous estimators have been developed in (6.14) and (6.15). By letting \(\varepsilon =cn^{-\tau }\) and recalling the assumption \(h=O(n^{-\theta })\) with \(\tau /3<\theta <1/3\), we have \(h^3=o(\varepsilon )\) and \(L_{j,1}(x,y)\) can be easily bounded. Therefore, there exist a positive constants \(C_1\) such that

$$\begin{aligned} P\left( \max _{1\le j\le p}\left| \hat{\omega }_j-\omega _j\right| \ge cn^{-\tau } \right)\le & {} \sum _{j=1}^{p}P\left( \left| \hat{\omega }_j-\omega _j\right| \ge cn^{-\tau } \right) \nonumber \\\le & {} O(n^2p\exp (-C_1n^{1-2\tau })), \end{aligned}$$
(6.16)

The theorem is proved. \(\square \)

Proof of Theorem 2

If \(\mathcal {M}\nsubseteq \widetilde{\mathcal {M}}\), there must exist some \(j\in \mathcal {M}\) such that \(\widetilde{\omega }_j<cn^{-\tau }\). Also, by Condition C6, we assume \( \min \limits _{j \in \mathcal {M}} \omega _j \ge 2cn^{-\tau }\). Thus, \(\mathcal {M}\nsubseteq \widetilde{\mathcal {M}}\) implies \(|\widetilde{\omega }_j-\omega _j|>cn^{-\tau }\) for some \(j\in \mathcal {M}\). Therefore, by (6.16), we have

$$\begin{aligned} P\{ \mathcal {M}\subseteq \widetilde{\mathcal {M}} \}\ge & {} P(\max \limits _{j \in \mathcal {M}} \left| \hat{\omega }_j-\omega _j\right| \le cn^{-\tau }) \nonumber \\\ge & {} 1-P(\max \limits _{j \in \mathcal {M}} |\hat{\omega }_j-\omega _j|> cn^{-\tau }) \nonumber \\\ge & {} 1-d\cdot P(|\hat{\omega }_j-\omega _j|> cn^{-\tau }) \nonumber \\\ge & {} 1-O(n^2d\exp (-C_1n^{1-2\tau })), \end{aligned}$$

where d is the cardinality of \(\mathcal {M}\). The sure screening property is proved.

According to the assumption in Theorem 2 that \(\kappa =\min \limits _{j\in \mathcal {M}}\omega _j-\max \limits _{j\in \mathcal {M}^c}\omega _j>0\), therefore there exists a positive constant \(C_2\) such that

$$\begin{aligned}&P\left( \min \limits _{j\in \mathcal {M}}\hat{\omega }_j \le \max \limits _{j\in \mathcal {M}^c}\hat{\omega }_j \right) =P\left( \min \limits _{j\in \mathcal {M}}\hat{\omega }_j-\min \limits _{j\in \mathcal {M}}\omega _j+\kappa \le \max \limits _{j\in \mathcal {M}^c}\hat{\omega }_j- \max \limits _{j\in \mathcal {M}^c}\omega _j\right) \nonumber \\&\quad = P\left( \{ \max \limits _{j\in \mathcal {M}^c}\hat{\omega }_j- \max \limits _{j\in \mathcal {M}^c}\omega _j\}-\{\min \limits _{j\in \mathcal {M}}\hat{\omega }_j-\min \limits _{j\in \mathcal {M}}\omega _j\} \ge \kappa \right) \nonumber \\&\quad \le P\left( \max \limits _{j\in \mathcal {M}^c}|\hat{\omega }_j-\omega _j|+\max \limits _{j\in \mathcal {M}}|\hat{\omega }_j-\omega _j| \ge \kappa \right) \nonumber \\&\quad \le P\left( \max \limits _{1\le j\le p}|\hat{\omega }_j-\omega _j| \ge \kappa /2 \right) \nonumber \\&\quad \le O(n^2p\exp (-C_2 n\kappa ^2)), \end{aligned}$$

The last inequality is the direct result from Eq. (6.16), and it goes to 0 as \(n\rightarrow \infty \), for \(\log (p)=O(n^a)\) with some \(a<1\). The ranking consistency property is therefore proved. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, L., Li, X., Wang, X. et al. Unified mean-variance feature screening for ultrahigh-dimensional regression. Comput Stat 37, 1887–1918 (2022). https://doi.org/10.1007/s00180-021-01184-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-021-01184-2

Keywords

Navigation