Abstract
Interactions have greatly influenced recent scientific discoveries, but the identification of interactions is challenging in ultra-high dimensions. In this study, we propose an interaction identification method for classification with ultra-high dimensional discrete features. We utilize clique sets to capture interactions among features, where features in a common clique have interactions that can be used for classification. The number of features related to the interaction is the size of the clique. Hence, our method can consider interactions caused by more than two feature variables. We propose a Kullback-Leibler divergence-based approach to correctly identify the clique sets with a probability that tends to 1 as the sample size tends to infinity. A clique screening method is then proposed to filter out clique sets that are useless for classification, and the strong sure screening property can be guaranteed. Finally, a clique naïve Bayes classifier is proposed for classification. Numerical studies demonstrate that our proposed approach performs very well.


Similar content being viewed by others
References
An, B., Wang, H., & Guo, J. (2013). Testing the statistical significance of an ultra-high-dimensional naive bayes classifier. Statistics and Its Interface, 6(2), 223–229.
Cui, H., Li, R., & Zhong, W. (2015). Model-free feature screening for ultrahigh dimensional discriminant analysis. Journal of the American Statistical Association, 110(510), 630–641.
Fan, J., Feng, Y., & Song, R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association, 106(494), 544–557.
Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849–911.
Fan, J., Song, R., & et al. (2010). Sure independence screening in generalized linear models with np-dimensionality. The Annals of Statistics, 38(6), 3567–3604.
Fan, Y., Kong, Y., Li, D., Zheng, Z., & et al (2015). Innovated interaction screening for high-dimensional nonlinear classification. The Annals of Statistics, 43(3), 1243–1272.
Guyon, I., Gunn S., Hur A. B., & Dror G. (2004). Result analysis of the nips 2003 feature selection challenge. In Proceedings of the 17th international conference on neural information processing systems.
Hao, N., & Zhang, H.H. (2014). Interaction screening for ultra-high dimensional data. Journal of the American Statistical Association, 109(507), 1285–1301.
Huang, D., Li, R., & Wang, H. (2014). Feature screening for ultrahigh dimensional categorical data with applications. Journal of Business & Economic Statistics, 32(2), 237–244.
Huang, J., Breheny, P., & Ma, S. (2012). A selective review of group selection in high-dimensional models. Statistical science: a review journal of the Institute of Mathematical Statistics, 27(4), 481–499.
Joachims, T. (2002). Learning to classify text using support vector machines: methods, theory and algorithms. Boston: Kluwer Academic Publishers.
Kussul, E., & Baidyk, T. (2004). Improved method of handwritten digit recognition tested on mnist database. Image and Vision Computing, 22(12), 971–981.
Li, R., Zhong, W., & Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499), 1129–1139.
Mai, Q., & Zou, H. (2015). The fused kolmogorov filter: a nonparametric model-free screening method. The Annals of Statistics, 43(4), 1471–1497.
Reese, R., Dai, X., & Fu, G. (2018). Strong sure screening of ultra-high dimensional categorical data. arXiv:1801.03539.
Webb, G.I., Boughton, J.R., & Wang, Z. (2005). Not so naive Bayes: aggregating one-dependence estimators. Machine Learning, 58(1), 5–24.
Wu, X., & Kumar, V. (2009). The top ten algorithms in data mining. Boca Raton: CRC Press.
Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67.
Zhao, W., Chellappa, R., & Krishnaswamy, A. (1998). Discriminant analysis of principal components for face recognition. In Proceedings third IEEE international conference on automatic face and gesture recognition.
Zhu, J., & Hastie, T. (2004). Classification of gene microarrays by penalized logistic regression. Biostatistics, 5(3), 427–443.
Zhu, L.-P., Li, L., Li, R., & Zhu, L.-X. (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 106(496), 1464–1475.
Zhu, X., Suk, H., & Shen, D. (2014). A novel matrix-similarity based loss function for joint regression and classification in ad diagnosis. NeuroImage, 100, 91–105.
Funding
The research of Baiguo An is partially supported by the National Natural Science Foundation of China (No. 12071308, No. 11601349), scientific research planned project of the National Bureau of Statistics of China(No. 2017LZ15). The research of Guozhong Feng is supported by the National Natural Science Foundation of China (No. 11501095). The research of Jianhua Guo is supported by the National Key Research and Development Program of China (No. 2020YFA0714102) and the National Natural Science Foundation of China (No. 11631003, No. 11690012).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Proof of Theorem 1
We prove the claim of Theorem 1 by the following five steps.
Step 1. For j, k = 1, … , p and y ∈ {0, 1}, define \(\pi _{y}^{jk}=\left (\pi _{00y}^{jk},\pi _{01y}^{jk},\pi _{10y}^{jk},\pi _{11y}^{jk}\right )^{\top }\), and \(\hat {\pi }_{y}^{jk}=\left (\hat {\pi }_{00y}^{jk},\hat {\pi }_{01y}^{jk},\hat {\pi }_{10y}^{jk},\hat {\pi }_{11y}^{jk}\right )^{\top }\). Let \(\alpha =\min \limits \{\alpha _{0},\alpha _{1}\}\). In this step, we will prove that \(P\left (\|{\widehat {\pi }}_{y}^{jk}-\pi _{y}^{jk}\|_{1}\geq t\right )\leq 8\exp \left \{-\alpha nt^{2}/8\right \}\) for arbitrary t > 0.
One can see that \(P\left (\|\widehat \pi _{y}^{jk}-\pi _{y}^{jk}\|_{1}\geq t\right )=P\left (|{\widehat {\pi }}_{00y}^{jk}-\pi _{00y}^{jk}|+|{\widehat {\pi }}_{01y}^{jk}-\pi _{01y}^{jk}|+|{\widehat {\pi }}_{10y}^{jk}\right .\) \(\left .-\pi _{10y}^{jk}|+|{\widehat {\pi }}_{11y}^{jk}-\pi _{11y}^{jk}|\geq t\right )\leq {\sum }_{l,s\in \{0, 1\}}P\left (|{\widehat {\pi }}_{lsy}^{jk}-\pi _{lsy}^{jk}|>t/4\right )\). By the Hoeffdings Inequality, we have that \(P\left (|{\widehat {\pi }}_{lsy}^{jk}-\pi _{lsy}^{jk}|>t/4\right )\leq 2\exp \left \{-n_{y}t^{2}/8\right \}\leq 2\exp \left \{-\alpha nt^{2}/8\right \}\). Hence, \(P\left (\|\widehat \pi _{y}^{jk}-\pi _{y}^{jk}\|_{1}\geq t\right )\leq 8\exp \left \{-\alpha nt^{2}/8\right \}\).
Step 2. Let \(r_{n}=\left (|\hat {\pi }_{0}-\pi _{0}|+|\hat {\pi }_{1}-\pi _{1}|\right )\log \left (1/\left (4{\pi _{L}^{2}}\right )\right )\). Define \({\mathcal {E}}=\left \{|\hat {\pi }_{lsy}^{jk}-\pi _{lsy}^{jk}|<\pi _{L}/2 \text {for all } j,k=1,\ldots ,p,\text {~and~} l,s,y\in \{0, 1\}\right \}\), and \({\mathcal {E}}_{r} = \{r_{n}\leq \nu _{n}/2\}\). In this step, we will prove that \(P({\mathcal {E}})\rightarrow 1\) and \(P({\mathcal {E}}_{r})\rightarrow 1\) as n tends to \(\infty \).
Specifically, \(P(\mathcal {E})\geq 1-\sum \nolimits _{y\in \{0, 1\}}\sum \nolimits _{j\neq k}\sum \nolimits _{l,s\in \{0, 1\}}P\left (|\hat {\pi }_{lsy}^{jk}-\pi _{lsy}^{jk}|\geq \pi _{L}/2\right )\geq 1- \sum \nolimits _{y\in \{0, 1\}}\sum \nolimits _{j\neq k}\sum \nolimits _{l,s\in \{0, 1\}}2\exp \left \{-{\alpha \pi _{L}^{2}}n/2\right \}\geq 1-16p^{2}\exp \left \{-{\alpha \pi _{L}^{2}}n/2\right \}= 1-16\exp \left \{-{\alpha \pi _{L}^{2}}n/2+2Cn^{\xi }\right \} \rightarrow 1\).
By the assumption A4, we know that nκνn = O(1) with κ ∈ (0, (1 − ξ)/2). On the other hand, one can see that n1/2rn = Op(1). Hence, it is easy to show that \(P({\mathcal {E}}_{r})\rightarrow 1\). This completes the proof of Step 2.
Step 3. In this step, we will show that on the events \(\mathcal {E}\) and \(\mathcal {E}_{r}\), \(|{\widehat {\text {kl}}}(j,k)-{\text {kl}}(j,k)|\leq M\left (\|\hat {\pi }_{0}^{jk}-\pi _{0}^{jk}\|_{1}+\|\hat {\pi }_{1}^{jk}-\pi _{1}^{jk}\|_{1}\right )+\nu _{n}/2\) holds with some positive constant M for all j, k.
Recall that \(\widehat {\text {kl}}(j,k)={\sum }_{y}\hat {\pi }_{y}\widehat {\text {kl}}(j,k;y)\), where
Then, we have that
where
and \(\pi _{y}^{jk*}=\left (\pi _{00y}^{jk*},\pi _{01y}^{jk*},\pi _{10y}^{jk*},\pi _{11y}^{jk*}\right )^{\top }=\pi _{y}^{jk}+\zeta \left (\hat {\pi }_{y}^{jk}-\pi _{y}^{jk}\right )\) with some ζ ∈ (0, 1). One can see that for every l, s ∈ {0, 1},
On event \({\mathcal {E}}\), we have that for every l, s, y ∈ {0, 1}, \(\pi _{lsy}^{jk*}=(1-\zeta )\pi _{lsy}^{jk}+\zeta \hat {\pi }_{lsy}^{jk}\geq (1-\zeta )\pi _{L}+\zeta /2\pi _{L}>\pi _{L}/2\). Consequently,
Denote \(1+2\log 2-3\log (\pi _{L}/2)\) by M/2, then we have that
Moreover,
which means that kl(j, k; y) is uniformly bounded for all j, k, y ∈ {0, 1}. Consequently, \(|\widehat {\text {kl}}(j,k)-\text {kl}(j,k)| \leq \hat {\pi }_{0}|{\widehat {\text {kl}}}(j,k;0)-{\text {kl}}\)\((j,k;0)|+\hat {\pi }_{1}|{\widehat {\text {kl}}}(j,k;1)-{\text {kl}}(j,k;1)|+|\hat {\pi }_{0}-\pi _{0}|{\text {kl}}(j,k;0)+|\hat {\pi }_{1}-\pi _{1}|{\text {kl}}(j,k;1)\leq M\left (\|\hat {\pi }_{0}^{jk}-\pi _{0}^{jk}\|_{1}+\|\hat {\pi }_{1}^{jk}-\pi _{1}^{jk}\|_{1}\right )+r_{n},\) where \(r_{n}=\left (|\hat {\pi }_{0}-\pi _{0}|+|\hat {\pi }_{1}-\pi _{1}|\right )\log \frac {1}{4{\pi _{L}^{2}}}\).
Moreover, on the event \(\mathcal {E}_{r}\), rn ≤ νn/2. Consequently, on the events \(\mathcal {E}\) and \(\mathcal {E}_{r}\), we have that \(|\widehat {\text {kl}}(j,k)-\text {kl}(j,k)|\leq M\left (\|\hat {\pi }_{0}^{jk}-\pi _{0}^{jk}\|_{1}+\|\hat {\pi }_{1}^{jk}-\pi _{1}^{jk}\|_{1}\right )+\nu _{n}/2\).
Step 4. In this step, we will prove that with probability tending to 1, \({\widehat {\mathcal {G}}}_{\nu _{n}}\subseteq {\mathcal {G}}\) is true. Specifically,
Step 5. In this step, we will prove that \(P\left (\widehat {\mathcal {G}}_{\nu _{n}}\supseteq \mathcal {G}\right )\rightarrow 1\). Specifically,
Combining the above results, one can see that \(P\left (\widehat {\mathcal {G}}_{\nu _{n}}=\mathcal {G}\right )\rightarrow 1\). This completes the whole proof of Theorem 1.
Appendix B: Proof of Theorem 2
We prove Theorem 2 by the following two steps.
Step 1. In this step, we will prove that \( P\left (\mathcal {C}_{T}\subset \widehat {\mathcal {C}}(\gamma _{n})\right ) \rightarrow 1\).
One can see that \(P\left (\mathcal {C}_{T}\subset \widehat {\mathcal {C}}(\gamma _{n})\right )= P\left (\bigcap _{m\in {\mathcal {C}}_{T}}\left \{K_{m}^{-1/2}\|\left (\text {diag}\left ({\widehat {\Sigma }}^{(m)}\right )\right )^{-1/2}\left ({{\widehat {\Pi }}_{0}^{m}}\right .\right .\right .\) \(\left .\left .\left .-{{\widehat {\Pi }}_{1}^{m}}\right )\|_{2}\geq \gamma _{n}\right \}\right )\geq 1-P\left (\bigcup _{m\in {\mathcal {C}}_{T}}\left \{K_{m}^{-1/2}\|\left (\text {diag}\left ({\widehat {\Sigma }}^{(m)}\right )\right )^{-1/2}\left ({{\widehat {\Pi }}_{0}^{m}}-{{\widehat {\Pi }}_{1}^{m}}\right )\|_{2}\leq \gamma _{n}\right \}\right )\). Under the assumption A7 and γn = 2/3C2n−𝜗, we have
We first focus on \(P\left (\left \|\left (\left (\text {diag}\left ({\widehat {\Sigma }}^{(m)}\right )\right )^{-1/2}-\left (\text {diag}\left ({\Sigma }^{(m)}\right )\right )^{-1/2}\right )\cdot \left ({{\widehat {\Pi }}_{0}^{m}}-{{\widehat {\Pi }}_{1}^{m}}\right )\right \|_{1}\right .\) \(\left .\geq 1/6K_{m}C_{2}n^{-\vartheta }\right )\). Define \(\hat {W}^{(m)}=\left (\hat {w}_{1}^{(m)},\ldots ,\hat {w}_{K_{m}}^{(m)}\right )^{\top }\) and \(W^{(m)}=\left (w_{1}^{(m)},\ldots ,w_{K_{m}}^{(m)}\right )^{\top }\) with \(\hat {w}_{k}^{(m)}=\left (\alpha _{0}^{-1}{\widehat {\Pi }}_{k0}^{m}\left (1-{\widehat {\Pi }}_{k0}^{m}\right )+\alpha _{1}^{-1}{\widehat {\Pi }}_{k1}^{m}\left (1-{\widehat {\Pi }}_{k1}^{m}\right )\right )^{-1/2}\), \(w_{k}^{(m)}=\left (\alpha _{0}^{-1}{\Pi }_{k0}^{m}\right .\) \(\left .\left (1-{\Pi }_{k0}^{m}\right )+\alpha _{1}^{-1}{\Pi }_{k1}^{m}\left (1-{\Pi }_{k1}^{m}\right )\right )^{-1/2}\) for k = 1, … , Km.
Then, one can see that
where \({\Pi }_{ky\tau }^{m}=\tau \widehat {\Pi }_{ky}^{m}+(1-\tau ){\Pi }_{ky}^{m}\) for some τ ∈ (0, 1) and y ∈ {0, 1}. One can verify that \(\frac {\partial w_{k}^{(m)}}{\partial {\Pi }_{kl}^{m}}=-1/2\left [\alpha _{0}^{-1}{\Pi }_{k0}^{m}\left (1-{\Pi }_{k0}^{m}\right )+\alpha _{1}{\Pi }_{k1}^{m}\left (1-{\Pi }_{k1}^{m}\right )\right ]^{-3/2}\alpha _{l}^{-1}\left (1-2{\Pi }_{kl}^{m}\right )\).
Define \(\mathcal {E}=\left \{|\hat {\Pi }_{ky}^{m}-{\Pi }_{ky}^{m}|<{\Pi }_{L}/2 \text {~for~all~} k, m,\text {~and~} y\in \{0, 1\}\right \}\). Then, by the similar proof of Step 2 in the proof for Theorem 1, one can verify that \(P({\mathcal {E}})\rightarrow 1\). On the event \({\mathcal {E}}\), it is easy to show that \({\Pi }_{L}/2\leq {\Pi }_{ky\tau }^{m}\leq (1-{\Pi }_{L}/2)\) for k = 1, … , Km, y = 0, 1 and all m. Consequently, on the event \({\mathcal {E}}\), one can obtain that
As a result, on the event \(\mathcal {E}\) we have that
Next, we consider \(P\left (\left \|\left (\text {diag}\left ({\Sigma }^{(m)}\right )\right )^{-1/2}\left ({{\widehat {\Pi }}_{y}^{m}}-{{\Pi }_{y}^{m}}\right )\right \|_{1}\geq 1/12K_{m}C_{2}n^{-\vartheta }\right )\) with y = 0, 1. Due to that \({w_{k}^{m}}=\left (\alpha _{0}^{-1}{\Pi }_{k0}^{m}\left (1-{\Pi }_{k0}^{m}\right )+\alpha _{1}^{-1}{\Pi }_{k1}^{m}\left (1-{\Pi }_{k1}^{m}\right )\right )^{-1/2}\leq {\Pi }_{L}^{-1}\), we have that \(P\left (\left \|\left (\text {diag}\left ({\Sigma }^{(m)}\right )\right )^{-1/2}\left ({\widehat {\Pi }_{y}^{m}}-{{\Pi }_{y}^{m}}\right )\right \|_{1}\geq 1/12K_{m}C_{2}n^{-\vartheta }\right ) \leq P\left (\max \limits _{k}({w_{k}^{m}})\right .\) \(\left .\|\left ({\widehat {\Pi }_{y}^{m}}-{{\Pi }_{y}^{m}}\right )\|_{1}\geq 1/12K_{m}C_{2}n^{-\vartheta }\right ) \leq P\left (\|\left ({{\widehat {\Pi }}_{y}^{m}}-{{\Pi }_{y}^{m}}\right )\|_{1}\right .\) \(\left .\geq 1/12{\Pi }_{L}K_{m}C_{2}n^{-\vartheta }\right )\).
Denote \(\min \limits \left \{\frac {1}{96}\frac {{{\Pi }_{L}^{3}}}{1-{\Pi }_{L}}\alpha ,\frac {1}{12}{\Pi }_{L}\right \}\) by M, then based on the above analysis, we have
Moreover, by the assumptions A2 and A7, we see that \(p=e^{Cn^{\xi }}\) and 𝜗 ∈ (0, (1 − ξ)/2); hence, we can obtain that \(p\exp \left \{-C_{4}n^{1-2\vartheta }\right \}\rightarrow 0\). Combining the fact that \(P({\mathcal {E}}^{C})\rightarrow 0\), we have that \(P\left (\bigcup _{m\in \mathcal {C}_{T}}\left \{K_{m}^{-1/2}\|\left (\text {diag}\left (\widehat {\Sigma }^{(m)}\right )\right )^{-1/2}\left ({\widehat {\Pi }_{0}^{m}}-{\widehat {\Pi }_{1}^{m}}\right )\|_{2}\leq \gamma _{n}\right \}\right )\) \(\rightarrow 0\). Furthermore, \( P\left (\mathcal {C}_{T}\subset \widehat {\mathcal {C}}(\gamma _{n})\right ) \geq 1-P\left (\bigcup _{m\in {\mathcal {C}}_{T}}\{K_{m}^{-1/2}\|\left (\text {diag}\left ({\widehat {\Sigma }}^{(m)}\right )\right )^{-1/2}\right .\) \(\left .\left ({{\widehat {\Pi }}_{0}^{m}}-{{\widehat {\Pi }}_{1}^{m}}\right )\|_{2}\leq \gamma _{n}\}\right )\rightarrow 1\).
Step 2. In this step, we will prove that \( P\left (\mathcal {C}_{T}\supset \widehat {\mathcal {C}}(\gamma _{n})\right ) \rightarrow 1\).
One can see that
By a similar proof to Step 1, one can see that \({\sum }_{m}P\left (K_{m}^{-1}\left |\|\left (\text {diag}\left (\widehat {\Sigma }^{(m)}\right )\right )^{-1/2}\right .\right .\) \(\left .\left .\left ({\widehat {\Pi }_{0}^{m}}-{\widehat {\Pi }_{1}^{m}}\right )\|_{1}-\|\left (\text {diag}\left ({\Sigma }^{(m)}\right )\right )^{-1/2}\left ({{\Pi }_{0}^{m}}-{{\Pi }_{1}^{m}}\right )\|_{1}\right |\right .\) \(\left .\geq 1/3C_{2}n^{-\vartheta }\right )\rightarrow 0\). Hence, P \(\left ({\mathcal {C}}_{T}\supset {\widehat {\mathcal {C}}}(\gamma _{n})\right )\rightarrow 1\).
Combining the results of above two steps, we have that \(P\left (\mathcal {C}_{T}=\widehat {\mathcal {C}}(\gamma _{n})\right )\rightarrow 1\).
This completes the whole proof of Theorem 2.
Rights and permissions
About this article
Cite this article
An, B., Feng, G. & Guo, J. Interaction Identification and Clique Screening for Classification with Ultra-high Dimensional Discrete Features. J Classif 39, 122–146 (2022). https://doi.org/10.1007/s00357-021-09399-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-021-09399-0