Abstract
In the era of big data, many sparse linear discriminant analysis methods have been proposed for classification and variable selection of the high-dimensional data. In order to solve the multiclass sparse discriminant problem for high-dimensional data under the Gaussian graphical model, this paper proposes a multiclass sparse discrimination analysis method by incorporating the graphical structure among predictors, which is named as IG-MSDA method. Our proposed IG-MSDA method can be used to estimate the vectors of all discriminant directions simultaneously. Under certain regularity conditions, it is shown that the proposed IG-MSDA method can consistently estimate all discriminant directions and the Bayes rule. Further, we establish the convergence rates of the estimators for the discriminant directions and the conditional misclassification rates. Finally, simulation studies and a real data analysis demonstrate the good performance of our proposed IG-MSDA method.






Similar content being viewed by others
Data Availability
The IBD dataset analyzed during the current study is available with accession number GDS1615 in the Gene Expression Omnibus data base of National Center for Biotechnology Information (NCBI), https://www.ncbi.nlm.nih.gov/geo/.
Code Availability
The code that supports the findings of this study is available at https://github.com/Luo-jx/IG-MSDA.
References
Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis (3rd ed.). New Jersey: John Wiley & Sons.
Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183–202.
Bickel, P. J., & Levina, E. (2004). Some theory for fisher’s linear discriminant function, ‘naive bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 10(6), 989–1010.
Cai, T., & Liu, W. (2011). A direct estimation approach to sparse linear discriminant analysis. Journal of the American Statistical Association, 106, 1566–1577.
Cai, T., & Zhang, L. (2019). High dimensional linear discriminant analysis: optimality, adaptive algorithm and missing data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(4), 675–705.
Cannings, T. I., & Samworth, R. J. (2017). Random-projection Ensemble Classification. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79, 9591035.
Clemmensen, L., Hastie, T., Witten, D., & Ersbll, B. (2011). Sparse discriminant analysis. Technometrics, 53, 406–413.
Dudoit, S., Fridlyand, J., & Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457), 77–87.
Fan, J., & Fan, Y. (2008). High-dimensional classification using features annealed independence rules. The Annals of Statistics, 36, 2605–2637.
Fan, J., Feng, Y., & Tong, X. (2012). A road to classification in high dimensional space: the regularized optimal affine discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74, 745–771.
Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical Lasso. Biostatistics, 9(3), 432–441.
Guo, J. (2010). Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis. Biostatistics, 11, 599–608.
Jiang, B., Chen, Z., & Leng, C. (2020). Dynamic linear discriminant analysis in high dimensional space. Bernoulli, 26(2), 1234–1268.
Le, K. T., Chaux, C., Richard, F. J., & Guedj, E. (2020). An adapted linear discriminant analysis with variable selection for the classification in high-dimension, and an application to medical data. Computational Statistics and Data Analysis, 152, 107031. https://doi.org/10.1016/j.csda.2020.107031
Lee, J. W., Lee, J. B., Park, M., & Song, S. H. (2005). An extensive comparison of recent classification tools applied to microarray data. Computational Statistics and Data Analysis, 48, 869–885.
Liu, J., Yu, G., & Liu, Y. (2019). Graph-based sparse linear discriminant analysis for high-dimensional classification. Journal of Multivariate Analysis, 171, 250–269.
Mai, Q., Yang, Y., & Zou, H. (2019). Multiclass sparse discriminant analysis. Statistica Sinica, 29, 97–111.
Mai, Q., & Zou, H. (2015). Sparse semiparametric discriminant analysis. Journal of Multivariate Analysis, 135, 175–188.
Mai, Q., Zou, H., & Yuan, M. (2012). A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika, 99, 29–42.
Meinshausen, N., & Bühlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso. The Annals of Statistics, 34(3), 1436–1462.
Pun, C. S., & Hadimaja, M. Z. (2021). A self-calibrated direct approach to precision matrix estimation and linear discriminant analysis in high dimensions. Computational Statistics and Data Analysis, 155, 107105. https://doi.org/10.1016/j.csda.2020.107105
Sheng, Y., & Wang, Q. (2019). Simultaneous variable selection and class fusion with penalized distance criterion based classifiers. Computational Statistics and Data Analysis, 133, 138–152.
Stephenson, M., Ali, R. A., Darlington, G. A., Schenkel, F. S., & Squires, E. J. (2021). DSLRIG: Leveraging predictor structure in logistic regression. Communications in Statistics - Simulation and Computation, 50(6), 1600–1612.
Wang, Z., Liu, X., Tang, W., & Lin, Y. (2021). Incorporating graphical structure of predictors in sparse quantile regression. Journal of Business & Economic Statistics, 39(3), 783–792.
Witten, D. M., & Tibshirani, R. (2011). Penalized classification using fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73, 753–772.
Xu, P., Brock, G. N., & Parrish, R. S. (2009). Modified linear discriminant analysis approaches for classification of high-dimensional microarray data. Computational Statistics & Data Analysis, 53, 1674–1687.
Yu, G., & Liu, Y. (2016). Sparse regression incorporating graphical structure among predictors. Journal of the American Statistical Association, 111(514), 707–720.
Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68, 49–67.
Zhou, Y., Zhang, B. X., Li, G. R., Tong, T. J., & Wan, X. (2017). GD-RDA: A new regularized discriminant analysis for high-dimensional data. Journal of Computational Biology, 24, 1099–1111.
Acknowledgements
The authors sincerely thank the editor, the associate editor, and two reviewers for their constructive comments that have led to a substantial improvement of this paper. This research was supported by the National Natural Science Foundation of China (12271046, 11971001 and 12131006).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Proof of Main Results
Appendix: Proof of Main Results
We first provide the subgradient condition of the optimization problem (2.9) and related lemmas.
Proposition 1
The vector \(\varvec{\theta }_g ~ (2\le g\le G)\) is the solution to problem (2.9) if and only if \(\varvec{\theta }_g\) can be decomposed into \({\varvec{\theta }}_g=\displaystyle \sum _{g=2}^{G}\varvec{v}^{(j)}_g\) for \(2\le g\le G\). For \(1\le j\le p_n\), then
-
(a)
\(\varvec{v}^{(j)}_{\cdot {\mathcal {N}^{(j)}}^{c}}=\varvec{0}\);
-
(b)
Either
or
Here,
where
This subgradient condition is similar to Karush-Kuhn-Tucker (KKT) condition of group Lasso in Yuan and Lin (2006). Invoking KKT condition, it is easy to show that, if \(\widehat{\varvec{v}}^{(j)}_{\cdot }\) is the solution to (2.7) for each \(1\le j\le p_n\), then \(\widehat{\varvec{v}}^{(j)}_g ~ (2\le g\le G)\) is either estimated to be \(\varvec{0}\) or a sparse vector with support set \(\mathcal {N}^{(j)}\). Therefore, \(\displaystyle \widehat{\varvec{\theta }}_g=\sum ^{p_n}_{j=1}\widehat{\varvec{v}}^{(j)}_g ~ (2\le g\le G)\) estimated by IG-MSDA method satisfies the decomposition (2.5).
Lemma 1
Suppose that regularity conditions (C1)–(C3) hold, for a sufficiently large positive constant C, as \(n\rightarrow \infty \), if \(\displaystyle \delta _{\min }\Big /\sqrt{\frac{\log p_n}{n}}\rightarrow \infty \), \(\delta _{\min }=\min _{j\in \mathcal {A}}\Vert \varvec{\delta }_{\cdot j}\Vert _2\) and \(k_n=|\mathcal {A}|\), then
Proof
Note that
From Lemma 1 (ii) in Sheng and Wang (2019), it can be seen that
Then we have
And \(\displaystyle \delta _{\min }\Big /\sqrt{\dfrac{\log p_n}{n}}\rightarrow \infty \). Thus, \(\displaystyle \min _{j\in \mathcal {A}}\Vert \widehat{\varvec{\delta }}_{\cdot j}\Vert _{2}\ge \dfrac{\delta _{\min }}{C}\). Furthermore, \(\displaystyle \max _{j\in \mathcal {A}}\phi _j=\max _{j\in \mathcal {A}}\dfrac{\sqrt{|{\mathcal {N}^{(j)}}|}}{\Vert \widehat{\varvec{\delta }}_{\cdot j}\Vert _2}\le \dfrac{\sqrt{k_n}}{\min _{j\in \mathcal {A}}\Vert \widehat{\varvec{\delta }}_{\cdot j}\Vert _2}\). Therefore, Lemma 1 is proved.
Proof of Theorem
1 Under regularity condition (C1), an oracle estimator is introduced, which is defined as
where \(g=2,\ldots ,G\) and \(\mathcal {G_A}\) denotes the subgraph of graph \(\mathcal {G}\) corresponding to \(\mathcal {A}\). In order to prove Theorem 1, the following conclusions need to be proved.
-
1)
\(\underset{j\in \mathcal {A}}{\min }\Vert \widetilde{\varvec{\psi }}_{\cdot j}\Vert _{2}>0\), where \(\widetilde{\varvec{\psi }}_{\cdot j}=(\widetilde{\varvec{\psi }}_{2j},\ldots ,\widetilde{\varvec{\psi }}_{Gj})^{\top }\); For convenience, let \(\mathcal {A}=\{1,2,\ldots ,k_n\}\). By regularity condition (C4), (A.1) is equivalent to \(\displaystyle \widetilde{\varvec{\psi }}_{g}=\sum _{j=1}^{k_n}\widetilde{\varvec{\mu }}_{g}^{(j)}\) and
$$\begin{aligned} \left( \widetilde{\varvec{\mu }}^{(1)}_{\cdot },\ldots ,\widetilde{\varvec{\mu }}^{(k_n)}_{\cdot }\right) =\arg \min _{\varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot }}\left\{ Q_n\left( \varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot }\right) +\lambda _{n}\sum _{j=1}^{k_n}\phi _{j}\Vert \varvec{\mu }^{(j)}_{\cdot }\Vert _2\right\} , \end{aligned}$$(A.2)where
$$\begin{aligned} Q_{n}\left( \varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot }\right) =\sum ^{G}_{g=2}\left[ \frac{1}{2}\biggl (\sum _{j=1}^{k_n}\varvec{\mu }^{(j)}_g\biggr )^{\top }\widehat{\varvec{\Sigma }}\biggl (\sum _{j=1}^{k_n}\varvec{\mu }^{(j)}_g\biggr )-\widehat{\varvec{\delta }}^{\top }_g\biggl (\sum _{j=1}^{k_n}\varvec{\mu }_g^{(j)}\biggr )\right] . \end{aligned}$$(A.3)Here, \(\text {supp}({\varvec{\mu }_{g}}^{(j)})\subseteq \mathcal {N}^{(j)}, ~ 2\le g\le G\), \(\varvec{\mu }_{\cdot }^{(j)}=(\varvec{\mu }_{2}^{(j)\top },\ldots ,\varvec{\mu }_{G}^{(j){\top }})^{\top }\) and \(\displaystyle \varvec{\psi }_{g}=\sum _{j=1}^{k_n}{\varvec{\mu }^{(j)}_{g}}\). Since the objective function defined by (A.2) is a convex function, from Proposition 1, it is easy to show that its solution satisfies KKT condition: for all \(j\in \mathcal {A}\), then
-
(a)
\(\varvec{\mu }_{\cdot {\mathcal {N}^{(j)}}^{c}}^{(j)}=\varvec{0}\);
-
(b)
Either
$$\begin{aligned} \varvec{\mu }_{\cdot \mathcal {N}^{(j)}}^{(j)}\ne \varvec{0}, \quad \nabla _{\mathcal {N}^{(j)}}Q_n\left( \varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot }\right) +\lambda _{n}\phi _{j}\frac{\varvec{\mu }_{\cdot \mathcal {N}^{(j)}}^{(j)}}{\Vert \varvec{\mu }_{\cdot \mathcal {N}^{(j)}}^{(j)}\Vert _{2}}=\varvec{0}, \end{aligned}$$or
$$\begin{aligned} \varvec{\mu }_{\cdot \mathcal {N}^{(j)}}^{(j)}=\varvec{0}, \quad \Vert \nabla _{\mathcal {N}^{(j)}}Q_n\left( \varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot }\right) \Vert _{2}\le \lambda _n\phi _{j}. \end{aligned}$$Here,
$$\begin{aligned} \small \displaystyle \nabla _{\mathcal {N}^{(j)}}L_n(\varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot })=\Biggl [\biggl (\frac{\partial Q_n (\varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot } )}{\partial {\varvec{\mu }^{(j)}_{2\mathcal {N}^{(j)}}}}\biggr )^{\top }, \ldots ,\biggl (\frac{\partial Q_n (\varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot } )}{\partial {\varvec{\mu }_{G\mathcal {N}^{(j)}}^{(j)}}}\biggr )^{\top }\Biggr ]^{\top } \end{aligned}$$and
$$\displaystyle \dfrac{\partial Q_n (\varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot } )}{\partial {{\varvec{\mu }}^{(j)}_{g\mathcal {N}^{(j)}}}}=(\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}\varvec{\psi }_g-\widehat{\varvec{\delta }}_{g\mathcal {A}})_{\mathcal {N}^{(j)}}, \quad \displaystyle \varvec{\psi }_g=\sum _{j\in \mathcal {A}}\varvec{\mu }_g^{(j)}, \quad 2\le g\le G. $$
Then \(\widetilde{\varvec{\psi }}_g ~ (2\le g\le G)\) can be obtained via \(\displaystyle \widetilde{\varvec{\psi }}_{g}=\sum _{j=1}^{k_n}\widetilde{\varvec{\mu }}_{g}^{(j)}\). Let \(\widehat{B}=\{i:\Vert \varvec{\mu }_{\cdot \mathcal {N}^{(i)}}^{(i)}\Vert _{2}\ne 0\}\). For each \(i\in \widehat{B}\), we have
$$ \nabla _{\mathcal {N}^{(j)}}Q_n(\varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot })=-\lambda _n\phi _{j}\frac{\varvec{\mu }_{\cdot \mathcal {N}^{(j)}}^{(j)}}{\left\| \varvec{\mu }_{\cdot \mathcal {N}^{(j)}}^{(j)}\right\| _{2}}. $$For each \(i\notin \widehat{B}\), we have
$$ \nabla _{\mathcal {N}_{(j)}}Q_n(\varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot })=-\lambda _n\phi _{j}\varvec{Z}_{\cdot \mathcal {N}^{(j)}}^{(j)}, $$where \(\varvec{Z}_{\cdot }^{(j)}\) is a \(p\times (G-1)\) matrix, \(\Vert \varvec{Z}_{\cdot \mathcal {N}^{(j)}}^{(j)}\Vert _{2}\le 1\). Because some variables may belong to multiple neighborhoods, the following conditions need to be satisfied:
-
(i)
For each \({i_1}\in \widehat{B},{i_2}\in \widehat{B},j\!\in \!{\mathcal {N}^{(i_1)}\bigcap \mathcal {N}^{(i_2)}}\), \(\phi _{i_1}\varvec{\mu }_{\cdot j}^{(i_1)}\big /\bigl \Vert {\varvec{\mu }}_{\cdot {\mathcal {N}}^{(j)}}^{(i_1)}\bigr \Vert _{2}\!=\!\phi _{i_2}{\varvec{\mu }}_{\cdot j}^{(i_2)}\big /\bigl \Vert {\varvec{\mu }}_{\cdot {\mathcal {N}}^{(j)}}^{(i_2)}\) \(\bigr \Vert _{2}\);
-
(ii)
For each \(i_1\in \widehat{B},i_2\notin \widehat{B},j\in \mathcal {N}^{(i_1)}\bigcap \mathcal {N}^{(i_2)}\), \(\phi _{i_1}\varvec{\mu }_{\cdot j}^{(i_1)} \big /\Vert \varvec{\mu }_{\cdot \mathcal {N}^{(j)}}^{(i_1)}\Vert _{2}=\phi _{i_2}\varvec{Z}_{\cdot j}^{(i_2)}\);
-
(iii)
For each \(i_1\notin \widehat{B},i_2\notin \widehat{B}\) and \(j\in \mathcal {N}^{(i_1)}\bigcap \mathcal {N}^{(i_2)}\), \(\phi _{i_1}\varvec{Z}_{\cdot j}^{(i_1)}=\phi _{i_2}\varvec{Z}_{\cdot j}^{(i_2)}\).
Then, for each \(i\in \mathcal {A}\), it can be defined that for \(i\in \widehat{B}\), \(\varvec{f}_{i}=\phi _{i}\varvec{\mu }_{\cdot i}^{(i)}/\Vert \varvec{\mu }_{\cdot \mathcal {N}^{(i)}}^{(i)}\Vert _{2}\), where \(\varvec{f}_{i}=(f_{2i},\ldots ,f_{Gi})^{\top }, f_{gi}=\displaystyle \frac{\phi _{i}\varvec{\mu }_{gi}^{(i)}}{\Vert {\varvec{\mu }}^{(i)}_{\cdot \mathcal {N}^{(i)}}\Vert _{2}}, 2\le g\le G\); for \(i\notin \widehat{B}, \varvec{f}_{i}=\phi _{i}\varvec{Z}_{\cdot i}^{(i)}\), where \( \varvec{f}_i=(f_{2i},\ldots ,f_{Gi})\), \(f_{gi}=\phi _{i}\varvec{Z}_{gi}^{(i)}, 2\le g\le G\). Then any solution \(\varvec{\psi }_{g}\) satisfies the equation: \(\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}\varvec{\psi }_g-\widehat{\varvec{\delta }}_g=\lambda _n\varvec{f}_{g\mathcal {A}}\). From the definition of \(\widetilde{\varvec{\psi }}_{g}\), it can be obtained that \(\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}\widetilde{\varvec{\psi }}_g-\widehat{\varvec{\delta }}_g=\lambda _n\widetilde{\varvec{f}}_{g\mathcal {A}}\), i.e., \(\widetilde{\varvec{\psi }}_g=\widehat{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}(\widehat{\varvec{\delta }}_g-\lambda _n\widetilde{\varvec{f}}_{g\mathcal {A}})\). Here, \(\widetilde{\varvec{f}}_{g\mathcal {A}}\in \mathbb {R}^{k_n}, 2\le g\le G\). In order to prove conclusion 1), we can obtain that
$$\begin{aligned}\begin{aligned} \max _{1\le g\le G}\Vert \widetilde{\varvec{\psi }}_g-\varvec{\theta }_{g\mathcal {A}}\Vert _{2} =&\max _{1\le g\le G}\Vert \widehat{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}(\widehat{\varvec{\delta }}_{g\mathcal {A}}-\lambda _n\widetilde{\varvec{f}}_{g\mathcal {A}})-{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}\varvec{\delta }_{g\mathcal {A}}\Vert _{2}\\ =&\max _{1\le g\le G}\Vert (\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}^{-1}-{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}})\varvec{\delta }_{g\mathcal {A}}+{\widehat{\varvec{\Sigma }}}^{-1}_{\mathcal{A}\mathcal{A}}(\widehat{\varvec{\delta }}_{g\mathcal {A}}-\varvec{\delta }_{g\mathcal {A}})-\lambda _n\widehat{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}\widetilde{\varvec{f}}_{g\mathcal {A}}\Vert _{2}\\ \le&\max _{1\le g\le G}\Vert (\widehat{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}-{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}})\varvec{\delta }_{g\mathcal {A}}\Vert _{2}+\max _{1\le g\le G}\Vert \widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}^{-1}(\widehat{\varvec{\delta }}_{g\mathcal {A}}-\varvec{\delta }_{g\mathcal {A}})\Vert _{2} \\&+\max _{1\le g\le G}\Vert \lambda _n\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}^{-1}\widetilde{\varvec{f}}_{g\mathcal {A}}\Vert _{2}\\ =:&I_1+I_2+I_3. \end{aligned}\end{aligned}$$For \(I_1\), we have
$$\begin{aligned} \begin{aligned} I_1&=\max _{1\le g\le G}\Vert ({\widehat{\varvec{\Sigma }}}^{-1}_{\mathcal{A}\mathcal{A}}-{\varvec{\Sigma }}^{{-1}}_{\mathcal{A}\mathcal{A}}){\varvec{\delta }_{g\mathcal {A}}}\Vert _2\\&=\max _{1\le g\le G}\Vert {\widehat{\varvec{\Sigma }}}^{-1}_{\mathcal{A}\mathcal{A}}[\varvec{I}-(\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}-\varvec{\Sigma }_{\mathcal{A}\mathcal{A}}){\varvec{\Sigma }}^{{-1}}_{\mathcal{A}\mathcal{A}}-\varvec{I}]\varvec{\theta }_{g\mathcal {A}}\Vert _{2}\\&=\max _{1\le g\le G}\Vert \widehat{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}{\varvec{\Sigma }}^{*{-1}}_{\mathcal{A}\mathcal{A}}(\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}-\varvec{\Sigma }_{\mathcal{A}\mathcal{A}})\varvec{\delta }_{g\mathcal {A}}\Vert _{2}\\&\le \widehat{\rho }^{-1}_n{\rho }^{-1}_n\max _{1\le g\le G}\Vert (\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}-\varvec{\Sigma }_{\mathcal{A}\mathcal{A}})\varvec{\delta }_{g\mathcal {A}}\Vert _{2}\\&\le \widehat{\rho }^{-1}_n\xi _0\left[ k_n\left( \max _{i,j\in \mathcal {A}}|{\widehat{\sigma }_{ij}-\sigma _{ij}}|\max _{1\le g\le G}\Vert \varvec{\delta }_{g\mathcal {A}}\Vert _{1}\right) ^{2}\right] ^{1/2}, \end{aligned}\end{aligned}$$(A.4)where \(\widehat{\rho }_n\) denotes the minimum eigenvalue of \(\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}\). It is easy to show that \(\widehat{\rho }_n>0\) since \(\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}\) is a positive definite matrix. From regularity condition (C2) and Theorem 13.5.1 in Anderson (2003), it is easy to show that \(\widehat{\rho }_n^{-1}\) is uniformly bounded. Furthermore, invoking Lemma 1 (i) in Sheng and Wang (2019), we can show that \(\displaystyle \max _{i,j\in \mathcal {A}}|{\widehat{\sigma }_{ij}-\sigma _{ij}}|=O_p(\sqrt{{\log p_n}/{n}})\) and \(\displaystyle \max _{1\le g\le G}\Vert \varvec{\delta }_g\Vert _{1}\) are uniformly bounded. Therefore, \(I_{1}=O_{p}(\displaystyle \sqrt{{k_n\log p_n}/{n}})\). For \(I_2\), we have
$$\begin{aligned} \begin{aligned} I_{2}&=\max _{1\le g\le G}\Vert \widehat{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}(\widehat{\varvec{\delta }}_{g\mathcal {A}}-{\varvec{\delta }}_{g\mathcal {A}})\Vert _{2} \le \widehat{\rho }^{-1}_n\max _{1\le g\le G}\Vert \widehat{\varvec{\delta }}_{g\mathcal {A}}-{\varvec{\delta }}_{g\mathcal {A}}\Vert _{2}\\&\le \widehat{\rho }^{-1}_n\sqrt{k_n}\max _{1\le g\le G}\Vert \widehat{\varvec{\delta }}_{g\mathcal {A}}-{\varvec{\delta }}_{g\mathcal {A}}\Vert _{\infty }. \end{aligned}\end{aligned}$$(A.5)From Lemma 1 (ii) in Sheng and Wang (2019), it can be obtained that
$$\Pr \left( \max _{g,j}|{\widehat{\theta }_{gj}-\theta _{gj}}|\le C\sqrt{\frac{\log p_n}{n}}\right) \longrightarrow 1.$$Thus, \(I_{2}=O_{p}(\sqrt{{k_n\log p_n}/{n}})\). For \(I_{3}\), we have
$$\begin{aligned} I_{3}=\lambda _{n}\max _{1\le g\le G}\Vert \widehat{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}\widetilde{\varvec{f}}_{g\mathcal {A}}\Vert _{2} \le \lambda _{n}\widehat{\rho }^{-1}_n\max _{1\le g\le G}\Vert \widetilde{\varvec{f}}_{g\mathcal {A}}\Vert _{2} \le \lambda _{n}\widehat{\rho }^{-1}_n\sqrt{k_n}\max _{j\in \mathcal {A}}\phi _{j}. \end{aligned}$$(A.6)Invoking Lemma 1, it is easy to show that
$$\Pr \left( \max _{j\in \mathcal {A}}\phi _j \le \frac{C\sqrt{k_n}}{\delta _{\min }}\right) \longrightarrow 1,$$and \(\lambda _{n}=O_p(\delta _{\min }\sqrt{{\log p_n}/{nk_n}})\), then \(I_{3}=O_p(\sqrt{{k_n\log p_n}/{n}})\). From (A.4), (A.5) and (A.6), we can obtain that \(\underset{g}{\max }\Vert \widetilde{\varvec{\psi }}_g-\varvec{\theta }_{g\mathcal {A}}\Vert _{2}=O_{p}(\sqrt{{k_n\log p_n}/{n}})\). If G is fixed, then
$$\Vert \widetilde{\varvec{\psi }}_{\cdot \mathcal {A}}-{\varvec{\theta }}_{\cdot \mathcal {A}}\Vert _{2}=\left( \sum _{g=2}^{G}\Vert \widetilde{\varvec{\psi }}_{g}-\varvec{\theta }_{g\mathcal {A}}\Vert ^{2}_{2}\right) ^{{1}/{2}}=O_p\left( \sqrt{\frac{k_n\log p_n}{n}}\right) ,$$and
$$\begin{aligned}{\begin{matrix} \min _{j\in \mathcal {A}}\Vert \widetilde{\varvec{\psi }}_{\cdot j}\Vert _{2} &{}>\min _{j\in \mathcal {A}}\Vert \varvec{\theta }_{\cdot j}\Vert _{2}-\max _{j\in \mathcal {A}}\Vert \widetilde{\varvec{\psi }}_{\cdot j}-\varvec{\theta }_{\cdot j}\Vert _{2}\\ &{}>\min _{j\in \mathcal {A}}\Vert \varvec{\theta }_{\cdot j}\Vert _{2}-\Vert \widetilde{\varvec{\psi }}_{\cdot \mathcal {A}}-\varvec{\theta }_{\cdot \mathcal {A}}\Vert _{2} >\theta _{\min }-C\sqrt{\frac{k_n\log p_n}{n}}. \end{matrix}}\end{aligned}$$By condition (C5), \(\displaystyle \min _{j\in \mathcal {A}}\Vert \widetilde{\varvec{\psi }}_{\cdot j}\Vert>{\theta _{\min }}/{2}>0\). Thus, conclusion 1) is proved.
-
(a)
-
2)
For all \(g ~ (2\le g\le G),~ \widetilde{\varvec{\theta }}_{g}=(\widetilde{\varvec{\psi }}_{g}^{\top },\varvec{0}^{\top })^{\top }\) is the solution by minimizing the objective function (2.9). It can be seen from Section 2 that (2.7) and (2.9) are equivalent optimization problems. From Proposition 1, the optimization problem defined by (2.7) satisfies KKT condition. For all \(j\in {\mathcal {A}}^{c}\), let \(\widetilde{\varvec{v}}_{g}^{(j)}=\varvec{0}\); for all \(j\in \mathcal {A}, 2\le g\le G\), let \(\widetilde{\varvec{v}}^{(j)}_{g\mathcal {A}}=\widetilde{\varvec{\mu }}^{(j)}_{g}, \widetilde{\varvec{v}}^{(j)}_{g\mathcal {A}^{c}}=\varvec{0}\), then \(\widetilde{\varvec{\theta }}_{g\mathcal {A}}=\widetilde{\varvec{\psi }}_{g}\) and \( \widetilde{\varvec{\theta }}_{g\mathcal {A}^{c}}=\varvec{0}\). For \(j\in \mathcal {A}\), by the definition of \(\widetilde{\varvec{\psi }}_{g}\), KKT condition of (2.9) is satisfied. For \(j\in \mathcal {A}^{c}\), we need to show that
$$\Pr \left\{ \forall j\in \mathcal {A}^{c},\left( \sum _{g=2}^{G}\Vert (\widehat{\varvec{\Sigma }}\varvec{\theta }_{g}-\widehat{\varvec{\delta }}_{g})_{\mathcal {N}^{(j)}}\Vert _2^{2}\right) ^{{1}/{2}}\le \lambda _n\phi _{j}\right\} \longrightarrow 1.$$Equivalent to prove
$$\Pr \left\{ \exists j\in \mathcal {A}^{c}, \left( \sum _{g=2}^{G}\Vert (\widehat{\varvec{\Sigma }}\varvec{\theta }_g-\widehat{\varvec{\delta }}_g)_{\mathcal {N}^{(j)}}\Vert _{2}^{2}\right) ^{{1}/{2}}>\lambda _n\phi _j\right\} \longrightarrow 0.$$Thus, we have
$$\begin{aligned}{\begin{matrix} &{}\Pr \left\{ \exists j\in {\mathcal {A}^{c}}, \left( \sum _{g=2}^{G}\left\| \left( \widehat{\varvec{\Sigma }}\varvec{\theta }_g-\widehat{\varvec{\delta }}_{g}\right) _{\mathcal {N}^{(j)}}\right\| _{2}^{2}\right) ^{{1}/{2}}>\lambda _n\phi _j\right\} \\ \le &{} \Pr \left\{ \left( \sum _{g=2}^{G}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\widetilde{\varvec{\psi }}_g-\widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}\right\| _{2}^{2}\right) ^{{1}/{2}}>\lambda _n\min _{j\in \mathcal {A}^{c}}\phi _j\right\} \\ =&{}\Pr \left\{ \left( \sum _{g=2}^{G}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\left( \widetilde{\varvec{\psi }}_g-\varvec{\theta }_{g\mathcal {A}}\right) +\widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\varvec{\theta }_{g\mathcal {A}}-\widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}\right\| _{2}^{2}\right) ^{{1}/{2}}>\lambda _n\phi _{*}\right\} \\ =&{}\Pr \left\{ \left( \sum _{g=2}^{G}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\varvec{\theta }_{g\mathcal {A}}-\widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}\right\| _{2}^{2}\right) ^{{1}/{2}}>\lambda _n\phi _{*}-\left( \sum _{g=2}^{G}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\left( \widetilde{\varvec{\psi }}_{g}-\varvec{\theta }_{g\mathcal {A}}\right) \right\| _{2}^{2}\right) ^{{1}/{2}}\right\} . \end{matrix}}\end{aligned}$$Furthermore, we have
$$\begin{aligned} {\begin{matrix} \max _{1\le g\le G}\Vert \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}(\widetilde{\varvec{\psi }}_{g}-\varvec{\theta }_{g\mathcal {A}})\Vert _{2} &{}\le \max _{1\le g\le G}\Vert \widehat{\varvec{\Sigma }}(\widetilde{\varvec{\theta }}_g-\varvec{\theta }_{g})\Vert _{2}\\ &{}\le \widehat{\rho }_n\max _{1\le g\le G}\Vert (\widetilde{\varvec{\theta }}_{g}-\varvec{\theta }_{g})\Vert _{2} =\widehat{\rho }_n\max _{1\le g\le G}\Vert \widetilde{\varvec{\psi }}_g-\varvec{\theta }_{g\mathcal {A}}\Vert _{2}. \end{matrix}} \end{aligned}$$From \(\widehat{\rho }_n>0\), \(\displaystyle \max _{1\le g\le G}\Vert \widetilde{\varvec{\psi }}_g-\varvec{\theta }_{g\mathcal {A}}\Vert _{2}=O_p(\sqrt{{k_n\log p_n}/{n}})\), regularity condition (C2) and Theorem 13.5.1 in Anderson (2003), we have
$$\begin{aligned} \max _{1\le g\le G}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\left( \widetilde{\varvec{\psi }}_{g}-\varvec{\theta }_{g\mathcal {A}}\right) \right\| _{2}=O_p\left( \sqrt{\frac{k_n\log p_n}{n}}\right) . \end{aligned}$$(A.7)Further, it is easy to show that
$$\begin{aligned}{\begin{matrix} &{}\max _{1\le g\le G}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\varvec{\theta }_{g\mathcal {A}}-\widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}\right\| _{2} =\max _{1\le g\le G}\left\| \left( \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}-\varvec{\Sigma }_{\mathcal {A}^{c}\mathcal {A}}\right) \varvec{\theta }_{g\mathcal {A}}-\left( \widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}-\varvec{\delta }_{g\mathcal {A}^{c}}\right) \right\| _{2}\\ \le &{}\max _{1\le g\le G}\left\| \left( \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}-\varvec{\Sigma }_{\mathcal {A}^{c}\mathcal {A}}\right) \varvec{\theta }_{g\mathcal {A}}\right\| _{2}+\max _{1\le g\le G}\left\| \widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}-\varvec{\delta }_{g\mathcal {A}^{c}}\right\| _{2}\\ =&{}\left\{ (p_n-k_n)\left( \max _{i\in \mathcal {A}^{c},j\in \mathcal {A}}|{\widehat{\sigma }_{ij}-\sigma _{ij}}|\left\| \varvec{\theta }_{g\mathcal {A}}\right\| _{1}\right) ^{2}\right\} ^{\frac{1}{2}}+\sqrt{p_n-k_n}\max _{1\le g\le G,j\in \mathcal {A}^{c}}|{\widehat{\delta }_{gj}-\delta _{gj}}|. \end{matrix}}\end{aligned}$$Again invoking Lemma 1(i) and (ii) in Sheng and Wang (2019) and regularity conditions (C1)-(C3), we can show that,
$$\begin{aligned} \Pr \left( \max _{i\in \mathcal {A}^{c},j\in \mathcal {A}}|{\widehat{\sigma }_{ij}-\sigma _{ij}}|\le C\sqrt{\frac{\log p_n}{n}}\right) \longrightarrow 1 \end{aligned}$$and
$$\begin{aligned} \Pr \left( \max _{g,j\in \mathcal {A}^{c}}|{\widehat{\delta }_{gj}-\delta _{gj}}|\le C\sqrt{\frac{\log p_n}{n}}\right) \longrightarrow 1. \end{aligned}$$Then we have
$$\begin{aligned} \Pr \left( \max _{g}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\varvec{\theta }_{g\mathcal {A}}-\widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}\right\| _{2} \le C\sqrt{\frac{\left( p_n-k_n\right) \log p_n}{n}}\right) \longrightarrow 1. \end{aligned}$$(A.8)By (A.7), (A.8), Chebyshev’s inequality and \(\displaystyle \lambda _n\phi _{*}\Big /\sqrt{k_n\log p_n/n}\rightarrow \infty \) , we have
$$\begin{aligned}{\begin{matrix} \Pr \left\{ \exists j\in {\mathcal {A}^{c}}, \left( \sum _{g=2}^{G}\left\| \left( \widehat{\varvec{\Sigma }}\varvec{\theta }_g-\widehat{\varvec{\delta }}_{g}\right) _{\mathcal {N}^{(j)}}\right\| _{2}^{2}\right) ^{{1}/{2}}>\lambda _n\phi _j\right\} &{}\le \frac{\mathbb {E}\left( \displaystyle \sum _{g=2}^{G}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\varvec{\theta }_{g\mathcal {A}}-\widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}\right\| _{2}^{2}\right) }{(\lambda _n\phi _{*})^{2}}\\ &{}\le \frac{C''\left( G-1\right) \left( p_n-k_n\right) \log p_n}{\lambda ^{2}_n\phi ^2_{*}n}. \end{matrix}}\end{aligned}$$By \(\displaystyle \lambda ^2_n\phi ^2_{*}n / \left[ \left( p_n-k_n\right) \log p_n \right] \rightarrow \infty \), we can show that conclusion 2) holds.
Summarizing the above results, we finish the proof of Theorem 1.
Proof of Theorem
2 Let \(R_n\) and \(R^\text {Bayes}\) denote the conditional misclassification rates of IG-MSDA and the Bayes rule, respectively. Given a large enough value h, let \(\eta _0=h({k_n\log p_n}/{n})^{{1}/{3}}\), similar to the proof of Theorem 2 in Mai and Zou (2015), we have
For the observation \(\varvec{x}\), \(D_g^\text {Bayes}(\varvec{x})-D_k^\text {Bayes}(\varvec{x})\) follows the normal distribution with variance \(\Delta \). By regularity condition (C5) and G is fixed, for a sufficiently large positive number M, we have
For \(A_2\), we have
where
For \(g,g'=1,\ldots ,G\), we have
in probability.
By Taylor expansion, the conditions of Theorems 1 and 2, for any \(g,g'=1,\ldots ,G\), we have
and
with probability tending to one.
Furthermore, we have
and
By \({k_n\log p_n}/{n}\rightarrow 0\), there is a large enough positive constant M, for \(g,g'=1,\ldots ,G\) such that
Therefore, there exists a sufficiently large positive constant h such that \(\eta _0/3>M\sqrt{{k_n\log p_n}/{n}}\). Thus, for \(g,g'=1,\ldots ,G\), \(\Pr (\left| \mu _{gg'}\right| <\eta _0/3)\rightarrow 1\). In addition, by Markov’s inequality, we have
From regularity condition (C1) and Theorem 1, for \(g=1,\ldots ,G\), we can show that
Then, we have
holds with probability tending to 1. Summarizing the above results, we have \(|R_n-R^\text {Bayes}|=O_p(({k_n\log p_n}/{n})^{1/3})\). Therefore, we complete the proof of Theorem 2.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Luo, J., Li, X., Yu, C. et al. Multiclass Sparse Discriminant Analysis Incorporating Graphical Structure Among Predictors. J Classif 40, 614–637 (2023). https://doi.org/10.1007/s00357-023-09451-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-023-09451-1