Multiclass Sparse Discriminant Analysis Incorporating Graphical Structure Among Predictors

Luo, Jingxuan; Li, Xuejiao; Yu, Chongxiu; Li, Gaorong

doi:10.1007/s00357-023-09451-1

Multiclass Sparse Discriminant Analysis Incorporating Graphical Structure Among Predictors

Published: 14 October 2023

Volume 40, pages 614–637, (2023)
Cite this article

Journal of Classification Aims and scope Submit manuscript

Jingxuan Luo¹,
Xuejiao Li²,
Chongxiu Yu² &
…
Gaorong Li ORCID: orcid.org/0000-0002-1784-3472¹

313 Accesses
1 Altmetric
Explore all metrics

Abstract

In the era of big data, many sparse linear discriminant analysis methods have been proposed for classification and variable selection of the high-dimensional data. In order to solve the multiclass sparse discriminant problem for high-dimensional data under the Gaussian graphical model, this paper proposes a multiclass sparse discrimination analysis method by incorporating the graphical structure among predictors, which is named as IG-MSDA method. Our proposed IG-MSDA method can be used to estimate the vectors of all discriminant directions simultaneously. Under certain regularity conditions, it is shown that the proposed IG-MSDA method can consistently estimate all discriminant directions and the Bayes rule. Further, we establish the convergence rates of the estimators for the discriminant directions and the conditional misclassification rates. Finally, simulation studies and a real data analysis demonstrate the good performance of our proposed IG-MSDA method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sparse overlapped linear discriminant analysis

Article 24 November 2022

Robust and sparse multigroup classification by the optimal scoring approach

Article 20 February 2020

Partially Supervised Sparse Factor Regression For Multi-Class Classification

Data Availability

The IBD dataset analyzed during the current study is available with accession number GDS1615 in the Gene Expression Omnibus data base of National Center for Biotechnology Information (NCBI), https://www.ncbi.nlm.nih.gov/geo/.

Code Availability

The code that supports the findings of this study is available at https://github.com/Luo-jx/IG-MSDA.

References

Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis (3rd ed.). New Jersey: John Wiley & Sons.
MATH Google Scholar
Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183–202.
Article MathSciNet MATH Google Scholar
Bickel, P. J., & Levina, E. (2004). Some theory for fisher’s linear discriminant function, ‘naive bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 10(6), 989–1010.
Article MathSciNet MATH Google Scholar
Cai, T., & Liu, W. (2011). A direct estimation approach to sparse linear discriminant analysis. Journal of the American Statistical Association, 106, 1566–1577.
Article MathSciNet MATH Google Scholar
Cai, T., & Zhang, L. (2019). High dimensional linear discriminant analysis: optimality, adaptive algorithm and missing data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(4), 675–705.
Article MathSciNet MATH Google Scholar
Cannings, T. I., & Samworth, R. J. (2017). Random-projection Ensemble Classification. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79, 9591035.
MathSciNet MATH Google Scholar
Clemmensen, L., Hastie, T., Witten, D., & Ersbll, B. (2011). Sparse discriminant analysis. Technometrics, 53, 406–413.
Article MathSciNet Google Scholar
Dudoit, S., Fridlyand, J., & Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457), 77–87.
Article MathSciNet MATH Google Scholar
Fan, J., & Fan, Y. (2008). High-dimensional classification using features annealed independence rules. The Annals of Statistics, 36, 2605–2637.
Article MathSciNet MATH Google Scholar
Fan, J., Feng, Y., & Tong, X. (2012). A road to classification in high dimensional space: the regularized optimal affine discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74, 745–771.
Article MathSciNet MATH Google Scholar
Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical Lasso. Biostatistics, 9(3), 432–441.
Article MATH Google Scholar
Guo, J. (2010). Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis. Biostatistics, 11, 599–608.
Article MATH Google Scholar
Jiang, B., Chen, Z., & Leng, C. (2020). Dynamic linear discriminant analysis in high dimensional space. Bernoulli, 26(2), 1234–1268.
Article MathSciNet MATH Google Scholar
Le, K. T., Chaux, C., Richard, F. J., & Guedj, E. (2020). An adapted linear discriminant analysis with variable selection for the classification in high-dimension, and an application to medical data. Computational Statistics and Data Analysis, 152, 107031. https://doi.org/10.1016/j.csda.2020.107031
Article MathSciNet MATH Google Scholar
Lee, J. W., Lee, J. B., Park, M., & Song, S. H. (2005). An extensive comparison of recent classification tools applied to microarray data. Computational Statistics and Data Analysis, 48, 869–885.
Article MathSciNet MATH Google Scholar
Liu, J., Yu, G., & Liu, Y. (2019). Graph-based sparse linear discriminant analysis for high-dimensional classification. Journal of Multivariate Analysis, 171, 250–269.
Article MathSciNet MATH Google Scholar
Mai, Q., Yang, Y., & Zou, H. (2019). Multiclass sparse discriminant analysis. Statistica Sinica, 29, 97–111.
MathSciNet MATH Google Scholar
Mai, Q., & Zou, H. (2015). Sparse semiparametric discriminant analysis. Journal of Multivariate Analysis, 135, 175–188.
Article MathSciNet MATH Google Scholar
Mai, Q., Zou, H., & Yuan, M. (2012). A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika, 99, 29–42.
Article MathSciNet MATH Google Scholar
Meinshausen, N., & Bühlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso. The Annals of Statistics, 34(3), 1436–1462.
Article MathSciNet MATH Google Scholar
Pun, C. S., & Hadimaja, M. Z. (2021). A self-calibrated direct approach to precision matrix estimation and linear discriminant analysis in high dimensions. Computational Statistics and Data Analysis, 155, 107105. https://doi.org/10.1016/j.csda.2020.107105
Article MathSciNet MATH Google Scholar
Sheng, Y., & Wang, Q. (2019). Simultaneous variable selection and class fusion with penalized distance criterion based classifiers. Computational Statistics and Data Analysis, 133, 138–152.
Article MathSciNet MATH Google Scholar
Stephenson, M., Ali, R. A., Darlington, G. A., Schenkel, F. S., & Squires, E. J. (2021). DSLRIG: Leveraging predictor structure in logistic regression. Communications in Statistics - Simulation and Computation, 50(6), 1600–1612.
Article MathSciNet MATH Google Scholar
Wang, Z., Liu, X., Tang, W., & Lin, Y. (2021). Incorporating graphical structure of predictors in sparse quantile regression. Journal of Business & Economic Statistics, 39(3), 783–792.
Article MathSciNet Google Scholar
Witten, D. M., & Tibshirani, R. (2011). Penalized classification using fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73, 753–772.
Article MathSciNet MATH Google Scholar
Xu, P., Brock, G. N., & Parrish, R. S. (2009). Modified linear discriminant analysis approaches for classification of high-dimensional microarray data. Computational Statistics & Data Analysis, 53, 1674–1687.
Article MathSciNet MATH Google Scholar
Yu, G., & Liu, Y. (2016). Sparse regression incorporating graphical structure among predictors. Journal of the American Statistical Association, 111(514), 707–720.
Article MathSciNet Google Scholar
Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68, 49–67.
Article MathSciNet MATH Google Scholar
Zhou, Y., Zhang, B. X., Li, G. R., Tong, T. J., & Wan, X. (2017). GD-RDA: A new regularized discriminant analysis for high-dimensional data. Journal of Computational Biology, 24, 1099–1111.
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors sincerely thank the editor, the associate editor, and two reviewers for their constructive comments that have led to a substantial improvement of this paper. This research was supported by the National Natural Science Foundation of China (12271046, 11971001 and 12131006).

Author information

Authors and Affiliations

School of Statistics, Beijing Normal University, 100875, Beijing, People’s Republic of China
Jingxuan Luo & Gaorong Li
Faculty of Science, Beijing University of Technology, 100124, Beijing, People’s Republic of China
Xuejiao Li & Chongxiu Yu

Authors

Jingxuan Luo
View author publications
You can also search for this author inPubMed Google Scholar
Xuejiao Li
View author publications
You can also search for this author inPubMed Google Scholar
Chongxiu Yu
View author publications
You can also search for this author inPubMed Google Scholar
Gaorong Li
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Gaorong Li.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Proof of Main Results

We first provide the subgradient condition of the optimization problem (2.9) and related lemmas.

Proposition 1

The vector $\varvec{\theta }_g ~ (2\le g\le G)$ is the solution to problem (2.9) if and only if $\varvec{\theta }_g$ can be decomposed into ${\varvec{\theta }}_g=\displaystyle \sum _{g=2}^{G}\varvec{v}^{(j)}_g$ for $2\le g\le G$. For $1\le j\le p_n$, then

(a)
$\varvec{v}^{(j)}_{\cdot {\mathcal {N}^{(j)}}^{c}}=\varvec{0}$;
(b)
Either

$$\begin{aligned} \varvec{v}_{\cdot \mathcal {N}^{(j)}}^{(j)}\ne \varvec{0}, \quad \nabla _{\mathcal {N}^{(j)}}L_n\left( \varvec{v}^{(1)}_{\cdot },\ldots ,\varvec{v}^{(p_n)}_{\cdot }\right) +\lambda _{n}\phi _{j}\frac{\varvec{v}_{\cdot \mathcal {N}^{(j)}}^{(j)}}{\left\| \varvec{v}_{\cdot \mathcal {N}^{(j)}}^{(j)}\right\| _{2}}=\varvec{0}, \end{aligned}$$

or

$$\begin{aligned} \varvec{v}_{\cdot \mathcal {N}^{(j)}}^{(j)}=\varvec{0}, \quad \left\| \nabla _{\mathcal {N}^{(j)}}L_{n}\left( \varvec{v}^{(1)}_{\cdot },\ldots ,\varvec{v}^{(p_n)}_{\cdot }\right) \right\| _{2}\le \lambda _n\phi _{j}. \end{aligned}$$

Here,

$$\begin{aligned} \displaystyle \nabla _{\mathcal {N}^{(j)}}L_n (\varvec{v}^{(1)}_{\cdot },\ldots ,\varvec{v}^{(p_n)}_{\cdot } )=\Biggl [\biggl (\frac{\partial L_n (\varvec{v}^{(1)}_{\cdot },\ldots ,\varvec{v}^{(p_n)}_{\cdot } )}{\partial {\varvec{v}^{(j)}_{2\mathcal {N}^{(j)}}}}\biggr )^{\top }, \ldots ,\biggl (\frac{\partial L_n (\varvec{v}^{(1)}_{\cdot },\ldots ,\varvec{v}^{(p_n)}_{\cdot } )}{\partial {\varvec{v}_{G\mathcal {N}^{(j)}}^{(j)}}}\biggr )^{\top }\Biggr ]^{\top }, \end{aligned}$$

where

$$\displaystyle \frac{\partial L_n (\varvec{v}^{(1)}_{\cdot },\ldots ,\varvec{v}^{(p_n)}_{\cdot } )}{\partial {\varvec{v}^{(j)}_{g\mathcal {N}^{(j)}}}}=(\widehat{\varvec{\Sigma }}\varvec{\theta }_g-\widehat{\varvec{\delta }}_g)_{\mathcal {N}^{(j)}}, \qquad \displaystyle \varvec{\theta }_g=\sum _{j=1}^{p}\varvec{v}_g^{(j)}, \qquad 2\le g\le G.$$

This subgradient condition is similar to Karush-Kuhn-Tucker (KKT) condition of group Lasso in Yuan and Lin (2006). Invoking KKT condition, it is easy to show that, if $\widehat{\varvec{v}}^{(j)}_{\cdot }$ is the solution to (2.7) for each $1\le j\le p_n$, then $\widehat{\varvec{v}}^{(j)}_g ~ (2\le g\le G)$ is either estimated to be $\varvec{0}$ or a sparse vector with support set $\mathcal {N}^{(j)}$. Therefore, $\displaystyle \widehat{\varvec{\theta }}_g=\sum ^{p_n}_{j=1}\widehat{\varvec{v}}^{(j)}_g ~ (2\le g\le G)$ estimated by IG-MSDA method satisfies the decomposition (2.5).

Lemma 1

Suppose that regularity conditions (C1)–(C3) hold, for a sufficiently large positive constant C, as $n\rightarrow \infty $, if $\displaystyle \delta _{\min }\Big /\sqrt{\frac{\log p_n}{n}}\rightarrow \infty $, $\delta _{\min }=\min _{j\in \mathcal {A}}\Vert \varvec{\delta }_{\cdot j}\Vert _2$ and $k_n=|\mathcal {A}|$, then

$$\begin{aligned} \Pr \left( \max _{j\in \mathcal {A}}\phi _j\le \frac{C\sqrt{k_n}}{\delta _{\min }}\right) \longrightarrow 1. \end{aligned}$$

Proof

Note that

$$\begin{aligned}{\begin{matrix} \min _{j\in \mathcal {A}}\Vert \widehat{\varvec{\delta }}_{\cdot j}\Vert _{2} &{}\ge \min _{j\in \mathcal {A}}\left\| {\varvec{\delta }}_{\cdot j}\right\| _{2}-\max _{j\in \mathcal {A}}\Vert \widehat{\varvec{\delta }}_{\cdot j}-\varvec{\delta }_{\cdot j}\Vert _{2}\\ &{}=\delta _{\min }-\max _{j\in \mathcal {A}}\Vert \widehat{\varvec{\delta }}_{\cdot j}-\varvec{\delta }_{\cdot j}\Vert _{2}. \end{matrix}}\end{aligned}$$

From Lemma 1 (ii) in Sheng and Wang (2019), it can be seen that

$$\Pr \left( \max _{g,j}|\widehat{\delta }_{gj}-\delta _{gj}|\le C\sqrt{\frac{\log p_n}{n}}\right) \longrightarrow 1.$$

Then we have

$$\begin{aligned} \max _{j\in \mathcal {A}}\Vert \widehat{\varvec{\delta }}_{\cdot j}-{\varvec{\delta }}_{\cdot j}\Vert _{2} \le \sqrt{G}\max _{j\in \mathcal {A}}\Vert \widehat{\varvec{\delta }}_{\cdot j}-{\varvec{\delta }}_{\cdot j}\Vert _{\infty } =\sqrt{G}\max _{g,j}|{\widehat{\delta }_{gj}-\delta _{gj}}| \le C\sqrt{\frac{\log p_n}{n}}. \end{aligned}$$

And $\displaystyle \delta _{\min }\Big /\sqrt{\dfrac{\log p_n}{n}}\rightarrow \infty $. Thus, $\displaystyle \min _{j\in \mathcal {A}}\Vert \widehat{\varvec{\delta }}_{\cdot j}\Vert _{2}\ge \dfrac{\delta _{\min }}{C}$. Furthermore, $\displaystyle \max _{j\in \mathcal {A}}\phi _j=\max _{j\in \mathcal {A}}\dfrac{\sqrt{|{\mathcal {N}^{(j)}}|}}{\Vert \widehat{\varvec{\delta }}_{\cdot j}\Vert _2}\le \dfrac{\sqrt{k_n}}{\min _{j\in \mathcal {A}}\Vert \widehat{\varvec{\delta }}_{\cdot j}\Vert _2}$. Therefore, Lemma 1 is proved.

Proof of Theorem

1 Under regularity condition (C1), an oracle estimator is introduced, which is defined as

$$\begin{aligned} \left( \widetilde{\varvec{\psi }}_2,\ldots ,\widetilde{\varvec{\psi }}_G\right) =\arg \min _{{\varvec{\psi }}_{g}\in \mathbb {R}^{k_n}}\Biggl \{{\sum _{g=2}^{G}}\biggl (\frac{1}{2}{\varvec{\psi }}_{g}^{\top }\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}\varvec{\psi }_{g}-\widehat{\varvec{\delta }}_{g\mathcal {A}}^{\top }\varvec{\psi }_{g}\biggr )+\lambda _{n} \Vert \varvec{\psi }_{2},\ldots ,\varvec{\psi }_{G}\Vert _{{\mathcal {G_A}},\phi _{\mathcal {A}}}\Biggr \}, \end{aligned}$$

(A.1)

where $g=2,\ldots ,G$ and $\mathcal {G_A}$ denotes the subgraph of graph $\mathcal {G}$ corresponding to $\mathcal {A}$. In order to prove Theorem 1, the following conclusions need to be proved.

1)
$\underset{j\in \mathcal {A}}{\min }\Vert \widetilde{\varvec{\psi }}_{\cdot j}\Vert _{2}>0$, where $\widetilde{\varvec{\psi }}_{\cdot j}=(\widetilde{\varvec{\psi }}_{2j},\ldots ,\widetilde{\varvec{\psi }}_{Gj})^{\top }$; For convenience, let $\mathcal {A}=\{1,2,\ldots ,k_n\}$. By regularity condition (C4), (A.1) is equivalent to $\displaystyle \widetilde{\varvec{\psi }}_{g}=\sum _{j=1}^{k_n}\widetilde{\varvec{\mu }}_{g}^{(j)}$ and
$$\begin{aligned} \left( \widetilde{\varvec{\mu }}^{(1)}_{\cdot },\ldots ,\widetilde{\varvec{\mu }}^{(k_n)}_{\cdot }\right) =\arg \min _{\varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot }}\left\{ Q_n\left( \varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot }\right) +\lambda _{n}\sum _{j=1}^{k_n}\phi _{j}\Vert \varvec{\mu }^{(j)}_{\cdot }\Vert _2\right\} , \end{aligned}$$
(A.2)
where
$$\begin{aligned} Q_{n}\left( \varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot }\right) =\sum ^{G}_{g=2}\left[ \frac{1}{2}\biggl (\sum _{j=1}^{k_n}\varvec{\mu }^{(j)}_g\biggr )^{\top }\widehat{\varvec{\Sigma }}\biggl (\sum _{j=1}^{k_n}\varvec{\mu }^{(j)}_g\biggr )-\widehat{\varvec{\delta }}^{\top }_g\biggl (\sum _{j=1}^{k_n}\varvec{\mu }_g^{(j)}\biggr )\right] . \end{aligned}$$
(A.3)
Here, $\text {supp}({\varvec{\mu }_{g}}^{(j)})\subseteq \mathcal {N}^{(j)}, ~ 2\le g\le G$, $\varvec{\mu }_{\cdot }^{(j)}=(\varvec{\mu }_{2}^{(j)\top },\ldots ,\varvec{\mu }_{G}^{(j){\top }})^{\top }$ and $\displaystyle \varvec{\psi }_{g}=\sum _{j=1}^{k_n}{\varvec{\mu }^{(j)}_{g}}$. Since the objective function defined by (A.2) is a convex function, from Proposition 1, it is easy to show that its solution satisfies KKT condition: for all $j\in \mathcal {A}$, then
1. (a)
  $\varvec{\mu }_{\cdot {\mathcal {N}^{(j)}}^{c}}^{(j)}=\varvec{0}$;
2. (b)
  Either
  $$\begin{aligned} \varvec{\mu }_{\cdot \mathcal {N}^{(j)}}^{(j)}\ne \varvec{0}, \quad \nabla _{\mathcal {N}^{(j)}}Q_n\left( \varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot }\right) +\lambda _{n}\phi _{j}\frac{\varvec{\mu }_{\cdot \mathcal {N}^{(j)}}^{(j)}}{\Vert \varvec{\mu }_{\cdot \mathcal {N}^{(j)}}^{(j)}\Vert _{2}}=\varvec{0}, \end{aligned}$$
  or
  $$\begin{aligned} \varvec{\mu }_{\cdot \mathcal {N}^{(j)}}^{(j)}=\varvec{0}, \quad \Vert \nabla _{\mathcal {N}^{(j)}}Q_n\left( \varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot }\right) \Vert _{2}\le \lambda _n\phi _{j}. \end{aligned}$$
  Here,
  $$\begin{aligned} \small \displaystyle \nabla _{\mathcal {N}^{(j)}}L_n(\varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot })=\Biggl [\biggl (\frac{\partial Q_n (\varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot } )}{\partial {\varvec{\mu }^{(j)}_{2\mathcal {N}^{(j)}}}}\biggr )^{\top }, \ldots ,\biggl (\frac{\partial Q_n (\varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot } )}{\partial {\varvec{\mu }_{G\mathcal {N}^{(j)}}^{(j)}}}\biggr )^{\top }\Biggr ]^{\top } \end{aligned}$$
  and
  $$\displaystyle \dfrac{\partial Q_n (\varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot } )}{\partial {{\varvec{\mu }}^{(j)}_{g\mathcal {N}^{(j)}}}}=(\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}\varvec{\psi }_g-\widehat{\varvec{\delta }}_{g\mathcal {A}})_{\mathcal {N}^{(j)}}, \quad \displaystyle \varvec{\psi }_g=\sum _{j\in \mathcal {A}}\varvec{\mu }_g^{(j)}, \quad 2\le g\le G. $$
Then $\widetilde{\varvec{\psi }}_g ~ (2\le g\le G)$ can be obtained via $\displaystyle \widetilde{\varvec{\psi }}_{g}=\sum _{j=1}^{k_n}\widetilde{\varvec{\mu }}_{g}^{(j)}$. Let $\widehat{B}=\{i:\Vert \varvec{\mu }_{\cdot \mathcal {N}^{(i)}}^{(i)}\Vert _{2}\ne 0\}$. For each $i\in \widehat{B}$, we have
$$ \nabla _{\mathcal {N}^{(j)}}Q_n(\varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot })=-\lambda _n\phi _{j}\frac{\varvec{\mu }_{\cdot \mathcal {N}^{(j)}}^{(j)}}{\left\| \varvec{\mu }_{\cdot \mathcal {N}^{(j)}}^{(j)}\right\| _{2}}. $$
For each $i\notin \widehat{B}$, we have
$$ \nabla _{\mathcal {N}_{(j)}}Q_n(\varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot })=-\lambda _n\phi _{j}\varvec{Z}_{\cdot \mathcal {N}^{(j)}}^{(j)}, $$
where $\varvec{Z}_{\cdot }^{(j)}$ is a $p\times (G-1)$ matrix, $\Vert \varvec{Z}_{\cdot \mathcal {N}^{(j)}}^{(j)}\Vert _{2}\le 1$. Because some variables may belong to multiple neighborhoods, the following conditions need to be satisfied:
1. (i)
  For each ${i_1}\in \widehat{B},{i_2}\in \widehat{B},j\!\in \!{\mathcal {N}^{(i_1)}\bigcap \mathcal {N}^{(i_2)}}$, $\phi _{i_1}\varvec{\mu }_{\cdot j}^{(i_1)}\big /\bigl \Vert {\varvec{\mu }}_{\cdot {\mathcal {N}}^{(j)}}^{(i_1)}\bigr \Vert _{2}\!=\!\phi _{i_2}{\varvec{\mu }}_{\cdot j}^{(i_2)}\big /\bigl \Vert {\varvec{\mu }}_{\cdot {\mathcal {N}}^{(j)}}^{(i_2)}$ $\bigr \Vert _{2}$;
2. (ii)
  For each $i_1\in \widehat{B},i_2\notin \widehat{B},j\in \mathcal {N}^{(i_1)}\bigcap \mathcal {N}^{(i_2)}$, $\phi _{i_1}\varvec{\mu }_{\cdot j}^{(i_1)} \big /\Vert \varvec{\mu }_{\cdot \mathcal {N}^{(j)}}^{(i_1)}\Vert _{2}=\phi _{i_2}\varvec{Z}_{\cdot j}^{(i_2)}$;
3. (iii)
  For each $i_1\notin \widehat{B},i_2\notin \widehat{B}$ and $j\in \mathcal {N}^{(i_1)}\bigcap \mathcal {N}^{(i_2)}$, $\phi _{i_1}\varvec{Z}_{\cdot j}^{(i_1)}=\phi _{i_2}\varvec{Z}_{\cdot j}^{(i_2)}$.
Then, for each $i\in \mathcal {A}$, it can be defined that for $i\in \widehat{B}$, $\varvec{f}_{i}=\phi _{i}\varvec{\mu }_{\cdot i}^{(i)}/\Vert \varvec{\mu }_{\cdot \mathcal {N}^{(i)}}^{(i)}\Vert _{2}$, where $\varvec{f}_{i}=(f_{2i},\ldots ,f_{Gi})^{\top }, f_{gi}=\displaystyle \frac{\phi _{i}\varvec{\mu }_{gi}^{(i)}}{\Vert {\varvec{\mu }}^{(i)}_{\cdot \mathcal {N}^{(i)}}\Vert _{2}}, 2\le g\le G$; for $i\notin \widehat{B}, \varvec{f}_{i}=\phi _{i}\varvec{Z}_{\cdot i}^{(i)}$, where $ \varvec{f}_i=(f_{2i},\ldots ,f_{Gi})$, $f_{gi}=\phi _{i}\varvec{Z}_{gi}^{(i)}, 2\le g\le G$. Then any solution $\varvec{\psi }_{g}$ satisfies the equation: $\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}\varvec{\psi }_g-\widehat{\varvec{\delta }}_g=\lambda _n\varvec{f}_{g\mathcal {A}}$. From the definition of $\widetilde{\varvec{\psi }}_{g}$, it can be obtained that $\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}\widetilde{\varvec{\psi }}_g-\widehat{\varvec{\delta }}_g=\lambda _n\widetilde{\varvec{f}}_{g\mathcal {A}}$, i.e., $\widetilde{\varvec{\psi }}_g=\widehat{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}(\widehat{\varvec{\delta }}_g-\lambda _n\widetilde{\varvec{f}}_{g\mathcal {A}})$. Here, $\widetilde{\varvec{f}}_{g\mathcal {A}}\in \mathbb {R}^{k_n}, 2\le g\le G$. In order to prove conclusion 1), we can obtain that
$$\begin{aligned}\begin{aligned} \max _{1\le g\le G}\Vert \widetilde{\varvec{\psi }}_g-\varvec{\theta }_{g\mathcal {A}}\Vert _{2} =&\max _{1\le g\le G}\Vert \widehat{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}(\widehat{\varvec{\delta }}_{g\mathcal {A}}-\lambda _n\widetilde{\varvec{f}}_{g\mathcal {A}})-{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}\varvec{\delta }_{g\mathcal {A}}\Vert _{2}\\ =&\max _{1\le g\le G}\Vert (\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}^{-1}-{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}})\varvec{\delta }_{g\mathcal {A}}+{\widehat{\varvec{\Sigma }}}^{-1}_{\mathcal{A}\mathcal{A}}(\widehat{\varvec{\delta }}_{g\mathcal {A}}-\varvec{\delta }_{g\mathcal {A}})-\lambda _n\widehat{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}\widetilde{\varvec{f}}_{g\mathcal {A}}\Vert _{2}\\ \le&\max _{1\le g\le G}\Vert (\widehat{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}-{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}})\varvec{\delta }_{g\mathcal {A}}\Vert _{2}+\max _{1\le g\le G}\Vert \widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}^{-1}(\widehat{\varvec{\delta }}_{g\mathcal {A}}-\varvec{\delta }_{g\mathcal {A}})\Vert _{2} \\&+\max _{1\le g\le G}\Vert \lambda _n\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}^{-1}\widetilde{\varvec{f}}_{g\mathcal {A}}\Vert _{2}\\ =:&I_1+I_2+I_3. \end{aligned}\end{aligned}$$
For $I_1$, we have
$$\begin{aligned} \begin{aligned} I_1&=\max _{1\le g\le G}\Vert ({\widehat{\varvec{\Sigma }}}^{-1}_{\mathcal{A}\mathcal{A}}-{\varvec{\Sigma }}^{{-1}}_{\mathcal{A}\mathcal{A}}){\varvec{\delta }_{g\mathcal {A}}}\Vert _2\\&=\max _{1\le g\le G}\Vert {\widehat{\varvec{\Sigma }}}^{-1}_{\mathcal{A}\mathcal{A}}[\varvec{I}-(\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}-\varvec{\Sigma }_{\mathcal{A}\mathcal{A}}){\varvec{\Sigma }}^{{-1}}_{\mathcal{A}\mathcal{A}}-\varvec{I}]\varvec{\theta }_{g\mathcal {A}}\Vert _{2}\\&=\max _{1\le g\le G}\Vert \widehat{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}{\varvec{\Sigma }}^{*{-1}}_{\mathcal{A}\mathcal{A}}(\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}-\varvec{\Sigma }_{\mathcal{A}\mathcal{A}})\varvec{\delta }_{g\mathcal {A}}\Vert _{2}\\&\le \widehat{\rho }^{-1}_n{\rho }^{-1}_n\max _{1\le g\le G}\Vert (\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}-\varvec{\Sigma }_{\mathcal{A}\mathcal{A}})\varvec{\delta }_{g\mathcal {A}}\Vert _{2}\\&\le \widehat{\rho }^{-1}_n\xi _0\left[ k_n\left( \max _{i,j\in \mathcal {A}}|{\widehat{\sigma }_{ij}-\sigma _{ij}}|\max _{1\le g\le G}\Vert \varvec{\delta }_{g\mathcal {A}}\Vert _{1}\right) ^{2}\right] ^{1/2}, \end{aligned}\end{aligned}$$
(A.4)
where $\widehat{\rho }_n$ denotes the minimum eigenvalue of $\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}$. It is easy to show that $\widehat{\rho }_n>0$ since $\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}$ is a positive definite matrix. From regularity condition (C2) and Theorem 13.5.1 in Anderson (2003), it is easy to show that $\widehat{\rho }_n^{-1}$ is uniformly bounded. Furthermore, invoking Lemma 1 (i) in Sheng and Wang (2019), we can show that $\displaystyle \max _{i,j\in \mathcal {A}}|{\widehat{\sigma }_{ij}-\sigma _{ij}}|=O_p(\sqrt{{\log p_n}/{n}})$ and $\displaystyle \max _{1\le g\le G}\Vert \varvec{\delta }_g\Vert _{1}$ are uniformly bounded. Therefore, $I_{1}=O_{p}(\displaystyle \sqrt{{k_n\log p_n}/{n}})$. For $I_2$, we have
$$\begin{aligned} \begin{aligned} I_{2}&=\max _{1\le g\le G}\Vert \widehat{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}(\widehat{\varvec{\delta }}_{g\mathcal {A}}-{\varvec{\delta }}_{g\mathcal {A}})\Vert _{2} \le \widehat{\rho }^{-1}_n\max _{1\le g\le G}\Vert \widehat{\varvec{\delta }}_{g\mathcal {A}}-{\varvec{\delta }}_{g\mathcal {A}}\Vert _{2}\\&\le \widehat{\rho }^{-1}_n\sqrt{k_n}\max _{1\le g\le G}\Vert \widehat{\varvec{\delta }}_{g\mathcal {A}}-{\varvec{\delta }}_{g\mathcal {A}}\Vert _{\infty }. \end{aligned}\end{aligned}$$
(A.5)
From Lemma 1 (ii) in Sheng and Wang (2019), it can be obtained that
$$\Pr \left( \max _{g,j}|{\widehat{\theta }_{gj}-\theta _{gj}}|\le C\sqrt{\frac{\log p_n}{n}}\right) \longrightarrow 1.$$
Thus, $I_{2}=O_{p}(\sqrt{{k_n\log p_n}/{n}})$. For $I_{3}$, we have
$$\begin{aligned} I_{3}=\lambda _{n}\max _{1\le g\le G}\Vert \widehat{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}\widetilde{\varvec{f}}_{g\mathcal {A}}\Vert _{2} \le \lambda _{n}\widehat{\rho }^{-1}_n\max _{1\le g\le G}\Vert \widetilde{\varvec{f}}_{g\mathcal {A}}\Vert _{2} \le \lambda _{n}\widehat{\rho }^{-1}_n\sqrt{k_n}\max _{j\in \mathcal {A}}\phi _{j}. \end{aligned}$$
(A.6)
Invoking Lemma 1, it is easy to show that
$$\Pr \left( \max _{j\in \mathcal {A}}\phi _j \le \frac{C\sqrt{k_n}}{\delta _{\min }}\right) \longrightarrow 1,$$
and $\lambda _{n}=O_p(\delta _{\min }\sqrt{{\log p_n}/{nk_n}})$, then $I_{3}=O_p(\sqrt{{k_n\log p_n}/{n}})$. From (A.4), (A.5) and (A.6), we can obtain that $\underset{g}{\max }\Vert \widetilde{\varvec{\psi }}_g-\varvec{\theta }_{g\mathcal {A}}\Vert _{2}=O_{p}(\sqrt{{k_n\log p_n}/{n}})$. If G is fixed, then
$$\Vert \widetilde{\varvec{\psi }}_{\cdot \mathcal {A}}-{\varvec{\theta }}_{\cdot \mathcal {A}}\Vert _{2}=\left( \sum _{g=2}^{G}\Vert \widetilde{\varvec{\psi }}_{g}-\varvec{\theta }_{g\mathcal {A}}\Vert ^{2}_{2}\right) ^{{1}/{2}}=O_p\left( \sqrt{\frac{k_n\log p_n}{n}}\right) ,$$
and
$$\begin{aligned}{\begin{matrix} \min _{j\in \mathcal {A}}\Vert \widetilde{\varvec{\psi }}_{\cdot j}\Vert _{2} &{}>\min _{j\in \mathcal {A}}\Vert \varvec{\theta }_{\cdot j}\Vert _{2}-\max _{j\in \mathcal {A}}\Vert \widetilde{\varvec{\psi }}_{\cdot j}-\varvec{\theta }_{\cdot j}\Vert _{2}\\ &{}>\min _{j\in \mathcal {A}}\Vert \varvec{\theta }_{\cdot j}\Vert _{2}-\Vert \widetilde{\varvec{\psi }}_{\cdot \mathcal {A}}-\varvec{\theta }_{\cdot \mathcal {A}}\Vert _{2} >\theta _{\min }-C\sqrt{\frac{k_n\log p_n}{n}}. \end{matrix}}\end{aligned}$$
By condition (C5), $\displaystyle \min _{j\in \mathcal {A}}\Vert \widetilde{\varvec{\psi }}_{\cdot j}\Vert>{\theta _{\min }}/{2}>0$. Thus, conclusion 1) is proved.
2)
For all $g ~ (2\le g\le G),~ \widetilde{\varvec{\theta }}_{g}=(\widetilde{\varvec{\psi }}_{g}^{\top },\varvec{0}^{\top })^{\top }$ is the solution by minimizing the objective function (2.9). It can be seen from Section 2 that (2.7) and (2.9) are equivalent optimization problems. From Proposition 1, the optimization problem defined by (2.7) satisfies KKT condition. For all $j\in {\mathcal {A}}^{c}$, let $\widetilde{\varvec{v}}_{g}^{(j)}=\varvec{0}$; for all $j\in \mathcal {A}, 2\le g\le G$, let $\widetilde{\varvec{v}}^{(j)}_{g\mathcal {A}}=\widetilde{\varvec{\mu }}^{(j)}_{g}, \widetilde{\varvec{v}}^{(j)}_{g\mathcal {A}^{c}}=\varvec{0}$, then $\widetilde{\varvec{\theta }}_{g\mathcal {A}}=\widetilde{\varvec{\psi }}_{g}$ and $ \widetilde{\varvec{\theta }}_{g\mathcal {A}^{c}}=\varvec{0}$. For $j\in \mathcal {A}$, by the definition of $\widetilde{\varvec{\psi }}_{g}$, KKT condition of (2.9) is satisfied. For $j\in \mathcal {A}^{c}$, we need to show that
$$\Pr \left\{ \forall j\in \mathcal {A}^{c},\left( \sum _{g=2}^{G}\Vert (\widehat{\varvec{\Sigma }}\varvec{\theta }_{g}-\widehat{\varvec{\delta }}_{g})_{\mathcal {N}^{(j)}}\Vert _2^{2}\right) ^{{1}/{2}}\le \lambda _n\phi _{j}\right\} \longrightarrow 1.$$
Equivalent to prove
$$\Pr \left\{ \exists j\in \mathcal {A}^{c}, \left( \sum _{g=2}^{G}\Vert (\widehat{\varvec{\Sigma }}\varvec{\theta }_g-\widehat{\varvec{\delta }}_g)_{\mathcal {N}^{(j)}}\Vert _{2}^{2}\right) ^{{1}/{2}}>\lambda _n\phi _j\right\} \longrightarrow 0.$$
Thus, we have
$$\begin{aligned}{\begin{matrix} &{}\Pr \left\{ \exists j\in {\mathcal {A}^{c}}, \left( \sum _{g=2}^{G}\left\| \left( \widehat{\varvec{\Sigma }}\varvec{\theta }_g-\widehat{\varvec{\delta }}_{g}\right) _{\mathcal {N}^{(j)}}\right\| _{2}^{2}\right) ^{{1}/{2}}>\lambda _n\phi _j\right\} \\ \le &{} \Pr \left\{ \left( \sum _{g=2}^{G}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\widetilde{\varvec{\psi }}_g-\widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}\right\| _{2}^{2}\right) ^{{1}/{2}}>\lambda _n\min _{j\in \mathcal {A}^{c}}\phi _j\right\} \\ =&{}\Pr \left\{ \left( \sum _{g=2}^{G}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\left( \widetilde{\varvec{\psi }}_g-\varvec{\theta }_{g\mathcal {A}}\right) +\widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\varvec{\theta }_{g\mathcal {A}}-\widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}\right\| _{2}^{2}\right) ^{{1}/{2}}>\lambda _n\phi _{*}\right\} \\ =&{}\Pr \left\{ \left( \sum _{g=2}^{G}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\varvec{\theta }_{g\mathcal {A}}-\widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}\right\| _{2}^{2}\right) ^{{1}/{2}}>\lambda _n\phi _{*}-\left( \sum _{g=2}^{G}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\left( \widetilde{\varvec{\psi }}_{g}-\varvec{\theta }_{g\mathcal {A}}\right) \right\| _{2}^{2}\right) ^{{1}/{2}}\right\} . \end{matrix}}\end{aligned}$$
Furthermore, we have
$$\begin{aligned} {\begin{matrix} \max _{1\le g\le G}\Vert \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}(\widetilde{\varvec{\psi }}_{g}-\varvec{\theta }_{g\mathcal {A}})\Vert _{2} &{}\le \max _{1\le g\le G}\Vert \widehat{\varvec{\Sigma }}(\widetilde{\varvec{\theta }}_g-\varvec{\theta }_{g})\Vert _{2}\\ &{}\le \widehat{\rho }_n\max _{1\le g\le G}\Vert (\widetilde{\varvec{\theta }}_{g}-\varvec{\theta }_{g})\Vert _{2} =\widehat{\rho }_n\max _{1\le g\le G}\Vert \widetilde{\varvec{\psi }}_g-\varvec{\theta }_{g\mathcal {A}}\Vert _{2}. \end{matrix}} \end{aligned}$$
From $\widehat{\rho }_n>0$, $\displaystyle \max _{1\le g\le G}\Vert \widetilde{\varvec{\psi }}_g-\varvec{\theta }_{g\mathcal {A}}\Vert _{2}=O_p(\sqrt{{k_n\log p_n}/{n}})$, regularity condition (C2) and Theorem 13.5.1 in Anderson (2003), we have
$$\begin{aligned} \max _{1\le g\le G}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\left( \widetilde{\varvec{\psi }}_{g}-\varvec{\theta }_{g\mathcal {A}}\right) \right\| _{2}=O_p\left( \sqrt{\frac{k_n\log p_n}{n}}\right) . \end{aligned}$$
(A.7)
Further, it is easy to show that
$$\begin{aligned}{\begin{matrix} &{}\max _{1\le g\le G}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\varvec{\theta }_{g\mathcal {A}}-\widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}\right\| _{2} =\max _{1\le g\le G}\left\| \left( \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}-\varvec{\Sigma }_{\mathcal {A}^{c}\mathcal {A}}\right) \varvec{\theta }_{g\mathcal {A}}-\left( \widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}-\varvec{\delta }_{g\mathcal {A}^{c}}\right) \right\| _{2}\\ \le &{}\max _{1\le g\le G}\left\| \left( \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}-\varvec{\Sigma }_{\mathcal {A}^{c}\mathcal {A}}\right) \varvec{\theta }_{g\mathcal {A}}\right\| _{2}+\max _{1\le g\le G}\left\| \widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}-\varvec{\delta }_{g\mathcal {A}^{c}}\right\| _{2}\\ =&{}\left\{ (p_n-k_n)\left( \max _{i\in \mathcal {A}^{c},j\in \mathcal {A}}|{\widehat{\sigma }_{ij}-\sigma _{ij}}|\left\| \varvec{\theta }_{g\mathcal {A}}\right\| _{1}\right) ^{2}\right\} ^{\frac{1}{2}}+\sqrt{p_n-k_n}\max _{1\le g\le G,j\in \mathcal {A}^{c}}|{\widehat{\delta }_{gj}-\delta _{gj}}|. \end{matrix}}\end{aligned}$$
Again invoking Lemma 1(i) and (ii) in Sheng and Wang (2019) and regularity conditions (C1)-(C3), we can show that,
$$\begin{aligned} \Pr \left( \max _{i\in \mathcal {A}^{c},j\in \mathcal {A}}|{\widehat{\sigma }_{ij}-\sigma _{ij}}|\le C\sqrt{\frac{\log p_n}{n}}\right) \longrightarrow 1 \end{aligned}$$
and
$$\begin{aligned} \Pr \left( \max _{g,j\in \mathcal {A}^{c}}|{\widehat{\delta }_{gj}-\delta _{gj}}|\le C\sqrt{\frac{\log p_n}{n}}\right) \longrightarrow 1. \end{aligned}$$
Then we have
$$\begin{aligned} \Pr \left( \max _{g}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\varvec{\theta }_{g\mathcal {A}}-\widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}\right\| _{2} \le C\sqrt{\frac{\left( p_n-k_n\right) \log p_n}{n}}\right) \longrightarrow 1. \end{aligned}$$
(A.8)
By (A.7), (A.8), Chebyshev’s inequality and $\displaystyle \lambda _n\phi _{*}\Big /\sqrt{k_n\log p_n/n}\rightarrow \infty $ , we have
$$\begin{aligned}{\begin{matrix} \Pr \left\{ \exists j\in {\mathcal {A}^{c}}, \left( \sum _{g=2}^{G}\left\| \left( \widehat{\varvec{\Sigma }}\varvec{\theta }_g-\widehat{\varvec{\delta }}_{g}\right) _{\mathcal {N}^{(j)}}\right\| _{2}^{2}\right) ^{{1}/{2}}>\lambda _n\phi _j\right\} &{}\le \frac{\mathbb {E}\left( \displaystyle \sum _{g=2}^{G}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\varvec{\theta }_{g\mathcal {A}}-\widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}\right\| _{2}^{2}\right) }{(\lambda _n\phi _{*})^{2}}\\ &{}\le \frac{C''\left( G-1\right) \left( p_n-k_n\right) \log p_n}{\lambda ^{2}_n\phi ^2_{*}n}. \end{matrix}}\end{aligned}$$
By $\displaystyle \lambda ^2_n\phi ^2_{*}n / \left[ \left( p_n-k_n\right) \log p_n \right] \rightarrow \infty $, we can show that conclusion 2) holds.

Summarizing the above results, we finish the proof of Theorem 1.

Proof of Theorem

2 Let $R_n$ and $R^\text {Bayes}$ denote the conditional misclassification rates of IG-MSDA and the Bayes rule, respectively. Given a large enough value h, let $\eta _0=h({k_n\log p_n}/{n})^{{1}/{3}}$, similar to the proof of Theorem 2 in Mai and Zou (2015), we have

$$\begin{aligned} \begin{aligned} |R_n-R^\text {Bayes}| \le&\Pr \left( \Big |{D_g^\text {Bayes}(\varvec{x})-D_k^\text {Bayes}(\varvec{x})}\Big |\le \eta _0,\exists g\ne k\right) \\&+\Pr \left( \Big |{\widehat{D}_{g}\left( \varvec{x}\right) -D^\text {Bayes}_{g}\left( \varvec{x}\right) }\Big |\ge \frac{\eta _0}{2},\exists g\big |~(\varvec{x}_i,y_i),i=1,\ldots ,n\right) \\ =:&A_1+A_2. \end{aligned}\end{aligned}$$

(A.9)

For the observation $\varvec{x}$, $D_g^\text {Bayes}(\varvec{x})-D_k^\text {Bayes}(\varvec{x})$ follows the normal distribution with variance $\Delta $. By regularity condition (C5) and G is fixed, for a sufficiently large positive number M, we have

$$\begin{aligned} \begin{aligned} A_1 \le&\sum _{g'=1}^{G}\Pr \left( \Big |{D_g^\text {Bayes}(\varvec{x})-D_k^\text {Bayes}(\varvec{x})}\Big |\le \eta _0\big |Y=g'\right) \pi _g' \\ \le&\frac{MG^2}{\Delta }\eta _0 \le M\left( \frac{k_n\log p_n}{n}\right) ^{{1}/{3}}. \end{aligned} \end{aligned}$$

(A.10)

For $A_2$, we have

$$\begin{aligned} \widehat{D}_{g}(\varvec{x})-D^\text {Bayes}_{g}(\varvec{x})\big |\left( Y=g',(\varvec{x}_i,y_i),i=1,\ldots ,n\right) \sim \mathcal {N}\left( \mu _{gg'},(\widehat{\varvec{\theta }}_{g}-\varvec{\theta }_{g})^{\top }\varvec{\Sigma }(\widehat{\varvec{\theta }}_{g}-{\varvec{\theta }}_g)\right) , \end{aligned}$$

where

$$ \displaystyle \mu _{gg'}=\log \frac{\widehat{\pi }_g}{\pi _{g}}-\log \frac{\widehat{\pi }_{1}}{\pi _1}+(\widehat{\varvec{\theta }}_g-\varvec{\theta }_{g})^{\top }\varvec{\mu }_{g'}+\frac{1}{2}(\varvec{\mu }_{1}+\varvec{\mu }_{g})^{\top }\varvec{\theta }_g-\frac{1}{2}(\widehat{\varvec{\mu }}_1+\widehat{\varvec{\mu }}_g)^{\top }\widehat{\varvec{\theta }}_{g}. $$

For $g,g'=1,\ldots ,G$, we have

$$\begin{aligned} \begin{aligned} \left| \mu _{gg'} \right| \le&\Big |\log \frac{\widehat{\pi }_g}{\pi _{g}}\Big |+\Big |\log \frac{\widehat{\pi }_1}{\pi _{1}}\Big |+\Big |(\widehat{\varvec{\theta }}_g-\varvec{\theta }_g)^{\top }\varvec{\mu }_{g'}\Big |+\frac{1}{2}\Big |(\varvec{\mu }_{1}+\varvec{\mu }_{g})^{\top }(\varvec{\theta }_{g}-\widehat{\varvec{\theta }}_{g})\Big |\\&+\frac{1}{2}\Big |(\varvec{\mu }_1+\varvec{\mu }_g-\widehat{\varvec{\mu }}_1-\widehat{\varvec{\mu }}_g)^{\top }(\widehat{\varvec{\theta }}_g-\varvec{\theta }_g)\Big |+\frac{1}{2}\Big |(\varvec{\mu }_1+\varvec{\mu }_g-\widehat{\varvec{\mu }}_1-\widehat{\varvec{\mu }}_g)^{\top }\varvec{\theta }_{g}\Big | \end{aligned} \end{aligned}$$

in probability.

By Taylor expansion, the conditions of Theorems 1 and 2, for any $g,g'=1,\ldots ,G$, we have

$$\begin{aligned} \left| \log \widehat{\pi }_g-\log \pi _g\right| \le M\frac{1}{\sqrt{n}}, \end{aligned}$$

$$\begin{aligned} \left| (\widehat{\varvec{\theta }}_g-\varvec{\theta }_g)^{\top }\varvec{\mu }_{g'}\right| \le \max _{1\le g'\le G}\Vert \varvec{\mu }_{g'}\Vert _{2}\Vert \widehat{\varvec{\theta }}-\varvec{\theta }\Vert _2\le M\sqrt{\frac{k_n\log p_n}{n}}, \end{aligned}$$

and

$$\begin{aligned} \frac{1}{2}\Big |(\varvec{\mu }_{1}+\varvec{\mu }_{g})^{\top }(\varvec{\theta }_g-\widehat{\varvec{\theta }}_g)\Big |\le \max _{1\le g\le G}\Vert \varvec{\mu }_g\Vert _{2}\Vert \widehat{\varvec{\theta }}-\varvec{\theta }\Vert _{2}\le M\sqrt{\frac{k_n\log p_n}{n}} \end{aligned}$$

with probability tending to one.

Furthermore, we have

$$\begin{aligned} \frac{1}{2}\Big |(\varvec{\mu }_1+\varvec{\mu }_g-\widehat{\varvec{\mu }}_1-\widehat{\varvec{\mu }}_g)^{\top }(\widehat{\varvec{\theta }}_g-\varvec{\theta }_g)\Big |\le \max _{1\le g\le G}\Vert \varvec{\mu }_{g}-\widehat{\varvec{\mu }}_g\Vert _{2}\Vert \widehat{\varvec{\theta }}-\varvec{\theta }\Vert _{2}\le M\frac{k_n\log p_n}{n} \end{aligned}$$

and

$$\begin{aligned} \frac{1}{2}\Big |(\varvec{\mu }_1+\varvec{\mu }_g-\widehat{\varvec{\mu }}_1-\widehat{\varvec{\mu }}_g)^{\top }\varvec{\theta }_{g}\Big |\le \max _{1\le g\le G}\Vert \varvec{\mu }_g-\widehat{\varvec{\mu }}_g\Vert _2\max _{1\le g\le G}\Vert \varvec{\theta }_{g}\Vert _2\le M\sqrt{\frac{k_n\log p_n}{n}}. \end{aligned}$$

By ${k_n\log p_n}/{n}\rightarrow 0$, there is a large enough positive constant M, for $g,g'=1,\ldots ,G$ such that

$$\begin{aligned} \Pr \left\{ \left| \mu _{gg'}\right| \le M\sqrt{\frac{k_n\log p_n}{n}}\right\} \longrightarrow 1. \end{aligned}$$

(A.11)

Therefore, there exists a sufficiently large positive constant h such that $\eta _0/3>M\sqrt{{k_n\log p_n}/{n}}$. Thus, for $g,g'=1,\ldots ,G$, $\Pr (\left| \mu _{gg'}\right| <\eta _0/3)\rightarrow 1$. In addition, by Markov’s inequality, we have

$$\begin{aligned} \Pr \left( \left| \widehat{D}_g(\varvec{X})-D^\text {Bayes}_{g}(\varvec{X})\right| \ge \frac{\eta _0}{2}\big |Y=g',(\varvec{x}_i,y_i),i=1,\ldots ,n\right) \le \frac{2(\widehat{\varvec{\theta }}_g-\varvec{\theta }_g)^{\top }\varvec{\Sigma }(\widehat{\varvec{\theta }}_g-\varvec{\theta }_g)}{(\eta _0/2-\mu _{gg'})^{2}}. \end{aligned}$$

From regularity condition (C1) and Theorem 1, for $g=1,\ldots ,G$, we can show that

$$\begin{aligned} \Pr \left( (\widehat{\varvec{\theta }}_g-\varvec{\theta }_g)^{\top }\varvec{\Sigma }(\widehat{\varvec{\theta }}_g-\varvec{\theta }_g)\le M\frac{k_n\log p_n}{n}\right) \longrightarrow 1. \end{aligned}$$

Then, we have

$$\begin{aligned} {\begin{matrix} A_2 &{}=\Pr \biggl (\left| \widehat{D}_{g}(\varvec{X})-D^\text {Bayes}_{g}(\varvec{X})\right| \ge \frac{\eta _0}{2},\exists g|(\varvec{x}_i,y_i),i=1,\ldots ,n\biggl )\\ &{}\le \sum _{g=1}^{G}\pi _g'\Pr \biggl (\left| \widehat{D}_{g}(\varvec{X})-D^\text {Bayes}_{g}(\varvec{X})\right| \ge \frac{\eta _0}{2}\Big |Y=g', (\varvec{x}_i,y_i),i=1,\ldots ,n\biggl )\\ &{}\le M\max _{g,g'}\frac{(\widehat{\varvec{\theta }}_g-\varvec{\theta }_{g})^{\top }\varvec{\Sigma }(\widehat{\varvec{\theta }}_g-\varvec{\theta }_{g})}{(\eta _0/2-\mu _{gg'})^2}\le M\left( \frac{k_n\log p_n}{n}\right) ^{1/3} \\ \end{matrix}}\end{aligned}$$

(A.12)

holds with probability tending to 1. Summarizing the above results, we have $|R_n-R^\text {Bayes}|=O_p(({k_n\log p_n}/{n})^{1/3})$. Therefore, we complete the proof of Theorem 2.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Luo, J., Li, X., Yu, C. et al. Multiclass Sparse Discriminant Analysis Incorporating Graphical Structure Among Predictors. J Classif 40, 614–637 (2023). https://doi.org/10.1007/s00357-023-09451-1

Download citation

Accepted: 19 September 2023
Published: 14 October 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s00357-023-09451-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multiclass Sparse Discriminant Analysis Incorporating Graphical Structure Among Predictors

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Sparse overlapped linear discriminant analysis

Robust and sparse multigroup classification by the optimal scoring approach

Partially Supervised Sparse Factor Regression For Multi-Class Classification

Data Availability

Code Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix: Proof of Main Results

Appendix: Proof of Main Results

Proposition 1

Lemma 1

Proof

Proof of Theorem

Proof of Theorem

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now