Skip to main content
Log in

Multiclass Sparse Discriminant Analysis Incorporating Graphical Structure Among Predictors

  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

In the era of big data, many sparse linear discriminant analysis methods have been proposed for classification and variable selection of the high-dimensional data. In order to solve the multiclass sparse discriminant problem for high-dimensional data under the Gaussian graphical model, this paper proposes a multiclass sparse discrimination analysis method by incorporating the graphical structure among predictors, which is named as IG-MSDA method. Our proposed IG-MSDA method can be used to estimate the vectors of all discriminant directions simultaneously. Under certain regularity conditions, it is shown that the proposed IG-MSDA method can consistently estimate all discriminant directions and the Bayes rule. Further, we establish the convergence rates of the estimators for the discriminant directions and the conditional misclassification rates. Finally, simulation studies and a real data analysis demonstrate the good performance of our proposed IG-MSDA method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data Availability

The IBD dataset analyzed during the current study is available with accession number GDS1615 in the Gene Expression Omnibus data base of National Center for Biotechnology Information (NCBI), https://www.ncbi.nlm.nih.gov/geo/.

Code Availability

The code that supports the findings of this study is available at https://github.com/Luo-jx/IG-MSDA.

References

  • Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis (3rd ed.). New Jersey: John Wiley & Sons.

    MATH  Google Scholar 

  • Beck, A., & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1), 183–202.

    Article  MathSciNet  MATH  Google Scholar 

  • Bickel, P. J., & Levina, E. (2004). Some theory for fisher’s linear discriminant function, ‘naive bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 10(6), 989–1010.

    Article  MathSciNet  MATH  Google Scholar 

  • Cai, T., & Liu, W. (2011). A direct estimation approach to sparse linear discriminant analysis. Journal of the American Statistical Association, 106, 1566–1577.

    Article  MathSciNet  MATH  Google Scholar 

  • Cai, T., & Zhang, L. (2019). High dimensional linear discriminant analysis: optimality, adaptive algorithm and missing data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(4), 675–705.

    Article  MathSciNet  MATH  Google Scholar 

  • Cannings, T. I., & Samworth, R. J. (2017). Random-projection Ensemble Classification. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79, 9591035.

    MathSciNet  MATH  Google Scholar 

  • Clemmensen, L., Hastie, T., Witten, D., & Ersbll, B. (2011). Sparse discriminant analysis. Technometrics, 53, 406–413.

    Article  MathSciNet  Google Scholar 

  • Dudoit, S., Fridlyand, J., & Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457), 77–87.

    Article  MathSciNet  MATH  Google Scholar 

  • Fan, J., & Fan, Y. (2008). High-dimensional classification using features annealed independence rules. The Annals of Statistics, 36, 2605–2637.

    Article  MathSciNet  MATH  Google Scholar 

  • Fan, J., Feng, Y., & Tong, X. (2012). A road to classification in high dimensional space: the regularized optimal affine discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74, 745–771.

    Article  MathSciNet  MATH  Google Scholar 

  • Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical Lasso. Biostatistics, 9(3), 432–441.

    Article  MATH  Google Scholar 

  • Guo, J. (2010). Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis. Biostatistics, 11, 599–608.

    Article  MATH  Google Scholar 

  • Jiang, B., Chen, Z., & Leng, C. (2020). Dynamic linear discriminant analysis in high dimensional space. Bernoulli, 26(2), 1234–1268.

    Article  MathSciNet  MATH  Google Scholar 

  • Le, K. T., Chaux, C., Richard, F. J., & Guedj, E. (2020). An adapted linear discriminant analysis with variable selection for the classification in high-dimension, and an application to medical data. Computational Statistics and Data Analysis, 152, 107031. https://doi.org/10.1016/j.csda.2020.107031

    Article  MathSciNet  MATH  Google Scholar 

  • Lee, J. W., Lee, J. B., Park, M., & Song, S. H. (2005). An extensive comparison of recent classification tools applied to microarray data. Computational Statistics and Data Analysis, 48, 869–885.

    Article  MathSciNet  MATH  Google Scholar 

  • Liu, J., Yu, G., & Liu, Y. (2019). Graph-based sparse linear discriminant analysis for high-dimensional classification. Journal of Multivariate Analysis, 171, 250–269.

    Article  MathSciNet  MATH  Google Scholar 

  • Mai, Q., Yang, Y., & Zou, H. (2019). Multiclass sparse discriminant analysis. Statistica Sinica, 29, 97–111.

    MathSciNet  MATH  Google Scholar 

  • Mai, Q., & Zou, H. (2015). Sparse semiparametric discriminant analysis. Journal of Multivariate Analysis, 135, 175–188.

    Article  MathSciNet  MATH  Google Scholar 

  • Mai, Q., Zou, H., & Yuan, M. (2012). A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika, 99, 29–42.

    Article  MathSciNet  MATH  Google Scholar 

  • Meinshausen, N., & Bühlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso. The Annals of Statistics, 34(3), 1436–1462.

    Article  MathSciNet  MATH  Google Scholar 

  • Pun, C. S., & Hadimaja, M. Z. (2021). A self-calibrated direct approach to precision matrix estimation and linear discriminant analysis in high dimensions. Computational Statistics and Data Analysis, 155, 107105. https://doi.org/10.1016/j.csda.2020.107105

    Article  MathSciNet  MATH  Google Scholar 

  • Sheng, Y., & Wang, Q. (2019). Simultaneous variable selection and class fusion with penalized distance criterion based classifiers. Computational Statistics and Data Analysis, 133, 138–152.

    Article  MathSciNet  MATH  Google Scholar 

  • Stephenson, M., Ali, R. A., Darlington, G. A., Schenkel, F. S., & Squires, E. J. (2021). DSLRIG: Leveraging predictor structure in logistic regression. Communications in Statistics - Simulation and Computation, 50(6), 1600–1612.

    Article  MathSciNet  MATH  Google Scholar 

  • Wang, Z., Liu, X., Tang, W., & Lin, Y. (2021). Incorporating graphical structure of predictors in sparse quantile regression. Journal of Business & Economic Statistics, 39(3), 783–792.

    Article  MathSciNet  Google Scholar 

  • Witten, D. M., & Tibshirani, R. (2011). Penalized classification using fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73, 753–772.

    Article  MathSciNet  MATH  Google Scholar 

  • Xu, P., Brock, G. N., & Parrish, R. S. (2009). Modified linear discriminant analysis approaches for classification of high-dimensional microarray data. Computational Statistics & Data Analysis, 53, 1674–1687.

    Article  MathSciNet  MATH  Google Scholar 

  • Yu, G., & Liu, Y. (2016). Sparse regression incorporating graphical structure among predictors. Journal of the American Statistical Association, 111(514), 707–720.

    Article  MathSciNet  Google Scholar 

  • Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68, 49–67.

    Article  MathSciNet  MATH  Google Scholar 

  • Zhou, Y., Zhang, B. X., Li, G. R., Tong, T. J., & Wan, X. (2017). GD-RDA: A new regularized discriminant analysis for high-dimensional data. Journal of Computational Biology, 24, 1099–1111.

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors sincerely thank the editor, the associate editor, and two reviewers for their constructive comments that have led to a substantial improvement of this paper. This research was supported by the National Natural Science Foundation of China (12271046, 11971001 and 12131006).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gaorong Li.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Proof of Main Results

Appendix: Proof of Main Results

We first provide the subgradient condition of the optimization problem (2.9) and related lemmas.

Proposition 1

The vector \(\varvec{\theta }_g ~ (2\le g\le G)\) is the solution to problem (2.9) if and only if \(\varvec{\theta }_g\) can be decomposed into \({\varvec{\theta }}_g=\displaystyle \sum _{g=2}^{G}\varvec{v}^{(j)}_g\) for \(2\le g\le G\). For \(1\le j\le p_n\), then

  1. (a)

    \(\varvec{v}^{(j)}_{\cdot {\mathcal {N}^{(j)}}^{c}}=\varvec{0}\);

  2. (b)

    Either

$$\begin{aligned} \varvec{v}_{\cdot \mathcal {N}^{(j)}}^{(j)}\ne \varvec{0}, \quad \nabla _{\mathcal {N}^{(j)}}L_n\left( \varvec{v}^{(1)}_{\cdot },\ldots ,\varvec{v}^{(p_n)}_{\cdot }\right) +\lambda _{n}\phi _{j}\frac{\varvec{v}_{\cdot \mathcal {N}^{(j)}}^{(j)}}{\left\| \varvec{v}_{\cdot \mathcal {N}^{(j)}}^{(j)}\right\| _{2}}=\varvec{0}, \end{aligned}$$

or

$$\begin{aligned} \varvec{v}_{\cdot \mathcal {N}^{(j)}}^{(j)}=\varvec{0}, \quad \left\| \nabla _{\mathcal {N}^{(j)}}L_{n}\left( \varvec{v}^{(1)}_{\cdot },\ldots ,\varvec{v}^{(p_n)}_{\cdot }\right) \right\| _{2}\le \lambda _n\phi _{j}. \end{aligned}$$

Here,

$$\begin{aligned} \displaystyle \nabla _{\mathcal {N}^{(j)}}L_n (\varvec{v}^{(1)}_{\cdot },\ldots ,\varvec{v}^{(p_n)}_{\cdot } )=\Biggl [\biggl (\frac{\partial L_n (\varvec{v}^{(1)}_{\cdot },\ldots ,\varvec{v}^{(p_n)}_{\cdot } )}{\partial {\varvec{v}^{(j)}_{2\mathcal {N}^{(j)}}}}\biggr )^{\top }, \ldots ,\biggl (\frac{\partial L_n (\varvec{v}^{(1)}_{\cdot },\ldots ,\varvec{v}^{(p_n)}_{\cdot } )}{\partial {\varvec{v}_{G\mathcal {N}^{(j)}}^{(j)}}}\biggr )^{\top }\Biggr ]^{\top }, \end{aligned}$$

where

$$\displaystyle \frac{\partial L_n (\varvec{v}^{(1)}_{\cdot },\ldots ,\varvec{v}^{(p_n)}_{\cdot } )}{\partial {\varvec{v}^{(j)}_{g\mathcal {N}^{(j)}}}}=(\widehat{\varvec{\Sigma }}\varvec{\theta }_g-\widehat{\varvec{\delta }}_g)_{\mathcal {N}^{(j)}}, \qquad \displaystyle \varvec{\theta }_g=\sum _{j=1}^{p}\varvec{v}_g^{(j)}, \qquad 2\le g\le G.$$

This subgradient condition is similar to Karush-Kuhn-Tucker (KKT) condition of group Lasso in Yuan and Lin (2006). Invoking KKT condition, it is easy to show that, if \(\widehat{\varvec{v}}^{(j)}_{\cdot }\) is the solution to (2.7) for each \(1\le j\le p_n\), then \(\widehat{\varvec{v}}^{(j)}_g ~ (2\le g\le G)\) is either estimated to be \(\varvec{0}\) or a sparse vector with support set \(\mathcal {N}^{(j)}\). Therefore, \(\displaystyle \widehat{\varvec{\theta }}_g=\sum ^{p_n}_{j=1}\widehat{\varvec{v}}^{(j)}_g ~ (2\le g\le G)\) estimated by IG-MSDA method satisfies the decomposition (2.5).

Lemma 1

Suppose that regularity conditions (C1)–(C3) hold, for a sufficiently large positive constant C, as \(n\rightarrow \infty \), if \(\displaystyle \delta _{\min }\Big /\sqrt{\frac{\log p_n}{n}}\rightarrow \infty \), \(\delta _{\min }=\min _{j\in \mathcal {A}}\Vert \varvec{\delta }_{\cdot j}\Vert _2\) and \(k_n=|\mathcal {A}|\), then

$$\begin{aligned} \Pr \left( \max _{j\in \mathcal {A}}\phi _j\le \frac{C\sqrt{k_n}}{\delta _{\min }}\right) \longrightarrow 1. \end{aligned}$$

Proof

Note that

$$\begin{aligned}{\begin{matrix} \min _{j\in \mathcal {A}}\Vert \widehat{\varvec{\delta }}_{\cdot j}\Vert _{2} &{}\ge \min _{j\in \mathcal {A}}\left\| {\varvec{\delta }}_{\cdot j}\right\| _{2}-\max _{j\in \mathcal {A}}\Vert \widehat{\varvec{\delta }}_{\cdot j}-\varvec{\delta }_{\cdot j}\Vert _{2}\\ &{}=\delta _{\min }-\max _{j\in \mathcal {A}}\Vert \widehat{\varvec{\delta }}_{\cdot j}-\varvec{\delta }_{\cdot j}\Vert _{2}. \end{matrix}}\end{aligned}$$

From Lemma 1 (ii) in Sheng and Wang (2019), it can be seen that

$$\Pr \left( \max _{g,j}|\widehat{\delta }_{gj}-\delta _{gj}|\le C\sqrt{\frac{\log p_n}{n}}\right) \longrightarrow 1.$$

Then we have

$$\begin{aligned} \max _{j\in \mathcal {A}}\Vert \widehat{\varvec{\delta }}_{\cdot j}-{\varvec{\delta }}_{\cdot j}\Vert _{2} \le \sqrt{G}\max _{j\in \mathcal {A}}\Vert \widehat{\varvec{\delta }}_{\cdot j}-{\varvec{\delta }}_{\cdot j}\Vert _{\infty } =\sqrt{G}\max _{g,j}|{\widehat{\delta }_{gj}-\delta _{gj}}| \le C\sqrt{\frac{\log p_n}{n}}. \end{aligned}$$

And \(\displaystyle \delta _{\min }\Big /\sqrt{\dfrac{\log p_n}{n}}\rightarrow \infty \). Thus, \(\displaystyle \min _{j\in \mathcal {A}}\Vert \widehat{\varvec{\delta }}_{\cdot j}\Vert _{2}\ge \dfrac{\delta _{\min }}{C}\). Furthermore, \(\displaystyle \max _{j\in \mathcal {A}}\phi _j=\max _{j\in \mathcal {A}}\dfrac{\sqrt{|{\mathcal {N}^{(j)}}|}}{\Vert \widehat{\varvec{\delta }}_{\cdot j}\Vert _2}\le \dfrac{\sqrt{k_n}}{\min _{j\in \mathcal {A}}\Vert \widehat{\varvec{\delta }}_{\cdot j}\Vert _2}\). Therefore, Lemma 1 is proved.

Proof of Theorem

1 Under regularity condition (C1), an oracle estimator is introduced, which is defined as

$$\begin{aligned} \left( \widetilde{\varvec{\psi }}_2,\ldots ,\widetilde{\varvec{\psi }}_G\right) =\arg \min _{{\varvec{\psi }}_{g}\in \mathbb {R}^{k_n}}\Biggl \{{\sum _{g=2}^{G}}\biggl (\frac{1}{2}{\varvec{\psi }}_{g}^{\top }\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}\varvec{\psi }_{g}-\widehat{\varvec{\delta }}_{g\mathcal {A}}^{\top }\varvec{\psi }_{g}\biggr )+\lambda _{n} \Vert \varvec{\psi }_{2},\ldots ,\varvec{\psi }_{G}\Vert _{{\mathcal {G_A}},\phi _{\mathcal {A}}}\Biggr \}, \end{aligned}$$
(A.1)

where \(g=2,\ldots ,G\) and \(\mathcal {G_A}\) denotes the subgraph of graph \(\mathcal {G}\) corresponding to \(\mathcal {A}\). In order to prove Theorem 1, the following conclusions need to be proved.

  1. 1)

    \(\underset{j\in \mathcal {A}}{\min }\Vert \widetilde{\varvec{\psi }}_{\cdot j}\Vert _{2}>0\), where \(\widetilde{\varvec{\psi }}_{\cdot j}=(\widetilde{\varvec{\psi }}_{2j},\ldots ,\widetilde{\varvec{\psi }}_{Gj})^{\top }\); For convenience, let \(\mathcal {A}=\{1,2,\ldots ,k_n\}\). By regularity condition (C4), (A.1) is equivalent to \(\displaystyle \widetilde{\varvec{\psi }}_{g}=\sum _{j=1}^{k_n}\widetilde{\varvec{\mu }}_{g}^{(j)}\) and

    $$\begin{aligned} \left( \widetilde{\varvec{\mu }}^{(1)}_{\cdot },\ldots ,\widetilde{\varvec{\mu }}^{(k_n)}_{\cdot }\right) =\arg \min _{\varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot }}\left\{ Q_n\left( \varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot }\right) +\lambda _{n}\sum _{j=1}^{k_n}\phi _{j}\Vert \varvec{\mu }^{(j)}_{\cdot }\Vert _2\right\} , \end{aligned}$$
    (A.2)

    where

    $$\begin{aligned} Q_{n}\left( \varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot }\right) =\sum ^{G}_{g=2}\left[ \frac{1}{2}\biggl (\sum _{j=1}^{k_n}\varvec{\mu }^{(j)}_g\biggr )^{\top }\widehat{\varvec{\Sigma }}\biggl (\sum _{j=1}^{k_n}\varvec{\mu }^{(j)}_g\biggr )-\widehat{\varvec{\delta }}^{\top }_g\biggl (\sum _{j=1}^{k_n}\varvec{\mu }_g^{(j)}\biggr )\right] . \end{aligned}$$
    (A.3)

    Here, \(\text {supp}({\varvec{\mu }_{g}}^{(j)})\subseteq \mathcal {N}^{(j)}, ~ 2\le g\le G\), \(\varvec{\mu }_{\cdot }^{(j)}=(\varvec{\mu }_{2}^{(j)\top },\ldots ,\varvec{\mu }_{G}^{(j){\top }})^{\top }\) and \(\displaystyle \varvec{\psi }_{g}=\sum _{j=1}^{k_n}{\varvec{\mu }^{(j)}_{g}}\). Since the objective function defined by (A.2) is a convex function, from Proposition 1, it is easy to show that its solution satisfies KKT condition: for all \(j\in \mathcal {A}\), then

    1. (a)

      \(\varvec{\mu }_{\cdot {\mathcal {N}^{(j)}}^{c}}^{(j)}=\varvec{0}\);

    2. (b)

      Either

      $$\begin{aligned} \varvec{\mu }_{\cdot \mathcal {N}^{(j)}}^{(j)}\ne \varvec{0}, \quad \nabla _{\mathcal {N}^{(j)}}Q_n\left( \varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot }\right) +\lambda _{n}\phi _{j}\frac{\varvec{\mu }_{\cdot \mathcal {N}^{(j)}}^{(j)}}{\Vert \varvec{\mu }_{\cdot \mathcal {N}^{(j)}}^{(j)}\Vert _{2}}=\varvec{0}, \end{aligned}$$

      or

      $$\begin{aligned} \varvec{\mu }_{\cdot \mathcal {N}^{(j)}}^{(j)}=\varvec{0}, \quad \Vert \nabla _{\mathcal {N}^{(j)}}Q_n\left( \varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot }\right) \Vert _{2}\le \lambda _n\phi _{j}. \end{aligned}$$

      Here,

      $$\begin{aligned} \small \displaystyle \nabla _{\mathcal {N}^{(j)}}L_n(\varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot })=\Biggl [\biggl (\frac{\partial Q_n (\varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot } )}{\partial {\varvec{\mu }^{(j)}_{2\mathcal {N}^{(j)}}}}\biggr )^{\top }, \ldots ,\biggl (\frac{\partial Q_n (\varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot } )}{\partial {\varvec{\mu }_{G\mathcal {N}^{(j)}}^{(j)}}}\biggr )^{\top }\Biggr ]^{\top } \end{aligned}$$

      and

      $$\displaystyle \dfrac{\partial Q_n (\varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot } )}{\partial {{\varvec{\mu }}^{(j)}_{g\mathcal {N}^{(j)}}}}=(\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}\varvec{\psi }_g-\widehat{\varvec{\delta }}_{g\mathcal {A}})_{\mathcal {N}^{(j)}}, \quad \displaystyle \varvec{\psi }_g=\sum _{j\in \mathcal {A}}\varvec{\mu }_g^{(j)}, \quad 2\le g\le G. $$

    Then \(\widetilde{\varvec{\psi }}_g ~ (2\le g\le G)\) can be obtained via \(\displaystyle \widetilde{\varvec{\psi }}_{g}=\sum _{j=1}^{k_n}\widetilde{\varvec{\mu }}_{g}^{(j)}\). Let \(\widehat{B}=\{i:\Vert \varvec{\mu }_{\cdot \mathcal {N}^{(i)}}^{(i)}\Vert _{2}\ne 0\}\). For each \(i\in \widehat{B}\), we have

    $$ \nabla _{\mathcal {N}^{(j)}}Q_n(\varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot })=-\lambda _n\phi _{j}\frac{\varvec{\mu }_{\cdot \mathcal {N}^{(j)}}^{(j)}}{\left\| \varvec{\mu }_{\cdot \mathcal {N}^{(j)}}^{(j)}\right\| _{2}}. $$

    For each \(i\notin \widehat{B}\), we have

    $$ \nabla _{\mathcal {N}_{(j)}}Q_n(\varvec{\mu }^{(1)}_{\cdot },\ldots ,\varvec{\mu }^{(k_n)}_{\cdot })=-\lambda _n\phi _{j}\varvec{Z}_{\cdot \mathcal {N}^{(j)}}^{(j)}, $$

    where \(\varvec{Z}_{\cdot }^{(j)}\) is a \(p\times (G-1)\) matrix, \(\Vert \varvec{Z}_{\cdot \mathcal {N}^{(j)}}^{(j)}\Vert _{2}\le 1\). Because some variables may belong to multiple neighborhoods, the following conditions need to be satisfied:

    1. (i)

      For each \({i_1}\in \widehat{B},{i_2}\in \widehat{B},j\!\in \!{\mathcal {N}^{(i_1)}\bigcap \mathcal {N}^{(i_2)}}\), \(\phi _{i_1}\varvec{\mu }_{\cdot j}^{(i_1)}\big /\bigl \Vert {\varvec{\mu }}_{\cdot {\mathcal {N}}^{(j)}}^{(i_1)}\bigr \Vert _{2}\!=\!\phi _{i_2}{\varvec{\mu }}_{\cdot j}^{(i_2)}\big /\bigl \Vert {\varvec{\mu }}_{\cdot {\mathcal {N}}^{(j)}}^{(i_2)}\) \(\bigr \Vert _{2}\);

    2. (ii)

      For each \(i_1\in \widehat{B},i_2\notin \widehat{B},j\in \mathcal {N}^{(i_1)}\bigcap \mathcal {N}^{(i_2)}\), \(\phi _{i_1}\varvec{\mu }_{\cdot j}^{(i_1)} \big /\Vert \varvec{\mu }_{\cdot \mathcal {N}^{(j)}}^{(i_1)}\Vert _{2}=\phi _{i_2}\varvec{Z}_{\cdot j}^{(i_2)}\);

    3. (iii)

      For each \(i_1\notin \widehat{B},i_2\notin \widehat{B}\) and \(j\in \mathcal {N}^{(i_1)}\bigcap \mathcal {N}^{(i_2)}\), \(\phi _{i_1}\varvec{Z}_{\cdot j}^{(i_1)}=\phi _{i_2}\varvec{Z}_{\cdot j}^{(i_2)}\).

    Then, for each \(i\in \mathcal {A}\), it can be defined that for \(i\in \widehat{B}\), \(\varvec{f}_{i}=\phi _{i}\varvec{\mu }_{\cdot i}^{(i)}/\Vert \varvec{\mu }_{\cdot \mathcal {N}^{(i)}}^{(i)}\Vert _{2}\), where \(\varvec{f}_{i}=(f_{2i},\ldots ,f_{Gi})^{\top }, f_{gi}=\displaystyle \frac{\phi _{i}\varvec{\mu }_{gi}^{(i)}}{\Vert {\varvec{\mu }}^{(i)}_{\cdot \mathcal {N}^{(i)}}\Vert _{2}}, 2\le g\le G\); for \(i\notin \widehat{B}, \varvec{f}_{i}=\phi _{i}\varvec{Z}_{\cdot i}^{(i)}\), where \( \varvec{f}_i=(f_{2i},\ldots ,f_{Gi})\), \(f_{gi}=\phi _{i}\varvec{Z}_{gi}^{(i)}, 2\le g\le G\). Then any solution \(\varvec{\psi }_{g}\) satisfies the equation: \(\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}\varvec{\psi }_g-\widehat{\varvec{\delta }}_g=\lambda _n\varvec{f}_{g\mathcal {A}}\). From the definition of \(\widetilde{\varvec{\psi }}_{g}\), it can be obtained that \(\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}\widetilde{\varvec{\psi }}_g-\widehat{\varvec{\delta }}_g=\lambda _n\widetilde{\varvec{f}}_{g\mathcal {A}}\), i.e., \(\widetilde{\varvec{\psi }}_g=\widehat{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}(\widehat{\varvec{\delta }}_g-\lambda _n\widetilde{\varvec{f}}_{g\mathcal {A}})\). Here, \(\widetilde{\varvec{f}}_{g\mathcal {A}}\in \mathbb {R}^{k_n}, 2\le g\le G\). In order to prove conclusion 1), we can obtain that

    $$\begin{aligned}\begin{aligned} \max _{1\le g\le G}\Vert \widetilde{\varvec{\psi }}_g-\varvec{\theta }_{g\mathcal {A}}\Vert _{2} =&\max _{1\le g\le G}\Vert \widehat{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}(\widehat{\varvec{\delta }}_{g\mathcal {A}}-\lambda _n\widetilde{\varvec{f}}_{g\mathcal {A}})-{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}\varvec{\delta }_{g\mathcal {A}}\Vert _{2}\\ =&\max _{1\le g\le G}\Vert (\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}^{-1}-{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}})\varvec{\delta }_{g\mathcal {A}}+{\widehat{\varvec{\Sigma }}}^{-1}_{\mathcal{A}\mathcal{A}}(\widehat{\varvec{\delta }}_{g\mathcal {A}}-\varvec{\delta }_{g\mathcal {A}})-\lambda _n\widehat{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}\widetilde{\varvec{f}}_{g\mathcal {A}}\Vert _{2}\\ \le&\max _{1\le g\le G}\Vert (\widehat{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}-{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}})\varvec{\delta }_{g\mathcal {A}}\Vert _{2}+\max _{1\le g\le G}\Vert \widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}^{-1}(\widehat{\varvec{\delta }}_{g\mathcal {A}}-\varvec{\delta }_{g\mathcal {A}})\Vert _{2} \\&+\max _{1\le g\le G}\Vert \lambda _n\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}^{-1}\widetilde{\varvec{f}}_{g\mathcal {A}}\Vert _{2}\\ =:&I_1+I_2+I_3. \end{aligned}\end{aligned}$$

    For \(I_1\), we have

    $$\begin{aligned} \begin{aligned} I_1&=\max _{1\le g\le G}\Vert ({\widehat{\varvec{\Sigma }}}^{-1}_{\mathcal{A}\mathcal{A}}-{\varvec{\Sigma }}^{{-1}}_{\mathcal{A}\mathcal{A}}){\varvec{\delta }_{g\mathcal {A}}}\Vert _2\\&=\max _{1\le g\le G}\Vert {\widehat{\varvec{\Sigma }}}^{-1}_{\mathcal{A}\mathcal{A}}[\varvec{I}-(\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}-\varvec{\Sigma }_{\mathcal{A}\mathcal{A}}){\varvec{\Sigma }}^{{-1}}_{\mathcal{A}\mathcal{A}}-\varvec{I}]\varvec{\theta }_{g\mathcal {A}}\Vert _{2}\\&=\max _{1\le g\le G}\Vert \widehat{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}{\varvec{\Sigma }}^{*{-1}}_{\mathcal{A}\mathcal{A}}(\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}-\varvec{\Sigma }_{\mathcal{A}\mathcal{A}})\varvec{\delta }_{g\mathcal {A}}\Vert _{2}\\&\le \widehat{\rho }^{-1}_n{\rho }^{-1}_n\max _{1\le g\le G}\Vert (\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}-\varvec{\Sigma }_{\mathcal{A}\mathcal{A}})\varvec{\delta }_{g\mathcal {A}}\Vert _{2}\\&\le \widehat{\rho }^{-1}_n\xi _0\left[ k_n\left( \max _{i,j\in \mathcal {A}}|{\widehat{\sigma }_{ij}-\sigma _{ij}}|\max _{1\le g\le G}\Vert \varvec{\delta }_{g\mathcal {A}}\Vert _{1}\right) ^{2}\right] ^{1/2}, \end{aligned}\end{aligned}$$
    (A.4)

    where \(\widehat{\rho }_n\) denotes the minimum eigenvalue of \(\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}\). It is easy to show that \(\widehat{\rho }_n>0\) since \(\widehat{\varvec{\Sigma }}_{\mathcal{A}\mathcal{A}}\) is a positive definite matrix. From regularity condition (C2) and Theorem 13.5.1 in Anderson (2003), it is easy to show that \(\widehat{\rho }_n^{-1}\) is uniformly bounded. Furthermore, invoking Lemma 1 (i) in Sheng and Wang (2019), we can show that \(\displaystyle \max _{i,j\in \mathcal {A}}|{\widehat{\sigma }_{ij}-\sigma _{ij}}|=O_p(\sqrt{{\log p_n}/{n}})\) and \(\displaystyle \max _{1\le g\le G}\Vert \varvec{\delta }_g\Vert _{1}\) are uniformly bounded. Therefore, \(I_{1}=O_{p}(\displaystyle \sqrt{{k_n\log p_n}/{n}})\). For \(I_2\), we have

    $$\begin{aligned} \begin{aligned} I_{2}&=\max _{1\le g\le G}\Vert \widehat{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}(\widehat{\varvec{\delta }}_{g\mathcal {A}}-{\varvec{\delta }}_{g\mathcal {A}})\Vert _{2} \le \widehat{\rho }^{-1}_n\max _{1\le g\le G}\Vert \widehat{\varvec{\delta }}_{g\mathcal {A}}-{\varvec{\delta }}_{g\mathcal {A}}\Vert _{2}\\&\le \widehat{\rho }^{-1}_n\sqrt{k_n}\max _{1\le g\le G}\Vert \widehat{\varvec{\delta }}_{g\mathcal {A}}-{\varvec{\delta }}_{g\mathcal {A}}\Vert _{\infty }. \end{aligned}\end{aligned}$$
    (A.5)

    From Lemma 1 (ii) in Sheng and Wang (2019), it can be obtained that

    $$\Pr \left( \max _{g,j}|{\widehat{\theta }_{gj}-\theta _{gj}}|\le C\sqrt{\frac{\log p_n}{n}}\right) \longrightarrow 1.$$

    Thus, \(I_{2}=O_{p}(\sqrt{{k_n\log p_n}/{n}})\). For \(I_{3}\), we have

    $$\begin{aligned} I_{3}=\lambda _{n}\max _{1\le g\le G}\Vert \widehat{\varvec{\Sigma }}^{-1}_{\mathcal{A}\mathcal{A}}\widetilde{\varvec{f}}_{g\mathcal {A}}\Vert _{2} \le \lambda _{n}\widehat{\rho }^{-1}_n\max _{1\le g\le G}\Vert \widetilde{\varvec{f}}_{g\mathcal {A}}\Vert _{2} \le \lambda _{n}\widehat{\rho }^{-1}_n\sqrt{k_n}\max _{j\in \mathcal {A}}\phi _{j}. \end{aligned}$$
    (A.6)

    Invoking Lemma 1, it is easy to show that

    $$\Pr \left( \max _{j\in \mathcal {A}}\phi _j \le \frac{C\sqrt{k_n}}{\delta _{\min }}\right) \longrightarrow 1,$$

    and \(\lambda _{n}=O_p(\delta _{\min }\sqrt{{\log p_n}/{nk_n}})\), then \(I_{3}=O_p(\sqrt{{k_n\log p_n}/{n}})\). From (A.4), (A.5) and (A.6), we can obtain that \(\underset{g}{\max }\Vert \widetilde{\varvec{\psi }}_g-\varvec{\theta }_{g\mathcal {A}}\Vert _{2}=O_{p}(\sqrt{{k_n\log p_n}/{n}})\). If G is fixed, then

    $$\Vert \widetilde{\varvec{\psi }}_{\cdot \mathcal {A}}-{\varvec{\theta }}_{\cdot \mathcal {A}}\Vert _{2}=\left( \sum _{g=2}^{G}\Vert \widetilde{\varvec{\psi }}_{g}-\varvec{\theta }_{g\mathcal {A}}\Vert ^{2}_{2}\right) ^{{1}/{2}}=O_p\left( \sqrt{\frac{k_n\log p_n}{n}}\right) ,$$

    and

    $$\begin{aligned}{\begin{matrix} \min _{j\in \mathcal {A}}\Vert \widetilde{\varvec{\psi }}_{\cdot j}\Vert _{2} &{}>\min _{j\in \mathcal {A}}\Vert \varvec{\theta }_{\cdot j}\Vert _{2}-\max _{j\in \mathcal {A}}\Vert \widetilde{\varvec{\psi }}_{\cdot j}-\varvec{\theta }_{\cdot j}\Vert _{2}\\ &{}>\min _{j\in \mathcal {A}}\Vert \varvec{\theta }_{\cdot j}\Vert _{2}-\Vert \widetilde{\varvec{\psi }}_{\cdot \mathcal {A}}-\varvec{\theta }_{\cdot \mathcal {A}}\Vert _{2} >\theta _{\min }-C\sqrt{\frac{k_n\log p_n}{n}}. \end{matrix}}\end{aligned}$$

    By condition (C5), \(\displaystyle \min _{j\in \mathcal {A}}\Vert \widetilde{\varvec{\psi }}_{\cdot j}\Vert>{\theta _{\min }}/{2}>0\). Thus, conclusion 1) is proved.

  2. 2)

    For all \(g ~ (2\le g\le G),~ \widetilde{\varvec{\theta }}_{g}=(\widetilde{\varvec{\psi }}_{g}^{\top },\varvec{0}^{\top })^{\top }\) is the solution by minimizing the objective function (2.9). It can be seen from Section 2 that (2.7) and (2.9) are equivalent optimization problems. From Proposition 1, the optimization problem defined by (2.7) satisfies KKT condition. For all \(j\in {\mathcal {A}}^{c}\), let \(\widetilde{\varvec{v}}_{g}^{(j)}=\varvec{0}\); for all \(j\in \mathcal {A}, 2\le g\le G\), let \(\widetilde{\varvec{v}}^{(j)}_{g\mathcal {A}}=\widetilde{\varvec{\mu }}^{(j)}_{g}, \widetilde{\varvec{v}}^{(j)}_{g\mathcal {A}^{c}}=\varvec{0}\), then \(\widetilde{\varvec{\theta }}_{g\mathcal {A}}=\widetilde{\varvec{\psi }}_{g}\) and \( \widetilde{\varvec{\theta }}_{g\mathcal {A}^{c}}=\varvec{0}\). For \(j\in \mathcal {A}\), by the definition of \(\widetilde{\varvec{\psi }}_{g}\), KKT condition of (2.9) is satisfied. For \(j\in \mathcal {A}^{c}\), we need to show that

    $$\Pr \left\{ \forall j\in \mathcal {A}^{c},\left( \sum _{g=2}^{G}\Vert (\widehat{\varvec{\Sigma }}\varvec{\theta }_{g}-\widehat{\varvec{\delta }}_{g})_{\mathcal {N}^{(j)}}\Vert _2^{2}\right) ^{{1}/{2}}\le \lambda _n\phi _{j}\right\} \longrightarrow 1.$$

    Equivalent to prove

    $$\Pr \left\{ \exists j\in \mathcal {A}^{c}, \left( \sum _{g=2}^{G}\Vert (\widehat{\varvec{\Sigma }}\varvec{\theta }_g-\widehat{\varvec{\delta }}_g)_{\mathcal {N}^{(j)}}\Vert _{2}^{2}\right) ^{{1}/{2}}>\lambda _n\phi _j\right\} \longrightarrow 0.$$

    Thus, we have

    $$\begin{aligned}{\begin{matrix} &{}\Pr \left\{ \exists j\in {\mathcal {A}^{c}}, \left( \sum _{g=2}^{G}\left\| \left( \widehat{\varvec{\Sigma }}\varvec{\theta }_g-\widehat{\varvec{\delta }}_{g}\right) _{\mathcal {N}^{(j)}}\right\| _{2}^{2}\right) ^{{1}/{2}}>\lambda _n\phi _j\right\} \\ \le &{} \Pr \left\{ \left( \sum _{g=2}^{G}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\widetilde{\varvec{\psi }}_g-\widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}\right\| _{2}^{2}\right) ^{{1}/{2}}>\lambda _n\min _{j\in \mathcal {A}^{c}}\phi _j\right\} \\ =&{}\Pr \left\{ \left( \sum _{g=2}^{G}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\left( \widetilde{\varvec{\psi }}_g-\varvec{\theta }_{g\mathcal {A}}\right) +\widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\varvec{\theta }_{g\mathcal {A}}-\widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}\right\| _{2}^{2}\right) ^{{1}/{2}}>\lambda _n\phi _{*}\right\} \\ =&{}\Pr \left\{ \left( \sum _{g=2}^{G}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\varvec{\theta }_{g\mathcal {A}}-\widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}\right\| _{2}^{2}\right) ^{{1}/{2}}>\lambda _n\phi _{*}-\left( \sum _{g=2}^{G}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\left( \widetilde{\varvec{\psi }}_{g}-\varvec{\theta }_{g\mathcal {A}}\right) \right\| _{2}^{2}\right) ^{{1}/{2}}\right\} . \end{matrix}}\end{aligned}$$

    Furthermore, we have

    $$\begin{aligned} {\begin{matrix} \max _{1\le g\le G}\Vert \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}(\widetilde{\varvec{\psi }}_{g}-\varvec{\theta }_{g\mathcal {A}})\Vert _{2} &{}\le \max _{1\le g\le G}\Vert \widehat{\varvec{\Sigma }}(\widetilde{\varvec{\theta }}_g-\varvec{\theta }_{g})\Vert _{2}\\ &{}\le \widehat{\rho }_n\max _{1\le g\le G}\Vert (\widetilde{\varvec{\theta }}_{g}-\varvec{\theta }_{g})\Vert _{2} =\widehat{\rho }_n\max _{1\le g\le G}\Vert \widetilde{\varvec{\psi }}_g-\varvec{\theta }_{g\mathcal {A}}\Vert _{2}. \end{matrix}} \end{aligned}$$

    From \(\widehat{\rho }_n>0\), \(\displaystyle \max _{1\le g\le G}\Vert \widetilde{\varvec{\psi }}_g-\varvec{\theta }_{g\mathcal {A}}\Vert _{2}=O_p(\sqrt{{k_n\log p_n}/{n}})\), regularity condition (C2) and Theorem 13.5.1 in Anderson (2003), we have

    $$\begin{aligned} \max _{1\le g\le G}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\left( \widetilde{\varvec{\psi }}_{g}-\varvec{\theta }_{g\mathcal {A}}\right) \right\| _{2}=O_p\left( \sqrt{\frac{k_n\log p_n}{n}}\right) . \end{aligned}$$
    (A.7)

    Further, it is easy to show that

    $$\begin{aligned}{\begin{matrix} &{}\max _{1\le g\le G}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\varvec{\theta }_{g\mathcal {A}}-\widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}\right\| _{2} =\max _{1\le g\le G}\left\| \left( \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}-\varvec{\Sigma }_{\mathcal {A}^{c}\mathcal {A}}\right) \varvec{\theta }_{g\mathcal {A}}-\left( \widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}-\varvec{\delta }_{g\mathcal {A}^{c}}\right) \right\| _{2}\\ \le &{}\max _{1\le g\le G}\left\| \left( \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}-\varvec{\Sigma }_{\mathcal {A}^{c}\mathcal {A}}\right) \varvec{\theta }_{g\mathcal {A}}\right\| _{2}+\max _{1\le g\le G}\left\| \widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}-\varvec{\delta }_{g\mathcal {A}^{c}}\right\| _{2}\\ =&{}\left\{ (p_n-k_n)\left( \max _{i\in \mathcal {A}^{c},j\in \mathcal {A}}|{\widehat{\sigma }_{ij}-\sigma _{ij}}|\left\| \varvec{\theta }_{g\mathcal {A}}\right\| _{1}\right) ^{2}\right\} ^{\frac{1}{2}}+\sqrt{p_n-k_n}\max _{1\le g\le G,j\in \mathcal {A}^{c}}|{\widehat{\delta }_{gj}-\delta _{gj}}|. \end{matrix}}\end{aligned}$$

    Again invoking Lemma 1(i) and (ii) in Sheng and Wang (2019) and regularity conditions (C1)-(C3), we can show that,

    $$\begin{aligned} \Pr \left( \max _{i\in \mathcal {A}^{c},j\in \mathcal {A}}|{\widehat{\sigma }_{ij}-\sigma _{ij}}|\le C\sqrt{\frac{\log p_n}{n}}\right) \longrightarrow 1 \end{aligned}$$

    and

    $$\begin{aligned} \Pr \left( \max _{g,j\in \mathcal {A}^{c}}|{\widehat{\delta }_{gj}-\delta _{gj}}|\le C\sqrt{\frac{\log p_n}{n}}\right) \longrightarrow 1. \end{aligned}$$

    Then we have

    $$\begin{aligned} \Pr \left( \max _{g}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\varvec{\theta }_{g\mathcal {A}}-\widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}\right\| _{2} \le C\sqrt{\frac{\left( p_n-k_n\right) \log p_n}{n}}\right) \longrightarrow 1. \end{aligned}$$
    (A.8)

    By (A.7), (A.8), Chebyshev’s inequality and \(\displaystyle \lambda _n\phi _{*}\Big /\sqrt{k_n\log p_n/n}\rightarrow \infty \) , we have

    $$\begin{aligned}{\begin{matrix} \Pr \left\{ \exists j\in {\mathcal {A}^{c}}, \left( \sum _{g=2}^{G}\left\| \left( \widehat{\varvec{\Sigma }}\varvec{\theta }_g-\widehat{\varvec{\delta }}_{g}\right) _{\mathcal {N}^{(j)}}\right\| _{2}^{2}\right) ^{{1}/{2}}>\lambda _n\phi _j\right\} &{}\le \frac{\mathbb {E}\left( \displaystyle \sum _{g=2}^{G}\left\| \widehat{\varvec{\Sigma }}_{\mathcal {A}^{c}\mathcal {A}}\varvec{\theta }_{g\mathcal {A}}-\widehat{\varvec{\delta }}_{g\mathcal {A}^{c}}\right\| _{2}^{2}\right) }{(\lambda _n\phi _{*})^{2}}\\ &{}\le \frac{C''\left( G-1\right) \left( p_n-k_n\right) \log p_n}{\lambda ^{2}_n\phi ^2_{*}n}. \end{matrix}}\end{aligned}$$

    By \(\displaystyle \lambda ^2_n\phi ^2_{*}n / \left[ \left( p_n-k_n\right) \log p_n \right] \rightarrow \infty \), we can show that conclusion 2) holds.

Summarizing the above results, we finish the proof of Theorem 1.

Proof of Theorem

2 Let \(R_n\) and \(R^\text {Bayes}\) denote the conditional misclassification rates of IG-MSDA and the Bayes rule, respectively. Given a large enough value h, let \(\eta _0=h({k_n\log p_n}/{n})^{{1}/{3}}\), similar to the proof of Theorem 2 in Mai and Zou (2015), we have

$$\begin{aligned} \begin{aligned} |R_n-R^\text {Bayes}| \le&\Pr \left( \Big |{D_g^\text {Bayes}(\varvec{x})-D_k^\text {Bayes}(\varvec{x})}\Big |\le \eta _0,\exists g\ne k\right) \\&+\Pr \left( \Big |{\widehat{D}_{g}\left( \varvec{x}\right) -D^\text {Bayes}_{g}\left( \varvec{x}\right) }\Big |\ge \frac{\eta _0}{2},\exists g\big |~(\varvec{x}_i,y_i),i=1,\ldots ,n\right) \\ =:&A_1+A_2. \end{aligned}\end{aligned}$$
(A.9)

For the observation \(\varvec{x}\), \(D_g^\text {Bayes}(\varvec{x})-D_k^\text {Bayes}(\varvec{x})\) follows the normal distribution with variance \(\Delta \). By regularity condition (C5) and G is fixed, for a sufficiently large positive number M, we have

$$\begin{aligned} \begin{aligned} A_1 \le&\sum _{g'=1}^{G}\Pr \left( \Big |{D_g^\text {Bayes}(\varvec{x})-D_k^\text {Bayes}(\varvec{x})}\Big |\le \eta _0\big |Y=g'\right) \pi _g' \\ \le&\frac{MG^2}{\Delta }\eta _0 \le M\left( \frac{k_n\log p_n}{n}\right) ^{{1}/{3}}. \end{aligned} \end{aligned}$$
(A.10)

For \(A_2\), we have

$$\begin{aligned} \widehat{D}_{g}(\varvec{x})-D^\text {Bayes}_{g}(\varvec{x})\big |\left( Y=g',(\varvec{x}_i,y_i),i=1,\ldots ,n\right) \sim \mathcal {N}\left( \mu _{gg'},(\widehat{\varvec{\theta }}_{g}-\varvec{\theta }_{g})^{\top }\varvec{\Sigma }(\widehat{\varvec{\theta }}_{g}-{\varvec{\theta }}_g)\right) , \end{aligned}$$

where

$$ \displaystyle \mu _{gg'}=\log \frac{\widehat{\pi }_g}{\pi _{g}}-\log \frac{\widehat{\pi }_{1}}{\pi _1}+(\widehat{\varvec{\theta }}_g-\varvec{\theta }_{g})^{\top }\varvec{\mu }_{g'}+\frac{1}{2}(\varvec{\mu }_{1}+\varvec{\mu }_{g})^{\top }\varvec{\theta }_g-\frac{1}{2}(\widehat{\varvec{\mu }}_1+\widehat{\varvec{\mu }}_g)^{\top }\widehat{\varvec{\theta }}_{g}. $$

For \(g,g'=1,\ldots ,G\), we have

$$\begin{aligned} \begin{aligned} \left| \mu _{gg'} \right| \le&\Big |\log \frac{\widehat{\pi }_g}{\pi _{g}}\Big |+\Big |\log \frac{\widehat{\pi }_1}{\pi _{1}}\Big |+\Big |(\widehat{\varvec{\theta }}_g-\varvec{\theta }_g)^{\top }\varvec{\mu }_{g'}\Big |+\frac{1}{2}\Big |(\varvec{\mu }_{1}+\varvec{\mu }_{g})^{\top }(\varvec{\theta }_{g}-\widehat{\varvec{\theta }}_{g})\Big |\\&+\frac{1}{2}\Big |(\varvec{\mu }_1+\varvec{\mu }_g-\widehat{\varvec{\mu }}_1-\widehat{\varvec{\mu }}_g)^{\top }(\widehat{\varvec{\theta }}_g-\varvec{\theta }_g)\Big |+\frac{1}{2}\Big |(\varvec{\mu }_1+\varvec{\mu }_g-\widehat{\varvec{\mu }}_1-\widehat{\varvec{\mu }}_g)^{\top }\varvec{\theta }_{g}\Big | \end{aligned} \end{aligned}$$

in probability.

By Taylor expansion, the conditions of Theorems 1 and 2, for any \(g,g'=1,\ldots ,G\), we have

$$\begin{aligned} \left| \log \widehat{\pi }_g-\log \pi _g\right| \le M\frac{1}{\sqrt{n}}, \end{aligned}$$
$$\begin{aligned} \left| (\widehat{\varvec{\theta }}_g-\varvec{\theta }_g)^{\top }\varvec{\mu }_{g'}\right| \le \max _{1\le g'\le G}\Vert \varvec{\mu }_{g'}\Vert _{2}\Vert \widehat{\varvec{\theta }}-\varvec{\theta }\Vert _2\le M\sqrt{\frac{k_n\log p_n}{n}}, \end{aligned}$$

and

$$\begin{aligned} \frac{1}{2}\Big |(\varvec{\mu }_{1}+\varvec{\mu }_{g})^{\top }(\varvec{\theta }_g-\widehat{\varvec{\theta }}_g)\Big |\le \max _{1\le g\le G}\Vert \varvec{\mu }_g\Vert _{2}\Vert \widehat{\varvec{\theta }}-\varvec{\theta }\Vert _{2}\le M\sqrt{\frac{k_n\log p_n}{n}} \end{aligned}$$

with probability tending to one.

Furthermore, we have

$$\begin{aligned} \frac{1}{2}\Big |(\varvec{\mu }_1+\varvec{\mu }_g-\widehat{\varvec{\mu }}_1-\widehat{\varvec{\mu }}_g)^{\top }(\widehat{\varvec{\theta }}_g-\varvec{\theta }_g)\Big |\le \max _{1\le g\le G}\Vert \varvec{\mu }_{g}-\widehat{\varvec{\mu }}_g\Vert _{2}\Vert \widehat{\varvec{\theta }}-\varvec{\theta }\Vert _{2}\le M\frac{k_n\log p_n}{n} \end{aligned}$$

and

$$\begin{aligned} \frac{1}{2}\Big |(\varvec{\mu }_1+\varvec{\mu }_g-\widehat{\varvec{\mu }}_1-\widehat{\varvec{\mu }}_g)^{\top }\varvec{\theta }_{g}\Big |\le \max _{1\le g\le G}\Vert \varvec{\mu }_g-\widehat{\varvec{\mu }}_g\Vert _2\max _{1\le g\le G}\Vert \varvec{\theta }_{g}\Vert _2\le M\sqrt{\frac{k_n\log p_n}{n}}. \end{aligned}$$

By \({k_n\log p_n}/{n}\rightarrow 0\), there is a large enough positive constant M, for \(g,g'=1,\ldots ,G\) such that

$$\begin{aligned} \Pr \left\{ \left| \mu _{gg'}\right| \le M\sqrt{\frac{k_n\log p_n}{n}}\right\} \longrightarrow 1. \end{aligned}$$
(A.11)

Therefore, there exists a sufficiently large positive constant h such that \(\eta _0/3>M\sqrt{{k_n\log p_n}/{n}}\). Thus, for \(g,g'=1,\ldots ,G\), \(\Pr (\left| \mu _{gg'}\right| <\eta _0/3)\rightarrow 1\). In addition, by Markov’s inequality, we have

$$\begin{aligned} \Pr \left( \left| \widehat{D}_g(\varvec{X})-D^\text {Bayes}_{g}(\varvec{X})\right| \ge \frac{\eta _0}{2}\big |Y=g',(\varvec{x}_i,y_i),i=1,\ldots ,n\right) \le \frac{2(\widehat{\varvec{\theta }}_g-\varvec{\theta }_g)^{\top }\varvec{\Sigma }(\widehat{\varvec{\theta }}_g-\varvec{\theta }_g)}{(\eta _0/2-\mu _{gg'})^{2}}. \end{aligned}$$

From regularity condition (C1) and Theorem 1, for \(g=1,\ldots ,G\), we can show that

$$\begin{aligned} \Pr \left( (\widehat{\varvec{\theta }}_g-\varvec{\theta }_g)^{\top }\varvec{\Sigma }(\widehat{\varvec{\theta }}_g-\varvec{\theta }_g)\le M\frac{k_n\log p_n}{n}\right) \longrightarrow 1. \end{aligned}$$

Then, we have

$$\begin{aligned} {\begin{matrix} A_2 &{}=\Pr \biggl (\left| \widehat{D}_{g}(\varvec{X})-D^\text {Bayes}_{g}(\varvec{X})\right| \ge \frac{\eta _0}{2},\exists g|(\varvec{x}_i,y_i),i=1,\ldots ,n\biggl )\\ &{}\le \sum _{g=1}^{G}\pi _g'\Pr \biggl (\left| \widehat{D}_{g}(\varvec{X})-D^\text {Bayes}_{g}(\varvec{X})\right| \ge \frac{\eta _0}{2}\Big |Y=g', (\varvec{x}_i,y_i),i=1,\ldots ,n\biggl )\\ &{}\le M\max _{g,g'}\frac{(\widehat{\varvec{\theta }}_g-\varvec{\theta }_{g})^{\top }\varvec{\Sigma }(\widehat{\varvec{\theta }}_g-\varvec{\theta }_{g})}{(\eta _0/2-\mu _{gg'})^2}\le M\left( \frac{k_n\log p_n}{n}\right) ^{1/3} \\ \end{matrix}}\end{aligned}$$
(A.12)

holds with probability tending to 1. Summarizing the above results, we have \(|R_n-R^\text {Bayes}|=O_p(({k_n\log p_n}/{n})^{1/3})\). Therefore, we complete the proof of Theorem 2.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Luo, J., Li, X., Yu, C. et al. Multiclass Sparse Discriminant Analysis Incorporating Graphical Structure Among Predictors. J Classif 40, 614–637 (2023). https://doi.org/10.1007/s00357-023-09451-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-023-09451-1

Keywords