Skip to main content
Log in

Heterogeneous analysis for clustered data using grouped finite mixture models

  • Original Paper
  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

It is common to observe significant heterogeneity in clustered data across scientific fields. Cluster-wise conditional distributions are widely used to explore variations and relationships within and among clusters. This paper aims to capture such heterogeneity by employing cluster-wise finite mixture models. To address the heterogeneity among clusters, we introduce latent group structure and incorporate heterogeneous mixing proportions across different groups, accommodating the diverse characteristics observed in the data. The specific number of groups and their membership are unknown. To identify the latent group structure, we employ concave penalty functions to the pairwise differences of the preliminary consistent estimators for the mixing proportions. This approach enables the automatic division of clusters into finite subgroups. Theoretical results demonstrate that as the number of clusters and cluster sizes tend to infinity, the true latent group structure can be recovered with probability close to one, and the post-classification estimators exhibit oracle efficiency. We support our proposed approach’s performance and applicability through extensive simulations and analysis of basic consumption expenditure among urban households in China.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Algorithm 2
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Bai, J.: Estimating multiple breaks one at a time. Economet. Theor. 13(3), 315–352 (1997)

    Article  MathSciNet  Google Scholar 

  • Begg, M.D., Parides, M.K.: Separation of individual-level and cluster-level covariate effects in regression analysis of correlated data. Stat. Med. 22(16), 2591–2602 (2003)

    Article  Google Scholar 

  • Bester, C.A., Hansen, C.B.: Grouped effects estimators in fixed effects models. J. Econom. 190(1), 197–208 (2016)

    Article  MathSciNet  Google Scholar 

  • Bonhomme, S., Manresa, E.: Grouped patterns of heterogeneity in panel data. Econometrica 83(3), 1147–1184 (2015)

    Article  MathSciNet  Google Scholar 

  • Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 3(1), 1–122 (2011)

  • Chen, J.: Optimal rate of convergence for finite mixture models. Ann. Stat. 23(1), 221–233 (1995)

    Article  MathSciNet  Google Scholar 

  • Chi, E.C., Lange, K.: Splitting methods for convex clustering. J. Comput. Graph. Stat. 24(4), 994–1013 (2015)

    Article  MathSciNet  Google Scholar 

  • Desai, M., Begg, M.D.: A comparison of regression approaches for analyzing clustered data. Am. J. Public Health 98(8), 1425–1429 (2008)

    Article  Google Scholar 

  • Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)

    Article  MathSciNet  Google Scholar 

  • Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)

    Article  Google Scholar 

  • Jiang, J., Nguyen, T.: Linear and Generalized Linear Mixed Models and their Applications. Springer, New York (2007)

    Google Scholar 

  • Keribin, C.: Consistent estimation of the order of mixture models. Sankhyā: Indian J. Stat. Ser. A 62(1), 49–66 (2000)

  • Khalili, A., Chen, J.: Variable selection in finite mixture of regression models. J. Am. Stat. Assoc. 102(479), 1025–1038 (2007)

    Article  MathSciNet  Google Scholar 

  • Lin, X., Carroll, R.J.: Semiparametric regression for clustered data. Biometrika 88(4), 1179–1185 (2001)

    Article  MathSciNet  Google Scholar 

  • Louis, T.A.: Finding the observed information matrix when using the em algorithm. J. R. Stat. Soc. Ser. B Stat Methodol. 44(2), 223–233 (1982)

    MathSciNet  Google Scholar 

  • Ma, S., Huang, J.: A concave pairwise fusion approach to subgroup analysis. J. Am. Stat. Assoc. 112(517), 410–423 (2017)

    Article  MathSciNet  Google Scholar 

  • McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, New York (2000)

    Book  Google Scholar 

  • Melnykov, V., Maitra, R.: Finite mixture models and model-based clustering. Stat. Surveys 4, 80–116 (2010)

    Article  MathSciNet  Google Scholar 

  • Meng, X.L., Rubin, D.B.: Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80(2), 267–278 (1993)

    Article  MathSciNet  Google Scholar 

  • Neuhaus, J.M., Kalbfleisch, J.D.: Between- and within-cluster covariate effects in the analysis of clustered data. Biometrics, 638–645 (1998)

  • Ng, S.K., McLachlan, G.J.: Mixture models for clustering multilevel growth trajectories. Comput. Stat. Data Anal. 71, 43–51 (2014)

    Article  MathSciNet  Google Scholar 

  • Radchenko, P., Mukherjee, G.: Convex clustering via l1 fusion penalization. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 79(5), 1527–1546 (2017)

    Article  MathSciNet  Google Scholar 

  • Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)

    Article  Google Scholar 

  • Rosen, O., Jiang, W., Tanner, M.A.: Mixtures of marginal models. Biometrika 87(2), 391–404 (2000)

    Article  MathSciNet  Google Scholar 

  • Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)

    Article  MathSciNet  Google Scholar 

  • Sugasawa, S.: Grouped heterogeneous mixture modeling for clustered data. J. Am. Stat. Assoc. 116(534), 999–1010 (2021)

  • Sugasawa, S., Kobayashi, G., Kawakubo, Y.: Latent mixture modeling for clustered data. Stat. Comput. 29, 537–548 (2019)

    Article  MathSciNet  Google Scholar 

  • Sun, Z., Rosen, O., Sampson, A.R.: Multivariate bernoulli mixture models with application to postmortem tissue studies in schizophrenia. Biometrics 63(3), 901–909 (2007)

    Article  MathSciNet  Google Scholar 

  • Tang, X., Qu, A.: Mixture modeling for longitudinal data. J. Comput. Graph. Stat. 25(4), 1117–1137 (2016)

    Article  MathSciNet  Google Scholar 

  • Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010)

    MathSciNet  Google Scholar 

  • Wang, W., Su, L.: Identifying latent group structures in nonlinear panels. J. Econom. 220(2), 272–295 (2021)

    Article  MathSciNet  Google Scholar 

  • Wang, H., Li, R., Tsai, C.L.: Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 94(3), 553–568 (2007)

    Article  MathSciNet  Google Scholar 

  • Yu, J., Nummi, T., Pan, J.X.: Mixture regression for longitudinal data based on joint mean-covariance model. J. Multivar. Anal. 190, 104956 (2022)

    Article  MathSciNet  Google Scholar 

  • Zhang, C.H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

We are very grateful to the Editor, Associate Editor and referees, as well as our financial sponsors for their insightful comments and suggestions that have improved the manuscript significantly.

Funding

This work was supported by the National Natural Science Foundation of China [Grant numbers 11690012, 11631003, 12226003].

Author information

Authors and Affiliations

Authors

Contributions

M.W. (corresponding author) proposed the conceptualization and provided the funding acquisition, L.C. completed the methodology and analysis. All authors together wroted and reviewed the manuscript.

Corresponding author

Correspondence to Wenqing Ma.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Proof of Theorem 1

Define

$$\begin{aligned} {{\widetilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }})=\mathop {\arg \max }\limits _{{\varvec{\pi }}_{i}}\sum \limits _{j=1}^{n_{i}}\log {f_{i}(y_{ij}|{\varvec{x}}_{ij};{\varvec{\pi }}_{i},{\varvec{\theta }})}, \end{aligned}$$
(A.1)
$$\begin{aligned}{{\widetilde{{\varvec{\theta }}}}}= & {} \mathop {\arg \max }\limits _{{\varvec{\theta }}}\sum \limits _{i=1}^{m}\sum \limits _{j=1}^{n_{i}}\log {f_{i}(y_{ij}|{\varvec{x}}_{ij};{{\widetilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }}),{\varvec{\theta }})},\\ {{\widetilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }}^{(0)})= & {} \mathop {\arg \max }\limits _{{\varvec{\pi }}_{i}}\sum \limits _{j=1}^{n_{i}}\log {f_{i}(y_{ij}|{\varvec{x}}_{ij};{\varvec{\pi }}_{i},{\varvec{\theta }}^{(0)})},\\ {\varvec{\pi }}^{(0)}_{i}= & {} \mathop {\arg \max }\limits _{{\varvec{\pi }}_{i}}\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}{\textrm{E}}\{\log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}_{i},{\varvec{\theta }}^{(0)})}\}.\end{aligned}$$

For \(i=1,\ldots ,m\), let

$$\begin{aligned} {\varvec{\pi }}_{i}({\varvec{\theta }})=\mathop {\arg \max }\limits _{{\varvec{\pi }}_{i}}\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}{\textrm{E}}\{\log {f_{i}(y_{ij}|{\varvec{x}}_{ij};{\varvec{\pi }}_{i},{\varvec{\theta }})}\}, \end{aligned}$$

then, we have \({\varvec{\pi }}^{(0)}_{i}={\varvec{\pi }}_{i}({\varvec{\theta }}^{(0)})\). Because the all unknown parameters contain \({\varvec{\theta }}\) and \({\varvec{\pi }}\), the consistency of the preliminary estimators can be divided into two parts. At first, we will show that the preliminary estimator \({{\widetilde{{\varvec{\theta }}}}}\) is consistent when the sample size m and the minimum cluster size \(n_{1}\) tend to infinity.

(1). Note that

$$\begin{aligned}Q({\varvec{\theta }})=\sum \limits _{i=1}^{m}\sum \limits _{j=1}^{n_{i}}\log {f(y_{ij}|{\varvec{x}}_{ij},{\tilde{{\varvec{\pi }}}}_{i}({\varvec{\theta }}),{\varvec{\theta }})},\end{aligned}$$

we make a Taylor expression for \(Q({\tilde{{\varvec{\theta }}}})\) at \({\varvec{\theta }}^{0}\),

$$\begin{aligned} \begin{aligned} Q({\tilde{{\varvec{\theta }}}})-{\textbf{Q}}({\varvec{\theta }}^{0})&=({\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0})^{\textrm{T}}\frac{\partial Q({\varvec{\theta }})}{\partial {\varvec{\theta }}}|_{{\varvec{\theta }}={\varvec{\theta }}^{0}}\\&\quad +\frac{1}{2}({\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0})^{\textrm{T}}\frac{\partial ^{2} Q({\varvec{\theta }})}{\partial {\varvec{\theta }}\partial {\varvec{\theta }}^{\textrm{T}}}|_{{\varvec{\theta }}={\varvec{\theta }}^{0}}({\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0})\\&\quad +({\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0})^{\textrm{T}}o_{p}(1)({\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0}), \end{aligned} \end{aligned}$$

since \({{\tilde{{\varvec{\theta }}}}}=\mathop {\arg \max }\limits _{{\varvec{\theta }}}Q({\varvec{\theta }})\), so that \(Q({\tilde{{\varvec{\theta }}}})-Q({\varvec{\theta }}^{0})\ge 0\). In addition,

$$\begin{aligned} \begin{aligned}&\frac{\partial Q({\varvec{\theta }})}{\partial {\varvec{\theta }}}|_{{\varvec{\theta }}={\varvec{\theta }}^{0}}\\&\quad =\frac{1}{N}\sum \limits _{i=1}^{m}\sum \limits _{j=1}^{n_{i}}\frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})}}{\partial {\varvec{\theta }}}\\&\quad \quad + \frac{1}{N}\sum \limits _{i=1}^{n}\sum \limits _{j=1}^{n_{i}} \left\{ \frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{{\tilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})}}{\partial {\varvec{\theta }}} \right. \\&\left. \qquad - \frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})}}{\partial {\varvec{\theta }}}\right\} \\&\quad = S_{1}+S_{2}, \end{aligned} \end{aligned}$$

since \({\textrm{E}}(S_{1})=0\) and \(\textrm{tr}\{\textrm{var}(S_{1})\}=O_{p}(1)\), so we know that \(S_{1}=O_{p}(N^{-1/2})\). For \(S_{2}\), following the Cauchy-Schwarz inequality, we have

$$\begin{aligned} \begin{aligned} \Vert S_{2}\Vert _{2}&\le \frac{1}{N}\sum \limits _{i=1}^{m}\sum \limits _{j=1}^{n_{i}}\Vert \frac{\partial ^{2}\log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})}}{\partial {\varvec{\theta }}\partial {\varvec{\pi }}_{i}}\Vert _{2}\\&\quad \Vert {{\tilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }}^{0})-{\varvec{\pi }}_{i}({\varvec{\theta }}^{0})\Vert _{2}\\&\le \left\{ \frac{1}{m}\sum \limits _{i=1}^{m}(\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}M_{2}(y_{ij},{\varvec{x}}_{ij})^{2})\right\} ^{1/2}\\&\quad \left( \frac{1}{m}\sum \limits _{i=1}^{m}\Vert {{\tilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }}^{0})-{\varvec{\pi }}_{i}({\varvec{\theta }}^{0})\Vert ^{2}_{2}\right) ^{1/2}. \end{aligned} \end{aligned}$$

Following (A.1), we have

$$\begin{aligned} 0&=\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}_{i},{\varvec{\theta }})}}{\partial {\varvec{\pi }}_{i}}\big |_{{\varvec{\pi }}_{i} ={{\tilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }})}\\&=\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}_{i},{\varvec{\theta }})}}{\partial {\varvec{\pi }}_{i}}\big |_{{\varvec{\pi }}_{i}={\varvec{\pi }}_{i}({\varvec{\theta }})}\\&\quad +\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial ^{2} \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}_{i},{\varvec{\theta }})}}{\partial {\varvec{\pi }}_{i}\partial {\varvec{\pi }}^{\textrm{T}}_{i}}\big |_{{\varvec{\pi }}_{i}={\check{{\varvec{\pi }}}}_{i}({\varvec{\theta }})}\\&\quad \quad ({{\tilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }})-{\varvec{\pi }}_{i}({\varvec{\theta }})), \end{aligned}$$

where \({\check{{\varvec{\pi }}}}_{i}({\varvec{\theta }})\) lies between \({{\tilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }})\) and \({\varvec{\pi }}_{i}({\varvec{\theta }})\). For \(i=1,\ldots ,m\), let

$$\begin{aligned}G_{1i}=\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}_{i},{\varvec{\theta }})}}{\partial {\varvec{\pi }}_{i}},\end{aligned}$$

and

$$\begin{aligned} G_{2i}=\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial ^{2} \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}_{i},{\varvec{\theta }})}}{\partial {\varvec{\pi }}_{i}\partial {\varvec{\pi }}^{\textrm{T}}_{i}},\end{aligned}$$

thus, we can obtain that \({{\tilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }}^{0})-{\varvec{\pi }}_{i}({\varvec{\theta }}^{0})=-G^{-1}_{2i}({\check{{\varvec{\pi }}}}_{i}({\varvec{\theta }}),{\varvec{\theta }})\)\(G_{1i}({\varvec{\pi }}_{i}({\varvec{\theta }}),{\varvec{\theta }})\). Then,

$$\begin{aligned} \begin{aligned}&\frac{1}{m}\sum \limits _{i=1}^{m}\Vert {{\widetilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }}^{(0)})-{\varvec{\pi }}_{i}({\varvec{\theta }}^{(0)})\Vert _{2}\\&\quad \le \min \limits _{i}\{\lambda ^{2}_{min}(G_{2i}({\check{{\varvec{\pi }}}}_{i}({\varvec{\theta }}),{\varvec{\theta }}))\} \frac{1}{m}\sum \limits _{i=1}^{m}\Vert G_{1i}({\varvec{\pi }}_{i}({\varvec{\theta }}),{\varvec{\theta }})\Vert _{2}. \end{aligned} \end{aligned}$$

Following Assumptions (C1)–(C4), we know that

$$\begin{aligned}&\max \limits _{1\le i\le m}\Vert G_{2i}({\check{{\varvec{\pi }}}}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})-{\textrm{E}}\{G_{2i}({\varvec{\pi }}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})\}\Vert _{2}\\&\quad \le \max \limits _{1\le i\le n}\Vert G_{2i}({\check{{\varvec{\pi }}}}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})-G_{2i}({\varvec{\pi }}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})\Vert _{2}\\&\qquad +\max \limits _{1\le i\le n}\Vert G_{2i}({\varvec{\pi }}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})-{\textrm{E}}\{G_{2i}({\varvec{\pi }}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})\}\Vert _{2} \\&\quad =o_{p}(1). \end{aligned}$$

Further, for each i, we have that

$$\begin{aligned} \begin{aligned}&\lambda _{min}\{G_{2i}({\check{{\varvec{\pi }}}}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})\}\\&\quad =\lambda _{min}[{\textrm{E}}\{G_{2i}({\varvec{\pi }}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})\}]+o_{p}(1). \end{aligned} \end{aligned}$$

Thus, we have

$$\begin{aligned}&\frac{1}{n}\sum \limits _{i=1}^{n}\Vert {{\tilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }}^{0})-{\varvec{\pi }}_{i}({\varvec{\theta }}^{0})\Vert ^{2}_{2} \\&\quad \le (\lambda _{min}[{\textrm{E}}\{G_{2i}({\varvec{\pi }}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})\}]+o_{p}(1))^{2} \\&\quad \quad \frac{1}{m}\sum \limits _{i=1}^{m}\Vert G_{1i}({\varvec{\pi }}_{i}({\varvec{\theta }}^{(0)}),{\varvec{\theta }}^{(0)})\Vert ^{2}_{2} \\&\quad =O_{p}(1)O_{p}(n^{-1}_{1})=O_{p}(1/n_{1}), \end{aligned}$$

since \({\textrm{E}}\{G_{1i}({\varvec{\pi }}_{i}({\varvec{\theta }}^{(0)}),{\varvec{\theta }}^{(0)})\}={\textbf{0}}\), where the last inequality holds. Then, \(\Vert S_{2}\Vert _{2}=O_{p}(n^{-1/2}_{1})\). When \(m\rightarrow \infty \) and \(n_{1}\rightarrow \infty \), we have

$$\begin{aligned} \frac{\partial ^{2} Q({\varvec{\theta }})}{\partial {\varvec{\theta }}\partial {\varvec{\theta }}^{\textrm{T}}}\big |_{{\varvec{\theta }}={\varvec{\theta }}^{0}}\rightarrow {\textrm{E}}\left( \frac{\partial ^{2} Q({\varvec{\theta }}^{0})}{\partial {\varvec{\theta }}\partial {\varvec{\theta }}^{\textrm{T}}}\right) , \end{aligned}$$

with probability 1, and we know that \(Q({{\tilde{{\varvec{\theta }}}}})-Q({\varvec{\theta }}^{0})\) is controlled by the second term \(({\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0})^{\textrm{T}}(\partial ^{2} Q({\varvec{\theta }})/\partial {\varvec{\theta }}\partial {\varvec{\theta }}^{\textrm{T}})|_{{\varvec{\theta }}={\varvec{\theta }}^{0}}({\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0})\). Then, \(Q({{\tilde{{\varvec{\theta }}}}})-Q({\varvec{\theta }}^{0})\le 0\) with probability 1. Correspondingly, we have \(\Vert {\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0}\Vert _{2}=O_{p}(1/\sqrt{n_{1}})\). Nextly, we will prove that the preliminary estimator \({\varvec{\pi }}_{i}\) is consistent when the sample size m and the cluster size \(n_{1}\) tend to infinity, \(i=1,\ldots ,m\).

(2). Let \({{\tilde{{\varvec{\pi }}}}}_{i}\in \{{\varvec{\pi }}^{0}_{i}+{\varvec{v}}/\sqrt{n_{1}}:\Vert {\varvec{v}}\Vert \le C\}\), where C is a constant. Following

$$\begin{aligned}&\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\{\log {f(y_{ij}|{\varvec{x}}_{ij},{{\tilde{{\varvec{\pi }}}}}_{i},{{\tilde{{\varvec{\theta }}}}})}-\log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}^{0}_{i},{{\tilde{{\varvec{\theta }}}}})}\}\\&\quad =\frac{1}{\sqrt{n_{i}}}{\varvec{v}}^{\textrm{T}}\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}^{0}_{i},{{\tilde{{\varvec{\theta }}}}})}}{\partial {\varvec{\pi }}_{i}}\\&\qquad +\frac{1}{2n_{i}}{\varvec{v}}^{\textrm{T}}\left\{ \frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial ^{2}\log {f(y_{ij}|{\varvec{x}}_{ij},{{\tilde{{\varvec{\pi }}}}}_{i},{{\tilde{{\varvec{\theta }}}}})}}{\partial {\varvec{\pi }}_{i}\partial {\varvec{\pi }}^{\textrm{T}}_{i}}\right\} {\varvec{v}}\\&\quad = I_{1}+I_{2}. \end{aligned}$$

Under the Assumption conditions (C1)–(C4) and \(\Vert {\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0}\Vert _{2}=O_{p}(1/\sqrt{n_{1}})\), we have

$$\begin{aligned}&\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}^{0}_{i},{{\tilde{{\varvec{\theta }}}}})}}{\partial {\varvec{\pi }}_{i}}\\&\quad =\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}^{0}_{i},{\varvec{\theta }}^{0})}}{\partial {\varvec{\pi }}_{i}} \\&\qquad +({{\tilde{{\varvec{\theta }}}}}-{\varvec{\theta }}^{0})^{\textrm{T}}\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial ^{2}\log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}^{0}_{i},{\check{{\varvec{\theta }}}})}}{\partial {\varvec{\pi }}_{i}\partial {\varvec{\theta }}}\\&\quad =\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}^{0}_{i},{\varvec{\theta }}^{0})}}{\partial {\varvec{\pi }}_{i}}+O_{p}(1/\sqrt{n_{1}})\\&\quad =O_{p}(1/\sqrt{n_{1}}), \end{aligned}$$

where \({\check{{\varvec{\theta }}}}\) lies between \({{\tilde{{\varvec{\theta }}}}}\) and \({\varvec{\theta }}^{0}\). Thus,

$$\begin{aligned} \begin{aligned} |I_{1}|&\le \frac{1}{\sqrt{n_{i}}}\Vert {\varvec{v}}\Vert _{2}\cdot \Vert \frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}^{0}_{i},{{\tilde{{\varvec{\theta }}}}})}}{\partial {\varvec{\pi }}_{i}}\Vert _{2}\\&=\Vert {\varvec{v}}\Vert _{2}O_{p}(1/n_{i}). \end{aligned} \end{aligned}$$

Next, we consider \(I_{2}\). Taking the same ticks with \(G_{2i}({\check{{\varvec{\pi }}}}_{i}({\varvec{\theta }}^{0}),\)\({\varvec{\theta }}^{0})\), we can obtain that

$$\begin{aligned} \lambda _{min}\{G_{2i}({\check{{\varvec{\pi }}}}_{i},{{\check{{\varvec{\theta }}}}})\}=\lambda _{min}\{G_{2i}({\varvec{\pi }}^{0}_{i},{\varvec{\theta }}^{0})\}+o_{p}(1). \end{aligned}$$

Then,

$$\begin{aligned} \begin{aligned} I_{2}&\ge \frac{1}{2n_{i}}\Vert {\varvec{v}}\Vert ^{2}_{2}[\lambda _{min}\{G_{2i}({\varvec{\pi }}^{0}_{i},{\varvec{\theta }}^{0})\}+o_{p}(1)]\\&\ge \frac{c}{4n_{i}}\Vert {\varvec{v}}\Vert ^{2}_{2} \end{aligned} \end{aligned}$$

holds for sufficiently constant c. Consequently, for a large constant \(\Vert {\varvec{v}}\Vert _{2}=C\), \(I_{1}\) is controlled by \(I_{2}\),

$$\begin{aligned} P(\sup \limits _{\Vert {\varvec{v}}\Vert _{2}=C}(I_{1}+I_{2})<0)\ge 1-\varepsilon . \end{aligned}$$

Thus, for each \(i=1,\ldots ,m\), we have \(\Vert {{\tilde{{\varvec{\pi }}}}}_{i}-{\varvec{\pi }}^{0}_{i}\Vert _{2}=O_{p}(1/\sqrt{n_{1}})\). \(\square \)

Proof of Theorem 2

Recall that \({\varvec{\gamma }}=({\varvec{\gamma }}^{\textrm{T}}_{1},\dots ,{\varvec{\gamma }}^{\textrm{T}}_{m})^{\textrm{T}}\) be \(m\times K\) parameter matrix, and let \({\textbf{W}}=({\textbf{W}}^{\textrm{T}}_{1},\dots ,{\textbf{W}}^{\textrm{T}}_{m})^{\textrm{T}}\) is a \(m\times G\) group index matrix, where \({\textbf{W}}_{i}=(w_{i1},\dots ,w_{iG})^{\textrm{T}}\), and only one element equals to 1, the other elements equal to 0. If the ith subject belongs to the gth subgroup, then, \(w_{ig}=1\). Define \({\varvec{\gamma }}={\textbf{W}}{\varvec{\alpha }}\), where \({\varvec{\alpha }}\in {\mathbb {R}}^{G\times K}\). When \({\textbf{W}}\) is known, then,

$$\begin{aligned} {\widehat{{\varvec{\alpha }}}}^{or}=\arg \min \limits _{{\varvec{\alpha }}\in {\mathbb {R}}^{G\times K}}\frac{1}{2}\Vert {{\widetilde{{\varvec{\pi }}}}}-{\textbf{W}}{\varvec{\alpha }}\Vert ^{2}_{F}. \end{aligned}$$

Obviously, \({\widehat{{\varvec{\alpha }}}}^{or}=({\textbf{W}}^{\textrm{T}}{\textbf{W}})^{-1}{\textbf{W}}^{\textrm{T}}{{\tilde{{\varvec{\pi }}}}}\). Let \({\varvec{\alpha }}^{0}=({\textbf{W}}^{\textrm{T}}{\textbf{W}})^{-1}{\textbf{W}}^{\textrm{T}}\)\({\textbf{W}}{\varvec{\alpha }}^{0}\), we have

$$\begin{aligned} {\widehat{{\varvec{\alpha }}}}^{or}-{\varvec{\alpha }}^{0}=({\textbf{W}}^{\textrm{T}}{\textbf{W}})^{-1}{\textbf{W}}^{\textrm{T}}({{\tilde{{\varvec{\pi }}}}}-{\textbf{W}}{\varvec{\alpha }}^{0}). \end{aligned}$$

Following Theorem 1, we know that \(\Vert {{\tilde{{\varvec{\pi }}}}}_{i}-{\varvec{\pi }}^{0}_{i}\Vert _{2}=\Vert {{\tilde{{\varvec{\pi }}}}}_{i}-{\textbf{W}}_{i}{\varvec{\alpha }}^{0}\Vert _{2}=O_{p}(n^{-1/2}_{1})\), and

$$\begin{aligned} \begin{aligned}&\Vert {\widehat{{\varvec{\alpha }}}}^{or}-{\varvec{\alpha }}^{0}\Vert _{\infty }\\&\quad \le \Vert ({\textbf{W}}^{\textrm{T}}{\textbf{W}})^{-1}\Vert _{\infty }\Vert {\textbf{W}}^{\textrm{T}}({{\tilde{{\varvec{\pi }}}}}-{\textbf{W}}{\varvec{\alpha }}^{0})\Vert _{\infty }. \end{aligned} \end{aligned}$$

Additionally, we know that \(\Vert ({\textbf{W}}^{\textrm{T}}{\textbf{W}})^{-1}\Vert _{\infty }=|{\mathcal {S}}_{min}|^{-1}\), and \(\Vert {\textbf{W}}^{\textrm{T}}({{\tilde{{\varvec{\pi }}}}}-{\textbf{W}}{\varvec{\alpha }}^{0})\Vert _{\infty }\le O_{p}(|{\mathcal {S}}_{max}|\sqrt{K}/\sqrt{n_{1}})\), where K is a fixed value. Thus, we can obtain that \(\Vert {\widehat{{\varvec{\gamma }}}}^{or}-{\varvec{\gamma }}^{0}\Vert _{\infty }=\Vert {\widehat{{\varvec{\alpha }}}}^{or}-{\varvec{\alpha }}^{0}\Vert _{\infty }\le |{\mathcal {S}}_{max}|/(|{\mathcal {S}}_{min}|\sqrt{n_{1}})\). Following the assumption \(|{\mathcal {S}}_{min}|=O(|{\mathcal {S}}_{max}|)\), we have \(\Vert {\widehat{{\varvec{\gamma }}}}^{or}-{\varvec{\gamma }}^{0}\Vert _{\infty }\le C_{1}/\sqrt{n_{1}}\). \(\square \)

Proof of Theorem 3

For any given \(\lambda \), note that

$$\begin{aligned} \begin{aligned} Q({\varvec{\gamma }})&=\frac{1}{2}\Vert {{\tilde{{\varvec{\pi }}}}}-{\varvec{\gamma }}\Vert ^{2}_{F}+\sum \limits _{1\le i<i^{\prime }\le m}p_{\tau }(\Vert {\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }}\Vert _{2},\lambda )\\&= L_{m}({\varvec{\gamma }})+P_{m}({\varvec{\gamma }}). \end{aligned} \end{aligned}$$

Let \(Q^{{\mathcal {S}}}({\varvec{\alpha }})=L^{{\mathcal {S}}}_{m}({\varvec{\alpha }})+P^{{\mathcal {S}}}_{m}({\varvec{\alpha }})\) be the objective function when the true group structure \({\mathcal {S}}\) is known, that is, \(L^{{\mathcal {S}}}_{m}({\varvec{\alpha }})=\frac{1}{2}\Vert {{\tilde{{\varvec{\pi }}}}}-{\textbf{W}}{\varvec{\alpha }}\Vert ^{2}_{F}\) and \(P^{{\mathcal {S}}}_{m}({\varvec{\alpha }})=\sum _{g<g^{\prime }}|{\mathcal {S}}_{g}||{\mathcal {S}}_{g^{\prime }}|p_{\tau }(\Vert {\varvec{\alpha }}_{g}-{\varvec{\alpha }}_{g^{\prime }}\Vert _{2},\lambda )\). Let \({\mathcal {T}}: \mathcal {M_{{\mathcal {S}}}}\rightarrow {\mathbb {R}}^{G\times K}\) be a mapping, \({\mathcal {T}}({\varvec{\gamma }})\) represents a \(G\times K\) matrix. The gth row represents the mixing probability vector for the gth subgroup. Let \({\mathcal {T}}^{\star }:{\mathbb {R}}^{m\times K}\rightarrow {\mathbb {R}}^{G\times K}\) be a mapping, and \({\mathcal {T}}^{\star }({\varvec{\gamma }})=((\sum _{i\in {\mathcal {S}}_{1}}{\varvec{\gamma }}_{i}/|{\mathcal {S}}_{1}|)^{\textrm{T}},\dots ,(\sum _{i\in {\mathcal {S}}_{G}}{\varvec{\gamma }}_{i}/|{\mathcal {S}}_{G}|)^{\textrm{T}})^{\textrm{T}}\). Obviously, when \({\varvec{\gamma }}\in {\mathcal {M}}_{{\mathcal {S}}}\), \({\mathcal {T}}({\varvec{\gamma }})={\mathcal {T}}^{\star }({\varvec{\gamma }})\), further, \(P_{m}({\varvec{\gamma }})=P^{{\mathcal {S}}}_{m}({\mathcal {T}}({\varvec{\gamma }}))\). For each \({\varvec{\alpha }}\in {\mathbb {R}}^{G\times K}\), \(P^{{\mathcal {S}}}_{m}({\varvec{\alpha }})=P_{m}({\mathcal {T}}^{-1}({\varvec{\alpha }}))\). Thus, we have

$$\begin{aligned} Q({\varvec{\gamma }})=Q^{{\mathcal {S}}}_{m}({\mathcal {T}}({\varvec{\gamma }})),\quad Q^{{\mathcal {S}}}({\varvec{\alpha }})=Q({\mathcal {T}}^{-1}({\varvec{\alpha }})). \end{aligned}$$
(A.2)

For every \({\varvec{\gamma }}\in {\mathbb {R}}^{m\times K}\), \({\varvec{\gamma }}^{\star }={\mathcal {T}}^{-1}({\mathcal {T}}^{\star }({\varvec{\gamma }}))\). Define

$$\begin{aligned} \Gamma =\{{\varvec{\gamma }}\in {\mathbb {R}}^{m\times K}: \Vert {\varvec{\gamma }}-{\varvec{\gamma }}^{0}\Vert _{\infty }\le C_{1}/\sqrt{n_{1}}\} \end{aligned}$$

as the neighborhood of \({\varvec{\gamma }}^{0}\). Following Theorem 1, we know that \({\widehat{{\varvec{\gamma }}}}^{or}\in \Gamma \). Next, we need to show that \({\widehat{{\varvec{\gamma }}}}^{or}\) is the strictly local minimizer of the objective function \(Q({\varvec{\gamma }})\) with probability 1.

Firstly, for each \({\varvec{\gamma }}\in \Gamma \), we want to prove that \(Q({\varvec{\gamma }}^{\star })>Q({\widehat{{\varvec{\gamma }}}}^{or})\). We know that \(\Vert {\varvec{\alpha }}_{g}-{\varvec{\alpha }}_{g^{\prime }}\Vert _{2}\ge \Vert {\varvec{\alpha }}^{0}_{g}-{\varvec{\alpha }}^{0}_{g^{\prime }}\Vert _{2}-2\sup \limits _{g}\Vert {\varvec{\alpha }}_{g}-{\varvec{\alpha }}^{0}_{g}\Vert _{2}\), and

$$\begin{aligned}{} & {} \sup \limits _{g}\Vert {\varvec{\alpha }}_{g}-{\varvec{\alpha }}^{0}_{g}\Vert _{2}\nonumber \\{} & {} \quad =\sup \limits _{g}\Vert \sum \limits _{i\in {\mathcal {S}}_{g}}{\varvec{\gamma }}_{i}/|{\mathcal {S}}_{g}|-{\varvec{\alpha }}^{0}_{g}\Vert _{2}\nonumber \\{} & {} \quad \le \sup \limits _{g}\sup \limits _{i\in {\mathcal {S}}_{g}}\Vert {\varvec{\gamma }}_{i}-{\varvec{\gamma }}^{0}_{i}\Vert _{2}\le C_{1}/\sqrt{n_{1}}. \end{aligned}$$
(A.3)

Then, for any g and \(g^{\prime }\), we have \(\Vert {\varvec{\alpha }}_{g}-{\varvec{\alpha }}_{g^{\prime }}\Vert _{2}\ge \Vert {\varvec{\alpha }}^{0}_{g}-{\varvec{\alpha }}^{0}_{g^{\prime }}\Vert _{2}-2\sup _{i}\Vert {\varvec{\gamma }}_{i}-{\varvec{\gamma }}^{0}_{i}\Vert _{2}\ge b_{m}-2C_{1}/\sqrt{n_{1}}>a\lambda \), where the last inequality holds because the assumption \(b_{m}>a\lambda>>C_{1}/\sqrt{n_{1}}\). Thus, we can obtain that \(P^{{\mathcal {S}}}_{m}({\mathcal {T}}^{\star }({\varvec{\gamma }}))=C_{m}\), where \(C_{m}\) is a constant. Thus, we have \(Q^{{\mathcal {S}}}({\mathcal {T}}^{\star }({\varvec{\gamma }}))=L^{{\mathcal {S}}}_{m}({\mathcal {T}}^{\star }({\varvec{\gamma }}))+C_{m}\) for all \({\varvec{\gamma }}\in \Gamma \). In addition, because \({\widehat{{\varvec{\alpha }}}}^{or}\) is the unique global minimizer of \(L^{{\mathcal {S}}}_{n}({\varvec{\alpha }})\), so that, \(L^{{\mathcal {S}}}_{m}({\mathcal {T}}^{\star }({\varvec{\gamma }}))>L^{{\mathcal {S}}}_{m}({\widehat{{\varvec{\alpha }}}}^{or})\) for all \({\mathcal {T}}^{\star }({\varvec{\gamma }})\ne {\widehat{{\varvec{\alpha }}}}^{or}\). Thus, we have \(Q^{{\mathcal {S}}}({\mathcal {T}}^{\star }({\varvec{\gamma }}))>Q^{{\mathcal {S}}}_{m}({\widehat{{\varvec{\alpha }}}}^{or})\). Following \(Q({\varvec{\gamma }}^{\star })=Q({\mathcal {T}}^{-1}({\mathcal {T}}^{\star }({\varvec{\gamma }}))) =Q^{{\mathcal {S}}}({\mathcal {T}}^{\star }({\varvec{\gamma }})))=L^{{\mathcal {S}}}_{m}({\mathcal {T}}^{\star }({\varvec{\gamma }}))+P^{{\mathcal {S}}}_{m}({\mathcal {T}}^{\star }({\varvec{\gamma }}))\), and \(Q({\widehat{{\varvec{\gamma }}}}^{or})=Q^{{\mathcal {S}}}({\widehat{{\varvec{\alpha }}}}^{or})=L^{{\mathcal {S}}}_{m}({\widehat{{\varvec{\alpha }}}}^{or})+P^{{\mathcal {S}}}_{m}({\widehat{{\varvec{\alpha }}}}^{or})\), we have \(Q^{{\mathcal {S}}}({\mathcal {T}}^{\star }({\varvec{\gamma }}))>Q({\varvec{\gamma }})\). By the equation (A.2), we know that \(Q^{{\mathcal {S}}}_{m}({\widehat{{\varvec{\alpha }}}}^{or})=Q_{m}({\widehat{{\varvec{\gamma }}}}^{or})\) and \(Q^{{\mathcal {S}}}({\mathcal {T}}^{\star }({\varvec{\gamma }}))=Q({\varvec{\gamma }}^{\star })\). Thus, for each \({\varvec{\gamma }}^{\star }\ne {\widehat{{\varvec{\gamma }}}}^{or}\), \(Q({\varvec{\gamma }}^{\star })>Q({\widehat{{\varvec{\gamma }}}}^{or})\).

Secondly, we define a positive sequence \(t_{m}\), and \(\Gamma _{m}=\{{\varvec{\gamma }}: \Vert {\varvec{\gamma }}-{\widehat{{\varvec{\gamma }}}}^{or}\Vert _{2}\le t_{m}\}\) is the neighborhood of \({\widehat{{\varvec{\gamma }}}}^{or}\). For any \({\varvec{\gamma }}\in \Gamma _{m}\cap \Gamma \), making a Taylor expression for \(Q({\varvec{\gamma }})\), that is,

$$\begin{aligned} \begin{aligned}&Q({\varvec{\gamma }})-Q({\varvec{\gamma }}^{\star })\\&\quad =-\textrm{tr}\{({{\widetilde{{\varvec{\pi }}}}}-{\check{{\varvec{\gamma }}}})^{\textrm{T}}({\varvec{\gamma }}-{\varvec{\gamma }}^{\star })\}+\sum \limits _{i=1}^{m}\frac{\partial P_{m}({\check{{\varvec{\gamma }}}})}{\partial {\varvec{\gamma }}^{\textrm{T}}_{i}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}^{\star }_{i})\\&\quad = R_{1}+R_{2}, \end{aligned} \end{aligned}$$

where \({\check{{\varvec{\gamma }}}}=\delta {\varvec{\gamma }}+(1-\delta ){\varvec{\gamma }}^{\star }\) for some \(\delta \in (0,1)\), and

$$\begin{aligned} \begin{aligned} R_{2}&=\lambda \sum \limits _{i^{\prime }>i}\rho ^{\prime }(\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})(\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})^{-1}\\&\quad \quad ({\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }})^{\textrm{T}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}^{\star }_{i})\\&\quad +\lambda \sum \limits _{i^{\prime }<i}\rho ^{\prime }(\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2}) (\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})^{-1}\\&\quad \quad ({\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }})^{\textrm{T}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}^{\star }_{i})\\&=\lambda \sum \limits _{i^{\prime }>i}\rho ^{\prime }(\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2}) (\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})^{-1}\\&\quad \quad ({\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }})^{\textrm{T}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}^{\star }_{i})\\&\quad +\lambda \sum \limits _{i^{\prime }<i}\rho ^{\prime }(\Vert {\check{{\varvec{\gamma }}}}_{i^{\prime }}-{\check{{\varvec{\gamma }}}}_{i}\Vert _{2}) (\Vert {\check{{\varvec{\gamma }}}}_{i^{\prime }}-{\check{{\varvec{\gamma }}}}_{i}\Vert _{2})^{-1}\\&\quad \quad ({\check{{\varvec{\gamma }}}}_{i^{\prime }}-{\check{{\varvec{\gamma }}}}_{i})^{\textrm{T}}({\varvec{\gamma }}_{i^{\prime }}-{\varvec{\gamma }}^{\star }_{i^{\prime }})\\&=\lambda \sum \limits _{i^{\prime }>i}\rho ^{\prime }(\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2}) (\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})^{-1}\\&\quad \quad ({\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }})^{\textrm{T}}\{({\varvec{\gamma }}_{i}-{\varvec{\gamma }}^{\star }_{i})-({\varvec{\gamma }}_{i^{\prime }}-{\varvec{\gamma }}^{\star }_{i^{\prime }})\}. \end{aligned} \end{aligned}$$

When \(i,i^{\prime }\in {\mathcal {S}}_{g}\), then, \({\varvec{\gamma }}^{\star }_{i}={\varvec{\gamma }}^{\star }_{i^{\prime }}\), and \({\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}=\delta ({\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }})\). Thus, we have

$$\begin{aligned} R_{2}&=\lambda \sum \limits _{g=1}^{G}\sum \limits _{i,i^{\prime }\in {\mathcal {S}}_{g},i<i^{\prime }}\rho ^{\prime }(\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})\\&\quad (\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})^{-1} ({\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }})^{\textrm{T}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }})\\&\quad +\lambda \sum \limits _{g<g^{\prime }}\sum \limits _{i\in {\mathcal {S}}_{g},i^{\prime }\in {\mathcal {S}}_{g^{\prime }}}\rho ^{\prime }(\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})\\&\quad (\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})^{-1}(\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})^{\textrm{T}}\\&\quad \{({\varvec{\gamma }}_{i}-{\varvec{\gamma }}^{\star }_{i})-({\varvec{\gamma }}_{i^{\prime }}-{\varvec{\gamma }}^{\star }_{i^{\prime }})\}, \end{aligned}$$

Further, based on the same reasons as (A.3), we have

$$\begin{aligned} \begin{aligned} \sup \limits _{i}\Vert {\varvec{\gamma }}^{\star }_{i}-{\widehat{{\varvec{\gamma }}}}^{or}_{i}\Vert _{2}&=\sup \limits _{g}\Vert {\varvec{\alpha }}_{g}-{\widehat{{\varvec{\alpha }}}}^{or}_{g}\Vert _{2}\\&\le \sup \limits _{i}\Vert {\varvec{\gamma }}_{i}-{\widehat{{\varvec{\gamma }}}}^{or}_{i}\Vert _{2}. \end{aligned} \end{aligned}$$

Then,

$$\begin{aligned} \begin{aligned}&\sup _{i}\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2}\\&\quad \le 2\sup \limits _{i}\Vert {\check{{\varvec{\gamma }}}}_{i}-{\varvec{\gamma }}^{\star }_{i}\Vert _{2} \le 2\sup \limits _{i}\Vert {\varvec{\gamma }}_{i}-{\varvec{\gamma }}^{\star }_{i}\Vert _{2}\\&\quad \le 2\{\sup \limits _{i}(\Vert {\varvec{\gamma }}_{i}-{\widehat{{\varvec{\gamma }}}}^{or}_{i}\Vert _{2}+\Vert {\widehat{{\varvec{\gamma }}}}^{or}_{i}-{\varvec{\gamma }}^{\star }_{i}\Vert _{2})\}\\&\quad \le 4 \sup \limits _{i}\Vert {\varvec{\gamma }}_{i}-{\widehat{{\varvec{\gamma }}}}^{or}_{i}\Vert _{2}\le 4 t_{m}. \end{aligned} \end{aligned}$$

Further, \(\rho ^{\prime }(\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})\ge \rho ^{\prime }(4t_{n})\) by the concavity of \(\rho (\cdot )\). Thus, we have

$$\begin{aligned} R_{2}\ge \sum \limits _{g=1}^{G}\sum \limits _{i,i^{\prime }\in {\mathcal {S}}_{g},i<i^{\prime }}\lambda \rho ^{\prime }(4t_{n})\Vert {\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }}\Vert _{2}. \end{aligned}$$

When \(i\in {\mathcal {S}}_{g}\), then, \({\varvec{\gamma }}^{\star }_{i}=|{\mathcal {S}}_{g}|^{-1}\sum _{i\in {\mathcal {S}}_{g}}{\varvec{\gamma }}_{i}\). Following

$$\begin{aligned} \begin{aligned} R_{1}&=-\sum \limits _{i=1}^{m}({{\tilde{{\varvec{\pi }}}}}_{i}-{\check{{\varvec{\gamma }}}}_{i})^{\textrm{T}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}^{\star }_{i})\\&=-\sum \limits _{g=1}^{G}\left\{ \sum \limits _{i,i^{\prime }\in {\mathcal {S}}_{g}}({{\tilde{{\varvec{\pi }}}}}_{i}-{\check{{\varvec{\gamma }}}}_{i})^{\textrm{T}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }})/|{\mathcal {S}}_{g}|\right\} \\&=-\frac{1}{2}\left\{ \sum \limits _{g=1}^{G}\frac{1}{|{\mathcal {S}}_{g}|}\left\{ \sum \limits _{i,i^{\prime }\in {\mathcal {S}}_{g}}({{\tilde{{\varvec{\pi }}}}}_{i}-{\check{{\varvec{\gamma }}}}_{i})^{\textrm{T}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }})\right\} \right. \\&\left. \quad +\sum \limits _{g=1}^{G}\frac{1}{|{\mathcal {S}}_{g}|}\{\sum \limits _{i,i^{\prime }\in {\mathcal {S}}_{g}}({{\tilde{{\varvec{\pi }}}}}_{i}-{\check{{\varvec{\gamma }}}}_{i})^{\textrm{T}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }})\}\right\} , \end{aligned} \end{aligned}$$

thus, for any \(i,i^{\prime }\in {\mathcal {S}}_{g}\), we have \(R_{1}=-\sum _{g=1}^{G}\{\sum _{i<i^{\prime }}({{\tilde{{\varvec{\pi }}}}}_{i}-{\check{{\varvec{\gamma }}}}_{i}-{{\tilde{{\varvec{\pi }}}}}_{i^{\prime }} +{\check{{\varvec{\gamma }}}}_{i^{\prime }})^{\textrm{T}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }})\}/|{\mathcal {S}}_{g}|,\) and \(\sup _{i}\Vert ({{\tilde{{\varvec{\pi }}}}}_{i}-{\check{{\varvec{\gamma }}}}_{i})\Vert _{2}\le \sup _{i}(\Vert {{\tilde{{\varvec{\pi }}}}}_{i}-{\varvec{\pi }}^{0}_{i}\Vert _{2} +\Vert {\varvec{\pi }}^{0}_{i}-{\check{{\varvec{\gamma }}}}_{i}\Vert _{2}).\) Following \(\sup \limits _{i}\Vert {\varvec{\pi }}^{0}_{i}-{\check{{\varvec{\gamma }}}}_{i}\Vert =\sup \limits _{i}\Vert {\varvec{\gamma }}^{0}_{i}-{\check{{\varvec{\gamma }}}}_{i}\Vert \le C_{1}/\sqrt{n_{1}}\), then,

$$\begin{aligned} |R_{1}|&\le 2|{\mathcal {S}}_{min}|^{-1}\sup \limits _{i}({{\tilde{{\varvec{\pi }}}}}_{i}-{\check{{\varvec{\gamma }}}}_{i})^{\textrm{T}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }})\\&\le 2|{\mathcal {S}}_{min}|^{-1}\sup \limits _{i}\Vert {{\tilde{{\varvec{\pi }}}}}_{i}-{\check{{\varvec{\gamma }}}}_{i}\Vert _{2}\Vert {\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }}\Vert _{2}\\&\le 2|{\mathcal {S}}_{min}|^{-1}(n_{1}^{-1/2}+C_{1}n_{1}^{-1/2})\Vert {\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }}\Vert _{2}. \end{aligned}$$

Consequently, we have

$$\begin{aligned} \begin{aligned} Q({\varvec{\gamma }})-Q({\varvec{\gamma }}^{\star })&\ge \sum \limits _{g=1}^{G}\sum \limits _{\begin{array}{c} i,i^{\prime }\in {\mathcal {S}}_{g}\\ i<i^{\prime } \end{array}}\Vert {\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }}\Vert _{2}\\&\quad \left\{ \lambda \rho ^{\prime }(4t_{n})-\frac{2}{|{\mathcal {S}}_{min}|} (n_{1}^{-1/2}+C_{1}n_{1}^{-1/2})\right\} . \end{aligned} \end{aligned}$$

Let \(t_{m}=o(1)\), then \(\rho ^{\prime }(4t_{n})\rightarrow 1\). Since \(\lambda \gg C_{1}n_{1}^{-1/2}\), and \(|{\mathcal {S}}_{min}|^{-1}=o(1)\), then \(Q({\varvec{\gamma }})-Q({\varvec{\gamma }}^{\star })\ge 0\) for sufficiently large \(m, n_{1}\). \(\square \)

Proof of Theorem 4

Define

$$\begin{aligned} {\widehat{{\varvec{\Theta }}}}^{or}=\arg \max \limits _{{\varvec{\Theta }}}\sum \limits _{i=1}^{m}\sum \limits _{j=1}^{n_{i}}\log \{f({y_{ij}|{\varvec{x}}_{ij},{\varvec{\Theta }},{\mathcal {S}}^{0}})\} \end{aligned}$$

as the oracle estimators for \({\varvec{\Theta }}=({{\textrm{vec}}}({\varvec{\pi }})^{\textrm{T}},{\varvec{\theta }})^{\textrm{T}}\) when the true group structure is given, and

$$\begin{aligned} {\widehat{{\varvec{\Theta }}}}=\arg \max \limits _{{\varvec{\Theta }}}\sum \limits _{i=1}^{m}\sum \limits _{j=1}^{n_{i}}\log \{f({y_{ij}|{\varvec{x}}_{ij},{\varvec{\Theta }},{\widehat{{\mathcal {S}}}}})\}. \end{aligned}$$

Following Theorem 3, we know that \(P({\widehat{{\mathcal {S}}}}={\mathcal {S}}^{0})\rightarrow 1\) when both the sample size m and the cluster size \(n_{1}\) tend to infinity. Therefore, it is sufficient to consider the asymptotic distribution of the oracle estimators \({\widehat{{\varvec{\Theta }}}}^{or}\).

Let \({\widehat{{\varvec{\Theta }}}}\in \{{\varvec{\Theta }}^{0}+N^{-1/2}{\varvec{\vartheta }}:\Vert {\varvec{\vartheta }}\Vert _{2}\le M_{\varepsilon }\}\), and

$$\begin{aligned} L({\varvec{\Theta }})= & {} \sum \limits _{i=1}^{m}\sum \limits _{j=1}^{n_{i}}\log \{f({y_{ij}|{\varvec{x}}_{ij},{\varvec{\Theta }},{\mathcal {S}}^{0}})\}, \\ {\textbf{U}}({\varvec{\Theta }}^{0})= & {} \frac{\partial L({\varvec{\Theta }})}{\partial {\varvec{\Theta }}}\big |_{{\varvec{\Theta }}={\varvec{\Theta }}^{0}}, {\textbf{V}}({\varvec{\Theta }}^{0})=\frac{\partial ^{2} L({\varvec{\Theta }})}{\partial {\varvec{\Theta }}\partial {\varvec{\Theta }}^{\textrm{T}}}\big |_{{\varvec{\Theta }}={\varvec{\Theta }}^{0}}.\end{aligned}$$

By Taylor’s expression of \(L({\widehat{{\varvec{\Theta }}}}^{or})\) at \({\varvec{\Theta }}^{0}\),

$$\begin{aligned} L({\widehat{{\varvec{\Theta }}}}^{or})-L({\widehat{{\varvec{\Theta }}}})=\frac{1}{\sqrt{N}}{\textbf{U}}({\varvec{\Theta }}^{0})^{\textrm{T}}{\varvec{\vartheta }}+\frac{1}{2N}{\varvec{\vartheta }}^{\textrm{T}}{\textbf{V}}({\check{{\varvec{\Theta }}}}){\varvec{\vartheta }}, \end{aligned}$$

where \({\check{{\varvec{\Theta }}}}\) lies between \({\widehat{{\varvec{\Theta }}}}^{or}\) and \({\widehat{{\varvec{\Theta }}}}^{0}\). When \(m,n_{1}\rightarrow \infty \), following the assumption conditions, we have the score function \({\textbf{U}}({\varvec{\Theta }}^{0})=O_{p}(\sqrt{N})\) and \(-{\textbf{V}}({\varvec{\Theta }}^{0})/N={\textbf{F}}({\varvec{\Theta }}^{0})/N+o_{P}(1)\), where \({\textbf{F}}({\varvec{\Theta }}^{0})=-{\textrm{E}}\{{\textbf{V}}({\varvec{\Theta }}^{0})\}=N{\bar{{\textbf{F}}}}({\varvec{\Theta }}^{0})\) is the fisher information matrix, and \({\textbf{F}}({\varvec{\Theta }}^{0})/N=O_{p}(1)\). Thus, for sufficient large \(M_{\varepsilon }\), \(L({\widehat{{\varvec{\Theta }}}}^{or})-L({\widehat{{\varvec{\Theta }}}})\) is controlled by the second term, which equals to \(-\frac{1}{2}{\varvec{\vartheta }}^{\textrm{T}}(\frac{1}{N}{\textbf{F}}({\varvec{\Theta }}^{0})+o_{p}(1)){\varvec{\vartheta }}\). Thus, for any given \(\varepsilon >0\), there exists a large \(M_{\varepsilon }\), such that

$$\begin{aligned} P(\sup \limits _{\Vert {\varvec{\vartheta }}\Vert _{2}=M_{\varepsilon }}L({\widehat{{\varvec{\Theta }}}}^{or})-L({\widehat{{\varvec{\Theta }}}})<0)>1-\varepsilon . \end{aligned}$$

Thus, we have \(\Vert {\widehat{{\varvec{\Theta }}}}^{or}-{\widehat{{\varvec{\Theta }}}}\Vert _{2}=O_{p}(N^{-1/2})\).

Furthermore, using the Taylor’s expression of \({\textbf{U}}({\widehat{{\varvec{\Theta }}}}^{or})\) around \({\varvec{\Theta }}^{0}\), with \({\textbf{U}}({\widehat{{\varvec{\Theta }}}}^{or})={\textbf{0}}\), we can obtain that

$$\begin{aligned} {\widehat{{\varvec{\Theta }}}}^{or}-{\varvec{\Theta }}^{0}=-{\textbf{V}}^{-1}({\check{{\varvec{\Theta }}}}){\textbf{U}}({\varvec{\Theta }}^{0}), \end{aligned}$$

where \({\check{{\varvec{\Theta }}}}\) lies between \({\widehat{{\varvec{\Theta }}}}^{or}\) and \({\widehat{{\varvec{\Theta }}}}^{0}\). When the sample sizes \(m,n_{1}\rightarrow \infty \), following the weak large numbers law, we have

$$\begin{aligned} -\frac{1}{N}{\textbf{V}}({\check{{\varvec{\Theta }}}}){\mathop {\longrightarrow }\limits ^{P}}{\bar{{\textbf{F}}}}({\varvec{\Theta }}^{0}). \end{aligned}$$

Thus, \(\sqrt{N}{\bar{{\textbf{F}}}}^{1/2}({\varvec{\Theta }}^{0})({\widehat{{\varvec{\Theta }}}}^{or}-{\varvec{\Theta }}^{0})\longrightarrow N({\textbf{0}},{\textbf{I}})\). \(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liang, C., Ma, W. Heterogeneous analysis for clustered data using grouped finite mixture models. Stat Comput 34, 40 (2024). https://doi.org/10.1007/s11222-023-10353-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11222-023-10353-w

Keywords

Navigation