Heterogeneous analysis for clustered data using grouped finite mixture models

Liang, Chunhui; Ma, Wenqing

doi:10.1007/s11222-023-10353-w

Heterogeneous analysis for clustered data using grouped finite mixture models

Original Paper
Published: 15 November 2023

Volume 34, article number 40, (2024)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Chunhui Liang¹ &
Wenqing Ma¹^nAff2

284 Accesses
Explore all metrics

Abstract

It is common to observe significant heterogeneity in clustered data across scientific fields. Cluster-wise conditional distributions are widely used to explore variations and relationships within and among clusters. This paper aims to capture such heterogeneity by employing cluster-wise finite mixture models. To address the heterogeneity among clusters, we introduce latent group structure and incorporate heterogeneous mixing proportions across different groups, accommodating the diverse characteristics observed in the data. The specific number of groups and their membership are unknown. To identify the latent group structure, we employ concave penalty functions to the pairwise differences of the preliminary consistent estimators for the mixing proportions. This approach enables the automatic division of clusters into finite subgroups. Theoretical results demonstrate that as the number of clusters and cluster sizes tend to infinity, the true latent group structure can be recovered with probability close to one, and the post-classification estimators exhibit oracle efficiency. We support our proposed approach’s performance and applicability through extensive simulations and analysis of basic consumption expenditure among urban households in China.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 1

Algorithm 2

Matrix Normal Cluster-Weighted Models

Article Open access 02 June 2021

A New Look at the Dirichlet Distribution: Robustness, Clustering, and Both Together

Article Open access 02 July 2024

Seemingly unrelated clusterwise linear regression for contaminated data

Article Open access 06 August 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Bai, J.: Estimating multiple breaks one at a time. Economet. Theor. 13(3), 315–352 (1997)
Article MathSciNet Google Scholar
Begg, M.D., Parides, M.K.: Separation of individual-level and cluster-level covariate effects in regression analysis of correlated data. Stat. Med. 22(16), 2591–2602 (2003)
Article Google Scholar
Bester, C.A., Hansen, C.B.: Grouped effects estimators in fixed effects models. J. Econom. 190(1), 197–208 (2016)
Article MathSciNet Google Scholar
Bonhomme, S., Manresa, E.: Grouped patterns of heterogeneity in panel data. Econometrica 83(3), 1147–1184 (2015)
Article MathSciNet Google Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 3(1), 1–122 (2011)
Chen, J.: Optimal rate of convergence for finite mixture models. Ann. Stat. 23(1), 221–233 (1995)
Article MathSciNet Google Scholar
Chi, E.C., Lange, K.: Splitting methods for convex clustering. J. Comput. Graph. Stat. 24(4), 994–1013 (2015)
Article MathSciNet Google Scholar
Desai, M., Begg, M.D.: A comparison of regression approaches for analyzing clustered data. Am. J. Public Health 98(8), 1425–1429 (2008)
Article Google Scholar
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)
Article MathSciNet Google Scholar
Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)
Article Google Scholar
Jiang, J., Nguyen, T.: Linear and Generalized Linear Mixed Models and their Applications. Springer, New York (2007)
Google Scholar
Keribin, C.: Consistent estimation of the order of mixture models. Sankhyā: Indian J. Stat. Ser. A 62(1), 49–66 (2000)
Khalili, A., Chen, J.: Variable selection in finite mixture of regression models. J. Am. Stat. Assoc. 102(479), 1025–1038 (2007)
Article MathSciNet Google Scholar
Lin, X., Carroll, R.J.: Semiparametric regression for clustered data. Biometrika 88(4), 1179–1185 (2001)
Article MathSciNet Google Scholar
Louis, T.A.: Finding the observed information matrix when using the em algorithm. J. R. Stat. Soc. Ser. B Stat Methodol. 44(2), 223–233 (1982)
MathSciNet Google Scholar
Ma, S., Huang, J.: A concave pairwise fusion approach to subgroup analysis. J. Am. Stat. Assoc. 112(517), 410–423 (2017)
Article MathSciNet Google Scholar
McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, New York (2000)
Book Google Scholar
Melnykov, V., Maitra, R.: Finite mixture models and model-based clustering. Stat. Surveys 4, 80–116 (2010)
Article MathSciNet Google Scholar
Meng, X.L., Rubin, D.B.: Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80(2), 267–278 (1993)
Article MathSciNet Google Scholar
Neuhaus, J.M., Kalbfleisch, J.D.: Between- and within-cluster covariate effects in the analysis of clustered data. Biometrics, 638–645 (1998)
Ng, S.K., McLachlan, G.J.: Mixture models for clustering multilevel growth trajectories. Comput. Stat. Data Anal. 71, 43–51 (2014)
Article MathSciNet Google Scholar
Radchenko, P., Mukherjee, G.: Convex clustering via l1 fusion penalization. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 79(5), 1527–1546 (2017)
Article MathSciNet Google Scholar
Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)
Article Google Scholar
Rosen, O., Jiang, W., Tanner, M.A.: Mixtures of marginal models. Biometrika 87(2), 391–404 (2000)
Article MathSciNet Google Scholar
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Article MathSciNet Google Scholar
Sugasawa, S.: Grouped heterogeneous mixture modeling for clustered data. J. Am. Stat. Assoc. 116(534), 999–1010 (2021)
Sugasawa, S., Kobayashi, G., Kawakubo, Y.: Latent mixture modeling for clustered data. Stat. Comput. 29, 537–548 (2019)
Article MathSciNet Google Scholar
Sun, Z., Rosen, O., Sampson, A.R.: Multivariate bernoulli mixture models with application to postmortem tissue studies in schizophrenia. Biometrics 63(3), 901–909 (2007)
Article MathSciNet Google Scholar
Tang, X., Qu, A.: Mixture modeling for longitudinal data. J. Comput. Graph. Stat. 25(4), 1117–1137 (2016)
Article MathSciNet Google Scholar
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010)
MathSciNet Google Scholar
Wang, W., Su, L.: Identifying latent group structures in nonlinear panels. J. Econom. 220(2), 272–295 (2021)
Article MathSciNet Google Scholar
Wang, H., Li, R., Tsai, C.L.: Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 94(3), 553–568 (2007)
Article MathSciNet Google Scholar
Yu, J., Nummi, T., Pan, J.X.: Mixture regression for longitudinal data based on joint mean-covariance model. J. Multivar. Anal. 190, 104956 (2022)
Article MathSciNet Google Scholar
Zhang, C.H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)
Article MathSciNet Google Scholar

Download references

Acknowledgements

We are very grateful to the Editor, Associate Editor and referees, as well as our financial sponsors for their insightful comments and suggestions that have improved the manuscript significantly.

Funding

This work was supported by the National Natural Science Foundation of China [Grant numbers 11690012, 11631003, 12226003].

Author information

Wenqing Ma
Present address: School of Mathematical Sciences, Capital Normal University, 105 North Xisanhuan Road, 100048, Beijing, China

Authors and Affiliations

KLAS, Department of Mathematics and Statistics, Northeast Normal University, 5268 Renmin Street, Changchun, 130024, Jilin, China
Chunhui Liang & Wenqing Ma

Authors

Chunhui Liang
View author publications
You can also search for this author in PubMed Google Scholar
Wenqing Ma
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.W. (corresponding author) proposed the conceptualization and provided the funding acquisition, L.C. completed the methodology and analysis. All authors together wroted and reviewed the manuscript.

Corresponding author

Correspondence to Wenqing Ma.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Proof of Theorem 1

Define

$$\begin{aligned} {{\widetilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }})=\mathop {\arg \max }\limits _{{\varvec{\pi }}_{i}}\sum \limits _{j=1}^{n_{i}}\log {f_{i}(y_{ij}|{\varvec{x}}_{ij};{\varvec{\pi }}_{i},{\varvec{\theta }})}, \end{aligned}$$

(A.1)

$$\begin{aligned}{{\widetilde{{\varvec{\theta }}}}}= & {} \mathop {\arg \max }\limits _{{\varvec{\theta }}}\sum \limits _{i=1}^{m}\sum \limits _{j=1}^{n_{i}}\log {f_{i}(y_{ij}|{\varvec{x}}_{ij};{{\widetilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }}),{\varvec{\theta }})},\\ {{\widetilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }}^{(0)})= & {} \mathop {\arg \max }\limits _{{\varvec{\pi }}_{i}}\sum \limits _{j=1}^{n_{i}}\log {f_{i}(y_{ij}|{\varvec{x}}_{ij};{\varvec{\pi }}_{i},{\varvec{\theta }}^{(0)})},\\ {\varvec{\pi }}^{(0)}_{i}= & {} \mathop {\arg \max }\limits _{{\varvec{\pi }}_{i}}\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}{\textrm{E}}\{\log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}_{i},{\varvec{\theta }}^{(0)})}\}.\end{aligned}$$

For $i=1,\ldots ,m$, let

$$\begin{aligned} {\varvec{\pi }}_{i}({\varvec{\theta }})=\mathop {\arg \max }\limits _{{\varvec{\pi }}_{i}}\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}{\textrm{E}}\{\log {f_{i}(y_{ij}|{\varvec{x}}_{ij};{\varvec{\pi }}_{i},{\varvec{\theta }})}\}, \end{aligned}$$

then, we have ${\varvec{\pi }}^{(0)}_{i}={\varvec{\pi }}_{i}({\varvec{\theta }}^{(0)})$. Because the all unknown parameters contain ${\varvec{\theta }}$ and ${\varvec{\pi }}$, the consistency of the preliminary estimators can be divided into two parts. At first, we will show that the preliminary estimator ${{\widetilde{{\varvec{\theta }}}}}$ is consistent when the sample size m and the minimum cluster size $n_{1}$ tend to infinity.

(1). Note that

$$\begin{aligned}Q({\varvec{\theta }})=\sum \limits _{i=1}^{m}\sum \limits _{j=1}^{n_{i}}\log {f(y_{ij}|{\varvec{x}}_{ij},{\tilde{{\varvec{\pi }}}}_{i}({\varvec{\theta }}),{\varvec{\theta }})},\end{aligned}$$

we make a Taylor expression for $Q({\tilde{{\varvec{\theta }}}})$ at ${\varvec{\theta }}^{0}$,

$$\begin{aligned} \begin{aligned} Q({\tilde{{\varvec{\theta }}}})-{\textbf{Q}}({\varvec{\theta }}^{0})&=({\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0})^{\textrm{T}}\frac{\partial Q({\varvec{\theta }})}{\partial {\varvec{\theta }}}|_{{\varvec{\theta }}={\varvec{\theta }}^{0}}\\&\quad +\frac{1}{2}({\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0})^{\textrm{T}}\frac{\partial ^{2} Q({\varvec{\theta }})}{\partial {\varvec{\theta }}\partial {\varvec{\theta }}^{\textrm{T}}}|_{{\varvec{\theta }}={\varvec{\theta }}^{0}}({\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0})\\&\quad +({\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0})^{\textrm{T}}o_{p}(1)({\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0}), \end{aligned} \end{aligned}$$

since ${{\tilde{{\varvec{\theta }}}}}=\mathop {\arg \max }\limits _{{\varvec{\theta }}}Q({\varvec{\theta }})$, so that $Q({\tilde{{\varvec{\theta }}}})-Q({\varvec{\theta }}^{0})\ge 0$. In addition,

$$\begin{aligned} \begin{aligned}&\frac{\partial Q({\varvec{\theta }})}{\partial {\varvec{\theta }}}|_{{\varvec{\theta }}={\varvec{\theta }}^{0}}\\&\quad =\frac{1}{N}\sum \limits _{i=1}^{m}\sum \limits _{j=1}^{n_{i}}\frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})}}{\partial {\varvec{\theta }}}\\&\quad \quad + \frac{1}{N}\sum \limits _{i=1}^{n}\sum \limits _{j=1}^{n_{i}} \left\{ \frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{{\tilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})}}{\partial {\varvec{\theta }}} \right. \\&\left. \qquad - \frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})}}{\partial {\varvec{\theta }}}\right\} \\&\quad = S_{1}+S_{2}, \end{aligned} \end{aligned}$$

since ${\textrm{E}}(S_{1})=0$ and $\textrm{tr}\{\textrm{var}(S_{1})\}=O_{p}(1)$, so we know that $S_{1}=O_{p}(N^{-1/2})$. For $S_{2}$, following the Cauchy-Schwarz inequality, we have

$$\begin{aligned} \begin{aligned} \Vert S_{2}\Vert _{2}&\le \frac{1}{N}\sum \limits _{i=1}^{m}\sum \limits _{j=1}^{n_{i}}\Vert \frac{\partial ^{2}\log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})}}{\partial {\varvec{\theta }}\partial {\varvec{\pi }}_{i}}\Vert _{2}\\&\quad \Vert {{\tilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }}^{0})-{\varvec{\pi }}_{i}({\varvec{\theta }}^{0})\Vert _{2}\\&\le \left\{ \frac{1}{m}\sum \limits _{i=1}^{m}(\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}M_{2}(y_{ij},{\varvec{x}}_{ij})^{2})\right\} ^{1/2}\\&\quad \left( \frac{1}{m}\sum \limits _{i=1}^{m}\Vert {{\tilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }}^{0})-{\varvec{\pi }}_{i}({\varvec{\theta }}^{0})\Vert ^{2}_{2}\right) ^{1/2}. \end{aligned} \end{aligned}$$

Following (A.1), we have

$$\begin{aligned} 0&=\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}_{i},{\varvec{\theta }})}}{\partial {\varvec{\pi }}_{i}}\big |_{{\varvec{\pi }}_{i} ={{\tilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }})}\\&=\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}_{i},{\varvec{\theta }})}}{\partial {\varvec{\pi }}_{i}}\big |_{{\varvec{\pi }}_{i}={\varvec{\pi }}_{i}({\varvec{\theta }})}\\&\quad +\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial ^{2} \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}_{i},{\varvec{\theta }})}}{\partial {\varvec{\pi }}_{i}\partial {\varvec{\pi }}^{\textrm{T}}_{i}}\big |_{{\varvec{\pi }}_{i}={\check{{\varvec{\pi }}}}_{i}({\varvec{\theta }})}\\&\quad \quad ({{\tilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }})-{\varvec{\pi }}_{i}({\varvec{\theta }})), \end{aligned}$$

where ${\check{{\varvec{\pi }}}}_{i}({\varvec{\theta }})$ lies between ${{\tilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }})$ and ${\varvec{\pi }}_{i}({\varvec{\theta }})$. For $i=1,\ldots ,m$, let

$$\begin{aligned}G_{1i}=\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}_{i},{\varvec{\theta }})}}{\partial {\varvec{\pi }}_{i}},\end{aligned}$$

and

$$\begin{aligned} G_{2i}=\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial ^{2} \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}_{i},{\varvec{\theta }})}}{\partial {\varvec{\pi }}_{i}\partial {\varvec{\pi }}^{\textrm{T}}_{i}},\end{aligned}$$

thus, we can obtain that ${{\tilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }}^{0})-{\varvec{\pi }}_{i}({\varvec{\theta }}^{0})=-G^{-1}_{2i}({\check{{\varvec{\pi }}}}_{i}({\varvec{\theta }}),{\varvec{\theta }})$$G_{1i}({\varvec{\pi }}_{i}({\varvec{\theta }}),{\varvec{\theta }})$. Then,

$$\begin{aligned} \begin{aligned}&\frac{1}{m}\sum \limits _{i=1}^{m}\Vert {{\widetilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }}^{(0)})-{\varvec{\pi }}_{i}({\varvec{\theta }}^{(0)})\Vert _{2}\\&\quad \le \min \limits _{i}\{\lambda ^{2}_{min}(G_{2i}({\check{{\varvec{\pi }}}}_{i}({\varvec{\theta }}),{\varvec{\theta }}))\} \frac{1}{m}\sum \limits _{i=1}^{m}\Vert G_{1i}({\varvec{\pi }}_{i}({\varvec{\theta }}),{\varvec{\theta }})\Vert _{2}. \end{aligned} \end{aligned}$$

Following Assumptions (C1)–(C4), we know that

$$\begin{aligned}&\max \limits _{1\le i\le m}\Vert G_{2i}({\check{{\varvec{\pi }}}}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})-{\textrm{E}}\{G_{2i}({\varvec{\pi }}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})\}\Vert _{2}\\&\quad \le \max \limits _{1\le i\le n}\Vert G_{2i}({\check{{\varvec{\pi }}}}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})-G_{2i}({\varvec{\pi }}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})\Vert _{2}\\&\qquad +\max \limits _{1\le i\le n}\Vert G_{2i}({\varvec{\pi }}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})-{\textrm{E}}\{G_{2i}({\varvec{\pi }}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})\}\Vert _{2} \\&\quad =o_{p}(1). \end{aligned}$$

Further, for each i, we have that

$$\begin{aligned} \begin{aligned}&\lambda _{min}\{G_{2i}({\check{{\varvec{\pi }}}}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})\}\\&\quad =\lambda _{min}[{\textrm{E}}\{G_{2i}({\varvec{\pi }}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})\}]+o_{p}(1). \end{aligned} \end{aligned}$$

Thus, we have

$$\begin{aligned}&\frac{1}{n}\sum \limits _{i=1}^{n}\Vert {{\tilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }}^{0})-{\varvec{\pi }}_{i}({\varvec{\theta }}^{0})\Vert ^{2}_{2} \\&\quad \le (\lambda _{min}[{\textrm{E}}\{G_{2i}({\varvec{\pi }}_{i}({\varvec{\theta }}^{0}),{\varvec{\theta }}^{0})\}]+o_{p}(1))^{2} \\&\quad \quad \frac{1}{m}\sum \limits _{i=1}^{m}\Vert G_{1i}({\varvec{\pi }}_{i}({\varvec{\theta }}^{(0)}),{\varvec{\theta }}^{(0)})\Vert ^{2}_{2} \\&\quad =O_{p}(1)O_{p}(n^{-1}_{1})=O_{p}(1/n_{1}), \end{aligned}$$

since ${\textrm{E}}\{G_{1i}({\varvec{\pi }}_{i}({\varvec{\theta }}^{(0)}),{\varvec{\theta }}^{(0)})\}={\textbf{0}}$, where the last inequality holds. Then, $\Vert S_{2}\Vert _{2}=O_{p}(n^{-1/2}_{1})$. When $m\rightarrow \infty $ and $n_{1}\rightarrow \infty $, we have

$$\begin{aligned} \frac{\partial ^{2} Q({\varvec{\theta }})}{\partial {\varvec{\theta }}\partial {\varvec{\theta }}^{\textrm{T}}}\big |_{{\varvec{\theta }}={\varvec{\theta }}^{0}}\rightarrow {\textrm{E}}\left( \frac{\partial ^{2} Q({\varvec{\theta }}^{0})}{\partial {\varvec{\theta }}\partial {\varvec{\theta }}^{\textrm{T}}}\right) , \end{aligned}$$

with probability 1, and we know that $Q({{\tilde{{\varvec{\theta }}}}})-Q({\varvec{\theta }}^{0})$ is controlled by the second term $({\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0})^{\textrm{T}}(\partial ^{2} Q({\varvec{\theta }})/\partial {\varvec{\theta }}\partial {\varvec{\theta }}^{\textrm{T}})|_{{\varvec{\theta }}={\varvec{\theta }}^{0}}({\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0})$. Then, $Q({{\tilde{{\varvec{\theta }}}}})-Q({\varvec{\theta }}^{0})\le 0$ with probability 1. Correspondingly, we have $\Vert {\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0}\Vert _{2}=O_{p}(1/\sqrt{n_{1}})$. Nextly, we will prove that the preliminary estimator ${\varvec{\pi }}_{i}$ is consistent when the sample size m and the cluster size $n_{1}$ tend to infinity, $i=1,\ldots ,m$.

(2). Let ${{\tilde{{\varvec{\pi }}}}}_{i}\in \{{\varvec{\pi }}^{0}_{i}+{\varvec{v}}/\sqrt{n_{1}}:\Vert {\varvec{v}}\Vert \le C\}$, where C is a constant. Following

$$\begin{aligned}&\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\{\log {f(y_{ij}|{\varvec{x}}_{ij},{{\tilde{{\varvec{\pi }}}}}_{i},{{\tilde{{\varvec{\theta }}}}})}-\log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}^{0}_{i},{{\tilde{{\varvec{\theta }}}}})}\}\\&\quad =\frac{1}{\sqrt{n_{i}}}{\varvec{v}}^{\textrm{T}}\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}^{0}_{i},{{\tilde{{\varvec{\theta }}}}})}}{\partial {\varvec{\pi }}_{i}}\\&\qquad +\frac{1}{2n_{i}}{\varvec{v}}^{\textrm{T}}\left\{ \frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial ^{2}\log {f(y_{ij}|{\varvec{x}}_{ij},{{\tilde{{\varvec{\pi }}}}}_{i},{{\tilde{{\varvec{\theta }}}}})}}{\partial {\varvec{\pi }}_{i}\partial {\varvec{\pi }}^{\textrm{T}}_{i}}\right\} {\varvec{v}}\\&\quad = I_{1}+I_{2}. \end{aligned}$$

Under the Assumption conditions (C1)–(C4) and $\Vert {\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0}\Vert _{2}=O_{p}(1/\sqrt{n_{1}})$, we have

$$\begin{aligned}&\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}^{0}_{i},{{\tilde{{\varvec{\theta }}}}})}}{\partial {\varvec{\pi }}_{i}}\\&\quad =\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}^{0}_{i},{\varvec{\theta }}^{0})}}{\partial {\varvec{\pi }}_{i}} \\&\qquad +({{\tilde{{\varvec{\theta }}}}}-{\varvec{\theta }}^{0})^{\textrm{T}}\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial ^{2}\log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}^{0}_{i},{\check{{\varvec{\theta }}}})}}{\partial {\varvec{\pi }}_{i}\partial {\varvec{\theta }}}\\&\quad =\frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}^{0}_{i},{\varvec{\theta }}^{0})}}{\partial {\varvec{\pi }}_{i}}+O_{p}(1/\sqrt{n_{1}})\\&\quad =O_{p}(1/\sqrt{n_{1}}), \end{aligned}$$

where ${\check{{\varvec{\theta }}}}$ lies between ${{\tilde{{\varvec{\theta }}}}}$ and ${\varvec{\theta }}^{0}$. Thus,

$$\begin{aligned} \begin{aligned} |I_{1}|&\le \frac{1}{\sqrt{n_{i}}}\Vert {\varvec{v}}\Vert _{2}\cdot \Vert \frac{1}{n_{i}}\sum \limits _{j=1}^{n_{i}}\frac{\partial \log {f(y_{ij}|{\varvec{x}}_{ij},{\varvec{\pi }}^{0}_{i},{{\tilde{{\varvec{\theta }}}}})}}{\partial {\varvec{\pi }}_{i}}\Vert _{2}\\&=\Vert {\varvec{v}}\Vert _{2}O_{p}(1/n_{i}). \end{aligned} \end{aligned}$$

Next, we consider $I_{2}$. Taking the same ticks with $G_{2i}({\check{{\varvec{\pi }}}}_{i}({\varvec{\theta }}^{0}),$${\varvec{\theta }}^{0})$, we can obtain that

$$\begin{aligned} \lambda _{min}\{G_{2i}({\check{{\varvec{\pi }}}}_{i},{{\check{{\varvec{\theta }}}}})\}=\lambda _{min}\{G_{2i}({\varvec{\pi }}^{0}_{i},{\varvec{\theta }}^{0})\}+o_{p}(1). \end{aligned}$$

Then,

$$\begin{aligned} \begin{aligned} I_{2}&\ge \frac{1}{2n_{i}}\Vert {\varvec{v}}\Vert ^{2}_{2}[\lambda _{min}\{G_{2i}({\varvec{\pi }}^{0}_{i},{\varvec{\theta }}^{0})\}+o_{p}(1)]\\&\ge \frac{c}{4n_{i}}\Vert {\varvec{v}}\Vert ^{2}_{2} \end{aligned} \end{aligned}$$

holds for sufficiently constant c. Consequently, for a large constant $\Vert {\varvec{v}}\Vert _{2}=C$, $I_{1}$ is controlled by $I_{2}$,

$$\begin{aligned} P(\sup \limits _{\Vert {\varvec{v}}\Vert _{2}=C}(I_{1}+I_{2})<0)\ge 1-\varepsilon . \end{aligned}$$

Thus, for each $i=1,\ldots ,m$, we have $\Vert {{\tilde{{\varvec{\pi }}}}}_{i}-{\varvec{\pi }}^{0}_{i}\Vert _{2}=O_{p}(1/\sqrt{n_{1}})$. $\square $

Proof of Theorem 2

Recall that ${\varvec{\gamma }}=({\varvec{\gamma }}^{\textrm{T}}_{1},\dots ,{\varvec{\gamma }}^{\textrm{T}}_{m})^{\textrm{T}}$ be $m\times K$ parameter matrix, and let ${\textbf{W}}=({\textbf{W}}^{\textrm{T}}_{1},\dots ,{\textbf{W}}^{\textrm{T}}_{m})^{\textrm{T}}$ is a $m\times G$ group index matrix, where ${\textbf{W}}_{i}=(w_{i1},\dots ,w_{iG})^{\textrm{T}}$, and only one element equals to 1, the other elements equal to 0. If the ith subject belongs to the gth subgroup, then, $w_{ig}=1$. Define ${\varvec{\gamma }}={\textbf{W}}{\varvec{\alpha }}$, where ${\varvec{\alpha }}\in {\mathbb {R}}^{G\times K}$. When ${\textbf{W}}$ is known, then,

$$\begin{aligned} {\widehat{{\varvec{\alpha }}}}^{or}=\arg \min \limits _{{\varvec{\alpha }}\in {\mathbb {R}}^{G\times K}}\frac{1}{2}\Vert {{\widetilde{{\varvec{\pi }}}}}-{\textbf{W}}{\varvec{\alpha }}\Vert ^{2}_{F}. \end{aligned}$$

Obviously, ${\widehat{{\varvec{\alpha }}}}^{or}=({\textbf{W}}^{\textrm{T}}{\textbf{W}})^{-1}{\textbf{W}}^{\textrm{T}}{{\tilde{{\varvec{\pi }}}}}$. Let ${\varvec{\alpha }}^{0}=({\textbf{W}}^{\textrm{T}}{\textbf{W}})^{-1}{\textbf{W}}^{\textrm{T}}$${\textbf{W}}{\varvec{\alpha }}^{0}$, we have

$$\begin{aligned} {\widehat{{\varvec{\alpha }}}}^{or}-{\varvec{\alpha }}^{0}=({\textbf{W}}^{\textrm{T}}{\textbf{W}})^{-1}{\textbf{W}}^{\textrm{T}}({{\tilde{{\varvec{\pi }}}}}-{\textbf{W}}{\varvec{\alpha }}^{0}). \end{aligned}$$

Following Theorem 1, we know that $\Vert {{\tilde{{\varvec{\pi }}}}}_{i}-{\varvec{\pi }}^{0}_{i}\Vert _{2}=\Vert {{\tilde{{\varvec{\pi }}}}}_{i}-{\textbf{W}}_{i}{\varvec{\alpha }}^{0}\Vert _{2}=O_{p}(n^{-1/2}_{1})$, and

$$\begin{aligned} \begin{aligned}&\Vert {\widehat{{\varvec{\alpha }}}}^{or}-{\varvec{\alpha }}^{0}\Vert _{\infty }\\&\quad \le \Vert ({\textbf{W}}^{\textrm{T}}{\textbf{W}})^{-1}\Vert _{\infty }\Vert {\textbf{W}}^{\textrm{T}}({{\tilde{{\varvec{\pi }}}}}-{\textbf{W}}{\varvec{\alpha }}^{0})\Vert _{\infty }. \end{aligned} \end{aligned}$$

Additionally, we know that $\Vert ({\textbf{W}}^{\textrm{T}}{\textbf{W}})^{-1}\Vert _{\infty }=|{\mathcal {S}}_{min}|^{-1}$, and $\Vert {\textbf{W}}^{\textrm{T}}({{\tilde{{\varvec{\pi }}}}}-{\textbf{W}}{\varvec{\alpha }}^{0})\Vert _{\infty }\le O_{p}(|{\mathcal {S}}_{max}|\sqrt{K}/\sqrt{n_{1}})$, where K is a fixed value. Thus, we can obtain that $\Vert {\widehat{{\varvec{\gamma }}}}^{or}-{\varvec{\gamma }}^{0}\Vert _{\infty }=\Vert {\widehat{{\varvec{\alpha }}}}^{or}-{\varvec{\alpha }}^{0}\Vert _{\infty }\le |{\mathcal {S}}_{max}|/(|{\mathcal {S}}_{min}|\sqrt{n_{1}})$. Following the assumption $|{\mathcal {S}}_{min}|=O(|{\mathcal {S}}_{max}|)$, we have $\Vert {\widehat{{\varvec{\gamma }}}}^{or}-{\varvec{\gamma }}^{0}\Vert _{\infty }\le C_{1}/\sqrt{n_{1}}$. $\square $

Proof of Theorem 3

For any given $\lambda $, note that

$$\begin{aligned} \begin{aligned} Q({\varvec{\gamma }})&=\frac{1}{2}\Vert {{\tilde{{\varvec{\pi }}}}}-{\varvec{\gamma }}\Vert ^{2}_{F}+\sum \limits _{1\le i<i^{\prime }\le m}p_{\tau }(\Vert {\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }}\Vert _{2},\lambda )\\&= L_{m}({\varvec{\gamma }})+P_{m}({\varvec{\gamma }}). \end{aligned} \end{aligned}$$

Let $Q^{{\mathcal {S}}}({\varvec{\alpha }})=L^{{\mathcal {S}}}_{m}({\varvec{\alpha }})+P^{{\mathcal {S}}}_{m}({\varvec{\alpha }})$ be the objective function when the true group structure ${\mathcal {S}}$ is known, that is, $L^{{\mathcal {S}}}_{m}({\varvec{\alpha }})=\frac{1}{2}\Vert {{\tilde{{\varvec{\pi }}}}}-{\textbf{W}}{\varvec{\alpha }}\Vert ^{2}_{F}$ and $P^{{\mathcal {S}}}_{m}({\varvec{\alpha }})=\sum _{g<g^{\prime }}|{\mathcal {S}}_{g}||{\mathcal {S}}_{g^{\prime }}|p_{\tau }(\Vert {\varvec{\alpha }}_{g}-{\varvec{\alpha }}_{g^{\prime }}\Vert _{2},\lambda )$. Let ${\mathcal {T}}: \mathcal {M_{{\mathcal {S}}}}\rightarrow {\mathbb {R}}^{G\times K}$ be a mapping, ${\mathcal {T}}({\varvec{\gamma }})$ represents a $G\times K$ matrix. The gth row represents the mixing probability vector for the gth subgroup. Let ${\mathcal {T}}^{\star }:{\mathbb {R}}^{m\times K}\rightarrow {\mathbb {R}}^{G\times K}$ be a mapping, and ${\mathcal {T}}^{\star }({\varvec{\gamma }})=((\sum _{i\in {\mathcal {S}}_{1}}{\varvec{\gamma }}_{i}/|{\mathcal {S}}_{1}|)^{\textrm{T}},\dots ,(\sum _{i\in {\mathcal {S}}_{G}}{\varvec{\gamma }}_{i}/|{\mathcal {S}}_{G}|)^{\textrm{T}})^{\textrm{T}}$. Obviously, when ${\varvec{\gamma }}\in {\mathcal {M}}_{{\mathcal {S}}}$, ${\mathcal {T}}({\varvec{\gamma }})={\mathcal {T}}^{\star }({\varvec{\gamma }})$, further, $P_{m}({\varvec{\gamma }})=P^{{\mathcal {S}}}_{m}({\mathcal {T}}({\varvec{\gamma }}))$. For each ${\varvec{\alpha }}\in {\mathbb {R}}^{G\times K}$, $P^{{\mathcal {S}}}_{m}({\varvec{\alpha }})=P_{m}({\mathcal {T}}^{-1}({\varvec{\alpha }}))$. Thus, we have

$$\begin{aligned} Q({\varvec{\gamma }})=Q^{{\mathcal {S}}}_{m}({\mathcal {T}}({\varvec{\gamma }})),\quad Q^{{\mathcal {S}}}({\varvec{\alpha }})=Q({\mathcal {T}}^{-1}({\varvec{\alpha }})). \end{aligned}$$

(A.2)

For every ${\varvec{\gamma }}\in {\mathbb {R}}^{m\times K}$, ${\varvec{\gamma }}^{\star }={\mathcal {T}}^{-1}({\mathcal {T}}^{\star }({\varvec{\gamma }}))$. Define

$$\begin{aligned} \Gamma =\{{\varvec{\gamma }}\in {\mathbb {R}}^{m\times K}: \Vert {\varvec{\gamma }}-{\varvec{\gamma }}^{0}\Vert _{\infty }\le C_{1}/\sqrt{n_{1}}\} \end{aligned}$$

as the neighborhood of ${\varvec{\gamma }}^{0}$. Following Theorem 1, we know that ${\widehat{{\varvec{\gamma }}}}^{or}\in \Gamma $. Next, we need to show that ${\widehat{{\varvec{\gamma }}}}^{or}$ is the strictly local minimizer of the objective function $Q({\varvec{\gamma }})$ with probability 1.

Firstly, for each ${\varvec{\gamma }}\in \Gamma $, we want to prove that $Q({\varvec{\gamma }}^{\star })>Q({\widehat{{\varvec{\gamma }}}}^{or})$. We know that $\Vert {\varvec{\alpha }}_{g}-{\varvec{\alpha }}_{g^{\prime }}\Vert _{2}\ge \Vert {\varvec{\alpha }}^{0}_{g}-{\varvec{\alpha }}^{0}_{g^{\prime }}\Vert _{2}-2\sup \limits _{g}\Vert {\varvec{\alpha }}_{g}-{\varvec{\alpha }}^{0}_{g}\Vert _{2}$, and

$$\begin{aligned}{} & {} \sup \limits _{g}\Vert {\varvec{\alpha }}_{g}-{\varvec{\alpha }}^{0}_{g}\Vert _{2}\nonumber \\{} & {} \quad =\sup \limits _{g}\Vert \sum \limits _{i\in {\mathcal {S}}_{g}}{\varvec{\gamma }}_{i}/|{\mathcal {S}}_{g}|-{\varvec{\alpha }}^{0}_{g}\Vert _{2}\nonumber \\{} & {} \quad \le \sup \limits _{g}\sup \limits _{i\in {\mathcal {S}}_{g}}\Vert {\varvec{\gamma }}_{i}-{\varvec{\gamma }}^{0}_{i}\Vert _{2}\le C_{1}/\sqrt{n_{1}}. \end{aligned}$$

(A.3)

Then, for any g and $g^{\prime }$, we have $\Vert {\varvec{\alpha }}_{g}-{\varvec{\alpha }}_{g^{\prime }}\Vert _{2}\ge \Vert {\varvec{\alpha }}^{0}_{g}-{\varvec{\alpha }}^{0}_{g^{\prime }}\Vert _{2}-2\sup _{i}\Vert {\varvec{\gamma }}_{i}-{\varvec{\gamma }}^{0}_{i}\Vert _{2}\ge b_{m}-2C_{1}/\sqrt{n_{1}}>a\lambda $, where the last inequality holds because the assumption $b_{m}>a\lambda>>C_{1}/\sqrt{n_{1}}$. Thus, we can obtain that $P^{{\mathcal {S}}}_{m}({\mathcal {T}}^{\star }({\varvec{\gamma }}))=C_{m}$, where $C_{m}$ is a constant. Thus, we have $Q^{{\mathcal {S}}}({\mathcal {T}}^{\star }({\varvec{\gamma }}))=L^{{\mathcal {S}}}_{m}({\mathcal {T}}^{\star }({\varvec{\gamma }}))+C_{m}$ for all ${\varvec{\gamma }}\in \Gamma $. In addition, because ${\widehat{{\varvec{\alpha }}}}^{or}$ is the unique global minimizer of $L^{{\mathcal {S}}}_{n}({\varvec{\alpha }})$, so that, $L^{{\mathcal {S}}}_{m}({\mathcal {T}}^{\star }({\varvec{\gamma }}))>L^{{\mathcal {S}}}_{m}({\widehat{{\varvec{\alpha }}}}^{or})$ for all ${\mathcal {T}}^{\star }({\varvec{\gamma }})\ne {\widehat{{\varvec{\alpha }}}}^{or}$. Thus, we have $Q^{{\mathcal {S}}}({\mathcal {T}}^{\star }({\varvec{\gamma }}))>Q^{{\mathcal {S}}}_{m}({\widehat{{\varvec{\alpha }}}}^{or})$. Following $Q({\varvec{\gamma }}^{\star })=Q({\mathcal {T}}^{-1}({\mathcal {T}}^{\star }({\varvec{\gamma }}))) =Q^{{\mathcal {S}}}({\mathcal {T}}^{\star }({\varvec{\gamma }})))=L^{{\mathcal {S}}}_{m}({\mathcal {T}}^{\star }({\varvec{\gamma }}))+P^{{\mathcal {S}}}_{m}({\mathcal {T}}^{\star }({\varvec{\gamma }}))$, and $Q({\widehat{{\varvec{\gamma }}}}^{or})=Q^{{\mathcal {S}}}({\widehat{{\varvec{\alpha }}}}^{or})=L^{{\mathcal {S}}}_{m}({\widehat{{\varvec{\alpha }}}}^{or})+P^{{\mathcal {S}}}_{m}({\widehat{{\varvec{\alpha }}}}^{or})$, we have $Q^{{\mathcal {S}}}({\mathcal {T}}^{\star }({\varvec{\gamma }}))>Q({\varvec{\gamma }})$. By the equation (A.2), we know that $Q^{{\mathcal {S}}}_{m}({\widehat{{\varvec{\alpha }}}}^{or})=Q_{m}({\widehat{{\varvec{\gamma }}}}^{or})$ and $Q^{{\mathcal {S}}}({\mathcal {T}}^{\star }({\varvec{\gamma }}))=Q({\varvec{\gamma }}^{\star })$. Thus, for each ${\varvec{\gamma }}^{\star }\ne {\widehat{{\varvec{\gamma }}}}^{or}$, $Q({\varvec{\gamma }}^{\star })>Q({\widehat{{\varvec{\gamma }}}}^{or})$.

Secondly, we define a positive sequence $t_{m}$, and $\Gamma _{m}=\{{\varvec{\gamma }}: \Vert {\varvec{\gamma }}-{\widehat{{\varvec{\gamma }}}}^{or}\Vert _{2}\le t_{m}\}$ is the neighborhood of ${\widehat{{\varvec{\gamma }}}}^{or}$. For any ${\varvec{\gamma }}\in \Gamma _{m}\cap \Gamma $, making a Taylor expression for $Q({\varvec{\gamma }})$, that is,

$$\begin{aligned} \begin{aligned}&Q({\varvec{\gamma }})-Q({\varvec{\gamma }}^{\star })\\&\quad =-\textrm{tr}\{({{\widetilde{{\varvec{\pi }}}}}-{\check{{\varvec{\gamma }}}})^{\textrm{T}}({\varvec{\gamma }}-{\varvec{\gamma }}^{\star })\}+\sum \limits _{i=1}^{m}\frac{\partial P_{m}({\check{{\varvec{\gamma }}}})}{\partial {\varvec{\gamma }}^{\textrm{T}}_{i}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}^{\star }_{i})\\&\quad = R_{1}+R_{2}, \end{aligned} \end{aligned}$$

where ${\check{{\varvec{\gamma }}}}=\delta {\varvec{\gamma }}+(1-\delta ){\varvec{\gamma }}^{\star }$ for some $\delta \in (0,1)$, and

$$\begin{aligned} \begin{aligned} R_{2}&=\lambda \sum \limits _{i^{\prime }>i}\rho ^{\prime }(\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})(\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})^{-1}\\&\quad \quad ({\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }})^{\textrm{T}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}^{\star }_{i})\\&\quad +\lambda \sum \limits _{i^{\prime }<i}\rho ^{\prime }(\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2}) (\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})^{-1}\\&\quad \quad ({\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }})^{\textrm{T}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}^{\star }_{i})\\&=\lambda \sum \limits _{i^{\prime }>i}\rho ^{\prime }(\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2}) (\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})^{-1}\\&\quad \quad ({\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }})^{\textrm{T}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}^{\star }_{i})\\&\quad +\lambda \sum \limits _{i^{\prime }<i}\rho ^{\prime }(\Vert {\check{{\varvec{\gamma }}}}_{i^{\prime }}-{\check{{\varvec{\gamma }}}}_{i}\Vert _{2}) (\Vert {\check{{\varvec{\gamma }}}}_{i^{\prime }}-{\check{{\varvec{\gamma }}}}_{i}\Vert _{2})^{-1}\\&\quad \quad ({\check{{\varvec{\gamma }}}}_{i^{\prime }}-{\check{{\varvec{\gamma }}}}_{i})^{\textrm{T}}({\varvec{\gamma }}_{i^{\prime }}-{\varvec{\gamma }}^{\star }_{i^{\prime }})\\&=\lambda \sum \limits _{i^{\prime }>i}\rho ^{\prime }(\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2}) (\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})^{-1}\\&\quad \quad ({\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }})^{\textrm{T}}\{({\varvec{\gamma }}_{i}-{\varvec{\gamma }}^{\star }_{i})-({\varvec{\gamma }}_{i^{\prime }}-{\varvec{\gamma }}^{\star }_{i^{\prime }})\}. \end{aligned} \end{aligned}$$

When $i,i^{\prime }\in {\mathcal {S}}_{g}$, then, ${\varvec{\gamma }}^{\star }_{i}={\varvec{\gamma }}^{\star }_{i^{\prime }}$, and ${\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}=\delta ({\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }})$. Thus, we have

$$\begin{aligned} R_{2}&=\lambda \sum \limits _{g=1}^{G}\sum \limits _{i,i^{\prime }\in {\mathcal {S}}_{g},i<i^{\prime }}\rho ^{\prime }(\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})\\&\quad (\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})^{-1} ({\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }})^{\textrm{T}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }})\\&\quad +\lambda \sum \limits _{g<g^{\prime }}\sum \limits _{i\in {\mathcal {S}}_{g},i^{\prime }\in {\mathcal {S}}_{g^{\prime }}}\rho ^{\prime }(\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})\\&\quad (\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})^{-1}(\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})^{\textrm{T}}\\&\quad \{({\varvec{\gamma }}_{i}-{\varvec{\gamma }}^{\star }_{i})-({\varvec{\gamma }}_{i^{\prime }}-{\varvec{\gamma }}^{\star }_{i^{\prime }})\}, \end{aligned}$$

Further, based on the same reasons as (A.3), we have

$$\begin{aligned} \begin{aligned} \sup \limits _{i}\Vert {\varvec{\gamma }}^{\star }_{i}-{\widehat{{\varvec{\gamma }}}}^{or}_{i}\Vert _{2}&=\sup \limits _{g}\Vert {\varvec{\alpha }}_{g}-{\widehat{{\varvec{\alpha }}}}^{or}_{g}\Vert _{2}\\&\le \sup \limits _{i}\Vert {\varvec{\gamma }}_{i}-{\widehat{{\varvec{\gamma }}}}^{or}_{i}\Vert _{2}. \end{aligned} \end{aligned}$$

Then,

$$\begin{aligned} \begin{aligned}&\sup _{i}\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2}\\&\quad \le 2\sup \limits _{i}\Vert {\check{{\varvec{\gamma }}}}_{i}-{\varvec{\gamma }}^{\star }_{i}\Vert _{2} \le 2\sup \limits _{i}\Vert {\varvec{\gamma }}_{i}-{\varvec{\gamma }}^{\star }_{i}\Vert _{2}\\&\quad \le 2\{\sup \limits _{i}(\Vert {\varvec{\gamma }}_{i}-{\widehat{{\varvec{\gamma }}}}^{or}_{i}\Vert _{2}+\Vert {\widehat{{\varvec{\gamma }}}}^{or}_{i}-{\varvec{\gamma }}^{\star }_{i}\Vert _{2})\}\\&\quad \le 4 \sup \limits _{i}\Vert {\varvec{\gamma }}_{i}-{\widehat{{\varvec{\gamma }}}}^{or}_{i}\Vert _{2}\le 4 t_{m}. \end{aligned} \end{aligned}$$

Further, $\rho ^{\prime }(\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})\ge \rho ^{\prime }(4t_{n})$ by the concavity of $\rho (\cdot )$. Thus, we have

$$\begin{aligned} R_{2}\ge \sum \limits _{g=1}^{G}\sum \limits _{i,i^{\prime }\in {\mathcal {S}}_{g},i<i^{\prime }}\lambda \rho ^{\prime }(4t_{n})\Vert {\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }}\Vert _{2}. \end{aligned}$$

When $i\in {\mathcal {S}}_{g}$, then, ${\varvec{\gamma }}^{\star }_{i}=|{\mathcal {S}}_{g}|^{-1}\sum _{i\in {\mathcal {S}}_{g}}{\varvec{\gamma }}_{i}$. Following

$$\begin{aligned} \begin{aligned} R_{1}&=-\sum \limits _{i=1}^{m}({{\tilde{{\varvec{\pi }}}}}_{i}-{\check{{\varvec{\gamma }}}}_{i})^{\textrm{T}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}^{\star }_{i})\\&=-\sum \limits _{g=1}^{G}\left\{ \sum \limits _{i,i^{\prime }\in {\mathcal {S}}_{g}}({{\tilde{{\varvec{\pi }}}}}_{i}-{\check{{\varvec{\gamma }}}}_{i})^{\textrm{T}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }})/|{\mathcal {S}}_{g}|\right\} \\&=-\frac{1}{2}\left\{ \sum \limits _{g=1}^{G}\frac{1}{|{\mathcal {S}}_{g}|}\left\{ \sum \limits _{i,i^{\prime }\in {\mathcal {S}}_{g}}({{\tilde{{\varvec{\pi }}}}}_{i}-{\check{{\varvec{\gamma }}}}_{i})^{\textrm{T}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }})\right\} \right. \\&\left. \quad +\sum \limits _{g=1}^{G}\frac{1}{|{\mathcal {S}}_{g}|}\{\sum \limits _{i,i^{\prime }\in {\mathcal {S}}_{g}}({{\tilde{{\varvec{\pi }}}}}_{i}-{\check{{\varvec{\gamma }}}}_{i})^{\textrm{T}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }})\}\right\} , \end{aligned} \end{aligned}$$

thus, for any $i,i^{\prime }\in {\mathcal {S}}_{g}$, we have $R_{1}=-\sum _{g=1}^{G}\{\sum _{i<i^{\prime }}({{\tilde{{\varvec{\pi }}}}}_{i}-{\check{{\varvec{\gamma }}}}_{i}-{{\tilde{{\varvec{\pi }}}}}_{i^{\prime }} +{\check{{\varvec{\gamma }}}}_{i^{\prime }})^{\textrm{T}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }})\}/|{\mathcal {S}}_{g}|,$ and $\sup _{i}\Vert ({{\tilde{{\varvec{\pi }}}}}_{i}-{\check{{\varvec{\gamma }}}}_{i})\Vert _{2}\le \sup _{i}(\Vert {{\tilde{{\varvec{\pi }}}}}_{i}-{\varvec{\pi }}^{0}_{i}\Vert _{2} +\Vert {\varvec{\pi }}^{0}_{i}-{\check{{\varvec{\gamma }}}}_{i}\Vert _{2}).$ Following $\sup \limits _{i}\Vert {\varvec{\pi }}^{0}_{i}-{\check{{\varvec{\gamma }}}}_{i}\Vert =\sup \limits _{i}\Vert {\varvec{\gamma }}^{0}_{i}-{\check{{\varvec{\gamma }}}}_{i}\Vert \le C_{1}/\sqrt{n_{1}}$, then,

$$\begin{aligned} |R_{1}|&\le 2|{\mathcal {S}}_{min}|^{-1}\sup \limits _{i}({{\tilde{{\varvec{\pi }}}}}_{i}-{\check{{\varvec{\gamma }}}}_{i})^{\textrm{T}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }})\\&\le 2|{\mathcal {S}}_{min}|^{-1}\sup \limits _{i}\Vert {{\tilde{{\varvec{\pi }}}}}_{i}-{\check{{\varvec{\gamma }}}}_{i}\Vert _{2}\Vert {\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }}\Vert _{2}\\&\le 2|{\mathcal {S}}_{min}|^{-1}(n_{1}^{-1/2}+C_{1}n_{1}^{-1/2})\Vert {\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }}\Vert _{2}. \end{aligned}$$

Consequently, we have

$$\begin{aligned} \begin{aligned} Q({\varvec{\gamma }})-Q({\varvec{\gamma }}^{\star })&\ge \sum \limits _{g=1}^{G}\sum \limits _{\begin{array}{c} i,i^{\prime }\in {\mathcal {S}}_{g}\\ i<i^{\prime } \end{array}}\Vert {\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }}\Vert _{2}\\&\quad \left\{ \lambda \rho ^{\prime }(4t_{n})-\frac{2}{|{\mathcal {S}}_{min}|} (n_{1}^{-1/2}+C_{1}n_{1}^{-1/2})\right\} . \end{aligned} \end{aligned}$$

Let $t_{m}=o(1)$, then $\rho ^{\prime }(4t_{n})\rightarrow 1$. Since $\lambda \gg C_{1}n_{1}^{-1/2}$, and $|{\mathcal {S}}_{min}|^{-1}=o(1)$, then $Q({\varvec{\gamma }})-Q({\varvec{\gamma }}^{\star })\ge 0$ for sufficiently large $m, n_{1}$. $\square $

Proof of Theorem 4

Define

$$\begin{aligned} {\widehat{{\varvec{\Theta }}}}^{or}=\arg \max \limits _{{\varvec{\Theta }}}\sum \limits _{i=1}^{m}\sum \limits _{j=1}^{n_{i}}\log \{f({y_{ij}|{\varvec{x}}_{ij},{\varvec{\Theta }},{\mathcal {S}}^{0}})\} \end{aligned}$$

as the oracle estimators for ${\varvec{\Theta }}=({{\textrm{vec}}}({\varvec{\pi }})^{\textrm{T}},{\varvec{\theta }})^{\textrm{T}}$ when the true group structure is given, and

$$\begin{aligned} {\widehat{{\varvec{\Theta }}}}=\arg \max \limits _{{\varvec{\Theta }}}\sum \limits _{i=1}^{m}\sum \limits _{j=1}^{n_{i}}\log \{f({y_{ij}|{\varvec{x}}_{ij},{\varvec{\Theta }},{\widehat{{\mathcal {S}}}}})\}. \end{aligned}$$

Following Theorem 3, we know that $P({\widehat{{\mathcal {S}}}}={\mathcal {S}}^{0})\rightarrow 1$ when both the sample size m and the cluster size $n_{1}$ tend to infinity. Therefore, it is sufficient to consider the asymptotic distribution of the oracle estimators ${\widehat{{\varvec{\Theta }}}}^{or}$.

Let ${\widehat{{\varvec{\Theta }}}}\in \{{\varvec{\Theta }}^{0}+N^{-1/2}{\varvec{\vartheta }}:\Vert {\varvec{\vartheta }}\Vert _{2}\le M_{\varepsilon }\}$, and

$$\begin{aligned} L({\varvec{\Theta }})= & {} \sum \limits _{i=1}^{m}\sum \limits _{j=1}^{n_{i}}\log \{f({y_{ij}|{\varvec{x}}_{ij},{\varvec{\Theta }},{\mathcal {S}}^{0}})\}, \\ {\textbf{U}}({\varvec{\Theta }}^{0})= & {} \frac{\partial L({\varvec{\Theta }})}{\partial {\varvec{\Theta }}}\big |_{{\varvec{\Theta }}={\varvec{\Theta }}^{0}}, {\textbf{V}}({\varvec{\Theta }}^{0})=\frac{\partial ^{2} L({\varvec{\Theta }})}{\partial {\varvec{\Theta }}\partial {\varvec{\Theta }}^{\textrm{T}}}\big |_{{\varvec{\Theta }}={\varvec{\Theta }}^{0}}.\end{aligned}$$

By Taylor’s expression of $L({\widehat{{\varvec{\Theta }}}}^{or})$ at ${\varvec{\Theta }}^{0}$,

$$\begin{aligned} L({\widehat{{\varvec{\Theta }}}}^{or})-L({\widehat{{\varvec{\Theta }}}})=\frac{1}{\sqrt{N}}{\textbf{U}}({\varvec{\Theta }}^{0})^{\textrm{T}}{\varvec{\vartheta }}+\frac{1}{2N}{\varvec{\vartheta }}^{\textrm{T}}{\textbf{V}}({\check{{\varvec{\Theta }}}}){\varvec{\vartheta }}, \end{aligned}$$

where ${\check{{\varvec{\Theta }}}}$ lies between ${\widehat{{\varvec{\Theta }}}}^{or}$ and ${\widehat{{\varvec{\Theta }}}}^{0}$. When $m,n_{1}\rightarrow \infty $, following the assumption conditions, we have the score function ${\textbf{U}}({\varvec{\Theta }}^{0})=O_{p}(\sqrt{N})$ and $-{\textbf{V}}({\varvec{\Theta }}^{0})/N={\textbf{F}}({\varvec{\Theta }}^{0})/N+o_{P}(1)$, where ${\textbf{F}}({\varvec{\Theta }}^{0})=-{\textrm{E}}\{{\textbf{V}}({\varvec{\Theta }}^{0})\}=N{\bar{{\textbf{F}}}}({\varvec{\Theta }}^{0})$ is the fisher information matrix, and ${\textbf{F}}({\varvec{\Theta }}^{0})/N=O_{p}(1)$. Thus, for sufficient large $M_{\varepsilon }$, $L({\widehat{{\varvec{\Theta }}}}^{or})-L({\widehat{{\varvec{\Theta }}}})$ is controlled by the second term, which equals to $-\frac{1}{2}{\varvec{\vartheta }}^{\textrm{T}}(\frac{1}{N}{\textbf{F}}({\varvec{\Theta }}^{0})+o_{p}(1)){\varvec{\vartheta }}$. Thus, for any given $\varepsilon >0$, there exists a large $M_{\varepsilon }$, such that

$$\begin{aligned} P(\sup \limits _{\Vert {\varvec{\vartheta }}\Vert _{2}=M_{\varepsilon }}L({\widehat{{\varvec{\Theta }}}}^{or})-L({\widehat{{\varvec{\Theta }}}})<0)>1-\varepsilon . \end{aligned}$$

Thus, we have $\Vert {\widehat{{\varvec{\Theta }}}}^{or}-{\widehat{{\varvec{\Theta }}}}\Vert _{2}=O_{p}(N^{-1/2})$.

Furthermore, using the Taylor’s expression of ${\textbf{U}}({\widehat{{\varvec{\Theta }}}}^{or})$ around ${\varvec{\Theta }}^{0}$, with ${\textbf{U}}({\widehat{{\varvec{\Theta }}}}^{or})={\textbf{0}}$, we can obtain that

$$\begin{aligned} {\widehat{{\varvec{\Theta }}}}^{or}-{\varvec{\Theta }}^{0}=-{\textbf{V}}^{-1}({\check{{\varvec{\Theta }}}}){\textbf{U}}({\varvec{\Theta }}^{0}), \end{aligned}$$

where ${\check{{\varvec{\Theta }}}}$ lies between ${\widehat{{\varvec{\Theta }}}}^{or}$ and ${\widehat{{\varvec{\Theta }}}}^{0}$. When the sample sizes $m,n_{1}\rightarrow \infty $, following the weak large numbers law, we have

$$\begin{aligned} -\frac{1}{N}{\textbf{V}}({\check{{\varvec{\Theta }}}}){\mathop {\longrightarrow }\limits ^{P}}{\bar{{\textbf{F}}}}({\varvec{\Theta }}^{0}). \end{aligned}$$

Thus, $\sqrt{N}{\bar{{\textbf{F}}}}^{1/2}({\varvec{\Theta }}^{0})({\widehat{{\varvec{\Theta }}}}^{or}-{\varvec{\Theta }}^{0})\longrightarrow N({\textbf{0}},{\textbf{I}})$. $\square $

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liang, C., Ma, W. Heterogeneous analysis for clustered data using grouped finite mixture models. Stat Comput 34, 40 (2024). https://doi.org/10.1007/s11222-023-10353-w

Download citation

Received: 24 August 2023
Accepted: 25 October 2023
Published: 15 November 2023
DOI: https://doi.org/10.1007/s11222-023-10353-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Heterogeneous analysis for clustered data using grouped finite mixture models

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Matrix Normal Cluster-Weighted Models

A New Look at the Dirichlet Distribution: Robustness, Clustering, and Both Together

Seemingly unrelated clusterwise linear regression for contaminated data

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

Proof of Theorem 4

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Heterogeneous analysis for clustered data using grouped finite mixture models

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Matrix Normal Cluster-Weighted Models

A New Look at the Dirichlet Distribution: Robustness, Clustering, and Both Together

Seemingly unrelated clusterwise linear regression for contaminated data

Explore related subjects

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

Proof of Theorem 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation