Abstract
It is common to observe significant heterogeneity in clustered data across scientific fields. Cluster-wise conditional distributions are widely used to explore variations and relationships within and among clusters. This paper aims to capture such heterogeneity by employing cluster-wise finite mixture models. To address the heterogeneity among clusters, we introduce latent group structure and incorporate heterogeneous mixing proportions across different groups, accommodating the diverse characteristics observed in the data. The specific number of groups and their membership are unknown. To identify the latent group structure, we employ concave penalty functions to the pairwise differences of the preliminary consistent estimators for the mixing proportions. This approach enables the automatic division of clusters into finite subgroups. Theoretical results demonstrate that as the number of clusters and cluster sizes tend to infinity, the true latent group structure can be recovered with probability close to one, and the post-classification estimators exhibit oracle efficiency. We support our proposed approach’s performance and applicability through extensive simulations and analysis of basic consumption expenditure among urban households in China.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bai, J.: Estimating multiple breaks one at a time. Economet. Theor. 13(3), 315–352 (1997)
Begg, M.D., Parides, M.K.: Separation of individual-level and cluster-level covariate effects in regression analysis of correlated data. Stat. Med. 22(16), 2591–2602 (2003)
Bester, C.A., Hansen, C.B.: Grouped effects estimators in fixed effects models. J. Econom. 190(1), 197–208 (2016)
Bonhomme, S., Manresa, E.: Grouped patterns of heterogeneity in panel data. Econometrica 83(3), 1147–1184 (2015)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 3(1), 1–122 (2011)
Chen, J.: Optimal rate of convergence for finite mixture models. Ann. Stat. 23(1), 221–233 (1995)
Chi, E.C., Lange, K.: Splitting methods for convex clustering. J. Comput. Graph. Stat. 24(4), 994–1013 (2015)
Desai, M., Begg, M.D.: A comparison of regression approaches for analyzing clustered data. Am. J. Public Health 98(8), 1425–1429 (2008)
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)
Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)
Jiang, J., Nguyen, T.: Linear and Generalized Linear Mixed Models and their Applications. Springer, New York (2007)
Keribin, C.: Consistent estimation of the order of mixture models. Sankhyā: Indian J. Stat. Ser. A 62(1), 49–66 (2000)
Khalili, A., Chen, J.: Variable selection in finite mixture of regression models. J. Am. Stat. Assoc. 102(479), 1025–1038 (2007)
Lin, X., Carroll, R.J.: Semiparametric regression for clustered data. Biometrika 88(4), 1179–1185 (2001)
Louis, T.A.: Finding the observed information matrix when using the em algorithm. J. R. Stat. Soc. Ser. B Stat Methodol. 44(2), 223–233 (1982)
Ma, S., Huang, J.: A concave pairwise fusion approach to subgroup analysis. J. Am. Stat. Assoc. 112(517), 410–423 (2017)
McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, New York (2000)
Melnykov, V., Maitra, R.: Finite mixture models and model-based clustering. Stat. Surveys 4, 80–116 (2010)
Meng, X.L., Rubin, D.B.: Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80(2), 267–278 (1993)
Neuhaus, J.M., Kalbfleisch, J.D.: Between- and within-cluster covariate effects in the analysis of clustered data. Biometrics, 638–645 (1998)
Ng, S.K., McLachlan, G.J.: Mixture models for clustering multilevel growth trajectories. Comput. Stat. Data Anal. 71, 43–51 (2014)
Radchenko, P., Mukherjee, G.: Convex clustering via l1 fusion penalization. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 79(5), 1527–1546 (2017)
Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)
Rosen, O., Jiang, W., Tanner, M.A.: Mixtures of marginal models. Biometrika 87(2), 391–404 (2000)
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Sugasawa, S.: Grouped heterogeneous mixture modeling for clustered data. J. Am. Stat. Assoc. 116(534), 999–1010 (2021)
Sugasawa, S., Kobayashi, G., Kawakubo, Y.: Latent mixture modeling for clustered data. Stat. Comput. 29, 537–548 (2019)
Sun, Z., Rosen, O., Sampson, A.R.: Multivariate bernoulli mixture models with application to postmortem tissue studies in schizophrenia. Biometrics 63(3), 901–909 (2007)
Tang, X., Qu, A.: Mixture modeling for longitudinal data. J. Comput. Graph. Stat. 25(4), 1117–1137 (2016)
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010)
Wang, W., Su, L.: Identifying latent group structures in nonlinear panels. J. Econom. 220(2), 272–295 (2021)
Wang, H., Li, R., Tsai, C.L.: Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 94(3), 553–568 (2007)
Yu, J., Nummi, T., Pan, J.X.: Mixture regression for longitudinal data based on joint mean-covariance model. J. Multivar. Anal. 190, 104956 (2022)
Zhang, C.H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)
Acknowledgements
We are very grateful to the Editor, Associate Editor and referees, as well as our financial sponsors for their insightful comments and suggestions that have improved the manuscript significantly.
Funding
This work was supported by the National Natural Science Foundation of China [Grant numbers 11690012, 11631003, 12226003].
Author information
Authors and Affiliations
Contributions
M.W. (corresponding author) proposed the conceptualization and provided the funding acquisition, L.C. completed the methodology and analysis. All authors together wroted and reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Proof of Theorem 1
Define
For \(i=1,\ldots ,m\), let
then, we have \({\varvec{\pi }}^{(0)}_{i}={\varvec{\pi }}_{i}({\varvec{\theta }}^{(0)})\). Because the all unknown parameters contain \({\varvec{\theta }}\) and \({\varvec{\pi }}\), the consistency of the preliminary estimators can be divided into two parts. At first, we will show that the preliminary estimator \({{\widetilde{{\varvec{\theta }}}}}\) is consistent when the sample size m and the minimum cluster size \(n_{1}\) tend to infinity.
(1). Note that
we make a Taylor expression for \(Q({\tilde{{\varvec{\theta }}}})\) at \({\varvec{\theta }}^{0}\),
since \({{\tilde{{\varvec{\theta }}}}}=\mathop {\arg \max }\limits _{{\varvec{\theta }}}Q({\varvec{\theta }})\), so that \(Q({\tilde{{\varvec{\theta }}}})-Q({\varvec{\theta }}^{0})\ge 0\). In addition,
since \({\textrm{E}}(S_{1})=0\) and \(\textrm{tr}\{\textrm{var}(S_{1})\}=O_{p}(1)\), so we know that \(S_{1}=O_{p}(N^{-1/2})\). For \(S_{2}\), following the Cauchy-Schwarz inequality, we have
Following (A.1), we have
where \({\check{{\varvec{\pi }}}}_{i}({\varvec{\theta }})\) lies between \({{\tilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }})\) and \({\varvec{\pi }}_{i}({\varvec{\theta }})\). For \(i=1,\ldots ,m\), let
and
thus, we can obtain that \({{\tilde{{\varvec{\pi }}}}}_{i}({\varvec{\theta }}^{0})-{\varvec{\pi }}_{i}({\varvec{\theta }}^{0})=-G^{-1}_{2i}({\check{{\varvec{\pi }}}}_{i}({\varvec{\theta }}),{\varvec{\theta }})\)\(G_{1i}({\varvec{\pi }}_{i}({\varvec{\theta }}),{\varvec{\theta }})\). Then,
Following Assumptions (C1)–(C4), we know that
Further, for each i, we have that
Thus, we have
since \({\textrm{E}}\{G_{1i}({\varvec{\pi }}_{i}({\varvec{\theta }}^{(0)}),{\varvec{\theta }}^{(0)})\}={\textbf{0}}\), where the last inequality holds. Then, \(\Vert S_{2}\Vert _{2}=O_{p}(n^{-1/2}_{1})\). When \(m\rightarrow \infty \) and \(n_{1}\rightarrow \infty \), we have
with probability 1, and we know that \(Q({{\tilde{{\varvec{\theta }}}}})-Q({\varvec{\theta }}^{0})\) is controlled by the second term \(({\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0})^{\textrm{T}}(\partial ^{2} Q({\varvec{\theta }})/\partial {\varvec{\theta }}\partial {\varvec{\theta }}^{\textrm{T}})|_{{\varvec{\theta }}={\varvec{\theta }}^{0}}({\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0})\). Then, \(Q({{\tilde{{\varvec{\theta }}}}})-Q({\varvec{\theta }}^{0})\le 0\) with probability 1. Correspondingly, we have \(\Vert {\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0}\Vert _{2}=O_{p}(1/\sqrt{n_{1}})\). Nextly, we will prove that the preliminary estimator \({\varvec{\pi }}_{i}\) is consistent when the sample size m and the cluster size \(n_{1}\) tend to infinity, \(i=1,\ldots ,m\).
(2). Let \({{\tilde{{\varvec{\pi }}}}}_{i}\in \{{\varvec{\pi }}^{0}_{i}+{\varvec{v}}/\sqrt{n_{1}}:\Vert {\varvec{v}}\Vert \le C\}\), where C is a constant. Following
Under the Assumption conditions (C1)–(C4) and \(\Vert {\tilde{{\varvec{\theta }}}}-{\varvec{\theta }}^{0}\Vert _{2}=O_{p}(1/\sqrt{n_{1}})\), we have
where \({\check{{\varvec{\theta }}}}\) lies between \({{\tilde{{\varvec{\theta }}}}}\) and \({\varvec{\theta }}^{0}\). Thus,
Next, we consider \(I_{2}\). Taking the same ticks with \(G_{2i}({\check{{\varvec{\pi }}}}_{i}({\varvec{\theta }}^{0}),\)\({\varvec{\theta }}^{0})\), we can obtain that
Then,
holds for sufficiently constant c. Consequently, for a large constant \(\Vert {\varvec{v}}\Vert _{2}=C\), \(I_{1}\) is controlled by \(I_{2}\),
Thus, for each \(i=1,\ldots ,m\), we have \(\Vert {{\tilde{{\varvec{\pi }}}}}_{i}-{\varvec{\pi }}^{0}_{i}\Vert _{2}=O_{p}(1/\sqrt{n_{1}})\). \(\square \)
Proof of Theorem 2
Recall that \({\varvec{\gamma }}=({\varvec{\gamma }}^{\textrm{T}}_{1},\dots ,{\varvec{\gamma }}^{\textrm{T}}_{m})^{\textrm{T}}\) be \(m\times K\) parameter matrix, and let \({\textbf{W}}=({\textbf{W}}^{\textrm{T}}_{1},\dots ,{\textbf{W}}^{\textrm{T}}_{m})^{\textrm{T}}\) is a \(m\times G\) group index matrix, where \({\textbf{W}}_{i}=(w_{i1},\dots ,w_{iG})^{\textrm{T}}\), and only one element equals to 1, the other elements equal to 0. If the ith subject belongs to the gth subgroup, then, \(w_{ig}=1\). Define \({\varvec{\gamma }}={\textbf{W}}{\varvec{\alpha }}\), where \({\varvec{\alpha }}\in {\mathbb {R}}^{G\times K}\). When \({\textbf{W}}\) is known, then,
Obviously, \({\widehat{{\varvec{\alpha }}}}^{or}=({\textbf{W}}^{\textrm{T}}{\textbf{W}})^{-1}{\textbf{W}}^{\textrm{T}}{{\tilde{{\varvec{\pi }}}}}\). Let \({\varvec{\alpha }}^{0}=({\textbf{W}}^{\textrm{T}}{\textbf{W}})^{-1}{\textbf{W}}^{\textrm{T}}\)\({\textbf{W}}{\varvec{\alpha }}^{0}\), we have
Following Theorem 1, we know that \(\Vert {{\tilde{{\varvec{\pi }}}}}_{i}-{\varvec{\pi }}^{0}_{i}\Vert _{2}=\Vert {{\tilde{{\varvec{\pi }}}}}_{i}-{\textbf{W}}_{i}{\varvec{\alpha }}^{0}\Vert _{2}=O_{p}(n^{-1/2}_{1})\), and
Additionally, we know that \(\Vert ({\textbf{W}}^{\textrm{T}}{\textbf{W}})^{-1}\Vert _{\infty }=|{\mathcal {S}}_{min}|^{-1}\), and \(\Vert {\textbf{W}}^{\textrm{T}}({{\tilde{{\varvec{\pi }}}}}-{\textbf{W}}{\varvec{\alpha }}^{0})\Vert _{\infty }\le O_{p}(|{\mathcal {S}}_{max}|\sqrt{K}/\sqrt{n_{1}})\), where K is a fixed value. Thus, we can obtain that \(\Vert {\widehat{{\varvec{\gamma }}}}^{or}-{\varvec{\gamma }}^{0}\Vert _{\infty }=\Vert {\widehat{{\varvec{\alpha }}}}^{or}-{\varvec{\alpha }}^{0}\Vert _{\infty }\le |{\mathcal {S}}_{max}|/(|{\mathcal {S}}_{min}|\sqrt{n_{1}})\). Following the assumption \(|{\mathcal {S}}_{min}|=O(|{\mathcal {S}}_{max}|)\), we have \(\Vert {\widehat{{\varvec{\gamma }}}}^{or}-{\varvec{\gamma }}^{0}\Vert _{\infty }\le C_{1}/\sqrt{n_{1}}\). \(\square \)
Proof of Theorem 3
For any given \(\lambda \), note that
Let \(Q^{{\mathcal {S}}}({\varvec{\alpha }})=L^{{\mathcal {S}}}_{m}({\varvec{\alpha }})+P^{{\mathcal {S}}}_{m}({\varvec{\alpha }})\) be the objective function when the true group structure \({\mathcal {S}}\) is known, that is, \(L^{{\mathcal {S}}}_{m}({\varvec{\alpha }})=\frac{1}{2}\Vert {{\tilde{{\varvec{\pi }}}}}-{\textbf{W}}{\varvec{\alpha }}\Vert ^{2}_{F}\) and \(P^{{\mathcal {S}}}_{m}({\varvec{\alpha }})=\sum _{g<g^{\prime }}|{\mathcal {S}}_{g}||{\mathcal {S}}_{g^{\prime }}|p_{\tau }(\Vert {\varvec{\alpha }}_{g}-{\varvec{\alpha }}_{g^{\prime }}\Vert _{2},\lambda )\). Let \({\mathcal {T}}: \mathcal {M_{{\mathcal {S}}}}\rightarrow {\mathbb {R}}^{G\times K}\) be a mapping, \({\mathcal {T}}({\varvec{\gamma }})\) represents a \(G\times K\) matrix. The gth row represents the mixing probability vector for the gth subgroup. Let \({\mathcal {T}}^{\star }:{\mathbb {R}}^{m\times K}\rightarrow {\mathbb {R}}^{G\times K}\) be a mapping, and \({\mathcal {T}}^{\star }({\varvec{\gamma }})=((\sum _{i\in {\mathcal {S}}_{1}}{\varvec{\gamma }}_{i}/|{\mathcal {S}}_{1}|)^{\textrm{T}},\dots ,(\sum _{i\in {\mathcal {S}}_{G}}{\varvec{\gamma }}_{i}/|{\mathcal {S}}_{G}|)^{\textrm{T}})^{\textrm{T}}\). Obviously, when \({\varvec{\gamma }}\in {\mathcal {M}}_{{\mathcal {S}}}\), \({\mathcal {T}}({\varvec{\gamma }})={\mathcal {T}}^{\star }({\varvec{\gamma }})\), further, \(P_{m}({\varvec{\gamma }})=P^{{\mathcal {S}}}_{m}({\mathcal {T}}({\varvec{\gamma }}))\). For each \({\varvec{\alpha }}\in {\mathbb {R}}^{G\times K}\), \(P^{{\mathcal {S}}}_{m}({\varvec{\alpha }})=P_{m}({\mathcal {T}}^{-1}({\varvec{\alpha }}))\). Thus, we have
For every \({\varvec{\gamma }}\in {\mathbb {R}}^{m\times K}\), \({\varvec{\gamma }}^{\star }={\mathcal {T}}^{-1}({\mathcal {T}}^{\star }({\varvec{\gamma }}))\). Define
as the neighborhood of \({\varvec{\gamma }}^{0}\). Following Theorem 1, we know that \({\widehat{{\varvec{\gamma }}}}^{or}\in \Gamma \). Next, we need to show that \({\widehat{{\varvec{\gamma }}}}^{or}\) is the strictly local minimizer of the objective function \(Q({\varvec{\gamma }})\) with probability 1.
Firstly, for each \({\varvec{\gamma }}\in \Gamma \), we want to prove that \(Q({\varvec{\gamma }}^{\star })>Q({\widehat{{\varvec{\gamma }}}}^{or})\). We know that \(\Vert {\varvec{\alpha }}_{g}-{\varvec{\alpha }}_{g^{\prime }}\Vert _{2}\ge \Vert {\varvec{\alpha }}^{0}_{g}-{\varvec{\alpha }}^{0}_{g^{\prime }}\Vert _{2}-2\sup \limits _{g}\Vert {\varvec{\alpha }}_{g}-{\varvec{\alpha }}^{0}_{g}\Vert _{2}\), and
Then, for any g and \(g^{\prime }\), we have \(\Vert {\varvec{\alpha }}_{g}-{\varvec{\alpha }}_{g^{\prime }}\Vert _{2}\ge \Vert {\varvec{\alpha }}^{0}_{g}-{\varvec{\alpha }}^{0}_{g^{\prime }}\Vert _{2}-2\sup _{i}\Vert {\varvec{\gamma }}_{i}-{\varvec{\gamma }}^{0}_{i}\Vert _{2}\ge b_{m}-2C_{1}/\sqrt{n_{1}}>a\lambda \), where the last inequality holds because the assumption \(b_{m}>a\lambda>>C_{1}/\sqrt{n_{1}}\). Thus, we can obtain that \(P^{{\mathcal {S}}}_{m}({\mathcal {T}}^{\star }({\varvec{\gamma }}))=C_{m}\), where \(C_{m}\) is a constant. Thus, we have \(Q^{{\mathcal {S}}}({\mathcal {T}}^{\star }({\varvec{\gamma }}))=L^{{\mathcal {S}}}_{m}({\mathcal {T}}^{\star }({\varvec{\gamma }}))+C_{m}\) for all \({\varvec{\gamma }}\in \Gamma \). In addition, because \({\widehat{{\varvec{\alpha }}}}^{or}\) is the unique global minimizer of \(L^{{\mathcal {S}}}_{n}({\varvec{\alpha }})\), so that, \(L^{{\mathcal {S}}}_{m}({\mathcal {T}}^{\star }({\varvec{\gamma }}))>L^{{\mathcal {S}}}_{m}({\widehat{{\varvec{\alpha }}}}^{or})\) for all \({\mathcal {T}}^{\star }({\varvec{\gamma }})\ne {\widehat{{\varvec{\alpha }}}}^{or}\). Thus, we have \(Q^{{\mathcal {S}}}({\mathcal {T}}^{\star }({\varvec{\gamma }}))>Q^{{\mathcal {S}}}_{m}({\widehat{{\varvec{\alpha }}}}^{or})\). Following \(Q({\varvec{\gamma }}^{\star })=Q({\mathcal {T}}^{-1}({\mathcal {T}}^{\star }({\varvec{\gamma }}))) =Q^{{\mathcal {S}}}({\mathcal {T}}^{\star }({\varvec{\gamma }})))=L^{{\mathcal {S}}}_{m}({\mathcal {T}}^{\star }({\varvec{\gamma }}))+P^{{\mathcal {S}}}_{m}({\mathcal {T}}^{\star }({\varvec{\gamma }}))\), and \(Q({\widehat{{\varvec{\gamma }}}}^{or})=Q^{{\mathcal {S}}}({\widehat{{\varvec{\alpha }}}}^{or})=L^{{\mathcal {S}}}_{m}({\widehat{{\varvec{\alpha }}}}^{or})+P^{{\mathcal {S}}}_{m}({\widehat{{\varvec{\alpha }}}}^{or})\), we have \(Q^{{\mathcal {S}}}({\mathcal {T}}^{\star }({\varvec{\gamma }}))>Q({\varvec{\gamma }})\). By the equation (A.2), we know that \(Q^{{\mathcal {S}}}_{m}({\widehat{{\varvec{\alpha }}}}^{or})=Q_{m}({\widehat{{\varvec{\gamma }}}}^{or})\) and \(Q^{{\mathcal {S}}}({\mathcal {T}}^{\star }({\varvec{\gamma }}))=Q({\varvec{\gamma }}^{\star })\). Thus, for each \({\varvec{\gamma }}^{\star }\ne {\widehat{{\varvec{\gamma }}}}^{or}\), \(Q({\varvec{\gamma }}^{\star })>Q({\widehat{{\varvec{\gamma }}}}^{or})\).
Secondly, we define a positive sequence \(t_{m}\), and \(\Gamma _{m}=\{{\varvec{\gamma }}: \Vert {\varvec{\gamma }}-{\widehat{{\varvec{\gamma }}}}^{or}\Vert _{2}\le t_{m}\}\) is the neighborhood of \({\widehat{{\varvec{\gamma }}}}^{or}\). For any \({\varvec{\gamma }}\in \Gamma _{m}\cap \Gamma \), making a Taylor expression for \(Q({\varvec{\gamma }})\), that is,
where \({\check{{\varvec{\gamma }}}}=\delta {\varvec{\gamma }}+(1-\delta ){\varvec{\gamma }}^{\star }\) for some \(\delta \in (0,1)\), and
When \(i,i^{\prime }\in {\mathcal {S}}_{g}\), then, \({\varvec{\gamma }}^{\star }_{i}={\varvec{\gamma }}^{\star }_{i^{\prime }}\), and \({\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}=\delta ({\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }})\). Thus, we have
Further, based on the same reasons as (A.3), we have
Then,
Further, \(\rho ^{\prime }(\Vert {\check{{\varvec{\gamma }}}}_{i}-{\check{{\varvec{\gamma }}}}_{i^{\prime }}\Vert _{2})\ge \rho ^{\prime }(4t_{n})\) by the concavity of \(\rho (\cdot )\). Thus, we have
When \(i\in {\mathcal {S}}_{g}\), then, \({\varvec{\gamma }}^{\star }_{i}=|{\mathcal {S}}_{g}|^{-1}\sum _{i\in {\mathcal {S}}_{g}}{\varvec{\gamma }}_{i}\). Following
thus, for any \(i,i^{\prime }\in {\mathcal {S}}_{g}\), we have \(R_{1}=-\sum _{g=1}^{G}\{\sum _{i<i^{\prime }}({{\tilde{{\varvec{\pi }}}}}_{i}-{\check{{\varvec{\gamma }}}}_{i}-{{\tilde{{\varvec{\pi }}}}}_{i^{\prime }} +{\check{{\varvec{\gamma }}}}_{i^{\prime }})^{\textrm{T}}({\varvec{\gamma }}_{i}-{\varvec{\gamma }}_{i^{\prime }})\}/|{\mathcal {S}}_{g}|,\) and \(\sup _{i}\Vert ({{\tilde{{\varvec{\pi }}}}}_{i}-{\check{{\varvec{\gamma }}}}_{i})\Vert _{2}\le \sup _{i}(\Vert {{\tilde{{\varvec{\pi }}}}}_{i}-{\varvec{\pi }}^{0}_{i}\Vert _{2} +\Vert {\varvec{\pi }}^{0}_{i}-{\check{{\varvec{\gamma }}}}_{i}\Vert _{2}).\) Following \(\sup \limits _{i}\Vert {\varvec{\pi }}^{0}_{i}-{\check{{\varvec{\gamma }}}}_{i}\Vert =\sup \limits _{i}\Vert {\varvec{\gamma }}^{0}_{i}-{\check{{\varvec{\gamma }}}}_{i}\Vert \le C_{1}/\sqrt{n_{1}}\), then,
Consequently, we have
Let \(t_{m}=o(1)\), then \(\rho ^{\prime }(4t_{n})\rightarrow 1\). Since \(\lambda \gg C_{1}n_{1}^{-1/2}\), and \(|{\mathcal {S}}_{min}|^{-1}=o(1)\), then \(Q({\varvec{\gamma }})-Q({\varvec{\gamma }}^{\star })\ge 0\) for sufficiently large \(m, n_{1}\). \(\square \)
Proof of Theorem 4
Define
as the oracle estimators for \({\varvec{\Theta }}=({{\textrm{vec}}}({\varvec{\pi }})^{\textrm{T}},{\varvec{\theta }})^{\textrm{T}}\) when the true group structure is given, and
Following Theorem 3, we know that \(P({\widehat{{\mathcal {S}}}}={\mathcal {S}}^{0})\rightarrow 1\) when both the sample size m and the cluster size \(n_{1}\) tend to infinity. Therefore, it is sufficient to consider the asymptotic distribution of the oracle estimators \({\widehat{{\varvec{\Theta }}}}^{or}\).
Let \({\widehat{{\varvec{\Theta }}}}\in \{{\varvec{\Theta }}^{0}+N^{-1/2}{\varvec{\vartheta }}:\Vert {\varvec{\vartheta }}\Vert _{2}\le M_{\varepsilon }\}\), and
By Taylor’s expression of \(L({\widehat{{\varvec{\Theta }}}}^{or})\) at \({\varvec{\Theta }}^{0}\),
where \({\check{{\varvec{\Theta }}}}\) lies between \({\widehat{{\varvec{\Theta }}}}^{or}\) and \({\widehat{{\varvec{\Theta }}}}^{0}\). When \(m,n_{1}\rightarrow \infty \), following the assumption conditions, we have the score function \({\textbf{U}}({\varvec{\Theta }}^{0})=O_{p}(\sqrt{N})\) and \(-{\textbf{V}}({\varvec{\Theta }}^{0})/N={\textbf{F}}({\varvec{\Theta }}^{0})/N+o_{P}(1)\), where \({\textbf{F}}({\varvec{\Theta }}^{0})=-{\textrm{E}}\{{\textbf{V}}({\varvec{\Theta }}^{0})\}=N{\bar{{\textbf{F}}}}({\varvec{\Theta }}^{0})\) is the fisher information matrix, and \({\textbf{F}}({\varvec{\Theta }}^{0})/N=O_{p}(1)\). Thus, for sufficient large \(M_{\varepsilon }\), \(L({\widehat{{\varvec{\Theta }}}}^{or})-L({\widehat{{\varvec{\Theta }}}})\) is controlled by the second term, which equals to \(-\frac{1}{2}{\varvec{\vartheta }}^{\textrm{T}}(\frac{1}{N}{\textbf{F}}({\varvec{\Theta }}^{0})+o_{p}(1)){\varvec{\vartheta }}\). Thus, for any given \(\varepsilon >0\), there exists a large \(M_{\varepsilon }\), such that
Thus, we have \(\Vert {\widehat{{\varvec{\Theta }}}}^{or}-{\widehat{{\varvec{\Theta }}}}\Vert _{2}=O_{p}(N^{-1/2})\).
Furthermore, using the Taylor’s expression of \({\textbf{U}}({\widehat{{\varvec{\Theta }}}}^{or})\) around \({\varvec{\Theta }}^{0}\), with \({\textbf{U}}({\widehat{{\varvec{\Theta }}}}^{or})={\textbf{0}}\), we can obtain that
where \({\check{{\varvec{\Theta }}}}\) lies between \({\widehat{{\varvec{\Theta }}}}^{or}\) and \({\widehat{{\varvec{\Theta }}}}^{0}\). When the sample sizes \(m,n_{1}\rightarrow \infty \), following the weak large numbers law, we have
Thus, \(\sqrt{N}{\bar{{\textbf{F}}}}^{1/2}({\varvec{\Theta }}^{0})({\widehat{{\varvec{\Theta }}}}^{or}-{\varvec{\Theta }}^{0})\longrightarrow N({\textbf{0}},{\textbf{I}})\). \(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liang, C., Ma, W. Heterogeneous analysis for clustered data using grouped finite mixture models. Stat Comput 34, 40 (2024). https://doi.org/10.1007/s11222-023-10353-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-023-10353-w