CLT for spiked eigenvalues of a sample covariance matrix from high-dimensional Gaussian mean mixtures
Introduction
Spiked population model originally introduced in [16] assumes that the covariance matrix of a population is a finite-rank perturbation of a base matrix. For the simplest case where the base matrix is an identity and the perturbation matrix is nonnegative, the eigenvalues of can be grouped into two separated classes Here the top eigenvalues are called spiked eigenvalues of . These spikes often carry a wealth of information about the dependence among the components of and can be inferred from their empirical counterparts, say the eigenvalues of a sample covariance matrix . However, in high-dimensional situations where the dimension is non-negligible with respect to the sample size , it is well known that the eigenvalues of deviate from their population ones in a subtle manner. See [3], [19], [22], etc.
Let us consider the widely used independent components (IC) model [3] for the population , admitting the following stochastic representation where denotes the population mean and is a vector of independent and standardized random variables. Let be independent and identically distributed (i.i.d.) observations from this population. The sample covariance matrix is the eigenvalues of which are denote as , arranged in descending order and referred to as sample eigenvalues. These eigenvalues have been well studied in the so-called Marčenko–Pastur (MP) asymptotic regime [19] where In particular, under the spiked population model (1) with pair-wise different, the -th () largest eigenvalue of converges to with a Gaussian fluctuation if is larger than the critical value , otherwise to the right edge point of the MP law with a Tracy–Widom fluctuation, see [5], [7], [16], [21]. Some extensions of these results can be found in [6], [14], where the base matrix can be general having a non-degenerate limiting spectral distribution. In addition to these literatures focusing on the sample covariance matrix, the spiked population model has also been adopted in the study of sample canonical correlation matrix, Fisher matrix, and separable covariance matrix. See [8], [13], [15], [23].
Most recently, [18] considered a mean mixture (MM) model when investigating the problem of high-dimensional clustering. An interesting phenomenon is that a sample covariance matrix from such mixture inherently carries some spiked eigenvalues. To fully understand such phenomenon, let us consider an MM of subpopulations with mean vectors and a common covariance matrix , that is, where denote the subpopulations and is a random vector similarly defined as in the IC model (2). Let be observations from this mixture and denote by the (unknown) sizes of samples from each of the subpopulations. Then, given these sizes, the expectation of the sample covariance matrix is Clearly, this covariance matrix is a sum of the base matrix and a finite rank perturbation matrix, which forms a spiked population model and the spikes are generated by the differences among the subpopulation means. The main results in [18] imply that, in the MP asymptotic regime (3), spiked eigenvalues of from the MM model converge almost surely to some limits in the same way as that already established under the IC model (2) with covariance matrix .
This paper takes a step further to investigate the second order limits of these spiked sample eigenvalues. A Gaussian mean mixture (GMM) population is considered in our study, i.e., the MM model (4) with standard Gaussian vector . This mixture is one of fundamental probabilistic models in statistical inference and has a wide range of applications, such as in pattern recognition, image processing and unsupervised machine learning. Compared with the IC model (2), the main difference of the GMM lies in that the observations are heterogeneous in conditional mean, given the group information. Hence our main task here is to quantify the effect of such heterogeneity on the fluctuation of spiked sample eigenvalues.
The main contribution of this paper is a new central limit theorem (CLT) for sample spikes under the GMM model. Our results show that these sample eigenvalues asymptotically act as the eigenvalues of certain Gaussian random matrix, the distribution of which is jointly determined by the mixing proportions , the mean differences and the common covariance matrix of subpopulations. In particular, the inner products between the means and the eigenvectors of play important roles in the fluctuation of the spikes. As these determining factors are not functions of , our CLT is essentially different from the case under the IC model as illustrated in [5], [6].
The rest of the paper is organized as follows. Section 2 details our model and assumptions. Section 3 presents our new CLT for spiked sample eigenvalues. Technical proofs are presented in Section 4.
Section snippets
Preliminary
In this section, we briefly review some definitions from random matrix theory, which will be frequently used in the remaining part of the paper. We denote by the constant the limit of dimension-to-sample size ratios .
For a real symmetric matrix with eigenvalues , its empirical spectral distribution (ESD) is the following the probability measure where denotes the Dirac measure at . If, as , the ESD sequence has a limit , it is called the
First-order convergence of the eigenvalues of
In this section, we review some results on the convergence of bulk eigenvalues and spiked eigenvalues of the sample covariance matrix from the GMM model. These results will be used when investigating the fluctuations of spiked sample eigenvalues.
With the decomposition of in (8), from standard random matrix theory for Wishart matrices [4], [19], [22] and the rank inequality (Theorem A.44 in [4]), almost surely, the ESD converges weakly to a generalized Marčenko–Pastur (MP) law, as
Outline of the proof of Theorem 1
We list here some notations that will be used throughout this section. Let be the solution to the equation Denote We denote by some constant which may take different values at different appearances. Orders “” and “” of vectors are in terms of
CRediT authorship contribution statement
Weiming Li: Conceptualization, Methodology, Writing – review & editing. Junpeng Zhu: Methodology, Software, Writing – original draft.
Acknowledgments
The authors would like to thank the Editors and one anonymous reviewer for their thoughtful comments and suggestions. Weiming Li’s research is partially supported by NSFC, China (Nos. 11971293 and 12141107).
References (23)
- et al.
On sample eigenvalues in a generalized spiked population model
J. Multivariate Anal.
(2012) - et al.
Eigenvalues of large sample covariance matrices of spiked population models
J. Multivariate Anal.
(2006) - et al.
The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices
Adv. Math.
(2011) - et al.
The limits of the sample spiked eigenvalues for a high-dimensional generalized Fisher matrix and its applications
J. Statist. Plann. Inference
(2021) Strong convergence of the empirical distribution of eigenvalues of large-dimensional random matrices
J. Multivariate Anal.
(1995)- et al.
On asymptotics of eigenvectors of large sample covariance matrix
Ann. Probab.
(2007) - et al.
No eigenvalues outside the support of the limiting spectral distribution of large-dimensional sample covariance matrices
Ann. Probab.
(1998) - et al.
CLT for linear spectral statistics of large-dimensional sample covariance matrices
Ann. Probab.
(2004) - et al.
- et al.
Central limit theorems for eigenvalues in a spiked population model
Ann. Inst. Henri Poincaré Probab. Stat.
(2008)
Canonical correlation coefficients of high-dimensional Gaussian vectors: Finite rank case
Ann. Statist.
Cited by (1)
Anomaly Detection and Alarm Limit Design for In-Hole Bit Bounce Based on Interval Augmented Mahalanobis Distance
2023, Communications in Computer and Information Science