CLT for spiked eigenvalues of a sample covariance matrix from high-dimensional Gaussian mean mixtures

https://doi.org/10.1016/j.jmva.2022.105127Get rights and content

Abstract

Sample covariance matrices from a finite mean mixture model naturally carry certain spiked eigenvalues, which are generated by the differences among the mean vectors. However, their asymptotic behaviors remain largely unknown when the population dimension p grows proportionally to the sample size n. In this paper, a new CLT is established for the spiked eigenvalues by considering a Gaussian mean mixture in such high-dimensional asymptotic frameworks. It shows that the convergence rate of these eigenvalues is O(1/n) and their fluctuations can be characterized by the mixing proportions, the eigenvalues of the common covariance matrix, and the inner products between the mean vectors and the eigenvectors of the covariance matrix.

Introduction

Spiked population model originally introduced in [16] assumes that the covariance matrix Σ of a population xRp is a finite-rank perturbation of a base matrix. For the simplest case where the base matrix is an identity and the perturbation matrix is nonnegative, the eigenvalues of Σ can be grouped into two separated classes Spec(Σ)=(α1,,αMM,1,,1pM).Here the top M eigenvalues α1αM>1 are called spiked eigenvalues of Σ. These spikes often carry a wealth of information about the dependence among the components of x and can be inferred from their empirical counterparts, say the eigenvalues of a sample covariance matrix Sn. However, in high-dimensional situations where the dimension p is non-negligible with respect to the sample size n, it is well known that the eigenvalues of Sn deviate from their population ones in a subtle manner. See [3], [19], [22], etc.

Let us consider the widely used independent components (IC) model [3] for the population x, admitting the following stochastic representation x=μ+Σ12z,where μRp denotes the population mean and zRp is a vector of independent and standardized random variables. Let x1,,xn be n independent and identically distributed (i.i.d.) observations from this population. The sample covariance matrix is Sn=1n1i=1n(xix̄)(xix̄),x̄=1ni=1nxi,the eigenvalues of which are denote as {λ}, arranged in descending order and referred to as sample eigenvalues. These eigenvalues have been well studied in the so-called Marčenko–Pastur (MP) asymptotic regime [19] where n,p=pn,cnp/nc(0,).In particular, under the spiked population model (1) with (α) pair-wise different, the -th (1M) largest eigenvalue λ of Sn converges to α+cα/(α1) with a Gaussian fluctuation if α is larger than the critical value 1+c, otherwise to (1+c)2 the right edge point of the MP law with a Tracy–Widom fluctuation, see [5], [7], [16], [21]. Some extensions of these results can be found in [6], [14], where the base matrix can be general having a non-degenerate limiting spectral distribution. In addition to these literatures focusing on the sample covariance matrix, the spiked population model has also been adopted in the study of sample canonical correlation matrix, Fisher matrix, and separable covariance matrix. See [8], [13], [15], [23].

Most recently, [18] considered a mean mixture (MM) model when investigating the problem of high-dimensional clustering. An interesting phenomenon is that a sample covariance matrix from such mixture inherently carries some spiked eigenvalues. To fully understand such phenomenon, let us consider an MM x of τ subpopulations {Gi,i{1,τ}} with mean vectors {μi,i{1,τ}} and a common covariance matrix Σ0, that is, x|Gi=μi+Σ012z,i{1,τ},where {xGi} denote the τ subpopulations and zRp is a random vector similarly defined as in the IC model (2). Let {xj} be n observations from this mixture and denote by {ni,i{1,τ}} the (unknown) sizes of samples from each of the subpopulations. Then, given these sizes, the expectation of the sample covariance matrix is ΣnE(Sn)=Σ0+1n(n1)1i<jτninjμiμjμiμj.Clearly, this covariance matrix Σn is a sum of the base matrix Σ0 and a finite rank perturbation matrix, which forms a spiked population model and the spikes are generated by the differences among the τ subpopulation means. The main results in [18] imply that, in the MP asymptotic regime (3), spiked eigenvalues of Sn from the MM model converge almost surely to some limits in the same way as that already established under the IC model (2) with covariance matrix Σn.

This paper takes a step further to investigate the second order limits of these spiked sample eigenvalues. A Gaussian mean mixture (GMM) population is considered in our study, i.e., the MM model (4) with standard Gaussian vector zN(0,Ip). This mixture is one of fundamental probabilistic models in statistical inference and has a wide range of applications, such as in pattern recognition, image processing and unsupervised machine learning. Compared with the IC model (2), the main difference of the GMM lies in that the observations are heterogeneous in conditional mean, given the group information. Hence our main task here is to quantify the effect of such heterogeneity on the fluctuation of spiked sample eigenvalues.

The main contribution of this paper is a new central limit theorem (CLT) for sample spikes under the GMM model. Our results show that these sample eigenvalues asymptotically act as the eigenvalues of certain Gaussian random matrix, the distribution of which is jointly determined by the mixing proportions {ni/n}, the mean differences {μiμj} and the common covariance matrix Σ0 of subpopulations. In particular, the inner products between the means {μj} and the eigenvectors of Σ0 play important roles in the fluctuation of the spikes. As these determining factors are not functions of Σn, our CLT is essentially different from the case under the IC model as illustrated in [5], [6].

The rest of the paper is organized as follows. Section 2 details our model and assumptions. Section 3 presents our new CLT for spiked sample eigenvalues. Technical proofs are presented in Section 4.

Section snippets

Preliminary

In this section, we briefly review some definitions from random matrix theory, which will be frequently used in the remaining part of the paper. We denote by the constant c(0,) the limit of dimension-to-sample size ratios {cn}.

For a p×p real symmetric matrix Mp with eigenvalues (λj)1jp, its empirical spectral distribution (ESD) is the following the probability measure FMp=1pj=1pδλj,where δb denotes the Dirac measure at b. If, as p, the ESD sequence {FMp} has a limit G, it is called the

First-order convergence of the eigenvalues of Sn

In this section, we review some results on the convergence of bulk eigenvalues and spiked eigenvalues of the sample covariance matrix Sn from the GMM model. These results will be used when investigating the fluctuations of spiked sample eigenvalues.

With the decomposition of Sn in (8), from standard random matrix theory for Wishart matrices [4], [19], [22] and the rank inequality (Theorem A.44 in [4]), almost surely, the ESD FSn converges weakly to a generalized Marčenko–Pastur (MP) law, as

Outline of the proof of Theorem 1

We list here some notations that will be used throughout this section. Let m̲0=m̲0(z) be the solution to the equation z=1m̲0+cnt1+tm̲0dHp(t).Denote N=nτ,δnj=nλjSnλnk,jJk,Rn=Rn(1)+Rn(2),Bn=n1NS˜n,Un=u1,,uτ=kn1μ1,,knτμτ,Uˆn=ȳ1,,ȳτ=kn1x̄1,,knτx̄τ,Rn(1)=nUˆnBnλnkI1UˆnUnBnλnkI1UncnptrBnλnkI1Σ0I,Rn(2)=nUnBnλnkI1UnUnλnkIλnkm̲0λnkΣ01Un.We denote by M some constant which may take different values at different appearances. Orders “o()” and “O()” of vectors are in terms of

CRediT authorship contribution statement

Weiming Li: Conceptualization, Methodology, Writing – review & editing. Junpeng Zhu: Methodology, Software, Writing – original draft.

Acknowledgments

The authors would like to thank the Editors and one anonymous reviewer for their thoughtful comments and suggestions. Weiming Li’s research is partially supported by NSFC, China (Nos. 11971293 and 12141107).

References (23)

  • BaoZ.G. et al.

    Canonical correlation coefficients of high-dimensional Gaussian vectors: Finite rank case

    Ann. Statist.

    (2019)
  • View full text