CLT for spiked eigenvalues of a sample covariance matrix from high-dimensional Gaussian mean mixtures

doi:10.1016/j.jmva.2022.105127

Journal of Multivariate Analysis

Volume 193, January 2023, 105127

https://doi.org/10.1016/j.jmva.2022.105127 Get rights and content

Abstract

Sample covariance matrices from a finite mean mixture model naturally carry certain spiked eigenvalues, which are generated by the differences among the mean vectors. However, their asymptotic behaviors remain largely unknown when the population dimension $p$ grows proportionally to the sample size $n$ . In this paper, a new CLT is established for the spiked eigenvalues by considering a Gaussian mean mixture in such high-dimensional asymptotic frameworks. It shows that the convergence rate of these eigenvalues is $O (1 / \sqrt{n})$ and their fluctuations can be characterized by the mixing proportions, the eigenvalues of the common covariance matrix, and the inner products between the mean vectors and the eigenvectors of the covariance matrix.

Introduction

Spiked population model originally introduced in [16] assumes that the covariance matrix $Σ$ of a population $x \in R^{p}$ is a finite-rank perturbation of a base matrix. For the simplest case where the base matrix is an identity and the perturbation matrix is nonnegative, the eigenvalues of $Σ$ can be grouped into two separated classes $Spec (Σ) = (\underset{M}{\underset{︸}{α_{1}, \dots, α_{M}}}, \underset{p - M}{\underset{︸}{1, \dots, 1}}) .$ Here the top $M$ eigenvalues $α_{1} \geq \dots \geq α_{M} > 1$ are called spiked eigenvalues of $Σ$ . These spikes often carry a wealth of information about the dependence among the components of $x$ and can be inferred from their empirical counterparts, say the eigenvalues of a sample covariance matrix $S_{n}$ . However, in high-dimensional situations where the dimension $p$ is non-negligible with respect to the sample size $n$ , it is well known that the eigenvalues of $S_{n}$ deviate from their population ones in a subtle manner. See [3], [19], [22], etc.

Let us consider the widely used independent components (IC) model [3] for the population $x$ , admitting the following stochastic representation $x = μ + Σ^{\frac{1}{2}} z,$ where $μ \in R^{p}$ denotes the population mean and $z \in R^{p}$ is a vector of independent and standardized random variables. Let $x_{1}, \dots, x_{n}$ be $n$ independent and identically distributed (i.i.d.) observations from this population. The sample covariance matrix is $S_{n} = \frac{1}{n - 1} \sum_{i = 1}^{n} (x_{i} - \bar{x}) {(x_{i} - \bar{x})}^{⊤}, \bar{x} = \frac{1}{n} \sum_{i = 1}^{n} x_{i},$ the eigenvalues of which are denote as ${λ_{ℓ}}$ , arranged in descending order and referred to as sample eigenvalues. These eigenvalues have been well studied in the so-called Marčenko–Pastur (MP) asymptotic regime [19] where $n \to \infty, p = p_{n} \to \infty, c_{n} ≜ p / n \to c \in (0, \infty) .$ In particular, under the spiked population model (1) with $(α_{ℓ})$ pair-wise different, the $ℓ$ -th ( $1 \leq ℓ \leq M$ ) largest eigenvalue $λ_{ℓ}$ of $S_{n}$ converges to $α_{ℓ} + c α_{ℓ} / (α_{ℓ} - 1)$ with a Gaussian fluctuation if $α_{ℓ}$ is larger than the critical value $1 + \sqrt{c}$ , otherwise to ${(1 + \sqrt{c})}^{2}$ the right edge point of the MP law with a Tracy–Widom fluctuation, see [5], [7], [16], [21]. Some extensions of these results can be found in [6], [14], where the base matrix can be general having a non-degenerate limiting spectral distribution. In addition to these literatures focusing on the sample covariance matrix, the spiked population model has also been adopted in the study of sample canonical correlation matrix, Fisher matrix, and separable covariance matrix. See [8], [13], [15], [23].

Most recently, [18] considered a mean mixture (MM) model when investigating the problem of high-dimensional clustering. An interesting phenomenon is that a sample covariance matrix from such mixture inherently carries some spiked eigenvalues. To fully understand such phenomenon, let us consider an MM $x$ of $τ$ subpopulations ${G_{i}, i \in {1 \dots, τ}}$ with mean vectors ${μ_{i}, i \in {1 \dots, τ}}$ and a common covariance matrix $Σ_{0}$ , that is, $x | G_{i} = μ_{i} + Σ_{0}^{\frac{1}{2}} z, i \in {1 \dots, τ},$ where ${x ∣ G_{i}}$ denote the $τ$ subpopulations and $z \in R^{p}$ is a random vector similarly defined as in the IC model (2). Let ${x_{j}}$ be $n$ observations from this mixture and denote by ${n_{i}, i \in {1 \dots, τ}}$ the (unknown) sizes of samples from each of the subpopulations. Then, given these sizes, the expectation of the sample covariance matrix is $Σ_{n} ≜ E (S_{n}) = Σ_{0} + \frac{1}{n (n - 1)} \sum_{1 \leq i < j \leq τ} n_{i} n_{j} (μ_{i} - μ_{j}) {(μ_{i} - μ_{j})}^{⊤} .$ Clearly, this covariance matrix $Σ_{n}$ is a sum of the base matrix $Σ_{0}$ and a finite rank perturbation matrix, which forms a spiked population model and the spikes are generated by the differences among the $τ$ subpopulation means. The main results in [18] imply that, in the MP asymptotic regime (3), spiked eigenvalues of $S_{n}$ from the MM model converge almost surely to some limits in the same way as that already established under the IC model (2) with covariance matrix $Σ_{n}$ .

This paper takes a step further to investigate the second order limits of these spiked sample eigenvalues. A Gaussian mean mixture (GMM) population is considered in our study, i.e., the MM model (4) with standard Gaussian vector $z \sim N (0, I_{p})$ . This mixture is one of fundamental probabilistic models in statistical inference and has a wide range of applications, such as in pattern recognition, image processing and unsupervised machine learning. Compared with the IC model (2), the main difference of the GMM lies in that the observations are heterogeneous in conditional mean, given the group information. Hence our main task here is to quantify the effect of such heterogeneity on the fluctuation of spiked sample eigenvalues.

The main contribution of this paper is a new central limit theorem (CLT) for sample spikes under the GMM model. Our results show that these sample eigenvalues asymptotically act as the eigenvalues of certain Gaussian random matrix, the distribution of which is jointly determined by the mixing proportions ${n_{i} / n}$ , the mean differences ${μ_{i} - μ_{j}}$ and the common covariance matrix $Σ_{0}$ of subpopulations. In particular, the inner products between the means ${μ_{j}}$ and the eigenvectors of $Σ_{0}$ play important roles in the fluctuation of the spikes. As these determining factors are not functions of $Σ_{n}$ , our CLT is essentially different from the case under the IC model as illustrated in [5], [6].

The rest of the paper is organized as follows. Section 2 details our model and assumptions. Section 3 presents our new CLT for spiked sample eigenvalues. Technical proofs are presented in Section 4.

Section snippets

Preliminary

In this section, we briefly review some definitions from random matrix theory, which will be frequently used in the remaining part of the paper. We denote by the constant $c \in (0, \infty)$ the limit of dimension-to-sample size ratios ${c_{n}}$ .

For a $p \times p$ real symmetric matrix $M_{p}$ with eigenvalues ${(λ_{j})}_{1 \leq j \leq p}$ , its empirical spectral distribution (ESD) is the following the probability measure $F^{M_{p}} = \frac{1}{p} \sum_{j = 1}^{p} δ_{λ_{j}},$ where $δ_{b}$ denotes the Dirac measure at $b$ . If, as $p \to \infty$ , the ESD sequence ${F^{M_{p}}}$ has a limit $G$ , it is called the

First-order convergence of the eigenvalues of $S_{n}$

In this section, we review some results on the convergence of bulk eigenvalues and spiked eigenvalues of the sample covariance matrix $S_{n}$ from the GMM model. These results will be used when investigating the fluctuations of spiked sample eigenvalues.

With the decomposition of $S_{n}$ in (8), from standard random matrix theory for Wishart matrices [4], [19], [22] and the rank inequality (Theorem A.44 in [4]), almost surely, the ESD $F^{S_{n}}$ converges weakly to a generalized Marčenko–Pastur (MP) law, as

Outline of the proof of Theorem 1

We list here some notations that will be used throughout this section. Let ${\underset{̲}{m}}_{0} = {\underset{̲}{m}}_{0} (z)$ be the solution to the equation $z = - \frac{1}{{\underset{̲}{m}}_{0}} + c_{n} \int \frac{t}{1 + t {\underset{̲}{m}}_{0}} d H_{p} (t) .$ Denote $N = n - τ, δ_{n j} = \sqrt{n} (λ_{j}^{S_{n}} - λ_{n k}), j \in J_{k}, R_{n} = R_{n}^{(1)} + R_{n}^{(2)}, B_{n} = \frac{n - 1}{N} {\tilde{S}}_{n},$ $U_{n} = (u_{1}, \dots, u_{τ}) = (\sqrt{k_{n 1}} μ_{1}, \dots, \sqrt{k_{n τ}} μ_{τ}), {\hat{U}}_{n} = ({\bar{y}}_{1}, \dots, {\bar{y}}_{τ}) = (\sqrt{k_{n 1}} {\bar{x}}_{1}, \dots, \sqrt{k_{n τ}} {\bar{x}}_{τ}),$ $R_{n}^{(1)} = \sqrt{n} [{\hat{U}}_{n}^{⊤} {(B_{n} - λ_{n k} I)}^{- 1} {\hat{U}}_{n} - U_{n}^{⊤} {(B_{n} - λ_{n k} I)}^{- 1} U_{n} - \frac{c_{n}}{p} tr \{{(B_{n} - λ_{n k} I)}^{- 1} Σ_{0}\} I],$ $R_{n}^{(2)} = \sqrt{n} [U_{n}^{⊤} {(B_{n} - λ_{n k} I)}^{- 1} U_{n} - U_{n}^{⊤} {\{- λ_{n k} I - λ_{n k} {\underset{̲}{m}}_{0} (λ_{n k}) Σ_{0}\}}^{- 1} U_{n}] .$ We denote by $M$ some constant which may take different values at different appearances. Orders “ $o (\cdot)$ ” and “ $O (\cdot)$ ” of vectors are in terms of

CRediT authorship contribution statement

Weiming Li: Conceptualization, Methodology, Writing – review & editing. Junpeng Zhu: Methodology, Software, Writing – original draft.

Acknowledgments

The authors would like to thank the Editors and one anonymous reviewer for their thoughtful comments and suggestions. Weiming Li’s research is partially supported by NSFC, China (Nos. 11971293 and 12141107).

References (23)

BaiZ.D. et al.
On sample eigenvalues in a generalized spiked population model
J. Multivariate Anal.
(2012)
BaikJ. et al.
Eigenvalues of large sample covariance matrices of spiked population models
J. Multivariate Anal.
(2006)
Benaych-GeorgesF. et al.
The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices
Adv. Math.
(2011)
JiangD.D. et al.
The limits of the sample spiked eigenvalues for a high-dimensional generalized Fisher matrix and its applications
J. Statist. Plann. Inference
(2021)
SilversteinJ.W.
Strong convergence of the empirical distribution of eigenvalues of large-dimensional random matrices
J. Multivariate Anal.
(1995)
BaiZ.D. et al.
On asymptotics of eigenvectors of large sample covariance matrix
Ann. Probab.
(2007)
BaiZ.D. et al.
No eigenvalues outside the support of the limiting spectral distribution of large-dimensional sample covariance matrices
Ann. Probab.
(1998)
BaiZ.D. et al.
CLT for linear spectral statistics of large-dimensional sample covariance matrices
Ann. Probab.
(2004)
BaiZ.D. et al.
BaiZ.D. et al.
Central limit theorems for eigenvalues in a spiked population model
Ann. Inst. Henri Poincaré Probab. Stat.
(2008)

BaoZ.G. et al.

Canonical correlation coefficients of high-dimensional Gaussian vectors: Finite rank case

Ann. Statist.

(2019)

Cited by (1)

Anomaly Detection and Alarm Limit Design for In-Hole Bit Bounce Based on Interval Augmented Mahalanobis Distance
2023, Communications in Computer and Information Science

View full text

CLT for spiked eigenvalues of a sample covariance matrix from high-dimensional Gaussian mean mixtures

Abstract

Introduction

Section snippets

Preliminary

First-order convergence of the eigenvalues of Sn

Outline of the proof of Theorem 1

CRediT authorship contribution statement

Acknowledgments

J. Multivariate Anal.

J. Multivariate Anal.

Adv. Math.

J. Statist. Plann. Inference

J. Multivariate Anal.

On asymptotics of eigenvectors of large sample covariance matrix

Ann. Probab.

No eigenvalues outside the support of the limiting spectral distribution of large-dimensional sample covariance matrices

Ann. Probab.

CLT for linear spectral statistics of large-dimensional sample covariance matrices

Ann. Probab.

Central limit theorems for eigenvalues in a spiked population model

Ann. Inst. Henri Poincaré Probab. Stat.

Canonical correlation coefficients of high-dimensional Gaussian vectors: Finite rank case

Ann. Statist.

First-order convergence of the eigenvalues of $S_{n}$