Skip to main content
Log in

Finite mixtures, projection pursuit and tensor rank: a triangulation

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Finite mixtures of multivariate distributions play a fundamental role in model-based clustering. However, they pose several problems, especially in the presence of many irrelevant variables. Dimension reduction methods, such as projection pursuit, are commonly used to address these problems. In this paper, we use skewness-maximizing projections to recover the subspace which optimally separates the cluster means. Skewness might then be removed in order to search for other potentially interesting data structures or to perform skewness-sensitive statistical analyses, such as the Hotelling’s \( T^{2}\) test. Our approach is algebraic in nature and deals with the symmetric tensor rank of the third multivariate cumulant. We also derive closed-form expressions for the symmetric tensor rank of the third cumulants of several multivariate mixture models, including mixtures of skew-normal distributions and mixtures of two symmetric components with proportional covariance matrices. Theoretical results in this paper shed some light on the connection between the estimated number of mixture components and their skewness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Adcock C, Eling M, Loperfido N (2015) Skewed distributions in finance and actuarial science: a review. Eur J Finance 21:1253–1281

    Article  Google Scholar 

  • Ambagaspitiya RS (1999) On the distributions of two classes of correlated aggregate claims. Insur Math Econ 24:301–308

    Article  MathSciNet  MATH  Google Scholar 

  • Arellano-Valle RB, Genton MG, Loschi RH (2009) Shape mixtures of multivariate skew-normal distributions. J Multivar Anal 100:91–101

    Article  MathSciNet  MATH  Google Scholar 

  • Azzalini A, Capitanio A (2003) Distributions generated by perturbation of symmetry with emphasis on a multivariate skew \(t\) distribution. J R Stat Soc B 65:367–389

    Article  MathSciNet  MATH  Google Scholar 

  • Azzalini A, Genton MG (2008) Robust likelihood methods based on the skew-t and related distributions. Int Stat Rev 76:106–129

    Article  MATH  Google Scholar 

  • Bartoletti S, Loperfido N (2010) Modelling air pollution data by the skew-normal distribution. Stoch Environ Res Risk Assess 24:513–517

    Article  Google Scholar 

  • Blough DK (1989) Multivariate symmetry and asymmetry. Inst Stat Math 24:513–517

    Google Scholar 

  • Bolton RJ, Krzanowski WJ (2003) Projection pursuit clustering for exploratory data analysis. J Comput Graph Stat 12:121–142

    Article  MathSciNet  Google Scholar 

  • Bouveyron C, Brunet-Saumard C (2014) Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal 71:52–78

    Article  MathSciNet  MATH  Google Scholar 

  • Branco MD, Dey DK (2001) A general class of skew-elliptical distributions. J Multivar Anal 79:99–113

    Article  MathSciNet  MATH  Google Scholar 

  • Comon P (2014) Tensors: a brief introduction. IEEE Sig Process Mag Inst Electr Electron Eng 31:44–53

    Google Scholar 

  • Comon P, Golub G, Lim L-H, Mourrain B (2008) Symmetric tensors and symmetric tensor rank. SIAM J Matrix Anal Appl 30:1254–1279

    Article  MathSciNet  MATH  Google Scholar 

  • Fraley C, Raftery Adrian E, Scrucca L (2017) mclust: Gaussian mixture modelling for model-based clustering, classification, and density estimation. https://CRAN.R-project.org/package=mclust. R package version 5.3

  • Franceschini C, Loperfido N (2017a) MaxSkew: skewness-based projection pursuit. https://CRAN.R-project.org/package=MaxSkew. R package version 1.1

  • Franceschini C, Loperfido N (2017b) MultiSkew: measures, tests and removes multivariate skewness. https://CRAN.R-project.org/package=MultiSkew. R package version 1.1.1

  • Friedman J (1987) Exploratory projection pursuit. J. Am Stat Assoc 82:249–266

    Article  MathSciNet  MATH  Google Scholar 

  • Friedman JH, Tukey JW (1974) A projection pursuit algorithm for exploratory data analysis. IEEE Trans Comput Ser C 23:881–890

    Article  MATH  Google Scholar 

  • Frühwirth-Schnatter S, Pyne S (2010) Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew\(-t\) distributions. Biostatistics 11:317–336

    Article  Google Scholar 

  • Grasman RPPP, Huizenga HM, Geurts HM (2010) Departure from normality in multivariate normative comparison: the Cramé r alternative for Hotelling’s \(T^{2}\). Neuropsychologia 48:1510–1516

    Article  Google Scholar 

  • Hennig C (2004) Asymmetric linear dimension reduction for classification. J Comput Graph Stat 13:930–945

    Article  MathSciNet  Google Scholar 

  • Hennig C (2005) A method for visual cluster validation. In: Weihs C, Gaul W (eds) Classification—the ubiquitous challenge. Springer, Heidelberg, pp 153–160

    Chapter  Google Scholar 

  • Hui G, Lindsay BG (2010) Projection pursuit via white noise matrices. Sankhya B 72:123–153

    Article  MathSciNet  MATH  Google Scholar 

  • Jondeau E, Rockinger M (2006) Optimal portfolio allocation under higher moments. Eur Financ Manag 12:29–55

    Article  Google Scholar 

  • Karlis D, Xekalaki E (2003) Choosing initial values for the EM algorithm for finite mixtures. Comput Stat Data Anal 41:577–590

    Article  MathSciNet  MATH  Google Scholar 

  • Kim H-M, Mallick BK (2003) Moments of random vectors with skew \(t\) distribution and their quadratic forms. Stat Probab Lett 63:417–423

    Article  MathSciNet  MATH  Google Scholar 

  • Landsberg JM, Michalek M (2017) On the geometry of border rank decompositions for matrix multiplication and other tensors with symmetry. SIAM J Appl Algebra Geom 1:2–19

    Article  MathSciNet  MATH  Google Scholar 

  • Lee S, McLachlan GJ (2013) Model-based clustering and classification with non-normal mixture distributions. Stat Methods Appl 22:427–454 (with discussion)

    Article  MathSciNet  MATH  Google Scholar 

  • Lin XS (2004) Compound distributions. In: Encyclopedia of actuarial science, vol 1. Wiley, pp 314–317

  • Lindsay BG, Yao W (2012) Fisher information matrix: a tool for dimension reduction, projection pursuit, independent component analysis, and more. Can J Stat 40:712–730

    Article  MathSciNet  MATH  Google Scholar 

  • Loperfido N (2004) Generalized skew-normal distributions. Skew-elliptical distributions and their applications: a journey beyond normality. CRC, Boca Raton, pp 65–80

    Google Scholar 

  • Loperfido N (2013) Skewness and the linear discriminant function. Stat Probab Lett 83:93–99

    Article  MathSciNet  MATH  Google Scholar 

  • Loperfido N (2014) Linear transformations to symmetry. J Multivar Anal 129:186–192

    Article  MathSciNet  MATH  Google Scholar 

  • Loperfido N (2015a) Vector-valued skewness for model-based clustering. Stat Probab Lett 99:230–237

    Article  MathSciNet  MATH  Google Scholar 

  • Loperfido N (2015b) Singular value decomposition of the third multivariate moment. Linear Algebra Appl 473:202–216

    Article  MathSciNet  MATH  Google Scholar 

  • Loperfido N (2018) Skewness-based projection pursuit: a computational approach. Comput Stat Data Anal 120:42–57

    Article  MathSciNet  MATH  Google Scholar 

  • Loperfido N, Mazur S, Podgorski K (2018) Third cumulant for multivariate aggregate claims models. Scand Actuar J 2018:109–128

    Article  MathSciNet  MATH  Google Scholar 

  • Mardia K (1970) Measures of multivariate skewness and kurtosis with applications. Biometrika 57:519–530

    Article  MathSciNet  MATH  Google Scholar 

  • McNicholas PD (2016) Model-based clustering. J Class 33:331–373

    Article  MathSciNet  MATH  Google Scholar 

  • Melnykov V, Maitra R (2010) Finite mixture models and model-based clustering. Stat Surv 4:80–116

    Article  MathSciNet  MATH  Google Scholar 

  • Miettinen J, Taskinen S, Nordhausen K, Oja H (2015) Fourth moments and independent component analysis. Stat Sci 3:372–390

    Article  MathSciNet  MATH  Google Scholar 

  • Mòri T, Rohatgi V, Székely G (1993) On multivariate skewness and kurtosis. Theory Probab Appl 38:547–551

    Article  MathSciNet  MATH  Google Scholar 

  • Morris K, McNicholas PD, Scrucca L (2013) Dimension reduction for model-based clustering via mixtures of multivariate t-distributions. Adv Data Anal Classif 7:321–338

    Article  MathSciNet  MATH  Google Scholar 

  • Oeding L, Ottaviani G (2013) Eigenvectors of tensors and algorithms for Waring decomposition. J Symb Comput 54:9–35

    Article  MathSciNet  MATH  Google Scholar 

  • Paajarvi P, Leblanc J (2004) Skewness maximization for impulsive sources in blind deconvolution. In: Proceedings of the 6th Nordic signal processing symposium—NORSIG, Espoo, Finland

  • Peña D, Prieto FJ (2001) Cluster identification using projections. J Am Stat Assoc 96:1433–1445

    Article  MathSciNet  MATH  Google Scholar 

  • Rao CR, Rao MB (1998) Matrix algebra and its applications to statistics and econometrics. World Scientific Co. Pte. Ltd, Singapore

    Book  MATH  Google Scholar 

  • Sakata T, Sumi T, Miyazaki M (2016) Algebraic and computational aspects of real tensor ranks. Springer, Tokyo

    Book  MATH  Google Scholar 

  • Scrucca L (2010) Dimension reduction for model-based clustering. Stat Comput 20:471–484

    Article  MathSciNet  Google Scholar 

  • Scrucca L (2014) Graphical tools for model-based mixture discriminant analysis. Adv Data Anal Classif 8:147–165

    Article  MathSciNet  Google Scholar 

  • Tarpey T, Yun D, Petkova E (2009) Model misspecification: Finite mixture or homogeneous? Stat Model 8:199–218

    Article  MathSciNet  Google Scholar 

  • Tyler DE, Critchley F, Dümbgen L, Oja H (2009) Invariant co-ordinate selection. J R Stat Soc B 71:1–27 (with discussion)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The author would like to thank an anonymous Associate Editor and two anonymous Reviewers for their care in handling this paper and for their precious comments which greatly helped in increasing its quality.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicola Loperfido.

Appendix

Appendix

The Kronecker product and the vectorization operator The Kronecker (or tensor) product acts on matrices \(A=\left\{ a_{ij}\right\} \in {\mathbb {R}} ^{p}\times {\mathbb {R}}^{q}\) and \(B=\left\{ b_{ij}\right\} \in {\mathbb {R}} ^{m}\times {\mathbb {R}}^{n}\) by obtaining a block matrix \(A\otimes B\in {\mathbb {R}}^{pm}\times {\mathbb {R}}^{qn}\) whose ij-th block is the matrix \( a_{ij}B\) (see, for example, Rao and Rao 1998, p 193). As an example, consider the matrices

$$\begin{aligned} A=\left( \begin{array}{cc} 2 &{}\quad 1 \\ 4 &{}\quad 3 \end{array} \right) \quad \text { and }\quad B=\left( \begin{array}{lll} 1 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 1 &{}\quad 2 \end{array} \right) . \end{aligned}$$

The Kronecker product \(A\otimes B\) is

$$\begin{aligned} \left( \begin{array}{rrrrrr} 2 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0 \\ 0 &{} 2 &{} 4 &{} 0 &{} 1 &{} 2 \\ 4 &{} 0 &{} 0 &{} 3 &{} 0 &{} 0 \\ 0 &{} 4 &{} 8 &{} 0 &{} 3 &{} 6 \end{array} \right) . \end{aligned}$$

We shall recall some fundamental properties of the Kronecker product (see, for example, Rao and Rao 1998, pp 194–201).

  1. 1.

    The Kronecker product is associative: \(\left( A\otimes B\right) \otimes C=A\otimes \left( B\otimes C\right) =A\otimes B\otimes C\).

  2. 2.

    If matrices A, B, C and D are of appropriate size, then \( \left( A\otimes B\right) \left( C\otimes D\right) =AC\otimes BD\).

  3. 3.

    If the inverses \(A^{-1}\) and \(B^{-1}\) of matrices A and B exist then \(\left( A\otimes B\right) ^{-1}=A^{-1}\otimes B^{-1}\).

  4. 4.

    If a and b are two vectors, then \(ab^{\top }\), \(a\otimes b^{\top }\) and \(b^{\top }\otimes a\) denote the same matrix.

The matrix vectorization (also known as vec operator or just vectorization) converts a matrix \(A=\left\{ a_{ij}\right\} \in {\mathbb {R}}^{p}\times \mathbb { R}^{q}\) into a pq-dimensional vector \(A^{V}=\text{ vec }\left( A\right) =\left( \alpha _{1}, \ldots ,\alpha _{pq}\right) ^{\top }\), where \(a_{ij}=\alpha _{\left( j-1\right) p+i}\), by stacking its columns on top of each other, i.e.

$$\begin{aligned} A=\left( \begin{array}{cc} 2 &{} 1 \\ 4 &{} 3 \end{array} \right) \quad \text { and }\quad A^{V}=\left( \begin{array}{c} 2 \\ 4 \\ 1 \\ 3 \end{array} \right) . \end{aligned}$$

We shall now recall some fundamental properties of the matrix vectorization (see, for example, Rao and Rao 1998, pp 194–201).

  1. 1.

    For any two \(m\times n\) matrices A and B it holds true that \( \text{ tr }\left( A^{\top }B\right) =\text{ vec }^{\top }(B)\text{ vec }(A)\).

  2. 2.

    If \(A\in {\mathbb {R}}^{p}\times {\mathbb {R}}^{q},\;B\in {\mathbb {R}} ^{q}\times {\mathbb {R}}^{r}\) and \(C\in {\mathbb {R}}^{r}\times {\mathbb {R}}^{s}\) then \(\text{ vec }\left( ABC\right) =\left( C^{\top }\otimes A\right) \text{ vec }\left( B\right) \).

  3. 3.

    For any \(m\times n\) matrix A it holds true that \(\text{ vec }\left( A^{\top }\right) =A^{\top V}\) and \(\left[ \text{ vec }\left( A\right) \right] ^{\top }=A^{V\top }\).

  4. 4.

    If A is an invertible matrix, then \(\text{ vec }\left( A^{-1}\right) ^{V}=A^{-V}\) and \(\left[ \text{ vec }\left( A^{-1}\right) \right] ^{\top }=A^{-V\top }\).

Rank of \({\mathcal {A}}\) We shall first prove by contradiction that the tensor rank of \({\mathcal {A}}\) is three. If the tensor rank of \({\mathcal {A}}\) were 2 its unfolding might be represented as

$$\begin{aligned} {\mathcal {A}}_{\left( 1\right) }=u_{1}^{\top }\otimes v_{1}\otimes w_{1}^{\top }+u_{2}^{\top }\otimes v_{2}\otimes w_{2}^{\top } \end{aligned}$$

for some 3-dimensional real vectors \(u_{1}\), \(v_{1}\), \(w_{1}\), \(u_{2}\), \( v_{2}\), \(w_{2}\). As a direct consequence, we would have

$$\begin{aligned} {\mathcal {A}}_{\left( 1\right) }{\mathcal {A}}_{\left( 1\right) }^{\top }= & {} \left( u_{1}^{\top }\otimes v_{1}\otimes w_{1}^{\top }+u_{2}^{\top }\otimes v_{2}\otimes w_{2}^{\top }\right) \left( u_{1}\otimes v_{1}^{\top }\otimes w_{1}+u_{2}\otimes v_{2}^{\top }\otimes w_{2}\right) \\= & {} \left( u_{1}^{\top }u_{1}\right) \left( w_{1}^{\top }w_{1}\right) v_{1}v_{1}^{\top }+\left( u_{1}^{\top }u_{2}\right) \left( w_{1}^{\top }w_{2}\right) \left[ v_{1}v_{2}^{\top }+v_{2}v_{1}^{\top }\right] \\&\quad +\,\left( u_{2}^{\top }u_{2}\right) \left( w_{2}^{\top }w_{2}\right) v_{2}v_{2}^{\top }. \end{aligned}$$

The rank of \({\mathcal {A}}_{\left( 1\right) }{\mathcal {A}}_{\left( 1\right) }^{\top }\) would then be two, since the quadratic form \(a^{\top }{\mathcal {A}} _{\left( 1\right) }{\mathcal {A}}_{\left( 1\right) }^{\top }a\) would be zero for any 3-dimensional vector orthogonal to both \(v_{1}\) and \(v_{2}\). This would lead to a contradiction, since \({\mathcal {A}}_{\left( 1\right) }\mathcal { A}_{\left( 1\right) }^{\top }=2I_{3}\), where \(I_{3}\) is the \(3\times 3\) identity matrix and hence of full rank. In a similar way we can prove that the tensor rank of \({\mathcal {A}}\) is not one. It is neither zero, since \( {\mathcal {A}}\) is not a null tensor. Hence it must be greater or equal than three. We shall prove that it is three using the vectors \(e_{1}=\left( 1,0,0\right) ^{\top }\), \(e_{2}=\left( 0,1,0\right) ^{\top }\) and \( e_{3}=\left( 0,0,1\right) ^{\top }\): elementary matrix algebra shows that

$$\begin{aligned} {\mathcal {A}}_{\left( 1\right) }=e_{1}^{\top }\otimes e_{2}\otimes e_{3}^{\top }+e_{2}^{\top }\otimes e_{1}\otimes e_{3}^{\top }+e_{3}^{\top }\otimes e_{2}\otimes e_{1}^{\top }. \end{aligned}$$

We shall now prove by contradiction that the symmetric tensor rank of \( {\mathcal {A}}\) is four. If the symmetric tensor rank of \({\mathcal {A}}\) were three its unfolding might be represented as

$$\begin{aligned} {\mathcal {A}}_{\left( 1\right) }=\underset{i=1}{\overset{3}{\sum }}b_{i}^{\top }\otimes b_{i}^{\top }\otimes b_{i} \end{aligned}$$

for some 3-dimensional real vectors \(b_{1}\), \(b_{2}\) and \(b_{3}\). These vectors are linearly independent, otherwise there would exist a 3-dimensional real vector v orthogonal to all of them, making the quadratic form

$$\begin{aligned} v^{\top }{\mathcal {A}}_{\left( 1\right) }{\mathcal {A}}_{\left( 1\right) }^{\top }v=\underset{i=1}{\overset{3}{\sum }}\underset{j=1}{\overset{3}{\sum }} \left( b_{i}^{\top }b_{j}\right) ^{2}\left( v^{\top }b_{j}\right) \left( b_{i}^{\top }v\right) \end{aligned}$$

equal to zero. This would lead to a contradiction, since \({\mathcal {A}} _{\left( 1\right) }{\mathcal {A}}_{\left( 1\right) }^{\top }\) is proportional to the \(3\times 3\) identity matrix, and therefore positive definite. Let u and w be a 9-dimensional and a 3-dimensional real vectors whose first components are one while the others are zero, so that \({\mathcal {A}}_{\left( 1\right) }u=\left( 0,0,0\right) ^{\top }\) and \(u=w\otimes w\otimes 1\). Apply now standard properties of the Kronecker product to obtain

$$\begin{aligned} \left( \underset{i=1}{\overset{3}{\sum }}b_{i}^{\top }\otimes b_{i}^{\top }\otimes b_{i}\right) \left( w\otimes w\otimes 1\right) =\underset{i=1}{ \overset{3}{\sum }}b_{i}^{\top }w\otimes b_{i}^{\top }w\otimes v_{i}= \underset{i=1}{\overset{3}{\sum }}\left( b_{i}^{\top }w\right) ^{2}b=\left( \begin{array}{c} 0 \\ 0 \\ 0 \end{array} \right) . \end{aligned}$$

Since \(b_{1}\), \(b_{2}\) and \(b_{3}\) are linearly independent at least one of the scalar products \(b_{1}^{\top }w\), \(b_{2}^{\top }w\) and \(b_{3}^{\top }w\) is different from zero. This would lead to a contradiction, since the null vector \(\left( 0,0,0\right) ^{\top }\) may be obtained as a linear combination of the linearly independent vectors \(b_{1}\), \(b_{2}\) and \(b_{3}\) only if all the coefficients \(b_{1}^{\top }w\), \(b_{2}^{\top }w\) and \( b_{3}^{\top }w\) of the linear combination were zero. We conclude that the symmetric tensor rank of \({\mathcal {A}}\) cannot be three. In a similar way we can prove that the symmetric tensor rank of \({\mathcal {A}}\) is not smaller than three. We shall prove that it is four using the vectors \(a_{1}=\left( 1,1,1\right) ^{\top }\), \(a_{2}=\left( 1,-1,-1\right) ^{\top }\), \( a_{3}=\left( -1,1,-1\right) ^{\top }\)and \(a_{4}=\left( -1,-1,1\right) ^{\top }\): elementary matrix algebra shows that

$$\begin{aligned} {\mathcal {A}}_{\left( 1\right) }=\overset{4}{\underset{i=1}{\sum }}\left( 2^{-2/3}a_{i}^{\top }\right) \otimes \left( 2^{-2/3}a_{i}\right) \otimes \left( 2^{-2/3}a_{i}^{\top }\right) . \end{aligned}$$

Proof of Theorem 1

Let \(\mu _{i,j}\) (\(\overline{\mu }_{i,j}\)) be the j-th moment (centered moment) matrix of the i-th mixture’s component, with cdf \(F_{i}\) and weight \(\pi _{i}\), for \(j=1,2,3\) and \( i=1, \ldots ,g\). Also, let \(\mu \) be the expected value of m: \(\mu =\mu _{1,1}\pi _{1}+ \cdots +\mu _{g,1}\pi _{g}\). Finally, let \(\lambda _{i}=\mu _{i,1}\pi _{i}-\mu \), for \(i=1, \ldots ,g\). By definition, the expected value of \( \varXi \) and the third cumulant matrix of M are

$$\begin{aligned} E\left( \varXi \right) =K_{3,1}\pi _{1}+ \cdots +K_{3,g}\pi _{g}\text { and } K_{3,M}=\lambda _{1}\otimes \lambda _{1}^{\top }\otimes \lambda _{1}\pi _{1}+ \cdots +\lambda _{g}\otimes \lambda _{g}^{\top }\otimes \lambda _{g}\pi _{g}. \end{aligned}$$

The tensor product \(\left( y+c\right) \otimes \left( y+c\right) ^{\top }\otimes \left( y+c\right) \) might be decomposed into

$$\begin{aligned} yy^{\top }\otimes y+yy^{\top }\otimes c+c\otimes yy^{\top }+c\otimes c\otimes y^{\top }+y\otimes y\otimes c^{\top }+y\otimes cc^{\top }+cc^{\top }\otimes y+cc^{\top }\otimes c \end{aligned}$$
(3)

(Loperfido 2013). The third moment matrix about \(\mu \) of a random vector with cdf \(F_{i}\) is \(\mu _{i,3}\left( x-\mu \right) =\mu _{i,3}\left[ \left( x-\mu _{i,1}\right) +\lambda _{i}\right] \). By taking expectations with respect to the cdf \(F_{i}\), after letting \(y=x-\mu _{i,1}\) and \(c=\lambda _{i}\) we obtain another expression for \(\mu _{i,3}\left( x-\mu \right) \):

$$\begin{aligned}&\overline{\mu }_{i,3}+\overline{\mu }_{i,2}\otimes \lambda _{i}+\lambda _{i}\otimes \overline{\mu }_{i,2}+\lambda _{i}\otimes \lambda _{i}\otimes \overline{\mu }_{i,1}^{\top }+\overline{\mu }_{i,2}^{V}\otimes \lambda _{i}^{\top }+\,\overline{\mu }_{i,1}\otimes \lambda _{i}\lambda _{i}^{\top }\nonumber \\&\quad +\lambda _{i}\lambda _{i}^{\top }\otimes \overline{\mu }_{i,1}+\lambda _{i}\lambda _{i}^{\top }\otimes \lambda _{i}, \end{aligned}$$
(4)

where \(A^{V}\) denotes the vectorization of the matrix A. By assumption, \( \overline{\mu }_{i,3}\) and \(\overline{\mu }_{i,2}\) equal \(K_{i,3}\) and \( \varOmega \), thus leading to the simplified expression

$$\begin{aligned} \mu _{i,3}\left( X-\mu \right) =\kappa _{i,3}+\varOmega \otimes \lambda _{i}+\lambda _{i}\otimes \varOmega +\varOmega ^{V}\otimes \lambda _{i}^{\top }+\lambda _{i}\lambda _{i}^{\top }\otimes \lambda _{i}. \end{aligned}$$
(5)

By definition, the cdf of X is the mixture of the distribution functions \( F_{1}\), ..., \(F_{g}\), with weights \(\pi _{1}\), ..., \(\pi _{g}\). Hence \( K_{3,X}\), i.e. the third cumulant matrix of X, is

$$\begin{aligned} \sum \limits _{i=1}^{g}K_{i,3}\pi _{i}+\varOmega \otimes \sum \limits _{i=1}^{g}\lambda _{i}\pi _{i}+\sum \limits _{i=1}^{g}\lambda _{i}\pi _{i}\otimes \varOmega +\varOmega ^{V}\otimes \sum \limits _{i=1}^{g}\lambda _{i}^{\top }\pi _{i}+\sum \limits _{i=1}^{g}\lambda _{i}\lambda _{i}^{\top }\otimes \lambda _{i}\pi _{i}. \end{aligned}$$
(6)

It might be simplified into \(K_{3,x}=E\left( \varXi \right) +K_{3,M}\) by noticing that \(E\left( M-\mu \right) =\lambda _{1}\pi _{1}+ \cdots +\lambda _{g}\pi _{g}\) is a null vector and by recalling the definitions of \(E\left( \varXi \right) \) and \(K_{3,M}\).

Proof of Theorem 2

Without loss of generality we can assume that the location vector is a null vector: \(\xi =0_{d}\). Let \(X_{i}\sim SN_{d}\left( 0_{d},\varOmega ,\alpha _{i}\right) \) be a d-dimensional skew-normal random vector with null location vector, scale matrix \(\varOmega \) and shape parameter \(\alpha _{i}\), for \(i=1\), \(\ldots \), g. The first and second moments of \(x_{i}\) are \(E\left( X_{i}\right) =\sqrt{2/\pi }\delta _{i} \) and \(E\left( X_{i}X_{i}^{\top }\right) =\varOmega \), for \(i=1\), \(\ldots \), g. The third moment of \(X_{i}\) is

$$\begin{aligned} \mu _{3,i}=\sqrt{2/\pi }\left[ \delta _{i}\otimes \varOmega +\varOmega ^{V}\delta _{i}^{\top }+\left( I_{d}\otimes \delta _{i}\right) \varOmega -\left( I_{d}\otimes \delta _{i}\right) \left( \delta _{i}\otimes \delta _{i}^{\top }\right) \right] , \end{aligned}$$
(7)

where \(\varOmega ^{V}\) denotes the vectorization of the matrix \(\varOmega \). It might be simplified into

$$\begin{aligned} \mu _{3,i}=\sqrt{2/\pi }\left( \delta _{i}\otimes \varOmega +\varOmega ^{V}\delta _{i}^{\top }+\varOmega \otimes \delta _{i}-\delta _{i}\otimes \delta _{i}^{\top }\otimes \delta _{i}\right) , \end{aligned}$$
(8)

by recalling that \(\left( A\otimes B\right) \left( C\otimes D\right) =AC\otimes BD\), if matrices A, B, C and D are of appropriate size. The third moment matrix of X is a weighted average of \(\mu _{3,1}\),..., \( \mu _{3,g}\) with weights \(\pi _{1}\),..., \(\pi _{g}\):

$$\begin{aligned} \mu _{3}=\sum \limits _{i=1}^{g}\pi _{i}\mu _{3,i}=\varOmega \otimes \eta +\eta \otimes \varOmega +\varOmega ^{V}\eta ^{\top }-\sqrt{\frac{2}{\pi }} \sum \limits _{i=1}^{g}\pi _{i}\delta _{i}\otimes \delta _{i}^{\top }\otimes \delta _{i}, \end{aligned}$$
(9)

where \(\eta =E\left( X\right) =\sqrt{2/\pi }\left( \pi _{1}\delta _{1}+ \cdots +\pi _{g}\delta _{g}\right) \) is the mean of X. A similar argument shows that the second moment of X is just the common scale matrix \(\varOmega \) : \(E\left( XX^{\top }\right) =\pi _{1}E\left( X_{1}X_{1}^{\top }\right) + \cdots +\pi _{g}E\left( X_{g}X_{g}^{\top }\right) =\pi _{1}\varOmega + \cdots +\pi _{g}\varOmega =\varOmega \). The third cumulant of X is the difference of \( E\left( X\otimes X^{\top }\otimes X\right) +2E\left( X\right) \otimes E^{\top }\left( X\right) \otimes E\left( X\right) \) and \(E\left( XX^{\top }\right) \otimes E\left( X\right) +E\left( X\right) \otimes E\left( XX^{\top }\right) +E^{V}\left( XX^{\top }\right) E^{\top }\left( X\right) \). Hence the third cumulant matrix of X might be represented as

$$\begin{aligned} \mu _{3}-\varOmega \otimes \eta -\eta \otimes \varOmega -\varOmega ^{V}\eta ^{\top }+2\eta \otimes \eta ^{\top }\otimes \eta =2\eta \otimes \eta ^{\top }\otimes \eta -\sqrt{\frac{2}{\pi }}\sum \limits _{i=1}^{g}\pi _{i}\delta _{i}\otimes \delta _{i}^{\top }\otimes \delta _{i}. \end{aligned}$$
(10)

The proof is completed by recalling the definition of \(\eta \).

Proof of Theorem 3

We shall first recall two well-known properties of the Kronecker product (see, for example, Rao and Rao 1998, p 197). If A , B and C are three matrices with B and C being of the same size, then \(A\otimes \left( B+C\right) =A\otimes B+A\otimes C\). The Kronecker product is associative, too: \(\left( A\otimes B\right) \otimes C=A\otimes \left( B\otimes C\right) =A\otimes B\otimes C\). The two properties lead to

$$\begin{aligned}&\left( x+y\right) \otimes \left( x+y\right) ^{\top }\otimes \left( x+y\right) =x\otimes x^{\top }\otimes x+x\otimes x^{\top }\otimes y+x\otimes y^{\top }\otimes x \\&\quad +\,x\otimes y^{\top }\otimes y+y\otimes x^{\top }\otimes x+y\otimes x^{\top }\otimes y+y\otimes y^{\top }\otimes x+y\otimes y^{\top }\otimes y, \end{aligned}$$

where x and y are two d-dimensional real vectors. The two properties also lead to

$$\begin{aligned}&\left( x-y\right) \otimes \left( x-y\right) ^{\top }\otimes \left( x-y\right) =x\otimes x^{\top }\otimes x-x\otimes x^{\top }\otimes y-x\otimes y^{\top }\otimes x \\&\quad +\,x\otimes y^{\top }\otimes y-y\otimes x^{\top }\otimes x+y\otimes x^{\top }\otimes y+y\otimes y^{\top }\otimes x-y\otimes y^{\top }\otimes y. \end{aligned}$$

Both identities lead to the following one:

$$\begin{aligned}&\left( x+y\right) \otimes \left( x+y\right) ^{\top }\otimes \left( x+y\right) +\left( x-y\right) \otimes \left( x-y\right) ^{\top }\otimes \left( x-y\right) \\&\quad =2x\otimes x^{\top }\otimes x+2x\otimes y^{\top }\otimes y+2y\otimes x^{\top }\otimes y+2y\otimes y^{\top }\otimes x. \end{aligned}$$

The above identity, together with the definitions of the vectors \(\alpha _{i}=\lambda +\gamma _{i}\) and \(\beta _{i}=\lambda -\gamma _{i}\), implies

$$\begin{aligned} \frac{1}{2}\left( \alpha _{i}\otimes \alpha _{i}^{\top }\otimes \alpha _{i}+\beta _{i}\otimes \beta _{i}^{\top }\otimes \beta _{i}\right) =\lambda \otimes \lambda ^{\top }\otimes \lambda +\lambda \otimes \gamma _{i}^{\top }\otimes \gamma _{i}+\gamma _{i}\otimes \lambda ^{\top }\otimes \gamma _{i}+\gamma _{i}\otimes \gamma _{i}^{\top }\otimes \lambda . \end{aligned}$$

We shall now recall another property of the Kronecker product: if a and b are two vectors, then \(ab^{\top }\), \(a\otimes b^{\top }\) and \(b^{\top }\otimes a\) denote the same matrix (see, for example, Rao and Rao 1998, p 199). This property, together with the definitions of \(\varGamma \) and \(\lambda \), leads to

$$\begin{aligned} \varGamma \otimes \lambda= & {} \left( \overset{d}{\underset{i=1}{\sum }}\gamma _{i}^{\top }\gamma _{i}\right) \otimes \lambda =\left( \overset{d}{\underset{ i=1}{\sum }}\gamma _{i}\otimes \gamma _{i}^{\top }\right) \otimes \lambda = \overset{d}{\underset{i=1}{\sum }}\gamma _{i}\otimes \gamma _{i}^{\top }\otimes \lambda , \\ \lambda \otimes \varGamma= & {} \lambda \otimes \overset{d}{\underset{i=1}{\sum }} \gamma _{i}\gamma _{i}^{\top }=\overset{d}{\underset{i=1}{\sum }}\lambda \otimes \gamma _{i}\otimes \gamma _{i}^{\top }=\overset{d}{\underset{i=1}{ \sum }}\lambda \otimes \gamma _{i}^{\top }\otimes \gamma _{i}\text { and} \\ \varGamma ^{V}\lambda ^{\top }= & {} \left( \overset{d}{\underset{i=1}{\sum }} \gamma _{i}\otimes \gamma _{i}\right) \lambda ^{\top }=\overset{d}{\underset{ i=1}{\sum }}\gamma _{i}\otimes \gamma _{i}\otimes \lambda ^{\top }=\overset{d }{\underset{i=1}{\sum }}\gamma _{i}\otimes \lambda ^{\top }\otimes \gamma _{i}. \end{aligned}$$

As a direct consequence, the following matrix decomposition holds true:

$$\begin{aligned} \varGamma \otimes \lambda +\lambda \otimes \varGamma +\varGamma ^{V}\lambda ^{\top }=- \frac{d}{2}\lambda \otimes \lambda ^{\top }\otimes \lambda +\frac{1}{2} \overset{d}{\underset{i=1}{\sum }}\left( \alpha _{i}\otimes \alpha _{i}^{\top }\otimes \alpha _{i}+\beta _{i}\otimes \beta _{i}^{\top }\otimes \beta _{i}\right) . \end{aligned}$$

Finally, we shall recall a third property of the Kronecker product: if A and B are any two matrices then \(\left( A\otimes B\right) ^{\top }=A^{\top }\otimes B^{\top }\) (see, for example, Rao and Rao 1998, page 194). Therefore we have

$$\begin{aligned} \varGamma \otimes \lambda ^{\top }+\lambda ^{\top }\otimes \varGamma +\lambda \varGamma ^{V\top }=-\frac{d}{2}\lambda ^{\top }\otimes \lambda \otimes \lambda ^{\top }+\frac{1}{2}\overset{d}{\underset{i=1}{\sum }}\left( \alpha _{i}^{\top }\otimes \alpha _{i}\otimes \alpha _{i}^{\top }+\beta _{i}^{\top }\otimes \beta _{i}\otimes \beta _{i}^{\top }\right) . \end{aligned}$$

By assumption, the left-hand side of the above identity equals the matrix unfolding of \({\mathcal {T}}\), and this completes the proof.

Proof of Theorem 4

We shall first prove the theorem for a location mixture of \(g\le d\) weakly symmetric components. By Theorem 1, the third cumulant matrix of X is

$$\begin{aligned} K_{3,X}=\sum \limits _{i=1}^{g}\pi _{i}\left( \mu _{i}-\mu \right) \otimes \left( \mu _{i}-\mu \right) ^{\top }\otimes \left( \mu _{i}-\mu \right) , \end{aligned}$$
(11)

where \(\mu _{i}\) and \(\pi _{i}>0\) are the mean and the weight of the i-th mixture’s component, for \(i=1, \ldots ,d\), while \(\mu =\pi _{1}\mu _{1}+ \cdots +\pi _{g}\mu _{g}\) is the mean of X. The best linear discriminant subspace \( \left\{ \mu _{1}-\mu , \ldots ,\mu _{g}-\mu \right\} \) is spanned by the columns of the matrix \(H=\left( \eta _{1}, \ldots ,\eta _{g}\right) \), where \(\eta _{i}=\pi _{i}^{1/3}\left( \mu _{i}-\mu \right) \), for \(i=1, \ldots ,g\). Also, let \(Z=\left( Z_{1}, \ldots ,Z_{d}\right) ^{\top }=\varSigma ^{-1/2}\left( X-\mu \right) \) be the standardized version of X, where \(\varSigma ^{-1/2}\) is the symmetric, positive definite square root of the concentration matrix \(\varSigma ^{-1}\), that is the inverse of the covariance matrix \(\varSigma \) of X. The third cumulant of Z is

$$\begin{aligned} K_{3,Z}=\gamma _{1}\otimes \gamma _{1}^{\top }\otimes \gamma _{1}+ \cdots +\gamma _{g}\otimes \gamma _{g}^{\top }\otimes \gamma _{g}, \end{aligned}$$

where \(\gamma _{i}=\varSigma ^{-1/2}\eta _{i}\), for \(i=1, \ldots ,g\). Since \(\varSigma ^{-1/2}\) is a full-rank matrix, the columns of the matrix \(\varGamma =\left( \gamma _{1}, \ldots ,\gamma _{g}\right) \) span the best linear discriminant subspace, too. We shall denote its rank by \(h\le g-1\). The skewness of the linear combination \(c^{\top }Z\) is \(\beta _{1}\left( c^{\top }Z\right) =cK_{3,Z}^{\top }\left( c\otimes c\right) /\left\| c\right\| \), where \( \left\| c\right\| \) is the euclidean norm of the d-dimensional, real vector c. Hence the vector c which maximizes the skewness of \(c^{\top }Z\) is proportional to the dominant tensor eigenvector of \({\mathcal {K}}_{3,Z}\), that is the unit-norm vector \(v_{1}\) which satisfies \(K_{3,z}^{\top }\left( v_{1}\otimes v_{1}\right) =\lambda _{1}v_{v}\) for the highest possible value of the scalar \(\lambda _{1}\). We shall now recall two fundamental properties of the Kronecker product. It is associative: \(\left( A\otimes B\right) \otimes C=A\otimes \left( B\otimes C\right) =A\otimes B\otimes C\); if matrices A, B, C and D are of appropriate size, then \(\left( A\otimes B\right) \left( C\otimes D\right) =AC\otimes BD\). These properties, together with the identities

$$\begin{aligned} v_{1}\otimes v_{1}=\left( v_{1}\otimes v_{1}\otimes 1\right) \text { and } \gamma _{i}\otimes \gamma _{i}^{\top }\otimes \gamma _{i}=\gamma _{i}\otimes \gamma _{i}\otimes \gamma _{i}^{\top } \end{aligned}$$

lead to a convenient representation for \(v_{1}\):

$$\begin{aligned} v_{1}=\frac{\left( \gamma _{1}^{\top }v_{1}\right) ^{2}\gamma _{1}+ \cdots +\left( \gamma _{g}^{\top }v_{1}\right) ^{2}\gamma _{g}}{\lambda _{1} }. \end{aligned}$$

It follows that \(v_{1}\) is a linear combination of \(\gamma _{1}, \ldots , \gamma _{g}\) and hence belongs to the best linear discriminant subspace. It also follows that \(Y_{1}=v_{1}^{\top }\varSigma ^{-1/2}X\) is the linear projection of X with maximal skewness, and therefore \(v_{1}^{\top }\varSigma ^{-1/2}\) might be taken as the first row of A. The linear projections \( Y_{2}\), ..., \(Y_{h}\) might be found using similar arguments.

We shall now consider shape mixtures with skew-normal components, and use the same notation as in Theorem 2, which allows us to represent the third cumulant matrix of X as

$$\begin{aligned} 2\left( \frac{2}{\pi }\right) ^{3/2}\sum \limits _{i=1}^{g}\pi _{i}\delta _{i}\otimes \sum \limits _{i=1}^{g}\pi _{i}\delta _{i}^{\top }\otimes \sum \limits _{i=1}^{g}\pi _{i}\delta _{i}-\sqrt{\frac{2}{\pi }} \sum \limits _{i=1}^{g}\pi _{i}\delta _{i}\otimes \delta _{i}^{\top }\otimes \delta _{i}. \end{aligned}$$
(12)

We have \(\mu _{i}=\xi +\sqrt{2/\pi }\delta _{i}\) and \(\mu _{i}-\mu =\sqrt{ 2/\pi }\left( \delta _{i}-\pi _{1}\delta _{1}- \cdots -\pi _{g}\delta _{g}\right) \), for \(i=1, \ldots ,g\). Hence the best linear discriminant subspace \(\left\{ \mu _{1}-\mu , \ldots ,\mu _{g}-\mu \right\} \) is spanned by the columns of the matrix \(H=\left( \eta _{1}, \ldots ,\eta _{g+1}\right) \), where

$$\begin{aligned} \eta _{1}=\root 3 \of {2}\sqrt{\frac{2}{\pi }}\left( \pi _{1}\delta _{1}+ \cdots +\pi _{g}\delta _{g}\right) \text { and }\eta _{i}=-\root 3 \of {\pi _{i}\sqrt{\frac{2 }{\pi }/}}\delta _{i}, \end{aligned}$$

for \(i=2, \ldots ,d\). An argument similar to the one used for location mixtures completes the proof.

Proof of Theorem 5

Without loss of generality we shall assume that the random vector X is centred : \(E\left( X\right) =0_{d}\). Then its third cumulant matrix coincides with its third moment matrix \(M_{3,X}\), and might be represented as

$$\begin{aligned} \eta _{1}\otimes \eta _{1}\otimes \eta _{1}^{\top }+ \cdots +\eta _{d}\otimes \eta _{d}\otimes \eta _{d}^{\top }, \end{aligned}$$

where \(\eta _{1}, \ldots , \eta _{d}\) are d-dimensional, real vectors. Clearly, if the tensor rank of \(M_{3,X}\) is \(k\le d\), the last \(\eta _{k+1} , \ldots , \eta _{d}\) are null vectors, for \(k=0,\ldots ,d-1\). The third moment matrix of \(Y=\left( Y_{1},\ldots ,Y_{d}\right) ^{\top }=BX\), where B is a \( d\times d\) real matrix of full rank, is

$$\begin{aligned} M_{3,Y}=B\eta _{1}\otimes B\eta _{1}\otimes \eta _{1}^{\top }B^{\top }+ \cdots +B\eta _{d}\otimes B\eta _{d}\otimes \eta _{d}^{\top }B^{\top }. \end{aligned}$$

The matrix B might be chosen to make the product BH a diagonal matrix: \( BH=diag\left( \psi _{1},\ldots ,\psi _{d}\right) \), where \(H=\left\{ \eta _{1},\ldots ,\eta _{d}\right\} \) is the \(d\times d\) matrix whose columns are \( \eta _{1}\),..., \(\eta _{d}\). The third moment of Y would then be a third-order, symmetric and diagonal tensor:

$$\begin{aligned} M_{3,Y}=\psi _{1}e_{1}\otimes e_{1}\otimes e_{1}^{\top }+ \cdots +\psi _{d}e_{d}\otimes e_{d}\otimes e_{d}^{\top }, \end{aligned}$$

where \(e_{i}\) is the i-th column of the d-dimensional identity matrix and \(\psi _{i}=E\left( Y_{i}^{3}\right) \), for \(i=1,\ldots ,d\). Let \(W=\left( W_{1},\ldots ,W_{d}\right) ^{\top }=CY\), where C is again a \(d\times d\) real matrix of full rank. The third moment of the i-th component of W, denoted by \(\omega _{i}=E\left( W_{i}^{3}\right) \), is just the sum of all products \(c_{ijh}E\left( Y_{i}Y_{j}Y_{h}\right) \). The expectation \(E\left( Y_{i}Y_{j}Y_{h}\right) \) is zero whenever at least one index differs from the remaining ones, so that \(\omega _{i}=c_{i1}^{3}\psi _{1}+ \cdots +c_{id}^{3}\psi _{d}\), for \(i=1, \ldots ,d\). In matrix notation, we might write \(\omega =\left( C\circ C\circ C\right) \psi \), where \(\omega =\left( \omega _{1}, \ldots ,\omega _{d}\right) ^{\top }\), \(\psi =\left( \psi _{1}, \ldots ,\psi _{d}\right) ^{\top }\) and “\(\circ \)” denotes the Hadamard, or elementwise, product (see, for example, Rao and Rao 1998, p 203). The theorem is trivially true when \(\psi \) is a null vector, which happens only when the third moment of X is a null matrix. Otherwise, the entries of C might be chosen to make \(\omega \) the d-dimensional real vector whose first entry is one while all others are zero: \(\omega =\left( 1,0, \ldots ,0\right) ^{\top }\in {\mathbb {R}}^{d}\). Then the matrix A is might be obtained by removing the first row of the matrix product CB.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Loperfido, N. Finite mixtures, projection pursuit and tensor rank: a triangulation. Adv Data Anal Classif 13, 145–173 (2019). https://doi.org/10.1007/s11634-018-0336-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-018-0336-z

Keywords

Mathematics Subject Classification

Navigation