Elsevier

Neural Networks

Volume 23, Issue 2, March 2010, Pages 226-238
Neural Networks

Robust extraction of local structures by the minimum β-divergence method

https://doi.org/10.1016/j.neunet.2009.11.011Get rights and content

Abstract

This paper discusses a new highly robust learning algorithm for exploring local principal component analysis (PCA) structures in which an observed data follow one of several heterogeneous PCA models. The proposed method is formulated by minimizing β-divergence. It searches a local PCA structure based on an initial location of the shifting parameter and a value for the tuning parameter β. If the initial choice of the shifting parameter belongs to a data cluster, then the proposed method detects the local PCA structure of that data cluster, ignoring data in other clusters as outliers. We discuss the selection procedures for the tuning parameter β and the initial value of the shifting parameter μ in this article. We demonstrate the performance of the proposed method by simulation. Finally, we compare the proposed method with a method based on a finite mixture model.

Introduction

Principal component analysis (PCA) is one of the most popular technique for processing, compressing and visualizing multivariate data. It is widely used for reducing dimensionality of multivariate data (Jolliffe, 2002). In general, PCA aims to extract the most informative q-dimensional output vector y(t) from input vector x(t) of dimension m, which is achieved by obtaining the m×q orthogonal matrix Γ (i.e. ΓTΓ=Iq, identity matrix). Thus Γ linealy relates x(t) to y(t) by y(t)=ΓT(x(t)μ),t=1,2,,n such that components of y(t) are mutually uncorrelated, satisfying the order of the variances according to the component number of y(t). In the context of off-line learning Γ and μ are directly obtained as the q dominant eigenvectors of the sample covariance matrix and the sample mean vector. The classical PCA is characterized by minimizing the empirical loss function 1nt=1nz(x(t),μ,Γ) with respect to μ and Γ, where z(x,μ,Γ)=12{xμ2ΓT(xμ)2} or half the squared residual distance of xμ projected onto the subspace spanned by the columns of Γ (Hotelling, 1933). Higuchi and Eguchi (2004) proposed a variant of this classical procedure for robust PCA by minimizing the empirical loss function LΨ(μ,Γ)=1nt=1nΨ(z(x(t),μ,Γ)) where Ψ(z) is assumed to be a monotonically increasing. Various choices of Ψs yield various procedures for PCA including the identity function Ψ0(z)=z as the classical PCA and the sigmoid function as the self-organizing rule, cf. (Xu & Yuille, 1995). In general, Ψ is interpreted as a generic function to give the loss function LΨ. The minimization of LΨ in Eq. (4) is referred as minimum psi principle generated by Ψ. Based on an argument similar to that of the classical PCA, Higuchi and Eguchi (2004) showed that the minimizer of LΨ(μ,Γ) satisfies the stationary equation system for μ and Γ. In neural networks, Γ is interpreted as the coefficient matrix connecting m neurons to q neurons, where a learning process works by off-line renewal of Γ based on a batch of input vectors or on-line renewal of Γ based on sequential input vectors (Amari, 1977, Haykin, 1999, Oja, 1989). See also Croux and Haesbroeck (2000) and Campbell (1980) for robust PCA methods. All PCA algorithms mentioned above are well discussed and established in a context in which the data distribution is uni-modal, that is, there is only one data center in the entire data space.

In the case of multi-modal distribution, the performance of the PCA algorithms as early discussed are not so good. In this aspects, several interesting algorithms for local dimensionality reduction have been proposed. As for example, mixtures of PCA and mixtures of factor analysers proposed by Hinton, Dayan and Revow (1997), VQPCA (Vector-Quantization PCA) algorithm of Kambhatla and Leen (1997), mixtures of PPCA algorithm of Tipping and Bishop (1999), a nonlinear neural network model of mixture of local PCA proposed by Zhang, Fu and Yan (2001), resolution-based complexity control for Gaussian mixture models of Meinicke and Ritter (2001), automated hierarchical mixtures of PPCA algorithm of Su and Dy (2004) and an extension of neural gas to local PCA proposed by Möller and Hoffmann (2004). However, when applying any one of these algorithms, one may encounter a difficult problem that the number of data clusters in the entire data space should be known in advance. To overcome such problems for local dimensionality reduction, there exist some alternative ideas which includes variational inference for Bayesian mixtures of factor analysers proposed by Ghahramani and Beal (2000), unsupervised learning of finite mixture models suggested by Figueiredo and Jain (2002) and accelerated variational Dirichlet mixture models of Kurihara, Welling and Vlassis (2006). Anyway, these type of algorithms may gives misleading results in presence of outliers (Hampel, Ronchetti, Rousseeuw & Stahel, 1986). In this aspect, Ridder and France (2003) offered robust algorithm based on mixture of PPCA using t-distributions. However, one major problem in this algorithm is that it needs number of data clusters in advance. Therefore some researchers or users may expect a highly robust algorithm against outliers which does not require number of data clusters in advance.

In this paper we propose a new highly robust algorithm for exploring local PCA structures by minimizing β-divergence in a situation where we do not know whether the data distribution is uni-modal or multi-modal. The key idea is the use of a super robust PCA algorithm with a volume adjustment based on β-divergence, in which it properly detects one data cluster by ignoring all data in other clusters as outliers. See Higuchi and Eguchi, 1998, Higuchi and Eguchi, 2004, Kamiya and Eguchi (2001) and Mollah, Minami and Eguchi (2007) for the robust procedures. The proposed method has a close link with the mixture ICA method proposed by Mollah, Minami and Eguchi (2006). See also Lee, Lewicki and Sejnowski (2000) for model based mixtures of ICA models. We introduce the β-divergence satisfying a condition of volume matching which naturally defines the learning algorithm for both uni-modal and multi-modal distributions. Also the behavior of the expected loss function based on the β-divergence in a context of multi-modal distributions is investigated. Thus the performance of the proposed learning algorithm beyond robustness is viewed. The proposed learning algorithm is based on an empirical loss function Lβ(μ,V)=1nt=1n1β[1det(V)12ββ+1exp{βw(x(t),μ,V)}] where V is a variance matrix and w(x,μ,V)=12(xμ)TV1(xμ). Thus the shifting parameter μ is defined by the minimizer of the loss function (5) with respect to μ; the connection matrix Γ is defined by eigen-decomposition of the minimizer of the loss function (5) with respect to V. The loss function Lβ(μ,V) is closely connected with minimum psi-principle if we choose Ψβ(z)={1exp(βz)}/β. The main difference from the minimum Ψ-principle is that the loss function is defined by a function of V and w in place of Γ and z in Eq. (4). It may suggest one of robust procedures for PCA by direct application of the discussion in Higuchi and Eguchi (2004). Furthermore, we will show that the loss function Lβ(μ,V) satisfies a remarkable property beyond robustness as follows. Let us consider a probabilistic situation in which the data distribution is J-modal. Then the dataset D={x(t):t=1,,n} in Rm can be decomposed into J mutually disjoint subsets {Dj:j=1,,J} such that jth local mode along with mean vector μj and variance matrix Vj occurs in Dj={x(t):tTj} with partitioned index sets {Tj:j=1,,J}. Then we will show that the proposed learning algorithm by minimizing Lβ(μ,V) can be extract μj and Vj if it start from an initial point μDj and appropriately chosen V. At this time, under a Gaussian mixture density p(x)=j=1Jπjφ(x,μj,Vj), we will observe that (μj,Vj)=argmin(μ,V)Dj×SmLβ(μ,V;p)=argmin(μ,V)Dj×SmLβ(μ,V),(j=1,2,,J), where Sm denotes the space of all the symmetric, positive-definite matrices of order m. Thus minimization of Lβ(μ,V) with respect to (μ,V)(Dj×Sm) offers J local minima {(μj,Vj);j=1,2,,J} for J-modal data distribution.

Section 2 describes the new proposal for local PCA by minimizing β-divergence. In Section 3, we discuss the consistency of the proposed method for local PCA. Section 4 discuss the proposed learning algorithm. In Section 5, we discuss the adaptive selection procedure for the tuning parameter. Simulation and discussion is given in Section 6. Finally, Section 7 presents the conclusions of this study.

Section snippets

Local principal component analysis

Let p(x) and q(x) be probability density functions on a data space in Rm. The β-divergence of p(x) with q(x) is defined as Dβ(p,q)=[1β{pβ(x)qβ(x)}p(x)1β+1{pβ+1(x)qβ+1(x)}]dx,forβ>0 which is non-negative, that is Dβ(p,q)0, equality holds if and only if p(x)=q(x) for almost all x in Rm, see Basu, Harris, Hjort and Jones (1998) and Minami and Eguchi (2002). We note that β-divergence reduces to Kullback–Leibler (KL) divergence when we take a limit of the tuning parameter β to 0 as limβ0Dβ(p,q)

Consistency for local PCA

Let us look at a behavior of the expected β-loss function defined by Lβ(μ,V;p)=1β[1det(V)12ββ+1Epexp{βw(x,μ,V)}] for a vector μ in Rp and a symmetric positive definite matrix V, where Ep denotes the expectation with respect to an underlying distribution with the density function p. We envisage that the distribution describes a local structure of data. For simplicity we assume a Gaussian mixture distribution with the density function p(x)=j=1Jπjφμj,Vj(x), where φμ,V(x) is a Gaussian density

Learning algorithm for Local PCA

The proposed learning algorithm explores a local PCA structure based on the initial condition of the shifting parameter μ. If the initial value of μ belongs to the jth data cluster Dj, then all input vectors belonging to Dj are transferred into jth local PCA structure (j=1,2,,J) by considering the data in other clusters as outliers. Thus, we can extract all local PCA structures by changing the shifting parameter μ, sequentially. The rule of changing the initial value for the shifting parameter

Selection procedure for the tuning parameter β

Let us discuss a selection procedure for tuning parameter based on a given data set. We observe that the performance of the proposed method for local PCA depends on the value of the tuning parameter β. To obtain better performance by this method, an attempt is made to propose an adaptive selection procedure for β. To find an appropriate β, we evaluate the estimates by various values of β. Minami and Eguchi (2003) and Mollah et al. (2006) used β-divergence with a fixed value of β as a measure

Simulation and discussion

To demonstrate the performance of the proposed algorithm, we generated the following data sets from the J-component Gaussian mixture distribution with different mean vectors μj and covariance matrices Vj, (j=1,2,,J).

    Dataset 1:

    1000 samples were generated from two components Gaussian mixture distribution (Fig. 2(a)) with mean vectors μ1=(0,0)T and μ2=(8,0)T, and variance matrices V1=[3.02.12.11.4]andV2=[2.61.21.20.8] respectively, where the first 500 samples were drawn from N(μ1,V1) and the

Conclusions

In this paper, we proposed a new highly robust learning algorithm for exploring local PCA structures by minimizing β-divergence. During iteration, if the initial choices of μ belongs to a data cluster, then all input vectors belonging to that cluster are classified into a local PCA structure, ignoring the data in other clusters as outliers. In order to extract local PCA structure corresponding to another data clusters, the initial value of the shifting vector μ is changed by data vector having

References (30)

  • H. Kamiya et al.

    A class of robust principal component vectors

    Journal of Multivariate Analysis

    (2001)
  • R. Möller et al.

    An extension of neural gas to local PCA

    Neurocomputing

    (2004)
  • B. Zhang et al.

    A nonlinear neural network model of mixture of local principal component analysis: Application to handwritten digits recognition

    Pattern Recognition

    (2001)
  • S.-I. Amari

    Neural theory of association and concept formation

    Biological Cybernetics

    (1977)
  • A. Basu et al.

    Robust and efficient estimation by minimising a density power divergence

    Biometrika

    (1998)
  • N.A. Campbell

    Robust procedures in multivariate analysis 1: Robust co-variance estimation

    Applied Statistics

    (1980)
  • C. Croux et al.

    Principal component analysis based on robust estimators of the covariance or correlation matrix: Influence functions and efficiencies

    Biometrika

    (2000)
  • M. Figueiredo et al.

    Unsupervised learning of finite mixture models

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2002)
  • Z. Ghahramani et al.

    Variational inference for Bayesian mixture of factor analysers

  • F.R. Hampel et al.

    Robust statistics: The approach based on influence functions

    (1986)
  • S. Haykin

    Neural networks

    (1999)
  • T. Hastie et al.

    The elements of statistical learning

    (2001)
  • H. Hotelling

    Analysis of a complex of statistical variables into principal components

    Journal of Educational Psychology

    (1933)
  • I. Higuchi et al.

    The influence function of principal component analysis by self-organizing rule

    Neural Computation

    (1998)
  • I. Higuchi et al.

    Robust principal component analysis with adaptive selection for tuning parameters

    Journal of Machine Learning Research

    (2004)
  • Cited by (30)

    • Robust complementary hierarchical clustering for gene expression data analysis by β-divergence

      2013, Journal of Bioscience and Bioengineering
      Citation Excerpt :

      The CHC algorithm employs the estimator that maximizes the log-likelihood function. On the other hand, maximization of the β-likelihood function or equivalently minimization of the β-divergence produces highly robust estimators for the parameters of the statistical models (16–18). Thus, we employ the β-likelihood function for robust estimation of clustering patterns.

    • λ -Deformed Evidence Lower Bound (λ -ELBO) Using Rényi and Tsallis Divergence

      2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • Minimum Divergence Methods in Statistical Machine Learning: From an Information Geometric Viewpoint

      2022, Minimum Divergence Methods in Statistical Machine Learning: From an Information Geometric Viewpoint
    View all citing articles on Scopus
    View full text