Robust extraction of local structures by the minimum -divergence method
Introduction
Principal component analysis (PCA) is one of the most popular technique for processing, compressing and visualizing multivariate data. It is widely used for reducing dimensionality of multivariate data (Jolliffe, 2002). In general, PCA aims to extract the most informative -dimensional output vector from input vector of dimension , which is achieved by obtaining the orthogonal matrix (i.e. , identity matrix). Thus linealy relates to by such that components of are mutually uncorrelated, satisfying the order of the variances according to the component number of . In the context of off-line learning and are directly obtained as the dominant eigenvectors of the sample covariance matrix and the sample mean vector. The classical PCA is characterized by minimizing the empirical loss function with respect to and , where or half the squared residual distance of projected onto the subspace spanned by the columns of (Hotelling, 1933). Higuchi and Eguchi (2004) proposed a variant of this classical procedure for robust PCA by minimizing the empirical loss function where is assumed to be a monotonically increasing. Various choices of s yield various procedures for PCA including the identity function as the classical PCA and the sigmoid function as the self-organizing rule, cf. (Xu & Yuille, 1995). In general, is interpreted as a generic function to give the loss function . The minimization of in Eq. (4) is referred as minimum psi principle generated by . Based on an argument similar to that of the classical PCA, Higuchi and Eguchi (2004) showed that the minimizer of satisfies the stationary equation system for and . In neural networks, is interpreted as the coefficient matrix connecting neurons to neurons, where a learning process works by off-line renewal of based on a batch of input vectors or on-line renewal of based on sequential input vectors (Amari, 1977, Haykin, 1999, Oja, 1989). See also Croux and Haesbroeck (2000) and Campbell (1980) for robust PCA methods. All PCA algorithms mentioned above are well discussed and established in a context in which the data distribution is uni-modal, that is, there is only one data center in the entire data space.
In the case of multi-modal distribution, the performance of the PCA algorithms as early discussed are not so good. In this aspects, several interesting algorithms for local dimensionality reduction have been proposed. As for example, mixtures of PCA and mixtures of factor analysers proposed by Hinton, Dayan and Revow (1997), VQPCA (Vector-Quantization PCA) algorithm of Kambhatla and Leen (1997), mixtures of PPCA algorithm of Tipping and Bishop (1999), a nonlinear neural network model of mixture of local PCA proposed by Zhang, Fu and Yan (2001), resolution-based complexity control for Gaussian mixture models of Meinicke and Ritter (2001), automated hierarchical mixtures of PPCA algorithm of Su and Dy (2004) and an extension of neural gas to local PCA proposed by Möller and Hoffmann (2004). However, when applying any one of these algorithms, one may encounter a difficult problem that the number of data clusters in the entire data space should be known in advance. To overcome such problems for local dimensionality reduction, there exist some alternative ideas which includes variational inference for Bayesian mixtures of factor analysers proposed by Ghahramani and Beal (2000), unsupervised learning of finite mixture models suggested by Figueiredo and Jain (2002) and accelerated variational Dirichlet mixture models of Kurihara, Welling and Vlassis (2006). Anyway, these type of algorithms may gives misleading results in presence of outliers (Hampel, Ronchetti, Rousseeuw & Stahel, 1986). In this aspect, Ridder and France (2003) offered robust algorithm based on mixture of PPCA using -distributions. However, one major problem in this algorithm is that it needs number of data clusters in advance. Therefore some researchers or users may expect a highly robust algorithm against outliers which does not require number of data clusters in advance.
In this paper we propose a new highly robust algorithm for exploring local PCA structures by minimizing -divergence in a situation where we do not know whether the data distribution is uni-modal or multi-modal. The key idea is the use of a super robust PCA algorithm with a volume adjustment based on -divergence, in which it properly detects one data cluster by ignoring all data in other clusters as outliers. See Higuchi and Eguchi, 1998, Higuchi and Eguchi, 2004, Kamiya and Eguchi (2001) and Mollah, Minami and Eguchi (2007) for the robust procedures. The proposed method has a close link with the mixture ICA method proposed by Mollah, Minami and Eguchi (2006). See also Lee, Lewicki and Sejnowski (2000) for model based mixtures of ICA models. We introduce the -divergence satisfying a condition of volume matching which naturally defines the learning algorithm for both uni-modal and multi-modal distributions. Also the behavior of the expected loss function based on the -divergence in a context of multi-modal distributions is investigated. Thus the performance of the proposed learning algorithm beyond robustness is viewed. The proposed learning algorithm is based on an empirical loss function where is a variance matrix and Thus the shifting parameter is defined by the minimizer of the loss function (5) with respect to ; the connection matrix is defined by eigen-decomposition of the minimizer of the loss function (5) with respect to . The loss function is closely connected with minimum psi-principle if we choose . The main difference from the minimum -principle is that the loss function is defined by a function of and in place of and in Eq. (4). It may suggest one of robust procedures for PCA by direct application of the discussion in Higuchi and Eguchi (2004). Furthermore, we will show that the loss function satisfies a remarkable property beyond robustness as follows. Let us consider a probabilistic situation in which the data distribution is -modal. Then the dataset in can be decomposed into mutually disjoint subsets such that th local mode along with mean vector and variance matrix occurs in with partitioned index sets . Then we will show that the proposed learning algorithm by minimizing can be extract and if it start from an initial point and appropriately chosen . At this time, under a Gaussian mixture density , we will observe that where denotes the space of all the symmetric, positive-definite matrices of order . Thus minimization of with respect to offers local minima for -modal data distribution.
Section 2 describes the new proposal for local PCA by minimizing -divergence. In Section 3, we discuss the consistency of the proposed method for local PCA. Section 4 discuss the proposed learning algorithm. In Section 5, we discuss the adaptive selection procedure for the tuning parameter. Simulation and discussion is given in Section 6. Finally, Section 7 presents the conclusions of this study.
Section snippets
Local principal component analysis
Let and be probability density functions on a data space in . The -divergence of with is defined as which is non-negative, that is , equality holds if and only if for almost all in , see Basu, Harris, Hjort and Jones (1998) and Minami and Eguchi (2002). We note that -divergence reduces to Kullback–Leibler (KL) divergence when we take a limit of the tuning parameter to 0 as
Consistency for local PCA
Let us look at a behavior of the expected -loss function defined by for a vector in and a symmetric positive definite matrix , where denotes the expectation with respect to an underlying distribution with the density function . We envisage that the distribution describes a local structure of data. For simplicity we assume a Gaussian mixture distribution with the density function where is a Gaussian density
Learning algorithm for Local PCA
The proposed learning algorithm explores a local PCA structure based on the initial condition of the shifting parameter . If the initial value of belongs to the th data cluster , then all input vectors belonging to are transferred into th local PCA structure () by considering the data in other clusters as outliers. Thus, we can extract all local PCA structures by changing the shifting parameter , sequentially. The rule of changing the initial value for the shifting parameter
Selection procedure for the tuning parameter
Let us discuss a selection procedure for tuning parameter based on a given data set. We observe that the performance of the proposed method for local PCA depends on the value of the tuning parameter . To obtain better performance by this method, an attempt is made to propose an adaptive selection procedure for . To find an appropriate , we evaluate the estimates by various values of . Minami and Eguchi (2003) and Mollah et al. (2006) used -divergence with a fixed value of as a measure
Simulation and discussion
To demonstrate the performance of the proposed algorithm, we generated the following data sets from the -component Gaussian mixture distribution with different mean vectors and covariance matrices , ().
- Dataset 1:
1000 samples were generated from two components Gaussian mixture distribution (Fig. 2(a)) with mean vectors and , and variance matrices respectively, where the first 500 samples were drawn from and the
Conclusions
In this paper, we proposed a new highly robust learning algorithm for exploring local PCA structures by minimizing -divergence. During iteration, if the initial choices of belongs to a data cluster, then all input vectors belonging to that cluster are classified into a local PCA structure, ignoring the data in other clusters as outliers. In order to extract local PCA structure corresponding to another data clusters, the initial value of the shifting vector is changed by data vector having
References (30)
- et al.
A class of robust principal component vectors
Journal of Multivariate Analysis
(2001) - et al.
An extension of neural gas to local PCA
Neurocomputing
(2004) - et al.
A nonlinear neural network model of mixture of local principal component analysis: Application to handwritten digits recognition
Pattern Recognition
(2001) Neural theory of association and concept formation
Biological Cybernetics
(1977)- et al.
Robust and efficient estimation by minimising a density power divergence
Biometrika
(1998) Robust procedures in multivariate analysis 1: Robust co-variance estimation
Applied Statistics
(1980)- et al.
Principal component analysis based on robust estimators of the covariance or correlation matrix: Influence functions and efficiencies
Biometrika
(2000) - et al.
Unsupervised learning of finite mixture models
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2002) - et al.
Variational inference for Bayesian mixture of factor analysers
- et al.
Robust statistics: The approach based on influence functions
(1986)
Neural networks
The elements of statistical learning
Analysis of a complex of statistical variables into principal components
Journal of Educational Psychology
The influence function of principal component analysis by self-organizing rule
Neural Computation
Robust principal component analysis with adaptive selection for tuning parameters
Journal of Machine Learning Research
Cited by (30)
rMisbeta: A robust missing value imputation approach in transcriptomics and metabolomics data
2021, Computers in Biology and MedicineRobust complementary hierarchical clustering for gene expression data analysis by β-divergence
2013, Journal of Bioscience and BioengineeringCitation Excerpt :The CHC algorithm employs the estimator that maximizes the log-likelihood function. On the other hand, maximization of the β-likelihood function or equivalently minimization of the β-divergence produces highly robust estimators for the parameters of the statistical models (16–18). Thus, we employ the β-likelihood function for robust estimation of clustering patterns.
λ -Deformed Evidence Lower Bound (λ -ELBO) Using Rényi and Tsallis Divergence
2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Minimum Divergence Methods in Statistical Machine Learning: From an Information Geometric Viewpoint
2022, Minimum Divergence Methods in Statistical Machine Learning: From an Information Geometric Viewpoint