Conditionally independent component analysis for supervised feature extraction

doi:10.1016/S0925-2312(02)00518-0

Neurocomputing

Volume 49, Issues 1–4, December 2002, Pages 139-150

https://doi.org/10.1016/S0925-2312(02)00518-0 Get rights and content

Abstract

The present paper extends the framework of independent component analysis (ICA) to supervised learning. The key idea is to find a conditionally independent representation of input variables for a target variable by linear transformation. The representation can be considered as independent components of observations from which explanatory parts for a target variable are removed, and it is directly useful for naive Bayes learning which has been reported to perform as well as more sophisticated methods for prediction. The learning algorithm is derived under a similar but different criterion to ICA. The algorithm attempts to maximize the independence among extracted features as well as the mutual information between extracted features and a target variable.

Introduction

Independent component analysis (ICA) has been successfully applied to data analysis, signal processing, and modeling of coding in the brain, especially in the early vision system [6], [13]. However, ICA has been restricted to unsupervised cases. In the present paper, we extend the framework of ICA to supervised learning.

As a direct application of this extension, we consider naive Bayes learning [10]. Despite its simplicity, naive Bayes learning scheme performs well especially on most classification tasks, and is often significantly more accurate than more sophisticated methods. However, for the prediction of real values, the assumption of conditional independence affects the performance. The present paper proposes a method to find conditionally independent components.

The features extracted by conditionally independent components analysis (CICA) can be considered as independent components of observations from which explanatory parts for a target variable are removed. Therefore, this extension is expected to enable feature extraction in supervised data analysis and modeling of higher level coding in the brain.

We begin by reviewing naive Bayes learning in Section 2. The main algorithm to find the conditionally independent components is derived in Section 3. In Section 4, we summarize the kernel density estimation which is used both in CICA and in naive Bayes learning, and give simple simulation results in Section 5. We also give a method of dimension reduction to deal with a large dimensionality based on canonical correlation analysis in Section 6. Finally, we conclude the paper with some discussion and suggestions for future works in Section 7.

Section snippets

Naive Bayes learning

Let us consider the problem of predicting a target value y for an input vector $x$ with d elements x₁,…,x_d. If the probability density function $p(y | x)$ is known, we can choose y to minimize the expected prediction error. However, $p(y | x)$ is not usually known and it must be estimated from a set of training samples. If the dimensionality of $x$ is large, the learning is considered to be difficult, which is often called the ‘curse of dimensionality’ in learning theory. Naive Bayes learning solves this

Problem

We assume that there is an unknown vector of source signals $s =(s_{1},…,s_{d^{∗}})^{T}$ whose elements are mutually independent conditionally for given y $p(s | y)=p(s_{1} | y)⋯p(s_{d^{∗}} | y).$ The validity of the assumption is discussed in Section 3.4. Observed $x$ is considered to be a linear mixture of $s$ . For the sake of simplicity, we assume $d^{∗} =d$ . The case $d^{∗} <d$ is dealt with as a dimension reduction problem in Section 6. If $d^{∗} =d$ , the observation is generated by $x =A s,$ where $A∈ R^{d×d}$ is an unknown nonsingular matrix. Without

Kernel density estimation

Suppose we have n training samples $(x^{(1)},y^{(1)}),…,(x^{(n)},y^{(n)})$ . Let $W ̂$ denote an estimation of the demixing matrix W, and $z^{(j)}$ denote the value of $x^{(j)}$ transformed by $W ̂$ , $z^{(j)} = W ̂ x^{(j)}$ .

Kernel density estimation is a traditional nonparametric method [14], which is written for the joint distribution of z_i and y in the form $p ̂ (z_{i},y)= 1 nh_{i} h_{y} ∑ j=1 n K z_{i} −z_{i}^{(j)} h_{i} K y−y^{(j)} h_{y},$ where K is a kernel function, and h_i and h_y are parameters defining kernel widths for z_i and y.

A typical choice of K is the Gaussian kernel

Simulations

We show simple simulation results to confirm the proposed algorithm works. In this experiment, we only investigate the case of $d^{∗} =d$ to concentrate on the performance of CICA.

Dimension reduction

In this section, we study the case that the dimensionality of observations is larger than that of sources, i.e. $d>d^{∗}$ . In order to apply CICA algorithm, we reduce the dimensionality of $x$ by linear transformation $u =B x,$ where B is a $d^{∗} ×d$ matrix and $u$ is a transformed vector. There is no problem if the rank of B is equal to $d^{∗}$ ; otherwise, how should we choose B? There are two requirements for $u$ : one is that $u$ is preferred to preserve as much information on y as possible; the other is that applying

Concluding remarks

In the present paper, we have proposed an extended framework of ICA to a supervised learning scheme and applied it to naive Bayes learning. It has been shown that the cost function is reduced to maximizing the independence of extracted features as well as the sum of the mutual information between extracted features and a target variable. The framework provides a novel method of extracting independent factors embedded in the input variables by removing explanatory parts.

The extension to

Acknowledgements

The author wishes to acknowledge two anonymous reviewers for their valuable suggestions.

Shotaro Akaho received the B.Eng., M.Eng. and Ph.D. degree in Mathematical Engineering from the University of Tokyo in 1988, 1990 and 2001 respectively. From 1990 to 2001, he worked as a researcher in Electrotechnical Laboratory (MITI Japan). Since April 2001, he has been working as a group leader of Mathematical Neuroinformatics Group of Neuroscience Research Institute of AIST (The National Institute of Advanced Industrial Science and Technology). His research interests include statistical

References (16)

A.J. Bell et al.
The ‘independent components’ of natural scenes are edge filters
Vision Res.
(1997)
S. Akaho, Y. Kiuchi, S. Umeyama, MICA: Multimodal independent component analysis, Proceedings of the IJCNN, 1999, pp....
T.W. Anderson
An Introduction to Multivariate Statistical Analysis
(1984)
S. Amari
Natural gradient works efficiently in learning
Neural Comput.
(1998)
S. Becker
Mutual information maximization: models of cortical self-organization
Network: Comput. Neural Systems
(1996)
S. Becker et al.
Learning mixture models of spatial coherence
Neural Comput.
(1993)
J.M. Chambers, T.J.Hastie (Eds.), Statistical Models in S, Wadsworth and Brooks, Pacific Grove, CA,...
R.G. Cowell et al.
Probabilistic Networks and Expert Systems
(1999)

There are more references available in the full text version of this article.

Cited by (17)

Selection of temporal lags for predicting riverflow series from hydroelectric plants using variable selection methods
2020, Energies
SKICA: A feature extraction algorithm based on supervised ICA with kernel for anomaly detection
2019, Journal of Intelligent and Fuzzy Systems
Training data reduction in deep neural networks with partial mutual information based feature selection and correlation matching based active learning
2017, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Independent component analysis of edge information for face recognition
2014, SpringerBriefs in Applied Sciences and Technology
Separation of instantaneous mixtures of a particular set of dependent sources using classical ICA methods
2013, Eurasip Journal on Advances in Signal Processing
Extraction of independent discriminant features for data with asymmetric distribution
2012, Knowledge and Information Systems

View all citing articles on Scopus

View full text

Conditionally independent component analysis for supervised feature extraction

Abstract

Introduction

Section snippets

Naive Bayes learning

Problem

Kernel density estimation

Simulations

Dimension reduction

Concluding remarks

Acknowledgements

Vision Res.

An Introduction to Multivariate Statistical Analysis

Natural gradient works efficiently in learning

Neural Comput.

Mutual information maximization: models of cortical self-organization

Network: Comput. Neural Systems

Learning mixture models of spatial coherence

Neural Comput.

Probabilistic Networks and Expert Systems