Independent component analysis by general nonlinear Hebbian-like learning rules
Introduction
Independent component analysis (ICA) 7, 17 is a recently developed signal processing technique whose goal is to express a set of random variables as linear combinations of statistically independent component variables. The main applications of ICA are in blind source separation [17], feature extraction 2, 18, and, in a slightly modified form, in blind deconvolution [9]. In the basic form of ICA [7], observe m scalar random variables x1,x2,…,xm which are assumed to be linear combinations of n unknown independent components, or ICs, denoted by s1,s2,…,sn. The ICs are, by definition, mutually statistically independent, and zero-mean. Let us arrange the observed variables xi into a vector and the IC variables si into a vector , respectively; then the linear relationship is given by Here, is an unknown m×n matrix of full rank, called the mixing matrix. The basic problem of ICA is then to estimate the realizations of the original ICs si using only observations of the mixtures xj. This is roughly equivalent to estimating the mixing matrix . Two fundamental restrictions of the model are that, firstly, we can only estimate non-Gaussian ICs (except if just one of the ICs is Gaussian), and secondly, we must have at least as many observed linear mixtures as ICs, i.e. m⩾n. Note that the assumption of zero mean of the ICs is in fact no restriction, as this can always be accomplished by subtracting the mean from the random vector . A basic, but rather insignificant indeterminacy in the model is that the ICs and the columns of can only be estimated up to a multiplicative constant, because any constant multiplying an IC in Eq. (1)could be cancelled by dividing the corresponding column of the mixing matrix by the same constant. For mathematical convenience, one usually defines that the ICs si have unit variance. This makes the (non-Gaussian) ICs unique, up to a multiplicative sign [7]. Note that this definition of ICA implies no ordering of the ICs.
The classical application of the ICA model is blind source separation [17], in which the observed values of correspond to a realization of an m-dimensional discrete-time signal , t=1,2,… . Then the components si(t) are called source signals, which are usually original, uncorrupted signals or noise sources.
Another application of ICA is feature extraction 2, 18. Then the columns of represent features, and si signals the presence and the coefficient of the ith feature in an observed data vector .
In blind deconvolution, a convolved version x(t) of a scalar i.i.d. signal s(t) is observed, again without knowing the signal s(t) or the convolution kernel 9, 27. The problem is then to find a separating filter f so that . The equalizer f(t) is assumed to be a FIR filter of sufficient length, so that the truncation effects can be ignored. Due to the assumption that the values of the original signal s(t) are independent for different t, this problem can be solved using essentially the same formalism as used in ICA 7, 28, 29. Indeed this problem can also be represented (though only approximately) by Eq. (1); then the realizations of and are vectors containing n=m subsequent observations of the signals x(t) and s(t), beginning at different points of time. In other words, a sequence of observations is such that for t=1,2,… . The square matrix is determined by the convolving filter. Though this formulation is only approximative, the exact formulation using linear filters would lead to essentially the same algorithms and convergence proofs. Also blind separation of several convolved signals can be represented combining these two approaches.
As a preprocessing step we assume here that the dimension of the data is reduced, e.g., by PCA, so that it equals the number of ICs. In other words, we assume m=n. We also assume that the data is prewhitened (or sphered), i.e., the xi are decorrelated and their variances are equalized by a linear transformation [7]. After this preprocessing, model Eq. (1)still holds, and the matrix becomes orthogonal.
Several neural algorithms for estimating the ICA model have been proposed recently, e.g., in 1, 3, 6, 15, 16, 20, 23. Usually these algorithms use Hebbian or anti-Hebbian learning. Hebbian learning has proved to be a powerful paradigm for neural learning [22]. In the following, we call both Hebbian and anti-Hebbian learning rules ‘Hebbian-like’. We use this general expression because the difference between Hebbian and anti-Hebbian learning is sometimes quite vague. Typically, one uses the expression ‘Hebbian’ when the learning function is increasing and ‘anti-Hebbian’ when the learning function is decreasing. In the general case, however, the learning function need not be increasing or decreasing, and thus a more general concept is needed.
Hebbian-like learning thus means that the weight vector of a neuron, whose input is denoted by , adapts according to a rule that is roughly of the form where f is a certain scalar function, called the learning function. Thus the change in is proportional both to the input and a nonlinear function of . Some kind of normalization and feedback terms must also be added. Several different learning functions f have been proposed in the context of ICA, e.g., the cubic function, the tanh function, or more complicated polynomials. Some of these, e.g., the cubic function, have been motivated by an exact convergence analysis. Others have only been motivated using some approximations whose validity may not be evident.
In this paper, we show that as long as the exact (local) convergence is concerned, the choice of the learning function f in Eq. (2)is not critical. In fact, practically any non-linear learning function may be used to perform ICA. More precisely, any function f divides the space of probability distributions into two half-spaces. Independent components whose distribution is in one of the half-spaces can be estimated using a Hebbian-like learning rule as in Eq. (2)with a positive sign before the learning term, and with f as the learning function. ICs whose distribution is in the other half-space can be estimated using the same learning rule, this time with a negative sign before the learning term. (The boundary between the two half-spaces contains distributions such that the corresponding ICs cannot be estimated using f. This boundary is, however, of vanishing volume.) In addition to the Hebbian-like mechanism, two assumptions are necessary here. First, the data must be preprocessed by whitening. Second, the Hebbian-like learning rule must be constrained so that the norm of the weight vector has constant norm.
Though in principle any function can be used in the Hebbian-like learning rule, practical considerations lead us to prefer certain learning functions to others. In particular, one can choose the non-linearity so that the estimator has desirable statistical properties like small variance and robustness against outliers. Also computational aspects may be taken into account.
This paper is organized as follows: in Section 2, a general motivation for our work is described. Our learning rules are described in Section 3. Section 4contains a discussion, and simulation results are presented in Section 5. Finally, some conclusions are drawn in Section 6.
Section snippets
Cumulants versus arbitrary nonlinearities
Generally, for source separation and ICA, higher than second-order statistics have to be used. Such higher-order statistics can be incorporated into the computations either explicitly using higher-order cumulants, or implicitly, by using suitable nonlinearities. Indeed, one might distinguish between two approaches to ICA which we call the ‘top-down’ approach and the ‘bottom-up’ approach.
In the top-down, or cumulant approach, one typically starts from the independence requirement. Mutual
General one-unit contrast functions
Contrast functions [7] provide a useful framework to describe ICA estimation. Usually they are based on a measure of the independence of the solutions. Denoting by and the weight vector and the input of a neuron, and slightly modifying the terminology in [7], one might also describe a contrast function as a measure of how far the distribution of the output of a neuron is from a gaussian distribution. The basic idea is then to find weight vectors (under a suitable constraint) that
Which learning function to choose?
The theorem of the preceding section shows that we have an infinite number of different Hebbian-like learning rules to choose from. This freedom is the very strength of our approach to ICA. Instead of being limited to a single non-linearity, our framework gives the user the opportunity to choose the non-linearity so as to optimize some criteria. These criteria may be either task-dependent, or follow some general optimality criteria.
Using standard optimality criteria of statistical estimators,
Simulation results
We applied the general Hebbian-like learning rule in (Eq. (4)) using two different learning functions, f1(y)=tanh(2y) and f2(y)=yexp(−y2/2). These learning functions were chosen according to the recommendations in [14]. The simulations consisted of blind source separation of four time signals that were linearly mixed to give raise to four mixture signals.
First we applied the learning rules introduced above on signals that have visually simple forms. This facilitates checking the results and
Conclusion
It was shown how a large class of Hebbian-like learning rules can be used for ICA estimation. Indeed, almost any nonlinear function can be used in the learning rule. The critical part is choosing correctly the multiplicative sign in the learning rule as a function of the shapes of the learning function and the distributions of the independent components. It was also shown how the correct sign can be estimated on-line, which leads to a universal learning rule that estimates an IC of practically
References (29)
Independent component analysis – a new concept?
Signal Processing
(1994)- et al.
Adaptive blind separation of independent sources: a deflation approach
Signal Processing
(1995) - et al.
Blind separation of sources, Part I – An adaptive algorithm based on neuromimetic architecture
Signal Processing
(1991) Principal components, minor components, and linear neural networks
Neural Networks
(1992)- et al.
On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix
Journal of Math. Analysis and Applications
(1985) - et al.
Robust fitting by nonlinear neural units
Neural Networks
(1996) Optimal unsupervised learning in a single-layered linear feedforward network
Neural Networks
(1989)- S. Amari, A. Cichocki, H.H. Yang, A new learning algorithm for blind source separation, in: D.S. Touretzky, M.C. Mozer,...
- A. Bell, T.J. Sejnowski, Edges are the independent components of natural scenes, in: Advances in Neural Information...
- et al.
An information-maximization approach to blind separation and blind deconvolution
Neural Computation
(1995)
Cited by (203)
Directed acyclic graph based information shares for price discovery
2022, Journal of Economic Dynamics and ControlA minimal model of the interaction of social and individual learning
2021, Journal of Theoretical BiologyCitation Excerpt :In our simplified model of coupled individual and social learning each agent is a single “neuron” which receives the same input vectors as every other agent, and is attempting to learn weights that permit it to track the only source which is nonGaussian, by responding to the higher-order input correlations induced by mixing. If the mixing matrix is orthogonal, only higher than second order input correlations are generated, and a nonlinear Hebbian rule can always learn the appropriate unmixing weights (Hyvärinen and Oja, 1998). In this situation cooperative learning, via agent interaction, is not needed, since individual agents can always successfully learn.
A review of railway infrastructure monitoring using fiber optic sensors
2020, Sensors and Actuators, A: PhysicalBiologically plausible deep learning — But how far can we go with shallow networks?
2019, Neural NetworksCitation Excerpt :In both cases, only the weights of a readout layer are learned with supervised training. Unsupervised methods are appealing since they can be implemented with local learning rules, see e.g. “Oja’s rule” (Oja, 1982; Sanger, 1989) for principal component analysis, nonlinear extensions for independent component analysis (Hyvärinen & Oja, 1998) or algorithms in Brito and Gerstner (2016), Liu and Jia (2012), Olshausen and Field (1997), Rozell, Johnson, Baraniuk, and Olshausen (2008) for sparse coding. A single readout layer can be implemented with a local rule as well.
Unsupervised learning of perceptual feature combinations
2024, PLoS Computational Biology