Elsevier

Signal Processing

Volume 64, Issue 3, 26 February 1998, Pages 301-313
Signal Processing

Independent component analysis by general nonlinear Hebbian-like learning rules

https://doi.org/10.1016/S0165-1684(97)00197-7Get rights and content

Abstract

A number of neural learning rules have been recently proposed for independent component analysis (ICA). The rules are usually derived from information-theoretic criteria such as maximum entropy or minimum mutual information. In this paper, we show that in fact, ICA can be performed by very simple Hebbian or anti-Hebbian learning rules, which may have only weak relations to such information-theoretical quantities. Rather surprisingly, practically any nonlinear function can be used in the learning rule, provided only that the sign of the Hebbian/anti-Hebbian term is chosen correctly. In addition to the Hebbian-like mechanism, the weight vector is here constrained to have unit norm, and the data is preprocessed by prewhitening, or sphering. These results imply that one can choose the non-linearity so as to optimize desired statistical or numerical criteria. © 1998 Elsevier Science B.V. All rights reserved.

Zusammenfassung

Für die ICA (independent component analysis) wurden in jüngerer Zeit mehrere neuronale Lernregeln vorgeschlagen. Diese Regeln werden üblicherweise von informationstheoretischen Kriterien, wie maximale Entropie oder minimale wechselseitige Information, abgeleitet. In dieser Arbeit zeigen wir, daß die ICA tatsächlich mit den sehr einfachen Hebb- oder Anti-Hebb-Regeln durchgeführt werden kann, welche nur wenig Beziehung zu informationstheoretischen Größen haben. Es ist überraschend, daß praktisch irgendeine nichtlineare Funktion für die Lernregel verwendet werden kann, solange man gewährleistet, daß das Vorzeichen des Hebbschen Terms richtig gewählt wird. Zusätzlich zum Hebb-ähnlichen Trainingsverfahren wird der Gewichtsvektor auf die Länge eins normiert und die Daten durch Dekorrelieren vorverarbeitet. Die Ergebnisse lassen darauf schließen, daß die Nichtlinearitäten so gewählt werden können, daß gewünschte statistische oder numerische Kriterien optimiert werden. © 1998 Elsevier Science B.V. All rights reserved.

Résumé

Un certain nombre de règles neuronales d’apprentissage ont été récemment proposées pour l’analyse indépendante de composants (ICA). Les règles sont généralement dérivées de critères basés sur la théorie de I’information tels que la maximisation de l’entropie ou la minimisation de l’information mutuelle. Dans cet article, nous montrons qu’en fail l’ICA peut être effectuée par des règles d’aprentissage Hebbienne ou anti-Hebbienne très simples qui peuvent n’avoir qu’une faible relation avec les quantités basées sur la théorie de l’information. De manière plutôt suprenante, pratiquement n’importe quelle fonction non-linéaire peut être utilisée dans la règle d’aprentissage pour autant que le signe du terme Hebbien/anti-Hebbien soit choisi correctement. En plus de l’introduction d’un mecanisme de type Hebbien, le vecteur des coefficients est dans notre cas contraint à une norme unité et les données sont prétraitées par préblanchiment ou normalisation. Ces résultats impliquent la possibilité de choisir la non-linéarite de manière à optimiser les critères statisitiques ou numériques désirés. © 1998 Elsevier Science B.V. All rights reserved.

Introduction

Independent component analysis (ICA) 7, 17 is a recently developed signal processing technique whose goal is to express a set of random variables as linear combinations of statistically independent component variables. The main applications of ICA are in blind source separation [17], feature extraction 2, 18, and, in a slightly modified form, in blind deconvolution [9]. In the basic form of ICA [7], observe m scalar random variables x1,x2,…,xm which are assumed to be linear combinations of n unknown independent components, or ICs, denoted by s1,s2,…,sn. The ICs are, by definition, mutually statistically independent, and zero-mean. Let us arrange the observed variables xi into a vector x= (x1,x2,…,xm)T and the IC variables si into a vector s, respectively; then the linear relationship is given by x=As.Here, A is an unknown m×n matrix of full rank, called the mixing matrix. The basic problem of ICA is then to estimate the realizations of the original ICs si using only observations of the mixtures xj. This is roughly equivalent to estimating the mixing matrix A. Two fundamental restrictions of the model are that, firstly, we can only estimate non-Gaussian ICs (except if just one of the ICs is Gaussian), and secondly, we must have at least as many observed linear mixtures as ICs, i.e. mn. Note that the assumption of zero mean of the ICs is in fact no restriction, as this can always be accomplished by subtracting the mean from the random vector x. A basic, but rather insignificant indeterminacy in the model is that the ICs and the columns of A can only be estimated up to a multiplicative constant, because any constant multiplying an IC in Eq. (1)could be cancelled by dividing the corresponding column of the mixing matrix A by the same constant. For mathematical convenience, one usually defines that the ICs si have unit variance. This makes the (non-Gaussian) ICs unique, up to a multiplicative sign [7]. Note that this definition of ICA implies no ordering of the ICs.

The classical application of the ICA model is blind source separation [17], in which the observed values of x correspond to a realization of an m-dimensional discrete-time signal x(t), t=1,2,… . Then the components si(t) are called source signals, which are usually original, uncorrupted signals or noise sources.

Another application of ICA is feature extraction 2, 18. Then the columns of A represent features, and si signals the presence and the coefficient of the ith feature in an observed data vector x.

In blind deconvolution, a convolved version x(t) of a scalar i.i.d. signal s(t) is observed, again without knowing the signal s(t) or the convolution kernel 9, 27. The problem is then to find a separating filter f so that s(t)=f(t)∗x(t). The equalizer f(t) is assumed to be a FIR filter of sufficient length, so that the truncation effects can be ignored. Due to the assumption that the values of the original signal s(t) are independent for different t, this problem can be solved using essentially the same formalism as used in ICA 7, 28, 29. Indeed this problem can also be represented (though only approximately) by Eq. (1); then the realizations of x and s are vectors containing n=m subsequent observations of the signals x(t) and s(t), beginning at different points of time. In other words, a sequence of observations x(t) is such that x(t)=(x(t+n−1),x(t+n−2),…,x(t))T for t=1,2,… . The square matrix A is determined by the convolving filter. Though this formulation is only approximative, the exact formulation using linear filters would lead to essentially the same algorithms and convergence proofs. Also blind separation of several convolved signals can be represented combining these two approaches.

As a preprocessing step we assume here that the dimension of the data x is reduced, e.g., by PCA, so that it equals the number of ICs. In other words, we assume m=n. We also assume that the data is prewhitened (or sphered), i.e., the xi are decorrelated and their variances are equalized by a linear transformation [7]. After this preprocessing, model Eq. (1)still holds, and the matrix A becomes orthogonal.

Several neural algorithms for estimating the ICA model have been proposed recently, e.g., in 1, 3, 6, 15, 16, 20, 23. Usually these algorithms use Hebbian or anti-Hebbian learning. Hebbian learning has proved to be a powerful paradigm for neural learning [22]. In the following, we call both Hebbian and anti-Hebbian learning rules ‘Hebbian-like’. We use this general expression because the difference between Hebbian and anti-Hebbian learning is sometimes quite vague. Typically, one uses the expression ‘Hebbian’ when the learning function is increasing and ‘anti-Hebbian’ when the learning function is decreasing. In the general case, however, the learning function need not be increasing or decreasing, and thus a more general concept is needed.

Hebbian-like learning thus means that the weight vector w of a neuron, whose input is denoted by x, adapts according to a rule that is roughly of the form Δw∝±x f(wTx)+… ,where f is a certain scalar function, called the learning function. Thus the change in w is proportional both to the input x and a nonlinear function of wTx. Some kind of normalization and feedback terms must also be added. Several different learning functions f have been proposed in the context of ICA, e.g., the cubic function, the tanh function, or more complicated polynomials. Some of these, e.g., the cubic function, have been motivated by an exact convergence analysis. Others have only been motivated using some approximations whose validity may not be evident.

In this paper, we show that as long as the exact (local) convergence is concerned, the choice of the learning function f in Eq. (2)is not critical. In fact, practically any non-linear learning function may be used to perform ICA. More precisely, any function f divides the space of probability distributions into two half-spaces. Independent components whose distribution is in one of the half-spaces can be estimated using a Hebbian-like learning rule as in Eq. (2)with a positive sign before the learning term, and with f as the learning function. ICs whose distribution is in the other half-space can be estimated using the same learning rule, this time with a negative sign before the learning term. (The boundary between the two half-spaces contains distributions such that the corresponding ICs cannot be estimated using f. This boundary is, however, of vanishing volume.) In addition to the Hebbian-like mechanism, two assumptions are necessary here. First, the data must be preprocessed by whitening. Second, the Hebbian-like learning rule must be constrained so that the norm of the weight vector has constant norm.

Though in principle any function can be used in the Hebbian-like learning rule, practical considerations lead us to prefer certain learning functions to others. In particular, one can choose the non-linearity so that the estimator has desirable statistical properties like small variance and robustness against outliers. Also computational aspects may be taken into account.

This paper is organized as follows: in Section 2, a general motivation for our work is described. Our learning rules are described in Section 3. Section 4contains a discussion, and simulation results are presented in Section 5. Finally, some conclusions are drawn in Section 6.

Section snippets

Cumulants versus arbitrary nonlinearities

Generally, for source separation and ICA, higher than second-order statistics have to be used. Such higher-order statistics can be incorporated into the computations either explicitly using higher-order cumulants, or implicitly, by using suitable nonlinearities. Indeed, one might distinguish between two approaches to ICA which we call the ‘top-down’ approach and the ‘bottom-up’ approach.

In the top-down, or cumulant approach, one typically starts from the independence requirement. Mutual

General one-unit contrast functions

Contrast functions [7] provide a useful framework to describe ICA estimation. Usually they are based on a measure of the independence of the solutions. Denoting by w and x the weight vector and the input of a neuron, and slightly modifying the terminology in [7], one might also describe a contrast function as a measure of how far the distribution of the output wTx of a neuron is from a gaussian distribution. The basic idea is then to find weight vectors (under a suitable constraint) that

Which learning function to choose?

The theorem of the preceding section shows that we have an infinite number of different Hebbian-like learning rules to choose from. This freedom is the very strength of our approach to ICA. Instead of being limited to a single non-linearity, our framework gives the user the opportunity to choose the non-linearity so as to optimize some criteria. These criteria may be either task-dependent, or follow some general optimality criteria.

Using standard optimality criteria of statistical estimators,

Simulation results

We applied the general Hebbian-like learning rule in (Eq. (4)) using two different learning functions, f1(y)=tanh(2y) and f2(y)=yexp(−y2/2). These learning functions were chosen according to the recommendations in [14]. The simulations consisted of blind source separation of four time signals that were linearly mixed to give raise to four mixture signals.

First we applied the learning rules introduced above on signals that have visually simple forms. This facilitates checking the results and

Conclusion

It was shown how a large class of Hebbian-like learning rules can be used for ICA estimation. Indeed, almost any nonlinear function can be used in the learning rule. The critical part is choosing correctly the multiplicative sign in the learning rule as a function of the shapes of the learning function and the distributions of the independent components. It was also shown how the correct sign can be estimated on-line, which leads to a universal learning rule that estimates an IC of practically

References (29)

  • J.-F. Cardoso, Eigen-structure of the fourth-order cumulant tensor with application to the blind source separation...
  • J.-F. Cardoso, Iterative techniques for blind source separation using only fourth-order cumulants, in: Proc. EUSIPCO,...
  • A. Cichocki, S.I. Amari, R. Thawonmas, Blind signal extraction using self-adaptive non-linear Hebbian learning rule,...
  • D. Donoho, On minimum entropy deconvolution, in: Applied Time Series Analysis II, Academic Press, 1981, pp...
  • Cited by (203)

    • Directed acyclic graph based information shares for price discovery

      2022, Journal of Economic Dynamics and Control
    • A minimal model of the interaction of social and individual learning

      2021, Journal of Theoretical Biology
      Citation Excerpt :

      In our simplified model of coupled individual and social learning each agent is a single “neuron” which receives the same input vectors as every other agent, and is attempting to learn weights that permit it to track the only source which is nonGaussian, by responding to the higher-order input correlations induced by mixing. If the mixing matrix is orthogonal, only higher than second order input correlations are generated, and a nonlinear Hebbian rule can always learn the appropriate unmixing weights (Hyvärinen and Oja, 1998). In this situation cooperative learning, via agent interaction, is not needed, since individual agents can always successfully learn.

    • Biologically plausible deep learning — But how far can we go with shallow networks?

      2019, Neural Networks
      Citation Excerpt :

      In both cases, only the weights of a readout layer are learned with supervised training. Unsupervised methods are appealing since they can be implemented with local learning rules, see e.g. “Oja’s rule” (Oja, 1982; Sanger, 1989) for principal component analysis, nonlinear extensions for independent component analysis (Hyvärinen & Oja, 1998) or algorithms in Brito and Gerstner (2016), Liu and Jia (2012), Olshausen and Field (1997), Rozell, Johnson, Baraniuk, and Olshausen (2008) for sparse coding. A single readout layer can be implemented with a local rule as well.

    View all citing articles on Scopus
    View full text