Independent component analysis by general nonlinear Hebbian-like learning rules

doi:10.1016/S0165-1684(97)00197-7

Signal Processing

Volume 64, Issue 3, 26 February 1998, Pages 301-313

https://doi.org/10.1016/S0165-1684(97)00197-7 Get rights and content

Abstract

A number of neural learning rules have been recently proposed for independent component analysis (ICA). The rules are usually derived from information-theoretic criteria such as maximum entropy or minimum mutual information. In this paper, we show that in fact, ICA can be performed by very simple Hebbian or anti-Hebbian learning rules, which may have only weak relations to such information-theoretical quantities. Rather surprisingly, practically any nonlinear function can be used in the learning rule, provided only that the sign of the Hebbian/anti-Hebbian term is chosen correctly. In addition to the Hebbian-like mechanism, the weight vector is here constrained to have unit norm, and the data is preprocessed by prewhitening, or sphering. These results imply that one can choose the non-linearity so as to optimize desired statistical or numerical criteria. © 1998 Elsevier Science B.V. All rights reserved.

Zusammenfassung

Für die ICA (independent component analysis) wurden in jüngerer Zeit mehrere neuronale Lernregeln vorgeschlagen. Diese Regeln werden üblicherweise von informationstheoretischen Kriterien, wie maximale Entropie oder minimale wechselseitige Information, abgeleitet. In dieser Arbeit zeigen wir, daß die ICA tatsächlich mit den sehr einfachen Hebb- oder Anti-Hebb-Regeln durchgeführt werden kann, welche nur wenig Beziehung zu informationstheoretischen Größen haben. Es ist überraschend, daß praktisch irgendeine nichtlineare Funktion für die Lernregel verwendet werden kann, solange man gewährleistet, daß das Vorzeichen des Hebbschen Terms richtig gewählt wird. Zusätzlich zum Hebb-ähnlichen Trainingsverfahren wird der Gewichtsvektor auf die Länge eins normiert und die Daten durch Dekorrelieren vorverarbeitet. Die Ergebnisse lassen darauf schließen, daß die Nichtlinearitäten so gewählt werden können, daß gewünschte statistische oder numerische Kriterien optimiert werden. © 1998 Elsevier Science B.V. All rights reserved.

Résumé

Un certain nombre de règles neuronales d’apprentissage ont été récemment proposées pour l’analyse indépendante de composants (ICA). Les règles sont généralement dérivées de critères basés sur la théorie de I’information tels que la maximisation de l’entropie ou la minimisation de l’information mutuelle. Dans cet article, nous montrons qu’en fail l’ICA peut être effectuée par des règles d’aprentissage Hebbienne ou anti-Hebbienne très simples qui peuvent n’avoir qu’une faible relation avec les quantités basées sur la théorie de l’information. De manière plutôt suprenante, pratiquement n’importe quelle fonction non-linéaire peut être utilisée dans la règle d’aprentissage pour autant que le signe du terme Hebbien/anti-Hebbien soit choisi correctement. En plus de l’introduction d’un mecanisme de type Hebbien, le vecteur des coefficients est dans notre cas contraint à une norme unité et les données sont prétraitées par préblanchiment ou normalisation. Ces résultats impliquent la possibilité de choisir la non-linéarite de manière à optimiser les critères statisitiques ou numériques désirés. © 1998 Elsevier Science B.V. All rights reserved.

Introduction

Independent component analysis (ICA) 7, 17 is a recently developed signal processing technique whose goal is to express a set of random variables as linear combinations of statistically independent component variables. The main applications of ICA are in blind source separation [17], feature extraction 2, 18, and, in a slightly modified form, in blind deconvolution [9]. In the basic form of ICA [7], observe m scalar random variables x₁,x₂,…,x_m which are assumed to be linear combinations of n unknown independent components, or ICs, denoted by s₁,s₂,…,s_n. The ICs are, by definition, mutually statistically independent, and zero-mean. Let us arrange the observed variables x_i into a vector $x = (x_{1},x_{2},…,x_{m})^{T}$ and the IC variables s_i into a vector $s$ , respectively; then the linear relationship is given by $x = As .$ Here, $A$ is an unknown m×n matrix of full rank, called the mixing matrix. The basic problem of ICA is then to estimate the realizations of the original ICs s_i using only observations of the mixtures x_j. This is roughly equivalent to estimating the mixing matrix $A$ . Two fundamental restrictions of the model are that, firstly, we can only estimate non-Gaussian ICs (except if just one of the ICs is Gaussian), and secondly, we must have at least as many observed linear mixtures as ICs, i.e. m⩾n. Note that the assumption of zero mean of the ICs is in fact no restriction, as this can always be accomplished by subtracting the mean from the random vector $x$ . A basic, but rather insignificant indeterminacy in the model is that the ICs and the columns of $A$ can only be estimated up to a multiplicative constant, because any constant multiplying an IC in Eq. (1)could be cancelled by dividing the corresponding column of the mixing matrix $A$ by the same constant. For mathematical convenience, one usually defines that the ICs s_i have unit variance. This makes the (non-Gaussian) ICs unique, up to a multiplicative sign [7]. Note that this definition of ICA implies no ordering of the ICs.

The classical application of the ICA model is blind source separation [17], in which the observed values of $x$ correspond to a realization of an m-dimensional discrete-time signal $x (t)$ , t=1,2,… . Then the components s_i(t) are called source signals, which are usually original, uncorrupted signals or noise sources.

Another application of ICA is feature extraction 2, 18. Then the columns of $A$ represent features, and s_i signals the presence and the coefficient of the ith feature in an observed data vector $x$ .

In blind deconvolution, a convolved version x(t) of a scalar i.i.d. signal s(t) is observed, again without knowing the signal s(t) or the convolution kernel 9, 27. The problem is then to find a separating filter f so that $s(t)=f(t)∗x(t)$ . The equalizer f(t) is assumed to be a FIR filter of sufficient length, so that the truncation effects can be ignored. Due to the assumption that the values of the original signal s(t) are independent for different t, this problem can be solved using essentially the same formalism as used in ICA 7, 28, 29. Indeed this problem can also be represented (though only approximately) by Eq. (1); then the realizations of $x$ and $s$ are vectors containing n=m subsequent observations of the signals x(t) and s(t), beginning at different points of time. In other words, a sequence of observations $x (t)$ is such that $x (t)=(x(t+n−1),x(t+n−2),…,x(t))^{T}$ for t=1,2,… . The square matrix $A$ is determined by the convolving filter. Though this formulation is only approximative, the exact formulation using linear filters would lead to essentially the same algorithms and convergence proofs. Also blind separation of several convolved signals can be represented combining these two approaches.

As a preprocessing step we assume here that the dimension of the data $x$ is reduced, e.g., by PCA, so that it equals the number of ICs. In other words, we assume m=n. We also assume that the data is prewhitened (or sphered), i.e., the x_i are decorrelated and their variances are equalized by a linear transformation [7]. After this preprocessing, model Eq. (1)still holds, and the matrix $A$ becomes orthogonal.

Several neural algorithms for estimating the ICA model have been proposed recently, e.g., in 1, 3, 6, 15, 16, 20, 23. Usually these algorithms use Hebbian or anti-Hebbian learning. Hebbian learning has proved to be a powerful paradigm for neural learning [22]. In the following, we call both Hebbian and anti-Hebbian learning rules ‘Hebbian-like’. We use this general expression because the difference between Hebbian and anti-Hebbian learning is sometimes quite vague. Typically, one uses the expression ‘Hebbian’ when the learning function is increasing and ‘anti-Hebbian’ when the learning function is decreasing. In the general case, however, the learning function need not be increasing or decreasing, and thus a more general concept is needed.

Hebbian-like learning thus means that the weight vector $w$ of a neuron, whose input is denoted by $x$ , adapts according to a rule that is roughly of the form $Δ w ∝± x f(w^{T} x)+… ,$ where f is a certain scalar function, called the learning function. Thus the change in $w$ is proportional both to the input $x$ and a nonlinear function of $w^{T} x$ . Some kind of normalization and feedback terms must also be added. Several different learning functions f have been proposed in the context of ICA, e.g., the cubic function, the tanh function, or more complicated polynomials. Some of these, e.g., the cubic function, have been motivated by an exact convergence analysis. Others have only been motivated using some approximations whose validity may not be evident.

In this paper, we show that as long as the exact (local) convergence is concerned, the choice of the learning function f in Eq. (2)is not critical. In fact, practically any non-linear learning function may be used to perform ICA. More precisely, any function f divides the space of probability distributions into two half-spaces. Independent components whose distribution is in one of the half-spaces can be estimated using a Hebbian-like learning rule as in Eq. (2)with a positive sign before the learning term, and with f as the learning function. ICs whose distribution is in the other half-space can be estimated using the same learning rule, this time with a negative sign before the learning term. (The boundary between the two half-spaces contains distributions such that the corresponding ICs cannot be estimated using f. This boundary is, however, of vanishing volume.) In addition to the Hebbian-like mechanism, two assumptions are necessary here. First, the data must be preprocessed by whitening. Second, the Hebbian-like learning rule must be constrained so that the norm of the weight vector has constant norm.

Though in principle any function can be used in the Hebbian-like learning rule, practical considerations lead us to prefer certain learning functions to others. In particular, one can choose the non-linearity so that the estimator has desirable statistical properties like small variance and robustness against outliers. Also computational aspects may be taken into account.

This paper is organized as follows: in Section 2, a general motivation for our work is described. Our learning rules are described in Section 3. Section 4contains a discussion, and simulation results are presented in Section 5. Finally, some conclusions are drawn in Section 6.

Section snippets

Cumulants versus arbitrary nonlinearities

Generally, for source separation and ICA, higher than second-order statistics have to be used. Such higher-order statistics can be incorporated into the computations either explicitly using higher-order cumulants, or implicitly, by using suitable nonlinearities. Indeed, one might distinguish between two approaches to ICA which we call the ‘top-down’ approach and the ‘bottom-up’ approach.

In the top-down, or cumulant approach, one typically starts from the independence requirement. Mutual

General one-unit contrast functions

Contrast functions [7] provide a useful framework to describe ICA estimation. Usually they are based on a measure of the independence of the solutions. Denoting by $w$ and $x$ the weight vector and the input of a neuron, and slightly modifying the terminology in [7], one might also describe a contrast function as a measure of how far the distribution of the output $w^{T} x$ of a neuron is from a gaussian distribution. The basic idea is then to find weight vectors (under a suitable constraint) that

Which learning function to choose?

The theorem of the preceding section shows that we have an infinite number of different Hebbian-like learning rules to choose from. This freedom is the very strength of our approach to ICA. Instead of being limited to a single non-linearity, our framework gives the user the opportunity to choose the non-linearity so as to optimize some criteria. These criteria may be either task-dependent, or follow some general optimality criteria.

Using standard optimality criteria of statistical estimators,

Simulation results

We applied the general Hebbian-like learning rule in (Eq. (4)) using two different learning functions, f₁(y)=tanh(2y) and f₂(y)=yexp(−y²/2). These learning functions were chosen according to the recommendations in [14]. The simulations consisted of blind source separation of four time signals that were linearly mixed to give raise to four mixture signals.

First we applied the learning rules introduced above on signals that have visually simple forms. This facilitates checking the results and

Conclusion

It was shown how a large class of Hebbian-like learning rules can be used for ICA estimation. Indeed, almost any nonlinear function can be used in the learning rule. The critical part is choosing correctly the multiplicative sign in the learning rule as a function of the shapes of the learning function and the distributions of the independent components. It was also shown how the correct sign can be estimated on-line, which leads to a universal learning rule that estimates an IC of practically

References (29)

P. Comon
Independent component analysis – a new concept?
Signal Processing
(1994)
N. Delfosse et al.
Adaptive blind separation of independent sources: a deflation approach
Signal Processing
(1995)
C. Jutten et al.
Blind separation of sources, Part I – An adaptive algorithm based on neuromimetic architecture
Signal Processing
(1991)
E. Oja
Principal components, minor components, and linear neural networks
Neural Networks
(1992)
E. Oja et al.
On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix
Journal of Math. Analysis and Applications
(1985)
E. Oja et al.
Robust fitting by nonlinear neural units
Neural Networks
(1996)
T.D. Sanger
Optimal unsupervised learning in a single-layered linear feedforward network
Neural Networks
(1989)
S. Amari, A. Cichocki, H.H. Yang, A new learning algorithm for blind source separation, in: D.S. Touretzky, M.C. Mozer,...
A. Bell, T.J. Sejnowski, Edges are the independent components of natural scenes, in: Advances in Neural Information...
A.J. Bell et al.
An information-maximization approach to blind separation and blind deconvolution
Neural Computation
(1995)

J.-F. Cardoso, Eigen-structure of the fourth-order cumulant tensor with application to the blind source separation...

J.-F. Cardoso, Iterative techniques for blind source separation using only fourth-order cumulants, in: Proc. EUSIPCO,...

A. Cichocki, S.I. Amari, R. Thawonmas, Blind signal extraction using self-adaptive non-linear Hebbian learning rule,...

D. Donoho, On minimum entropy deconvolution, in: Applied Time Series Analysis II, Academic Press, 1981, pp...

Cited by (203)

Directed acyclic graph based information shares for price discovery
2022, Journal of Economic Dynamics and Control
The possibility to measure the contribution of agents and exchanges to the price formation process in financial markets acquired increasing importance in the literature. In this paper I propose to exploit a data-driven approach to identify structural vector error correction models (SVECM) typically used for price discovery. Exploiting the non-Normal distributions of the variables under consideration, I propose a variant of the widespread Information Share measure, which I will refer to as the Directed Acyclic Graph based-Information Shares(DAG-IS), which can identify the leaders and the followers in the price formation process through the exploitation of a causal discovery algorithm well established in the area of machine learning. The approach will be illustrated from a semi-parametric perspective, solving the identification problem with no need to increase the computational complexity which usually arises when working at incredibly short time scales. Finally, an empirical application on IBM intraday data will be provided.
A minimal model of the interaction of social and individual learning
2021, Journal of Theoretical Biology
Citation Excerpt :
In our simplified model of coupled individual and social learning each agent is a single “neuron” which receives the same input vectors as every other agent, and is attempting to learn weights that permit it to track the only source which is nonGaussian, by responding to the higher-order input correlations induced by mixing. If the mixing matrix is orthogonal, only higher than second order input correlations are generated, and a nonlinear Hebbian rule can always learn the appropriate unmixing weights (Hyvärinen and Oja, 1998). In this situation cooperative learning, via agent interaction, is not needed, since individual agents can always successfully learn.
Learning is thought to be achieved by the selective, activity dependent, adjustment of synaptic connections. Individual learning can also be very hard and/or slow. Social, supervised, learning from others might amplify individual, possibly mainly unsupervised, learning by individuals, and might underlie the development and evolution of culture. We studied a minimal neural network model of the interaction of individual, unsupervised, and social supervised learning by communicating “agents”. Individual agents attempted to learn to track a hidden fluctuating “source”, which, linearly mixed with other masking fluctuations, generated observable input vectors. In this model data are generated linearly, facilitating mathematical analysis. Learning was driven either solely by direct observation of input data (unsupervised, Hebbian) or, in addition, by observation of another agent's output (supervised, Delta rule). To make learning more difficult, and to enhance biological realism, the learning rules were made slightly connection-inspecific, so that incorrect individual learning sometimes occurs. We found that social interaction can foster both correct and incorrect learning. Useful social learning therefore presumably involves additional factors some of which we outline.
A review of railway infrastructure monitoring using fiber optic sensors
2020, Sensors and Actuators, A: Physical
In recent years, railway infrastructures and systems have played a significant role as a highly efficient transportation mode to meet the growing demand in transporting both cargo and passengers. Application of these structures in extreme environmental situation under severe working and loading conditions, caused by the traffic growth, heavier axles and vehicles and increase in speed makes it extremely susceptible to degradation and failure. In the last two decades, a significant number of innovative sensing technologies based on fiber optic sensors (FOS) have been utilized for structural health monitoring (SHM) due to their inherent distinctive advantages, such as small size, light weight, immunity to electromagnetic interference (EMI) and corrosion, and embedding capability. Fiber optic-based monitoring systems use quasi-distributed and continuously distributed sensing techniques for real time measurement and long term assessment of structural properties. This allows for early stage damage detection and characterization, leading to timely remediation and prevention of catastrophic failures. In this scenario, FOS have been proved to be a powerful tool for meticulous assessment of railway systems including train and track behavior by enabling real-time data collection, inspection and detection of structural degradation. This article reviews the current state-of-the-art of fiber optic sensing/monitoring technologies, including the basic principles of various optical fiber sensors, novel sensing and computational methodologies, and practical applications for railway infrastructure monitoring. Additionally, application of these technologies to monitor temperature, stresses, displacements, strain measurements, train speed, mass and location, axle counting, wheel imperfections, rail settlements, wear and tear and health assessment of railway bridges and tunnels will be thoroughly discussed.
Performance-relevant kernel independent component analysis based operating performance assessment for nonlinear and non-Gaussian industrial processes
2019, Chemical Engineering Science
The operating performance assessment of industrial processes becomes increasingly important in manufacturing production. A novel operating performance assessment method based on performance-relevant kernel independent component analysis is proposed here for nonlinear and non-Gaussian processes. The proposed method accounts for the comprehensive economic index, and the objectives are simultaneously to maximize the non-Gaussianity of independent components as well as the correlations between them and the comprehensive economic index. When applying it to online assessment, it demonstrates stronger robustness and higher sensitivity than the traditional methods do, which is attributed to its capacity in highlighting the performance-relevant variation information in modeling. Furthermore, both the performance grades and the conversions can be evaluated, which enhances the interpretability of the results. For the nonoptimality, the variable contributions are used to find the possible cause. Finally, the efficiency of the proposed method is illustrated by a case of gold hydrometallurgy process.
Biologically plausible deep learning — But how far can we go with shallow networks?
2019, Neural Networks
Citation Excerpt :
In both cases, only the weights of a readout layer are learned with supervised training. Unsupervised methods are appealing since they can be implemented with local learning rules, see e.g. “Oja’s rule” (Oja, 1982; Sanger, 1989) for principal component analysis, nonlinear extensions for independent component analysis (Hyvärinen & Oja, 1998) or algorithms in Brito and Gerstner (2016), Liu and Jia (2012), Olshausen and Field (1997), Rozell, Johnson, Baraniuk, and Olshausen (2008) for sparse coding. A single readout layer can be implemented with a local rule as well.
Training deep neural networks with the error backpropagation algorithm is considered implausible from a biological perspective. Numerous recent publications suggest elaborate models for biologically plausible variants of deep learning, typically defining success as reaching around 98% test accuracy on the MNIST data set. Here, we investigate how far we can go on digit (MNIST) and object (CIFAR10) classification with biologically plausible, local learning rules in a network with one hidden layer and a single readout layer. The hidden layer weights are either fixed (random or random Gabor filters) or trained with unsupervised methods (Principal/Independent Component Analysis or Sparse Coding) that can be implemented by local learning rules. The readout layer is trained with a supervised, local learning rule. We first implement these models with rate neurons. This comparison reveals, first, that unsupervised learning does not lead to better performance than fixed random projections or Gabor filters for large hidden layers. Second, networks with localized receptive fields perform significantly better than networks with all-to-all connectivity and can reach backpropagation performance on MNIST. We then implement two of the networks – fixed, localized, random & random Gabor filters in the hidden layer – with spiking leaky integrate-and-fire neurons and spike timing dependent plasticity to train the readout layer. These spiking models achieve $>$ 98.2% test accuracy on MNIST, which is close to the performance of rate networks with one hidden layer trained with backpropagation. The performance of our shallow network models is comparable to most current biologically plausible models of deep learning. Furthermore, our results with a shallow spiking network provide an important reference and suggest the use of data sets other than MNIST for testing the performance of future models of biologically plausible deep learning.
Unsupervised learning of perceptual feature combinations
2024, PLoS Computational Biology

View all citing articles on Scopus

View full text

Independent component analysis by general nonlinear Hebbian-like learning rules

Abstract

Introduction

Section snippets

Cumulants versus arbitrary nonlinearities

General one-unit contrast functions

Which learning function to choose?

Simulation results

Conclusion

Signal Processing

Signal Processing

Signal Processing

Neural Networks

Journal of Math. Analysis and Applications

Neural Networks

Neural Networks

An information-maximization approach to blind separation and blind deconvolution

Neural Computation