Natural learning in NLDA networks
Introduction
Natural gradient descent (Amari, 1998, Rattray et al., 1998) probably gives the fastest convergence speed in multilayer perceptron (MLP) on-line training. It does so by replacing standard gradient descent by where is the local square error and is the metric tensor when the MLP weight space is viewed as an appropriate Riemannian manifold. The most natural way to arrive at is to recast MLP training as a log-likelihood maximization problem (Amari, 1998). To do so, one considers an underlying probability model where , with a gaussian with density ; therefore, we have It follows that , and the natural metric is then given by the Fisher information matrix Observe that also coincides (Heskes, 2000) with Levenberg’s approximation to the Hessian of a square error function. In the above setting and at least formally, can be introduced in terms of a loss function: MLP training seeks to minimize the loss and can then be written as This suggests that (1) can be used to speed up training convergence in other settings where the global loss is an average of individual pattern losses. In fact, it is not neccesary that be the expectation of a local loss, but only that its gradient can be written as the expectation of a vector random variable , i.e., . Then analogously to what has been done before, we can formally introduce a “natural” metric for such more general losses.
A drawback of on-line natural gradient descent is its computational cost; even if the number of natural weight updates is much lower than that of standard descent, each on-line iteration needs to compute and invert it. There are several ways to avoid this in on-line descent (Amari et al., 2000, Yang and Amari, 1998) and the cost can also be alleviated applying natural gradients in batch learning. But in MLP training, (2) shows that batch natural gradient descent essentially coincides with the Gauss–Newton’s minimization method, which gives another reason to expect a fast convergence. However it is not clear whether natural gradient descent, as defined in (4), can speed up the batch gradient descent minimization of a general .
The main point of this work is to show that this is indeed the case for cost functions quite different from square errors, namely those used in what we will call Non-Linear Discriminant Analysis (NLDA), a non-linear extension of Fisher’s well known Linear Discriminant Analysis introduced in Santa Cruz and Dorronsoro (1998). NLDA networks have the same architecture of standard MLPs, but after a MLP-like non linear mapping of input vectors, the eigen-vector based linear map of Fisher’s analysis is applied to the last hidden layer outputs. When compared with standard MLPs, NLDA networks can provide better results in imbalanced class problems, where the number of samples of one class samples is much smaller than that of the others. On these problems, MLPs tend to underemphasize the small class samples, while the target-free training of NLDAs gives more balanced classifiers. There are several ways to define Fisher criterion functions for multiclass problems and, accordingly, to define NLDA cost functions. We shall briefly review in the next section the most widely used, the determinant-based criterion, while in the third section we will define NLDA training and introduce a natural-like gradient for these networks. In Section 4 natural gradient’s better convergence will be demonstrated on some numerical examples, not only in terms of lower final criterion values but also when we take into account their higher complexity (about times bigger than that of standard gradient for a -class problem with -dimensional inputs and hidden units). Finally, we shall compare the NLDA information matrix with the hessian of NLDA’s criterion function and shall see analytically and numerically that they are different; we can thus conclude that, in the NLDA case, natural gradient’s speed up cannot be attributed to a Gauss–Newton like approximation.
Section snippets
Multiclass Fisher discriminant analysis
For a class problem, the objective of Fisher’s Discriminant Analysis is to linearly transform an original feature vector in a new dimensional vector so that the new features concentrate each class , , around its mean while keeping these class means apart. When , the output features are 1-dimensional and the above objective can be achieved by minimizing the criterion function , where is the total output covariance and
Natural gradient in multiclass NLDA networks
NLDA networks extract from -dimensional patterns coming from classes a new dimensional (as customary in Fisher’s analysis) feature set using also a Multilayer Perceptron architecture whose optimal weights minimize a Fisher discriminant criterion function instead of the usual MLP square error. More precisely, assume the simplest possible NLDA architecture, with input units, a single hidden layer with units and linear outputs. We denote network inputs as (we may
Numerical examples
In this section we shall illustrate natural gradient NLDA learning on 8 problems. Seven of them, the well known Wisconsin breast cancer, glass, heart disease, ionosphere, iris, diabetes in Pima indians and thyroid disease, are taken form the UCI database (Murphy & Aha, 1994) The eight dataset, XOR4, is a 4 class synthetic problem, an extension of bidimensional XOR to 3 dimensions, where eight gaussian distributions centered at the opposite corners of the unit cube are considered and four
The relationship between the information matrix and hessian
As mentioned in the introduction, there is a basic coincidence for MLPs between batch natural gradient descent and the Gauss–Newton method. In particular, the information and hessian matrices would agree in the ideal case of a zero square error. This may be used to explain the good convergence properties of MLP batch natural gradients and a similar explanation could also be true for NLDA network training. In this section we shall compare first analytically and then numerically the relationship
Conclusions
In this work we have defined natural-like gradients for NLDA network batch training, where the criterion function is neither a sum of squares nor, more generally, an average of local errors. Instead of a more principled approach, that would require the definition of an appropriate Riemannian structure on the NLDA weight space, we have followed a simpler, more heuristic procedure, based on the observation that the definition of natural gradient for MLPs just requires writing the gradient of the
Acknowledgements
The research was done with partial support of Spain’s CICyT, TIC 01-572, TIN 2004-07676.
References (12)
Natural gradient works efficiently in learning
Neural Computation
(1998)- et al.
Adaptive method of realizing natural gradient learning for multilayer perceptrons
Neural Computation
(2000) - et al.
Pattern Recognition: A Statistical Approach
(1982) Introduction to statistical pattern recognition
(1972)On natural learning and pruning in multilayered perceptrons
Neural Computation
(2000)- Murata, N., Müller, K.-R., Ziehe, A., & Amari, S. (1996). Adaptive on-line learning in changing environments. In:...
Cited by (5)
Intrinsic plasticity via natural gradient descent with application to drift compensation
2013, NeurocomputingCitation Excerpt :An alternative derivation of the natural gradient is given in [16], together with the natural equivalent of batch learning, linked to Levenberg–Marquardt optimization. Recently, the special case of learning for non-linear discriminant networks was improved by use of natural gradient in [17]. As opposed to standard neural learning, where the input weights are adapted, IP learning adapts parameters of the activation function.
Neural networks and statistical learning, second edition
2019, Neural Networks and Statistical Learning, Second EditionDeep fisher discriminant analysis
2017, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Neural networks and statistical learning
2014, Neural Networks and Statistical LearningMachine learning techniques for the automated classification of adhesin-like proteins in the human protozoan parasite trypanosoma cruzi
2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics