Elsevier

Neural Networks

Volume 20, Issue 5, July 2007, Pages 610-620
Neural Networks

Natural learning in NLDA networks

https://doi.org/10.1016/j.neunet.2006.09.014Get rights and content

Abstract

Non Linear Discriminant Analysis (NLDA) networks combine a standard Multilayer Perceptron (MLP) transfer function with the minimization of a Fisher analysis criterion. In this work we will define natural-like gradients for NLDA network training. Instead of a more principled approach, that would require the definition of an appropriate Riemannian structure on the NLDA weight space, we will follow a simpler procedure, based on the observation that the gradient of the NLDA criterion function J can be written as the expectation J(W)=E[Z(X,W)] of a certain random vector Z and defining then I=E[Z(X,W)Z(X,W)t] as the Fisher information matrix in this case. This definition of I formally coincides with that of the information matrix for the MLP or other square error functions; the NLDA J criterion, however, does not have this structure. Although very simple, the proposed approach shows much faster convergence than that of standard gradient descent, even when its costlier complexity is taken into account. While the faster convergence of natural MLP batch training can be also explained in terms of its relationship with the Gauss–Newton minimization method, this is not the case for NLDA training, as we will see analytically and numerically that the hessian and information matrices are different.

Introduction

Natural gradient descent (Amari, 1998, Rattray et al., 1998) probably gives the fastest convergence speed in multilayer perceptron (MLP) on-line training. It does so by replacing standard gradient descent by Wt+1=WtηtG(Wt)1e(Xt,yt;Wt), where e(X,y;W)=(f(X;W)y)2/2 is the local square error and G(W) is the metric tensor when the MLP weight space is viewed as an appropriate Riemannian manifold. The most natural way to arrive at G is to recast MLP training as a log-likelihood maximization problem (Amari, 1998). To do so, one considers an underlying probability model where y=f(X,W)+Z, with Z a gaussian with density 12πσexp(yf(X;W))22σ2; therefore, we have p(X,y;W)=cq(X)p(yX;W)=cq(X)12πσexp(yf(X;W))22σ2. It follows that logp(X,y;W)=(f(X;W)y)2/2logcq(X), and the natural metric is then given by the Fisher information matrix G(W)=EX,y[Wlogp(X,y;W)Wlogp(X,y;W)t]=EX,y[(f(X;W)y)2Wf(X;W)Wf(X;W)t]=σ2EX[Wf(X;W)Wf(X;W)t]. Observe that G(W) also coincides (Heskes, 2000) with Levenberg’s approximation to the Hessian of a square error function. In the above setting and at least formally, G can be introduced in terms of a loss function: MLP training seeks to minimize the loss L(W)=EX,y[e(X,y;W)] and G(W) can then be written as G(W)=EX,y[We(X,y;W)We(X,y;W)t]. This suggests that (1) can be used to speed up training convergence in other settings where the global loss is an average of individual pattern losses. In fact, it is not neccesary that L(W) be the expectation of a local loss, but only that its gradient can be written as the expectation of a vector random variable Z=Z(X;W), i.e., L(W)=EX[Z(X;W)]. Then analogously to what has been done before, we can formally introduce a “natural” metric G(W)=E[Z(X;W)Z(X;W)t] for such more general losses.

A drawback of on-line natural gradient descent is its computational cost; even if the number of natural weight updates is much lower than that of standard descent, each on-line iteration needs to compute G and invert it. There are several ways to avoid this in on-line descent (Amari et al., 2000, Yang and Amari, 1998) and the cost can also be alleviated applying natural gradients in batch learning. But in MLP training, (2) shows that batch natural gradient descent essentially coincides with the Gauss–Newton’s minimization method, which gives another reason to expect a fast convergence. However it is not clear whether natural gradient descent, as defined in (4), can speed up the batch gradient descent minimization of a general L(W).

The main point of this work is to show that this is indeed the case for cost functions quite different from square errors, namely those used in what we will call Non-Linear Discriminant Analysis (NLDA), a non-linear extension of Fisher’s well known Linear Discriminant Analysis introduced in Santa Cruz and Dorronsoro (1998). NLDA networks have the same architecture of standard MLPs, but after a MLP-like non linear mapping of input vectors, the eigen-vector based linear map of Fisher’s analysis is applied to the last hidden layer outputs. When compared with standard MLPs, NLDA networks can provide better results in imbalanced class problems, where the number of samples of one class samples is much smaller than that of the others. On these problems, MLPs tend to underemphasize the small class samples, while the target-free training of NLDAs gives more balanced classifiers. There are several ways to define Fisher criterion functions for multiclass problems and, accordingly, to define NLDA cost functions. We shall briefly review in the next section the most widely used, the determinant-based criterion, while in the third section we will define NLDA training and introduce a natural-like gradient for these networks. In Section 4 natural gradient’s better convergence will be demonstrated on some numerical examples, not only in terms of lower final criterion values but also when we take into account their higher complexity (about DH/2(C1) times bigger than that of standard gradient for a C-class problem with D-dimensional inputs and H hidden units). Finally, we shall compare the NLDA information matrix with the hessian of NLDA’s criterion function and shall see analytically and numerically that they are different; we can thus conclude that, in the NLDA case, natural gradient’s speed up cannot be attributed to a Gauss–Newton like approximation.

Section snippets

Multiclass Fisher discriminant analysis

For a C class problem, the objective of Fisher’s Discriminant Analysis is to linearly transform an original feature vector X in a new C1 dimensional vector Y=WtX so that the new features concentrate each class c, c=1,,C, around its mean Y¯c while keeping these class means apart. When C=2, the output features are 1-dimensional and the above objective can be achieved by minimizing the criterion function J(W)=sT/sB, where sT=E[(YY¯)2] is the total output covariance and sB=π1(Y¯1Y¯)2+π2(Y¯2Y¯)2

Natural gradient in multiclass NLDA networks

NLDA networks extract from D-dimensional patterns X coming from C classes a new C1 dimensional (as customary in Fisher’s analysis) feature set Y using also a Multilayer Perceptron architecture whose optimal weights minimize a Fisher discriminant criterion function instead of the usual MLP square error. More precisely, assume the simplest possible NLDA architecture, with D input units, a single hidden layer with H units and C1 linear outputs. We denote network inputs as X=(x1,,xD)t (we may

Numerical examples

In this section we shall illustrate natural gradient NLDA learning on 8 problems. Seven of them, the well known Wisconsin breast cancer, glass, heart disease, ionosphere, iris, diabetes in Pima indians and thyroid disease, are taken form the UCI database (Murphy & Aha, 1994) The eight dataset, XOR4, is a 4 class synthetic problem, an extension of bidimensional XOR to 3 dimensions, where eight gaussian distributions centered at the opposite corners of the unit cube are considered and four

The relationship between the information matrix and hessian

As mentioned in the introduction, there is a basic coincidence for MLPs between batch natural gradient descent and the Gauss–Newton method. In particular, the information and hessian matrices would agree in the ideal case of a zero square error. This may be used to explain the good convergence properties of MLP batch natural gradients and a similar explanation could also be true for NLDA network training. In this section we shall compare first analytically and then numerically the relationship

Conclusions

In this work we have defined natural-like gradients for NLDA network batch training, where the criterion function is neither a sum of squares nor, more generally, an average of local errors. Instead of a more principled approach, that would require the definition of an appropriate Riemannian structure on the NLDA weight space, we have followed a simpler, more heuristic procedure, based on the observation that the definition of natural gradient for MLPs just requires writing the gradient of the

Acknowledgements

The research was done with partial support of Spain’s CICyT, TIC 01-572, TIN 2004-07676.

References (12)

  • S. Amari

    Natural gradient works efficiently in learning

    Neural Computation

    (1998)
  • S. Amari et al.

    Adaptive method of realizing natural gradient learning for multilayer perceptrons

    Neural Computation

    (2000)
  • P. Devijver et al.

    Pattern Recognition: A Statistical Approach

    (1982)
  • K. Fukunaga

    Introduction to statistical pattern recognition

    (1972)
  • T. Heskes

    On natural learning and pruning in multilayered perceptrons

    Neural Computation

    (2000)
  • Murata, N., Müller, K.-R., Ziehe, A., & Amari, S. (1996). Adaptive on-line learning in changing environments. In:...
There are more references available in the full text version of this article.

Cited by (5)

  • Intrinsic plasticity via natural gradient descent with application to drift compensation

    2013, Neurocomputing
    Citation Excerpt :

    An alternative derivation of the natural gradient is given in [16], together with the natural equivalent of batch learning, linked to Levenberg–Marquardt optimization. Recently, the special case of learning for non-linear discriminant networks was improved by use of natural gradient in [17]. As opposed to standard neural learning, where the input weights are adapted, IP learning adapts parameters of the activation function.

  • Neural networks and statistical learning, second edition

    2019, Neural Networks and Statistical Learning, Second Edition
  • Deep fisher discriminant analysis

    2017, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
  • Neural networks and statistical learning

    2014, Neural Networks and Statistical Learning
View full text