Elsevier

Neural Networks

Volume 11, Issue 1, January 1998, Pages 89-116
Neural Networks

Neural Networks for Predicting Conditional Probability Densities: Improved Training Scheme Combining EM and RVFL

https://doi.org/10.1016/S0893-6080(97)00089-0Get rights and content

Abstract

Predicting conditional probability densities with neural networks requires complex (at least two-hidden-layer) architectures, which normally leads to rather long training times. By adopting the RVFL concept and constraining a subset of the parameters to randomly chosen initial values (such that the EM-algorithm can be applied), the training process can be accelerated by about two orders of magnitude. This allows training of a whole ensemble of networks at the same computational costs as would be required otherwise for training a single model. The simulations performed suggest that in this way a significant improvement of the generalization performance can be achieved. Copyright © 1997 Elsevier Science Ltd.

Introduction

In several applications the objective of a prediction problem is the approximation of the probability distribution of some target variable y conditional on a set of explanatory variables x. Consider for example the case of time series prediction. The conventional approach of neuro-forecasting is to present m previous time series values x, collected in a so-called lag-vector x(t)≔(x(t),x(t−1),…,x(tm+1))ϵRm, to the input layer of a standard feed forward network, and to train the latter to predict the future value x(t+1)=f(x(t)). The theoretical justification for this approach is based on Takens' embedding theorem (Takens, 1981), according to which, for sufficiently large m, the dynamics in the reconstructed space of lag-vectors x(t) (embedding space) is equivalent1 to the original (unknown) dynamics in the true state space. A crucial condition for this theorem, though, is that the observations or measurements x(t) are noise-free. Obviously, this requirement can hardly be met in most practical applications. If the distribution of the future value x(t+1) conditional on the past values x(t) were gaussian with constant standard deviation σ, the aforementioned approach would still be appropriate, since training based on a sum-squared error function can make the output node approximate the conditional mean 〈y|x〉, and σ, treated as a so-called hyperparameter, can be estimated by following the Bayesian evidence approach (MacKay, 1992aMacKay, 1992b). However, Casdagli et al. (1991) showed that even for additive gaussian noise on the observations x(t), the distribution of x(t+1) conditional on x(t), Py|x(x(t+1)=y|x(t)), is in general not gaussian, but can be of a complicated multimodal form. This suggests that, rather than predicting a single value y for x(t+1), the whole conditional density Py|x(x(t+1)=y|x(t)) should be approximated.

Several approaches to this problem have been independently developed in the last few years (Ormoneit, 1993; Neuneier et al., 1994; Allen and Taylor, 1994; Srivastava and Weigend, 1994; Bishop, 1994; Bishop, 1995; Weigend and Srivastava, 1995; Husmeier and Taylor, 1997), all of which can be interpreted as two-hidden-layer2 networks that approximate Py|x(x(t+1)=y|x(t)) by a mixture model:Py|x,q(y|x,q)=k=1KPk|x,q(k|x,q)Py|x,q,k(y|x,q,k)Here the k=1,…,K are labels for the different classes or subprocesses of the mixture, q is a vector of network parameters, Pk|x,q(k|x,q) a discrete prior probability for a data point having been generated from the kth subprocess, and Py|x,q,k(y|x,q,k) a conditional probability density chosen to be of a simple unimodal form (referred to as kernel function henceforth).

In this text we shall follow an approach similar to the one suggested in Husmeier and Taylor (1997) and choose the prior probabilities as x-independent (but adaptable) parameters,Pk|x,q(k|x,q)≔ak, ak≥0The kernel functions are taken as gaussians:Py|x,q,k(y|x,q,k)≔βkexpβk2[y(t)−μk(x(t);w)]2, βk≥0whose precisions βkσ−2k (=inverse variances) are given as adaptable parameters, but whose centres are modelled as non-linear functions of x, given for example by the output of a one-hidden-layer network with weights w, μkk(x;w). This gives rise to a two-hidden-layer network, henceforth referred to as the GM (“gaussian mixture”) network, which has H sigmoidal units in the first hidden layer (“S-layer”), K gaussian units (“RBF nodes”) in the second hidden layer (“G-layer”), and output weights ak that are positive and normalized (see Fig. 1). The total vector of network parameters, q, thus includes the prior probabilities {ak}, the kernel widths {σk}, and the remaining weights w. Given a time series segment D={x(t)}N+1t=−m+2 as training set, the objective of a training process is to adapt these parameters q so as to maximize the likelihoodPD|q(D|q)=PD|q(x(N+1),…,x(2−m)|q)=t=1NPy|x,q(x(t+1)|x(t),q)Px(x(1))or, equivalently, minimize the “error” function:3E(q)≔−1NlnPD|q(D|q)≃−1Nt=1NlnPy|x,q(x(t+1)|x(t),q)(where we have assumed that the time series can be modeled by an mth order Markov process, and have dropped the term Px(x(1)) in Eq. (4)since it does not depend on the network parameters). For an interpolation problem with a set of N input vectors xi and (for simplicity scalar) target variables yi, D={(xi,yi)}Ni=1, we obtain a similar “error” function if the samples (xi, yi) are independent:E(q)≔−1NlnPD|q(D|q)≃−1Ni=1NlnPy|x,q(yi|xi,q)(Again, the term ∏iP(xi) has been neglected as it is independent of the network parameters.) A standard training method is to use backpropagation and adapt the network parameters q according to a steepest descent scheme, q∼∇E(q). However, since the network has a two-hidden-layer structure, the convergence of such an approach is rather slow. This article will therefore focus on an alternative method.

Section 2will review the EM algorithm (Dempster et al., 1977), which has been proved to lead to a faster convergence than a gradient descent scheme (Jordan and Xu, 1995). A similar improvement was reported by Ormoneit (1993), where the algorithm was applied to a gaussian mixture network for learning unconditional probability densities. However, in Section 3it will be shown that the straighforward application of this scheme to the GM model for learning conditional probability densities is not immediately possible since one class of network parameters defy the performance of a complete “M-step” of the algorithm. Section 4shows how this “bottleneck effect” can be avoided by combining the GM model with the random vector functional link (RVFL) concept (Pao et al., 1994; Igelnik and Pao, 1995). Section 5reports on an empirical study, which suggests that a considerable acceleration of the training process can be obtained. Section 6compares the performance of the combined “GM-RVFL” scheme with several alternative approaches on a benchmark time series. It is demonstrated that as a consequence of the speed-up of the training process a whole ensemble of predictors can be obtained, by which eventually a significant improvement of the generalization performance is achieved. In 7 Committee performance and diversity, 8 A weighting scheme for predictors, 9 Automatic relevance determination (ARD)we will discuss several aspects for optimizing a network committee, namely diversity, weighting schemes and automatic relevance determination. This is applied in a final study, Section 10, to a well-known real-world benchmark problem: the Boston house-price data of Harrison and Rubinfeld (1978).

Finally a word concerning notation. So far we have used subscripts to distinguish between different probability (density) functions P…(.). In order to simplify the notation, these subscripts will be omitted henceforth. This implies that different arguments of P(.) indicate that different functions are considered. Moreover, depending on the argument, P(.) can either denote a discrete probability, or a continuous probability density. Since this convention is widely used in the statistical literature, confusions should not be likely to occur.

Section snippets

Review of the EM algorithm

The basic idea of the EM (expectation maximization) algorithm4 is that the original optimization problem could be alleviated considerably if a set of further so-called hidden variables Λ were known. Let D={y(t),x(t)}Nt=1 be a given training set of input vectors x(t) and targets y(t) (=x(t+1)), and let us defineΨ(q,Λ)≔−lnP(D,Λ|q)U(q|q′)≔〈ψ(q,Λ)〉Λ|

Stochastic dynamical system: prediction with the GM network

In order to compare the EM training scheme with standard backpropagation on the maximum likelihood error function, we applied the network to the following prediction problem, studied earlier by Husmeier and Taylor (1997). Consider the discrete dynamical systemsx(t+1)=αx(t)[1−x(t)], αϵ[0,1],x(t)ϵ[0,1]andx(t+1)=1−x(t)κ, κ>0,x(t)ϵ[0,1]

The first system is the well-known logistic map, and the second system will be referred to as the “kappa map”. Adding noise to the parameters α and κ, and

Combining GM and RVFL

In order to carry out the adaptation of the weights w in a single step, the network architecture needs to be modified so that U(w) in Eq. (19)becomes quadratic7 in w. Jordan and Jacobs (1994) introduced a hierarchical structure of linear networks (“hierarchial mixture of experts”), with x-dependent priors ak=ak(x) (“gating network”) softly switching between different

Training a Single Model

We tested the prediction performance of the GM-RVFL network on the time series prediction problem of Eq. (22). We chose an GM-RVFL architecture that contained the same number of nodes as the GM network studied before, though with additional direct connections employed between the input and the S-layer. The initialization of the adaptable parameters was as described in Section 3, and the learning rule for the weights w, Eq. (33), was solved using singular value decomposition.

Synthetic stochastic time series: the double-well problem

As a further empirical test, we applied the model to a benchmark time series studied earlier by Ormoneit (1993), Neuneier et al. (1994) and Husmeier and Taylor (1997). A particle of mass m moves in a double-well potential V(x)=0.5x4x2+1 subject to the Brownian dynamics:d2xdt2=−1mdVdx−αdxdt+R(t)where R(t) is a gaussian stochastic variable with zero mean and intensity 〈R(t)R(t′)〉=2k(tt′), which can be interpreted as a coupling between the system and a heat bath of temperature T (k=the

Committee performance and diversity

The improvement of the generalization performance by employing a network committee is a consequence of Jensen's inequality,ci≥0,ici=1,z̄iciziln(z̄)≥iciln(zi)⇒icilnz̄zi≥0which follows from the convexity of ln(.). Given that we have a set of weight factors {ci} satisfying the above condition, the committee prediction P̄(yt|xt) is given byP̄(yt|xt)=i=1NComciP(yt|xt,Mi)where Mi symbolizes the ith model in the committee. WithECom=−1NlnP̄(D)=−1Nt=1NlnP̄(yt|xt)Esingle(i)=−1NlnP(D|Mi)=−1Nt=1Nln

A weighting scheme for predictors

The simplest form of the weight factors ci is to choose them all equal in size, ci≔1/(NCom)∀i. However, one would expect to be able to reduce the generalization “error” still further if their setting could be optimized in some way. A straight forward method, reviewed by Bishop (1995), Chapter 9, would be to consider EComtrain=−1/NlnP̄(Dtrain) as a function of the {ci}, and then adapt the latter by following the gradient Δci∝−∂E/∂ci. The disadvantage of such an approach is that the adaptation

Automatic relevance determination (ARD)

In Section 5it was demonstrated that the choice of an appropriate value for σrand, the standard deviation of the random-weight distribution, is crucial for a good generalization performance of the model. It was suggested to carry out several simulations with different values of σrand, and then select those models that show the best performance on the cross-validation set. The objective of this section is to improve this scheme in two respects. Firstly, rather than drawing all the weights

A real-world regression problem: the Boston house-price data

We applied the scheme to the Boston house-price data of Harrison and Rubinfeld (1978). This is a real-world regression problem that has been studied by several authors before and can be used as a benchmark set.22 For each of the 506 census tracts within the Boston metropolitan area the data give 13 socioeconomic explanatory variables, as

Summary and discussion

This study has tested the generalization performance of the GM-RVFL model on three different benchmark problems. The results on the first two problems, the synthetic stochastic time series, were similar. They both suggest that the prediction performance of a single GM-RVFL model is typically as good as that of a fully adaptable GM network, provided a reasonable choice for the distribution width σrand has been made. The considerable acceleration of the training process easily allows an

Conclusions and future work

Predicting conditional probability densities with neural networks requires architectures with (at least) two hidden layers (like the GM). This larger model complexity renders gradient-based training schemes on the standard maximum-likelihood error surface rather slow, and suggests the application of the faster EM algorithm. However, using a GM as network model, the application of the latter is not immediately possible. The minimization step (“M step”) can only be carried out for two of the

Acknowledgements

Dirk Husmeier is supported by a Postgraduate Trust Studentship from the University of London. We would like to thank Dr A. C. C. Coolen for helpful comments on the manuscript.

References (33)

  • A.P. Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm

    Journal of the Royal Statistical Society

    (1997)
  • B. Igelnik et al.

    Stochastic choice of basis functions in adaptive functional approximation and the functional-link net

    IEEE Transactions on Neural Networks

    (1995)
  • M.I. Jordan et al.

    Hierarchical mixtures of experts and the EM algorithm

    Neural Computation

    (1994)
  • A. Krogh et al.

    Statistical mechanics of ensemble learning

    Physical Review

    (1997)
  • LeBlanc, M. & Tibshirani, R. (1993). Combining estimates in regression and classification. Technical Report, Department...
  • MacKay, D.J.C. (1992a). Bayesian interpolation. Neural Computation, 4,...
  • Cited by (0)

    View full text