Elsevier

Neural Networks

Volume 11, Issue 9, December 1998, Pages 1685-1694
Neural Networks

Contributed article
An empirical comparison of back propagation and the RDSE algorithm on continuously valued real world data

https://doi.org/10.1016/S0893-6080(98)00090-2Get rights and content

Abstract

The ability of a neural network to generalise is dependent on how representative the training patterns were of the whole data domain, and how smoothly the network has fitted to these patterns [Sethi, I.K. (1990). IEEE International Joint Conference on Neural Networks, Seattle, WA, Vol. 2, pp. 219–224]. In non-scaled continuous data domains, training examples will lie at differing distances from each other, making the fitting problem more difficult and varied. This paper introduces a new neuron with an adaptive steepness parameter, implemented as an extra internal connection, which is altered to better interpolate between the data points that its hyperplane divides. Networks of the new neuronal model are trained using a new paradigm entitled the random directed search by entropy algorithm (RDSE). This involves constructing a network by training one neuron at a time and freezing the weights. Each neuron is trained using directed random search [Baba (1989). Neural Networks, 2, 367–373] to find a hyperplane that separates examples by minimising an entropy measure [Quinlan (1986). Induction of Decision Trees, Machine Learning, Vol. 1, pp. 81–106]. This training paradigm solves the problem of pre-defining a network topology, has few problems with local minima, can handle unscaled continuous input data and can be fully trained in a relatively short time scale when compared with other methods, e.g. back propagation (BP).

An example benchmark problem is used to illustrate the effects of the new neuronal model, and results for two real world data domains are given which display an improved classification rate when compared against networks with a constant steepness value for every neuron. An empirical comparison between BP and RDSE for the two data sets are also given. These results display improved training times, robustness and classification rates by RDSE when compared against BP.

Introduction

Continuous input patterns to a neural network can be considered to lie in an n-dimensional input space where the n attribute values define the pattern's position. When training a neural network to discriminate between two classes of examples this space is partitioned into volumes, using hyperplanes, each of which should only contain examples of one of the two classes. Each hyperplane is defined by a single neuron and when an example is presented, the neuron's activation x is proportional to the distance of that data point from the hyperplane. The neuron output y is calculated by applying a transfer function to its activation, the most common of which is the sigmoid (Rummelhart et al., 1986a):y=11+e−sxwhere s is the steepness parameter. If s is very large then this approximates a step function but for smaller values we get a smooth continuously changing neuronal output (Hertz et al., 1991).

A 2-D partitioned input space can be visualised as a regularly shaped landscape where the third dimension is defined by the trained network output for any point in the input space (as shown in Fig. 5, Fig. 6, Fig. 7). Raised areas (white in the figures) correspond to the positive class of input data and low areas correspond to the negative (or zero) class. This landscape is the network's interpolation of the training data and for a new pattern at any position in the input space the network output can be considered as the probability of that pattern's class being positive. Usually a standard steepness parameter is used for every neuron throughout a neural network giving a curvature which is locally dictated by the learnt weights and thus a sub-optimal interpolation of the training data. If instead, the curvature were steep for training examples that lie close to the dividing hyperplane and shallow for those that lie further apart (as shown in Fig. 1), the interpolation of the input space would fit the domain knowledge much more closely.

This can be achieved by using an adaptive steepness parameter for each neuron which can be altered to interpolate between the training examples that each hyperplane divides. A trained network of these neurons will create a probability map (McLean et al., 1994) of the data domain based on the training examples, where areas of the map which contain examples will have a correspondingly conclusive output, and parts which do not will have an interpolated output based on the surrounding known areas. The resulting network would have more valid generalisation on test data as the outputs will be extrapolated from this domain map.

Section snippets

Discussion

Each hyperplane is trained so as to separate positive training examples from negative examples. In order to achieve good generalisation (Sankar and Mammone, 1991; Kruschke and Movellan, 1991) and to represent the domain knowledge as cost effectively as possible, the fewer hyperplanes used the better. The partitioned input space can be considered as a probability map of the data domain where the network output, for any given example position in the input space, reflects the probability of its

The RDSE algorithm

When experimenting with other neural network paradigms, the most consistent and annoying problem was getting the algorithm to converge to a solution consistently. All the gradient descent based algorithms have difficulties with local minima even when simulated annealing techniques are introduced. Annealing techniques involve a large number of extra parameters which require domain specific settings, and do not entirely alleviate the problem. This means that a large number of runs need to be

Description

Paving algorithms construct a network topology to fit the training problem by adding and training a single neuron at a time to the current neuronal layer. Once the current layer has partitioned all the training data into volumes of class purity, a convex hull1, a new layer is begun. The new layer is trained on the transformed data passed through the previous layers and neurons are

Continuous X-OR results

Fig. 5Fig. 6Fig. 7 depict the input space for neural networks consisting of 2–2–1 neurons, which have been trained using RDSE on continuous X-OR data (see Appendix A). The black areas represent a zero output of the network and the white areas represent a 1, for every continuous xy coordinate in the range 2.5–12.0 in steps of 0.095. The shading in between shows interpolated regions, along the dividing hyperplanes, which are split into ten shades between black and white representing outputs of

Testing

BP is probably the most commonly used and well documented (Rummelhart et al., 1986a, Rummelhart et al., 1986b; Plaut et al., 1986) FFNN training algorithm. Here, the standard version, with the momentum term (Plaut et al., 1986), has been used for a comparative evaluation with RDSE. Two different, continuously valued real world data sets were used in this comparison, the diabetes in Pima Indians set and the heart disease data set.

In order to find an optimal network topology and set of learning

Diabetes results

Table 2 displays the results obtained when building RDSE networks for the diabetes in Pima Indians database. This data set consisted of 175 training examples and 243 test examples, each of which consisted of 8 continuously valued attributes and belonged to one of two classes. The results are averaged over sets of 10 experiments with the same learning parameters. Each experiment in a set had a different random number seed and so each resulting network could have yielded a different solution. The

Heart disease results

BP and RDSE were comparatively tested using the heart disease data set. This data set contains 270 examples consisting of 13 patient attributes. Of these, 120 are classed as having heart disease and the remaining 150 are classed as free from heart disease. 240 examples were used as a training set and the remainder for testing5. The data set has an associated cost matrix as it is more costly

Conclusion

In Fig. 5, of the continuous X-OR results there is very little shading around the dividing hyperplanes, showing that the common steepness parameter of 1.0 is approximating a step function and will therefore give poor generalisation on test examples (Sethi, 1990). In Fig. 6, Fig. 7, there is a much smoother interpolation between the maximal and the minimal probability areas, clearly shown with the bands of shading. This improved interpolation of the training data gives better extrapolation when

References (14)

  • N. Baba

    A new approach for finding the global minimum of error function of neural networks

    Neural Networks

    (1989)
  • K.J. Cios

    A machine learning method for generation of a neural network architecture: A continuous ID3 algorithm

    IEEE Transactions on Neural Networks

    (1992)
  • Hertz, J., Krogh, A., & Palmer, R.G. (1991). Introduction to the theory of neural Computation (p. 27). Redwood City,...
  • J.K. Kruschke et al.

    Benefits of gain: Speeded learning and minimal hidden layers in backpropagation networks

    IEEE Transactions on Systems, Man, and Cybernetics

    (1991)
  • McLean, D., Bandar, Z. & O'Shea, J. (1994). Improved interpolation and extrapolation from continuous training examples...
  • McLean, D. (1997). The RDSE Algorithm,...
  • Michie, D., Spiegelhalter, D.J., & Taylor, C.C. (1994). Machine learning, neural and statistical classification, Ellis...
There are more references available in the full text version of this article.

Cited by (2)

View full text