Contributed articleAn empirical comparison of back propagation and the RDSE algorithm on continuously valued real world data
Introduction
Continuous input patterns to a neural network can be considered to lie in an n-dimensional input space where the n attribute values define the pattern's position. When training a neural network to discriminate between two classes of examples this space is partitioned into volumes, using hyperplanes, each of which should only contain examples of one of the two classes. Each hyperplane is defined by a single neuron and when an example is presented, the neuron's activation x is proportional to the distance of that data point from the hyperplane. The neuron output y is calculated by applying a transfer function to its activation, the most common of which is the sigmoid (Rummelhart et al., 1986a):where s is the steepness parameter. If s is very large then this approximates a step function but for smaller values we get a smooth continuously changing neuronal output (Hertz et al., 1991).
A 2-D partitioned input space can be visualised as a regularly shaped landscape where the third dimension is defined by the trained network output for any point in the input space (as shown in Fig. 5, Fig. 6, Fig. 7). Raised areas (white in the figures) correspond to the positive class of input data and low areas correspond to the negative (or zero) class. This landscape is the network's interpolation of the training data and for a new pattern at any position in the input space the network output can be considered as the probability of that pattern's class being positive. Usually a standard steepness parameter is used for every neuron throughout a neural network giving a curvature which is locally dictated by the learnt weights and thus a sub-optimal interpolation of the training data. If instead, the curvature were steep for training examples that lie close to the dividing hyperplane and shallow for those that lie further apart (as shown in Fig. 1), the interpolation of the input space would fit the domain knowledge much more closely.
This can be achieved by using an adaptive steepness parameter for each neuron which can be altered to interpolate between the training examples that each hyperplane divides. A trained network of these neurons will create a probability map (McLean et al., 1994) of the data domain based on the training examples, where areas of the map which contain examples will have a correspondingly conclusive output, and parts which do not will have an interpolated output based on the surrounding known areas. The resulting network would have more valid generalisation on test data as the outputs will be extrapolated from this domain map.
Section snippets
Discussion
Each hyperplane is trained so as to separate positive training examples from negative examples. In order to achieve good generalisation (Sankar and Mammone, 1991; Kruschke and Movellan, 1991) and to represent the domain knowledge as cost effectively as possible, the fewer hyperplanes used the better. The partitioned input space can be considered as a probability map of the data domain where the network output, for any given example position in the input space, reflects the probability of its
The RDSE algorithm
When experimenting with other neural network paradigms, the most consistent and annoying problem was getting the algorithm to converge to a solution consistently. All the gradient descent based algorithms have difficulties with local minima even when simulated annealing techniques are introduced. Annealing techniques involve a large number of extra parameters which require domain specific settings, and do not entirely alleviate the problem. This means that a large number of runs need to be
Description
Paving algorithms construct a network topology to fit the training problem by adding and training a single neuron at a time to the current neuronal layer. Once the current layer has partitioned all the training data into volumes of class purity, a convex hull1, a new layer is begun. The new layer is trained on the transformed data passed through the previous layers and neurons are
Continuous X-OR results
Fig. 5Fig. 6Fig. 7 depict the input space for neural networks consisting of 2–2–1 neurons, which have been trained using RDSE on continuous X-OR data (see Appendix A). The black areas represent a zero output of the network and the white areas represent a 1, for every continuous x–y coordinate in the range 2.5–12.0 in steps of 0.095. The shading in between shows interpolated regions, along the dividing hyperplanes, which are split into ten shades between black and white representing outputs of
Testing
BP is probably the most commonly used and well documented (Rummelhart et al., 1986a, Rummelhart et al., 1986b; Plaut et al., 1986) FFNN training algorithm. Here, the standard version, with the momentum term (Plaut et al., 1986), has been used for a comparative evaluation with RDSE. Two different, continuously valued real world data sets were used in this comparison, the diabetes in Pima Indians set and the heart disease data set.
In order to find an optimal network topology and set of learning
Diabetes results
Table 2 displays the results obtained when building RDSE networks for the diabetes in Pima Indians database. This data set consisted of 175 training examples and 243 test examples, each of which consisted of 8 continuously valued attributes and belonged to one of two classes. The results are averaged over sets of 10 experiments with the same learning parameters. Each experiment in a set had a different random number seed and so each resulting network could have yielded a different solution. The
Heart disease results
BP and RDSE were comparatively tested using the heart disease data set. This data set contains 270 examples consisting of 13 patient attributes. Of these, 120 are classed as having heart disease and the remaining 150 are classed as free from heart disease. 240 examples were used as a training set and the remainder for testing5. The data set has an associated cost matrix as it is more costly
Conclusion
In Fig. 5, of the continuous X-OR results there is very little shading around the dividing hyperplanes, showing that the common steepness parameter of 1.0 is approximating a step function and will therefore give poor generalisation on test examples (Sethi, 1990). In Fig. 6, Fig. 7, there is a much smoother interpolation between the maximal and the minimal probability areas, clearly shown with the bands of shading. This improved interpolation of the training data gives better extrapolation when
References (14)
A new approach for finding the global minimum of error function of neural networks
Neural Networks
(1989)A machine learning method for generation of a neural network architecture: A continuous ID3 algorithm
IEEE Transactions on Neural Networks
(1992)- Hertz, J., Krogh, A., & Palmer, R.G. (1991). Introduction to the theory of neural Computation (p. 27). Redwood City,...
- et al.
Benefits of gain: Speeded learning and minimal hidden layers in backpropagation networks
IEEE Transactions on Systems, Man, and Cybernetics
(1991) - McLean, D., Bandar, Z. & O'Shea, J. (1994). Improved interpolation and extrapolation from continuous training examples...
- McLean, D. (1997). The RDSE Algorithm,...
- Michie, D., Spiegelhalter, D.J., & Taylor, C.C. (1994). Machine learning, neural and statistical classification, Ellis...
Cited by (2)
Combining multiple decision trees using fuzzy-neural inference
2002, IEEE International Conference on Fuzzy SystemsMLP in layer-wise form with applications to weight decay
2002, Neural Computation