A feed-forward network for input that is both categorical and quantitative

doi:10.1016/S0893-6080(02)00090-4

Neural Networks

Volume 15, Issue 7, September 2002, Pages 881-890

https://doi.org/10.1016/S0893-6080(02)00090-4 Get rights and content

Abstract

The data on which a multi-layer perceptron (MLP) is to be trained to approximate a continuous function may have inputs that are categorical rather than numeric or quantitative such as color, gender, race, etc. A categorical variable causes a discontinuous relationship between an input variable and the output. A MLP, with connection matrices that multiply input values and sigmoid functions that further transform values, represents a continuous mapping in all input variables. A MLP therefore requires that all inputs correspond to numeric, continuously valued variables and represents a continuous function in all input variables. The way that this problem is usually dealt with is to replace the categorical values by numeric ones and treat them as if they were continuously valued. However, there is no meaningful correspondence between the continuous quantities generated this way and the original categorical values. Another approach is to encode the categorical portion of the input using 1-out-of-n encoding and include this code as input to the MLP.

The approach in this paper is to segregate categorical variables from the continuous independent variables completely. The MLP is trained with multiple outputs; a separate output unit for each of the allowed combination of values of the categorical independent variables. During training the categorical value or combination of categorical values determines which of the output units should have the target value on it, with the remaining outputs being ‘do not care’. Three data sets were used for comparison of methods. Results show that this approach is much more effective than the conventional approach of assigning continuous variables to the categorical features. In case of the data set where there were several categorical variables the method proposed here is also more effective than the 1-out-of-n input method.

Introduction

Often the data on which a multi-layer perceptron (MLP) is to be trained to approximate a continuous function has inputs that are categorical rather than numeric or quantitative. Examples of categorical variables are gender, race, region, type of industry, different types of species of fish, etc. A categorical variable causes a discontinuous relationship between an input variable and the output. A quantitative variable on the other hand is one that assumes numerical values corresponding to the points on a real line, e.g. the kilowatt-hours of electricity used per day or the number of defects in a product. An example of a data set containing a combination of quantitative and categorical data is the ‘Boston housing data’ data set (Harrison & Rubinfeld, 1978) which is used for training here. In this case there are two categorical independent variables: one is the Charles River dummy variable (=1 if tract bounds river; 0 otherwise) and the other is the index of accessibility to radial highways. Another example is the data used for predicting the age of abalone. In this case one of the independent variables is categorical with values of M for male, F for female and I for infantile.

Categorical variables may be broken up into two types: ordinal and non-ordinal. Ordinal variable values may be ordered. A non-ordinal categorical variable will be a label of a category without ordering property as is true in the case of the abalone data where no ordering can be placed on the values M, F and I. The Charles River dummy variable above also has this property.

A MLP (Werbos, 1974, Rumelhart et al., 1986, Wasserman, 1993) with connection matrices that multiply input values and sigmoid functions that further transform values, however, represents a continuous mapping in all input variables. A MLP requires that all inputs correspond to numeric, continuously valued variables and represents a continuous function in all input variables. One way of dealing with the problem of categorical variables, which seems intuitively unacceptable, is to replace the categorical values by numeric ones and treat them as if they were continuously valued. Thus if the set of values for a categorical variable was the set {red, blue, green} the values red=1, green=2 and blue=3 would have been used. However, there is no meaningful correspondence between the continuous quantities generated this way and the original categorical values. This type of encoding imposes an ordering on the values that does not exist. To ensure the generalization capability of a neural network, the data should be encoded in a form that allows for interpolation. Therefore categorical variables should be distinguished from the continuous independent variables. One way of dealing with this is to use 1-out-of-n coding on inputy, e.g. for the three colours we get, red=(100), green=(010) and blue=(001). This means one input unit for each categorical value. In this paper, however, the encoded values are not input to the MLP but are used to select 1-out-n output units. This is somewhat similar but not quite the same as the more drastic solution of having a separate network for each combination of categorical values.

Bishop, 1994, Bishop, 1995 introduces a new class of neural, network models obtained by combining a conventional neural network with a mixture density model. The complete system is called a Mixture Density Network. This network can in theory represent arbitrary conditional probability distributions in the same way that a conventional neural network can represent arbitrary functions. The neural network component of the MDN has three groups of output units for each of the Gaussian kernel functions. These units output a mean, variance and multiplier for each conditional probability density in the mixture of densities. The multiplier for including each density in the mixture is the probability that a particular density out of the mixture is applicable. Each combination of categorical values could have its own density.

Lee and Lee (2001) also address the problem of multi-value regression estimation with neural network architecture. They confine the multi-regression problems to those mapping vectors to a scalar with a single input vector mapping to several scalars. They propose a modular network approach such that each module handles only a single output rather than a set of outputs. A decision then has to be made as to which modules, since the number of outputs for a single input varies, produce the correct output values for the given input.

The remainder of the paper is organized as follows. It commences with a discussion of some neural network approaches. Next follows a description of a neural network construction and its amended training algorithm. This is followed by the results of simulations that demonstrate the validity of the approach suggested in this paper. Finally we have the summary including a comparison of the neural network approach to the statistical approach based on using indicator variables, a generalization of the approach in this paper and a description of future work.

Section snippets

Training algorithm used for the hybrid network

The hybrid network to be trained consists of a one hidden layer MLP which is modified to allow the encoded categorical value to modify the output of the MLP. The learnable parameters are the connection matrices from the input layer to the hidden layer and the hidden layer to the output of the MLP. In the detailed algorithm below, based on the training algorithm described by Brouwer (1997), f(x)=1/(1+e^x) with f′(x)=e^x/(1+e^x)². I is the identity matrix of the required dimension. ‘∗/’ is the

Using the network for prediction

Let us consider alternate ways of viewing the network when it is used for prediction after training has been used to find the parameters. Note that Eq. (1) can also be written as $y =(W^{(2)T} s)· f (W^{(1)} x)$ W^(2)T is the transpose of the connection matrix from the hidden layer to the output layer of the MLP. The complete network can now be described as an MLP whose weight vector from hidden layer to the single output is $w (s)= W^{(2)T} s .$ The weight vector therefore is a function of categorical input and of a

Simulations

Following are the results of doing simulations to permit comparisons of the approaches discussed previously. The three approaches that are compared are (1) conventional method: this means replacing category values by real numbers and using the values with the other values as input; (2) 1-out-of-n: this means encoding the categorical value combinations using 1-out-of-n encoding and using the coded values with the others as input; (3) separation method: this is the same as method 2 except that

Representing several functions by a single network

The partitioning of the data according to combinations of categorical values with each element of the partition corresponding to a different function suggests that we may attempt to represent several functions simultaneously by a single MLP. Following is an experiment to see if that is feasible. Three separate functions are stored in a single network; binary arithmetic operators +, ∗, and /.

The network that is used has no sigmoid function in the output layer. The training data consists of 300

Summary and further work

We have demonstrated a very effective method for treating categorical features when training an MLP by segregating the categorical features from the quantitative features. The quantitative part of the feature vector will be processed by an MLP with additional output units. These outputs are then combined with the coded form of the categorical part of the feature vector. In some instances the most appropriate way is to have a separate function and therefore a separate MLP for each categorical

Acknowledgements

The support of a NSERC grant (Research Council of Canada) and the comments made by the referees is gratefully acknowledged.

References (10)

Bishop, C. M. (1994). Mixture density networks. NCRG/94/004, available from...
C.M. Bishop
Neural networks for pattern recognition
(1995)
R.K. Brouwer
Training a feedforward network by feeding gradients forward rather then by backpropagation of errors
Neurocomputing
(1997)
D. Harrison et al.
Hedonic prices and the demand for clean air
Journal of Environmental Economics and Management
(1978)
R.A. Jacobs et al.
Adaptive mixtures of local experts
Neural Computation
(1992)

There are more references available in the full text version of this article.

Cited by (0)

View full text

Neural Networks

A feed-forward network for input that is both categorical and quantitative

Abstract

Introduction

Section snippets

Training algorithm used for the hybrid network

Using the network for prediction

Simulations

Representing several functions by a single network

Summary and further work

Acknowledgements

Neural networks for pattern recognition

Training a feedforward network by feeding gradients forward rather then by backpropagation of errors

Neurocomputing

Hedonic prices and the demand for clean air

Journal of Environmental Economics and Management

Adaptive mixtures of local experts

Neural Computation