Skip to main content
Log in

Revisiting squared-error and cross-entropy functions for training neural network classifiers

  • Published:
Neural Computing & Applications Aims and scope Submit manuscript

Abstract

This paper investigates the efficacy of cross-entropy and square-error objective functions used in training feed-forward neural networks to estimate posterior probabilities. Previous research has found no appreciable difference between neural network classifiers trained using cross-entropy or squared-error. The approach employed here, though, shows cross-entropy has significant, practical advantages over squared-error.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Arotaritei D, Negoita Gh M (2002) Optimisation of Recurrent NN by GA with Variable Length Genotype. In: McKay B, Slaney J (eds) AI2002: advances in artificial intelligence. Springer, Berlin Heidelberg New York, pp 691–692

    Google Scholar 

  2. Baum EB, Wilczek F (1988) Supervised learning of probability distributions by neural networks. In: Anderson D (ed) Neural information processing systems. American Institute of Physics, New York, pp 52–61

    Google Scholar 

  3. Berardi VL, Zhang GQ (1999) The effect of misclassification costs on neural network classifiers. Decis Sci 30(3):659–682

    Google Scholar 

  4. Bishop CM (1995a) Neural networks for pattern recognition. Clarendon, Oxford

    Google Scholar 

  5. Bourlard H, Wellekens CJ (1989) Links between Markov models and multilayer perceptrons. In: Toretzky DS (ed) Advances in neural information processing systems, vol 1. Morgan Kaufmann, San Mateo, pp 502–510

  6. Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Math Control Signals Syst 2:303–314

    Google Scholar 

  7. Duda RO, Hart P (1973) Pattern classification and scene analysis. Wiley, New York

    Google Scholar 

  8. Fagarasan F, Negoita Gh M (1995) A genetic-based method for learning the parameter of a fuzzy inference system. In: Kasabov N, Coghill G (eds) Artificial neural networks and expert systems. IEEE Computer Society, Los Alamitos, pp 223–226

    Google Scholar 

  9. Frean M (1990) The upstart algorithm: a method for constructing and training feedforward neural networks. Neural Comput 2(2):198–209

    Google Scholar 

  10. Hampshire JB, Waibel A (1990) Connectionist architectures for multi-speaker phoneme recognition. In: Toretzky DS (ed) Advances in neural information processing systems, vol 2. Morgan Kaufmann, San Mateo, pp 203–210

  11. Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Netw 4:251–257

    Google Scholar 

  12. Hung MS, Hu MY, Patuwo BE, Shanker M (1996) Estimating posterior probabilities in classification problems with neural networks. Int J Comput Intell Organ 1:49–60

    Google Scholar 

  13. Mezard M, Nadal JP (1989) Learning in feedforward layered networks: the tiling algorithm. J Phys A 22:21921–2203

    Google Scholar 

  14. Papoulis A (1964) Probability, random variables, and stochastic processes, 1st edn. McGraw Hill, New York, p 175

    Google Scholar 

  15. Richard MD, Lippmann RP (1991) Neural network classifiers estimate Bayesian a-posteriori probabilities. Neural comput 3:461–483

    Google Scholar 

  16. Rumelhart DE, Hinton GE, Williams RJ (1986a) Learning internal representations by error propagation. In: Rumelhart DE, McClelland JL, the PDP group (eds) Parallel distributed processing: exploration in the microstructure of cognition, Foundations. MIT Press, Cambridge, MA, pp 318–362

  17. Rumelhart DE, Hinton GE, Williams RJ (1986b) Learning representation by backpropagating errors. Nat (Lond) 323:533–536

    Google Scholar 

  18. Shoemaker PA (1991) A note on least-squares learning procedures and classification by neural networks. IEEE Trans Neural Netw 2(1):158–160

    Google Scholar 

  19. Wan EA (1990) Neural network classification: a Bayesian interpretation. IEEE Trans Neural Netw 1(4):303–375

    Google Scholar 

  20. White H (1989) Learning in artificial neural networks: a statistical perspective. Neural Comput 1:425–464

    Google Scholar 

  21. White H (1990) Connectionists nonparametric regression: multilayer feedforward networks can learn arbitrary mappings. Neural Netw 3:535–549

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Victor L. Berardi.

Appendix

Appendix

This appendix contains information concerning the parameters used in generating the simulated distributions of the illustration problems.

1.1 Trivariate normal (z1)

Let μ ij and \(\sigma^{2}_{ij}\) be the mean and variance of normal variable i for group j. The mean and variance parameters for Group 1 are \((\mu_{11}, \sigma^{2}_{11}) = (\mu_{21}, \sigma^{2}_{21}) = (\mu_{31}, \sigma^{2}_{31}) = (10.0, 25.0).\) For Group 2, the parameters are \((\mu _{12}, \sigma^{2}_{12}) = (\mu_{22}, \sigma^{2}_{22}) = (\mu _{32}, \sigma^{2}_{32}) = (5.5, 25.0)\) and for Group 3 \((\mu_{13}, \sigma^{2}_{13}) = (\mu_{23}, \sigma^{2}_{23}) = (\mu_{33}, \sigma^{2}_{33}) = (7.5, 25.0).\) Let Σ1, Σ2, and Σ3 be the variance-covariance matrices for Groups 1, 2, and 3, respectively. For this example, Σ1 = Σ2 = Σ3 = Σ where

$$ \Sigma = {\left( {\begin{array}{*{20}c} {{25.0}} & {{7.5}} & {{22.5}} \\ {{7.5}} & {{25.0}} & {{15.0}} \\ {{22.5}} & {{15.0}} & {{25.0}} \\ \end{array} } \right)}. $$

1.2 Bivariate bernoulli (z2)

Let z = (Z1j, Z2j) be the bivariate Bernoulli variables for group j where P(Z1j = 1) = p1j, P(Z2j = 1) = p2j, P(Z3j = 1) = p3j, and ρ j is the correlation coefficient. For Group 1, p11 = 0.8, p21=0.7, and ρ1=0.2. For Group 2, p12 = 0.5, p22=0.55, and ρ2=0.4. For Group 3, p13 = 0.425, p23=0.4, and ρ3=0.6.

1.3 Weibull (z3, z4)

For Weibull variable i, let α ij be the shape parameter for group j, and β ij be the scale parameter. For this example, the parameters for the first Weibull variable z3 are α11=4.0, β11=1.0, α12=1.5, β12=1.0, and α13=2.0, β13=1.0 and For the second Weibull variable z4, α21=0.35, β21=1.0, α22=0.55, β22=1.0, and α23=0.6, β23=1.0. Therefore, z3 is a concave density function and z4 is convex.

1.4 Binomial (z5)

Let t=1,2,...,T be the number of Bernoulli random variables composing the binomial random variate, and p j be the probability that the Bernoulli random variable for group j is 1. Then μ j = Tp j and \(\sigma^{2}_{j}=Tp_{j} (1 - p_{j}).\) For this example, p1=0.5, p2=0.3, p3=0.7 and T=10.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kline, D.M., Berardi, V.L. Revisiting squared-error and cross-entropy functions for training neural network classifiers. Neural Comput & Applic 14, 310–318 (2005). https://doi.org/10.1007/s00521-005-0467-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-005-0467-y

Keywords

Navigation