Revisiting squared-error and cross-entropy functions for training neural network classifiers

Kline, Douglas M.; Berardi, Victor L.

doi:10.1007/s00521-005-0467-y

Revisiting squared-error and cross-entropy functions for training neural network classifiers

Published: 14 July 2005

Volume 14, pages 310–318, (2005)
Cite this article

Neural Computing & Applications Aims and scope Submit manuscript

Douglas M. Kline¹ &
Victor L. Berardi²

2894 Accesses
183 Citations
3 Altmetric
Explore all metrics

Abstract

This paper investigates the efficacy of cross-entropy and square-error objective functions used in training feed-forward neural networks to estimate posterior probabilities. Previous research has found no appreciable difference between neural network classifiers trained using cross-entropy or squared-error. The approach employed here, though, shows cross-entropy has significant, practical advantages over squared-error.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Neural Networks with Multidimensional Cross-Entropy Loss Functions

Regression Neural Networks with a Highly Robust Loss Function

Loss Function Dynamics and Landscape for Deep Neural Networks Trained with Quadratic Loss

Article Open access 01 December 2022

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Arotaritei D, Negoita Gh M (2002) Optimisation of Recurrent NN by GA with Variable Length Genotype. In: McKay B, Slaney J (eds) AI2002: advances in artificial intelligence. Springer, Berlin Heidelberg New York, pp 691–692
Google Scholar
Baum EB, Wilczek F (1988) Supervised learning of probability distributions by neural networks. In: Anderson D (ed) Neural information processing systems. American Institute of Physics, New York, pp 52–61
Google Scholar
Berardi VL, Zhang GQ (1999) The effect of misclassification costs on neural network classifiers. Decis Sci 30(3):659–682
Google Scholar
Bishop CM (1995a) Neural networks for pattern recognition. Clarendon, Oxford
Google Scholar
Bourlard H, Wellekens CJ (1989) Links between Markov models and multilayer perceptrons. In: Toretzky DS (ed) Advances in neural information processing systems, vol 1. Morgan Kaufmann, San Mateo, pp 502–510
Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Math Control Signals Syst 2:303–314
Google Scholar
Duda RO, Hart P (1973) Pattern classification and scene analysis. Wiley, New York
Google Scholar
Fagarasan F, Negoita Gh M (1995) A genetic-based method for learning the parameter of a fuzzy inference system. In: Kasabov N, Coghill G (eds) Artificial neural networks and expert systems. IEEE Computer Society, Los Alamitos, pp 223–226
Google Scholar
Frean M (1990) The upstart algorithm: a method for constructing and training feedforward neural networks. Neural Comput 2(2):198–209
Google Scholar
Hampshire JB, Waibel A (1990) Connectionist architectures for multi-speaker phoneme recognition. In: Toretzky DS (ed) Advances in neural information processing systems, vol 2. Morgan Kaufmann, San Mateo, pp 203–210
Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Netw 4:251–257
Google Scholar
Hung MS, Hu MY, Patuwo BE, Shanker M (1996) Estimating posterior probabilities in classification problems with neural networks. Int J Comput Intell Organ 1:49–60
Google Scholar
Mezard M, Nadal JP (1989) Learning in feedforward layered networks: the tiling algorithm. J Phys A 22:21921–2203
Google Scholar
Papoulis A (1964) Probability, random variables, and stochastic processes, 1st edn. McGraw Hill, New York, p 175
Google Scholar
Richard MD, Lippmann RP (1991) Neural network classifiers estimate Bayesian a-posteriori probabilities. Neural comput 3:461–483
Google Scholar
Rumelhart DE, Hinton GE, Williams RJ (1986a) Learning internal representations by error propagation. In: Rumelhart DE, McClelland JL, the PDP group (eds) Parallel distributed processing: exploration in the microstructure of cognition, Foundations. MIT Press, Cambridge, MA, pp 318–362
Rumelhart DE, Hinton GE, Williams RJ (1986b) Learning representation by backpropagating errors. Nat (Lond) 323:533–536
Google Scholar
Shoemaker PA (1991) A note on least-squares learning procedures and classification by neural networks. IEEE Trans Neural Netw 2(1):158–160
Google Scholar
Wan EA (1990) Neural network classification: a Bayesian interpretation. IEEE Trans Neural Netw 1(4):303–375
Google Scholar
White H (1989) Learning in artificial neural networks: a statistical perspective. Neural Comput 1:425–464
Google Scholar
White H (1990) Connectionists nonparametric regression: multilayer feedforward networks can learn arbitrary mappings. Neural Netw 3:535–549
Google Scholar

Download references

Author information

Authors and Affiliations

Cameron School of Business, University of North Carolina, Wilmington, NC, USA
Douglas M. Kline
Graduate School of Management, Kent State University, Kent, OH, 44221, USA
Victor L. Berardi

Authors

Douglas M. Kline
View author publications
Search author on:PubMed Google Scholar
Victor L. Berardi
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Victor L. Berardi.

Appendix

This appendix contains information concerning the parameters used in generating the simulated distributions of the illustration problems.

1.1 Trivariate normal (z₁)

Let μ_ij and $\sigma^{2}_{ij}$ be the mean and variance of normal variable i for group j. The mean and variance parameters for Group 1 are $(\mu_{11}, \sigma^{2}_{11}) = (\mu_{21}, \sigma^{2}_{21}) = (\mu_{31}, \sigma^{2}_{31}) = (10.0, 25.0).$ For Group 2, the parameters are $(\mu _{12}, \sigma^{2}_{12}) = (\mu_{22}, \sigma^{2}_{22}) = (\mu _{32}, \sigma^{2}_{32}) = (5.5, 25.0)$ and for Group 3 $(\mu_{13}, \sigma^{2}_{13}) = (\mu_{23}, \sigma^{2}_{23}) = (\mu_{33}, \sigma^{2}_{33}) = (7.5, 25.0).$ Let Σ₁, Σ₂, and Σ₃ be the variance-covariance matrices for Groups 1, 2, and 3, respectively. For this example, Σ₁ = Σ₂ = Σ₃ = Σ where

$$ \Sigma = {\left( {\begin{array}{*{20}c} {{25.0}} & {{7.5}} & {{22.5}} \\ {{7.5}} & {{25.0}} & {{15.0}} \\ {{22.5}} & {{15.0}} & {{25.0}} \\ \end{array} } \right)}. $$

1.2 Bivariate bernoulli (z₂)

Let z = (Z_1j, Z_2j) be the bivariate Bernoulli variables for group j where P(Z_1j = 1) = p_1j, P(Z_2j = 1) = p_2j, P(Z_3j = 1) = p_3j, and ρ_j is the correlation coefficient. For Group 1, p₁₁ = 0.8, p₂₁=0.7, and ρ₁=0.2. For Group 2, p₁₂ = 0.5, p₂₂=0.55, and ρ₂=0.4. For Group 3, p₁₃ = 0.425, p₂₃=0.4, and ρ₃=0.6.

1.3 Weibull (z₃, z₄)

For Weibull variable i, let α_ij be the shape parameter for group j, and β_ij be the scale parameter. For this example, the parameters for the first Weibull variable z₃ are α₁₁=4.0, β₁₁=1.0, α₁₂=1.5, β₁₂=1.0, and α₁₃=2.0, β₁₃=1.0 and For the second Weibull variable z₄, α₂₁=0.35, β₂₁=1.0, α₂₂=0.55, β₂₂=1.0, and α₂₃=0.6, β₂₃=1.0. Therefore, z₃ is a concave density function and z₄ is convex.

1.4 Binomial (z₅)

Let t=1,2,...,T be the number of Bernoulli random variables composing the binomial random variate, and p_j be the probability that the Bernoulli random variable for group j is 1. Then μ_j = Tp_j and $\sigma^{2}_{j}=Tp_{j} (1 - p_{j}).$ For this example, p₁=0.5, p₂=0.3, p₃=0.7 and T=10.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kline, D.M., Berardi, V.L. Revisiting squared-error and cross-entropy functions for training neural network classifiers. Neural Comput & Applic 14, 310–318 (2005). https://doi.org/10.1007/s00521-005-0467-y

Download citation

Received: 12 August 2004
Accepted: 04 March 2005
Published: 14 July 2005
Issue Date: December 2005
DOI: https://doi.org/10.1007/s00521-005-0467-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Revisiting squared-error and cross-entropy functions for training neural network classifiers

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Neural Networks with Multidimensional Cross-Entropy Loss Functions

Regression Neural Networks with a Highly Robust Loss Function

Loss Function Dynamics and Landscape for Deep Neural Networks Trained with Quadratic Loss

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

1.1 Trivariate normal (z1)

1.2 Bivariate bernoulli (z2)

1.3 Weibull (z3, z4)

1.4 Binomial (z5)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

1.1 Trivariate normal (z₁)

1.2 Bivariate bernoulli (z₂)

1.3 Weibull (z₃, z₄)

1.4 Binomial (z₅)