Elsevier

Neurocomputing

Volume 73, Issues 1–3, December 2009, Pages 151-159
Neurocomputing

Theoretical analysis of batch and on-line training for gradient descent learning in neural networks

https://doi.org/10.1016/j.neucom.2009.05.017Get rights and content

Abstract

In this study, we theoretically analyze two essential training schemes for gradient descent learning in neural networks: batch and on-line training. The convergence properties of the two schemes applied to quadratic loss functions are analytically investigated. We quantify the convergence of each training scheme to the optimal weight using the absolute value of the expected difference (Measure 1) and the expected squared difference (Measure 2) between the optimal weight and the weight computed by the scheme. Although on-line training has several advantages over batch training with respect to the first measure, it does not converge to the optimal weight with respect to the second measure if the variance of the per-instance gradient remains constant. However, if the variance decays exponentially, then on-line training converges to the optimal weight with respect to Measure 2. Our analysis reveals the exact degrees to which the training set size, the variance of the per-instance gradient, and the learning rate affect the rate of convergence for each scheme.

Section snippets

Introduction and preliminaries

There are two essential training schemes for gradient descent learning in neural networks: batch training and on-line training. On-line training has also been referred to as pattern update (e.g., Atiya and Parlos [2]), sequential mode (e.g., Bishop [6], Haykin [12]), incremental learning (e.g., Bertsekas and Tsitsiklis [5], Hassoun [11], Sarle and Cary [20]), revision by case (Weiss and Kulikowski [22]), revision by pattern (Weiss and Kulikowski [22]), and sample-by-sample training (e.g.,

Framework of theoretical analysis

In order to theoretically compare the batch and on-line training schemes, we need to quantitatively analyze (3), (5), which we write again for convenience: (3):Wt+1,0=Wt,0-rG(Wt,0).(5):Wt,j+1=Wt,j-rtN+jg(Xt,j+1,Wt,j).

To facilitate the exposition of our theoretical analysis, we follow Heskes and Wiegerinck [13] and assume that each element W in the search space W is a scalar: WR for each WW. Thus a single parameter is trained by the two schemes. As noted by Heskes and Wiegerinck, it is

Two training schemes applied to quadratic loss functions

In this section we rigorously compare the batch and on-line training schemes applied to quadratic loss functions. As stated earlier, we assume WR; a single parameter is trained by the two schemes. We investigate the schemes applied to loss functions of the form L(W)=a(W-b)2. By shifting and scaling, these loss functions can be transformed to L(W)=12W2, and this form will be used. Thus the globally optimal weight W* is 0, andG(W)=LW=W.Let W0,0 denote the initial weight.

First we analyze

Analysis of the expected difference

As described earlier, batch training is a deterministic optimization algorithm, so (13) equals the expected difference between the optimal weight and the weight computed by batch training after t epochs. We derive the expectation of the difference for on-line training. From (16),E[Wt,n(o)]=W0,01-rNNt+n-rNNt+nEs=0t-1j=1NNNs+j(N-r)Nt+n-(Ns+j)Ys,j+j=1nNNt+j(N-r)n-jYt,j=W0,01-rNNt+n-rNNt+ns=0t-1j=1NNNs+j(N-r)Nt+n-(Ns+j)E[Ys,j]+j=1nNNt+j(N-r)n-jE[Yt,j]=W0,0(1-rN)Nt+n,where the last equality

Analysis of the expected squared difference

In this section, we quantitatively compare the two training schemes with regard to Measure 2, the expected squared difference between the optimal weight W*=0 and the weight computed by the training scheme. The analysis of the batch training scheme is simple; since it is a deterministic optimization algorithm, it follows from (13) thatE[(Wt,0(b)-W*)2]=(Wt,0(b))2=W0,02(1-r)2t.Thus batch training converges to W* with regard to Measure 2 provided r<2 (recall that r>0).

Regarding on-line training, it

Variances associated with on-line training

The expected squared difference E[(Wt,0(o)-W*)2] analyzed in Section 5 is closely related to the variance of the weight Wt,0(o) computed by the on-line training scheme. If we assume (27) (i.e., the variance of the random per-instance gradient remains constant), then it follows from the derivations described in 4 Analysis of the expected difference, 5.1 Convergence of on-line training with constant per-instance variance that Var(Wt,0(o))=σ2N2ifr=N4tσ2N3ifr=2Nrσ2N21-1-rN2Nt2N-rotherwise.This

Discussion

Our quantitative analysis shows that batch training has several advantages over on-line training when loss functions are quadratic. The analysis described in Section 4 shows that if the training set size is sufficiently large, then with regard to Measure 1, batch training converges faster to the globally optimal weight than on-line training provided that the learning rate is less than approximately 1.2785. The analysis described in Section 5.1 shows that with respect to Measure 2, batch

Takéhiko Nakama is currently enrolled in the Ph.D. program in Applied Mathematics and Statistics at The Johns Hopkins University in Baltimore, Maryland, USA. He completed his first Ph.D. program in 2003 by conducting neurophysiological research at The Johns Hopkins Krieger Mind/Brain Institute. He also received an M.S.E. in Mathematical Sciences from Hopkins in 2003. His research interests include stochastic processes (Markov chains in particular), analysis of algorithms, stochastic

References (23)

  • J.F.C. Khaw et al.

    Optimal design of neural networks using the Taguchi method

    Neurocomputing

    (1995)
  • D.R. Wilson et al.

    The general inefficiency of batch training for gradient descent learning

    Neural Networks

    (2003)
  • M. Anthony et al.

    Neural Network Learning Theoretical Foundations

    (1999)
  • A.F. Atiya et al.

    New results on recurrent network training: unifying the algorithms and accelerating convergence

    IEEE Transactions on Neural Networks

    (2000)
  • S. Becker, Y. LeCun, Improving the convergence of backpropagation learning with second order methods, in: Proceedings...
  • Y. Bengio

    Neural Networks for Speech and Sequence Recognition

    (1996)
  • D.P. Bertsekas et al.

    Neuro-Dynamic Programming

    (1996)
  • C.M. Bishop

    Neural Networks for Pattern Recognition

    (1997)
  • L. Bouttou et al.

    Speaker independent isolated digit recognition: multi-layer perceptrons vs. dynamic time-warping

    IEEE Transactions on Neural Networks

    (2000)
  • C.C. Chuang et al.

    Robust support vector regression networks for function approximation with outliers

    IEEE Transactions on Neural Networks

    (2002)
  • H. Demuth et al.

    Neural Network Toolbox User's Guide

    (1994)
  • Cited by (48)

    • More intelligent and robust estimation of battery state-of-charge with an improved regularized extreme learning machine

      2021, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      The training process of NNs is to optimize the weights and bias iteratively based on the principle of minimizing the loss function. The commonly used optimization algorithm is the gradient descent (GD) algorithm (Gan et al., 2020; Jiao and Wang, 2021; Takéhiko, 2009). However, when using the GD algorithm for network training, it often takes a long time to obtain the optimized weights and biases due to implementing high complexity and excessive amount of gradient calculation in each iteration.

    • Deterministic convergence of complex mini-batch gradient learning algorithm for fully complex-valued neural networks

      2020, Neurocomputing
      Citation Excerpt :

      Gradient training method (GTM) and its variants have been the backbone for training multilayer feedforward neural networks since the backpropagation algorithm (BPA) was proposed [1], and their effectiveness has been further verified in a recent remarkable progress of neural network research, where the deep neural networks [2] were successfully trained with the usual BPA. There are three practical modes to implement the backpropagation algorithm [3]: batch mode, online mode, and mini-batch mode. In order to obtain the accurate gradient direction, the batch mode accumulates the weight correction over all the training samples before performing the update.

    View all citing articles on Scopus

    Takéhiko Nakama is currently enrolled in the Ph.D. program in Applied Mathematics and Statistics at The Johns Hopkins University in Baltimore, Maryland, USA. He completed his first Ph.D. program in 2003 by conducting neurophysiological research at The Johns Hopkins Krieger Mind/Brain Institute. He also received an M.S.E. in Mathematical Sciences from Hopkins in 2003. His research interests include stochastic processes (Markov chains in particular), analysis of algorithms, stochastic optimization (evolutionary computation in particular), neural networks, and information theory.

    View full text