Theoretical analysis of batch and on-line training for gradient descent learning in neural networks

doi:10.1016/j.neucom.2009.05.017

Neurocomputing

Volume 73, Issues 1–3, December 2009, Pages 151-159

https://doi.org/10.1016/j.neucom.2009.05.017 Get rights and content

Abstract

In this study, we theoretically analyze two essential training schemes for gradient descent learning in neural networks: batch and on-line training. The convergence properties of the two schemes applied to quadratic loss functions are analytically investigated. We quantify the convergence of each training scheme to the optimal weight using the absolute value of the expected difference (Measure 1) and the expected squared difference (Measure 2) between the optimal weight and the weight computed by the scheme. Although on-line training has several advantages over batch training with respect to the first measure, it does not converge to the optimal weight with respect to the second measure if the variance of the per-instance gradient remains constant. However, if the variance decays exponentially, then on-line training converges to the optimal weight with respect to Measure 2. Our analysis reveals the exact degrees to which the training set size, the variance of the per-instance gradient, and the learning rate affect the rate of convergence for each scheme.

Section snippets

Introduction and preliminaries

There are two essential training schemes for gradient descent learning in neural networks: batch training and on-line training. On-line training has also been referred to as pattern update (e.g., Atiya and Parlos [2]), sequential mode (e.g., Bishop [6], Haykin [12]), incremental learning (e.g., Bertsekas and Tsitsiklis [5], Hassoun [11], Sarle and Cary [20]), revision by case (Weiss and Kulikowski [22]), revision by pattern (Weiss and Kulikowski [22]), and sample-by-sample training (e.g.,

Framework of theoretical analysis

In order to theoretically compare the batch and on-line training schemes, we need to quantitatively analyze (3), (5), which we write again for convenience: $(3) : W_{t + 1, 0} = W_{t, 0} - rG (W_{t, 0}) .$ $(5) : W_{t, j + 1} = W_{t, j} - r_{tN + j} g (X_{t, j + 1}, W_{t, j}) .$

To facilitate the exposition of our theoretical analysis, we follow Heskes and Wiegerinck [13] and assume that each element W in the search space $W$ is a scalar: $W \in R$ for each $W \in W$ . Thus a single parameter is trained by the two schemes. As noted by Heskes and Wiegerinck, it is

Two training schemes applied to quadratic loss functions

In this section we rigorously compare the batch and on-line training schemes applied to quadratic loss functions. As stated earlier, we assume $W \in R$ ; a single parameter is trained by the two schemes. We investigate the schemes applied to loss functions of the form $L (W) = a (W - b)^{2}$ . By shifting and scaling, these loss functions can be transformed to $L (W^{'}) = \frac{1}{2} {W^{'}}^{2}$ , and this form will be used. Thus the globally optimal weight $W^{*}$ is 0, and $G (W) = \frac{\partial L}{\partial W} = W .$ Let $W_{0, 0}$ denote the initial weight.

First we analyze

Analysis of the expected difference

As described earlier, batch training is a deterministic optimization algorithm, so (13) equals the expected difference between the optimal weight and the weight computed by batch training after t epochs. We derive the expectation of the difference for on-line training. From (16), $E [W_{t, n}^{(o)}] = W_{0, 0} {(1 - \frac{r}{N})}^{Nt + n} - \frac{r}{N^{Nt + n}} E [\sum_{s = 0}^{t - 1} \sum_{j = 1}^{N} N^{Ns + j} (N - r)^{Nt + n - (Ns + j)} Y_{s, j} + \sum_{j = 1}^{n} N^{Nt + j} (N - r)^{n - j} Y_{t, j}] = W_{0, 0} {(1 - \frac{r}{N})}^{Nt + n} - \frac{r}{N^{Nt + n}} [\sum_{s = 0}^{t - 1} \sum_{j = 1}^{N} N^{Ns + j} (N - r)^{Nt + n - (Ns + j)} E [Y_{s, j}] + \sum_{j = 1}^{n} N^{Nt + j} (N - r)^{n - j} E [Y_{t, j}]] = W_{0, 0} (1 - \frac{r}{N})^{Nt + n},$ where the last equality

Analysis of the expected squared difference

In this section, we quantitatively compare the two training schemes with regard to Measure 2, the expected squared difference between the optimal weight $W^{*} = 0$ and the weight computed by the training scheme. The analysis of the batch training scheme is simple; since it is a deterministic optimization algorithm, it follows from (13) that $E [(W_{t, 0}^{(b)} - W^{*})^{2}] = (W_{t, 0}^{(b)})^{2} = W_{0, 0}^{2} (1 - r)^{2 t} .$ Thus batch training converges to $W^{*}$ with regard to Measure 2 provided $r < 2$ (recall that $r > 0$ ).

Regarding on-line training, it

Variances associated with on-line training

The expected squared difference $E [(W_{t, 0}^{(o)} - W^{*})^{2}]$ analyzed in Section 5 is closely related to the variance of the weight $W_{t, 0}^{(o)}$ computed by the on-line training scheme. If we assume (27) (i.e., the variance of the random per-instance gradient remains constant), then it follows from the derivations described in 4 Analysis of the expected difference, 5.1 Convergence of on-line training with constant per-instance variance that $Var (W_{t, 0}^{(o)}) = \{\begin{matrix} σ^{2} N^{2} & if r = N \\ {4 t σ}^{2} N^{3} & if r = 2 N \\ \frac{r σ^{2} N^{2} [1 - {(1 - \frac{r}{N})}^{2 Nt}]}{2 N - r} & otherwise . \end{matrix}$ This

Discussion

Our quantitative analysis shows that batch training has several advantages over on-line training when loss functions are quadratic. The analysis described in Section 4 shows that if the training set size is sufficiently large, then with regard to Measure 1, batch training converges faster to the globally optimal weight than on-line training provided that the learning rate is less than approximately 1.2785. The analysis described in Section 5.1 shows that with respect to Measure 2, batch

Takéhiko Nakama is currently enrolled in the Ph.D. program in Applied Mathematics and Statistics at The Johns Hopkins University in Baltimore, Maryland, USA. He completed his first Ph.D. program in 2003 by conducting neurophysiological research at The Johns Hopkins Krieger Mind/Brain Institute. He also received an M.S.E. in Mathematical Sciences from Hopkins in 2003. His research interests include stochastic processes (Markov chains in particular), analysis of algorithms, stochastic

References (23)

J.F.C. Khaw et al.
Optimal design of neural networks using the Taguchi method
Neurocomputing
(1995)
D.R. Wilson et al.
The general inefficiency of batch training for gradient descent learning
Neural Networks
(2003)
M. Anthony et al.
Neural Network Learning Theoretical Foundations
(1999)
A.F. Atiya et al.
New results on recurrent network training: unifying the algorithms and accelerating convergence
IEEE Transactions on Neural Networks
(2000)
S. Becker, Y. LeCun, Improving the convergence of backpropagation learning with second order methods, in: Proceedings...
Y. Bengio
Neural Networks for Speech and Sequence Recognition
(1996)
D.P. Bertsekas et al.
Neuro-Dynamic Programming
(1996)
C.M. Bishop
Neural Networks for Pattern Recognition
(1997)
L. Bouttou et al.
Speaker independent isolated digit recognition: multi-layer perceptrons vs. dynamic time-warping
IEEE Transactions on Neural Networks
(2000)
C.C. Chuang et al.
Robust support vector regression networks for function approximation with outliers
IEEE Transactions on Neural Networks
(2002)

H. Demuth et al.

Neural Network Toolbox User's Guide

(1994)

Cited by (48)

More intelligent and robust estimation of battery state-of-charge with an improved regularized extreme learning machine
2021, Engineering Applications of Artificial Intelligence
Citation Excerpt :
The training process of NNs is to optimize the weights and bias iteratively based on the principle of minimizing the loss function. The commonly used optimization algorithm is the gradient descent (GD) algorithm (Gan et al., 2020; Jiao and Wang, 2021; Takéhiko, 2009). However, when using the GD algorithm for network training, it often takes a long time to obtain the optimized weights and biases due to implementing high complexity and excessive amount of gradient calculation in each iteration.
State-of-charge (SOC) is the key parameter for battery management, and the accurate estimation of SOC is pretty important for the safe and stable operation of lithium batteries. This paper investigates a regularized extreme learning machine trained with the spectral Fletcher–Reeves algorithm and tuned with the beetle antennae search algorithm (BAS-SFR-RELM) for intelligent and robust SOC estimation. In the experiment section, the urban dynamometer driving schedule (UDDS) profile and the Los Angeles 92 (LA92) profile are performed on a battery test platform for data collection. In the simulation section, the root mean squared error (RMSE) and the mean absolute error (MAE) are adopted to evaluate the performance of the model. Compared with the linear regression (LR), the back propagation (BP) network, the multi-layer perceptron (MLP), and the long short-term memory (LSTM) network, the BAS-SFR-RELM method can efficiently obtain the optimal regularization coefficient to effectively prevent overfitting with faster convergence speed. Increasing the number of hidden neurons in the BAS-SFR-RELM appropriately can improve the SOC estimation precision. Implementing the BAS-SFR-RELM with the noise-added data set gives high robustness for SOC estimation
Prediction of groundwater quality using efficient machine learning technique
2021, Chemosphere
To ensure safe drinking water sources in the future, it is imperative to understand the quality and pollution level of existing groundwater. The prediction of water quality with high accuracy is the key to control water pollution and the improvement of water management. In this study, a deep learning (DL) based model is proposed for predicting groundwater quality and compared with three other machine learning (ML) models, namely, random forest (RF), eXtreme gradient boosting (XGBoost), and artificial neural network (ANN). A total of 226 groundwater samples are collected from an agriculturally intensive area Arang of Raipur district, Chhattisgarh, India, and various physicochemical parameters are measured to compute entropy weight-based groundwater quality index (EWQI). Prediction performances of models are determined by introducing five error metrics. Results showed that DL model is the best prediction model with the highest accuracy in terms of R², i.e., R² = 0996 against the RF (R² = 0.886), XGBoost (R² = 0.0.927), and ANN (R² = 0.917). The uncertainty of the DL model output is cross-verified by running the proposed algorithm with newly randomized dataset for ten times, where minor deviations in the mean value of performance metrics are observed. Moreover, input variable importance computed by prediction models highlights that DL model is the most realistic and accurate approach in the prediction of groundwater quality.
Deterministic convergence of complex mini-batch gradient learning algorithm for fully complex-valued neural networks
2020, Neurocomputing
Citation Excerpt :
Gradient training method (GTM) and its variants have been the backbone for training multilayer feedforward neural networks since the backpropagation algorithm (BPA) was proposed [1], and their effectiveness has been further verified in a recent remarkable progress of neural network research, where the deep neural networks [2] were successfully trained with the usual BPA. There are three practical modes to implement the backpropagation algorithm [3]: batch mode, online mode, and mini-batch mode. In order to obtain the accurate gradient direction, the batch mode accumulates the weight correction over all the training samples before performing the update.
This paper investigates the fully complex mini-batch gradient algorithm for training complex-valued neural networks. Mini-batch gradient method has been widely used in neural network training, however, its convergence analysis is usually restricted to real-valued neural networks and of probability nature. By introducing a new Taylor mean value theorem for analytic functions, in this paper we establish deterministic convergence results for the fully complex mini-batch gradient algorithm under mild conditions. The deterministic convergence here means that the algorithm will deterministically converge, and both the weak convergence and strong convergence will be proved. Benefited from the newly introduced mean value theorem, our results are of global nature in that they are valid for arbitrarily given initial values of the weights. The theoretical findings are validated with a simulation example.
NN-SSTA: A deep neural network approach for statistical static timing analysis
2020, Expert Systems with Applications
Discrete statistical static timing analysis (SSTA) performs the timing analysis by using statistical maximum and convolution operations. The maximum is basically a non-linear operator and it is not a simple task to capture the skewness introduced by it. On the other hand, the convolution has a potential to “blow-up” the number of discrete samples as we going deep inside the timing graph and hence, results in exponential timing complexity. Therefore, in this paper we present novel deep neural network based operations which can accurately approximate the signal arrival-time's distributions with linear-time complexity. The various deep neural network (DNN) architectures have been used to implement both the maximum and the convolution operations using proper training dataset. Simulation results on various benchmark circuits (ISCAS 85, ISCAS 89, and ITC 99) show that the proposed method estimate the mean and standard deviation (STD) of critical path delay distribution with an average error of 0.75% and 2.56% as compared to Monte Carlo (MC), respectively. Our SSTA speeds up the traditional discrete approach by a factor of 20.7x on average. Furthermore, the PDF obtained from our method matches the ones obtained from MC with a reasonable error. Furthermore, we have proposed multi-wise maximum operations to reduce the arrival-time computational complexity at multi-inputs gates. Comparing to MC, the proposed method shows 0.97% and 2.58% average error in mean and STD respectively and the speeding up factor reaches 24.4x on average for all benchmarks.
Daily long-term traffic flow forecasting based on a deep neural network
2019, Expert Systems with Applications
Daily traffic flow forecasting is critical in advanced traffic management and can improve the efficiency of fixed-time signal control. This paper presents a traffic prediction method for one whole day using a deep neural network based on historical traffic flow data and contextual factor data. The main idea is that traffic flow within a short time period is strongly correlated with the starting and ending time points of the period together with a number of other contextual factors, such as day of week, weather, and season. Therefore, the relationship between the traffic flow values within a given time interval and a combination of contextual factors can be mined from historical data. First, a predictor was trained using a multi-layer supervised learning algorithm to mine the potential relationship between traffic flow data and a combination of key contextual factors. To reduce training times, a batch training method was proposed. Finally, a Seattle-based case study shows that, overall, the proposed method outperforms the conventional traffic prediction method in terms of prediction accuracy.
Explore a deep learning multi-output neural network for regional multi-step-ahead air quality forecasts
2019, Journal of Cleaner Production
Timely regional air quality forecasting in a city is crucial and beneficial for supporting environmental management decisions as well as averting serious accidents caused by air pollution. Artificial Intelligence-based models have been widely used in air quality forecasting. The Shallow Multi-output Long Short-Term Memory (SM-LSTM) model is suitable for regional multi-step-ahead air quality forecasting, while it commonly encounters spatio-temporal instabilities and time-lag effects. To overcome these bottlenecks and overfitting issues, this study proposed a Deep Multi-output LSTM (DM-LSTM) neural network model that were incorporated with three deep learning algorithms (i.e., mini-batch gradient descent, dropout neuron and L2 regularization) to configure the model for extracting the key factors of complex spatio-temporal relations as well as reducing error accumulation and propagation in multi-step-ahead air quality forecasting. The proposed DM-LSTM model was evaluated by three time series of PM_2.5, PM_10, and NO_x simultaneously at five air quality monitoring stations in Taipei City of Taiwan. Results indicated that the loss function values (mean-square-error) of the SM-LSTM and DM-LSTM models in the testing stages at horizon t+4 were 0.87 and 0.72, respectively. The G_bench values of the DM-LSTM model in the testing stages for PM_2.5, PM_10, and NO_x reached 0.95 at horizon t+1 and exceeded 0.81 at horizon t+4, respectively. Results demonstrated that the proposed DM-LSTM model incorporated with three deep learning algorithms could significantly improve the spatio-temporal stability and accuracy of regional multi-step-ahead air quality forecasts.

View all citing articles on Scopus

View full text

Theoretical analysis of batch and on-line training for gradient descent learning in neural networks

Abstract

Section snippets

Introduction and preliminaries

Framework of theoretical analysis

Two training schemes applied to quadratic loss functions

Analysis of the expected difference

Analysis of the expected squared difference

Variances associated with on-line training

Discussion

Neurocomputing

Neural Networks

Neural Network Learning Theoretical Foundations

New results on recurrent network training: unifying the algorithms and accelerating convergence

IEEE Transactions on Neural Networks

Neural Networks for Speech and Sequence Recognition

Neuro-Dynamic Programming

Neural Networks for Pattern Recognition

Speaker independent isolated digit recognition: multi-layer perceptrons vs. dynamic time-warping

IEEE Transactions on Neural Networks

Robust support vector regression networks for function approximation with outliers

IEEE Transactions on Neural Networks

Neural Network Toolbox User's Guide