An online gradient method with momentum for two-layer feedforward neural networks

doi:10.1016/j.amc.2009.02.038

Applied Mathematics and Computation

Volume 212, Issue 2, 15 June 2009, Pages 488-498

https://doi.org/10.1016/j.amc.2009.02.038 Get rights and content

Abstract

An online gradient method with momentum for two-layer feedforward neural networks is considered. The momentum coefficient is chosen in an adaptive manner to accelerate and stabilize the learning procedure of the network weights. Corresponding convergence results are proved, that is, the weak convergence result is proved under the uniformly boundedness assumption of the activation function and its derivatives, moreover, if the number of elements of the stationary point set for the error function is finite, then strong convergence result holds.

Introduction

Feedforward neural networks (FNN) have been widely used in applications, which are often trained by use of the gradient method [3], [4], [6], [7], [17], [18], and as a simple example, the convergence for two-layer feedforward neural networks is discussed in [6], [8], [15], [16]. To speed up and stabilize the training iteration procedure for the gradient method, a momentum term [12], [13] is often added to the increment formula for the weights, in which the present weight updating increment is a combination of the present gradient of the error function and the previous weight updating increment. Many researchers have developed the theory about momentum and extended its applications, see, e.g. [1], [2], [5], [9], [10], [11], [14], [20], [23].

In [21], some convergence results are given for a two-layer feedforward neural network, where the learning fashion of training examples is batch learning. These results are of global nature in that they are valid for any arbitrarily given initial values of the weights. The key for the convergence analysis is the monotonicity of the error function during the learning procedure, which is proved under the uniformly boundedness assumption of the activation function and its derivatives. In [22], we consider an online gradient method with momentum (OGM in short) for a two-layer feedforward neural network and obtain both the weak and strong convergence results. However, in [21], [22], in order to obtain the strong convergence we assume the error function is uniformly convex, which is a little intense. And in [22] we always assume the training examples are linear independent. The linear independence assumption on the training examples is satisfied in some practical models, where it needs the dimension n of the training examples greater than the number J of the training examples. However, if the number J of the training examples is very large, e.g. J is greater than n, then the linear independence assumption on the training examples cannot be satisfied.

In this paper, we consider an OGM for a two-layer feedforward neural network without the assumption that the training examples are linear independent. We also try to discuss the strong convergence for the OGM without the assumption that the error function is uniformly convex.

The rest of the paper is organized as follows. In Section 2 we introduce the online gradient method with momentum and discuss its convergence conditions, and with these conditions we obtain the corresponding weak convergence and strong convergence. In Section 3 we are devoted to proving Theorem 2.4, which is the main result of this paper. Finally, some conclusions are drawn in Section 4.

Section snippets

OGM and its convergence

For a given set of training examples ${ξ^{j}, O^{j}}_{j = 1}^{J} \subseteq R^{n} \times R$ , we describe the neural network approximation problem as follows. Let $g : R \to R$ be a given smooth activation function. For a choice of the weight vector $w \in R^{n}$ , the actual output of the neural network is $ζ^{j} = g (w \cdot ξ^{j}), j = 1, \dots, J,$ where $w \cdot ξ^{j}$ represents the inner product. Our task is to choose the weight w such that the difference $| O^{j} - ζ^{j} |$ is as small as possible. A simple and popular approach is to minimize the quadratic error function $E (w) : = \frac{1}{2} \sum_{j = 1}^{J} (O^{j} - ζ^{j})^{2} = 1$

Proof of Theorem 2.4

To show Theorem 2.4, including Lemma 2.3, we also need the other preliminary lemmas by follows.

Using Taylor’s formula we expand $g_{j} (w^{(m + 1) J} \cdot ξ^{j})$ at $w^{mJ} \cdot ξ^{j}$ as $g_{j} (w^{(m + 1) J} \cdot ξ^{j}) = g_{j} (w^{mJ} \cdot ξ^{j}) + g_{j}^{'} (w^{mJ} \cdot ξ^{j}) (w^{(m + 1) J} - w^{mJ}) \cdot ξ^{j} + \frac{1}{2} g_{j}^{″} (t_{m, j}) [(w^{(m + 1) J} - w^{mJ}) \cdot ξ^{j}]^{2} = g_{j} (w^{mJ} \cdot ξ^{j}) + g_{j}^{'} (w^{mJ} \cdot ξ^{j}) (w^{(m + 1) J} - w^{mJ}) \cdot ξ^{j} + ρ_{m, j},$ where $t_{m, j}$ lies in between $w^{mJ} \cdot ξ^{j}$ and $w^{(m + 1) J} \cdot ξ^{j}$ , and $ρ_{m, j} = \frac{1}{2} g_{j}^{″} (t_{m, j}) {[(w^{(m + 1) J} - w^{mJ}) \cdot ξ^{j}]}^{2} .$ From (2.6), (3.1) we get $E (w^{(m + 1) J}) - E (w^{mJ}) = \sum_{j = 1}^{J} g_{j}^{'} (w^{mJ} \cdot ξ^{j}) (w^{(m + 1) J} - w^{mJ}) \cdot ξ^{j} + \sum_{j = 1}^{J} ρ_{m, j} .$ Noticing $w^{(m + 1) J} - w^{mJ} = \sum_{k = 1}^{J} Δ w^{mJ + k} = \sum_{k = 1}^{J} (τ_{m, k} Δ w$

Conclusions

In this paper, we consider an online gradient method with momentum for two-layer feedforward neural networks. The momentum coefficient is chosen in an adaptive manner to accelerate and stabilize the learning procedure of the network weights. We do not restrict the training examples to be linear independent, and give up the assumption that the error function is uniformly convex, which is a little intense in the literature. With the assumption that the activation function and its derivatives $| g (t)$

Acknowledgement

The author would like to thank Professor Wei Wu for his lots of valuable suggestions on the topic of this paper, also thanks the anonymous referees for their valuable comments and suggestions on the revision of this paper.

References (23)

A. Bhaya et al.
Steepest descent with momentum for quadratic functions is a version of the conjugate gradient method
Neural Networks
(2004)
A. Crema et al.
Spectral projected subgradient with a momentum term for the Lagrangean dual approach
Computers and Operations Research
(2007)
Z.X. Li et al.
Convergence of an online gradient method for feedforward neural networks with stochastic inputs
Journal of Computer Applied Mathematics
(2004)
N. Qian
On the momentum term in gradient descent learning algorithms
Neural Networks
(1999)
W. Wu et al.
Deterministic convergence of an online gradient method for neural networks
Journal of Computer Applied Mathematics
(2002)
H. Zhang et al.
Convergent gradient ascent with momentum in general-sum games
Neurocomputing
(2004)
T.L. Fine et al.
Parameter convergence and learning curves for neural networks
Neural Computation
(1999)
W. Finnoff
Diffusion approximations for the constant learning rate backpropagation algorithm and resistance to local minima
Neural Computation
(1994)
E. Istook et al.
Improved backpropagation learning in neural networks with windowed momentum
International Journal of Neural System
(2002)
Chih-Jen Lin
Projected gradient methods for nonnegative matrix factorization
Neural Computation
(2007)

Z. Luo

On the convergence of the LMS algorithm with adaptive learning rate for linear feedforward networks

Neural Computation

(1991)

Cited by (52)

The Nesterov accelerated gradient algorithm for Auto-Regressive Exogenous models with random lost measurements: Interpolation method and auxiliary model method
2024, Information Sciences
For ARX models with random lost measurements, this article proposes an interpolation and auxiliary model based Nesterov accelerated gradient algorithm. This algorithm utilizes the interpolation and auxiliary model methods to fill in lost measurements. The parameters are estimated with better accuracy based on the interpolated input and predicted output data. Two simulation examples are presented to verify the effective of the proposed algorithm.
Last-iterate convergence analysis of stochastic momentum methods for neural networks
2023, Neurocomputing
The stochastic momentum method is a commonly used acceleration technique for solving large-scale stochastic optimization problems. Current convergence results of stochastic momentum methods under non-convex stochastic settings mostly discuss convergence in terms of the random output and minimum output, which requires temporal and spatial statistics of historical data. On the other hand, the last-iterate convergence allows us to avoid storing or selecting past output iterates after each iteration, while maintaining rigour in convergence analysis. To this end, we address the convergence of the last iterate output (called last-iterate convergence) of the stochastic momentum methods for non-convex stochastic optimization problems, in a way which is conformal with traditional optimization theory. For generality, we prove the last-iterate convergence of the stochastic momentum methods under a unified framework, covering both stochastic heavy ball momentum and stochastic Nesterov accelerated gradient momentum, whose momentum factors can be either constant or time-varying coefficients. Finally, the last-iterate convergence of the stochastic momentum methods is verified on the benchmark MNIST and CIFAR-10 datasets. The implementation of SUM is available at: https://github.com/xudp100/SUMhttps://github.com/xudp100/SUM.
Multi-layer perceptron classification method of medical data based on biogeography-based optimization algorithm with probability distributions
2022, Applied Soft Computing
In the field of medical informatics, the accuracy of medical data classification plays a vital role. Multi-layer Perceptron (MLP), as one of the most widely used neural networks, has been widely used in the medical fields. In recent years, the Biogeography-based Optimization (BBO) algorithm has been proposed to train MLP, but the original algorithm often encounters local minimums, slow convergence, and sensitivity to initialized values during the optimization process. To this end, this paper adopted the different probability distributions to improve the BBO (PD-BBO) algorithm to train MLP so as to improve medical data classification accuracy. These distributions include Gamma distribution, Beta distribution, Gaussian distribution, Exponential distribution, Poisson distribution, Geometric distribution, Rayleigh distribution and Weber distribution Then these different probability distributions were embed into the migration process of the BBO algorithm to replace the random distribution and the migration probability was defined. Finally, simulation experiments were carried out, and the benchmark function was used to prove the effectiveness of the proposed algorithms. And then it was used to train a multi-layer perceptron, and five medical data sets were selected for classification. After that, the performance of the standard BBO algorithm and five typical meta-heuristic algorithms were compared. The results showed that the PD-BBO algorithms to train MLP was better than the BBO algorithm and the selected meta-heuristic algorithms, and the classification accuracy has been improved to a certain extent.
Convergence analysis for sigma-pi-sigma neural network based on some relaxed conditions
2022, Information Sciences
This work proves the deterministic convergence of the Sigma-Pi-Sigma neural network based on the batch gradient learning algorithm under certain relaxed conditions. We establish strong and weak convergence results and prove that the error function decreases monotonically and tends to zero. The boundedness of weights is also proved simply and efficiently. In contrast to the usual requirements for the boundedness of weights, this work shows that such weight boundedness is no more one of the necessary conditions for ensuring convergence. In addition, we also show that the requirements on the learning rate and the stationary point set of the error function can be relaxed. Finally, the effectiveness of the proposed algorithm is validated by numerical experiments, followed by brief conclusions.
Artificial intelligence based modelling and multi-objective optimization of vinyl chloride monomer (VCM) plant to strike a balance between profit, energy utilization and environmental degradation
2022, Journal of the Indian Chemical Society
Citation Excerpt :
These elements are described in further depth below. The fitness function used in this article [5] has the following formula: GWO, a new stochastic and metaheuristic optimization technique, was introduced by Mirjalili et al. [23].
The current research focuses on the creation of a general technique that solves the key issue of any operational chemical plant, namely, how to strike a delicate balance between profit and environmental impact. As a case study, a commercial vinyl chloride monomer (VCM) production unit was used. This research produced a new modelling and optimization tool that commercial chemical plants can use to measure their environmental impact and strike a careful balance between profit and environmental damage. This paper demonstrates how to model commercial complex reactors using Aspen and ANN in an easy-to-use manner. The current study used a multi-objective hybrid ANN and genetic algorithm to find a delicate balance between profit and environmental damage. A case study of a commercial VCM manufacturing process demonstrates the efficacy of the proposed methodology. The suggested methodology creates optimal VCM reactor operating parameters, which can be used in commercial plants to increase profit. Furthermore, the suggested methodology creates a set of Pareto optimal solutions platform to acquire insight into the profit-environmental impact balance. These insights could be extremely beneficial to plant management in making educated decisions about plant operations.
Multiplanar reconstruction with incomplete data via enhanced fuzzy radial basis function neural networks
2020, Biomedical Signal Processing and Control
Citation Excerpt :
The next step is neural network training - in the field of feedforward neural networks, the best known training method is the error backpropagation algorithm (EBPA) [35]. Although new methods have evolved from EBPA and have been proposed by researchers [36,37], the problems of slow convergence and a high intensity dependence on the raw data remain [38]. Various heuristic optimization methods have been investigated for training feedforward neural networks [39–42].
Slice images based on a single slicing direction often contain incomplete data and cannot be used by clinicians for diagnosis or observation. It is thus necessary to reconstruct the slices using multiplanar reconstruction technology. In the case of complete data, it is not difficult to obtain a series of clear images from other slicing directions. In the case of incomplete data, interpolation methods are commonly employed on reconstructed images to compensate for the missing information. However, such results are often not ideal. In this study, we propose a new method based on an enhanced fuzzy radial basis function neural network. First, a series of incomplete transverse section images that have been accurately registered are adopted. Then, we superpose the sequence images to obtain the three-dimensional data volume. Thereafter, we can acquire the coronal or sagittal images by reformatting this data volume. For a reconstructed image, the proposed system was applied to compensate for the lost data. We used 15 sets of proposed neural networks to obtain 15 sets of output data, with the final output data acquired via the inverse distance-weighted algorithm. We trained the system via a gravitational search algorithm. Finally, we repaired all the interpolated data. In this experiment, we used two types of datasets, i.e., images obtained via brain magnetic resonance imaging and abdominal computed tomography. Subjective observations and objective evaluations confirm the superiority and effectiveness of the proposed method compared to other state-of-the-art methods.

View all citing articles on Scopus

^☆: This work is supported by Zhejiang Provincial Natural Science Foundation of China under Grant No. Y606009.

View full text

An online gradient method with momentum for two-layer feedforward neural networks☆

Abstract

Introduction

Section snippets

OGM and its convergence

Proof of Theorem 2.4

Conclusions

Acknowledgement

Neural Networks

Computers and Operations Research

Journal of Computer Applied Mathematics

Neural Networks

Journal of Computer Applied Mathematics

Neurocomputing

Parameter convergence and learning curves for neural networks

Neural Computation

Diffusion approximations for the constant learning rate backpropagation algorithm and resistance to local minima

Neural Computation

Improved backpropagation learning in neural networks with windowed momentum

International Journal of Neural System

Projected gradient methods for nonnegative matrix factorization

Neural Computation

On the convergence of the LMS algorithm with adaptive learning rate for linear feedforward networks

Neural Computation