Deterministic convergence of complex mini-batch gradient learning algorithm for fully complex-valued neural networks
Introduction
Gradient training method (GTM) and its variants have been the backbone for training multilayer feedforward neural networks since the backpropagation algorithm (BPA) was proposed [1], and their effectiveness has been further verified in a recent remarkable progress of neural network research, where the deep neural networks [2] were successfully trained with the usual BPA. There are three practical modes to implement the backpropagation algorithm [3]: batch mode, online mode, and mini-batch mode. In order to obtain the accurate gradient direction, the batch mode accumulates the weight correction over all the training samples before performing the update. The online mode, on the other hand, uses the approximate gradient direction and updates the network weights immediately after each training sample is fed. Mini-batch mode is a hybrid of the batch and online approaches. During each iteration, it computes the gradient over a block of samples and updates weights after these samples are presented. It has been shown that the online mode and mini-batch mode with small batch (block) size usually enjoy faster training speed and attain higher generalization accuracy than the batch mode especially on large training sets [4]. Mini-batch gradient method has become a popular learning method for deep networks [5].
The convergence of GTM can be affected by many factors, such as the learning mode, the learning rate, activation functions, initial weights, etc. For real applications, it is necessary and interesting to make clear under which conditions the convergence of GTM can be guaranteed. The batch GTM essentially corresponds to the standard gradient descent method, thus its convergence is of a deterministic nature and can be guaranteed by the classical optimization theory. Online GTM and mini-batch GTM, on the other hand, behave in a completely stochastic manner and their convergence is mostly shown to be of a probabilistic nature [6], [7], [8]. This probabilistic nature may cause the failure of learning. Thus, it is desirable if online GTM and mini-batch GTM also enjoy the deterministic convergence. Recently, great efforts have been made on the research of the deterministic convergence of online GTM, and fruitful theoretical results have been obtained on this point [9], [10], [11], [12], [13]. However, the deterministic convergence analysis for mini-batch GTM is still lacking.
In recent decades, complex-valued neural networks (CVNNs) have been an interesting research topic for their powerful ability in application areas such as telecommunications, speech recognition, image processing, series forecasting and others, where data to be processed are complex-valued [14], [15], [16], [17], [18], [19], [20], [21]. Complex gradient training method, which is an extension of GTM from real domain to complex domain, have been developed to train CVNNs. The difficulties for this extension are mainly mathematically caused by the Liouville’s theorem (an entire and bounded function in the complex domain is a constant) and the Cauchy-Riemann conditions required for an entire function. The Liouville’s theorem leads to the conflict between the boundedness and differentiability of the activation function (AF), while the Cauchy-Riemann conditions make the differentiability in complex domain quite stringent. As a result, for CVNNs, the choice of the activation function and the derivation of the gradient-based training methods are more complicated than the counterparts of real-valued neural networks (RVNNs).
CVNNs can be equipped with two types of AFs. The first type is the split-complex AFs [22], where two real-valued functions are used to process the real part and the imaginary part of the signal fed to the neuron, separately. Another type is the fully complex AFs [23]. Though split-complex AF can avoid the occurrence of the singular points, algorithms employing the fully complex activation function tend to yield better performance especially for the common case that the complex signal is strongly correlated [24]. It has been shown that fully complex CVNNs with continuous elementary transcendental functions (ETFs) as AFs over a compact set in the complex vector field are a universal approximator of any continuous complex mappings [23], which makes ETFs popular as AFs in the CVNN research community.
The derivation of the complex gradient-based training method can be traced back to the complex least mean square (CLMS) algorithm introduced in [25]. Later the split-complex backpropagation (BP) training algorithm for CVNNs with split-complex AFs and the fully complex BP training algorithm [26], [27] for CVNNs with fully complex AFs are derived independently by many authors. [23] verified that the split complex BP is a special case of the fully complex BP, and thus revealed why the split-complex BP can not make full use of the information from the real and imaginary components of the signal. The above derivations need to separately compute the partial derivatives of the error function with respect to the real and imaginary components of the network weights. Recently, using Wirtinger calculus [28], [29], fully complex gradient methods have been derived and rewritten in a very compact form [30] comparable to the real-valued counterparts.
Due to the restriction of the complex-valued activation functions, theoretical analysis to make clear under which conditions the complex gradient training method for CVNNs would converge is very important and desirable for real applications of CVNNs. However, though Wirtinger calculus helps to make complex GTM to be expressed in a similar manner of real GTM, the convergence of complex GTM can not be directly obtained from the existing convergence results of real GTM because many mathematical analysis tools in real domain, such as the traditional mean value theorem, do not hold in complex domain. For example, let and , then we have but for all w. There have been some convergence results for complex GTM and its variants, such as the split-complex gradient training method [31], [32], [33], the fully complex gradient training method [34], [35], the augmented fully complex gradient training method [36], and the complex filters [37], [38]. Moreover, the advantages of the complex step size for complex GTM are also discussed [39], [40]. However, similar to the case of the real GTM, deterministic convergence results for complex mini-batch GTM are still lacking.
Motivated by the above issues, this paper aims to establish a deterministic convergence theory for fully complex mini-batch gradient training method, which includes the batch training and the online training as two special cases. Specifically, we will make the following contributions:
- •
Using the Wirtinger calculus, three types of fully complex gradient methods will be derived, which are the fully complex batch gradient algorithm (FCBGA), the fully complex online gradient algorithm (FCOGA), and the fully complex mini-batch gradient algorithm (FCMBGA). Different from [30], [26], during the derivation we don’t need the activation function to satisfy the Schwarz’s symmetry property.
- •
A Taylor mean value theorem in integral form will be established. With this theorem, we don’t need to drop the high order terms of the Taylor series in our analysis [41], and instead we can give an accurate estimation for the difference of the error function between two iterations.
- •
The deterministic convergence results for FCMBGA will be established, covering both the weak convergence and strong convergence. During the convergence analysis, we don’t need the activation function to be a contract map, which is an indispensable condition in the convergence analysis in [37]. Morever, our results are of global nature in that they are valid for arbitrarily given initial values of the weights.
The rest of this paper is organized as follows. The network structure and the learning algorithms are described in Section 3. Section 4 presents some assumptions and our main theorem. The detailed proof of the theorem is given in Section 5. Numerical simulations to support our theoretical findings are presented in Section 6. Section 7 concludes this paper.
Section snippets
Network structure and fully complex gradient algorithms
In this section, we first give a brief description of Wirtinger calculus, then describe the network structure and derive three fully complex gradient algorithms: the fully complex batch gradient algorithm (FCBGA), the fully complex online gradient algorithm (FCOGA), and the fully complex mini-batch gradient algorithm (FCMBGA). Though we are mainly concerned with FCMBGA in this paper, the derivation procedure here would help to understand the close relationship between the three algorithms.
Main results
In this section, we will present some convergence results for FCMBGA. Our results are also valid for FCBGA and FCOGA because they are just two special cases of FCMBGA.
The following assumptions are needed in our convergence analysis.
(A1) There exists a constant such that for all ;
(A2) The functions and are analytic in ;
(A3) ;
(A4) The set contains only finite points.
Remark 1
Assumption (A1) is a common condition for the convergence
Proofs
We first list several lemmas which are crucial to our convergence analysis. Lemma 1 Let be analytic in the set S. If and the segment joining them is also in S, then Proof Define . As is analytic in S, the function is differentiable of any order with respect to t. Applying the Taylor’s theorem for real variable function to , we have thatSince
Illustrated simulation
In this section, our theoretical results are experimentally verified by a wind speed prediction problem, which is a benchmark for CVNNs.
The wind data was collected in an urban environment over a one-day period, and was represented as a vector of speed and direction in the North-East coordinate system. By combining the wind speed in the direction of North and East, the wind signal was made to form a complex signal = north + i east. For convenience, we stochastically select 1000 samples from
Conclusions
In this paper we have investigated the convergence of the fully complex mini-batch gradient algorithm when the training examples are fed into the network in a fixed order. We first derived three fully complex gradient training algorithms without assuming that the activation function satisfies the property of Schwarz’s symmetry: FCBGA, FCOGA and FCMBGA. Then we established a unified convergence theorem which is valid for all the above algorithms. We have shown that, under mild conditions, the
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
CRediT authorship contribution statement
Huisheng Zhang: Conceptualization, Methodology, Supervision, Writing - original draft. Ying Zhang: Methodology, Writing - review & editing. Shuai Zhu: Software, Investigation. Dongpo Xu: Resources, Writing - review & editing.
Acknowledgment
This work is supported by the National Natural Science Foundation of China (No. 61671099) and the Fundamental Research Funds for the Central Universities of China (No. 3132019323).
Huisheng Zhang received the M.S. degree from Xiamen University in 2003 and Ph.D. degree from Dalian University of Technology in 2009. From April 2014 to March 2015, he was financially supported by China Scholarship Council (CSC) to work as a visiting scholar at the Imperial College London (ICL), UK. He is currently a professor of Dalian Maritime University. His research interests include neural networks, signal processing and learning theory.
References (44)
Theoretical analysis of batch and on-line training for gradient descent learning in neural networks
Neurocomputing
(2009)- et al.
The general inefficiency of batch training for gradient descent learning
Neural Netw.
(2003) - et al.
Convergence of an online gradient method with inner-product penalty and adaptive momentum
Neurocomputing
(2012) - et al.
Convergence analysis of online gradient method for BP neural networks
Neural Netw.
(2011) - et al.
Complex-valued forecasting of wind profile
Renew. Energy
(2006) - et al.
An adaptive neuro-complex-fuzzy-inferential modeling mechanism for generating higher-order TSK models
Neurocomputing
(2019) An extension of the back-propagation algorithm to complex numbers
Neural Netw.
(1997)- et al.
Convergence analysis of the batch gradient-based neuro-fuzzy learning algorithm with smoothing L-1/2 regularization for the first-order Takagi-Sugeno system
Fuzzy Sets Syst.
(2017) - et al.
Fully complex conjugate gradient-based neural networks using Wirtinger calculus framework: Deterministic convergence and its application
Neural Netw.
(2019) - et al.
Performance analysis of the deficient length augmented CLMS algorithm for second order noncircular complex signals
Signal Process.
(2018)
Adaptive complex-valued stepsize based fast learning of complex-valued neural networks
Neural Netw.
Learning Representations by Backpropagating Errors
simple neural nets for handwritten digit recognition
Neural Comput.
Parameter convergence and learning curves for neural networks
Neural Comput.
Convergence analyses on on-line weight noise injection-based training algorithms for MLPs
IEEE Trans. Neural Netw. Learn. Syst.
Boundedness and convergence of online gradient method with penalty for feedforward neural networks
IEEE Trans. Neural Netw.
Convergence of a batch gradient algorithm with adaptive momentum for neural networks
Neural Process. Lett.
Convergence of cyclic and almost-cyclic learning with momentum for feedforward neural networks
IEEE Trans. Neural Netw.
Convergence of gradient method with momentum for two-layer feedforward neural networks
IEEE Trans. Neural Netw.
Generalization characteristics of complex-valued feedforward neural networks in relation to signal coherence
IEEE Trans. Neural Netw. Learn. Syst.
The generalized complex kernel least-mean-square algorithm
IEEE Trans. Signal Process.
Cited by (10)
A new deep domain adaptation method with joint adversarial training for online detection of bearing early fault
2022, ISA TransactionsCitation Excerpt :The three empirical loss functions in Eqs. (2)–(4) all adopt cross-entropy loss that can evaluate the classification performance on the probability output between 0 and 1. Moreover, a mini-batch stochastic gradient descent (SGD) algorithm [19] is employed to minimize Eq. (1). The mini-batch SGD algorithm can reach a tradeoff between the efficiency of batch gradient descent and SGD’s robustness.
Boundedness and Convergence of Mini-batch Gradient Method with Cyclic Dropconnect and Penalty
2024, Neural Processing LettersDAS-VSP NOISE ELIMINATION BASED ON THE DILATED PYRAMID ATTENTION NETWORK
2023, International Journal of Innovative Computing, Information and Control
Huisheng Zhang received the M.S. degree from Xiamen University in 2003 and Ph.D. degree from Dalian University of Technology in 2009. From April 2014 to March 2015, he was financially supported by China Scholarship Council (CSC) to work as a visiting scholar at the Imperial College London (ICL), UK. He is currently a professor of Dalian Maritime University. His research interests include neural networks, signal processing and learning theory.
Ying Zhang received the M.S. degree from Northeast Normal University in 2003 and Ph.D. degree from Dalian Maritime University in 2010. She is currently an associate professor of Dalian Maritime University. Her research interests include algebra, neural networks, cryptography, and coding theory.
Shuai Zhu is currently pursuing his Master degree at Dalian Maritime University, Dalian 116026. Her research interests include neural networks and numerical analysis.
Dongpo Xu received the B.S. degree in applied mathematics from Harbin Engineering University, Harbin, China, in 2004, and the Ph.D. degree in computational mathematics from the Dalian University of Technology, Dalian, China, in 2009.He was a Visiting Scholar with the Department of Electrical and Electronic Engineering, Imperial College London, London, U.K., from 2013 to 2014. He is currently an associate professor with the School of Mathematics and Statistics, Northeast Normal University, Changchun, China. His current research interests include neural networks, machine learning, and signal processing.