Elsevier

Neurocomputing

Volume 407, 24 September 2020, Pages 185-193
Neurocomputing

Deterministic convergence of complex mini-batch gradient learning algorithm for fully complex-valued neural networks

https://doi.org/10.1016/j.neucom.2020.04.114Get rights and content

Abstract

This paper investigates the fully complex mini-batch gradient algorithm for training complex-valued neural networks. Mini-batch gradient method has been widely used in neural network training, however, its convergence analysis is usually restricted to real-valued neural networks and of probability nature. By introducing a new Taylor mean value theorem for analytic functions, in this paper we establish deterministic convergence results for the fully complex mini-batch gradient algorithm under mild conditions. The deterministic convergence here means that the algorithm will deterministically converge, and both the weak convergence and strong convergence will be proved. Benefited from the newly introduced mean value theorem, our results are of global nature in that they are valid for arbitrarily given initial values of the weights. The theoretical findings are validated with a simulation example.

Introduction

Gradient training method (GTM) and its variants have been the backbone for training multilayer feedforward neural networks since the backpropagation algorithm (BPA) was proposed [1], and their effectiveness has been further verified in a recent remarkable progress of neural network research, where the deep neural networks [2] were successfully trained with the usual BPA. There are three practical modes to implement the backpropagation algorithm [3]: batch mode, online mode, and mini-batch mode. In order to obtain the accurate gradient direction, the batch mode accumulates the weight correction over all the training samples before performing the update. The online mode, on the other hand, uses the approximate gradient direction and updates the network weights immediately after each training sample is fed. Mini-batch mode is a hybrid of the batch and online approaches. During each iteration, it computes the gradient over a block of samples and updates weights after these samples are presented. It has been shown that the online mode and mini-batch mode with small batch (block) size usually enjoy faster training speed and attain higher generalization accuracy than the batch mode especially on large training sets [4]. Mini-batch gradient method has become a popular learning method for deep networks [5].

The convergence of GTM can be affected by many factors, such as the learning mode, the learning rate, activation functions, initial weights, etc. For real applications, it is necessary and interesting to make clear under which conditions the convergence of GTM can be guaranteed. The batch GTM essentially corresponds to the standard gradient descent method, thus its convergence is of a deterministic nature and can be guaranteed by the classical optimization theory. Online GTM and mini-batch GTM, on the other hand, behave in a completely stochastic manner and their convergence is mostly shown to be of a probabilistic nature [6], [7], [8]. This probabilistic nature may cause the failure of learning. Thus, it is desirable if online GTM and mini-batch GTM also enjoy the deterministic convergence. Recently, great efforts have been made on the research of the deterministic convergence of online GTM, and fruitful theoretical results have been obtained on this point [9], [10], [11], [12], [13]. However, the deterministic convergence analysis for mini-batch GTM is still lacking.

In recent decades, complex-valued neural networks (CVNNs) have been an interesting research topic for their powerful ability in application areas such as telecommunications, speech recognition, image processing, series forecasting and others, where data to be processed are complex-valued [14], [15], [16], [17], [18], [19], [20], [21]. Complex gradient training method, which is an extension of GTM from real domain to complex domain, have been developed to train CVNNs. The difficulties for this extension are mainly mathematically caused by the Liouville’s theorem (an entire and bounded function in the complex domain is a constant) and the Cauchy-Riemann conditions required for an entire function. The Liouville’s theorem leads to the conflict between the boundedness and differentiability of the activation function (AF), while the Cauchy-Riemann conditions make the differentiability in complex domain quite stringent. As a result, for CVNNs, the choice of the activation function and the derivation of the gradient-based training methods are more complicated than the counterparts of real-valued neural networks (RVNNs).

CVNNs can be equipped with two types of AFs. The first type is the split-complex AFs [22], where two real-valued functions are used to process the real part and the imaginary part of the signal fed to the neuron, separately. Another type is the fully complex AFs [23]. Though split-complex AF can avoid the occurrence of the singular points, algorithms employing the fully complex activation function tend to yield better performance especially for the common case that the complex signal is strongly correlated [24]. It has been shown that fully complex CVNNs with continuous elementary transcendental functions (ETFs) as AFs over a compact set in the complex vector field are a universal approximator of any continuous complex mappings [23], which makes ETFs popular as AFs in the CVNN research community.

The derivation of the complex gradient-based training method can be traced back to the complex least mean square (CLMS) algorithm introduced in [25]. Later the split-complex backpropagation (BP) training algorithm for CVNNs with split-complex AFs and the fully complex BP training algorithm [26], [27] for CVNNs with fully complex AFs are derived independently by many authors. [23] verified that the split complex BP is a special case of the fully complex BP, and thus revealed why the split-complex BP can not make full use of the information from the real and imaginary components of the signal. The above derivations need to separately compute the partial derivatives of the error function with respect to the real and imaginary components of the network weights. Recently, using Wirtinger calculus [28], [29], fully complex gradient methods have been derived and rewritten in a very compact form [30] comparable to the real-valued counterparts.

Due to the restriction of the complex-valued activation functions, theoretical analysis to make clear under which conditions the complex gradient training method for CVNNs would converge is very important and desirable for real applications of CVNNs. However, though Wirtinger calculus helps to make complex GTM to be expressed in a similar manner of real GTM, the convergence of complex GTM can not be directly obtained from the existing convergence results of real GTM because many mathematical analysis tools in real domain, such as the traditional mean value theorem, do not hold in complex domain. For example, let f(z)=ez and z2=z1+2πi, then we have f(z2)-f(z1)=0 but (z2-z1)f(w)=2πiew0 for all w. There have been some convergence results for complex GTM and its variants, such as the split-complex gradient training method [31], [32], [33], the fully complex gradient training method [34], [35], the augmented fully complex gradient training method [36], and the complex filters [37], [38]. Moreover, the advantages of the complex step size for complex GTM are also discussed [39], [40]. However, similar to the case of the real GTM, deterministic convergence results for complex mini-batch GTM are still lacking.

Motivated by the above issues, this paper aims to establish a deterministic convergence theory for fully complex mini-batch gradient training method, which includes the batch training and the online training as two special cases. Specifically, we will make the following contributions:

  • Using the Wirtinger calculus, three types of fully complex gradient methods will be derived, which are the fully complex batch gradient algorithm (FCBGA), the fully complex online gradient algorithm (FCOGA), and the fully complex mini-batch gradient algorithm (FCMBGA). Different from [30], [26], during the derivation we don’t need the activation function to satisfy the Schwarz’s symmetry property.

  • A Taylor mean value theorem in integral form will be established. With this theorem, we don’t need to drop the high order terms of the Taylor series in our analysis [41], and instead we can give an accurate estimation for the difference of the error function between two iterations.

  • The deterministic convergence results for FCMBGA will be established, covering both the weak convergence and strong convergence. During the convergence analysis, we don’t need the activation function to be a contract map, which is an indispensable condition in the convergence analysis in [37]. Morever, our results are of global nature in that they are valid for arbitrarily given initial values of the weights.

The rest of this paper is organized as follows. The network structure and the learning algorithms are described in Section 3. Section 4 presents some assumptions and our main theorem. The detailed proof of the theorem is given in Section 5. Numerical simulations to support our theoretical findings are presented in Section 6. Section 7 concludes this paper.

Section snippets

Network structure and fully complex gradient algorithms

In this section, we first give a brief description of Wirtinger calculus, then describe the network structure and derive three fully complex gradient algorithms: the fully complex batch gradient algorithm (FCBGA), the fully complex online gradient algorithm (FCOGA), and the fully complex mini-batch gradient algorithm (FCMBGA). Though we are mainly concerned with FCMBGA in this paper, the derivation procedure here would help to understand the close relationship between the three algorithms.

Main results

In this section, we will present some convergence results for FCMBGA. Our results are also valid for FCBGA and FCOGA because they are just two special cases of FCMBGA.

The following assumptions are needed in our convergence analysis.

  • (A1) There exists a constant c1>0 such that wmc1 for all m=0,1,;

  • (A2) The functions f(z) and g(z) are analytic in C;

  • (A3) ηn>0,n=0ηn=,n=0ηn2<;

  • (A4) The set Φ={w:wE(w)=0} contains only finite points.

Remark 1

Assumption (A1) is a common condition for the convergence

Proofs

We first list several lemmas which are crucial to our convergence analysis.

Lemma 1

Let φ:SCC be analytic in the set S. If z,z0S and the segment joining them is also in S, thenφ(z)=φ(z0)+φ(z0)(z-z0)+(z-z0)201(1-t)φ(z0+t(z-z0))dt.

Proof

Define u(t)=z0+t(z-z0),0t1. As φ is analytic in S, the function ϕ(t)=φ(u(t)) is differentiable of any order with respect to t. Applying the Taylor’s theorem for real variable function to ϕ(t), we have thatϕ(1)=ϕ(0)+ϕ(0)+01(1-t)ϕ(t)dt.Sinceϕ(t)=ddtφ(z0+t(z-z0))=(z-z0)φ

Illustrated simulation

In this section, our theoretical results are experimentally verified by a wind speed prediction problem, which is a benchmark for CVNNs.

The wind data was collected in an urban environment over a one-day period, and was represented as a vector of speed and direction in the North-East coordinate system. By combining the wind speed in the direction of North and East, the wind signal was made to form a complex signal z = north + i east. For convenience, we stochastically select 1000 samples from

Conclusions

In this paper we have investigated the convergence of the fully complex mini-batch gradient algorithm when the training examples are fed into the network in a fixed order. We first derived three fully complex gradient training algorithms without assuming that the activation function satisfies the property of Schwarz’s symmetry: FCBGA, FCOGA and FCMBGA. Then we established a unified convergence theorem which is valid for all the above algorithms. We have shown that, under mild conditions, the

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

CRediT authorship contribution statement

Huisheng Zhang: Conceptualization, Methodology, Supervision, Writing - original draft. Ying Zhang: Methodology, Writing - review & editing. Shuai Zhu: Software, Investigation. Dongpo Xu: Resources, Writing - review & editing.

Acknowledgment

This work is supported by the National Natural Science Foundation of China (No. 61671099) and the Fundamental Research Funds for the Central Universities of China (No. 3132019323).

Huisheng Zhang received the M.S. degree from Xiamen University in 2003 and Ph.D. degree from Dalian University of Technology in 2009. From April 2014 to March 2015, he was financially supported by China Scholarship Council (CSC) to work as a visiting scholar at the Imperial College London (ICL), UK. He is currently a professor of Dalian Maritime University. His research interests include neural networks, signal processing and learning theory.

References (44)

  • Y. Zhang et al.

    Adaptive complex-valued stepsize based fast learning of complex-valued neural networks

    Neural Netw.

    (2020)
  • D.E. Rumelhart et al.

    Learning Representations by Backpropagating Errors

    (1988)
  • D.C. Ciresan et al.

    simple neural nets for handwritten digit recognition

    Neural Comput.

    (2010)
  • X. Peng, L. Li, F. Wang, Accelerating minibatch stochastic gradient descent using typicality sampling, IEEE Trans....
  • T.L. Fine et al.

    Parameter convergence and learning curves for neural networks

    Neural Comput.

    (1999)
  • J. Sum et al.

    Convergence analyses on on-line weight noise injection-based training algorithms for MLPs

    IEEE Trans. Neural Netw. Learn. Syst.

    (2012)
  • H. Zhang et al.

    Boundedness and convergence of online gradient method with penalty for feedforward neural networks

    IEEE Trans. Neural Netw.

    (2009)
  • H. Shao et al.

    Convergence of a batch gradient algorithm with adaptive momentum for neural networks

    Neural Process. Lett.

    (2011)
  • J. Wang et al.

    Convergence of cyclic and almost-cyclic learning with momentum for feedforward neural networks

    IEEE Trans. Neural Netw.

    (2011)
  • N. Zhang et al.

    Convergence of gradient method with momentum for two-layer feedforward neural networks

    IEEE Trans. Neural Netw.

    (2006)
  • A. Hirose et al.

    Generalization characteristics of complex-valued feedforward neural networks in relation to signal coherence

    IEEE Trans. Neural Netw. Learn. Syst.

    (2012)
  • R. Boloix-Tortosa et al.

    The generalized complex kernel least-mean-square algorithm

    IEEE Trans. Signal Process.

    (2019)
  • Cited by (10)

    • A new deep domain adaptation method with joint adversarial training for online detection of bearing early fault

      2022, ISA Transactions
      Citation Excerpt :

      The three empirical loss functions in Eqs. (2)–(4) all adopt cross-entropy loss that can evaluate the classification performance on the probability output between 0 and 1. Moreover, a mini-batch stochastic gradient descent (SGD) algorithm [19] is employed to minimize Eq. (1). The mini-batch SGD algorithm can reach a tradeoff between the efficiency of batch gradient descent and SGD’s robustness.

    • DAS-VSP NOISE ELIMINATION BASED ON THE DILATED PYRAMID ATTENTION NETWORK

      2023, International Journal of Innovative Computing, Information and Control
    View all citing articles on Scopus

    Huisheng Zhang received the M.S. degree from Xiamen University in 2003 and Ph.D. degree from Dalian University of Technology in 2009. From April 2014 to March 2015, he was financially supported by China Scholarship Council (CSC) to work as a visiting scholar at the Imperial College London (ICL), UK. He is currently a professor of Dalian Maritime University. His research interests include neural networks, signal processing and learning theory.

    Ying Zhang received the M.S. degree from Northeast Normal University in 2003 and Ph.D. degree from Dalian Maritime University in 2010. She is currently an associate professor of Dalian Maritime University. Her research interests include algebra, neural networks, cryptography, and coding theory.

    Shuai Zhu is currently pursuing his Master degree at Dalian Maritime University, Dalian 116026. Her research interests include neural networks and numerical analysis.

    Dongpo Xu received the B.S. degree in applied mathematics from Harbin Engineering University, Harbin, China, in 2004, and the Ph.D. degree in computational mathematics from the Dalian University of Technology, Dalian, China, in 2009.He was a Visiting Scholar with the Department of Electrical and Electronic Engineering, Imperial College London, London, U.K., from 2013 to 2014. He is currently an associate professor with the School of Mathematics and Statistics, Northeast Normal University, Changchun, China. His current research interests include neural networks, machine learning, and signal processing.

    View full text