A Hessian-Free Gradient Flow (HFGF) method for the optimisation of deep learning neural networks
Introduction
Deep neural networks (DNN) have wide applications in many research fields including autonomous driving Sallab et al. (2017), speech recognition Agrawal et al. (2019); Krishna et al. (2019), computer vision Bao et al. (2019), natural language processing Young et al. (2018), and bioinformatics Min et al. (2017). The performance of deep neural networks depends highly on the training process, which has been the focus of many recent research works Yan et al. (2019); Xu et al. (2019). The training of neural networks is essentially the optimisation of a complicated, non-convex loss function with respect to its parameters. Due to the large dimensionality of the input data, and the complicated functional forms of the DNNs, optimizations of DNNs to high precision and computational speed poses a serious challenge in modern applications. This is because there is increasing demands in both model dimensionality (number of input and output) and model complexity internal to the DNNs.
The optimisation methods of neural networks can be divided into two categories: stochastic methods and deterministic methods Le et al. (2011). Stochastic methods have been heavily adopted in industrial applications due to their lower computational cost and easy implementation Fouskakis and Draper (2002). Research into stochastic methods is still ongoing, since their inception six decades ago Robbins and Monro (1951); Jin et al. (2019); Wen et al. (2019); Denevi et al. (2019). From the most basic Stochastic Gradient Descent (SGD), a multitude of methods including SGD with momentum Ruder (2016), and SGD with adaptive learning rates Klein et al. (2009), have been developed. Stochastic optimisation methods are widely used in the hope of locating the global minimum by identifying the globally optimal tunings for its internal parameters. Since DNN models as objective functions are always non-convex in nature, finding the location of the global solution with increasing dimensionality in their parameter space is a combinatorially hard computational problem. The implementation of stochastic methods carry with it the advantage of simplicity. However, the disadvantage of stochastic methods is that its search direction often zig-zags and the minimum point reached is often not exact. Moreover, they offer no guarantee of global optimality, and cannot even determine if the point they converge to is a local minimum at the very least.
With recent developments in computational capability of the hardware, deterministic methods are rising in significance. The mainstream deterministic optimisation method used extensively in the training of DNNs is the L-BFGS method Liu and Nocedal (1989) and its variants Zhu et al. (1997). They have been widely used in current research Yatawatta et al. (2019); Carrara et al. (2019). The disadvantage is that in the optimisation of a DNN’s objective function, these methods cannot guarantee global optimality due to its non-convexity Floudas and Pardalos (2013). However, they are able to guarantee a local minimum of the cost function of least squares fitting.
Stochastic methods have been widely researched in the past. Since Robbins and Monro (1951) proved in theory the effectiveness of SGD, many variants of the stochastic methods have become the centre of attention. Duchi et al. (2011) justified the convergence of adaptive learning rate methods in a convex topology. Later on, Ward et al. (2018) used mathematical proofs to demonstrate that one of the adaptive learning rate methods, AdaGrad-Norm, can converge to a stationary point in a non-convex topology. Variant optimisers belonging to the adaptive learning rate family have been extensively researched, including RMSprop Hinton et al. (2012), AdaDelta Zeiler (2012), Adam Kingma and Ba (2014), AdaFTRL Orabona and Pál (2015), SGD-BB Tan et al. (2016), AdaBatch Défossez and Bach (2017), SC-Adagrad Mukkamala and Hein (2017), AMSGRAD Reddi et al. (2016), and Padam Chen and Gu (2018). Apart from demonstrating the convergence of these stochastic methods, there is also research such as Du et al. (2018) that served to prove the global optimality of the converged point, but only in a theoretical context.
Research into deterministic models has been limited due to the associated computational cost in storing second-order information. Although Ghorbani et al. (2019) made use of the Hessian matrix of the loss function to understand the dynamics of neural network optimization, and deeply researched into the eigenvalues of the Hessian, the research results are not yet implemented in a real-world case study. However, this research accentuated the importance of understanding Hessian information to achieve optimality, and how the Hessian matrix determines the speed of convergence and the generalisation properties. The focus on the Hessian matrix has inspired research into second-order methods to be applied to the training of neural networks Ghorbani et al. (2019).
This paper proposes a new method adopting approximated second-order information following a quasi-Newton scheme to optimise DNNs. The method is derived based on a linearised version of the Gradient Flow method and makes use of finite differences to approximate values of a Hessian matrix. We test the effectiveness of the optimiser on the MNIST and OILDROPLET dataset. The former adopts a deep convolutional neural network (CNN) and the latter adopts a conventional deep neural network (DNN). The architecture of the CNNs used are cutting-edge and the DNN is fine-tuned but the focus is on the optimiser performance compared conventional optimisers such as SGD, Adam and L-BFGS.
In this paper, Section 2 derives the proposed quasi-Newton method named HFGF. Section 3 provides a proof of convergence for the novel method. Section 4 evaluates and analyses the performance of the method on testing functions. Section 5 applies the method to real-life datasets of MNIST and OILDROPLET, adopting optimised DNN architecture in each case.
Section snippets
Algorithm overview
We propose a method that adopts approximated second-order information to perform an optimisation task. To derive the optimisation method, we first write the general update rule as follows:The gradient descent method uses the negative gradient vector as search direction:The limit Δt → 0 gives the smooth trajectory of the gradient flow method:Linearizing the right-hand side, we obtain:Rewriting the gradient vector and
Convergence analysis
Training ANNs and DNNs can be viewed as the equivalent to minimizing a large-scale optimization problem of the form:where x ∈ Rn is a real valued n-dimensional vector of system variables that are to be optimized to minimize the scalar function f(x): Rn → R. We will adopt the inexact Newton Method to derive the proof of convergence.
In this paper, we assume that f has an optimal value f(x*) at x*. We will use the following assumption about the objective function for the rest of this
Analysis of the algorithm
This section undertakes the analysis of the cost and performance of the HFGF algorithm. To test the performance, we introduce several testing functions that is highly complex and non-convex in nature. These functions are introduced to simulate the highly complicated shape of the objective function of neural networks. The results are generated using a 2.3 GHz Intel Core i5 processor with a memory of 8 GB 2133 MHz LPDDR3.
Applications
The analysis above provides a theoretical understanding of the algorithmic performance. In this section, the optimiser is applied to several test cases to determine its real-world performance. The derived algorithm is applied to large-scale optimisation of DNNs to test its speed, robustness and accuracy. In particular, DNNs are optimised to confirm the efficacy of the optimiser to large-scale problems.
Conclusion and future work
We have demonstrated the derivation of a novel quasi-Newton optimisation method with a proof of convergence. The method is named Hessian-free Gradient Flow (HFGF) and has been designed for the optimization of DNNs. We first tested the HFGF method on standard testing functions and was then compared with the other common optimization algorithms to test for its convergence. Then we performed time analysis to identify the most time-consuming steps in the proposed algorithm. We also briefly
CRediT authorship contribution statement
Sushen Zhang: Methodology, Software, Validation, Formal analysis, Data curation, Writing - original draft, Writing - review & editing. Ruijuan Chen: Conceptualization, Methodology, Validation, Writing - original draft, Writing - review & editing. Wenyu Du: Software, Resources. Ye Yuan: Conceptualization, Methodology, Resources, Validation, Supervision. Vassilios S. Vassiliadis: Conceptualization, Methodology, Validation, Supervision, Project administration.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
The first author would like to thank the Cambridge Overseas Trust and the China Scholarship Council for their funding of this research.
References (41)
Differential gradient methods
J. Math. Anal. Appl.
(1978)- et al.
Generalized conjugate-gradient acceleration of nonsymmetrizable iterative methods
Linear Algebra Appl.
(1980) - et al.
Deep learning based classification for assessment of emotion recognition in speech
Available at SSRN 3356238
(2019) - et al.
Computer vision and deep learning–based data anomaly detection method for structural health monitoring
Struct. Health Monitor.
(2019) - et al.
Some effective methods for unconstrained optimization based on the solution of systems of ordinary differential equations
J. Opt. Theory Appl.
(1989) - et al.
Adversarial image detection in deep neural networks
Multimed. Tool. Appl.
(2019) - Chen, J., Gu, Q., 2018. Padam: closing the generalization gap of adaptive gradient methods in training deep neural...
- et al.
Adabatch: efficient gradient aggregation rules for sequential and parallel stochastic gradient methods
(2017) - et al.
Inexact newton methods
SIAM J. Numer. Anal.
(1982) - et al.
Learning-to-learn stochastic gradient descent with biased regularization
(2019)
Gradient descent finds global minima of deep neural networks
Gradient descent finds global minima of deep neural networks
Adaptive subgradient methods for online learning and stochastic optimization
J. Mach. Learn. Res.
State of the art in global optimization: computational methods and applications
Stochastic optimization: a review
Int. Stat. Rev.
An investigation into neural net optimization via hessian eigenvalue density
Neural networks for machine learning lecture 6a overview of mini-batch gradient descent
Cited on
A literature survey of benchmark functions for global optimization problems
Stochastic gradient descent escapes saddle points efficiently
Adam: a method for stochastic optimization
Adaptive stochastic gradient descent optimisation for image registration
Int. J. Comput. vis.
Cited by (2)
Couplet Analysis of Linguistic Topology Using Deep Neural Networks in Cognitive Linguistics
2022, Computational Intelligence and Neuroscience