A Hessian-Free Gradient Flow (HFGF) method for the optimisation of deep learning neural networks

https://doi.org/10.1016/j.compchemeng.2020.107008Get rights and content

Highlights

  • Design of a novel very large scale unconstrained local optimisation algorithm.

  • Gradient Flow based Quasi-Newton algorithm with full convergence proof.

  • Hessian-free approach, tested on nonlinear problems of up to 1 million variables.

  • Designed specifically for training very large-scale neural networks and datasets.

  • Tested successfully on industrial chemical and computer vision datasets.

Abstract

This paper presents a novel optimisation method, termed Hessian-free Gradient Flow, for the optimisation of deep neural networks. The algorithm entails the design characteristics of the Truncated Newton, Conjugate Gradient and Gradient Flow method. It employs a finite difference approximation scheme to make the algorithm Hessian-free and makes use of Armijo conditions to determine the descent condition. The method is first tested on standard testing functions with a high optimisation model dimensionality. Performance on the testing functions has demonstrated the potential of the algorithm to be applied to large-scale optimisation problems. The algorithm is then tested on classification and regression tasks using real-world datasets. Comparable performance to conventional optimisers has been obtained in both cases.

Introduction

Deep neural networks (DNN) have wide applications in many research fields including autonomous driving Sallab et al. (2017), speech recognition Agrawal et al. (2019); Krishna et al. (2019), computer vision Bao et al. (2019), natural language processing Young et al. (2018), and bioinformatics Min et al. (2017). The performance of deep neural networks depends highly on the training process, which has been the focus of many recent research works Yan et al. (2019); Xu et al. (2019). The training of neural networks is essentially the optimisation of a complicated, non-convex loss function with respect to its parameters. Due to the large dimensionality of the input data, and the complicated functional forms of the DNNs, optimizations of DNNs to high precision and computational speed poses a serious challenge in modern applications. This is because there is increasing demands in both model dimensionality (number of input and output) and model complexity internal to the DNNs.

The optimisation methods of neural networks can be divided into two categories: stochastic methods and deterministic methods Le et al. (2011). Stochastic methods have been heavily adopted in industrial applications due to their lower computational cost and easy implementation Fouskakis and Draper (2002). Research into stochastic methods is still ongoing, since their inception six decades ago Robbins and Monro (1951); Jin et al. (2019); Wen et al. (2019); Denevi et al. (2019). From the most basic Stochastic Gradient Descent (SGD), a multitude of methods including SGD with momentum Ruder (2016), and SGD with adaptive learning rates Klein et al. (2009), have been developed. Stochastic optimisation methods are widely used in the hope of locating the global minimum by identifying the globally optimal tunings for its internal parameters. Since DNN models as objective functions are always non-convex in nature, finding the location of the global solution with increasing dimensionality in their parameter space is a combinatorially hard computational problem. The implementation of stochastic methods carry with it the advantage of simplicity. However, the disadvantage of stochastic methods is that its search direction often zig-zags and the minimum point reached is often not exact. Moreover, they offer no guarantee of global optimality, and cannot even determine if the point they converge to is a local minimum at the very least.

With recent developments in computational capability of the hardware, deterministic methods are rising in significance. The mainstream deterministic optimisation method used extensively in the training of DNNs is the L-BFGS method Liu and Nocedal (1989) and its variants Zhu et al. (1997). They have been widely used in current research Yatawatta et al. (2019); Carrara et al. (2019). The disadvantage is that in the optimisation of a DNN’s objective function, these methods cannot guarantee global optimality due to its non-convexity Floudas and Pardalos (2013). However, they are able to guarantee a local minimum of the cost function of least squares fitting.

Stochastic methods have been widely researched in the past. Since Robbins and Monro (1951) proved in theory the effectiveness of SGD, many variants of the stochastic methods have become the centre of attention. Duchi et al. (2011) justified the convergence of adaptive learning rate methods in a convex topology. Later on, Ward et al. (2018) used mathematical proofs to demonstrate that one of the adaptive learning rate methods, AdaGrad-Norm, can converge to a stationary point in a non-convex topology. Variant optimisers belonging to the adaptive learning rate family have been extensively researched, including RMSprop Hinton et al. (2012), AdaDelta Zeiler (2012), Adam Kingma and Ba (2014), AdaFTRL Orabona and Pál (2015), SGD-BB Tan et al. (2016), AdaBatch Défossez and Bach (2017), SC-Adagrad Mukkamala and Hein (2017), AMSGRAD Reddi et al. (2016), and Padam Chen and Gu (2018). Apart from demonstrating the convergence of these stochastic methods, there is also research such as Du et al. (2018) that served to prove the global optimality of the converged point, but only in a theoretical context.

Research into deterministic models has been limited due to the associated computational cost in storing second-order information. Although Ghorbani et al. (2019) made use of the Hessian matrix of the loss function to understand the dynamics of neural network optimization, and deeply researched into the eigenvalues of the Hessian, the research results are not yet implemented in a real-world case study. However, this research accentuated the importance of understanding Hessian information to achieve optimality, and how the Hessian matrix determines the speed of convergence and the generalisation properties. The focus on the Hessian matrix has inspired research into second-order methods to be applied to the training of neural networks Ghorbani et al. (2019).

This paper proposes a new method adopting approximated second-order information following a quasi-Newton scheme to optimise DNNs. The method is derived based on a linearised version of the Gradient Flow method and makes use of finite differences to approximate values of a Hessian matrix. We test the effectiveness of the optimiser on the MNIST and OILDROPLET dataset. The former adopts a deep convolutional neural network (CNN) and the latter adopts a conventional deep neural network (DNN). The architecture of the CNNs used are cutting-edge and the DNN is fine-tuned but the focus is on the optimiser performance compared conventional optimisers such as SGD, Adam and L-BFGS.

In this paper, Section 2 derives the proposed quasi-Newton method named HFGF. Section 3 provides a proof of convergence for the novel method. Section 4 evaluates and analyses the performance of the method on testing functions. Section 5 applies the method to real-life datasets of MNIST and OILDROPLET, adopting optimised DNN architecture in each case.

Section snippets

Algorithm overview

We propose a method that adopts approximated second-order information to perform an optimisation task. To derive the optimisation method, we first write the general update rule as follows:xk+1=xk+Δt·Δxk.The gradient descent method uses the negative gradient vector as search direction:xk+1=xkΔt·f(xk).The limit Δt → 0 gives the smooth trajectory of the gradient flow method:dxdt=f(xk).Linearizing the right-hand side, we obtain:dxdt[f(xk)+2f(xk)·(xk+1xk)].Rewriting the gradient vector and

Convergence analysis

Training ANNs and DNNs can be viewed as the equivalent to minimizing a large-scale optimization problem of the form:minf(x),where x ∈ Rn is a real valued n-dimensional vector of system variables that are to be optimized to minimize the scalar function f(x): Rn → R. We will adopt the inexact Newton Method to derive the proof of convergence.

In this paper, we assume that f has an optimal value f(x*) at x*. We will use the following assumption about the objective function for the rest of this

Analysis of the algorithm

This section undertakes the analysis of the cost and performance of the HFGF algorithm. To test the performance, we introduce several testing functions that is highly complex and non-convex in nature. These functions are introduced to simulate the highly complicated shape of the objective function of neural networks. The results are generated using a 2.3 GHz Intel Core i5 processor with a memory of 8 GB 2133 MHz LPDDR3.

Applications

The analysis above provides a theoretical understanding of the algorithmic performance. In this section, the optimiser is applied to several test cases to determine its real-world performance. The derived algorithm is applied to large-scale optimisation of DNNs to test its speed, robustness and accuracy. In particular, DNNs are optimised to confirm the efficacy of the optimiser to large-scale problems.

Conclusion and future work

We have demonstrated the derivation of a novel quasi-Newton optimisation method with a proof of convergence. The method is named Hessian-free Gradient Flow (HFGF) and has been designed for the optimization of DNNs. We first tested the HFGF method on standard testing functions and was then compared with the other common optimization algorithms to test for its convergence. Then we performed time analysis to identify the most time-consuming steps in the proposed algorithm. We also briefly

CRediT authorship contribution statement

Sushen Zhang: Methodology, Software, Validation, Formal analysis, Data curation, Writing - original draft, Writing - review & editing. Ruijuan Chen: Conceptualization, Methodology, Validation, Writing - original draft, Writing - review & editing. Wenyu Du: Software, Resources. Ye Yuan: Conceptualization, Methodology, Resources, Validation, Supervision. Vassilios S. Vassiliadis: Conceptualization, Methodology, Validation, Supervision, Project administration.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

The first author would like to thank the Cambridge Overseas Trust and the China Scholarship Council for their funding of this research.

References (41)

  • C.A. Botsaris

    Differential gradient methods

    J. Math. Anal. Appl.

    (1978)
  • D.M. Young et al.

    Generalized conjugate-gradient acceleration of nonsymmetrizable iterative methods

    Linear Algebra Appl.

    (1980)
  • A. Agrawal et al.

    Deep learning based classification for assessment of emotion recognition in speech

    Available at SSRN 3356238

    (2019)
  • Y. Bao et al.

    Computer vision and deep learning–based data anomaly detection method for structural health monitoring

    Struct. Health Monitor.

    (2019)
  • A. Brown et al.

    Some effective methods for unconstrained optimization based on the solution of systems of ordinary differential equations

    J. Opt. Theory Appl.

    (1989)
  • F. Carrara et al.

    Adversarial image detection in deep neural networks

    Multimed. Tool. Appl.

    (2019)
  • Chen, J., Gu, Q., 2018. Padam: closing the generalization gap of adaptive gradient methods in training deep neural...
  • A. Défossez et al.

    Adabatch: efficient gradient aggregation rules for sequential and parallel stochastic gradient methods

    (2017)
  • R.S. Dembo et al.

    Inexact newton methods

    SIAM J. Numer. Anal.

    (1982)
  • G. Denevi et al.

    Learning-to-learn stochastic gradient descent with biased regularization

    (2019)
  • S.S. Du et al.

    Gradient descent finds global minima of deep neural networks

    Gradient descent finds global minima of deep neural networks

    (2018)
  • J. Duchi et al.

    Adaptive subgradient methods for online learning and stochastic optimization

    J. Mach. Learn. Res.

    (2011)
  • C.A. Floudas et al.

    State of the art in global optimization: computational methods and applications

    (2013)
  • D. Fouskakis et al.

    Stochastic optimization: a review

    Int. Stat. Rev.

    (2002)
  • B. Ghorbani et al.

    An investigation into neural net optimization via hessian eigenvalue density

    (2019)
  • G. Hinton et al.

    Neural networks for machine learning lecture 6a overview of mini-batch gradient descent

    Cited on

    (2012)
  • M. Jamil et al.

    A literature survey of benchmark functions for global optimization problems

    (2013)
  • C. Jin et al.

    Stochastic gradient descent escapes saddle points efficiently

    (2019)
  • D.P. Kingma et al.

    Adam: a method for stochastic optimization

    (2014)
  • S. Klein et al.

    Adaptive stochastic gradient descent optimisation for image registration

    Int. J. Comput. vis.

    (2009)
  • View full text