Elsevier

Pattern Recognition Letters

Volume 112, 1 September 2018, Pages 145-151
Pattern Recognition Letters

Variational closed-Form deep neural net inference

https://doi.org/10.1016/j.patrec.2018.07.001Get rights and content

Highlights

  • A novel Bayesian neural net construction allowing closed-form variational inference.

  • Closed-form updates are made tractable by decomposing ReLU into two components.

  • The resulting inference scheme fast convergence, compatible for online learning.

  • State-of-the-art learning curve when applied to Bayesian active learning.

  • Outperforms deterministic neural nets in scarce data regimes.

Abstract

We introduce a Bayesian construction for deep neural networks that is amenable to mean field variational inference that operates solely by closed-form update rules. Hence, it does not require any learning rate to be manually tuned. We show that by this virtue it becomes possible with our model to perform effective deep learning on three setups where conventional neural nets are known to perform suboptimally: i) online learning, ii) learning from small data, and iii) active learning. We compare our approach to earlier Bayesian neural network inference techniques spanning from expectation propagation to gradient-based variational Bayes, as well as deterministic neural nets with various activations functions. We observe our approach to improve on all these alternatives in two mainstream vision benchmarks and two medical data sets: diabetic retinopathy screening and exudate detection from eye fundus images.

Introduction

We introduce a novel construction which for the first time makes mean field variational Bayesian (VB) inference applicable to BNNs. Differently from all previous VB inference methods [4], [12], [13], [16], our construction allows updating variational parameters via only closed-form updates (i.e. to the local optimum), nullifying the need for tuning a suitable learning rate. We achieve such a neural network by decomposing the Rectified Linear Unit [7] activation function into two factors, one being an approximation to the Heaviside step function and the other the identity function. Fig. 1 illustrates the idea.

We first evaluate our model on online learning, where data are streaming in an environmentally determined order, and the model does not have the chance to look back at a data point seen earlier. While neural nets are extremely successful in image classification, their competitive advantage comes only after long training periods that last hundreds of epochs. We perform experiments on two benchmark data sets on which neural nets are proven to set the state-of-the-art: MNIST and CIFAR-10. We observe that in the online learning scenario, our Bayesian construction gives the steepest learning curve compared to other BNN inference methods and conventional nets with various activation functions and a manually-tuned learning rate.

Secondly, we perform deep learning on a small number of raw eye fundus images to predict diabetic retinopathy. This is an ecologically valid setup, as it is commonplace to work on small samples sizes, each sample being a patient, in medical imaging. We report that our seven-layer Bayesian deep network achieves the highest accuracy.

Lastly, we exploit the prediction variance provided by our Bayesian model to perform active learning, which is a great challenge for conventional neural nets. Using an information theoretic active learning criterion [10], our network successfully discovers more interesting cases (mostly the rare positives) in the early phases of learning and exhibits the steepest learning curve compared to its Bayesian and conventional competitors. Our use case here is detection of a diabetes symptom on the eye, called exudates, from eye fundus images on the public E-ophtha data set. Using trivial thresholding and connected component analysis, we generate a large set of proposal regions and our Bayesian deep net detects the true exudates from these proposals.

We can consolidate our contributions as follows: i) the first neural network construction that allows VB with closed-form update rules, abolishing the problematic need of tuning a learning rate, ii) illustration of effective deep learning from few samples without data augmentation, and iii) significantly steeper active learning than conventional neural nets with a deterministic active learning criterion.

Section snippets

Notational conventions and definitions

We denote a probability density function with p( · ). We use δa for the Kronecker delta function which takes 1 if its argument a is true and 0 otherwise. Expectation of a function f(x) with respect to p(x) is Ep(x)[f(x)]. We use ⟨x⟩ as a short-hand notation for Ep(x)[x]. For a density function composed of a set of independent variables Q=q(x1)q(x2)q(xP), by EQq(xj)[·] we denote the expectation of the term in the argument with respect to all factors in Q except q(xj). A scalar variable is

Bayesian neural networks: an overview

A neural network is composed of neurons which take an input vector hnl, pass it through linear weights wl, and then apply a non-linear activation function σ( · ), resulting in hrnl+1=σ(wrlThnl). The outputs of the sibling neurons r of layer l for data point n, put together, hnl+1=[h1nl+1,,hRnl+1] form the input vector of a subsequent set of neurons. This deterministic modeling is prone to overfitting caused by the need to fit a very large number of parameters to data. Easy fixes such as

The Variational ReLU Net

The central idea of deep learning is to devise a neuron and build a layered network of many of these neurons. Given a D-dimensional input xn and a scalar output frnl, the widely adopted approach is to first pass the input through a linear weight vector frnl=wrlThnl and apply a non-linear mapping hrnl+1=σ(frnl) to achieve the output activation hrnl+1. It has been repeatedly reported in the past that the well known non-linearities, such as softmax, tanh, and ReLU, do not allow closed-form

Results

We focus on three setups where the conventional deep learning paradigm performs suboptimally: i) online learning, ii) learning from small sample sets, and iii) active learning. It is noteworthy that we have designed all our experiments for a comparative analysis between our model and its competitors. Our main concern is to illustrate how the Bayesian approach can survive from these conditions and how our Bayesian construction stands out among its alternatives. To quantify the pure learning

Discussion

The key advantage of our novel Bayesian neural net construction is that it gives way to mean field variational inference with closed-form update rules. In online learning, this property provides either comparable or better learning speed than non-probabilistic conventional neural nets. We demonstrate also that the Bayesian nature of our construction allows training deep networks on very small sample sizes, while both the conventional neural nets and alternative BNN inference methods largely

References (24)

  • C. Blundell et al.

    Weight uncertainty in neural networks

    ICML

    (2015)
  • D. A. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep network learning by exponential linear units...
  • C. Dugas et al.

    Incorporating second-order functional knowledge for better option pricing

    NIPS

    (2001)
  • B. Frey et al.

    Variational learning in nonlinear gaussian belief networks

    Neural. Comput.

    (1999)
  • R. Frigola et al.

    Variational Gaussian process state-space models

    NIPS

    (2014)
  • S. Ghosh et al.

    Assumed density filtering methods for learning Bayesian neural networks

    AAAI

    (2016)
  • X. Glorot et al.

    Deep sparse rectifier neural networks

    AISTATS

    (2011)
  • K. He et al.

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

    ICCV

    (2015)
  • J. Hernández-Lobato et al.

    Probabilistic backpropagation for scalable learning of Bayesian neural networks

    ICML

    (2015)
  • N. Houlsby et al.

    Collaborative Gaussian processes for preference learning

    NIPS

    (2012)
  • T. Jaakkola et al.

    Bayesian parameter estimation via variational methods

    Stat. Comput.

    (2000)
  • D. Kingma et al.

    Auto-encoding variational Bayes

    ICLR

    (2014)
  • View full text