Variational closed-Form deep neural net inference
Introduction
We introduce a novel construction which for the first time makes mean field variational Bayesian (VB) inference applicable to BNNs. Differently from all previous VB inference methods [4], [12], [13], [16], our construction allows updating variational parameters via only closed-form updates (i.e. to the local optimum), nullifying the need for tuning a suitable learning rate. We achieve such a neural network by decomposing the Rectified Linear Unit [7] activation function into two factors, one being an approximation to the Heaviside step function and the other the identity function. Fig. 1 illustrates the idea.
We first evaluate our model on online learning, where data are streaming in an environmentally determined order, and the model does not have the chance to look back at a data point seen earlier. While neural nets are extremely successful in image classification, their competitive advantage comes only after long training periods that last hundreds of epochs. We perform experiments on two benchmark data sets on which neural nets are proven to set the state-of-the-art: MNIST and CIFAR-10. We observe that in the online learning scenario, our Bayesian construction gives the steepest learning curve compared to other BNN inference methods and conventional nets with various activation functions and a manually-tuned learning rate.
Secondly, we perform deep learning on a small number of raw eye fundus images to predict diabetic retinopathy. This is an ecologically valid setup, as it is commonplace to work on small samples sizes, each sample being a patient, in medical imaging. We report that our seven-layer Bayesian deep network achieves the highest accuracy.
Lastly, we exploit the prediction variance provided by our Bayesian model to perform active learning, which is a great challenge for conventional neural nets. Using an information theoretic active learning criterion [10], our network successfully discovers more interesting cases (mostly the rare positives) in the early phases of learning and exhibits the steepest learning curve compared to its Bayesian and conventional competitors. Our use case here is detection of a diabetes symptom on the eye, called exudates, from eye fundus images on the public E-ophtha data set. Using trivial thresholding and connected component analysis, we generate a large set of proposal regions and our Bayesian deep net detects the true exudates from these proposals.
We can consolidate our contributions as follows: i) the first neural network construction that allows VB with closed-form update rules, abolishing the problematic need of tuning a learning rate, ii) illustration of effective deep learning from few samples without data augmentation, and iii) significantly steeper active learning than conventional neural nets with a deterministic active learning criterion.
Section snippets
Notational conventions and definitions
We denote a probability density function with p( · ). We use δa for the Kronecker delta function which takes 1 if its argument a is true and 0 otherwise. Expectation of a function f(x) with respect to p(x) is . We use ⟨x⟩ as a short-hand notation for . For a density function composed of a set of independent variables by we denote the expectation of the term in the argument with respect to all factors in Q except q(xj). A scalar variable is
Bayesian neural networks: an overview
A neural network is composed of neurons which take an input vector pass it through linear weights wl, and then apply a non-linear activation function σ( · ), resulting in . The outputs of the sibling neurons r of layer l for data point n, put together, form the input vector of a subsequent set of neurons. This deterministic modeling is prone to overfitting caused by the need to fit a very large number of parameters to data. Easy fixes such as
The Variational ReLU Net
The central idea of deep learning is to devise a neuron and build a layered network of many of these neurons. Given a D-dimensional input xn and a scalar output the widely adopted approach is to first pass the input through a linear weight vector and apply a non-linear mapping to achieve the output activation . It has been repeatedly reported in the past that the well known non-linearities, such as softmax, tanh, and ReLU, do not allow closed-form
Results
We focus on three setups where the conventional deep learning paradigm performs suboptimally: i) online learning, ii) learning from small sample sets, and iii) active learning. It is noteworthy that we have designed all our experiments for a comparative analysis between our model and its competitors. Our main concern is to illustrate how the Bayesian approach can survive from these conditions and how our Bayesian construction stands out among its alternatives. To quantify the pure learning
Discussion
The key advantage of our novel Bayesian neural net construction is that it gives way to mean field variational inference with closed-form update rules. In online learning, this property provides either comparable or better learning speed than non-probabilistic conventional neural nets. We demonstrate also that the Bayesian nature of our construction allows training deep networks on very small sample sizes, while both the conventional neural nets and alternative BNN inference methods largely
References (24)
- et al.
Weight uncertainty in neural networks
ICML
(2015) - D. A. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep network learning by exponential linear units...
- et al.
Incorporating second-order functional knowledge for better option pricing
NIPS
(2001) - et al.
Variational learning in nonlinear gaussian belief networks
Neural. Comput.
(1999) - et al.
Variational Gaussian process state-space models
NIPS
(2014) - et al.
Assumed density filtering methods for learning Bayesian neural networks
AAAI
(2016) - et al.
Deep sparse rectifier neural networks
AISTATS
(2011) - et al.
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification
ICCV
(2015) - et al.
Probabilistic backpropagation for scalable learning of Bayesian neural networks
ICML
(2015) - et al.
Collaborative Gaussian processes for preference learning
NIPS
(2012)
Bayesian parameter estimation via variational methods
Stat. Comput.
Auto-encoding variational Bayes
ICLR
Cited by (4)
Probabilistic circuits for variational inference in discrete graphical models
2020, Advances in Neural Information Processing SystemsSampling-Free Variational Inference of Bayesian Neural Networks by Variance Backpropagation
2019, Proceedings of Machine Learning ResearchSampling-free variational inference of Bayesian neural networks by variance backpropagation
2019, 35th Conference on Uncertainty in Artificial Intelligence, UAI 2019