1 Introduction

Over the past two decades, neural networks have seen an enormous rise in popularity and are currently being used in almost every area of science and industry. In light of this widespread usage, it has become increasingly clear that trustworthy uncertainty estimates are essential (Gal 2016).

Fig. 1
figure 1

A comparison of the confidence intervals of our likelihood-ratio approach (DeepLR), an ensembling approach, and MC-Dropout on the two-moon data set. The colorbar represents the width of 95\(\%\) confidence intervals, where yellow indicates greater uncertainty. The orange and the blue circles indicate the location of the data points of the two different classes. Crucially, DeepLR presents high levels of uncertainty in regions far away from the data, generating confidence intervals of [0.00, 1.00], unlike the other methods that display extreme certainty in those regions (Color figure online)

Many uncertainty estimation methods have been developed using Bayesian techniques (Neal 2011); (MacKay 1992); (Gal 2016), ensembling techniques (Heskes 1997); (Lakshminarayanan et al. 2017), or applications of frequentist techniques such as the delta method (Kallus and McInerney 2022).

Many of the resulting confidence intervals (for the frequentist methods) and credible regions (for the Bayesian methods) have two common issues. Firstly, most methods result in symmetric intervals around the prediction which can be overly restrictive and can lead to very low coverage in biased regions (Sluijterman et al. 2022). Secondly, most methods rely heavily on asymptotic theorems (such as the central limit theorem or the Bernstein-von-Mises theorem) and can therefore only be trusted in the asymptotic regime where we have many more data points than model parameters, the exact opposite scenario of where we typically find ourselves within machine learning.

Contribution

In this paper, we demonstrate how the likelihood-ratio test can be leveraged to combat the two previously mentioned issues. We provide a first implementation of a likelihood-ratio-based approach, called DeepLR, that has the ability to produce asymmetric intervals that are more appropriately justified in the scenarios where we have more parameters than data points. Furthermore, these intervals exhibit desirable behavior in regions far removed from the data, as evidenced in Fig. 1.

Organisation This paper is structured into four sections, with this introduction being the first. Section 2 explains our method in detail and also contains the related work section which is simultaneously used to highlight the advantages and disadvantages of our method. Section 3 presents experimental results that illustrate the desirable properties of a likelihood-ratio-based approach. Finally, Sect. 4 summarizes and discusses the results and outlines possible directions for future work.

2 DeepLR: deep likelihood-ratio-based confidence intervals

In this section, we present our method, named DeepLR, for constructing confidence intervals for neural networks using the likelihood-ratio test. We first formalize the problem that we are considering in Sect. 2.1. Section 2.2 explains the general idea behind constructing a confidence interval via the likelihood-ratio test. Subsequently, in Sect. 2.3, we outline the high level idea for translating this general procedure to neural networks. The details regarding the distribution and the calculation of the test statistic are provided in Sects. 2.4 and 2.5. Finally, in Sect. 2.6, we compare our method to related work while simultaneously highlighting its strengths and limitations.

2.1 Problem formulation

We consider a data set \(\mathcal {D} = \{(X_{1}, Y_{1}), \ldots , (X_{n}, Y_{n}) \}\), consisting of n independent observations of the random variable pair (XY). We consider networks that provide an estimate for the conditional density of \(Y \mid X\). This is achieved by assuming a distribution and having the network output the parameter(s) of that distribution.

Three well-known types of networks that fall in this class are: (1) A regression setting where the network outputs a mean estimate and is trained using a mean-squared error loss. This is equivalent to assuming a normal distribution with homoscedastic noise. (2) Alternatively, the network could output both a mean and a variance estimate and be optimized by minimizing the negative loglikelihood assuming a normal distribution. (3) In a classification setting, the network could output logits that are transformed to class probabilities while assuming a categorical distribution.

The network is parametrized by \(\theta \in \mathbb {R}^{p}\), where p is typically much larger than n. With \(\Theta\), we denote the set containing all the \(\theta\) that are reachable for a network with a specific training process. This includes choices such as training time, batch size, optimizer, and regularization techniques. With \(p_{\theta }\), we denote the predicted conditional density. Additionally, we assume that the true conditional density is given by \(p_{\theta _{0}}\) for some \(\theta _{0}\) in \(\Theta\). In other words, we assume that our model is well specified.

The objective of our method is to construct a confidence interval for one of the output nodes of the network for a specific input of interest \(X_{0}\). We denote this output of interest with \(f_{\theta _{0}}(X_{0})\) for the remainder of the paper. In the context of a regression setting, this output of interest is the true regression function value at \(X_{0}\) and in a classification setting it is the true class probability for input \(X_{0}\).

We define a \((1-\alpha ) \cdot 100\%\) confidence interval for \(f_{\theta _{0}}(X_{0})\) as an interval, \(\text {CI}(f_{\theta _{0}}(X_{0}))\), which is random since it depends on the random realization of the data, such that the probability (taken with respect to the random data set) that \(\text {CI}(f_{\theta _{0}}(X_{0}))\) contains the true value \(f_{\theta _{0}}(X_{0})\) is \((1-\alpha )\cdot 100\%\).

2.2 Confidence interval based on the likelihood ratio

We explain the general idea behind constructing a confidence interval with the likelihood-ratio by working through a well-known example. We consider n observations \(Y_{i}\) that are assumed to be normally distributed with unknown mean \(\mu\) and unknown variance \(\sigma ^{2}\). Our goal is to create a confidence interval for \(\mu\) by using the likelihood-ratio test.

The duality between a confidence interval and hypothesis testing states that we can create a \((1-\alpha )\cdot 100\%\) confidence interval for \(\mu\) by including all the values c for which the hypothesis \(\mu =c\) cannot be rejected at a \((1-\alpha )\cdot 100\%\) confidence level. We must therefore test for what values c we can accept the hypothesis \(\mu =c\).

The general approach to test a hypothesis is to create a test statistic of which we know the distribution under the null hypothesis and to reject this hypothesis if the probability of finding the observed test statistic or an extremer value is smaller than \(\alpha\).

As our test statistic, we take two times the log of the likelihood ratio:

$$T(c):= 2\left( \sup _{\Theta }\left( \sum _{i=1}^{n}\log (L(Y_{i};\theta ))\right) - \sup _{\Theta _{0}}\left( \sum _{i=1}^{n}\log (L(Y_{i}; \theta ))\right) \right) ,$$

where L denotes the likelihood function \(\theta \mapsto L(Y_{i};\theta )\), \(\Theta\) is the full parameter space and \(\Theta _{0}\) the restricted parameter space. In our example, we have

$$\Theta = \{(\mu , \sigma ^{2}) \mid \mu \in \mathbb {R}, \sigma ^{2} \in \mathbb {R}_{>0}\},$$

and

$$\Theta _{0} = \{(c, \sigma ^{2}) \mid \sigma ^{2} \in \mathbb {R}_{>0}\}.$$

Wilks (1938) proved that T(c) weakly converges to a \(\chi ^{2}(1)\) distribution under the null hypothesis that \(\mu =c\). We therefore reject if \(T(c) > \chi ^{2}_{1-\alpha }(1)\) and our confidence interval for \(\mu\) becomes the set \(\{c \mid T(c) \le \chi ^{2}_{1-\alpha }(1)\}\), where \(\chi ^{2}_{1-\alpha }(1)\) is the \((1-\alpha )\)-quantile of a \(\chi ^{2}(1)\) distribution.

In our example, this results in the well-known interval

$$\bar{Y} \pm z_{1-\alpha /2}\sqrt{ \frac{1}{n} \frac{1}{n-1}\sum _{i=1}^{n}(Y_{i} - \bar{Y})^{2}},$$

where \(z_{1-\alpha /2}\) is the \((1-\alpha /2)\)-quantile of a standard-normal distribution.

2.3 High-level idea of DeepLR

Our goal is to apply the likelihood-ratio testing procedure outlined in the previous subsection to construct a confidence interval for \(f_{\theta _{0}}(X_{0})\), the value of one of the output nodes given input \(X_{0}\). We create this confidence interval by including all the values c for which the hypothesis \(f_{\theta _{0}}(X_{0})=c\) cannot be rejected. The testing of the hypothesis is done with the likelihood-ratio test. Specifically, we use two times the log likelihood ratio as our test statistic:

$$\begin{aligned} T(c)&:= 2\bigg (\sup _{\Theta }\big (\sum _{i=1}^{n}\log (L(X_{i}, Y_{i};\theta ))\big ) - \sup _{\Theta _{0}(c)}\big (\sum _{i=1}^{n}\log (L(X_{i}, Y_{i};\theta ))\big )\bigg ) \nonumber \\&= 2\bigg (\sup _{\Theta }\big (\sum _{i=1}^{n}\log (p_{\theta }(Y_{i} \mid X_{i}))\big ) - \sup _{\Theta _{0}(c)}\big (\sum _{i=1}^{n}\log (p_{\theta }(Y_{i} \mid X_{i}))\big )\bigg ), \end{aligned}$$
(1)

and we construct a confidence interval for \(f_{\theta _{0}}(X_{0})\) by including all values c for which the test statistic is not larger than \(\chi ^{2}_{1-\alpha }(1)\):

$$\begin{aligned} \text {CI}(f_{\theta _{0}}(X_{0})) = \{c \mid T(c) \le \chi ^{2}_{1-\alpha }(1)\}. \end{aligned}$$
(2)

Here, we consider \(\Theta \subset \mathbb {R}^{p}\) to be the set containing all reachable parameters, and \(\Theta _{0}(c) =\{\theta \in \Theta \mid f_{\theta _{0}}(X_{0}) = c\}\). The set \(\Theta\) is explicitly not equal to all parameter combinations. Due to explicit (e.g., early stopping) and implicit regularization (e.g., lazy training (Chizat et al. (2019)) not all parameter combinations can be reached. The set \(\Theta\) should therefore be seen as the set containing the parameters of all neural networks that can be found given the optimizer, training time, regularization techniques, and network architecture.

Intuitively, this approach answers the question: What values could the network have reached at location \(X_{0}\) while still explaining the data well? This is a sensible question for a highly flexible and typically overparameterized machine-learning approach. After training, the model ends up with a certain prediction at location \(X_{0}\). However, since the model is typically very complex, it is likely that the model could just as well have made other predictions at that location while still explaining the data well. Therefore, all those other function values should also be considered as possibilities. Inherently, all modeling choices are taken into account by asking this question. A more flexible model, for instance, is likely able to reach more values without affecting the likelihood of the training data, leading to a larger confidence interval.

The construction of the confidence interval in Eq. (2) assumes that the test statistic, T(c), has a \(\chi ^{2}(1)\) distribution. We discuss this assumption in the following subsection. The subsection thereafter describes how to calculate the test statistic.

2.4 Distribution of the test statistic

In the classical setting, (Wilks 1938) proved that the likelihood-ratio test statistic asymptotically has a \(\chi ^{2}(1)\) distribution when the submodel has one degree of freedom less than the full model. We are, however, not in thisclassical regime. We have many more parameters than data points and therefore need a similar result for this setting.

It has been shown that the likelihood-ratio test statistic converges to a \(\chi ^{2}\) distribution for a wide range of settings, which is referred to as the Wilks-phenomenon by later authors (Fan et al. 2001); (Boucheron and Massart, 2011). For a semi-parametric model, which more closely resembles our situation, it has been shown that the test statistic also converges in distribution to a \(\chi ^{2}\) distribution under appropriate regularity conditions Murphy and van der (Vaart 1997).

We prove a similar result for our setting in A. The theorem states that, under suitable assumptions, our test statistic has a \(\chi ^{2}(1)\) distribution. Intuitively, this results from the fact that we added a single constraint, namely that \(f_{\theta }(X_{0})=c\). We emphasize that even if the test statistic does not exactly follow a \(\chi ^{2}(1)\) distribution, the qualitative characteristics of the confidence intervals will remain evident, albeit with inaccurate coverage levels.

2.5 Calculating the test statistic

Calculating the two terms in Eq. (1) presents certain challenges. The first term, \(\sup _{\Theta }\big (\sum _{i=1}^{n}\log (p_{\theta }(Y_{i} \mid X_{i}))\big )\), is relatively straightforward. We train a network that maximizes the likelihood, which gives the conditional densities \(p_{\hat{\theta }}(Y_{i} \mid X_{i})\).

The second term is substantially more complex. Ideally, we would optimize over the set \(\Theta _{0}(c) = \{\theta \in \Theta \mid f_{\theta }(X_{0})=c\}\). This is problematic for two reasons. Firstly, it is unclear how we can add this constraint to the network and secondly, this would necessitate training our network for every distinct value c that we wish to test. Even when employing an efficient bisection algorithm, this could easily result in needing to train upwards of 10 additional networks.

We address this problem as follows. We first create a network that is perturbed in the direction of a relatively large value (\(c_{\text {max}}\)) at \(X_{0}\) and a network that is perturbed in the direction a relatively small value \((c_{\text {min}})\) at \(X_{0}\) while maximizing the likelihood of the data. We denote the resulting network parameters with \(\hat{\theta }_{\pm }\):

$$\hat{\theta }_{+} = \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{\underset{f_{\theta }(X_{0})\approx c_{\text {max}}}{\theta \in \Theta ,}} L(\mathcal {D};\theta ), \quad \text {and} \quad \hat{\theta }_{-} = \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{\underset{f_{\theta }(X_{0})\approx c_{\text {min}}}{\theta \in \Theta ,}} L(\mathcal {D};\theta ).$$

Subsequently, the network that maximizes the likelihood under the constraint \(f_{\theta }(X_{0})=c\) is approximated using a linear combination. Specifically, suppose we want the network that maximizes the likelihood of the training data while passing through c at \(X_{0}\). In the case that \(c>f_{\hat{\theta }}(X_{0})\), we approximate this network by taking a linear combination of the outputs such that

$$(1-\lambda ) f_{\hat{\theta }}(X_{0})+ \lambda f_{\hat{\theta }_{+}}(X_{0}) = c,$$

and we define \(p_{c}\) as the density that we get by using the same linear combinations for the distributional parameters that are predicted by the networks parametrized by \(\hat{\theta }\) and \(\hat{\theta }_{+}\). We then approximate the second term in Eq. (1) as follows:

$$\begin{aligned} \sup _{\Theta _{0}(c)}\big (\sum _{i=1}^{n}\log (p_{\theta }(Y_{i} \mid X_{i}))\big ) \approx \sum _{i=1}^{n} \log (p_{c}(Y_{i} \mid X_{i})). \end{aligned}$$
(3)

This procedure is visualized in Fig.  2 for a regression setting. Details on the second step, finding the perturbed networks, are provided below both for a regression and binary-classification setting.

Fig. 2
figure 2

Illustration of the steps of our method for the positive direction in a regression setting. Steps 2, 3, and 4 are also carried out in the negative direction

Regression Suppose we completed step 1 and we have a network that maximizes the data’s likelihood. For step 2, our objective is to adjust this network such that it goes through a relatively large value at \(X_{0}\) while continuing to maximize the data’s likelihood. Moreover, we aim to achieve this in as stable a manner as possible, given that minor differences can significantly influence the test statistic, particularly for large data sets.

We accomplish this by copying the original network and training it on the objective to maintain the original predictions—as those maximized the likelihood—while predicting \(f_{\hat{\theta }}(X_{0}) + 1\) for input \(X_{0}\). For the perturbation in the negative direction we use \(-1\). These values \(\pm 1\) are chosen assuming that the targets are normalized prior to training.

We obtain this training objective by using modified training set, \(\tilde{\mathcal {D}}\), that is constructed by replacing the targets \(Y_{i}\) in the original training set with the predictions \(f_{\hat{\theta }}(X_{i})\) of the original network and adding the data point \((X_{0}, f_{\hat{\theta }}(X_{0}) +1).\)

The resulting problem is very imbalanced. We want the network to change the prediction at location \(X_{0}\), which is only present in the data once. This makes the training very unstable since, especially when training for a small number of epochs, it is very influential which specific batch contains the new point. To remedy this, we use a combination of upsampling and downweighting of the new data point \((X_{0}, f_{\hat{\theta }}(X_{0}) +1).\) Merely upsampling the new data point-i.e., adding it multiple times-is undesirable as this can introduce significant biases (Van den Goorbergh et al. 2022). Hence, we add many copies of \((X_{0}, f_{\hat{\theta }}(X_{0}) + 1)\) but we reweigh the loss contributions of these added data points such that they have a total contribution to the loss that is equivalent to that of a single data point.

We propose to add \(2n/\text {batch size}\) extra data points such that each batch is expected to have 2 new data points. The same training procedure is used as for the original network. We found slightly larger or smaller number of added data points to perform very similarly. This setting worked for a wide variety of data sets and architectures. In Appendix B, we experimentally check the distribution of the test statistic in a controlled simulation experiment.

Binary classification Consider a data set where the targets are either 1 or 0. Our network outputs logits, denoted with \(f_{\hat{\theta }}(X)\), that are transformed to probabilities via a sigmoid function. The procedure is nearly identical for this binary classification setting: We create a positively perturbed network, parametrized by \(\hat{\theta }_{+}\), and a negatively perturbed network, parametrized by \(\hat{\theta }_{-}\).

The only difference is in the construction of the augmented data sets. We again replace the targets \(Y_{i}\) by the predictions of the original network, \(f_{\hat{\theta }}(X_{i})\), but now we add multiple copies of the data point \((X_{0}, 1)\) for the positive direction, and multiple copies of the data point \((X_{0}, 0)\) for the negative direction.

The entire method is summarized in Algorithm 1. In summary, we want to test what values the network can reach while still explaining the data well. We do this by perturbing the network in a positive direction and a negative direction and subsequently testing which linear combinations would still explain the data reasonably well, i.e., linear combinations with a test statistic smaller than \(\chi ^{2}_{1-\alpha }(1)\).

figure a

Algorithm 1: Pseudocode for the construction of \({\text{CI}}\left( {f_{{\theta _{0} }} \left( {X_{0} } \right)} \right)\) using DeepLR

2.6 Related work

In this subsection, we place our work in context of existing work while simultaneously highlighting the strengths and limitations of our approach. Our aim is not to give a complete overview of all uncertainty quantification methods, for which we refer to the various reviews and surveys on the subject (Abdar et al. 2021); (Gawlikowski et al. 2022); (He and Jiang 2023). Instead, we discuss several broad groups in which most methods can be categorized: Ensembling methods, Bayesian methods, frequentist methods, and distance-aware methods.

Ensembling methods train multiple models and use the variance of the predictions as an estimate for model uncertainty (Heskes 1997); (Lakshminarayanan et al. 2017); (Zhang et al. 2017); (Wenzel et al. 2020); (Jain et al. 2020); (Dwaracherla et al. 2022). While being extremely easy to implement, they can be computationally expensive due to the need to train multiple networks. Moreover, the resulting confidence intervals can behave poorly in regions with a limited amount of data where the predictor is likely biased. Additionally, ensemble members may interpolate in a very similar manner, potentially leading to unreasonably narrow confidence intervals.

Bayesian approaches place a prior distribution on the model parameters and aim to simulate from the resulting posterior distribution given the observed data (MacKay 1992); (Neal 2012); (Hernández-Lobato and Adams 2015). Since this posterior is generally intractable, it is often approximated, with MC-Dropout being a notable example (Gal and Ghahramani 2016); (Gal et al. 2017).

A downside is that these methods can be challenging to train, potentially resulting in a lower accuracy. Our proposed approach does not change the optimization procedure and therefore has no accuracy loss. Another downside is that while asymptotically Bayesian credible sets become confidence sets by the Bernstein-von Mises theorem (see Van der Vaart 2000, Chapter 10). However, this theorem generally does not apply for a neural network where the dimension of the parameter space typically exceeds the number of data points. Moreover, the prior distribution is often chosen out of computational convenience instead of being motivated by domain knowledge.

Distance-aware methods have a more pragmatic nature. They use the dissimilarity of a new input compared to the training data as a metric for model uncertainty (Lee et al. 2018); (Van Amersfoort et al. 2020); (Ren et al. 2021). As we will see in the next section, our method also exhibits this distance-aware property, albeit for a different reason: The further away from the training data, the easier it becomes for the network to change the predictions without negatively affecting the likelihood of the training data.

Frequentist methods use classical parametric statistics to obtain model uncertainty estimates. The typical approach - more elaborately explained in textbooks on parametric statistics, e.g. (Seber and Wild 2003) - involves obtaining (an estimate of) the variance of the model parameters using asymptotic theory and then converting this variance to the variance of the model predictions using the delta method. This approach has been used by various authors to create confidence intervals for neural networks (Kallus and McInerney 2022); (Nilsen et al. 2022); (Deng et al. 2023); (Khosravi et al. 2011).

Confidence intervals of this type are often referred to as Wald-type intervals. These intervals are necessarily symmetric. Various authors have noted that, for classical models, Wald-type intervals often behave worse than likelihood-ratio type intervals in the low-data regime (Hall and La Scala 1990); (Andersen et al. 2012); (Murphy 1995); (Murphy and van der Vaart 1997). Specifically, when the loglikelihood cannot be effectively approximated with a quadratic function, Wald-type intervals may behave very poorly (Pawitan 2001, Chapter 2). The significant advantage of Wald-type intervals in the classical setting is the easier computation. However, while only a single model needs to be fitted, the necessary inversion of a high-dimensional \(p \times p\) matrix and the quadratic approximation of the likelihood strongly rely on being in the asymptotic regime.

Conversely, the construction of the DeepLR confidence interval does not rely on a quadratic approximation of the likelihood, which is in general only valid asymptotically. While we still utilize asymptotic theory to determine the distribution of our test statistic (see our proof of Theorem 1 in Appendix A), we do impose the extremely strong requirement that the second derivative of the loglikelihood must converge and be invertible. We only require this second derivative to behave nicely in a single direction, which is a much weaker requirement.

Another benefit of likelihood-ratio-based confidence intervals is that they are transformation invariant (Pawitan 2000). In other words, a different parametrization of the distribution does not alter the resulting confidence intervals. This is not the case for most Wald-type intervals or Bayesian approaches.

The main limitation of DeepLR in its current form is the computational cost. We compare the computational costs of the various approaches in Table 1. Our method comes at no extra training cost but requires two additional networks to be trained for every confidence interval. Ensembling methods also need to train multiple networks, typically from five to ten, but these networks can be reused for different confidence intervals. Frequentist methods typically come at no additional training cost but require the inversion of a \(p \times p\) matrix to construct the confidence interval. The cost of Bayesian methods varies drastically from method to method. The training process can be substantially more involved and the construction of a credible set requires multiple samples from the (approximate) posterior.

Table 1 A comparison of the computational costs of different types of methods that create confidence intervals or credible regions

In summary, a likelihood-ratio-based method is distance aware, transformation invariant, has no accuracy loss, and is capable of creating asymmetric confidence intervals. However, it comes with the downside of being computationally expensive. Producing a confidence interval for a single input requires the training of two additional networks.

3 Experimental results

In this section, we present the results of various experiments that showcase the desirable properties of DeepLR, such as its distance-aware nature and capability to create asymmetric confidence intervals. The high computational cost of our method prohibits any large-scale experiments. Nevertheless, the following experiments demonstrate the effectiveness of a likelihood-ratio-based approach.

All code used in the following experiments can be found at https://github.com/LaurensSluyterman/Likelihood_ratio_intervals

3.1 Toy examples

To start, we present two one-dimensional toy examples that effectively illustrate the behavior of our method. Specifically, these examples illustrate the capability of our method to produce asymmetric confidence intervals that expand in regions with a limited amount of data points, both during interpolation and extrapolation.

Regression The data set consists of 80 realisations of the random variable pair (XY). Half of the x-values are sampled uniformly from the interval \([-1, -0.2]\), while the remaining half are sampled uniformly from the interval [0.2, 1]. The y-values are subsequently sampled using

$$Y \mid X = x \sim {\mathcal {N}\left( 2x^{2}, 0.1^{2}\right) }.$$

On this training set, we train a network by minimizing the mean-squared error, which is equivalent to maximizing the loglikelihood while assuming a normal distribution with homoscedastic variance. We use the mean-squared-error of the residuals as an estimate for the variance (this estimate is updated for every evaluation of the test statistic). The network is trained for 400 epochs, using a default Adam optimizer (Kingma and Ba 2014) and a batch size of 32. The network consists of 3 hidden layers with 40, 30, and 20 hidden units respectively. All layers have elu activation functions (Clevert et al. 2015) with \(l_{2}\)-regularization applied in each dense layer with a constant value of 1e-4.

Figure 3 gives the 95\(\%\) confidence intervals for mean predictions. Notably, in the biased region around 0, the intervals become highly asymmetric. In contrast, most other methods produce symmetric interval around the original network. In a region with a bias, this can easily lead to intervals with very poor coverage. This is exemplified by the biased symmetric intervals generated by an ensemble consisting of 10 ensemble members (we used the ensembling strategy employed by (Lakshminarayanan et al. 2017)).

In Appendix C, we demonstrate that our likelihood-ratio-based approach also works for this example with the popular XGBoost model (Chen and Guestrin 2016). While our paper focuses on applying the methodology to neural networks, this highlights that likelihood-ratio-based uncertainty quantification methods could also be developed for other types of models.

Fig. 3
figure 3

This figure illustrates DeepLR for a regression problem. The blue dots indicate the locations of the training points, the dotted orange line represents the true function, and the solid green line represents the predicted regression function of the network. The shaded blue region gives the 95\(\%\) CI of the regression function. DeepLR exhibits two desirable properties when compared to an ensemble approach. Firstly, the intervals expand in regions where data is sparse. Secondly, the intervals can be asymmetric, allowing for the compensation of potential bias (Color figure online)

Binary classification The data set consists of 60 realisations of the random variable pair (XY), where half of the x-values are sampled uniformly from the interval [0, 0.2] and the other half are sampled uniformly from the interval [0.8, 1]. The y-values are subsequently simulated using

$$\begin{aligned} Y \mid X =x \sim \text {Ber}(p(x)), \quad \text {with }p(x)=0.5 + 0.4\cos (6x). \end{aligned}$$

On this training set, we train a fully connected network with three hidden layers consisting of 30 hidden units with elu activations functions. The final layer outputs a logit that is transformed using a sigmoid to yield a class probability. The network is trained for 300 epochs using a binary-crossy-entropy loss function and the Adam optimizer with a batch size of 32.

The resulting 95\(\%\) confidence intervals of the predicted probability of class 1 are given in Fig. 4. We carried out the experiment with two amounts of regularization to illustrate how this affects the result. For comparison, we also implemented an ensembling approach and MC-Dropout (see Lakshminarayanan et al. 2017) and (Gal and Ghahramani 2016) for details). We used ten ensemble members and a standard dropout rate of 0.2. All networks were trained using the same training procedure.

We observe the same desirable properties as in the regression example. The intervals get much larger in regions with a limited amount of data, also when interpolating, and can become asymmetrical. Additionally, the intervals get smaller when we increase the amount of regularization. The model class becomes smaller (fewer parameters can be reached) making it more difficult for the model to change the predictions without affecting the likelihood. This, in turn, leads to smaller confidence intervals. If the model is overly regularized, it will become miss–specified (\(\theta _{0} \notin \Theta\)) and the resulting confidence intervals will not be correct.

The other approaches do not share the same qualitative properties. The ensembling approach results in confidence intervals that are far too narrow. All ensemble members behave more or less the same, especially when interpolating. The MC-Dropout credible regions do not expand in the regions in between the data and also only moderately expand when extrapolating.

Fig. 4
figure 4

This figure illustrates DeepLR for a binary classification problem. The blue dots represent the training data, the dotted orange line represents the true probability of class 1 and the solid green line represents the predicted probability of class 1. The shaded blue region provides the 95\(\%\) CI for the predicted probability of class 1. The top two figures (a) and (b) demonstrate the behavior of DeepLR for varying amounts of regularization. The more regularization, the smaller the class of admissible functions becomes, which naturally results in smaller intervals. Additionally, the top left figure demonstrates that the intervals expand when interpolating, a feature not shared by the dropout and ensembling approach (Color figure online)

3.2 Two-moon example

The data set consists of 80 data points, generated using the make_moons function from the scikit-learn package, which creates a binary classification problem with two interleaving half circles (Pedregosa et al. 2011).

We utilize the same network architecture as in the toy classification example. The network is trained for 500 epochs using the Adam optimizer with default learning rate and batch size of 32, while also applying \(l_2\)-regularization in each layer with a constant value of 1e-3.

Figure 1 presents the 95\(\%\) confidence intervals for the predicted class probabilities. The results illustrate that DeepLR becomes extremely uncertain in regions farther from the data (i.e., the confidence interval for the class probability spans the full range of [0, 1]).

In contrast, both an ensembling approach and MC-Dropout report excessively high certainty in the upper left and lower right regions. The ensemble’s behavior can be attributed to all ensemble members extrapolating in the same direction causing all ensemble members to report more or less the same class probability. For MC-dropout, a saturated sigmoid is causing the narrow credible intervals.

This comparison underscores the unique capability of DeepLR to provide more accurate uncertainty estimates in regions less well represented by data-a crucial capability in practical applications.

3.3 MNIST binary example

For a more difficult task, we train a small convolutional network on the first two classes of the MNIST data set, consisting of handwritten digits. In this binary classification task, the 0’s labeled as class 0 and the 1’s as class 1.

The CNN architecture consists of two pairs of convolutional layers (with 28 filters and 3 × 3 kernels) and max-pooling layers (2 × 2 kernel), followed by a densely connected network with two hidden layers with 30 hidden units each and elu activation functions.

The network is trained for 10 epochs, using the SGD optimizer with a batch size of 32, default learning rate, and \(l_2\) regularization with a constant value of 1e-5, and binary cross-entropy loss function. The amounts of training time and regularization were determined from a manual grid search using an 80/20 split of training data. For the actual experiment, the entire training set was utilized.

Figure 5 presents 95\(\%\) CIs for a number of different training points, test points, and OoD points. As shown, the OoD points have wider confidence intervals than the training and test points, reflecting greater uncertainty.

In an additional experiment, we rotated one of the test-points and created confidence intervals for the rotated images. As Fig. 6 illustrates, increasing the rotation angle results in larger confidence intervals. This behavior is also seen with the ensembling and MC-dropout, but to a significantly lesser extent.

These larger intervals can be explained as follows: Our approach essentially asks the intuitive question how much the network can change the prediction for this new input without overly affecting the likelihood of the training data. A heavily rotated image of a number 1 deviates greatly from the typical input, utilizing pixels that are almost never used by the training inputs. It is therefore relatively straightforward for the network to change the prediction of this rotated input without dramatically lowering the likelihood of the training data. This, in turn, results in very large confidence intervals, accurately indicating that it is a very unfamiliar input.

Fig. 5
figure 5

95\(\%\) CIs for different training points (first column), test points (second column), and OoD points (third column). Each subcaption displays the \(95\%\) CI for the probability of class 1 made with DeepLR (LR), MC-dropout (DR), and ensembling (EN). For in-distribution points (zeros and ones), all three methods provide extremely narrow intervals. For the OoD points, our method provides much larger CIs, a property also present in the other methods, albeit to a lesser extent

Fig. 6
figure 6

This figure provides 95\(\%\) CIs for different amounts of rotation. For rotations of 0 and 30 degrees, all methods produce very narrow confidence intervals, implying high certainty. At 60 degrees of rotation, DeepLR outputs high uncertainty whereas the ensembling approach and MC-dropout remain fairly certain. At 90 degrees of rotation, DeepLR outputs very high uncertainty, a confidence interval of [0.05, 1.00], a behavior also observed to a lesser extent in both ensembling and MC-dropout

3.4 CIFAR binary example

We extend the previous experiment to the CIFAR10 data set, using the first two classes-planes and cars-as a binary classification problem.

We use a CNN consisting of two pairs of convolutional layers (with 32 filters and 3 × 3 kernels) and max-pooling layers (2 × 2 kernel), followed by a densely connected network with three hidden layers with 30 hidden units each and elu activation functions.

The CNN is trained for 15 epochs using the SGD optimizer with default learning rate, a batch size of 32, and \(l_2\)-regularization with a constant value of 1e-5. The training time and regularization are determined the same way as for the MNIST experiment, using an 80/20 split of the training data.

Figure 7 presents 95\(\%\) confidence intervals for several training, test, and OoD points. Our method is uncertain for out-of-distribution inputs. However, contrary to the MNIST example, we also see that the model is uncertain for various in-distribution points. Figure 7e and g provide examples of such uncertain predictions. The exact reason why the model is uncertain for those inputs remains speculation. Possible explanations might be the open hood of the car in (e) or the large amount of blue sky in (g). An interesting avenue for future work would be to investigate what specific features cause DeepLR to output greater uncertainty.

Fig. 7
figure 7

This figure gives 95\(\%\) CIs for various training points (first column), test points (second column), and OoD points (third column). Each subcaption provides the CIs for the probability of class 1 (cars) made with DeepLR (LR), MC-dropout (DR), and ensembling (EN). All methods demonstrate greater uncertainty than in the MNIST example, which is sensible as the CIFAR data set is significantly harder. Our method is very uncertain for all OoD points, which is not always the case for MC-dropout (c and i) and ensembling (i). We also observe relatively uncertain predictions by all methods for some in-distribution points, notablye and g

3.5 Brain tumor example

As we mentioned before, there are scenarios where the high computational cost can be justified. As an example, we consider the task of detecting brain tumors in MRI images. Compared to the very high costs of making the MRI image, the extra computational cost of retraining the network in order to get an uncertainty estimate is justified.

As a training set, we use the Br35h data set (Hamanda 2020), conisting of 1500 MRI images that contain a tumor and 1500 MRI images that do not contain a tumor. All images, see Fig. 8 for examples, are taken in the axial (horizontal) plane above the level of the eyes. As OoD inputs, we take two images from a second MRI data set (Sartaj 2020). The first OoD image is in the sagittal (vertical) plane. The second OoD image is in the axial plane but contains the eyes.

For this experiment, we use a slightly larger CNN consisting of four blocks of convolutional layers with max-pooling and batch-normalization - the first two blocks have 3 x 3 kernels with 8 filters, and the last two have 3 × 3 kernels with 16 filters-followed by three densely connected layers with 20, 10, and 5 hidden units respectively, batch-normalization, and elu activation functions.

The CNN is trained for 30 epochs using the SGD optimizer with default learning rate, a batch size of 32, and \(l_{2}\)-regularization with a constant value of 1e-5. The training time and regularization are determined in the same way as during the MNIST and CIFAR experiment. The resulting model has roughly 95\(\%\) accuracy on the unseen test set.

Contrary to the previous experiments, we keep the convolutional layers fixed during the retraining to reduce the computational cost. This illustrates that it is possible to use cheaper procedures to approximate the test statistic.

The results for six different inputs are visualized in Fig.  8. DeepLR produces extremely narrow intervals for all four in-distributions points and very wide intervals for both OoD points. While the ensembling approach also produces large intervals for the OoD points, the intervals are also wide for the test points and to a lesser extent for the training points. MC-dropout produces wide intervals for all inputs except for one of the two training inputs (a). Notably, MC-dropout the intervals are not wider for the OoD inputs than for the test inputs.

Fig. 8
figure 8

95\(\%\) CIs for different training points (first column), test points (second column), and OoD points (third column). Each subcaption displays the \(95\%\) CI for the probability of class 1 (a tumor) made with DeepLR (LR), MC-dropout (DR), and ensembling (EN). DeepLR produces very narrow CIs both for the two training and the two test inputs. For the OoD inputs, the intervals become substantially wider. MC-dropout produces wide inputs for all three types of inputs. Ensembling produces very narrow intervals for both training inputs but relatively large interval for one of the test inputs (b). Both OoD inputs have a very wide CI

3.6 Adversarial example

In addition to the previous experiments, we briefly tested how the method deals with adversarial examples (Goodfellow et al. 2014). Adversarial examples are modified inputs that are specifically designed to mislead the model, typically by adding small perturbations to the input. While virtually imperceptible to humans, these perturbations can dramatically alter the model’s prediction.

We constructed adversarial versions of the two most confident test-inputs in Fig. 7 using the FGSM method (Goodfellow et al. 2014), which works by using the gradients of the networks’ loss function with respect to the input to create an input that maximizes the loss.

As illustrated in Fig. 9, DeepLR demonstrates a higher uncertainty for the two adversarial inputs. This effect can be explained as follows. Although very similar to a human observer, an adversarial example is significantly different to a neural network. The difference was so large that both adversarial examples were wrongly classified by the network. Since these adversarial examples differ significantly from the training data, the network can more easily change the prediction at this location without changing the other predictions and thus the likelihood of the training data too much. This results in much larger confidence intervals.

These results provide an encouraging sign that our method could also be able to handle adversarial examples, offering the unique capability to create confidence intervals, detect OoD examples, and offer robustness against adversarial examples, all with a single method. Unfortunately, the high computational cost of the method prevents more large-scale comparisons with other methods to more firmly establish this potential.

Fig. 9
figure 9

Adversarial inputs generated by the FGSM method result in more uncertain predictions. The same network was used as for the CIFAR example illustrated in Fig. 7. The intuition behind the larger CIs is as follows. While very similar to humans, the adversarial examples are significantly different to a neural network. This allows the network to change the prediction for the adversarial example without changing the predictions of the training data, resulting in larger confidence intervals

4 Discussion and conclusion

In this paper, we demonstrated the potential of a likelihood-ratio-based uncertainty estimate for neural networks. This approach is capable of producing asymmetric confidence intervals that are better motivated in cases where we have fewer data points than parameters, i.e. most deep learning applications.

The experimental results verify the theoretical advantages of a likelihood-ratio-based approach. The intervals are larger in regions with fewer data points, can get asymmetric in biased regions, and get larger for OoD and adversarial outputs. However, not being specifically designed for OoD detection or robustness against adversarial attacks, we do not claim it to be competitive in this regard against tailor-made alternatives.

While we made an effort to reduce it, our method still has some variance and can produce slightly different intervals upon repetition due to the randomness of the optimization procedure. This effect is greater for larger data sets where small differences can have a large effect on the test statistic.

Furthermore, it is essential that the model is well specified. This is the case for any model and not specific to our method. If the true density \(p_{\theta _{0}}\) cannot be reached, the resulting confidence intervals will surely be wrong. A model can be miss specified if it is overly regularized or if incorrect distributional assumptions are made (e.g. incorrectly assuming Gaussian noise).

In its current form, the high computational cost makes DeepLR unsuitable for many deep learning applications. A self-driving car that is approaching a cross-section cannot stop and wait for an hour until it has an uncertainty estimate. Nevertheless, a trustworthy uncertainty estimate may be critical in certain situations, or only a limited number of confidence intervals may be required. For instance, for medical applications, the extra computational time may be worthwhile. Alternatively, for some applications within astrophysics, only very few confidence intervals may be needed. If only a single interval is needed, for instance, our method is cheaper than an ensemble.

4.1 Future work

Overall, our findings highlight the potential of a likelihood-ratio-based approach as a new branch of uncertainty estimation methods. We hope that our work will inspire further research in this direction. Several areas for improvement include:

  • Reduced computational cost: A clear limitation of the current implementation is the cost. For every confidence interval, we need to train two additional networks. We acknowledge that this is infeasible for many - although not all - applications and we hope that the proof of concept in this paper motivates further research in this direction that may result in reduced computational cost.

  • Improved approximation of the test statistic: The calculation of the test statistic uses an approximation for the second term in Eq. (1). Further research could focus on finding better and possibly cheaper approximations. It may also be worthwhile to investigate the use of a Bartlett correction. There are multiple ways to approximate the test statistic and we do not claim that our proposal is the optimal one. In fact, in this paper we use three slightly different approaches. For the brain-tumor example, we only retrained parts of the network. For the XGBoost example in Appendix C, we simply retrained the entire model from scratch, and for the other examples, we perturbed the networks starting from the original one. This new branch of uncertainty-estimation methods should focus on improving the approximation of the test statistic.

  • Application to other machine-learning models: We applied this approach to neural networks. However, the methodology should also be applicable to other models, for example random forests. Especially for models that are relatively cheap to train, this approach could be very promising. In Appendix C, we demonstrated that a similar approach is also viable for the XGBoost model.

  • Extension beyond binary classification: We currently approximate the test statistic by retraining the network two times and using a linear combination of the original and the perturbed network. This does not easily extend to multiple classes. In that case, we would need to retrain the network in many more directions and we would no longer have a one-dimensional interpolation problem.

  • Development of the theory on the distribution of the test statistic: It would be interesting to develop the theory surrounding the distribution of the test statistic in greater generality, possibly also when explicitly considering a regularization term. We constructed a reparametrization that showed that, under some assumptions, the test statistic has a \(\chi ^{2}(1)\) distribution. It would be interesting to study these assumptions further.

  • A better understanding of what causes DeepLR to become uncertain: We saw, for example, that various planes and cars had rather large accompanying confidence intervals. It would be interesting to study what causes certain input to become more uncertain than others.