1 Introduction

Bayesian-based statistical inference tasks require the calculation of the likelihood function, which performs an important role as long as it states the probability of the observed data under a particular model. Therefore, the Bayes’ theorem leverages the inclusion of a priori knowledge about the studied phenomenon into the posterior distribution. Indeed, straightforward models gather an analytic expression for the likelihood function facilitating the evidence assessment; then, the posterior can be precisely computed. Notwithstanding, for complex models, to find an exact formula for the likelihood function is often intractable [13].

To deal with this intractability, free-likelihood techniques like Approximate Bayesian Computation (ABC) have emerged. ABC-based methods assess an auxiliary model with different parameter values drawn from a prior distribution to compute simulations that are compared to the observed data [13]. In particular, this comparison can be performed using statistics that summarize and characterize the information over large features and observations [5, 10]. However, the selection of proper and sufficient summary statistic could be difficult for complex models, demanding for alternative approaches that rely on kernel functions to embed and compare distributions into a Reproducing Kernel Hilbert Space (RKHS) [4, 9]. Recent advances in ABC-based inference have introduced kernel methods to accomplish more accurate posterior estimations. Authors in [6] developed a surrogate model as synthetic likelihood to define an adequate number of simulations within the ABC procedure via a Gaussian process-based framework. Mitrovic et al. [7] modeled the functional relationship between simulations and the optimal choice of summary statistics to encode the structure of a generative model using a kernel ridge regression for conditional distributions. Nonetheless, the techniques mentioned above require the estimation of different parameters related to the similarity computation among simulations to approximate the posterior. Then, expensive tuning procedures as grid search and cross-validation are carried out. Moreover, the user requires a vast knowledge concerning the ABC algorithm and the studied data to properly tune the free parameters, yielding to a high influence in the quality of the posterior approximation.

Here, we introduce an automatic version of an ABC algorithm to support Bayesian inference. Our approach, named AML-ABC, comprises a metric learning stage based on a Centered Kernel Alignment (CKA) to assess the matching between similarities defined over parameters and simulations [3]. Besides, a Mahalanobis distance is computed through CKA, and a graph representation is utilized to highlight local dependencies from both parameter and simulation spaces in ABC. Achieved results on synthetic and real-world inference problems demonstrate that our automatic extension of ABC infers competitive posteriors without requiring any manually fixing of free parameters.

The remainder of this paper is organized as follows: Sect. 2 describes the mathematical background. Section 3 describes the experimental set-up and the obtained result. Finally, the conclusions are outlined in Sect. 4.

2 Materials and Methods

ABC Fundamentals. In any Bayesian inference task, the central aim concerns the computation of a posterior \(p(\theta |y)\), using a prior distribution \(\zeta (\theta )\) and a likelihood function \(p(y|\theta )\), where \(y{{\,\mathrm{\in }\,}}\mathcal {Y}\) stands for the observed data and \(\theta {{\,\mathrm{\in }\,}}\Theta \) holds the model parameters. Nonetheless, when the likelihood is intractable, neither exact nor sampled posterior \(p(\theta |y)\propto p(y|\theta )\zeta (\theta )\) can be calculated. ABC approaches aim to facilitate such an inference via simulation of the likelihood through a generative model represented by a conditional probability \(p(x|\theta ),\) where \(x{{\,\mathrm{\in }\,}}\mathcal {Y}\) is a random variable standing for the simulated data [13]. Fundamentally, an ABC-based framework relies on the acceptance and rejection of simulated samples x using a distance function \(\mathrm{{d}}_{\mathcal {Y}}:\mathcal {Y}\times \mathcal {Y}\rightarrow \mathbb {R}^+.\) In turn, an approximate posterior can be estimated such that: \(\hat{p}(\theta |y;\epsilon )\propto \hat{p}(y|\theta ;\epsilon )\zeta (\theta ),\) where \(\hat{p}(y|\theta ;\epsilon ){{\,\mathrm{=}\,}}\int _{\mathcal {B}\left( y;\epsilon \right) }{p(x|\theta )}dx,\) \(\mathcal {B}\left( y;\epsilon \right) {{\,\mathrm{=}\,}}\{x:\mathrm{{d}}_{\mathcal {Y}}(x,y)<\epsilon \},\) and \(\epsilon {{\,\mathrm{\in }\,}}\mathbb {R}^+.\) Note that setting the value of \(\epsilon \) is a crucial stage for obtaining an accurate posterior.

On the other hand, most of the times it is difficult to apply a distance directly on \(\mathcal {Y}\) due to a large number of samples and features in real data. In such a case, some strategies use a mapping \(s{{\,\mathrm{=}\,}}\vartheta (x)\) before calculating the distance, where \(s{{\,\mathrm{\in }\,}}\mathcal {S}\) is a feature space and \(\vartheta :\mathcal {Y}\rightarrow \mathcal {S}\) [5]. However, using \(\vartheta (x)\) often leaks information for complex models. Then, some ABC-based approaches approximate \(\hat{p}(y|\theta ;\epsilon )\) as the convolution of the true likelihood \(p(y|\theta )\) and a kernel function \(\kappa :\mathcal {Y}\times \mathcal {Y}\rightarrow \mathbb {R},\) which imposes a constraint to the rejection of samples as the inner product \(\kappa (x,y){{\,\mathrm{=}\,}}\langle \phi (x),\phi (y)\rangle _{\mathcal {H}}\) in a Reproducing Kernel Hilbert Space (RKHS), \(\mathcal {H},\) where \(\phi :\mathcal {Y}\rightarrow \mathcal {H}\) [9]. In practice, given N samples \(\{x_n\backsim P_{X_n}\}^N_{n{{\,\mathrm{=}\,}}1}\) drawn from \(p(x|\theta _n),\) with \(\theta _n\backsim \zeta (\theta ),\) and the observation \(y\backsim P_Y,\) a weighted sample set \(\Psi {{\,\mathrm{=}\,}}\left\{ (\theta _n,w_n)\right\} ^N_{n{{\,\mathrm{=}\,}}1}\) is calculated by fixing:

$$\begin{aligned} w_n=\kappa _G\left( {\mathrm{{d}}_{\mathcal {H}}(P_{X_n},P_Y)};\epsilon \right) /{\sum \nolimits ^N_{n{{\,\mathrm{=}\,}}1}{\kappa _G\left( {\mathrm{{d}}_{\mathcal {H}}(P_{X_n},P_Y)};\epsilon \right) }}, \end{aligned}$$
(1)

where \(\kappa _G\) is a Gaussian kernel with bandwidth \(\epsilon .\) Eq. (1) is used to approximate \(p(\theta |y)\) via posterior expectation as: \(\hat{p}(\theta |y){{\,\mathrm{=}\,}}\sum \nolimits ^N_{n{{\,\mathrm{=}\,}}1}{w_n\kappa _G(\mathrm{{d}}_e(\theta ,\theta _n);\sigma _\theta )},\) where \(\mathrm{{d}}_e\) stands for the Euclidean distance and \(\sigma _\theta {{\,\mathrm{\in }\,}}\mathbb {R}^+\). Moreover, \(\mathrm{{d}}_{\mathcal {H}}(P_{X_n},P_Y)\) represents a distance over distributions.

Automatic ABC Using Metric Learning. To avoid the influence of the \(\epsilon \) value and the kernel parameters while computing the ABC-based posterior as in Eq. (1), we introduce an Automatic Metric Learning (AML) approach in the context of ABC, termed AML-ABC, for enhancing and automating the inference task. The idea behind AML-ABC is to include the information contained in the candidates \(\{\theta _n\}_{n=1}^N\) to improve the comparison stage carried out over simulations and observations. Let \(\varPsi = \{\theta _n,x_n\}_{n=1}^N\) be the set of N candidates \(\theta _n{{\,\mathrm{\in }\,}}\mathbb {R}^P\backsim \zeta (\theta )\) and their corresponding simulations \(x_n{{\,\mathrm{\in }\,}}\mathbb {R}^Q\backsim p(x|\theta )\). Further, let the kernel function \(\kappa _\theta :\Theta \times \Theta \rightarrow \mathbb {R}^+\) be a similarity measure between candidates in \(\Theta \), that define the kernel matrix \(\mathbf{K }_{{\varvec{\theta }}}{{\,\mathrm{\in }\,}}\mathbb {R}^{N\times N}\) holding elements:

$$\begin{aligned} \kappa _\theta (\theta _n,\theta _{n'})= {\left\{ \begin{array}{ll} \exp (-\text {d}_{\Theta }^2(\theta _n,\theta _{n'})), &{} \theta _n{{\,\mathrm{\in }\,}}\Omega _{n'} \\ 0, &{} \text { otherwise}, \end{array}\right. } \end{aligned}$$
(2)

where \(\Omega _{n'}\) is a set holding the M-nearest neighbors of \(\theta _{n'}\) in the sense of the distance \(\text {d}_{\Theta }:\Theta \times \Theta \rightarrow \mathbb {R}^+\). In this paper, to avoid large variations among components of \(\theta _n\) we rely on the Mahalanobis distance \(\text {d}_{\Theta }^2(\theta _n,\theta _{n'})=(\theta _n-\theta _{n'})^{\text {T}}{\varvec{\Sigma }}_{\Theta }^{-1}(\theta _n-\theta _{n'})\), where \({\varvec{\Sigma }}_{\Theta }{{\,\mathrm{\in }\,}}\mathbb {R}^{P\times P}\) is the sample covariance matrix of \(\{\theta _n\}_{n=1}^N\). Concerning the feature space \(\mathcal {S}\), we assess the similarity via the kernel \(\kappa _s:\mathcal {S}\times \mathcal {S}\rightarrow \mathbb {R}^+\), \(\kappa _s(\vartheta (x_n),\vartheta (x_{n'}))=\exp (-\text {d}_{\mathcal {S}}^2(\vartheta (x_n),\vartheta (x_{n'})))\), to build the matrix \(\mathbf{K }_{{\varvec{s}}}{{\,\mathrm{\in }\,}}\mathbb {R}^{N\times N},\) where \(\text {d}_{\mathcal {S}}^2:\mathcal {S}\times \mathcal {S}\rightarrow \mathbb {R}^+\) and \(\vartheta :\mathcal {Y}\rightarrow \mathcal {S}\) is a feature mapping. To perform the pairwise comparison between simulations in \(\mathcal {S}\) we use the Mahalanobis distance of the form [2]:

$$\begin{aligned} \text {d}_{\mathcal {S}}^2(\vartheta (x_n),\vartheta (x_{n'}))=(\vartheta (x_n)-\vartheta (x_{n'}))^{\text {T}}{\varvec{A}}{\varvec{A}}^{\text {T}}(\vartheta (x_n)-\vartheta (x_{n'})), \end{aligned}$$
(3)

where \({\varvec{\Sigma }}_{\mathcal {S}}^{-1}={\varvec{A}}{\varvec{A}}^{\text {T}}\) stands for the inverse covariance matrix of \(\vartheta (x_n){{\,\mathrm{\in }\,}}\mathbb {R}^D\) and \({\varvec{A}}{{\,\mathrm{\in }\,}}\mathbb {R}^{D\times d}\). In this sense, we use the information concerning the similarity over candidates in \(\Theta \), represented via \(\mathbf{K }_{{\varvec{\theta }}}\), to state the notion of similarity over simulations and observation in \(\mathcal {S}\), represented via \(\mathbf{K }_{{\varvec{s}}}\). In particular, we use a CKA-based measure between the above kernel matrices as [3]:

$$\begin{aligned} \hat{\rho }(\mathbf{K }_{{\varvec{\theta }}},\mathbf{K }_{{\varvec{s}}})= \frac{\langle \bar{\mathbf{K }}_{{\varvec{\theta }}},\bar{\mathbf{K }}_{{\varvec{s}}}\rangle _{\text {F}}}{\sqrt{\langle \bar{\mathbf{K }}_{{\varvec{\theta }}}\bar{\mathbf{K }}_{{\varvec{\theta }}}\rangle _{\text {F}}\langle \bar{\mathbf{K }}_{{\varvec{s}}}\bar{\mathbf{K }}_{{\varvec{s}}}\rangle _{\text {F}}}}, \end{aligned}$$
(4)

where \(\bar{\mathbf{K }}\) stands for the centered kernel as \(\bar{\mathbf{K }} = \tilde{{\varvec{I}}}\mathbf{K }\tilde{{\varvec{I}}}\), being \(\tilde{{\varvec{I}}}= {\varvec{I}}-{\varvec{1}}^{\top }{\varvec{1}}/N\) the empirical centering matrix, \({\varvec{I}}\in \mathbb {R}^{N \times N}\) is the identity matrix, and \({\varvec{1}}\in \mathbb {R}^{N}\) is the all-ones vector. Moreover, The notation \(\langle \cdot ,\cdot \rangle _{\text {F}}\) represents the matrix-based Frobenius norm. In Eq. (4), \(\hat{\rho }(\cdot ,\cdot )\) is a data driven estimator that aims to quantify the similarity between the parameter space and the feature space. To find the projection matrix \({\varvec{A}}\), we consider the following optimization problem:

$$\begin{aligned} \hat{{\varvec{A}}} = \text {arg} \max _{A} \log \left( \hat{\rho }(\mathbf{K }_{{\varvec{s}}}({\varvec{A}}), \mathbf{K }_{{\varvec{\theta }}}) \right) , \end{aligned}$$
(5)

where the logarithm function is used for mathematical convenience. The optimization problem in Eq. (5) can be solved using a gradient descent-based approach [2]. Moreover, we form the weighted sample set \(\Psi {{\,\mathrm{=}\,}}\left\{ (\theta _n,w_n)\right\} ^N_{n{{\,\mathrm{=}\,}}1}\) by fixing \(w_n=\kappa _E(z,z_n)/{\sum \nolimits ^N_{n{{\,\mathrm{=}\,}}1}{\kappa _E(z,z_n)}}\), where \(\kappa _E:\mathbb {R}^d\times \mathbb {R}^d\rightarrow \mathbb {R}\) is a similarity kernel defined as:

$$\begin{aligned} \kappa _E(z,z_n)= {\left\{ \begin{array}{ll} \exp (-||z-z_n||_2^2), &{} z_n{{\,\mathrm{\in }\,}}\Upsilon \\ 0, &{} \text { otherwise}, \end{array}\right. } \end{aligned}$$
(6)

where \(\Upsilon \) is a set holding the M-nearest neighbors of \(z=\vartheta (y)^{\top }{\varvec{\hat{A}}}\) in the sense of the Euclidean distance. Algorithm 1 summarizes the AML-ABC approach.

figure a

3 Experiments and Results

To test the AML-ABC performance, we consider two experiments following [9]: a toy problem concerning synthetic data from a mixture model and a Bayesian inference problem for a real ecological dynamic system. To accomplish an automatic inference, we find the number of M-nearest neighbors as the median of the optimal number of neighbours per point according to the Local Neighborhood Selection (LNS) algorithm introduced in [1]. Moreover, the K2-ABC method is selected as benchmark due to its nice performance over other methods [9].

For the toy problem, we analyze a mixture of uniform distributions of the form: \(p(x|{\varvec{\pi }}){{\,\mathrm{=}\,}}\sum _{c{{\,\mathrm{=}\,}}1}^{C} \pi _c{\mathscr {U}(c-1,c)},\) where \({\varvec{\pi }}{{\,\mathrm{=}\,}}\{\pi _c\}_{c{{\,\mathrm{=}\,}}1}^C\) are the mixing coefficients holding \(\sum _{c{{\,\mathrm{=}\,}}1}^{C}\pi _c{{\,\mathrm{=}\,}}1\), and C is the number of components. Here, the aim is to estimate the posterior \(p({\varvec{\pi }}|y)\) for \(C{{\,\mathrm{=}\,}}5\), given synthetic observations y drawn from the mixture with true parameters (target): \({\varvec{\pi ^*}}=\{0.25,0.04,0.33,0.04,0.34\}\). For concrete testing, we draw \(N=1000\) samples from a symmetric Dirichlet prior, \({\varvec{\pi }}\backsim \text {Dirichlet}({\varvec{1}}),\) and then used the mixture model to form the simulated data by drawing 400 observations for each prior candidate. Moreover, we employ a histogram with 10 bins as feature mapping in AML-ABC, and kernel widths \(\gamma =0.1,\) \(\epsilon =0.001\) in K2-ABC [9]. As quantitative assessment, we use the Euclidean distance \(\mathcal {E}=||{\varvec{\pi ^*}}-{\varvec{\hat{\pi }}}||_2\), where \({\varvec{\hat{\pi }}}\) is the expected value of the posterior using the weights \(\{w_n\}_{n=1}^N\) obtained by each method.

For the real dataset experiment, we considere the problem of inferring the dynamics of an adult blowfly population as introduced in [14]. In particular, the population dynamics are modelled by a discretised differential equation of the form: \(N_{t+1}=PN_{t-\tau }\exp (-N_{t-\tau }/N_0)e_t+N_t\exp (-\delta \epsilon _t)\), with \(N_{t+1}\) denoting the observation time at \(t+1\) which is determined by the time-lagged observations \(N_t\) and \(N_{t-\tau }\), where \(e_t\) and \(\epsilon _t\) stand for Gamma distributed noise \(e_t\backsim \mathcal {G}(1/\sigma _p^2,\sigma _p^2)\) and \(\epsilon _t\backsim \mathcal {G}(1/\sigma _d^2,\sigma _d^2)\). Here, the aim is to estimate the posterior of the parameters \({\varvec{\theta }}=\{P,N_0,\sigma _d,\sigma _p,\tau ,\delta \}\) given observed data concerning a time series of 180 observationsFootnote 1. We adopt Log-normal distributions for setting priors over \({\varvec{\theta }}\) [6]: \(\log P\backsim \mathcal {N}(2,2^2),\) \(\log N_0\backsim \mathcal {N}(6,1),\) \(\log \sigma _d\backsim \mathcal {N}(-0.5,1),\) \(\log \sigma _p\backsim \mathcal {N}(-0.5,1),\) \(\log \tau \backsim \mathcal {N}(2.7,1),\) \(\log \delta \backsim \mathcal {N}(-1,0.4^2)\). For AML-ABC we draw \(N=5000\) samples from the prior and then assess the model to form the simulated data by drawing 180 observations for each prior candidate. Besides, as feature mapping, we selected the 10 statistics used in [9]. Moreover, we use the Euclidean distance \(\mathscr {E}=||\vartheta (y)-\vartheta (x_n|{\varvec{\hat{\theta }}})||_2\) as quantitative assessment, where \(x_n|{\varvec{\hat{\theta }}}\) is a simulation from the model given the expected value of the posterior using each method. In particular, due to fluctuations produced by \(\epsilon _t\) and \(e_t\), we draw 100 simulation for each method and compute the median and standard deviation for \(\mathscr {E}\) [9].

Fig. 1.
figure 1

Uniform mixture model results. (a) Estimated mean posterior of mixing coefficients using various methods (b) Weights of the 5 nearest neighbors in AML-ABC.

Fig. 2.
figure 2

Non-linear ecological dynamic system results. (a)–(f) Prior distribution (solid line) and AML-ABC-based posterior estimation (dashed line) of model parameters in the log-space. (g) Some realizations from the model using the expected value of the parameters found via AML-ABC.

Toy Problem Results. Since this is a controlled experiment with known parameters \({\varvec{\pi ^*}},\) we can find the best possible performance of our AML-ABC by running the inference stage in Algorithm 1 with \(\tilde{w}_n = \kappa _E({\varvec{\pi ^*}},{\varvec{\pi }}_n)\). We refer to this approach as Best. The previous setting is equivalent to think that the CKA between \(\mathbf{K }_{{\varvec{\theta }}}\) and \(\mathbf{K }_{{\varvec{s}}}\) is perfect (\(\mathbf{K }_{{\varvec{\theta }}}\) = \(\mathbf{K }_{{\varvec{s}}}\)). Figure 1 shows the Best performance along with K2-ABC and AML-ABC results over the uniform mixture problem. In Fig. 1(a), the expected value of the posterior computed for all methods is close to the target. In particular, we obtained \(\mathcal {E}_{Best}=0.030\), \(\mathcal {E}_{\text {K2-ABC}}=0.063\), and \(\mathcal {E}_{\text {AML-ABC}}=0.064\). These results show that our AML-ABC is a competitive estimator to K2-ABC with a significant advantage concerning the automatic selection of free parameters. In addition, to provide a better understanding of the AML-ABC effectiveness, in Fig. 1(b) we provide the weights for the 5 nearest neighbors (according to the LNS algorithm) used to compute the posteriors. As noted, the majority of the chosen simulations for AML-ABC match the selected candidates using the Best, although our approach never observes the target.

Table 1. Performance of different ABC schemes over the blowfly dataset.

Real Dataset Results. Inferring the model parameters in this blowfly dataset is a very challenging task since the system dynamics can easily move from stable to chaotic regimes [6, 14]. This states an interesting scenario to test the performance and robustness of the AML-ABC. In Figs. 2(a) to (f), we provide the prior and the posterior approximation for each parameter fixing \(\sigma _\theta \) according to [12]. Notice how our proposal updates the beliefs about the model parameters leading to more concentrated posteriors. In the case of \(\log \sigma _p \), two modes reflect different intervals with probable values for driving the noise realization associated with egg production in the blowfly population. However, there is a predominant mode that states higher probabilities for this parameter. Moreover, Fig. 2(g) shows the closest and farthest simulation to the observed data from 100 realization used to compute \(\mathscr {E}\), showing a posterior in stable regime. Finally, Table 1 shows the performance of AML-ABC compared to different ABC-based methods tested on the blowfly dataset by authors in [9], where clearly the proposed method is a quite competitive approach to K2-ABC.

4 Conclusions

We propose an automatic enhancement of the well-known ABC algorithm devoted to Bayesian inference called AML-ABC. In particular, we include a Metric Learning approach based on a CKA methodology to quantify the matching between parameter and simulation spaces. Then, a Mahalanobis distance is learned through CKA and a graph representation is employed to reveal local relationships among parameter and simulation samples. Notably, our AML-ABC does not require the tuning of any free parameter. Besides, obtained results on a synthetic dataset and a real-world ecological system show the introduced AML-ABC is a competitive approach compared to other non-automatic state-of-the-art ABC methods. Future work includes the extension of AML-ABC for multi-dimensional problems and the inclusion of other dissimilarity measures, besides the Mahalanobis distance, to deal with complex and/or noisy data.