Skip to main content
Log in

Approximate Bayesian computation using asymptotically normal point estimates

  • Original paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Approximate Bayesian computation (ABC) provides inference of the posterior distribution, even for models with intractable likelihoods, by replacing the exact (intractable) model likelihood by a tractable approximate likelihood. Meanwhile, historically, the development of point-estimation methods usually precedes the development of posterior estimation methods. We propose and study new ABC methods based on asymptotically normal and consistent point-estimators of the model parameters. Specifically, for the classical ABC method, we propose and study two alternative bootstrap methods for estimating the tolerance tuning parameter, based on resampling from the asymptotic normal distribution of the given point-estimator. This tolerance estimator can be quickly computed even for any model for which it is computationally costly to sample directly from its exact likelihood, provided that its summary statistic is specified as consistent point-estimator of the model parameters with estimated asymptotic normal distribution that can typically be easily sampled from. Furthermore, this paper introduces and studies a new ABC method based on approximating the exact intractable likelihood by the asymptotic normal density of the point-estimator, motivated by the Bernstein-Von Mises theorem. Unlike the classical ABC method, this new approach does not require tuning parameters, aside from the summary statistic (the parameter point estimate). Each of the new ABC methods is illustrated and compared through a simulation study of tractable models and intractable likelihood models, and through the Bayesian intractable likelihood analysis of a real 23,000-node network dataset involving stochastic search model selection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Barndorff-Nielsen O (1978) Information and exponential families in statistical theory. Wiley, New York

    MATH  Google Scholar 

  • Beaumont M, Zhang W, Balding D (2002) Approximate Bayesian computation in population genetics. Genetics 162:2025–2035

    Google Scholar 

  • Bernstein S (1917) Theory of probability. Gostekhizdat, Moscow

    Google Scholar 

  • Biau G, Cérou F, Guyader A (2015) New insights into approximate Bayesian computation. Ann Inst Henri Poincaré Probab Stat 51:376–403

    MathSciNet  MATH  Google Scholar 

  • Bickel P, Yahav J (1969) Some contributions to the asymptotic theory of Bayes solutions. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 11:257–276

    MathSciNet  MATH  Google Scholar 

  • Box G (1976) Science and statistics. J Am Stat Assoc 71:791–799

    MathSciNet  MATH  Google Scholar 

  • Brouste A, Fukasawa M, Hino H, Iacus S, Kamatani K, Koike Y, Masuda H, Nomura R, Ogihara T, Shimuzu Y, Uchida M, Yoshida N (2014) The YUIMA project: a computational framework for simulation and inference of stochastic differential equations. J Stat Softw 57:1–51

    Google Scholar 

  • Caimo A, Friel N (2011) Bayesian inference for exponential random graph models. Social Netw 33:41–55

    Google Scholar 

  • Casella G, Berger R (2002) Statistical inference, 2nd edn. Duxbury, Pacific Grove, CA

    MATH  Google Scholar 

  • Clarke B, Ghosh J (1995) Posterior convergence given the mean. Ann Stat 23:2116–2144

    MathSciNet  MATH  Google Scholar 

  • Clarté G, Robert C, Ryder R, Stoehr J (2021) Component-wise approximate Bayesian computation via Gibbs-like steps. Biometrika 108:591–607

    MathSciNet  MATH  Google Scholar 

  • Cox D, Hinkley D (1974) Theoretical statistics. Chapman and Hall, London

    MATH  Google Scholar 

  • Dawid A (1970) On the limiting normality of posterior distributions. Math Proc Cambridge Philos Soc 67:625–633

    MathSciNet  MATH  Google Scholar 

  • Doksum K, Lo A (1990) Consistent and robust Bayes procedures for location based on partial information. Ann Stat 18:443–453

    MathSciNet  MATH  Google Scholar 

  • Drossos C, Philippou A (1980) A note on minimum distance estimates. Ann Inst Stat Math 32:121–123

    MathSciNet  MATH  Google Scholar 

  • Drovandi C, Pettitt A (2011) Likelihood-free Bayesian estimation of multivariate quantile distributions. Comput Stat Data Anal 55:2541–2556

    MathSciNet  MATH  Google Scholar 

  • Drovandi C, Pettitt A, Faddy M (2011) Approximate Bayesian computation using indirect inference. J R Stat Soc Ser C 60:317–337

    MathSciNet  Google Scholar 

  • Efron B, Hinkley D (1978) Assessing the accuracy of the maximum likelihood estimator: observed versus expected fisher information. Biometrika 65:457–483

    MathSciNet  MATH  Google Scholar 

  • Fenton L (1960) The sum of log-normal probability distributions in scatter transmission systems. IRE Trans Commun Syst 8:57–67

    Google Scholar 

  • Ferguson T (1996) A course in large sample theory. Chapman & Hall, London

    MATH  Google Scholar 

  • Frazier D, Martin G, Robert C, Rousseau J (2018) Asymptotic properties of approximate Bayesian computation. Biometrika 105:593–607

    MathSciNet  MATH  Google Scholar 

  • Gelman A, Carlin J, Stern H, Dunson D, Vehtari A, Rubin D (2013) Bayesian data analysis, 3rd edn. Chapman and Hall, Boca Raton, Florida

    MATH  Google Scholar 

  • George E, McCulloch R (1993) Variable selection via Gibbs sampling. J Am Stat Assoc 88:881–889

    Google Scholar 

  • Gnanadesikan R, Kettenring J (1972) Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics 28:81–124

    Google Scholar 

  • Grazian C, Fan Y (2020) A review of approximate Bayesian computation methods via density estimation: inference for simulator-models. WIREs Comput Stat 12:e1486

  • Haario H, Saksman E, Tamminen J (2001) An adaptive metropolis algorithm. Bernoulli 7:223–242

    MathSciNet  MATH  Google Scholar 

  • Huber P (1967) The behavior of maximum likelihood estimates under nonstandard conditions. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1: statistics. University of California Press, Berkeley, CA, pp 221–233

  • Hwang H, So B, Kim Y (2005) On limiting posterior distributions. TEST 14:567–580

    MathSciNet  MATH  Google Scholar 

  • Jin F, Lee L (2018) Lasso maximum likelihood estimation of parametric models with singular information matrices. Econometrics 6:8

    Google Scholar 

  • Karabatsos G, Leisen F (2018) An approximate likelihood perspective on ABC methods. Stat Surv 12:66–104

    MathSciNet  MATH  Google Scholar 

  • Kleijn B, van der Vaart A (2012) The Bernstein-Von-Mises theorem under misspecification. Electron J Stat 6:354–381

    MathSciNet  MATH  Google Scholar 

  • Krivitsky P (2017) Using contrastive divergence to seed Monte Carlo MLE for exponential-family random graph models. Comput Stat Data Anal 107:149–161

    MathSciNet  MATH  Google Scholar 

  • Laplace P (1820) Théorie Analytique Des Probabilités, 3rd edn. Courcier, Paris

    MATH  Google Scholar 

  • Le Cam L (1953) On some asymptotic properties of maximum likelihood estimates and related Bayes’ estimates. Univ California Publ Stat 1:277–330

  • Le Cam L (1960) Locally asymptotically normal families of distributions. Univ Calif Publ Stat 3:37–98

    Google Scholar 

  • Le Cam L, Yang G (1990) Asymptotics in statistics: some basic concepts. Springer, New York

    MATH  Google Scholar 

  • Lee LF (1993) Asymptotic distribution of the maximum likelihood estimator for a stochastic frontier function model with a singular information matrix. Economet Theor 9:413–430

    MathSciNet  Google Scholar 

  • Lehmann E, Casella G (1998) Theory of point estimation, 2nd edn. Springer-Verlag, New York

    MATH  Google Scholar 

  • Leskovec J, Kleinberg J, Faloutsos C (2007) Graph evolution: densification and shrinking diameters. ACM Trans Knowl Discov Data 1:1–41

    Google Scholar 

  • Lewis J, MacEachern S, Lee Y (2021) Bayesian restricted likelihood methods: conditioning on insufficient statistics in Bayesian regression. Bayesian Analysis, pp 1–70

  • Li W, Fearnhead P (2018) On the asymptotic efficiency of approximate Bayesian computation estimators. Biometrika 105:285–299

    MathSciNet  MATH  Google Scholar 

  • Lintusaari J, Gutmann M, Dutta R, Kaski S, Corander J (2017) Fundamentals and recent developments in approximate Bayesian computation. Syst Biol 66:e66

  • Marin JM, Pudlo P, Robert C, Ryder R (2012) Approximate Bayesian computational methods. Stat Comput 22:1167–1180

    MathSciNet  MATH  Google Scholar 

  • Marjoram P, Molitor J, Plagnol V, Tavaré S (2003) Markov chain Monte Carlo without likelihoods. Proc Natl Acad Sci 100:15324–15328

    Google Scholar 

  • Maronna R, Zamar R (2002) Robust estimates of location and dispersion for high-dimensional datasets. Technometrics 44:307–317

    MathSciNet  Google Scholar 

  • Maronna R, Martin R, Yohai V, Salibián-Barrera M (2019) Robust statistics: theory and methods (with R), 2nd edn. Wiley, Hoboken, NJ, USA

    MATH  Google Scholar 

  • Mengersen K, Pudlo P, Robert C (2013) Bayesian computation via empirical likelihood. Proc Natl Acad Sci 110:1321–1326

    Google Scholar 

  • Müller U (2013) Risk of Bayesian Inference in misspecified models, and the sandwich covariance matrix. Econometrica 81:1805–1849

    MathSciNet  MATH  Google Scholar 

  • Prangle D (2020) gk: An R package for the g-and-k and generalised g-and-h distributions. R J 12:7–20

    Google Scholar 

  • Price L, Drovandi C, Lee A, Nott D (2018) Bayesian synthetic likelihood. J Comput Graph Stat 27:1–11

    MathSciNet  MATH  Google Scholar 

  • Pritchard J, Seielstad M, Perez-Lezaun A, Feldman M (1999) Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol Biol Evol 16:1791–1798

    Google Scholar 

  • Ratmann O, Camacho A, Hu S, Colijn C (2018) Informed choices: how to calibrate ABC with hypothesis testing. In: Sisson S, Fan Y, Beaumont M (eds) Handbook of approximate Bayesian computation. CRC Press, Boca Raton, FL

  • Robert C, Casella G (2004) Monte Carlo statistical methods, 2nd edn. Springer, New York

    MATH  Google Scholar 

  • Roberts G, Rosenthal J (2009) Examples of adaptive MCMC. J Comput Graph Stat 18:349–367

    MathSciNet  Google Scholar 

  • Robins J, der Vaart AV, Ventura V (2000) Asymptotic distribution of p-values in composite null models. J Am Stat Assoc 95:1143–1156

    MathSciNet  MATH  Google Scholar 

  • Rodrigues G, Prangle D, Sisson S (2018) Recalibration: a post-processing method for approximate Bayesian computation. Comput Stat Data Anal 126:53–66

    MathSciNet  MATH  Google Scholar 

  • Rotnitzky A, Cox D, Bottai M, Robins J (2000) Likelihood-based inference with singular information matrix. Bernoulli, pp 243–284

  • Royall R, Tsou TS (2003) Interpreting statistical evidence by using imperfect models: Robust adjusted likelihood functions. J Roy Stat Soc B 65:391–404

    MathSciNet  MATH  Google Scholar 

  • Rubin D (1984) Bayesianly justifiable and relevant frequency calculations for the applied statistician. Ann Stat 12:1151–1172

    MathSciNet  MATH  Google Scholar 

  • Saxena K, Alam K (1982) Estimation of the non-centrality parameter of a chi squared distribution. Ann Stat 10:1012–1016

    MATH  Google Scholar 

  • Schwartz L (1965) On Bayes procedures. Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete 4:10–26

    MathSciNet  MATH  Google Scholar 

  • Schweinberger M, Krivitsky P, Butts C, Stewart J (2020) Exponential-family models of random graphs: inference in finite, super and infinite population scenarios. Stat Sci 35:627–662

    MathSciNet  MATH  Google Scholar 

  • Seber G (1984) Multivariate observations. Wiley, New York

    MATH  Google Scholar 

  • Silverman B (1986) Density estimation for statistics and data analysis. Chapman and Hall, Boca Raton, Florida

    MATH  Google Scholar 

  • Sisson S, Fan Y, Tanaka M (2007) Sequential Monte Carlo without likelihoods. Proc Natl Acad Sci 104:1760–1765

    MathSciNet  MATH  Google Scholar 

  • Sisson S, Fan Y, Beaumont M (2018) Handbook of approximate Bayesian computation. Chapman and Hall/CRC Press, Boca Raton, FL

    MATH  Google Scholar 

  • Strasser H (1981) Consistency of maximum likelihood and Bayes estimates. Ann Stat 9:1107–1113

    MathSciNet  MATH  Google Scholar 

  • Stromberg A (1997) Robust covariance estimates based on resampling. J Stat Plan Inf 57:321–334

    MathSciNet  MATH  Google Scholar 

  • Sunnåker M, Busetto A, Numminen E, Corander J, Foll M, Dessimoz C (2013) Approximate Bayesian computation. PLoS Comput Biol 9:1–10

    MathSciNet  Google Scholar 

  • van der Vaart A (1998) Asymptotic statistics. Cambridge University Press, New York

    MATH  Google Scholar 

  • van der Vaart E, Beaumont M, Johnston A, Sibly R (2015) Calibration and evaluation of individual-based models using Approximate Bayesian computation. Ecol Model 312:182–190

    Google Scholar 

  • Vihola M, Franks J (2020) On the use of approximate Bayesian computation Markov chain Monte Carlo with inflated tolerance and post-correction. Biometrika 107:381–395

    MathSciNet  MATH  Google Scholar 

  • Von Mises R (1931) Wahrscheinlichkeitsrechnung. Springer Verlag, Berlin

    MATH  Google Scholar 

  • Walker A (1968) On the asymptotic behaviour of posterior distributions. J Roy Stat Soc B 31:80–88

    MathSciNet  MATH  Google Scholar 

  • Wang X, George E (2007) Adaptive Bayesian criteria in variable selection for generalized linear models. Stat Sin 17:667–690

    MathSciNet  MATH  Google Scholar 

  • Yuan A, Clarke B (2004) Asymptotic normality of the posterior given a statistic. Can J Stat 32:119–137

    MathSciNet  MATH  Google Scholar 

  • Zhu W, Marin J, Leisen F (2016) A bootstrap likelihood approach to Bayesian computation. Aust N Z J Stat 58:227–244

    MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was supported in part by National Science Foundation grants SES-0242030 and SES-1156372, and National Institutes of Health grant 1R01AA028483-01. The author thanks an anonymous reviewer for suggestions to improve the robustness of Algorithm 4 and for suggesting the AN likelihood method; and thanks both reviewers and the Associate Editor for other suggestions that helped improve the presentation of this manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to George Karabatsos.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Appendices

Appendix A: Bernstein von Mises Theorem

Here, based on relatively recent work (Kleijn and van der Vaart 2012), we review the main relevant results of proofs of the Bernstein-Von Mises (BVM) Theorem, which asserts that under mild conditions, the posterior distribution follows a Central Limit Theorem, even if the model is misspecified for the given dataset, that is, the given dataset \(X^{(n)}\) is sampled from a fixed distribution \(P_{0}\) such that for the statistical model, \((P_{\theta }:\theta \in \Theta )\) there is no true parameter \(\theta _{0}\in \Theta \) satisfying \(P_{\theta _{0}}=P_{0}\). Previous proofs of the BVM Theorem assumed well-specified (correct) statistical model, that is, a model \((P_{\theta }:\theta \in \Theta )\) with true parameter \(\theta _{0}\in \Theta \) satisfying \(P_{\theta _{0}}=P_{0}\) for the distribution \(P_{0}\) that generated the given dataset \(X^{(n)}\) (e.g., Laplace 1820; Bernstein 1917; Von Mises 1931; Le Cam 1953; Walker 1968; Dawid 1970; Bickel and Yahav 1969; Schwartz 1965; Le Cam and Yang 1990; van der Vaart 1998) . However, here, when summarizing the key results of the BVM Theorem, we accommodate the view held by many statistical modelers, namely that “all models are wrong” but some models may be useful or more useful than others (Box 1976, p. 792). These results can be easily extended to the case where the model is well-specified (correct) for the given dataset \(X^{(n)}\).

Henceforth, we let \(\Theta \) be an open subset of \({\mathbb {R}}^{d}\) parameterizing statistical models \(\{P_{\theta }^{(n)}:\theta \in \Theta \}\) on some measurable space \(({\mathcal {X}},{\mathcal {B}})\) indexed by a parameter \(\theta \) that ranges over an open subset \(\Theta \) of \({\mathbb {R}}^{k}\). For simplicity, assume that for each n there exists a single measure that dominates all measures \(P_{\theta }^{(n)}\) as well as a “true probability measure” \(P_{0}^{(n)}\), and assume that there exist densities \(p_{\theta }^{(n)}\) and \(p_{0}^{(n)}\) such that the maps \((\theta ,x)\rightarrow p_{\theta }^{(n)}\) are measurable. Also, denote \(X^{(n)}\) as a data observation, which is a vector for i.i.d. observations, \(X^{(n)}=(X_{1},\ldots ,X_{n})\overset{\text {i.i.d.}}{\sim }P_{0}\) (with density \(p_{0}\) relative to a dominating measure). Then the statistical model consists of the n-fold product vectors \(P_{\theta }^{(n)}=P_{\theta }^{n}\), and is described as the collection of probability measures \(\{P_{\theta }^{n} :\theta \in \Theta \}\) on sample space \(({\mathcal {X}}^{n},{\mathcal {B}}^{n})\).

For every Borel set B, the posterior distribution of the random variable \(\vartheta \) is given by:

$$\begin{aligned} \Pi _{n}(B\mid X^{(n)})=\frac{ {\textstyle \int _{B}} p_{\theta }^{(n)}(X^{(n)})\pi (\theta )\text {d}\theta }{ {\textstyle \int _{\Theta }} p_{\theta }^{(n)}(X^{(n)})\pi (\theta )\text {d}\theta },\text { for }\forall B\in {\mathcal {B}}({\mathcal {X}}), \end{aligned}$$

given a prior probability measure \(\Pi \) (with density \(\pi \)) specified on \(\Theta \).

Now we present the main results of the BVM theorem (Kleijn and van der Vaart 2012, Theorem 2.1), based on results for samples \(X^{(n)}\) from fixed nonrandom true probability measure (distribution), \(P_{0}\). We label the conditions by numbered acronyms so that, later, we can easily describe the consequences of the specific conditions. Also, throughout we use the following notation conventions. Specifically, \(\leadsto \) denote convergence in distribution (weak convergence); \(\overset{P}{\rightarrow }\) denotes convergence in probability; for any given measurable function \(f:{\mathcal {X}}\mapsto {\mathbb {R}}^{k}\), we denote its expectation over a distribution P by E\(_{P}f(X)= {\displaystyle \int } f\)d\(P= {\displaystyle \int } f(x)\)dP(x); a given sample of observations \(X_{1},\ldots ,X_{n}\) has empirical measure (distribution function) \({\mathbb {P}}_{n}(\cdot )=\frac{1}{n} {\textstyle \sum \nolimits _{i=1}^{n}} \delta _{X_{i}}(\cdot )\), which assigns probability 1/n to each of the n observations; and empirical process is a centered and scaled version of the empirical measure, given by: \({\mathbb {G}}_{n}f=\sqrt{n}({\mathbb {P}} _{n}f-Pf)=\tfrac{1}{\sqrt{n}} {\textstyle \sum \nolimits _{i=1}^{n}} (f(X_{i})-\)E\(_{P}f(X_{i}))\).

Now, beginning to state the main results of the BVM theorem, the following states that if the given statistical model (1) satisfies local asymptotic normality (Le Cam 1960) around a parameter value \(\theta _{0}\in \Theta \); (2) the prior probability measure \(\Pi \) on the parameter space \(\Theta \) adequately supports \(\theta _{0} \); and (3) that the posterior distribution \(\Pi _{n}\) is asymptotically consistent for \(\theta _{0}\) with convergence rate \(\delta _{n}\); then the posterior distribution converges to a normal distribution in total variation. Specifically, suppose that the following three conditions hold:

  • (LAN) the given statistical model \(\{P_{\theta }^{(n)} :\theta \in \Theta \}\) satisfies a Local Asymptotic Normality (LAN) \(\text {condition }\)around a given inner point \(\theta _{0}\in \Theta \) relative to a norming rate \(\delta _{n}\rightarrow 0\): there exist is asymptotically normal random vectors \(\Delta _{n,\theta _{0}}\) and nonsingular matrices \(V_{\theta }\) such that the sequence \(\Delta _{n,\theta _{0}}\) is bounded in probability, and:

    $$\begin{aligned} \sup _{h\in K}\left| \log \frac{p_{\theta _{0}+h_{n}\delta _{n}}^{(n)} }{p_{\theta _{0}}^{(n)}}(X^{(n)})-h^{\intercal }V_{\theta _{0}}\Delta _{n,\theta _{0}}+\tfrac{1}{2}h^{\intercal }V_{\theta _{0}}h\right| \rightarrow 0, \end{aligned}$$
    (4.1)

    for every compact set \(K\subset {\mathbb {R}}^{k}\), in (outer) \(P_{0}^{(n)} \)-probability. That is (van der Vaart 1998, pp. 103–104), for every converging sequence \(h_{n}\rightarrow h\),

    $$\begin{aligned} \log \frac{p_{\theta _{0}+h\delta _{n}}^{(n)}}{p_{\theta _{0}}^{(n)}} =h^{\intercal }V_{\theta _{0}}\Delta _{n,\theta _{0}}-\tfrac{1}{2}h^{\intercal }V_{\theta _{0}}h+o_{P_{0}}(1), \end{aligned}$$

    (based on the Radon-Nikodym derivative), and

    $$\begin{aligned} \log \frac{p_{\theta _{0}+h\delta _{n}}^{(n)}}{p_{\theta _{0}}^{(n)}} \leadsto {\mathcal {N}}\left( -\tfrac{1}{2}h^{\intercal }V_{\theta _{0} }h,h^{\intercal }V_{\theta _{0}}h\right) . \end{aligned}$$

    In other words, the likelihood (relative to dominating measure \(P_{\theta _{0} }^{(n)}\)) of the statistical model reparameterized as\(\ \{P_{\theta _{0} +\delta _{n}h}^{(n)}:h\in \Theta \}\) by local parameter \(h=(\vartheta -\theta _{0})/\delta _{n}\), is given by:

    $$\begin{aligned} \text {d}P_{\theta _{0}+h\delta _{n}}^{(n)}=p_{\theta _{0}+h\delta _{n}}^{(n)} =\exp (h^{\intercal }V_{\theta _{0}}\Delta _{n,\theta _{0}}-\tfrac{1}{2}h^{\intercal }V_{\theta _{0}}h+\cdots )\text {d}P_{\theta _{0}}^{(n)} \end{aligned}$$

    where \(\Delta _{n,\theta _{0}}\) is “asymptotic sufficient”based on known \(\theta _{0}\) and ignoring the reminder term, \(\cdots \);

  • (PRIOR2) the prior probability measure \(\Pi \) defined on the parameter space \(\Theta \) has a density \(\pi \) that is continuous and positive around a neighborhood of \(\theta _{0}\);

  • (RATE3) The posterior distribution of the random parameter \(\vartheta \), which for every Borel set B is defined by:

    $$\begin{aligned} \Pi _{n}(\vartheta \in B\mid X^{(n)})= {\displaystyle \mathop \int \nolimits _{B}} p_{\theta }^{(n)}(X^{(n)})\pi (\theta )\text {d}\theta \big / {\displaystyle \mathop \int \nolimits _{\Theta }} p_{\theta }^{(n)}(X^{(n)})\pi (\theta )\text {d}\theta , \end{aligned}$$

    is consistent with convergence rate \(\delta _{n}\), in the sense that:

    $$\begin{aligned} P_{0}^{(n)}\Pi _{n}(\left\| \vartheta -\theta _{0}\right\| >\delta _{n} M_{n}\mid X^{(n)})\rightarrow 0 \end{aligned}$$

    holds \(\text {for every sequence of constants }M_{n}\rightarrow \infty \). Then, as a consequence of all three conditions (LAN1), (PRIOR2), and (RATE3), the sequence of posterior distributions \(\Pi _{n}\) converge to a normal distribution, in total variation, as follows:

    $$\begin{aligned} \sup _{B}\left| \Pi _{n}((\vartheta -\theta _{0})/\delta _{n}\in B\mid X^{(n)})-{\mathcal {N}}(B\mid \Delta _{n,\theta _{0}},V_{\theta _{0}}^{-1} )\right| \overset{P_{0}}{\rightarrow }0. \end{aligned}$$
    (4.2)

Furthermore, given an i.i.d. sample of data, \(X^{(n)}=(X_{1},\ldots ,X_{n})\overset{\text {i.i.d.}}{\sim }P_{0}\), with distribution \(P_{0} ^{(n)}=P_{0}^{n}\), and density \(p_{0}=P_{0}^{\prime }\) relative to a dominating measure (Kleijn and van der Vaart 2012, Lemmas 2.1-2.2, Theorems 2.2, 3.1, 7.2), suppose that the following conditions hold, most of which are smoothness conditions:

  • (Smooth1) the function \(\theta \mapsto \log p_{\theta }(X_{1})\) is differentiable at \(\theta _{0}\) in \(P_{0}\)-probability with derivative \({\dot{\ell }}_{\theta _{0}}(X_{1})\), where \({\dot{\ell }}_{\theta _{0}} =\frac{\partial }{\partial \theta _{0}}\log p_{\theta _{0}}\) is the score function of the model at \(\theta _{0}\);

  • (Smooth2) there exists \(\text {open neighborhood }U\text { of }\theta _{0}\text {, and square-integrable function }m_{\theta _{0}}\) such that for all \(\theta _{1},\theta _{2}\in U\):

    $$\begin{aligned} \left| \log \dfrac{p_{\theta _{1}}}{p_{\theta _{2}}}\right| \le m_{\theta _{0}}\left\| \theta _{1}-\theta _{2}\right\| \text {, (} P_{0}\text {-}a.s.\text {);} \end{aligned}$$
  • (Smooth3) the Kullback-Leibler divergence of \(P_{\theta _{0}}\) from \(P_{0}\) is finite and minimized at \(\theta _{0}\in \Theta \):

    $$\begin{aligned} -P_{0}\log \dfrac{p_{\theta _{0}}}{p_{0}}=\underset{\theta \in \Theta }{\inf } -P_{0}\log \dfrac{p_{\theta }}{p_{0}}<\infty , \end{aligned}$$

    and the following Kullback-Leibler divergence with respect to \(P_{0}\) has a second-order Taylor-expansion around \(\theta _{0}\text {:}\)

    $$\begin{aligned} -P_{0}\log \dfrac{p_{\theta }}{p_{\theta _{0}}}=\tfrac{1}{2}(\theta -\theta _{0})V_{\theta _{0}}(\theta -\theta _{0})+o(\left\| \theta -\theta _{0}\right\| ^{2})\text { (}\theta \rightarrow \theta _{0}\text {),} \end{aligned}$$

    where \(V_{\theta _{0}}\) is a positive-definite \(k_{\theta }\times k_{\theta }\) matrix of second derivatives given by:

    $$\begin{aligned} V_{\theta _{0}}&=\left. -\tfrac{\partial ^{2}}{\partial \theta ^{2}}P_{0} \log \tfrac{p_{\theta }}{p_{\theta _{0}}}\right| _{\theta =\theta _{0}}=\left. \tfrac{\partial ^{2}}{\partial \theta ^{2}}\{P_{0}\log p_{\theta _{0}}-P_{0}\log p_{\theta }\}\right| _{\theta =\theta _{0}}=\left. -\tfrac{\partial ^{2} }{\partial \theta ^{2}}P_{0}\log p_{\theta }\right| _{\theta =\theta _{0}}\\&=\tfrac{\partial ^{2}}{\partial \theta _{0}^{2}}P_{0}\log \tfrac{p_{0} }{p_{\theta _{0}}}=\tfrac{\partial ^{2}}{\partial \theta _{0}^{2}}\{P_{0}\log p_{0}-P_{0}\log p_{\theta _{0}}\}=-\tfrac{\partial ^{2}}{\partial \theta _{0}^{2} }P_{0}\log p_{\theta _{0}}=-P_{0}\ddot{\ell }_{\theta _{0}}. \end{aligned}$$
  • (Smooth4) \(P_{0}\frac{p_{\theta }}{p_{\theta _{0}}}<\infty \) for all \(\theta \) in a neighborhood of \(\theta _{0}\);

  • (Smooth5) \(P_{0}(e^{sm_{\theta _{0}}})<\infty \) for some \(s>0\);

  • (Invertible6) the matrix \(P_{0}{\dot{\ell }}_{\theta _{0}}\dot{\ell }_{\theta _{0}}^{\intercal }\) is invertible;

  • (Tests7) for every \(\epsilon >0\) there exists a sequence of tests \((\phi _{n})\), such that:

    $$\begin{aligned} \begin{array}{ccc} P_{0}^{n}\phi _{n}\rightarrow 0,&\,&\underset{\{\theta :||\theta -\theta _{0}||\ge \epsilon \}}{\sup }Q_{\theta }^{n}(1-\phi _{n})\rightarrow 0; \end{array} \end{aligned}$$
    (4.3)

    where a test \(\phi _{n}\) is a measurable function \(\phi _{n}:{\mathcal {X}} ^{n}\mapsto [0,1]\), and \(Q_{\theta }(A)=P_{0}\tfrac{p_{\theta }}{p_{\theta _{0}}}{\mathbf {1}}_{A}\). A sequence of tests with the property exists (4.3) if \(\Theta \) is compact, \(\theta _{0}\) is a unique point of minimum of \(\theta \mapsto -P_{0}\log p_{\theta }\), \(P_{0}(p_{\theta } /p_{\theta _{0}})<\infty \) for all \(\theta \in \Theta \) and that the map, \(\theta \mapsto P_{0}\left( \dfrac{p_{\theta }}{p_{\theta _{1}}^{s}p_{\theta _{0}}^{1-s}}\right) \) is continuous at \(\theta _{1}\) for every s in a left neighborhood of 1, for every \(\theta _{1}\) (A sufficient condition is that for every \(\theta _{1}\in \Theta \) the maps \(\theta \mapsto p_{\theta }/p_{\theta _{1}}\) and \(\theta \mapsto p_{\theta }/p_{\theta _{0}}\) are continuous in \(L_{1}(P_{0})\) at \(\theta =\theta _{1}\)) (by Theorem 3.2 of Kleijn and van der Vaart 2012).

Then, as a consequence of the three conditions (Smooth1)–(Smooth3),

  • (ConsequenceIID1) the LAN condition holds with \(\delta _{n}=n^{-1/2}\), and “centering sequence”:

    $$\begin{aligned} \Delta _{n,\theta _{0}}=V_{\theta _{0}}^{-1}{\mathbb {G}}_{n}{\dot{\ell }}_{\theta _{0} }=V_{\theta _{0}}^{-1}\left\{ \frac{1}{\sqrt{n}} {\displaystyle \sum \limits _{i=1}^{n}} ({\dot{\ell }}_{\theta _{0}}(X_{i})-\text {E}_{P_{0}}{\dot{\ell }}_{\theta _{0}} (X_{i}))\right\} , \end{aligned}$$

    where \({\mathbb {G}}_{n}=\sqrt{n}({\mathbb {P}}_{n}-P_{0})\) is the empirical process, based on the empirical measure, \({\mathbb {P}}_{n}(\cdot )=\frac{1}{n} {\textstyle \sum \nolimits _{i=1}^{n}} \delta _{X_{i}}(\cdot )\), which assigns probability 1/n to each of the n sample observations \(X_{1},\ldots ,X_{n}\). It follows that for every random sequence \((h_{n})_{n\ge 1}\) in \({\mathbb {R}}^{d}\) that is bounded in \(P_{0} \)-probability:

    $$\begin{aligned} \log {\displaystyle \prod \limits _{i=1}^{n}} \frac{p_{\theta _{0}+h_{n}/\sqrt{n}}}{p_{\theta _{0}}}(X_{i})=h_{n}^{\intercal }{\mathbb {G}}_{n}{\dot{\ell }}_{\theta _{0}}-\tfrac{1}{2}h_{n}^{\intercal } V_{\theta _{0}}h_{n}+o_{P_{0}}(1). \end{aligned}$$
  • (ConsequenceIID2) the score function, \({\dot{\ell }}_{\theta _{0} }=\frac{\partial }{\partial \theta }\log p_{\theta }\), is bounded as follows:

    $$\begin{aligned} ||{\dot{\ell }}_{\theta _{0}}(X)||\text { }\le m_{\theta _{0}}(X),\text { } (P_{0}-a.s.); \end{aligned}$$
  • (ConsequenceIID3) \(P_{0}{\dot{\ell }}_{\theta _{0}}=\frac{\partial }{\partial \theta _{0}}P_{0}\log p_{\theta _{0}}=0\);

  • (ConsequenceIID4) there exists a sequence of estimators \({\widehat{\theta }}_{n}\), which is weakly consistent (\({\widehat{\theta }} _{n}\overset{P_{0}}{\rightarrow }\theta _{0}\)) (e.g., MLEs) with:

    $$\begin{aligned} {\mathbb {P}}_{n}\log p_{{\widehat{\theta }}_{n}}\ge \sup _{\theta }{\mathbb {P}}_{n}\log p_{\theta }-o_{P_{0}}(n^{-1}), \end{aligned}$$

    which satisfies the asymptotic expansion:

    $$\begin{aligned} \sqrt{n}({\widehat{\theta }}_{n}-\theta _{0})=\dfrac{1}{\sqrt{n}} {\textstyle \sum \limits _{i=1}^{n}} V_{\theta _{0}}^{-1}{\dot{\ell }}_{\theta _{0}}(X_{i})+o_{P_{0}}(1), \end{aligned}$$

    and therefore,

    $$\begin{aligned} \sqrt{n}({\widehat{\theta }}_{n}-\theta _{0})=Z\leadsto {\mathcal {N}}(0,V_{\theta _{0}}^{-1}P_{0}{\dot{\ell }}_{\theta _{0}}{\dot{\ell }}_{\theta _{0}}^{\intercal }V_{\theta _{0}}^{-1}), \end{aligned}$$

    because \(n^{-1/2} {\textstyle \sum \nolimits _{i=1}^{n}} {\dot{\ell }}_{\theta _{0}}(X_{i})\leadsto {\mathcal {N}}(0,P_{0}{\dot{\ell }} _{\theta _{0}}{\dot{\ell }}_{\theta _{0}}^{\intercal })\) by the Central Limit Theorem and because Cov\((V_{\theta _{0}}^{-1}Z)=V_{\theta _{0}}^{-1}P_{0} {\dot{\ell }}_{\theta _{0}}{\dot{\ell }}_{\theta _{0}}^{\intercal }V_{\theta _{0}}^{-1}\) given fixed \(V_{\theta _{0}}^{-1}\). A customary choice of point estimator \({\widehat{\theta }}_{n}\) is the MLE. Also, under the further condition that (RATE3) holds, there exist consistent point-estimators \({\widehat{\theta }}_{n}\) with \(\delta _{n}^{-1}({\widehat{\theta }}_{n}-\theta _{0})=O_{P_{0}}(1)\), converging to \(\theta _{0}\) at rate \(\delta _{n}\);

  • (ConsequenceIID5) because \(\sqrt{n}({\widehat{\theta }} _{n}-\theta _{0})\) and \(\Delta _{n,\theta _{0}}\) differ only by a term of order \(o_{P_{0}}(1)\), so that the asymptotic equivalence \(\sqrt{n}(\widehat{\theta }_{n}-\theta _{0})-V_{\theta _{0}}^{-1}\Delta _{n,\theta _{0}}\overset{P_{0} }{\rightarrow }0\) holds. Thus, after ignoring LAN remainder term \(\cdots \), the estimator \({\widehat{\theta }}_{n}\) is an affine function of the asymptotic sufficient statistic \(\Delta _{n,\theta _{0}}\) for every n and \(\theta _{0}\), and thus \({\widehat{\theta }}_{n}\) is asymptotically sufficient in the original statistical model \(\{P_{\theta }^{n}:\theta \in \Theta \}\) because \({\widehat{\theta }}_{n}\) does not depend on \(\theta _{0}\); and that the total-variational distance \(\left\| {\mathcal {N}}\left( \cdot \mid \mu ,\Sigma \right) -{\mathcal {N}}\left( \cdot \mid \nu ,\Sigma \right) \right\| \) is bounded by a multiple of \(\left\| \mu -\nu \right\| \) as \((\mu \rightarrow \nu )\), and is invariant under rescaling and shifts, the BVM theorem concludes that:

    $$\begin{aligned} \sup _{B}\left| \Pi _{n}(\vartheta \in B\mid X^{(n)})-{\mathcal {N}} (B\mid {\widehat{\theta }}_{n},\tfrac{1}{n}V_{\theta _{0}}^{-1})\right| \overset{P_{0}}{\rightarrow }0. \end{aligned}$$

Also, as a consequence of conditions (Smooth1)–(Smooth5), (Invertible6), (Tests7),

  • (ConsequenceIID6) For every sequence \((M_{n})\) such that \(M_{n}\rightarrow \infty \) there exists a sequence of tests \((\omega _{n})\) such that for some constants \(D>0\), \(\epsilon >0\) and large enough n:

    $$\begin{aligned} \begin{array}{ccc} P_{0}^{n}\omega _{n}\rightarrow 0,&\,&Q_{\theta }^{n}(1-\omega _{n})\le e^{-nD(||\theta -\theta _{0}||^{2}\wedge \epsilon ^{2})}, \end{array} \end{aligned}$$

    for all \(\theta \in \Theta \) such that \(||\theta -\theta _{0}||\) \(\ge M_{n} /\sqrt{n}\).

Also, as a consequence of conditions (Smooth1)–(Smooth5), (Invertible6), (Tests7), and (PRIOR2),

  • (ConsequenceIID7) the posterior distribution \(\Pi _{n}\) converges at rate \(1/\sqrt{n}\):

    $$\begin{aligned} \Pi _{n}(\left\| \vartheta -\theta _{0}\right\| >M_{n}/\sqrt{n}\mid X^{(n)})\overset{P_{0}}{\rightarrow }0,\text { for all }M_{n}\rightarrow \infty ; \end{aligned}$$

    and under the further condition that (LAN1) holds:

  • (ConsequenceIID8) Under the loss function \(L(x)=||x||^{2}\) or \(L(x)=||x||\) (with \( {\textstyle \int } ||\theta ||^{q}\)d\(\Pi _{n}(\theta )<\infty \)), respectively, the point estimator \({\widehat{\theta }}_{n}\) is the posterior mean and median (resp.) satisfying:

    $$\begin{aligned} \sqrt{n}({\widehat{\theta }}_{n}-\theta _{0})\overset{P_{0}}{\leadsto }\arg \min _{t} {\displaystyle \int } L(t-h)\text {d}{\mathcal {N}}(h\mid X,V_{\theta _{0}}^{-1}), \end{aligned}$$

    with \(\sqrt{n}({\widehat{\theta }}_{n}-\theta _{0})\overset{P_{0}}{\leadsto } X\sim {\mathcal {N}}(h\mid 0,V_{\theta _{0}}^{-1}P_{0}{\dot{\ell }}_{\theta _{0}} {\dot{\ell }}_{\theta _{0}}^{\intercal }V_{\theta _{0}}^{-1})\).

In order to establish rates of convergence and limiting behavior of the posterior distribution, some of the conditions above make several requirements on the model involving the true distribution \(P_{0}\). These conditions may be too restrictive depending on the specific model and true distribution; for example, they exclude models in which \(-P_{0}\log \frac{p_{\theta }}{p_{\theta _{0}}}=\infty \) for \(\theta \) in neighborhoods of \(\theta _{0}\). However, such conditions can be dropped by assuming that Kullback-Leibler neighborhoods of the point of convergence receive enough prior mass and asymptotically consistent uniform tests for \(P_{0}\) versus such subsets exist, to provide alternative ways to exclude ‘undesirable’ subsets of the model beforehand, and consequently, to maintain the proofs of the results regarding rates of convergence and limiting behavior of the posterior distribution. Further, consequently, it is possible to derive a misspecified version of the consistency theorem of Schwartz (1965). The following proven results hold for the parametric models considered above, but also hold for non-parametric models.

A key lemma (Kleijn and van der Vaart 2012, Lemma 4.1) is that if \(V\subset \Theta \) is a (measurable) subset of the model; for some \(\epsilon >0\):

$$\begin{aligned} \Pi \left( \theta \in \Theta :-P_{0}\log \frac{p_{\theta }}{p_{\theta _{0}}} \le \epsilon \right) >0, \end{aligned}$$
(4.4)

and there exist constants \(\gamma >0\), \(\beta >\epsilon \), and a sequence \((\phi _{n})\) of test-functions such that:

$$\begin{aligned} \begin{array}{ccc} P_{0}^{n}\phi _{n}\le e^{-n\gamma },&\,&\underset{\theta \in V}{\sup } Q_{\theta }^{n}(1-\phi _{n})\le e^{-n\beta }, \end{array} \end{aligned}$$
(4.5)

for large enough \(n\ge 1\), then the posterior distribution converges as \(\Pi (V\left| X_{1},\ldots ,X_{n}\right. )\rightarrow 0\), for \(P_{0}\) -almost-surely, thereby excluding ‘undesirable’ subsets of the model. Further, in many situations, Kullback-Leibler property of the prior (4.4) holds for every \(\epsilon >0\), in which case (Kleijn and van der Vaart 2012, Corollary 4.1) if there exists a test-sequence \((\phi _{n})\) such that:

$$\begin{aligned}{}\begin{array}{ccc} P_{0}^{n}\phi _{n}\rightarrow 0,&\,&\underset{\theta \in V}{\sup }Q_{\theta } ^{n}(1-\phi _{n})\rightarrow 0, \end{array} \end{aligned}$$
(4.6)

or if a likelihood test condition holds (Kleijn and van der Vaart 2012; Lehmann and Casella 1998, Lemma 4.2) such that there exists a sequence \((M_{n})\) of positive numbers such that \(M_{n}\rightarrow \infty \) and

$$\begin{aligned} P_{0}^{n}\left( \underset{\theta \in V}{\inf }-{\mathbb {P}}_{n}\log \dfrac{p_{\theta }}{p_{\theta _{0}}}<\dfrac{1}{n}M_{n}\right) \rightarrow 0, \end{aligned}$$

then the posterior distribution converges as \(\Pi (V\left| X_{1} ,\ldots ,X_{n}\right. )\rightarrow 0\), for \(P_{0}\) -almost-surely, thereby excluding ‘undesirable’ subsets of the model beforehand. Another consequence of this key lemma is a misspecified form of Schwartz consistency (Kleijn and van der Vaart 2012, Corollary 4.2) , which establishes that if for all \(\epsilon >0\), the Kullback-Leibler property (4.4) is satisfied and that for all \(\eta >0\) there exists a test-sequence (\(\phi _{n}\)) such that:

$$\begin{aligned}{}\begin{array}{ccc} P_{0}^{n}\phi _{n}\rightarrow 0,&\,&\underset{\theta :d(\theta ,\Theta _{0} )>\eta }{\sup }Q_{\theta }^{n}(1-\phi _{n})\rightarrow 0, \end{array} \end{aligned}$$

where \(\Theta _{0}\) is the set of points in the model at minimal Kullback-Leibler divergence with respect to the true distribution \(P_{0}\), \(\Theta _{0}=\{\theta \in \Theta :-P_{0}\log (p_{\theta }/p_{0})\}=\inf _{\Theta }-P_{0}\log (p_{\theta }/p_{0})\}\), and \(d(\theta ,\Theta _{0})\) is the infimum of \(||\theta -\theta _{0}||\) over \(\theta _{0}\in \Theta _{0}\) (thus, here, we do not assume the existence of a unique minimizer of the Kullback-Leibler divergence), then posterior consistency holds:\(\ \Pi (d(\theta ,\Theta _{0} )>\eta \left| X_{1},\ldots ,X_{n}\right. )\rightarrow 1\), \(P_{0} \)-almost-surely, for every \(\eta >0\). Indeed, considering \({\widehat{\theta }} _{n}\) as the MLE, any set of conditions that gives consistency of the MLE also gives concentration of the posterior (Strasser 1981).

Aside from this, indeed, the main conclusion is that the BVM theorem asserts that for i.i.d. data and \(h=\sqrt{n}(\vartheta -\theta _{0})\), that the sequence of posterior densities:

$$\begin{aligned} \pi _{n}(h\mid X^{(n)})=\frac{ {\textstyle \prod \nolimits _{i=1}^{n}} p_{\theta _{0}+h/\sqrt{n}}(X_{i})\pi (\theta _{0}+h/\sqrt{n})}{ {\textstyle \int } {\textstyle \prod \nolimits _{i=1}^{n}} p_{\theta _{0}+h/\sqrt{n}}(X_{i})\pi (\theta _{0}+h/\sqrt{n})\text {d}h}, \end{aligned}$$

is asymptotically equivalent in distribution to:

$$\begin{aligned} \text {d}{\mathcal {N}}(X,V_{\theta _{0}}^{-1})(h)=\frac{\text {d}{\mathcal {N}} (h,V_{\theta _{0}}^{-1})(X)}{ {\textstyle \int } \text {d}{\mathcal {N}}(h,V_{\theta _{0}}^{-1})(X)\text {d}h}, \end{aligned}$$

the posterior density of the Gaussian location model \(({\mathcal {N}} (h,V_{\theta _{0}}^{-1}):h\in {\mathbb {R}}^{k})\) under an improper prior density, because for large n the prior \(\pi (\theta _{0}+h/\sqrt{n})\) becomes a constant \(\pi (\theta _{0})\) and then cancels from the posterior density \(\pi _{n}(h\mid X^{(n)})\), given that the prior \(\pi \) is continuous about \(\theta _{0}\) under the (PRIOR2) condition.

Now, when the given statistical model \((P_{\theta }:\theta \in \Theta )\) is well-specified, that is the dataset \(X^{(n)}\) is generated from a distribution \(P_{0}\) such that \(P_{0}=P_{\theta _{0}}\) for some true model parameter \(\theta _{0}\in \Theta \), then the Lipschitz condition (Smooth2) can be replaced by the slightly weaker condition of Differentiability in quadratic mean (DQM): there exists a measurable vector-valued function \({\dot{\ell }}_{\theta _{0}}\) such that:

$$\begin{aligned} \int [\sqrt{p_{\theta }}-\sqrt{p_{\theta _{0}}}-\frac{1}{2}(\theta -\theta _{0})^{\intercal }{\dot{\ell }}_{\theta _{0}}\sqrt{p_{\theta _{0}}}]^{2}\text {d} \mu =o(\left\| \theta -\theta _{0}\right\| ^{2})\text { as }\theta \rightarrow \theta _{0}, \end{aligned}$$

implying the existence of the Fisher information matrix \({\mathcal {I}} _{\theta _{0}}=P_{0}{\dot{\ell }}_{\theta _{0}}{\dot{\ell }}_{\theta _{0}}^{\intercal } \). Then all above results of the BVM theorem hold with \(V_{\theta _{0} }={\mathcal {I}}_{\theta _{0}}=-P_{0}\ddot{\ell }_{\theta _{0}}=P_{0}\dot{\ell }_{\theta _{0}}{\dot{\ell }}_{\theta _{0}}^{\intercal }\) and \(V_{\theta _{0}} ^{-1}={\mathcal {I}}_{\theta _{0}}^{-1}=V_{\theta _{0}}^{-1}P_{0}{\dot{\ell }} _{\theta _{0}}{\dot{\ell }}_{\theta _{0}}^{\intercal }V_{\theta _{0}}^{-1}\), and the asymptotic sufficient statistic \({\widehat{\theta }}_{n}\) is a best-regular (efficient) estimator sequence (e.g., MLEs), i.e., satisfies \(\sqrt{n}({\widehat{\theta }}_{n}-\theta _{0})-{\mathcal {I}}_{\theta _{0}}^{-1} \Delta _{n,\theta _{0}}\overset{P_{0}}{\rightarrow }0\).

Essentially, the BVM Theorem asserts that if i.i.d. data \(X^{(n)} =(X_{1},\ldots ,X_{n})\) are sampled from a fixed distribution \(P_{0}\); the model likelihood \(p_{\theta }^{(n)}\) is sufficiently regular around the model parameter value \(\theta _{0}\in \Theta \) which minimizes the Kullback-Leibler (KL) divergence of \(P_{\theta _{0}}\) to \(P_{0}\); and the prior density \(\pi \) is smooth and positive around \(\theta _{0}\); then the posterior distribution \(\Pi _{n}(\theta \in B\mid X^{(n)})\) converges to a normal distribution centered on the MLE or other best-regular estimator sequence \({\widehat{\theta }}_{n}\), and covariance equal to 1/n times the inverse of the second derivative matrix of the KL divergence around \(\theta _{0}\) when the model is misspecified (or covariance equal to the inverse covariance matrix, \({\mathcal {I}}_{\theta _{0}}^{-1}\), when the statistical model is well-specified).

The Bernstein-Von Mises (BVM) Theorem thus concludes that as the data sample size (n) increases, the influence of the prior becomes negligible, and the sequence of posterior distributions \(\Pi _{n}(\theta \mid X^{(n)})\) resembles more and more closely to a sequence of ‘sharpening’ normal distributions centered on MLEs.

When the statistical model is well-specified, the MLE \({\widehat{\theta }}_{n}\) has a sampling covariance matrix given by \(\frac{1}{n}{\mathcal {I}}_{\theta _{0} }^{-1}\), which equals the limiting posterior covariance matrix, where \({\mathcal {I}}_{\theta _{0}}\) is the Fisher information matrix at the true data-generating model parameter, \(\theta _{0}\). Then informally, we obtain the remarkable symmetry:

$$\begin{aligned}{}\begin{array}{ccccc} \Pi _{n}(\cdot \mid {\widehat{\theta }}_{n})\approx {\mathcal {N}}(\cdot \mid {\widehat{\theta }}_{n},\tfrac{1}{n}{\mathcal {I}}_{{\widehat{\theta }}_{n}}^{-1})&\,&\text {and}&\,&P_{{\widehat{\theta }}_{n}\left| \vartheta =\theta \right. }\approx {\mathcal {N}}(\cdot \mid \theta ,\tfrac{1}{n}{\mathcal {I}}_{\theta }^{-1}), \end{array} \end{aligned}$$

since conditioning \({\widehat{\theta }}_{n}\) on \(\vartheta =\theta \) gives the usual “frequentist” distribution of \({\widehat{\theta }}_{n}\) under \(\theta \). Also, for any random sets \({\widehat{B}}_{n}\) such that \(\Pi _{n}({\widehat{B}}_{n}\mid X_{1},\ldots ,X_{n})=1-\alpha \) for each n satisfy:

$$\begin{aligned} {\mathcal {N}}((n{\mathcal {I}}_{\theta _{0}})^{-1/2}({\widehat{B}}_{n}-\widehat{\theta }_{n})\mid 0,I) \end{aligned}$$

(in probability). Such sets \({\widehat{B}}_{n}\) are of the form \({\widehat{B}} _{n}={\widehat{\theta }}_{n}+{\mathcal {I}}_{\theta _{0}}^{-1/2}{\widehat{C}}_{n} /\sqrt{n}\) for sets \({\widehat{C}}_{n}\) that receive asymptotically probability \(1-\alpha \) under the standard Gaussian distribution. Thus, when the statistical model is well-specified, the \((1-\alpha )\)-credible sets \({\widehat{B}}_{n}\) are asymptotically equivalent to the Wald \((1-\alpha )\)-confidence sets based on the asymptotically normal estimators \({\widehat{\theta }}_{n}\), and consequently they are valid \(1-\alpha \) confidence sets.

The BVM theorem also concludes that, when the statistical model is misspecified, the sampling (sandwich) covariance matrix of the MLE (or other best-regular estimator) \({\widehat{\theta }}_{n}\), given by \(V_{\theta _{0}} ^{-1}P_{0}{\dot{\ell }}_{\theta _{0}}{\dot{\ell }}_{\theta _{0}}^{\intercal } V_{\theta _{0}}^{-1}\), mismatches the limiting normal posterior covariance matrix, \(\tfrac{1}{n}V_{\theta _{0}}^{-1}\). Then Bayesian credible sets not confidence sets at the nominal level.

Appendix B: Justification of the bootstrap in the Bayesian context

Here we establish the consistency of both tolerance estimators \({\widehat{\varepsilon }}_{n}\), and the bootstrap estimator \(\widehat{\mathrm {cov}}({\widehat{\theta }}_{n})\) of the covariance matrix of the MLE \({\widehat{\theta }}_{n}\) (or other consistent point estimator), by recalling a theorem related to the delta method for the bootstrap. Then, we relate these consistency results to the Bernstein-Von-Mises Theorem (Kleijn and van der Vaart 2012), outlined in Appendix A, which asserts that, asymptotically (in the sample size \(n\rightarrow \infty \)), the sampling distribution of an asymptotically consistent point estimator coincides with the posterior distribution of the parameter.

First, the delta method for bootstrap theorem (van der Vaart 1998, Theorem 23.5, p. 331) asserts that given a measurable map \(\phi :{\mathbb {R}}^{k}\mapsto {\mathbb {R}}^{m}\) defined and continuously differentiable in a neighborhood of \(\theta _{0}\), and given random vectors \({\widehat{\theta }}_{n}\) (statistics), generated by a bootstrap sampling scheme, taking their values in the domain of \(\phi \) that converge almost surely to \(\theta _{0}\),

$$\begin{aligned}{}\begin{array}{lccc} \text {If both} &{} \sqrt{n}({\widehat{\theta }}_{n}-\theta _{0})\leadsto T &{} \text {and} &{} \sqrt{n}({\widehat{\theta }}_{n}^{*}-{\widehat{\theta }} _{n})\leadsto T\\ &{} &{} &{} \text {conditionally almost surely,}\\ \text {then both} &{} \sqrt{n}(\phi ({\widehat{\theta }}_{n})-\phi (\theta _{0}))\leadsto \phi _{\theta }^{\prime }(T) &{} \text {and} &{} \sqrt{n}(\phi ({\widehat{\theta }}_{n}^{*})-\phi ({\widehat{\theta }}_{n}))\leadsto \phi _{\theta }^{\prime }(T)\\ &{} &{} &{} \text {conditionally almost surely,} \end{array} \nonumber \\ \end{aligned}$$
(4.7)

where throughout, \(\leadsto \) symbolizes convergence in distribution. This theorem is a key condition for the bootstrap to yield sensible results, for a wide variety of functions \(\phi \).

Now, recall the two alternative tolerance estimators (2.2) of the mean tolerance parameter (2.1), which is based on \(\phi \) being the distance function,

$$\begin{aligned} \phi ({\widehat{\theta }}_{n}^{*})=||s(y^{*};{\widehat{\theta }}_{n}^{*}(X^{*}))-s(x)||,\end{aligned}$$

being a measurable map that is continuously differentiable in a neighborhood of \(\theta _{0}\), the value of the parameter which minimizes the Kullback-Leibler divergence. Also, each tolerance estimator is based on a bootstrap sampling scheme, which draws bootstrap samples \({\widehat{\theta }} _{n}^{*}\) directly from the asymptotic normal distribution of the point estimator \({\widehat{\theta }}_{n}\), which is a justifiable bootstrap sampling scheme of the asymptotic normality of \({\widehat{\theta }}_{n}\) (van der Vaart 1998, p. 326). The theorem’s condition \(\sqrt{n}({\widehat{\theta }}_{n}-\theta _{0})\leadsto T\) holds, given that M-estimators under regularity conditions are asymptotically normal, with \(\sqrt{n}({\widehat{\theta }}_{n}-\theta _{0} )\leadsto \) \(T\sim \) N\((0,V_{\theta _{0}}^{-1}P_{0}{\dot{\ell }}_{\theta _{0}} {\dot{\ell }}_{\theta _{0}}^{\intercal }V_{\theta _{0}}^{-1})\) (van der Vaart 1998, Theorem 5.23, p. 53), where \(P_{0}{\dot{\ell }}_{\theta _{0}}{\dot{\ell }}_{\theta _{0}}^{\intercal }= {\textstyle \int } (\frac{\partial }{\partial \theta _{0}}\ell _{\theta }(x))(\frac{\partial }{\partial \theta }\ell _{\theta }(x))^{\intercal }\mathrm {d}P_{0}(x)\) and \(V_{\theta _{0}}=-P_{0}\ddot{\ell }_{\theta _{0}}=- {\textstyle \int } \tfrac{\partial ^{2}}{\partial \theta _{0}^{2}}\log p_{\theta _{0}}(x)\mathrm {d} P_{0}(x),\) with \(\ell _{\theta }=\log p_{\theta }(x)\), and \({\dot{\ell }}_{\theta }=\frac{\partial }{\partial \theta }\ell _{\theta }\) is the score function at \(\theta \in \Theta \). Here, if the model is well-specified, that is when the dataset x is generated from a true distribution \(P_{0}\) supported by the model, such that \(P_{0}=P_{\theta _{0}}\) for some \(\theta _{0}\in \Theta \), then under mild conditions the Fisher information matrix \({\mathcal {I}}_{\theta _{0}}\) exists, with \({\mathcal {I}}_{\theta _{0}}=V_{\theta _{0}}=-P_{0}\ddot{\ell }_{\theta _{0}}=P_{0}{\dot{\ell }}_{\theta _{0}}{\dot{\ell }}_{\theta _{0}}^{\intercal }\) and \(V_{\theta _{0}}^{-1}={\mathcal {I}}_{\theta _{0}}^{-1}=V_{\theta _{0}} ^{-1}P_{0}{\dot{\ell }}_{\theta _{0}}{\dot{\ell }}_{\theta _{0}}^{\intercal } V_{\theta _{0}}^{-1}\) (see Sect. 1 and Appendix A for more details). The theorem’s condition \(\sqrt{n}({\widehat{\theta }}_{n}^{*}-{\widehat{\theta }} _{n})\leadsto T\) holds because of the weak consistency of the general M-estimator (e.g., MLE) under regularity conditions, \({\widehat{\theta }} _{n}\overset{P}{\rightarrow }\theta _{0}\) (van der Vaart 1998, Theorem 5.7, p. 45). Then, by Slutsky’s lemma, \(\sqrt{n}({\widehat{\theta }}_{n}^{*}-{\widehat{\theta }}_{n})\) \(\leadsto \sqrt{n}({\widehat{\theta }}_{n}-\theta _{0})\), which implies that \(\sqrt{n}({\widehat{\theta }}_{n}^{*}-{\widehat{\theta }} _{n})\) and \(\sqrt{n}({\widehat{\theta }}_{n}-\theta _{0})\) are asymptotically equivalent. So by Slutsky’s theorem (Ferguson 1996, Theorem 6(b), p. 39), \(\sqrt{n}({\widehat{\theta }}_{n}-\theta _{0})\leadsto T\) and the asymptotic equivalence \((\sqrt{n}({\widehat{\theta }}_{n}-\theta _{0})-\sqrt{n} ({\widehat{\theta }}_{n}^{*}-{\widehat{\theta }}_{n}))\overset{P}{\rightarrow }0\) (if and only if \(\sqrt{n}({\widehat{\theta }}_{n}-\theta _{0})-\sqrt{n}({\widehat{\theta }}_{n}^{*}-{\widehat{\theta }}_{n}))\leadsto 0\) (van der Vaart 1998, Theorem 2.7, p. 10) ) both imply\(\sqrt{n}({\widehat{\theta }}_{n}^{*}-{\widehat{\theta }} _{n})\leadsto T\). Hence, in summary, all of the conditions of the delta method for bootstrap theorem (4.7) are met, which in turn implies both \(\sqrt{n}(\phi ({\widehat{\theta }}_{n})-\phi (\theta _{0}))\leadsto \phi _{\theta }^{\prime }(T)\) and \(\sqrt{n}(\phi ({\widehat{\theta }}_{n}^{*})-\phi ({\widehat{\theta }}_{n}))\leadsto \phi _{\theta }^{\prime }(T)\). In turn, after taking a simple transformation of \(\sqrt{n}(\phi ({\widehat{\theta }} _{n}^{*})-\phi ({\widehat{\theta }}_{n}))\) to obtain \(\phi ({\widehat{\theta }} _{n}^{*})\), this weak convergence is equivalent to the convergence \(\mathrm {E}_{P_{0}}\phi ({\widehat{\theta }}_{n}^{*})\rightarrow \mathrm {E} _{P_{0}}\phi (\theta _{0})\), according to the Portmanteau Lemma (van der Vaart 1998, Lemma 2.2, p. 6,7), since \(\phi \) is a continuous bounded function. Then \(\mathrm {E}_{P_{0}} \phi ({\widehat{\theta }}_{n}^{*})\) is an asymptotically consistent estimator of the true tolerance, \(\varepsilon =\mathrm {E}_{P_{0}}\phi (\theta _{0})\).

When the point estimator \({\widehat{\theta }}_{n}\) does not readily provide an estimate \(\widehat{\mathrm {cov}}({\widehat{\theta }}_{n})\) of its sampling covariance matrix, the empirical bootstrap method is used to calculate its estimate of the covariance matrix, \(\widehat{\mathrm {cov}}(\widehat{\theta }_{n})\), the estimator of \(\mathrm {E}_{P_{0}}\phi ({\widehat{\theta }}_{n}^{*}(X^{*}))\), given by (1.3), as mentioned in Sect. 1. Here, the function is \(\phi ({\widehat{\theta }}_{n}^{*})=({\widehat{\theta }}_{n} ^{*}(X^{*})-\theta _{0})({\widehat{\theta }}_{n}^{*}(X^{*})-\theta _{0})^{\intercal }\), and the expectation with respect to parameter estimate samples \({\widehat{\theta }}_{n}^{*}(X^{*})\) obtained via random datasets \(X^{*}\sim P_{0}\), each of size n, drawn from the true distribution \(P_{0}\) that generated the given sample dataset x; and based on the plug-in estimator \({\widehat{P}}_{0}={\mathbb {P}}_{n}\), based on the empirical measure, \({\mathbb {P}}_{n}(\cdot )=\frac{1}{n} {\textstyle \sum \nolimits _{i=1}^{n}} \delta _{X_{i}}(\cdot )\), which assigns probability 1/n to each of the n sample observations \(X_{1},\ldots ,X_{n}\) forming the dataset x. Then assuming that \(\phi \) is a measurable map that is continuously differentiable in a neighborhood of \(\theta _{0}\) (the value of the parameter which minimizes the Kullback-Leibler divergence), this empirical bootstrap sampling scheme, the same arguments used for both tolerance estimators, above, can also be used to prove the delta theorem for the bootstrap and the consistency for the bootstrap estimator of the covariance matrix of \({\widehat{\theta }}_{n}\). When the model is well-specified, the estimator \(\widehat{\mathrm {cov} }({\widehat{\theta }}_{n})\) provides a consistent estimator of the inverse Fisher covariance matrix, \({\mathcal {I}}_{\theta _{0}}^{-1}=V_{\theta _{0}}^{-1}P_{0} {\dot{\ell }}_{\theta _{0}}{\dot{\ell }}_{\theta _{0}}^{\intercal }V_{\theta _{0}}^{-1}\).

If \({\widehat{\theta }}_{n}\) is the MLE, then \(({\widehat{\theta }}_{n},{\widehat{\varepsilon }}_{n})\) is a MLE due to the invariance property of the MLE, which states that any function of the MLE \({\widehat{\theta }}_{n}\) is an MLE of the function (Casella and Berger 2002, Theorem 7.2.10, pp. 320–321). Such an invariance property also holds for any general consistent minimum distance estimators (Drossos and Philippou 1980), which are special M-estimators (e.g., van der Vaart 1998, Ch. 5). Then for either tolerance estimator, the Bernstein-Von-Mises theorem (summarized in Appendix A) establishes the convergence of the posterior distribution of \(\theta \) to the asymptotic normal distribution of \({\widehat{\theta }}_{n}\), based on the approximate likelihood \( {\textstyle \int } f_{s}(s(x)+v\mid \theta )K_{{\widehat{\varepsilon }}_{n}}(v)\)d\(v= {\textstyle \int } p_{\theta }(y){\mathbf {1}}_{\left\| s(y)-s(x)\right\| \le {\widehat{\varepsilon }}_{n}}\)dy and the consistent estimator \({\widehat{\varepsilon }}_{n}\) of the tolerance \(\varepsilon \).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Karabatsos, G. Approximate Bayesian computation using asymptotically normal point estimates. Comput Stat 38, 531–568 (2023). https://doi.org/10.1007/s00180-022-01226-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-022-01226-3

Keywords