Abstract
Although Bayesian deep neural network models are ubiquitous in classification problems; their Markov Chain Monte Carlo based implementation suffers from high computational cost, limiting the use of this powerful technique in large-scale studies. Variational Bayes (VB) has emerged as a competitive alternative to overcome some of these computational issues. This paper focuses on the variational Bayesian deep neural network estimation methodology and discusses the related statistical theory and algorithmic implementations in the context of classification. For a dense deep neural network-based classification, the paper compares and contrasts the true posterior’s consistency and contraction rates and the corresponding variational posterior. Based on the complexity of the deep neural network (DNN), this paper provides an assessment of the loss in classification accuracy due to VB’s use and guidelines on the characterization of the prior distributions and the variational family. The difficulty of the numerical optimization for obtaining the variational Bayes solution has also been quantified as a function of the complexity of the DNN. The development is motivated by an important biomedical engineering application, namely building predictive tools for the transition from mild cognitive impairment to Alzheimer’s disease. The predictors are multi-modal and may involve complex interactive relations.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Availability of data and materials
The data is publicly available.
Code availability
The computational code is available.
References
Bai, J., Song, Q., Cheng, G.: Efficient variational inference for sparse deep learning with theoretical guarantee. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 466–476. Curran Associates, Inc. (2020)
Barron, A., Schervish, M.J., Wasserman, L.: The consistency of posterior distributions in nonparametric problems. Ann. Stat. 27(2), 536–561 (1999)
Bhattacharya, S., Maiti, T.: Statistical foundation of variational bayes neural networks. Neural Netw. 137, 151–173 (2021)
Bishop, C.M.: Bayesian neural networks. J. Braz. Comput. Soc. 4(1), 61–68 (1997)
Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Blei, D.M., Lafferty, J.D.: A correlated topic model of science. Ann. Appl. Stat. 1(1), 17a, 35 (2007)
Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D.: Weight uncertainty in neural network. In: Proceedings of Machine Learning Research, vol. 37, pp. 1613–1622. PMLR (2015)
Cannings, T.I., Samworth, R.J.: Random-projection ensemble classification. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 79(4), 959–1035 (2017)
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., Zhang, Z.: Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems (2015)
Chérief-Abdellatif, B.-E.: Convergence rates of variational inference in sparse deep learning. In: Hal DaumA III, Singh, A. (eds) Proceedings of the 37th International Conference on Machine Learning, vol. 119 of Proceedings of Machine Learning Research, pp. 1831–1842. PMLR (2020)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(61), 2121–2159 (2011)
Hinton G.E., Van Camp, D.: Keeping the neural networks simple by minimizing the description length of the weights. In: Proceedings of the Sixth Annual Conference on Computational Learning Theory, COLT’93, pp. 5a 13. ACM press (1993)
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York (2009)
Graves, A.: Practical variational inference for neural networks. In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24, pp. 2348–2356. Curran Associates, Inc. (2011)
Graves, A.: Generating sequences with recurrent neural networks (2014). arXiv:1308.0850
Gurney, K.: An Introduction to Neural Networks. Taylor & Francis Inc., USA (1997). (ISBN 1857286731)
Hinton, G., Srivastava, N., Swersky, K.: Lecture 6a Overview of Mini-batch Gradient Descent (2012). http://www.cs.toronto.edu/hinton/coursera/lecture6/lec6.pdf
Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)
Hubin, A., Storvik, G., Frommlet, F.: Deep Bayesian regression models (2018). arXiv:1806.02160
Javid, K., Handley, W., Hobson, M.P., Lasenby, A.: Compromise-free Bayesian neural networks (2020). arXiv:2004.12211
Kingma, D.P., Salimans, T., Welling, M.: Variational dropout and the local reparameterization trick. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 2575–2583. Curran Associates, Inc. (2015)
Korolev, I.: Alzheimer’s disease: a clinical and basic science review. Med. Stud. Res. J. 4(1), 24–33 (2014)
Korolev, I.O., Symonds, L.L., Bozoki, A.C., Initiative, A.D.N.: Predicting progression from mild cognitive impairment to Alzheimer’s dementia using clinical, MRI, and plasma biomarkers via probabilistic pattern classification. PLoS ONE 11(2), e0138866 (2016)
Lampinen, J., Vehtari, A.: Bayesian approach for neural networks-review and case studies. Neural Netw. Off. J. Int. Neural Netw. Soc. 14(3), 257–274 (2001)
Lee, H.K.H.: Consistency of posterior distributions for neural networks. Neural Netw. 13(6), 629–642 (2000)
Li, X., Li, C., Chi, J., Ouyang, J.: Variance reduction in black-box variational inference by adaptive importance sampling. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, International Joint Conferences on Artificial Intelligence Organization, pp. 2404–2410 (2018)
Liang, F., Li, Q., Zhou, L.: Bayesian neural networks for selection of drug sensitive genes. J. Am. Stat. Assoc. 113(523), 955–972 (2018)
Liu, Z., Maiti, T., Bender, A.: A role for prior knowledge in statistical classification of the transition from MCI to Alzheimer’s disease. Unpublished report (2020)
Matthews, A.G. de G., Hron, J., Rowland, M., Turner, R.E., Ghahramani, Z.: Gaussian process behaviour in wide deep neural networks. In: International Conference on Learning Representations (2018)
McKinney, W.: Data structures for statistical computing in python. In: van der Walt, S., Millman, J. (eds.) Proceedings of the 9th Python in Science Conference, pp, 56–61 (2010)
McMahan, H.B.: A survey of algorithms and analysis for adaptive online learning. J. Mach. Learn. Res. 18(90), 1–50 (2017)
Mullachery, V., Khera, A., Husain, A.: Bayesian neural networks (2018). arXiv:1801.07710
Nagapetyan, T., Duncan, A.B., Hasenclever, L., Vollmer, S.J., Szpruch, L., Zygalakis, K.: The true cost of stochastic gradient Langevin dynamics (2017). arXiv:1706.02692
Neal, R.M.: Bayesian training of backpropagation networks by the hybrid Monte-Carlo method (1992). https://www.cs.toronto.edu/~radford/ftp/bbp.pdf
Paisley, J., Blei, David, Jordan, Michael: Variational bayesian inference with stochastic search. In: Proceedings of the 29th International Conference on International Conference on Machine Learning, ICML’12, pp. 1363,1370. ACM press (2012)
Pati, D., Bhattacharya, A., Yang, Y.: On statistical optimality of variational bayes. In: Storkey, A., Perez-Cruz, F. (eds.) Proceedings of Machine Learning Research, vol. 84, pp. 1579–1588. PMLR (2018)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Petersen, R.C., Roberts, R.O., Knopman, D.S., Boeve, B.F., Geda, Y.E., Ivnik, R.J., Smith, G.E., Jack, C.R.: Mild cognitive impairment: ten years later. Arch. Neurol. 66(12), 1447–1455 (2009). https://doi.org/10.1001/archneurol.2009.266
Pollard, D.: Empirical processes: Theory and applications. NSF-CBMS Regional Conference Series in Probability and Statistics 2, i–86 (1990)
Polson, N.G., Ročková, V.: Posterior concentration for sparse deep learning. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc. (2018)
Ranganath, R., Gerrish, S., Blei, D.M.: Black box variational inference (2013). arXiv:1401.0118
Ross, S.M.: Simulation, fifth edition Academic Press (2013). (ISBN 9780124158252)
Schmidt-Hieber, J.: Nonparametric regression using deep neural networks with ReLU activation function. Ann. Stat. 48(4), 1875–1897 (2020)
Singh, B., De, S., Zhang, Y., Goldstein, T., Taylor, G.: Layer-specific adaptive learning rates for deep networks (2015). arXiv:1510.04609
Sun, S., Chen, C., Carin, L.: Learning structured weight uncertainty in bayesian neural networks. In: Proceedings of Machine Learning Research, vol, 54, pp. 1283–1292. PMLR (2017)
Sun, S., Zhang, G., Shi, J., Grosse, R.B: Functional variational bayesian neural networks. In: 7th International Conference on Learning Representations, ICLR 2019. OpenReview.net (2019)
Sun, Y., Song, Q., Liang, F.: Consistent sparse deep learning: theory and computation. J. Am. Stat. Assoc. 0 (ja):1–42 (2021)
Taghia, J.: Lecture Notes. Part III: black-box variational inference (2018). http://www.it.uu.se/research/systems_and_control/education/2018/pml/lectures/VILectuteNotesPart3.pdf
Torben, S., Sumeetpal Sidhu, S.: Trace-class Gaussian priors for Bayesian learning of neural networks with MCMC. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 85(1), 46–66 (2023)
van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics. Springer, New York (1996)
Wan, R., Zhong, M., Xiong, H., Zhu, Z.: Neural control variates for variance reduction (2018). arXiv:1806.00159
Wang, Y., Blei, D.M.: Frequentist consistency of variational bayes. J. Am. Stat. Assoc. 114(527), 1147–1161 (2019)
Welling, M., Teh, Y.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pp. 681–688. ACM Press (2011)
Wing Hung, W., Xiaotong, S.: Probability inequalities for likelihood ratios and convergence rates of sieve MLES. Ann. Stat. 23(2), 339–362 (1995)
Wu, A., Nowozin, S., Meeds, E., Turner, R.E., Hernández-Lobato, J.M., Gaunt, A.L.: Deterministic variational inference for robust bayesian neural networks (2019). https://openreview.net/forum?id=B1l08oAct7
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)
Yang, K., Maiti, T.: Statistical aspects of high-dimensional sparse artificial neural network models. Mach. Learn. Knowl. Extr. 2(1), 1–19 (2020)
Yang, Y., Pati, D., Bhattacharya, A.: \(\alpha \)-variational inference with statistical guarantees. Ann. Stat. 48(2), 886–905 (2020)
Zhang, D., Shen, D.: Multi-modal multi-task learning for joint prediction of clinical scores in Alzheimer’s disease. In: Tianming, L., Dinggang, S., Luis, I., Xiaodong, T. (eds.) Multimodal Brain Image Analysis, pp. 60–67. Springer Berlin Heidelberg, Berlin, Heidelberg (2011)
Zhang, D., Shen, D., Initiative, A.D.N.: Predicting future clinical changes of mci patients using longitudinal and multimodal biomarkers. PLoS ONE 7(3), e0033182 (2012)
Zhang, F., Gao, C.: Convergence rates of variational posterior distributions. Ann. Stat. 48(4), 2180–2207 (2020)
Zhu, C., Cheng, Y., Gan, Z., Huang, F., Liu, J., Goldstein, T.: Adaptive learning rates with maximum variation averaging (2020). arXiv:2006.11918
Funding
This work is partially supported by the grants NSF-1924724, NSF-1952856, and NSF-2124605.
Author information
Authors and Affiliations
Contributions
Equal contribution from all three authors.
Corresponding author
Ethics declarations
Conflict of interest
None.
Ethical approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendices
Appendix A Algorithms of variational implementation.
With q and p as in (12) and (10) respectively,
Appendix B Preliminaries
Definition 1
\(MVN(\varvec{\mu },\varvec{\Sigma })\) is used to denote the density function of multivariate normal distribution with mean \(\varvec{\mu }\) and variance covariance matrix \(\varvec{\Sigma }\).
Definition 2
For a vector \(\varvec{\alpha }\) and a function g,
-
1.
\(||\varvec{\alpha }||_1=\sum _i |\alpha _i|\), \(||\varvec{\alpha }||_2=\sqrt{\sum _i \alpha _i^2}\), \(||\varvec{\alpha }||_\infty =\max _i |\alpha _i|\).
-
2.
\(||g||_1=\int _{\varvec{x}\in \chi } |g(\varvec{x})|d\varvec{x}\), \(||g||_2=\sqrt{\int _{\varvec{x}\in \chi } g(\varvec{x})^2d\varvec{x}}\), \(||g||_\infty =\sup _{\varvec{x}\in \chi } |g(\varvec{x})|\)
Definition 3
(Bracketing number and entropy) For any two functions l and u, define the bracket [l, u] as the set of all functions f such that \(l\le f\le u\) pointwise. Let ||.|| be a metric. Define an \(\varepsilon -\)bracket as a bracket with \(||u-l||\le \varepsilon \). Define the bracketing number of a set of functions \(\mathcal {F}^*\) as the minimum number of \(\varepsilon -\)brackets needed to cover \(\mathcal {F}^*\), and denote it by \(N_{[]}(\varepsilon ,\mathcal {F}^*,||.||)\). Finally, the Hellinger bracketing entropy, denoted by \(H_{[]}(\varepsilon ,\mathcal {F}^*,||.||)\), is the natural logarithm of the bracketing number (Pollard 1990).
Definition 4
(Covering number and entropy) Let (V, ||.||) be a normed space, and \(\mathcal {F} \subset V\). \(\{V_1,\ldots , V_n \}\) is an \(\varepsilon -\)covering of \(\mathcal {F}\) if \(\mathcal {F} \subset \cup _{i=1}^N B(V_i,\varepsilon )\), or equivalently, \(\forall \) \(\theta \in \mathcal {F}\), \(\exists \) i such that \(||\theta -V_i||<\varepsilon \). The covering number of \(\mathcal {F}\) denoted by \(N(\varepsilon ,\mathcal {F},||.||)=\min \{n: \exists \, \varepsilon -\text { covering over }\mathcal {F}\text { of size } n \}\). Finally, the Hellinger covering entropy, denoted by \(H(\varepsilon , \mathcal {F},||.||)\), is the natural logarithm of the covering number (Pollard 1990).
Lemma 5 gives a bound on the integral of the Hellinger entropy. Lemma 6 shows that the prior gives negligible probability outside the sieve \(\mathcal {F}_n\). Lemma 7 shows that the prior if prior gives sufficient mass on the KL neighborhoods of the true density, the marginal density is well bounded. Lemma 8 shows that if parameters of two neural networks are close then so are the neural networks themselves. Lemmas 5, 6, 7 and 8 will serve as important tools towards the proof of consistency of the true posterior.
Lemma 5
With \(H_{[]}(u,\widetilde{\mathcal {F}}_n,||.||_2)\) as in Definition 3, for \(H_{[]}(u,\widetilde{\mathcal {F}}_n,||.||_2)\le K_n\log (M_n/u)\),
Proof
See proof of lemma 7.14 in Bhattacharya and Maiti (2021). \(\square \)
Lemma 6
Suppose, \(\int _{\mathcal {F}_n^c} p(\varvec{\theta }_{n}) d\varvec{\theta }_{n} \le e^{-n\varepsilon }, n \rightarrow \infty \) for any \(\varepsilon >0\). Then, for every \(\tilde{\varepsilon }<\varepsilon \).
Proof
See proof of lemma 7.16 in Bhattacharya and Maiti (2021). \(\square \)
Lemma 7
Suppose \(\mathcal {N}_\varepsilon =\{\varvec{\theta }_{n}: d_{\text {KL}}(\ell _0,\ell _{\varvec{\theta }_{n}})<\varepsilon \}\) and \( \int _{\mathcal {N}_\varepsilon } p(\varvec{\theta }_{n})d\varvec{\theta }_{n}\ge e^{-n\varepsilon }, n\rightarrow \infty \) then for any \(\nu >0\),
Proof
See proof of lemma 7.12 in Bhattacharya and Maiti (2021). \(\square \)
Lemma 8
Let \(\eta _{\varvec{\theta }_{n}^*}(\varvec{x})=\varvec{b}_L^*+\varvec{A}_L^* \psi (\varvec{b}_{L-1}^*+\varvec{A}_{L-1}^* \psi ( \cdots \psi (\varvec{b}_1^*+\varvec{A}_1^*\psi (\varvec{b}_0^*+\varvec{A}_0^* \varvec{x})))\) be a fixed neural network. Let \(\eta _{\varvec{\theta }_{n}}(\varvec{x})=\varvec{b}_L+\varvec{A}_L \psi (\varvec{b}_{L-1}+\varvec{A}_{L-1} \psi ( \cdots \psi (\varvec{b}_1+\varvec{A}_1\psi (\varvec{b}_0+\varvec{A}_0 \varvec{x})))\) be a neural network such that
where \(\tilde{k}_{vn}=k_{vn}+1\). Then,
Proof
In the proof, we suppress the dependence on n. Define the projection \(P_v\) as \(P_V \eta _{\varvec{\theta }}(\varvec{x})=\varvec{b}_{V-1}+\varvec{A}_{V-1} \psi ( \cdots \psi (\varvec{b}_1+\varvec{A}_1\psi (\varvec{b}_0+\varvec{A}_0\varvec{x})))\). We claim that
We prove this by induction. Let \(v=1\) as follows. Let \(\tilde{\varepsilon }=\varepsilon /\sum _{v=0}^L \tilde{k}_v\prod _{v'=v+1}^L a^*_{v'}\), then
where the second line holds since \(\psi (u)\le 1\) and the third step is shown next. Let \(u=-\varvec{b}_{0}[s]-\varvec{A}_{0}[s]^\top \varvec{x}\) and \(u_\delta =\varvec{b}_{0}[s]+\varvec{A}_{0}[s]^\top \varvec{x}-\varvec{b}_{0}^*[s]+{\varvec{A}_{0}^*[s]}^\top \varvec{x}\), then for \(|u_\delta |<1\)
since \(e^u/((1+e^u)(1+e^{u-1}))\le 1/2\) and \(|e^{u_\delta }-1|\le 2|u_\delta |\) for \(|u_\delta |<1\). Now, \(|u_\delta |=|\varvec{b}_{0}[s] -\varvec{b}_{0}^*[s]|+\sum _{s'=0}^{p_n} |\varvec{A}_0[s][s']-\varvec{A}_0^*[s][s']| \le (p_n+1)\tilde{\varepsilon }<1\).
Suppose the result hold for \(V-1\), we show the result for V as follows:
where the second step follows since \(\psi (u)\le 1\) and the third step follows by relation (B3) provided \(|P_{V-1} \eta _{\varvec{\theta }}(\varvec{x})[s]-P_{V-1} \eta _{\varvec{\theta }^*}(\varvec{x})[s]|\le 1\). But this holds using relation (B2) with \(v=V-1\).
Thus proceeding further we get
This completes the proof.
Lemma 9 shows that if the expected KL densities between two densities is close then the expected log-likelihood ratio between the two densities is also well bounded where expectation is taken with respect to the variational member q. Lemma 10 shows that if functions are close then so is the logistic loss between them. Lemmas 8, 9 and 10 together will serve as tools towards establishing that the variational and true posterior are close in the KL-distance. \(\square \)
Lemma 9
Suppose q satisfies \(\int d_{\textrm{KL}}(\ell _0,\ell _{\varvec{\theta }_{n}}) q(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\le \varepsilon ,\) then for any \(\nu >0\),
Proof
See proof of lemma 7.13 in Bhattacharya and Maiti (2021). \(\square \)
Lemma 10
If \(|\eta _0(\varvec{x})-\eta _{\varvec{\theta }_{n}}(\varvec{x})|\le \varepsilon \), then \(|h_{\varvec{\theta }_{n}}(\varvec{x})|\le 2\varepsilon \) where
Proof
Note that,
where the second step follows by using \(\sigma (x)=e^{x}/(1+e^x) \le 1\) and the proof of the third step is shown below. \(\square \)
Let \(p=\sigma (\eta _0(\varvec{x}))\), then \(0\le p \le 1\) and \( r=\eta _{\varvec{\theta }_{n}}(\varvec{x})-\eta _0(\varvec{x})\), then
Lemma 11 gives a bound on the first order derivatives of a neural network. Lemma 12 gives a bound on the Hellinger entropy under the sieve \(\mathcal {F}_n\). Lemmas 11 and 12 will serve as tools to bound the Hellinger entropy of the functional sieve space \(\widetilde{\mathcal {F}}_n\) based on \(\mathcal {F}_n\).
Lemma 11
For \(\eta _{\varvec{\theta }_{n}}(\varvec{x})=\varvec{b}_L+\varvec{A}_L \psi (\varvec{b}_{L-1}+\varvec{A}_{L-1} \psi ( \cdots \psi (\varvec{b}_1+\varvec{A}_1\psi (\varvec{b}_0+\varvec{A}_0 \varvec{x})))\),
where \(a_{v'n}=\sup _{v=0, \ldots , k_{(v'+1)n}} ||\varvec{A}_{v'}[v]]||_1\).
Proof
We suppress the dependence on n. Let \(P_{V}=\varvec{b}_V+\varvec{A}_V\psi (\cdots \varvec{b}_1+\varvec{A}_1\psi (\varvec{b}_0+\varvec{A}_0 \varvec{x})))\). Define \(G_{V,V}=\varvec{1}_{k_V+1}\) and for \(V=0,\ldots , L\), \(V'=0,\ldots , V-1\), let
where \(\odot \) denotes component wise multiplication.
With \(\psi (P_{-1})=\varvec{x}\), we define
By the above form and the fact that \(\psi (u),\psi '(u),|x_i|\le 1\), it can be easily checked by induction \(|G_{v,L}|\le \prod _{v'=v+1}^L a_{v'}\) which completes the proof. \(\square \)
Lemma 12
Let, \(\widetilde{\mathcal {F}}_n=\{\sqrt{\ell }: \ell _{\varvec{\theta }_{n}}(y,\varvec{x}), \varvec{\theta }_{n} \in \mathcal {F}_n\}\) where \(\ell _{\varvec{\theta }_{n}}(y,\varvec{x})\) is given by
and \(\mathcal {F}_n\) is given by
Then with \(H_{[]}(u,\widetilde{\mathcal {F}}_n,||.||_2)\) is as in Definition 3,
Proof
In this proof, we suppress the dependence on n. By lemma 4.1 in Pollard (1990),
For \(\varvec{\theta }_1, \varvec{\theta }_2 \in \mathcal {F}\), let \(\widetilde{\ell }(u)=\sqrt{\ell _{u\varvec{\theta }_1+(1-u)\varvec{\theta }_2}(\varvec{x},y)}\). Following Equation (52) in Bhattacharya and Maiti (2021),
where the upper bound \(F(\varvec{x},y)=(CK)^L\). This is because \(|\partial \widetilde{\ell }/\partial \theta _j|\), the derivative of \(\sqrt{\ell }\) w.r.t. is bounded above by \(|\partial \eta _{\varvec{\theta }}(\varvec{x})/\partial \theta _j|\) as shown below.
Thus, using \(e^{\eta _{\varvec{\theta }}(\varvec{x})}/(1+e^{\eta _{\varvec{\theta }}(\varvec{x})})\) and Lemma 11, we get
In view of (B6) and theorem 2.7.11 in van der Vaart and Wellner (1996), we have
where \(N_{[]}\) and \(H_{[]}\) denote the bracketing number and bracketing entropy as in Definition 3. Using, Lemma 5 with \(M=K^{L+1}C^{L+2}\), we get
Therefore,
The proof follows by noting \(\log \sqrt{2}\varepsilon \ge \log \varepsilon \).
Proposition 13 establishes a bound on the log-likelihood ratio when the neural network lies outside the Hellinger neigborhood of the true density function. Proposition 14 shows that the prior gives negligible probability outside the sieve. Proposition 15 shows that the prior gives sufficiently large probability to KL-neighborhoods of the true density function. Propositions 13, 14 and 15 taken together will be used to establish the posterior consistency of the true posterior. \(\square \)
Proposition 13
Let \(n\epsilon _n^2\rightarrow \infty \). Suppose \(K_n\log n =o(n^b\epsilon _n^2)\), for some \(0<b<1\), \(L_n\sim \log n\) and \(p(\varvec{\theta }_{n})=MVN(\varvec{\mu }_n,\text {diag}(\varvec{\zeta }_n^2))\) where \(\log ||\varvec{\zeta }_n||_\infty =O(\log n)\) and \(||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2)\). Then for every \(\varepsilon >0\),
Proof
It suffices to show
The expression on the left above is bounded above by
Using lemma 12 with \(\varepsilon =\varepsilon \epsilon _n\) and \(C_n=e^{n^b\epsilon _n^2/K_n}\),
where \(H_{[]}(u,\widetilde{\mathcal {F}}_n,||.||_2)\) is as in Definition 3. The first inequality in the third step follows because \(L_n\sim \log n\), \(K_n\log n=o(n^b\epsilon _n^2)\) and \(K_n\log C_n =n^b \epsilon _n^2\), \( -\log \epsilon _n^2\le \log n\). The second inequality in the third step is by \((n^b \log n)/n=o(1)\)
By theorem 1 in Wing Hung and Xiaotong (1995), for some constant \(C>0\), we have
Using proposition 14 with \(\varepsilon =2\varepsilon \), we have
Therefore, using Lemma 6 with \(\varepsilon =2\varepsilon ^2\epsilon _n^2\) and \(\tilde{\varepsilon }={\varepsilon }^2 \epsilon _n^2\), we have
Combining (B8) and (B9), (B7) follows. \(\square \)
Proposition 14
Let \(n\epsilon _n^2\rightarrow \infty \). Let \(p(\varvec{\theta }_{n})=MVN(\varvec{\mu }_n,\text {diag}(\varvec{\zeta }_n^2))\) where \(\log ||\varvec{\zeta }_n||_\infty =O(\log n)\) and \(||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2)\). Suppose for some \(0<b<1\), \(K_n\log n=o(n^b\epsilon _n^2)\), then for \(C_n=e^{n^b \epsilon _n^2/K_n}\) and \(\mathcal {F}_n\) as in (33), for any \(\varepsilon >0\),
Proof
Let \(\mathcal {F}_{jn}=\{\theta _{jn}: |\theta _{jn}|\le C_n\}\), then \(\mathcal {F}_n=\cap _{j=1}^{K_n} \mathcal {F}_{jn}\implies \mathcal {F}_n^c= \cap _{j=1}^{K_n}\mathcal {F}_{jn}^c\). This implies \(\int _{\varvec{\theta }_{n} \in \mathcal {F}_n^c}p(\varvec{\theta }_{n})d\varvec{\theta }_{n}\le \sum _{j=1}^{K_n}\int _{\mathcal {F}_{jn}^c}(e^{-\frac{(\theta _{jn}-\mu _{jn})^2}{2\zeta _{jn}^2}}/\sqrt{2\pi \zeta _{jn}^2})d\theta _{jn}\). Thus,
Since \(||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2) \implies ||\varvec{\mu }_n||_\infty =o(\sqrt{n}\epsilon _n)\). Since \(\log ||\varvec{\zeta }_n||_\infty =O(\log n)\) which implies for some \(M>0\), \(d\ge 1\),
where the last convergence holds since \(K_n\log n=o(n^b \epsilon _n^2)\). This further implies \(R_n=(n^b \epsilon _n^2)/(K_n\log n)-(d+1) \rightarrow \infty \). Thus, using Mill’s ratio, we get:
where the last asymptotic inequality holds because
In the above step, the first asymptotic equivalence is by (B10), the second inequality holds since \(K_n\le n\). The last inequality is by \(R_n \rightarrow \infty \) and \(\log /n\rightarrow 0\). \(\square \)
Proposition 15
Let \(p(\varvec{\theta }_{n})=MVN(\varvec{\mu }_n,\text {diag}(\varvec{\zeta }_n^2))\) with \(\log ||\varvec{\zeta }_n||_\infty =O(\log n)\), \(||\varvec{\zeta }^*_n||_\infty =O(1)\). Let \(||\eta _0-\eta _{\varvec{\theta }^*_n}||_1\le \varepsilon \epsilon _n^2/4\), \(n\epsilon _n^2 \rightarrow \infty \). Define,
If \(K_n\log n =o(n\epsilon _n^2)\), \(||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)\), \(\log (\sum _{v=0}^{L_n} k_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})=O(\log n)\), \(||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2)\),
Proof
Let \(\eta _{\varvec{\theta }^*_n}(\varvec{x})=\varvec{b}_L^*+\varvec{A}_L^* \psi (\varvec{b}_{L-1}^*+\varvec{A}_{L-1}^* \psi ( \cdots \psi (\varvec{b}_1^*+\varvec{A}_1^*\psi (\varvec{b}_0^*+\varvec{A}_0^* \varvec{x})))\) be the neural network such that
Such a neural network exists since \(||\eta _0-\eta _{\varvec{\theta }^*_n}||_1\le \varepsilon \epsilon _n^2/4\). Next define \(\mathcal {M}_{\varepsilon \epsilon _n^2}\) as:
where \(\tilde{k}_{vn}=k_{vn}+1\). For every \(\varvec{\theta }_{n} \in \mathcal {M}_{\varepsilon \epsilon _n^2}\), by Lemma 8, we have
Combining (B12) and (B13), we get for \(\varvec{\theta }_{n}\in \mathcal {M}_{\varepsilon \epsilon _n^2}\), \(||\eta _{\varvec{\theta }_{n}}-\eta _{0}||_1 \le \varepsilon \epsilon _n^2/2\).
This, in view of Lemma 10, \(d_{\textrm{KL}}(\ell _0,\ell _{\varvec{\theta }_{n}}) \le \varepsilon \epsilon _n^2\). Let \(\varvec{\theta }_{n} \in \mathcal {N}_{\varepsilon \epsilon _n^2}\) for every \(\varvec{\theta }_{n} \in \mathcal {M}_{\varepsilon \epsilon _n^2}\). Therefore,
Let \(\delta _n=\varepsilon \epsilon _n^2/(2\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})\), then
where the second last equality holds by mean value theorem.
Note that \(\widehat{\theta }_{jn} \in [\theta _{jn}^*-1,\theta _{jn}^*+1]\) since \(\delta _n \rightarrow 0\), therefore
where the last inequality follows since \((a+b)^2\le 2(a^2+b^2)\). Also,
since \(||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)\), \(||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2)\) and \(||\varvec{\zeta }_n^*||_\infty =O(1)\) and \(n\epsilon _n^2 \rightarrow \infty \). Also,
where the last follows since \(\log ||\varvec{\zeta }_n||_\infty =O(\log n)\), \(\log (\sum _{v=0}^{L_n} k_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})=O(\log n)\) and \(1/n\epsilon _n^2=o(1)\) which implies \(-2\log \epsilon _n=o(\log n)\).
where the last inequality follows since \(K_n\log n=o(n\epsilon _n^2)\),
Combining (B15) and (B16) and replacing (B14), the proof follows.
Proposition 16 establishes that under a suitable choice of the variational family q and the prior p, the KL distance between p and q is suitably bounded. Proposition 17 shows that the integral of the logistic loss between the neural network model and the true model with respect to the variational family q is small. Propositions 15, 16 and 17 taken together will be used to establish that the KL-distance between the true posterior and the variational posterior is suitably bounded. \(\square \)
Proposition 16
For \(\log ||\varvec{\zeta }_n||_\infty =O(\log n)\) and \(||\varvec{\zeta }^*_n||_\infty =O(1)\), let \(q(\varvec{\theta }_{n})=MVN(\varvec{\theta }^*_n,I_{K_n}/n^{2+2d})\) and \(p(\varvec{\theta }_{n})=MVN(\varvec{\mu }_n,\text {diag}(\varvec{\zeta }_n^2))\). Let \(K_n\log n\) \(=o(n\epsilon _n^2)\), \(||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)\), \(||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2)\) and \(n\epsilon _n^2 \rightarrow \infty \), then for any \(\nu >0\),
Proof
where the second last inequality uses \(\varvec{\zeta }^*_n=1/\varvec{\zeta }_n\). The last equality follows since \(\log ||\varvec{\zeta }_n||_{\infty }=O(\log n)\), \(||\varvec{\zeta }_n^*||_\infty =O(1)\), \(K_n\log n=o(n\epsilon _n^2)\), \(||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2)\) and \(||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)\). \(\square \)
Proposition 17
Let \(q(\varvec{\theta }_{n}) \sim MVN(\varvec{\theta }^*_n,I_{K_n}/n^{2+2d})\) where \(d>d^*>0\) and \(\sum _{v=0}^{L_n} k_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n}=O(n^{d^*})\). Define
Let \(||\eta _0-\eta _{\varvec{\theta }^*_n}||_1\le \varepsilon \epsilon _n^2/4\) where \(n\epsilon _n^2 \rightarrow \infty \). If \(K_n\log n=o(n\epsilon _n^2)\), \(||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)\),
Proof
Since \(h(\varvec{\theta }_{n})\) is a KL-distance, \(h(\varvec{\theta }_{n})>0\). We establish an upper bound:
where the first inequality is a consequence of Lemma 10 and the last inequality follows since \(||\eta _{\varvec{\theta }_{n}^*}-\eta _0||_1=o(\epsilon _n^2)\).
Let \(S=\{\varvec{\theta }_{n}:\cap _{j=1}^{K_n}|\theta _{jn}-\theta _{jn}^*|\le \varepsilon \epsilon _n^2/(\sum _{v=0}^L \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n}) \}\), then
Let \(S^c=\cup _{j=1}^{K_n}S_j^c\), \(S_j=\{|\theta _{jn}-\theta _{jn}^*|\le u_n\}\), \(u_n=\varepsilon \epsilon _n^2/(\sum _{v=0}^L \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})\). We first compute \(Q(S^c)\) as follows:
Using (B19) in the last term of (B18), we get
where second step follows by Mill’s ratio, \(K_n=o(n\epsilon _n^2)\), \(\sum _{v=0}^{L_n} k_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n}=O(n^d)\) which implies \(n^{1+d}u_n \rightarrow \infty \). The third step holds because
since \((\sum _{v=0}^L \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n} )^2 \log n=O(n^{2d^*} \log n)=o(n^{2d})\).
For the second term in (B18), let \({S'}=\{|\varvec{b}_L[s]-\varvec{b}_L^*[s]|>u_n\}\)
\(\tilde{S}^c\) is the union of all \(S_j^c\), \(j=1, \ldots , K_n\) except the one corresponding to \(\varvec{b}_{L}[s]\).
Also, \(E_{q(\varvec{b}_L[s])}|\varvec{b}_L[s]-\varvec{b}_L^*[s]|=\sqrt{2/\pi }(1/n^{1+d})\). Thus
where the first equality in the above step follows by observing that \(Q(\tilde{S}^c)\) behaves analogous to \(Q(S^c)\) which was computed in (B19) and the second equality in the above step follows due to Mill’s ratio and \(\sum _{v=0}^L \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n}=O(n^v)\) which implies \(n^{1+d} u_n \rightarrow \infty \). The third inequality in the above step is a consequence of the fact that \(K_n\le n^{1+d}\).
Combining (B20), (B23) and (B24), we get
Note the third term in (B18) can be handled similar to third term and it can be shown
where the last equality in the second step follows by \(K_n=o(n \epsilon _n^2)\) and the argument in (B21) by which \(e^{-(n^{1+d}u_n-2\log n)}=o(1)\).
Combining (B20) and (B26) with (B18) the proof follows.
Using Propositions 15, 16 and 17, the following Proposition 18 establishes that the KL-distance between the true posterior and the variational posterior is suitably bounded. \(\square \)
Proposition 18
Let \(p(\varvec{\theta }_{n})=MVN(\varvec{\mu }_n,\text {diag}(\varvec{\zeta }_n^2))\), \(||\varvec{\zeta }_n||_\infty =O(n)\) and \(||\varvec{\zeta }^*_n||_\infty =O(1)\).
-
1.
Let \(L_n=L\), \(p_n=p\) independent of n. If \(K_n\log n=o(n)\) and \(||\varvec{\mu }_n||_2^2=o(n)\), then
$$\begin{aligned} d_{\textrm{KL}}(\pi ^*,\pi (.|\varvec{y}_{n},\varvec{X}_{n}))=o_{P_0^n}(n) \end{aligned}$$(B27) -
2.
Let \(K_n\log n=o(n\epsilon _n^2)\), \(L_n \sim \log n\) and \(||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2)\). If there exists a neural network such that \(||\eta _0-\eta _{\varvec{\theta }^*_n}||_1=o(n\epsilon _n^2)\), \(||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)\) and \(\log (\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})=O(\log n)\), then
$$\begin{aligned} d_{\textrm{KL}}(\pi ^*,\pi (.|\varvec{y}_{n},\varvec{X}_{n}))=o_{P_0^n}(n\epsilon _n^2) \end{aligned}$$(B28)
Proof
For any \(q \in \mathcal {Q}_n\).
Since \(\pi ^*\) satisfies minimizes the KL-distance to \(\pi (.|\varvec{y}_{n},\varvec{X}_{n})\) in the family \(\mathcal {Q}_n\), therefore for any \(\kappa >0\)
\(\square \)
Proof of part 1
Note, \(K_n\log n=o(n)\), \(||\mu _n||_2^2 =o(n)\), \(||\varvec{\zeta }_n||_\infty =O(n)\) and \(||\varvec{\zeta }^*_n||_\infty =O(1)\). Let \(q(\varvec{\theta }_{n})=MVN(\varvec{\theta }^*_n, \varvec{I}_{K_n}/\sqrt{n})\) where \(\varvec{\theta }^*_n\) is defined next. For \(N\ge 1\), let \(\eta _{\varvec{\theta }^*_N}\) be a finite neural network approximation satisfying \(||\eta _{\varvec{\theta }^*_n}-\eta _0||_1 \le \varepsilon /4\). Since \(\eta _0\) is a continuous function defined on the compact set \([0,1]^p\), thus the existence of such a neural network is guaranteed by Theorem 2.1 in Hornik et al. (1989). Let \(\varvec{\theta }_{n}^*\) be \(\varvec{\theta }_N^*\) for all non zero coefficients and zeros for all non existent coefficients.
Step 1 (a): Using proposition 16, with \(\epsilon _n=1\), we get for any \(\nu >0\),
where the above step follows \(||\varvec{\theta }^*_n||_2^2=||\varvec{\theta }^*_N||_2^2=||\varvec{\theta }^*_n||_2^2=o(n)\).
Step 1 (b): Next, note that
Since \(||\eta _0-\eta _{\varvec{\theta }^*_n}||_1 \le \varepsilon /4\), using proposition 17 with \(\epsilon _n=1\) and \(\varepsilon =\varepsilon \), \(\int d_{\textrm{KL}}(\ell _0,\ell _{\varvec{\theta }_{n}})q(\varvec{\theta }_{n})d\varvec{\theta }_{n} \le \varepsilon \) which follows by noting that \(||\varvec{\theta }^*_n||_2^2=||\varvec{\theta }^*_N||_2^2=o(n)\) and \(\log (\sum _{v=0}^{L} \tilde{k}_{vN}\prod _{v'=v+1}^{L} a^*_{v'N})=O(\log n)\).
Therefore, by Lemma 9,
Step 1 (c): Since \(||\eta _0-\eta _{\varvec{\theta }^*_n}||_1\le \varepsilon /4\), using proposition 15 with \(\epsilon _n=1\) and \(\nu =\varepsilon \),
which follows by \(||\varvec{\theta }^*_n||_2^2=||\varvec{\theta }^*_N||_2^2=o(n)\) and \(\log (\sum _{v=0}^{L} \tilde{k}_{vn}\prod _{v'=v+1}^{L} a^*_{v'n})=O(\log n)\). Therefore, using Lemma 7, we get
Step 1 (d): From (B30) and (B29) we get
where the last inequality is a consequence of (B31), (B33) and (B34).
Since \(\varepsilon \) is arbitrary, taking \(\varepsilon \rightarrow 0\) completes the proof. \(\square \)
Proof of part 2
Note, \(K_n\log n=o(n\epsilon _n^2)\), \(||\mu _n||_2^2 =o(n\epsilon _n^2)\), \(\log ||\varvec{\zeta }_n||_\infty =O(\log n)\) and \(||\varvec{\zeta }^*_n||_\infty =O(1)\). Let \(q(\varvec{\theta }_{n})=MVN(\varvec{\theta }^*_n, \varvec{I}_{K_n}/n^{2+2d}), d>d^*\) where \(\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n}=O(n^{d^*})\), \(d^*>0\). We next define \(\varvec{\theta }_{n}^*\) as follows:
Let \(\eta _{\varvec{\theta }^*_n}\) be the neural satisfying
The existence of such a neural network is guaranteed since \(||\eta _{\varvec{\theta }^*_n}-\eta _0||_1=o(\epsilon _n^2)\).
Step 2 (a): Since \(||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)\), by proposition 16,
Step 2 (b): By proposition 17, since \(||\eta _{\varvec{\theta }^*_n}-\eta _0||_1\le \varepsilon \epsilon _n^2/4\), \(||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)\) and \((\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})\log n=o(n\epsilon _n^2)\), we have
Therefore, by Lemma 9,
Step 2 (c): By proposition 15, since \(||\eta _{\varvec{\theta }^*_n}-\eta _0||_1\le \varepsilon \epsilon _n^2/4\), \(||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)\)and \(\log (\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})=O(\log n)\), we have
Therefore, using Lemma 7, we get
Step 2 (d): From (B30) and (B29) we get
where the last inequality is a consequence of (B36), (B37) and (B38).
Since \(\varepsilon \) is arbitrary, taking \(\varepsilon \rightarrow 0\) completes the proof. \(\square \)
Appendix C Consistency of the variational posterior
Proof of Theorem 1
We assume Relation (32) holds with \(A_n\) and \(B_n\) are same as in (31).
By assumptions (A1) and (A2), the prior parameters satisfy
Note \(K_n \sim n^a\), \(0<a<1\) which implies \(K_n\log n=o(n)\). By proposition 18 part 1.,
By step 1 (c) in the proof of proposition 18
Since, \(K_n \sim n^a\), \(K_n\log n=o(n^b)\), \(a<b<1\). Using proposition 13 with \(\epsilon _n=1\),
Thus, using (C40), (C41) and (C42) in (32), we get
\(\square \)
Proof of Theorem 2
We assume Relation (32) holds with \(A_n\) and \(B_n\) are same as in (31).
Let \(K_n\sim n^a\) and \(\epsilon _n^2\sim n^{-\delta }\), \(0<\delta <1-a\). This implies \(K_n\log n=o(n\epsilon _n^2)\).
By assumptions (A1) and (A4), the prior parameters satisfy
Also by assumption (A3),
By proposition 18 part 2.,
By step 2 (c) in the proof of proposition 18
Since \(K_n \sim n^a\), \(K_n\log n=o(n^b \epsilon _n^2)\), \(a+\delta<b<1\). Using proposition 13, it follows that
Thus, using (C43), (C44) and (C45) in (32), we get
\(\square \)
Proof of Corollary 1
Let \(\hat{\ell }_n(y,\varvec{x})=\int \ell _{\varvec{\theta }_{n}}(y,\varvec{x}) \pi ^*(\varvec{\theta }_{n})d\varvec{\theta }_{n}\).
Taking \(\varepsilon \rightarrow 0\), we get \(d_{\text {H}}(\hat{\ell }_n,\ell _0)=o_{P_0^n}(1)\). Let
then, note that \(\hat{\eta }(\varvec{x})=\log ( \hat{\ell }_n(1,\varvec{x})/\hat{\ell }_n(0,\varvec{x}))\). Further,
This implies
In the above equation, the sixth and the seventh step hold because \(\sqrt{1-x}\le 1-x/2\) and \(|p_1-p_2|\le |\sqrt{p_1}+\sqrt{p_2}||\sqrt{p_1}-\sqrt{p_2}|\le 2|\sqrt{p_1}-\sqrt{p_2}|\) respectively. The fifth step holds because
By (C47) and Cauchy Schwartz inequality,
The proof follows in lieu (35). \(\square \)
Proof of Corollary 2
We assume Relation (32) holds with \(A_n\) and \(B_n\) are same as in (31).
Let \(K_n\sim n^a\) and \(\epsilon _n^2\sim n^{-\delta }\), \(0<\delta <1-a\). This implies \(K_n\log n=o(n\epsilon _n^2)\).
Also, \(K_n\log n=o(n^b \epsilon _n^2)\), \(a+\delta<b<1\). This implies \(K_n\log n =o(n^b (\epsilon _n^2)^\kappa )\), \(0\le \kappa \le 1\). Thus, using proposition 13 with \(\epsilon _n=\epsilon _n^{k}\), we get
This together with (C43), (C44) and (32) implies \(\pi ^*(\mathcal {U}_{\varepsilon \epsilon _n^\kappa }^c)=o_{P_0^n}(\epsilon _n^{2-2\kappa })\).
Let \(\hat{\ell }_n(y,\varvec{x})=\int \ell _{\varvec{\theta }_{n}}(y,\varvec{x}) \pi ^*(\varvec{\theta }_{n})d\varvec{\theta }_{n}\), then
Dividing by \(\epsilon _n^\kappa \) on both sides we get
By (C48), for every \(0\le \kappa \le 2/3\),
The proof follows in lieu of (35). \(\square \)
Appendix D Consistency of the true posterior
From (11), note that
Theorem 19
Suppose conditions of Theorem 1 hold. Then,
-
1.
$$\begin{aligned} P_0^n\left( \pi (\mathcal {U}_{\varepsilon }^c|\varvec{y}_{n},\varvec{X}_{n})\le 2e^{-n\varepsilon ^2/2}\right) \rightarrow 1, n \rightarrow \infty \end{aligned}$$
-
2.
$$\begin{aligned} P_0^n(|R(\hat{C})-R(C^\textrm{Bayes})|\le 8\sqrt{2}\varepsilon )\rightarrow 1,n \rightarrow \infty \end{aligned}$$
Proof
By assumptions (A1) and (A2), the prior parameters satisfy
Note \(K_n \sim n^a\), \(0<a<1\) which implies \(K_n\log n=o(n)\). Thus, the conditions of proposition 15 hold with \(\epsilon _n=1\).
\(n \rightarrow \infty \) which follows from (B34) (see step 1 (c) in proof of proposition 18). Since \(K_n\log n=o(n^b)\), \(a<b<1\), the proposition 13 holds with \(\epsilon _n=1\).
where the last equality follows from (B7) with \(\epsilon _n=1\) in the proof of proposition 13. Using (D51) and (D52) with (D50), we get
Take \(\nu =\varepsilon ^2/2\) to complete the proof. Mimicking the steps in the proof of corollary 1,
where the second last inequality is a consequence of part 1. in Theorem 19. The remaining part of the proof follows by (C48) and (35). \(\square \)
Theorem 20
Suppose conditions of Theorem 2 hold. Then,
-
1.
$$\begin{aligned} P_0^n\left( \pi (\mathcal {U}_{\varepsilon \epsilon _n}^c|\varvec{y}_{n},\varvec{X}_{n})\le 2e^{-n\epsilon _n^2\varepsilon ^2/2}\right) \rightarrow 1, n \rightarrow \infty \end{aligned}$$
-
2.
$$\begin{aligned} P_0^n(|R(\hat{C})-R(C^\textrm{Bayes})|\le 8\sqrt{2}\varepsilon \epsilon _n)\rightarrow 1,n \rightarrow \infty \end{aligned}$$
Proof
By assumptions (A1) and (A4), the prior parameters satisfy
Also by assumption (A3),
Note \(K_n \sim n^a\), \(0<a<1\) and \(\epsilon _n\sim n^{-\delta }\), \(0<\delta <1-a\), thus \(K_n\log n=o(n\epsilon _n^2)\). Thus, the conditions of proposition 15 hold.
for \( n \rightarrow \infty \) where the above convergence follows from (B38) in step 2 (c) in the proof of proposition 18. Also, since \(K_n\log n=o(n^b \epsilon _n^2)\), \(a+\delta<b<1\). Thus conditions of proposition 13 hold.
where the last equality follows from (B7) in the proof of proposition 13.
Using (D53) and (D54) with (D50), we get \(P_0^n\left( \pi (\mathcal {U}_{\varepsilon \epsilon _n}^c|\varvec{y}_{n},\varvec{X}_{n})\ge 2e^{-n\epsilon _n^2(\varepsilon ^2-\nu )}\right) \rightarrow 0, n \rightarrow \infty \). Take \(\nu =\varepsilon ^2/2\) to complete the proof. Mimicking the steps in the proof of corollary 2,
where the second last inequality is a consequence of part 1. in Theorem 20 and the last inequality last equality follows since \(\epsilon _n \sim n^{-\delta }\). Dividing by \(\epsilon _n\) on both sides we get
The remaining part of the proof follows by (C48) and (35). \(\square \)
Appendix E Tables for real data
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bhattacharya, S., Liu, Z. & Maiti, T. Comprehensive study of variational Bayes classification for dense deep neural networks. Stat Comput 34, 17 (2024). https://doi.org/10.1007/s11222-023-10338-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-023-10338-9