Comprehensive study of variational Bayes classification for dense deep neural networks

Bhattacharya, Shrijita; Liu, Zihuan; Maiti, Tapabrata

doi:10.1007/s11222-023-10338-9

Comprehensive study of variational Bayes classification for dense deep neural networks

Original Paper
Published: 30 October 2023

Volume 34, article number 17, (2024)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Shrijita Bhattacharya¹^na1,
Zihuan Liu²^na1 &
Tapabrata Maiti¹^na1

352 Accesses
Explore all metrics

Abstract

Although Bayesian deep neural network models are ubiquitous in classification problems; their Markov Chain Monte Carlo based implementation suffers from high computational cost, limiting the use of this powerful technique in large-scale studies. Variational Bayes (VB) has emerged as a competitive alternative to overcome some of these computational issues. This paper focuses on the variational Bayesian deep neural network estimation methodology and discusses the related statistical theory and algorithmic implementations in the context of classification. For a dense deep neural network-based classification, the paper compares and contrasts the true posterior’s consistency and contraction rates and the corresponding variational posterior. Based on the complexity of the deep neural network (DNN), this paper provides an assessment of the loss in classification accuracy due to VB’s use and guidelines on the characterization of the prior distributions and the variational family. The difficulty of the numerical optimization for obtaining the variational Bayes solution has also been quantified as a function of the complexity of the DNN. The development is motivated by an important biomedical engineering application, namely building predictive tools for the transition from mild cognitive impairment to Alzheimer’s disease. The predictors are multi-modal and may involve complex interactive relations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comprehensive overview of Alzheimer's disease utilizing Machine Learning approaches

Article 11 June 2024

Computational Approaches Applied in the Field of Neuroscience

Early prediction of Alzheimer's disease using convolutional neural network: a review

Article Open access 17 November 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Availability of data and materials

The data is publicly available.

Code availability

The computational code is available.

References

Bai, J., Song, Q., Cheng, G.: Efficient variational inference for sparse deep learning with theoretical guarantee. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 466–476. Curran Associates, Inc. (2020)
Google Scholar
Barron, A., Schervish, M.J., Wasserman, L.: The consistency of posterior distributions in nonparametric problems. Ann. Stat. 27(2), 536–561 (1999)
Article MathSciNet Google Scholar
Bhattacharya, S., Maiti, T.: Statistical foundation of variational bayes neural networks. Neural Netw. 137, 151–173 (2021)
Article Google Scholar
Bishop, C.M.: Bayesian neural networks. J. Braz. Comput. Soc. 4(1), 61–68 (1997)
Article Google Scholar
Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Google Scholar
Blei, D.M., Lafferty, J.D.: A correlated topic model of science. Ann. Appl. Stat. 1(1), 17a, 35 (2007)
Article MathSciNet Google Scholar
Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D.: Weight uncertainty in neural network. In: Proceedings of Machine Learning Research, vol. 37, pp. 1613–1622. PMLR (2015)
Cannings, T.I., Samworth, R.J.: Random-projection ensemble classification. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 79(4), 959–1035 (2017)
Article MathSciNet Google Scholar
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., Zhang, Z.: Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems (2015)
Chérief-Abdellatif, B.-E.: Convergence rates of variational inference in sparse deep learning. In: Hal DaumA III, Singh, A. (eds) Proceedings of the 37th International Conference on Machine Learning, vol. 119 of Proceedings of Machine Learning Research, pp. 1831–1842. PMLR (2020)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(61), 2121–2159 (2011)
MathSciNet Google Scholar
Hinton G.E., Van Camp, D.: Keeping the neural networks simple by minimizing the description length of the weights. In: Proceedings of the Sixth Annual Conference on Computational Learning Theory, COLT’93, pp. 5a 13. ACM press (1993)
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York (2009)
Google Scholar
Graves, A.: Practical variational inference for neural networks. In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24, pp. 2348–2356. Curran Associates, Inc. (2011)
Google Scholar
Graves, A.: Generating sequences with recurrent neural networks (2014). arXiv:1308.0850
Gurney, K.: An Introduction to Neural Networks. Taylor & Francis Inc., USA (1997). (ISBN 1857286731)
Book Google Scholar
Hinton, G., Srivastava, N., Swersky, K.: Lecture 6a Overview of Mini-batch Gradient Descent (2012). http://www.cs.toronto.edu/hinton/coursera/lecture6/lec6.pdf
Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)
Article Google Scholar
Hubin, A., Storvik, G., Frommlet, F.: Deep Bayesian regression models (2018). arXiv:1806.02160
Javid, K., Handley, W., Hobson, M.P., Lasenby, A.: Compromise-free Bayesian neural networks (2020). arXiv:2004.12211
Kingma, D.P., Salimans, T., Welling, M.: Variational dropout and the local reparameterization trick. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 2575–2583. Curran Associates, Inc. (2015)
Google Scholar
Korolev, I.: Alzheimer’s disease: a clinical and basic science review. Med. Stud. Res. J. 4(1), 24–33 (2014)
Google Scholar
Korolev, I.O., Symonds, L.L., Bozoki, A.C., Initiative, A.D.N.: Predicting progression from mild cognitive impairment to Alzheimer’s dementia using clinical, MRI, and plasma biomarkers via probabilistic pattern classification. PLoS ONE 11(2), e0138866 (2016)
Article Google Scholar
Lampinen, J., Vehtari, A.: Bayesian approach for neural networks-review and case studies. Neural Netw. Off. J. Int. Neural Netw. Soc. 14(3), 257–274 (2001)
Article Google Scholar
Lee, H.K.H.: Consistency of posterior distributions for neural networks. Neural Netw. 13(6), 629–642 (2000)
Article Google Scholar
Li, X., Li, C., Chi, J., Ouyang, J.: Variance reduction in black-box variational inference by adaptive importance sampling. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, International Joint Conferences on Artificial Intelligence Organization, pp. 2404–2410 (2018)
Liang, F., Li, Q., Zhou, L.: Bayesian neural networks for selection of drug sensitive genes. J. Am. Stat. Assoc. 113(523), 955–972 (2018)
Article MathSciNet Google Scholar
Liu, Z., Maiti, T., Bender, A.: A role for prior knowledge in statistical classification of the transition from MCI to Alzheimer’s disease. Unpublished report (2020)
Matthews, A.G. de G., Hron, J., Rowland, M., Turner, R.E., Ghahramani, Z.: Gaussian process behaviour in wide deep neural networks. In: International Conference on Learning Representations (2018)
McKinney, W.: Data structures for statistical computing in python. In: van der Walt, S., Millman, J. (eds.) Proceedings of the 9th Python in Science Conference, pp, 56–61 (2010)
McMahan, H.B.: A survey of algorithms and analysis for adaptive online learning. J. Mach. Learn. Res. 18(90), 1–50 (2017)
MathSciNet Google Scholar
Mullachery, V., Khera, A., Husain, A.: Bayesian neural networks (2018). arXiv:1801.07710
Nagapetyan, T., Duncan, A.B., Hasenclever, L., Vollmer, S.J., Szpruch, L., Zygalakis, K.: The true cost of stochastic gradient Langevin dynamics (2017). arXiv:1706.02692
Neal, R.M.: Bayesian training of backpropagation networks by the hybrid Monte-Carlo method (1992). https://www.cs.toronto.edu/~radford/ftp/bbp.pdf
Paisley, J., Blei, David, Jordan, Michael: Variational bayesian inference with stochastic search. In: Proceedings of the 29th International Conference on International Conference on Machine Learning, ICML’12, pp. 1363,1370. ACM press (2012)
Pati, D., Bhattacharya, A., Yang, Y.: On statistical optimality of variational bayes. In: Storkey, A., Perez-Cruz, F. (eds.) Proceedings of Machine Learning Research, vol. 84, pp. 1579–1588. PMLR (2018)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet Google Scholar
Petersen, R.C., Roberts, R.O., Knopman, D.S., Boeve, B.F., Geda, Y.E., Ivnik, R.J., Smith, G.E., Jack, C.R.: Mild cognitive impairment: ten years later. Arch. Neurol. 66(12), 1447–1455 (2009). https://doi.org/10.1001/archneurol.2009.266
Article Google Scholar
Pollard, D.: Empirical processes: Theory and applications. NSF-CBMS Regional Conference Series in Probability and Statistics 2, i–86 (1990)
MathSciNet Google Scholar
Polson, N.G., Ročková, V.: Posterior concentration for sparse deep learning. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc. (2018)
Google Scholar
Ranganath, R., Gerrish, S., Blei, D.M.: Black box variational inference (2013). arXiv:1401.0118
Ross, S.M.: Simulation, fifth edition Academic Press (2013). (ISBN 9780124158252)
Google Scholar
Schmidt-Hieber, J.: Nonparametric regression using deep neural networks with ReLU activation function. Ann. Stat. 48(4), 1875–1897 (2020)
MathSciNet Google Scholar
Singh, B., De, S., Zhang, Y., Goldstein, T., Taylor, G.: Layer-specific adaptive learning rates for deep networks (2015). arXiv:1510.04609
Sun, S., Chen, C., Carin, L.: Learning structured weight uncertainty in bayesian neural networks. In: Proceedings of Machine Learning Research, vol, 54, pp. 1283–1292. PMLR (2017)
Sun, S., Zhang, G., Shi, J., Grosse, R.B: Functional variational bayesian neural networks. In: 7th International Conference on Learning Representations, ICLR 2019. OpenReview.net (2019)
Sun, Y., Song, Q., Liang, F.: Consistent sparse deep learning: theory and computation. J. Am. Stat. Assoc. 0 (ja):1–42 (2021)
Taghia, J.: Lecture Notes. Part III: black-box variational inference (2018). http://www.it.uu.se/research/systems_and_control/education/2018/pml/lectures/VILectuteNotesPart3.pdf
Torben, S., Sumeetpal Sidhu, S.: Trace-class Gaussian priors for Bayesian learning of neural networks with MCMC. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 85(1), 46–66 (2023)
Article Google Scholar
van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics. Springer, New York (1996)
Book Google Scholar
Wan, R., Zhong, M., Xiong, H., Zhu, Z.: Neural control variates for variance reduction (2018). arXiv:1806.00159
Wang, Y., Blei, D.M.: Frequentist consistency of variational bayes. J. Am. Stat. Assoc. 114(527), 1147–1161 (2019)
Article MathSciNet Google Scholar
Welling, M., Teh, Y.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pp. 681–688. ACM Press (2011)
Wing Hung, W., Xiaotong, S.: Probability inequalities for likelihood ratios and convergence rates of sieve MLES. Ann. Stat. 23(2), 339–362 (1995)
MathSciNet Google Scholar
Wu, A., Nowozin, S., Meeds, E., Turner, R.E., Hernández-Lobato, J.M., Gaunt, A.L.: Deterministic variational inference for robust bayesian neural networks (2019). https://openreview.net/forum?id=B1l08oAct7
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)
Yang, K., Maiti, T.: Statistical aspects of high-dimensional sparse artificial neural network models. Mach. Learn. Knowl. Extr. 2(1), 1–19 (2020)
Yang, Y., Pati, D., Bhattacharya, A.: $\alpha $-variational inference with statistical guarantees. Ann. Stat. 48(2), 886–905 (2020)
Article MathSciNet Google Scholar
Zhang, D., Shen, D.: Multi-modal multi-task learning for joint prediction of clinical scores in Alzheimer’s disease. In: Tianming, L., Dinggang, S., Luis, I., Xiaodong, T. (eds.) Multimodal Brain Image Analysis, pp. 60–67. Springer Berlin Heidelberg, Berlin, Heidelberg (2011)
Zhang, D., Shen, D., Initiative, A.D.N.: Predicting future clinical changes of mci patients using longitudinal and multimodal biomarkers. PLoS ONE 7(3), e0033182 (2012)
Article Google Scholar
Zhang, F., Gao, C.: Convergence rates of variational posterior distributions. Ann. Stat. 48(4), 2180–2207 (2020)
Article MathSciNet Google Scholar
Zhu, C., Cheng, Y., Gan, Z., Huang, F., Liu, J., Goldstein, T.: Adaptive learning rates with maximum variation averaging (2020). arXiv:2006.11918

Download references

Funding

This work is partially supported by the grants NSF-1924724, NSF-1952856, and NSF-2124605.

Author information

Shrijita Bhattacharya, Zihuan Liu and Tapabrata Maiti have contributed equally to this work.

Authors and Affiliations

Department of Statistics and Probability, Michigan State University, 619 Red Cedar Rd, East Lansing, MI, 48824, USA
Shrijita Bhattacharya & Tapabrata Maiti
Department of Biostatistics, Yale University, 300 George Street, New Haven, CT, 06511, USA
Zihuan Liu

Authors

Shrijita Bhattacharya
View author publications
You can also search for this author in PubMed Google Scholar
Zihuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Tapabrata Maiti
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Equal contribution from all three authors.

Corresponding author

Correspondence to Zihuan Liu.

Ethics declarations

Conflict of interest

None.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (zip 147 KB)

Appendices

Appendix A Algorithms of variational implementation.

With q and p as in (12) and (10) respectively,

$$\begin{aligned} d_{\textrm{KL}}(q,p)= & {} \sum _{j=1}^{K_n} \left( \log \frac{\zeta _{jn}}{s_{jn}}+\frac{s_{jn}^2}{2\zeta _{jn}^2}+\frac{(m_{jn}-\mu _{jn})^2}{2\zeta _{jn}^2}-\frac{1}{2}\right) \\ \nabla _{m_{jn}}d_{\textrm{KL}}(q,p)= & {} \frac{(m_{jn}-\mu _{jn})}{2\zeta _{jn}^2} \\ \nabla _{s_{jn}}d_{\textrm{KL}}(q,p)= & {} -\frac{1}{s_{jn}}+\frac{s_{jn}}{\zeta _{jn}^2}\\ \nabla _{m_{jn}}\mathcal {L}_{\mathcal {V}_q}= & {} E_{q(.|\mathcal {V}_q)}\left( \left( \frac{\theta _{jn}-m_{jn}}{s_{jn}^2}\right) \log L(\varvec{\theta }_{n})\right) \\ \nabla _{s_{jn}}\mathcal {L}_{\mathcal {V}_q}= & {} E_{q(.|\mathcal {V}_q)}\\{} & {} \left( \left( \frac{(\theta _{jn}-m_{jn})^2}{s_{jn}^3}-\frac{1}{s_{jn}}\right) \log L(\varvec{\theta }_{n})\right) \end{aligned}$$

Appendix B Preliminaries

Definition 1

$MVN(\varvec{\mu },\varvec{\Sigma })$ is used to denote the density function of multivariate normal distribution with mean $\varvec{\mu }$ and variance covariance matrix $\varvec{\Sigma }$.

Definition 2

For a vector $\varvec{\alpha }$ and a function g,

1.
$||\varvec{\alpha }||_1=\sum _i |\alpha _i|$, $||\varvec{\alpha }||_2=\sqrt{\sum _i \alpha _i^2}$, $||\varvec{\alpha }||_\infty =\max _i |\alpha _i|$.
2.
$||g||_1=\int _{\varvec{x}\in \chi } |g(\varvec{x})|d\varvec{x}$, $||g||_2=\sqrt{\int _{\varvec{x}\in \chi } g(\varvec{x})^2d\varvec{x}}$, $||g||_\infty =\sup _{\varvec{x}\in \chi } |g(\varvec{x})|$

Definition 3

(Bracketing number and entropy) For any two functions l and u, define the bracket [l, u] as the set of all functions f such that $l\le f\le u$ pointwise. Let ||.|| be a metric. Define an $\varepsilon -$bracket as a bracket with $||u-l||\le \varepsilon $. Define the bracketing number of a set of functions $\mathcal {F}^*$ as the minimum number of $\varepsilon -$brackets needed to cover $\mathcal {F}^*$, and denote it by $N_{[]}(\varepsilon ,\mathcal {F}^*,||.||)$. Finally, the Hellinger bracketing entropy, denoted by $H_{[]}(\varepsilon ,\mathcal {F}^*,||.||)$, is the natural logarithm of the bracketing number (Pollard 1990).

Definition 4

(Covering number and entropy) Let (V, ||.||) be a normed space, and $\mathcal {F} \subset V$. $\{V_1,\ldots , V_n \}$ is an $\varepsilon -$covering of $\mathcal {F}$ if $\mathcal {F} \subset \cup _{i=1}^N B(V_i,\varepsilon )$, or equivalently, $\forall $ $\theta \in \mathcal {F}$, $\exists $ i such that $||\theta -V_i||<\varepsilon $. The covering number of $\mathcal {F}$ denoted by $N(\varepsilon ,\mathcal {F},||.||)=\min \{n: \exists \, \varepsilon -\text { covering over }\mathcal {F}\text { of size } n \}$. Finally, the Hellinger covering entropy, denoted by $H(\varepsilon , \mathcal {F},||.||)$, is the natural logarithm of the covering number (Pollard 1990).

Lemma 5 gives a bound on the integral of the Hellinger entropy. Lemma 6 shows that the prior gives negligible probability outside the sieve $\mathcal {F}_n$. Lemma 7 shows that the prior if prior gives sufficient mass on the KL neighborhoods of the true density, the marginal density is well bounded. Lemma 8 shows that if parameters of two neural networks are close then so are the neural networks themselves. Lemmas 5, 6, 7 and 8 will serve as important tools towards the proof of consistency of the true posterior.

Lemma 5

With $H_{[]}(u,\widetilde{\mathcal {F}}_n,||.||_2)$ as in Definition 3, for $H_{[]}(u,\widetilde{\mathcal {F}}_n,||.||_2)\le K_n\log (M_n/u)$,

$$\begin{aligned} \int _0^{\varepsilon }H_{[]}(u,\widetilde{\mathcal {F}}_n,||.||_2)du\lesssim \varepsilon \sqrt{K_n(\log M_n-\log \varepsilon )} \end{aligned}$$

Proof

See proof of lemma 7.14 in Bhattacharya and Maiti (2021). $\square $

Lemma 6

Suppose, $\int _{\mathcal {F}_n^c} p(\varvec{\theta }_{n}) d\varvec{\theta }_{n} \le e^{-n\varepsilon }, n \rightarrow \infty $ for any $\varepsilon >0$. Then, for every $\tilde{\varepsilon }<\varepsilon $.

$$\begin{aligned} P_0^n\left( \int _{\varvec{\theta }_{n} \in \mathcal {F}_n^c}\frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n})d\varvec{\theta }_{n}\ge e^{-n\tilde{\varepsilon } }\right) \le e^{-n(\varepsilon -\tilde{\varepsilon })} \end{aligned}$$

Proof

See proof of lemma 7.16 in Bhattacharya and Maiti (2021). $\square $

Lemma 7

Suppose $\mathcal {N}_\varepsilon =\{\varvec{\theta }_{n}: d_{\text {KL}}(\ell _0,\ell _{\varvec{\theta }_{n}})<\varepsilon \}$ and $ \int _{\mathcal {N}_\varepsilon } p(\varvec{\theta }_{n})d\varvec{\theta }_{n}\ge e^{-n\varepsilon }, n\rightarrow \infty $ then for any $\nu >0$,

$$\begin{aligned} P_0^n\left( \left| \log \int \frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n})d\varvec{\theta }_{n}\right| \ge n \nu \right) \le \frac{2\varepsilon }{\nu } \end{aligned}$$

Proof

See proof of lemma 7.12 in Bhattacharya and Maiti (2021). $\square $

Lemma 8

Let $\eta _{\varvec{\theta }_{n}^*}(\varvec{x})=\varvec{b}_L^*+\varvec{A}_L^* \psi (\varvec{b}_{L-1}^*+\varvec{A}_{L-1}^* \psi ( \cdots \psi (\varvec{b}_1^*+\varvec{A}_1^*\psi (\varvec{b}_0^*+\varvec{A}_0^* \varvec{x})))$ be a fixed neural network. Let $\eta _{\varvec{\theta }_{n}}(\varvec{x})=\varvec{b}_L+\varvec{A}_L \psi (\varvec{b}_{L-1}+\varvec{A}_{L-1} \psi ( \cdots \psi (\varvec{b}_1+\varvec{A}_1\psi (\varvec{b}_0+\varvec{A}_0 \varvec{x})))$ be a neural network such that

$$|\theta _{jn}-\theta ^*_{jn}|\le \frac{\varepsilon }{\sum _{v=0}^L \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n}}$$

where $\tilde{k}_{vn}=k_{vn}+1$. Then,

$$\int _{\varvec{x}\in [0,1]^{p_n}} |\eta _{\varvec{\theta }_{n}}(\varvec{x})- \eta _{\varvec{\theta }^*_n}(\varvec{x})|dx\le \varepsilon $$

Proof

In the proof, we suppress the dependence on n. Define the projection $P_v$ as $P_V \eta _{\varvec{\theta }}(\varvec{x})=\varvec{b}_{V-1}+\varvec{A}_{V-1} \psi ( \cdots \psi (\varvec{b}_1+\varvec{A}_1\psi (\varvec{b}_0+\varvec{A}_0\varvec{x})))$. We claim that

$$\begin{aligned} |P_V \eta _{\varvec{\theta }}(\varvec{x})[s]-P_V \eta _{\varvec{\theta }^*}(\varvec{x})[s]|\le \frac{\varepsilon \sum _{v=0}^V \tilde{k}_v\prod _{v'=v+1}^L a^*_{v'}}{\sum _{v=0}^L \tilde{k}_v\prod _{v'=v+1}^L a^*_{v'}}\nonumber \\ \end{aligned}$$

(B2)

We prove this by induction. Let $v=1$ as follows. Let $\tilde{\varepsilon }=\varepsilon /\sum _{v=0}^L \tilde{k}_v\prod _{v'=v+1}^L a^*_{v'}$, then

$$\begin{aligned}&|P_1 \eta _{\varvec{\theta }}(\varvec{x})[s]-P_1 \eta _{\varvec{\theta }^*}(\varvec{x})[s]|\\&\quad \le |\varvec{b}_{1}-\varvec{b}_{1}^*[s]|+|{\varvec{A}_1[s]}^\top \psi (\varvec{b}_0+\varvec{A}_0\varvec{x})\\&\qquad -{\varvec{A}_1^*[s]}^\top \psi (\varvec{b}_0^*+\varvec{A}_0^*\varvec{x})|\\&\quad \le \tilde{\varepsilon }+||\varvec{A}_1[s] -\varvec{A}_1^*[s]||_1\\&\qquad +\sum _{s'=0}^{k_1} |\varvec{A}_{1}^*[s][s'](\psi (\varvec{b}_{0}[s]+\varvec{A}_{0}[s]^\top \varvec{x})\\&\qquad -\psi (\varvec{b}_{0}^*[s]+{\varvec{A}_{0}^*[s]}^\top \varvec{x}))|\\&\quad =\tilde{\varepsilon }+k_{1}\tilde{\varepsilon }+\tilde{\varepsilon }\sum _{s'=0}^{k_{1}}|\varvec{A}_{1}^*[s][s']|(k_{0}+1)\\&\quad =\tilde{\varepsilon }(1+k_1+a^*_{1}(p_n+1))\le \tilde{\varepsilon } (\tilde{k}_1+a^*_{1} \tilde{k}_0) \end{aligned}$$

where the second line holds since $\psi (u)\le 1$ and the third step is shown next. Let $u=-\varvec{b}_{0}[s]-\varvec{A}_{0}[s]^\top \varvec{x}$ and $u_\delta =\varvec{b}_{0}[s]+\varvec{A}_{0}[s]^\top \varvec{x}-\varvec{b}_{0}^*[s]+{\varvec{A}_{0}^*[s]}^\top \varvec{x}$, then for $|u_\delta |<1$

$$\begin{aligned} |\psi (u)-\psi (u+u_\delta )|&=\left| \frac{e^{u+u_\delta }-e^{u}}{(1+e^{u+u_\delta })(1+e^{u})}\right| \nonumber \\&\le \left| \frac{e^u(e^{u_\delta }-1)}{(1+e^u)(1+e^{u+u_\delta })}\right| \nonumber \\&\le \frac{e^u |e^{u_\delta }-1|}{(1+e^u)(1+e^{u-1})}\le |u_\delta | \end{aligned}$$

(B3)

since $e^u/((1+e^u)(1+e^{u-1}))\le 1/2$ and $|e^{u_\delta }-1|\le 2|u_\delta |$ for $|u_\delta |<1$. Now, $|u_\delta |=|\varvec{b}_{0}[s] -\varvec{b}_{0}^*[s]|+\sum _{s'=0}^{p_n} |\varvec{A}_0[s][s']-\varvec{A}_0^*[s][s']| \le (p_n+1)\tilde{\varepsilon }<1$.

Suppose the result hold for $V-1$, we show the result for V as follows:

$$\begin{aligned}&|P_V \eta _{\varvec{\theta }}(\varvec{x})[s]-P_V \eta _{\varvec{\theta }^*}(\varvec{x})[s]|\\&\quad \le |\varvec{b}_{V}[s]-\varvec{b}_{V}^*[s]|+|{\varvec{A}_{V}[s]}^\top \psi (P_{V-1} \eta _{\varvec{\theta }}(\varvec{x}))\\&\qquad -{\varvec{A}_{V}^*[s]}^\top \psi (P_{V-1} \eta _{\varvec{\theta }^*}(\varvec{x}))|\\&\quad \le \tilde{\varepsilon }+||\varvec{A}_{V}[s] -{\varvec{A}_{V}^*[s]}^\top ||_1\\&\qquad +\sum _{s'=0}^{k_V} |\varvec{A}_{V}^*[s][s'](\psi (P_{V-1} \eta _{\varvec{\theta }}(\varvec{x})[s])\\&\qquad -\psi (P_{V-1} \eta _{\varvec{\theta }^*}(\varvec{x})[s]))|\\&\quad \le \tilde{\varepsilon }+||\varvec{A}_{V}[s] -{\varvec{A}_{V}^*[s]}^\top ||_1\\&\qquad +\sum _{s'=0}^{k_V} |\varvec{A}_{V}^*[s][s'](P_{V-1} \eta _{\varvec{\theta }}(\varvec{x})[s])-\psi (P_{V-1} \eta _{\varvec{\theta }^*}(\varvec{x})[s])| \end{aligned}$$

where the second step follows since $\psi (u)\le 1$ and the third step follows by relation (B3) provided $|P_{V-1} \eta _{\varvec{\theta }}(\varvec{x})[s]-P_{V-1} \eta _{\varvec{\theta }^*}(\varvec{x})[s]|\le 1$. But this holds using relation (B2) with $v=V-1$.

Thus proceeding further we get

$$\begin{aligned}&|P_V \eta _{\varvec{\theta }}(\varvec{x})[s]-P_V \eta _{\varvec{\theta }^*}(\varvec{x})[s]|\\&\quad \le \tilde{\varepsilon } (1+k_{V})+2\tilde{\varepsilon }\sum _{s'=0}^{k_{V}}|W_{V}^*[s][s']|\sum _{v=0}^{V-1} \tilde{k}_v \prod _{v'=v+1}^{V-1}a^*_{v'}\\&\quad \le \tilde{\varepsilon }\tilde{k}_v+\tilde{\varepsilon } \sum _{v=0}^{V-1} \tilde{k}_v \prod _{v'=v+1}^V\widetilde{\theta }_v'=\tilde{\varepsilon } \sum _{v=0}^{V} \tilde{k}_{v} \prod _{v'=v+1}^V a^*_{v'} \end{aligned}$$

This completes the proof.

Lemma 9 shows that if the expected KL densities between two densities is close then the expected log-likelihood ratio between the two densities is also well bounded where expectation is taken with respect to the variational member q. Lemma 10 shows that if functions are close then so is the logistic loss between them. Lemmas 8, 9 and 10 together will serve as tools towards establishing that the variational and true posterior are close in the KL-distance. $\square $

Lemma 9

Suppose q satisfies $\int d_{\textrm{KL}}(\ell _0,\ell _{\varvec{\theta }_{n}}) q(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\le \varepsilon ,$ then for any $\nu >0$,

$$\begin{aligned} P_0^n\left( \left| \int q(\varvec{\theta }_{n}) \log \frac{L(\varvec{\theta }_{n})}{L_0}d\varvec{\theta }_{n}\right| \ge n\nu \right) \le \frac{\varepsilon }{\nu } \end{aligned}$$

Proof

See proof of lemma 7.13 in Bhattacharya and Maiti (2021). $\square $

Lemma 10

If $|\eta _0(\varvec{x})-\eta _{\varvec{\theta }_{n}}(\varvec{x})|\le \varepsilon $, then $|h_{\varvec{\theta }_{n}}(\varvec{x})|\le 2\varepsilon $ where

$$\begin{aligned} h_{\varvec{\theta }_{n}}(\varvec{x})= & {} \sigma (\eta _0(\varvec{x}))(\eta _0(\varvec{x})-\eta _{\varvec{\theta }_{n}}(\varvec{x}))\\{} & {} + \log (1-\sigma (\eta _0(\varvec{x}))) -\log (1-\sigma (\eta _{\varvec{\theta }_{n}}(\varvec{x}))) \end{aligned}$$

Proof

Note that,

$$\begin{aligned} |h_{\varvec{\theta }_{n}}(\varvec{x})|&\le |\sigma (\eta _0(\varvec{x}))| |\eta _0(\varvec{x})-\eta _{\varvec{\theta }_{n}}(\varvec{x})|\\&\quad +|\log (1-\sigma (\eta _0(\varvec{x}))-\log (1-\sigma (\eta _{\varvec{\theta }_{n}}(\varvec{x}))|\\&\le |\eta _0(\varvec{x})-\eta _{\varvec{\theta }_{n}}(\varvec{x})| \\&\quad +\left| \log \left( 1+\sigma (\eta _0(\varvec{x}))(e^{\eta _{\varvec{\theta }_{n}}(\varvec{x})-\eta _0(\varvec{x})}-1)\right) \right| \\&\le 2|\eta _0(\varvec{x})-\eta _{\varvec{\theta }_{n}}(\varvec{x})| \end{aligned}$$

where the second step follows by using $\sigma (x)=e^{x}/(1+e^x) \le 1$ and the proof of the third step is shown below. $\square $

Let $p=\sigma (\eta _0(\varvec{x}))$, then $0\le p \le 1$ and $ r=\eta _{\varvec{\theta }_{n}}(\varvec{x})-\eta _0(\varvec{x})$, then

$$\begin{aligned}&\left| \log \left( 1+\sigma (\eta _0(\varvec{x}))(e^{\eta _{\varvec{\theta }_{n}}(\varvec{x})-\eta _0(\varvec{x})}-1)\right) \right| \\&\quad =\left| \log \left( 1+p(e^r-1)\right) \right| \\&r>0:\hspace{2mm}|\log (1+p(e^r-1))|=\log (1+p(e^r-1))\\&\quad \le \log (1+(e^r-1))=r\\&r<0:\hspace{2mm} |\log (1+p(e^r-1))|=-\log (1+p(e^r-1))\\&\quad \le -\log (1+(e^r-1))=-r \end{aligned}$$

Lemma 11 gives a bound on the first order derivatives of a neural network. Lemma 12 gives a bound on the Hellinger entropy under the sieve $\mathcal {F}_n$. Lemmas 11 and 12 will serve as tools to bound the Hellinger entropy of the functional sieve space $\widetilde{\mathcal {F}}_n$ based on $\mathcal {F}_n$.

Lemma 11

For $\eta _{\varvec{\theta }_{n}}(\varvec{x})=\varvec{b}_L+\varvec{A}_L \psi (\varvec{b}_{L-1}+\varvec{A}_{L-1} \psi ( \cdots \psi (\varvec{b}_1+\varvec{A}_1\psi (\varvec{b}_0+\varvec{A}_0 \varvec{x})))$,

$$\begin{aligned} \sup _{j=1, \ldots , K_n} \nabla _{\theta _j} \eta _{\varvec{\theta }_{n}}(\varvec{x}) \le \prod _{v'=1}^{L_n} a_{v'n} \end{aligned}$$

where $a_{v'n}=\sup _{v=0, \ldots , k_{(v'+1)n}} ||\varvec{A}_{v'}[v]]||_1$.

Proof

We suppress the dependence on n. Let $P_{V}=\varvec{b}_V+\varvec{A}_V\psi (\cdots \varvec{b}_1+\varvec{A}_1\psi (\varvec{b}_0+\varvec{A}_0 \varvec{x})))$. Define $G_{V,V}=\varvec{1}_{k_V+1}$ and for $V=0,\ldots , L$, $V'=0,\ldots , V-1$, let

$$\begin{aligned} G_{V',V}&=\varvec{A}_{V}(\psi '(P_{V-1})\odot \varvec{A}_{V-1}(\psi '(P_{V-2})\odot \cdots \varvec{A}_{V+1}(\psi '(P_{V'})))) \end{aligned}$$

where $\odot $ denotes component wise multiplication.

With $\psi (P_{-1})=\varvec{x}$, we define

$$\begin{aligned} {\left\{ \begin{array}{ll} \nabla _{\varvec{b}_{v}} \eta _{\varvec{\theta }}(\varvec{x})&{}=G_{v,L} \varvec{1}_{k_{v+1}} \\ \nabla _{\varvec{A}_{v}} \eta _{\varvec{\theta }}(\varvec{x})&{}=G_{v,L}\varvec{1}_{k_{v+1}}\psi (P_{v-1})^\top \end{array}\right. } \end{aligned}$$

By the above form and the fact that $\psi (u),\psi '(u),|x_i|\le 1$, it can be easily checked by induction $|G_{v,L}|\le \prod _{v'=v+1}^L a_{v'}$ which completes the proof. $\square $

Lemma 12

Let, $\widetilde{\mathcal {F}}_n=\{\sqrt{\ell }: \ell _{\varvec{\theta }_{n}}(y,\varvec{x}), \varvec{\theta }_{n} \in \mathcal {F}_n\}$ where $\ell _{\varvec{\theta }_{n}}(y,\varvec{x})$ is given by

$$\begin{aligned} \ell _{\varvec{\theta }_{n}}(y,\varvec{x})=\exp \left( y \eta _{\varvec{\theta }_{n}}(\varvec{x})-\log \left( 1+e^{ \eta _{\varvec{\theta }_{n}}(\varvec{x})}\right) \right) \end{aligned}$$

(B4)

and $\mathcal {F}_n$ is given by

$$\begin{aligned} \mathcal {F}_n=\Big \{\varvec{\theta }_{n}:|\theta _{jn}|\le C_n, j=1,\ldots , K_n\Big \} \end{aligned}$$

(B5)

Then with $H_{[]}(u,\widetilde{\mathcal {F}}_n,||.||_2)$ is as in Definition 3,

$$\begin{aligned}{} & {} \int _{\varepsilon ^2/8}^{\sqrt{2}\varepsilon }\sqrt{H_{[]}(u,\widetilde{\mathcal {F}}_n,||.||_2)}du\\{} & {} \quad \lesssim \varepsilon \sqrt{K_n((L_n+1) \log K_n+(L_n+2)\log C_n-\log \varepsilon )} \end{aligned}$$

Proof

In this proof, we suppress the dependence on n. By lemma 4.1 in Pollard (1990),

$$\begin{aligned} N(\varepsilon ,\mathcal {F}_n,||.||_\infty )\le \left( \frac{3C}{\varepsilon }\right) ^{K}. \end{aligned}$$

For $\varvec{\theta }_1, \varvec{\theta }_2 \in \mathcal {F}$, let $\widetilde{\ell }(u)=\sqrt{\ell _{u\varvec{\theta }_1+(1-u)\varvec{\theta }_2}(\varvec{x},y)}$. Following Equation (52) in Bhattacharya and Maiti (2021),

$$\begin{aligned} \sqrt{\ell _{\varvec{\theta }_1}(\varvec{x},y)}-\sqrt{\ell _{\varvec{\theta }_2}(\varvec{x},y)}&\le K\sup _{j} \Big |\frac{\partial {\widetilde{\ell }}}{\partial {\theta _j}}\Big |||\varvec{\theta }_1-\varvec{\theta }_2||_{\infty }\nonumber \\&\le F(\varvec{x},y)||\varvec{\theta }_1-\varvec{\theta }_2||_{\infty } \end{aligned}$$

(B6)

where the upper bound $F(\varvec{x},y)=(CK)^L$. This is because $|\partial \widetilde{\ell }/\partial \theta _j|$, the derivative of $\sqrt{\ell }$ w.r.t. is bounded above by $|\partial \eta _{\varvec{\theta }}(\varvec{x})/\partial \theta _j|$ as shown below.

$$\begin{aligned} \left| \frac{\partial {\widetilde{\ell }}}{\partial {\theta _j}}\right|&=\left| \frac{1}{2}\frac{\partial {\eta _{\varvec{\theta }}(\varvec{x})}}{\partial {\theta _j}}\left( y-\frac{e^{\eta _{\varvec{\theta }}(\varvec{x})}}{1+e^{\eta _{\varvec{\theta }}(\varvec{x})}}\right) \right. \\&\left. \sqrt{e^{(y\eta _{\varvec{\theta }}(\varvec{x})-\log (1+e^{\eta _{\varvec{\theta }}(\varvec{x})}))}}\right| \\&\le \left| \frac{1}{2} \frac{\partial {\eta _{\varvec{\theta }}(\varvec{x})}}{\partial {\theta _j}}\right| \left( \frac{e^{\eta _{\varvec{\theta }}(\varvec{x})}}{1+e^{\eta _{\varvec{\theta }}(\varvec{x})}}\right) ^{1/2}\left( \frac{1}{1+e^{\eta _{\varvec{\theta }}(\varvec{x})}}\right) ^{1/2}\\&\le \frac{1}{4}\left| \frac{\partial {\eta _{\varvec{\theta }}(\varvec{x})}}{\partial {\theta _j}}\right| \end{aligned}$$

Thus, using $e^{\eta _{\varvec{\theta }}(\varvec{x})}/(1+e^{\eta _{\varvec{\theta }}(\varvec{x})})$ and Lemma 11, we get

$$\begin{aligned} \sup _{j=0, \ldots , K_n}\left| \frac{\partial {\eta _{\varvec{\theta }}(\varvec{x})}}{\partial {\theta _j}}\right| \le \prod _{v=1}^{L} a^*_{v}=\prod _{v=1}^L k_v C\le (KC)^L \end{aligned}$$

In view of (B6) and theorem 2.7.11 in van der Vaart and Wellner (1996), we have

$$\begin{aligned}{} & {} N_{[]}(\varepsilon , \widetilde{\mathcal {F}}_n, ||.||_2) \le \left( \frac{3K^{L+1}C^{L+2}}{2\varepsilon }\right) ^{K}\\{} & {} \quad \implies H_{[]}(\varepsilon , \widetilde{\mathcal {F}}_n, ||.||_2)\lesssim K \log \frac{K^{L+1}C^{L+2} }{\varepsilon } \end{aligned}$$

where $N_{[]}$ and $H_{[]}$ denote the bracketing number and bracketing entropy as in Definition 3. Using, Lemma 5 with $M=K^{L+1}C^{L+2}$, we get

$$\begin{aligned}{} & {} \int _0^{\varepsilon } \sqrt{H_{[]}(u, \widetilde{\mathcal {F}}_n, ||.||_2)} du\\{} & {} \quad \lesssim \varepsilon \sqrt{K((L+1)\log K +2(L+2)\log C-\log \varepsilon )} \end{aligned}$$

Therefore,

$$\begin{aligned}&\int _{\varepsilon ^2/8}^{\sqrt{2}\varepsilon } H_{[]}(u, \widetilde{\mathcal {F}}_n, ||.||_2) du\le \int _{0}^{\sqrt{2}\varepsilon } {H_{[]}(u, \widetilde{\mathcal {F}}_n, ||.||_2)} du\\&\quad \lesssim \sqrt{2}\varepsilon \sqrt{K((L+1) \log K+(L+2)\log C-\log \sqrt{2} \varepsilon )} \end{aligned}$$

The proof follows by noting $\log \sqrt{2}\varepsilon \ge \log \varepsilon $.

Proposition 13 establishes a bound on the log-likelihood ratio when the neural network lies outside the Hellinger neigborhood of the true density function. Proposition 14 shows that the prior gives negligible probability outside the sieve. Proposition 15 shows that the prior gives sufficiently large probability to KL-neighborhoods of the true density function. Propositions 13, 14 and 15 taken together will be used to establish the posterior consistency of the true posterior. $\square $

Proposition 13

Let $n\epsilon _n^2\rightarrow \infty $. Suppose $K_n\log n =o(n^b\epsilon _n^2)$, for some $0<b<1$, $L_n\sim \log n$ and $p(\varvec{\theta }_{n})=MVN(\varvec{\mu }_n,\text {diag}(\varvec{\zeta }_n^2))$ where $\log ||\varvec{\zeta }_n||_\infty =O(\log n)$ and $||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2)$. Then for every $\varepsilon >0$,

$$\begin{aligned} \log \int _{\mathcal {U}_{\varepsilon \epsilon _n}^c} \frac{L(\varvec{\theta }_{n})}{L_0} p(\varvec{\theta }_{n})d\varvec{\theta }_{n}\le \log 2-\varepsilon ^2 n\epsilon _n^2 + o_{P_0^n}(1) \end{aligned}$$

Proof

It suffices to show

$$\begin{aligned} P_0^n\left( \int _{\mathcal {U}_{\varepsilon \epsilon _n}^c} \frac{L(\varvec{\theta }_{n})}{L_0} p(\varvec{\theta }_{n})d\varvec{\theta }_{n}> 2e^{-\varepsilon n\epsilon _n^2}\right) \rightarrow 0,\,\, n \rightarrow \infty \nonumber \\ \end{aligned}$$

(B7)

The expression on the left above is bounded above by

$$\begin{aligned}&P_0^n\left( \int _{\mathcal {U}_{\varepsilon \epsilon _n}^c \cap \mathcal {F}_n} \frac{L(\varvec{\theta }_{n})}{L_0} p(\varvec{\theta }_{n})d\varvec{\theta }_{n}> e^{-\varepsilon ^2 n\epsilon _n^2}\right) \\&\quad +P_0^n\left( \int _{\mathcal {F}_n^c} \frac{L(\varvec{\theta }_{n})}{L_0} p(\varvec{\theta }_{n})d\varvec{\theta }_{n}> e^{-\varepsilon ^2 n\epsilon _n^2}\right) \end{aligned}$$

Using lemma 12 with $\varepsilon =\varepsilon \epsilon _n$ and $C_n=e^{n^b\epsilon _n^2/K_n}$,

$$\begin{aligned}&\int _{\varepsilon ^2\epsilon _n^2/8}^{{\sqrt{2}\varepsilon \epsilon _n}}{H_{[]}(u, \widetilde{\mathcal {F}}_n, ||.||_2)} du\\&\quad \lesssim \epsilon _n \varepsilon \sqrt{K_n((L_n+1) \log K_n+(L_n+2)\log C_n-\log \varepsilon \epsilon _n)}\\&\quad \le \varepsilon \epsilon _n O(\max (\sqrt{K_n(L_n+1)\log K_n}, \\&\quad \sqrt{K_n(L_n+2)\log C_n}, \sqrt{-\log \epsilon _n}))\\&\quad \le \varepsilon \epsilon _n \max (o(\epsilon _n\sqrt{n^b \log n}),\\&\quad O(\epsilon _n\sqrt{n^b \log n} ),O(\sqrt{\log n}))\le \varepsilon ^2 \epsilon _n^2 \sqrt{n} \end{aligned}$$

where $H_{[]}(u,\widetilde{\mathcal {F}}_n,||.||_2)$ is as in Definition 3. The first inequality in the third step follows because $L_n\sim \log n$, $K_n\log n=o(n^b\epsilon _n^2)$ and $K_n\log C_n =n^b \epsilon _n^2$, $ -\log \epsilon _n^2\le \log n$. The second inequality in the third step is by $(n^b \log n)/n=o(1)$

By theorem 1 in Wing Hung and Xiaotong (1995), for some constant $C>0$, we have

$$\begin{aligned}&P_0^n\left( \int _{\varvec{\theta }_{n}\in \mathcal {U}_{\varepsilon \epsilon _n}^c \cap \mathcal {F}_n } \frac{L(\varvec{\theta }_{n})}{L_0 }p(\varvec{\theta }_{n})d\varvec{\theta }_{n}> e^{-\varepsilon ^2n\epsilon _n^2}\right) \nonumber \\&\quad \le P_0^n\left( \sup _{\varvec{\theta }_{n}\in \mathcal {U}_{\varepsilon \epsilon _n}^c \cap \mathcal {F}_n } \frac{L(\varvec{\theta }_{n})}{L_0}> e^{-\varepsilon ^2n\epsilon _n^2}\right) \nonumber \\&\quad \le 4\exp (-C\varepsilon ^2 n\epsilon _n^2)=o(n\epsilon _n^2) \end{aligned}$$

(B8)

Using proposition 14 with $\varepsilon =2\varepsilon $, we have

$$\int _{\varvec{\theta }_{n} \in \mathcal {F}_n^c} p(\varvec{\theta }_{n})d\varvec{\theta }_{n} \le e^{-2 n \varepsilon ^2 {\epsilon _n}^2}$$

Therefore, using Lemma 6 with $\varepsilon =2\varepsilon ^2\epsilon _n^2$ and $\tilde{\varepsilon }={\varepsilon }^2 \epsilon _n^2$, we have

$$\begin{aligned} P_0^n\left( \int _{\mathcal {F}_n^c} \frac{L(\varvec{\theta }_{n})}{L_0} p(\varvec{\theta }_{n})d\varvec{\theta }_{n}> e^{-\varepsilon ^2 n\epsilon _n^2}\right) \le e^{-\varepsilon ^2 n\epsilon _n^2} \rightarrow 0.\nonumber \\ \end{aligned}$$

(B9)

Combining (B8) and (B9), (B7) follows. $\square $

Proposition 14

Let $n\epsilon _n^2\rightarrow \infty $. Let $p(\varvec{\theta }_{n})=MVN(\varvec{\mu }_n,\text {diag}(\varvec{\zeta }_n^2))$ where $\log ||\varvec{\zeta }_n||_\infty =O(\log n)$ and $||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2)$. Suppose for some $0<b<1$, $K_n\log n=o(n^b\epsilon _n^2)$, then for $C_n=e^{n^b \epsilon _n^2/K_n}$ and $\mathcal {F}_n$ as in (33), for any $\varepsilon >0$,

$$\int _{\varvec{\theta }_{n} \in \mathcal {F}_n^c}p(\varvec{\theta }_{n})d\varvec{\theta }_{n}\le e^{- n \varepsilon \epsilon _n^2}, n \rightarrow \infty $$

Proof

Let $\mathcal {F}_{jn}=\{\theta _{jn}: |\theta _{jn}|\le C_n\}$, then $\mathcal {F}_n=\cap _{j=1}^{K_n} \mathcal {F}_{jn}\implies \mathcal {F}_n^c= \cap _{j=1}^{K_n}\mathcal {F}_{jn}^c$. This implies $\int _{\varvec{\theta }_{n} \in \mathcal {F}_n^c}p(\varvec{\theta }_{n})d\varvec{\theta }_{n}\le \sum _{j=1}^{K_n}\int _{\mathcal {F}_{jn}^c}(e^{-\frac{(\theta _{jn}-\mu _{jn})^2}{2\zeta _{jn}^2}}/\sqrt{2\pi \zeta _{jn}^2})d\theta _{jn}$. Thus,

$$\begin{aligned}&\int _{\varvec{\theta }_{n} \in \mathcal {F}_n^c}p(\varvec{\theta }_{n})d\varvec{\theta }_{n} \le \sum _{j=1}^{K_n}\int _{-\infty }^{-C_n}\frac{1}{\sqrt{2\pi \zeta _{jn}^2}}e^{-\frac{(\theta _{jn}-\mu _{jn})^2}{2\zeta _{jn}^2}}d\theta _{jn}\\&\qquad +\sum _{j=1}^{K_n}\int _{C_n}^{\infty }\frac{1}{\sqrt{2\pi \zeta _{jn}^2}}e^{-\frac{(\theta _{jn}-\mu _{jn})^2}{2\zeta _{jn}^2}}d\theta _{jn}\\&\quad =\sum _{j=1}^{K_n}\left( 1-\Phi \left( \frac{C_n-\mu _{jn}}{\zeta _{jn}}\right) \right) \\&\qquad +\sum _{j=1}^{K_n}\left( 1-\Phi \left( \frac{C_n+\mu _{jn}}{\zeta _{jn}}\right) \right) \end{aligned}$$

Since $||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2) \implies ||\varvec{\mu }_n||_\infty =o(\sqrt{n}\epsilon _n)$. Since $\log ||\varvec{\zeta }_n||_\infty =O(\log n)$ which implies for some $M>0$, $d\ge 1$,

$$\begin{aligned} \min \left( \frac{|C_n-\mu _{jn}|}{\zeta _{jn}},\frac{|C_n+\mu _{jn}|}{\zeta _{jn}}\right)&\ge \frac{(C_n-\sqrt{n})}{n^dM}\nonumber \\&\ge e^{\log C_n-(d+1)\log n}\nonumber \\&\quad -\frac{1}{n^{d-1/2}M}\nonumber \\&\sim e^{R_n\log n} \rightarrow \infty \end{aligned}$$

(B10)

where the last convergence holds since $K_n\log n=o(n^b \epsilon _n^2)$. This further implies $R_n=(n^b \epsilon _n^2)/(K_n\log n)-(d+1) \rightarrow \infty $. Thus, using Mill’s ratio, we get:

$$\begin{aligned} \int _{\varvec{\theta }_{n} \in \mathcal {F}_n^c}p(\varvec{\theta }_{n})d\varvec{\theta }_{n}&=O\left( \sum _{j=1}^{K_n}\frac{\zeta _{jn}}{C_n-\mu _{jn}}e^{-\frac{(C_n-\mu _{jn})^2}{2\zeta _{jn}^2}}\right. \\&\quad \left. +\sum _{j=1}^{K_n}\frac{\zeta _{jn}}{C_n+\mu _{jn}}e^{-\frac{(C_n+\mu _{jn})^2}{2\zeta _{jn}^2}}\right) \\&\le 2K_ne^{-\frac{(C_n-\sqrt{n})^2}{2n^2M^2}}\le e^{-\varepsilon n\epsilon _n^2} \end{aligned}$$

where the last asymptotic inequality holds because

$$\begin{aligned}&\frac{(C_n-\sqrt{n})^2}{2n^d M^2}-\log 2K_n\sim \frac{1}{2}e^{2R_n\log n} -2\log K_n \\&\quad \ge n\left( \frac{e^{2R_n}}{2}-\frac{2\log n}{n}\right) \ge \varepsilon n\epsilon _n^2 \end{aligned}$$

In the above step, the first asymptotic equivalence is by (B10), the second inequality holds since $K_n\le n$. The last inequality is by $R_n \rightarrow \infty $ and $\log /n\rightarrow 0$. $\square $

Proposition 15

Let $p(\varvec{\theta }_{n})=MVN(\varvec{\mu }_n,\text {diag}(\varvec{\zeta }_n^2))$ with $\log ||\varvec{\zeta }_n||_\infty =O(\log n)$, $||\varvec{\zeta }^*_n||_\infty =O(1)$. Let $||\eta _0-\eta _{\varvec{\theta }^*_n}||_1\le \varepsilon \epsilon _n^2/4$, $n\epsilon _n^2 \rightarrow \infty $. Define,

$$\begin{aligned} d_{\textrm{KL}}(\ell _0,\ell _{\varvec{\theta }_{n}})= & {} \int _{\varvec{x}\in [0,1]^{p_n}} \left( \sigma (\eta _0(\varvec{x}))(\eta _0(\varvec{x})-\eta _{\varvec{\theta }_{n}}(\varvec{x}))\right. \nonumber \\{} & {} \quad \left. + \log \frac{1-\sigma (\eta _0(\varvec{x}))}{1-\sigma (\eta _{\varvec{\theta }_{n}}(\varvec{x}))}\right) d\varvec{x}\nonumber \\ \mathcal {N}_\varepsilon= & {} \left\{ \varvec{\theta }_{n}:d_{\textrm{KL}}(\ell _0,\ell _{\varvec{\theta }_{n}})<\varepsilon \right\} \end{aligned}$$

(B11)

If $K_n\log n =o(n\epsilon _n^2)$, $||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)$, $\log (\sum _{v=0}^{L_n} k_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})=O(\log n)$, $||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2)$,

$$ \int _{\varvec{\theta }_{n}\in N_{\varepsilon \epsilon _n^2}} p(\varvec{\theta }_{n})d\varvec{\theta }_{n} \ge e^{-n\epsilon _n^2\nu } \hspace{3mm}\forall \,\, \nu >0$$

Proof

Let $\eta _{\varvec{\theta }^*_n}(\varvec{x})=\varvec{b}_L^*+\varvec{A}_L^* \psi (\varvec{b}_{L-1}^*+\varvec{A}_{L-1}^* \psi ( \cdots \psi (\varvec{b}_1^*+\varvec{A}_1^*\psi (\varvec{b}_0^*+\varvec{A}_0^* \varvec{x})))$ be the neural network such that

$$\begin{aligned} ||\eta _{\varvec{\theta }^*_n}-\eta _0||_1\le \frac{\varepsilon \epsilon _n^2}{4} \end{aligned}$$

(B12)

Such a neural network exists since $||\eta _0-\eta _{\varvec{\theta }^*_n}||_1\le \varepsilon \epsilon _n^2/4$. Next define $\mathcal {M}_{\varepsilon \epsilon _n^2}$ as:

$$\begin{aligned} \mathcal {M}_{\varepsilon \epsilon _n^2}&=\Big \{\varvec{\theta }_{n}:|{\theta }_{jn}-{\theta }^*_{jn}|<\frac{\varepsilon \epsilon _n^2}{2\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n}}, \\&\quad j=1,\ldots , K_n\Big \} \end{aligned}$$

where $\tilde{k}_{vn}=k_{vn}+1$. For every $\varvec{\theta }_{n} \in \mathcal {M}_{\varepsilon \epsilon _n^2}$, by Lemma 8, we have

$$\begin{aligned} ||\eta _{\varvec{\theta }_{n}}-\eta _{\varvec{\theta }^*_n}||_1 \le \frac{\varepsilon \epsilon _n^2}{2} \end{aligned}$$

(B13)

Combining (B12) and (B13), we get for $\varvec{\theta }_{n}\in \mathcal {M}_{\varepsilon \epsilon _n^2}$, $||\eta _{\varvec{\theta }_{n}}-\eta _{0}||_1 \le \varepsilon \epsilon _n^2/2$.

This, in view of Lemma 10, $d_{\textrm{KL}}(\ell _0,\ell _{\varvec{\theta }_{n}}) \le \varepsilon \epsilon _n^2$. Let $\varvec{\theta }_{n} \in \mathcal {N}_{\varepsilon \epsilon _n^2}$ for every $\varvec{\theta }_{n} \in \mathcal {M}_{\varepsilon \epsilon _n^2}$. Therefore,

$$\begin{aligned} \int _{\varvec{\theta }_{n} \in \mathcal {N}_{\varepsilon \epsilon _n^2}} p(\varvec{\theta }_{n})d\varvec{\theta }_{n}\ge \int _{\varvec{\theta }_{n} \in \mathcal {M}_{\varepsilon \epsilon _n^2}} p(\varvec{\theta }_{n})d\varvec{\theta }_{n} \end{aligned}$$

Let $\delta _n=\varepsilon \epsilon _n^2/(2\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})$, then

$$\begin{aligned}&\int _{\varvec{\theta }_{n} \in \mathcal {M}_{\varepsilon \epsilon _n^2}}p(\varvec{\theta }_{n})d\varvec{\theta }_{n}=\prod _{j=1}^{K_n}\int _{\theta _{jn}^*-\delta _{n}}^{\theta _{jn}^*+\delta _{n}}\frac{1}{\sqrt{2\pi \zeta _{jn}^2}}e^{-\frac{(\theta _{jn}-\mu _{jn})^2}{2\zeta _{jn}^2}}d\theta _{jn}\nonumber \\&\quad = \prod _{j=1}^{K_n}\frac{2\delta _{n}}{\sqrt{2\pi \zeta _{jn}^2}}e^{-\frac{(\widehat{\theta }_{jn}-\mu _{jn})^2}{2\zeta _{jn}^2}},\,\, \widehat{\theta }_{jn}\in [\theta _{jn}^*-\delta _{n},\theta _{jn}^*+\delta _{n}]\nonumber \\&\quad =\prod _{j=1}^{K_n}e^{-\left( -\frac{1}{2}\log \frac{2}{\pi }-\log \delta _n+\log \zeta _{jn}+\frac{(\widehat{\theta }_{jn}-\mu _{jn})^2}{2\zeta _{jn}^2}\right) } \end{aligned}$$

(B14)

where the second last equality holds by mean value theorem.

Note that $\widehat{\theta }_{jn} \in [\theta _{jn}^*-1,\theta _{jn}^*+1]$ since $\delta _n \rightarrow 0$, therefore

$$\begin{aligned}&\frac{(\widehat{\theta }_{jn}-\mu _{jn})^2}{2\zeta _{jn}^2} \le \frac{\max ( (\theta _{jn}^*-\mu _{jn}-1)^2,(\theta _{jn}^*-\mu _{jn}+1)^2)}{2\zeta _{jn}^2}\\&\quad \le \frac{(\theta _{jn}^*-\mu _{jn})^2}{\zeta _{jn}^2}+\frac{1 }{\zeta _{jn}^2} \end{aligned}$$

where the last inequality follows since $(a+b)^2\le 2(a^2+b^2)$. Also,

$$\begin{aligned} \sum _{j=1}^{K_n}\frac{(\widehat{\theta }_{jn}-\mu _{jn})^2}{2\zeta _{jn}^2}&\le 2\sum _{j=1}^{K_n}\frac{{\theta _{jn}^*}^2}{\zeta _{jn}^2}+ 2\sum _{j=1}^{K_n}\frac{\mu _{jn}^2}{\zeta _{jn}^2}+\sum _{j=1}^{K_n}\frac{1}{\zeta _{jn}^2}\nonumber \\&\le 2 (||\varvec{\theta }^*_n||_2^2+||\varvec{\mu }_n||_2^2+1) ||\varvec{\zeta }_n^*||_\infty \le n\nu \epsilon _n^2 \end{aligned}$$

(B15)

since $||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)$, $||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2)$ and $||\varvec{\zeta }_n^*||_\infty =O(1)$ and $n\epsilon _n^2 \rightarrow \infty $. Also,

$$\begin{aligned} -\log \delta _n +\log \zeta _{jn}&=\log 2 +\log (\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})\\&\quad -\log \varepsilon \epsilon _n^2\\&\le \log 2 +\log (\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})\\&\quad +\log \zeta _{jn}-\log \varepsilon \\&\quad -2\log \epsilon _n\\&\le \log 2 +O(\log n)+O(\log n)\\&\quad -\log \varepsilon +O(\log n) \end{aligned}$$

where the last follows since $\log ||\varvec{\zeta }_n||_\infty =O(\log n)$, $\log (\sum _{v=0}^{L_n} k_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})=O(\log n)$ and $1/n\epsilon _n^2=o(1)$ which implies $-2\log \epsilon _n=o(\log n)$.

$$\begin{aligned} \sum _{j=1}^{K_n}-\frac{1}{2}\log \frac{2}{\pi }-\log \delta _n +\log \zeta _{jn}=O(K_n\log n)=o(n\epsilon _n^2) \end{aligned}$$

(B16)

where the last inequality follows since $K_n\log n=o(n\epsilon _n^2)$,

Combining (B15) and (B16) and replacing (B14), the proof follows.

Proposition 16 establishes that under a suitable choice of the variational family q and the prior p, the KL distance between p and q is suitably bounded. Proposition 17 shows that the integral of the logistic loss between the neural network model and the true model with respect to the variational family q is small. Propositions 15, 16 and 17 taken together will be used to establish that the KL-distance between the true posterior and the variational posterior is suitably bounded. $\square $

Proposition 16

For $\log ||\varvec{\zeta }_n||_\infty =O(\log n)$ and $||\varvec{\zeta }^*_n||_\infty =O(1)$, let $q(\varvec{\theta }_{n})=MVN(\varvec{\theta }^*_n,I_{K_n}/n^{2+2d})$ and $p(\varvec{\theta }_{n})=MVN(\varvec{\mu }_n,\text {diag}(\varvec{\zeta }_n^2))$. Let $K_n\log n$ $=o(n\epsilon _n^2)$, $||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)$, $||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2)$ and $n\epsilon _n^2 \rightarrow \infty $, then for any $\nu >0$,

$$\begin{aligned} d_{\textrm{KL}}(q,p)\le n\epsilon _n^2\nu \end{aligned}$$

Proof

$$\begin{aligned}&d_{\textrm{KL}}(q,p)=\sum _{j=1}^{K_n}\left( \log \sqrt{n^{1+d}}\zeta _{jn}+\frac{1}{n^{1+d}\zeta _{jn}^2}\right. \\&\qquad \left. +\frac{(\theta _{jn}^*-\mu _{jn})^2}{\zeta _{jn}^2}-\frac{1}{2}\right) \\&\quad \le \frac{K_n}{2}((d+1)\log n-1)+\sum _{j=1}^{K_n}\log \zeta _{jn}\\&\qquad +\frac{1}{n^{1+d}}\sum _{j=1}^{K_n}\frac{1}{\zeta _{jn}^2}+2\sum _{j=1}^{K_n}\frac{{\theta _{jn}^*}^2}{\zeta _{jn}^2}+2\sum _{j=1}^{K_n}\frac{\mu _{jn}^2}{\zeta _{jn}^2}\\&\quad \le K_n\Big ((d+1)\frac{\log n}{2}+\log ||\varvec{\zeta }_n||_\infty \Big ) \\&\qquad +2\Big (\frac{K_n}{n}+||\varvec{\theta }^*_n||_2^2+||\varvec{\mu }_n||_2^2\Big )||\varvec{\zeta }_n^*||_\infty \\&\quad =o(n\epsilon _n^2) \end{aligned}$$

where the second last inequality uses $\varvec{\zeta }^*_n=1/\varvec{\zeta }_n$. The last equality follows since $\log ||\varvec{\zeta }_n||_{\infty }=O(\log n)$, $||\varvec{\zeta }_n^*||_\infty =O(1)$, $K_n\log n=o(n\epsilon _n^2)$, $||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2)$ and $||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)$. $\square $

Proposition 17

Let $q(\varvec{\theta }_{n}) \sim MVN(\varvec{\theta }^*_n,I_{K_n}/n^{2+2d})$ where $d>d^*>0$ and $\sum _{v=0}^{L_n} k_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n}=O(n^{d^*})$. Define

$$\begin{aligned} h(\varvec{\theta }_{n})= & {} \int _{\varvec{x}\in [0,1]^{p_n}} \left( \sigma (\eta _0(\varvec{x}))(\eta _0(\varvec{x})-\eta _{\varvec{\theta }_{n}}(\varvec{x}))\right. \\{} & {} \left. + \log \frac{1-\sigma (\eta _0(\varvec{x}))}{1-\sigma (\eta _{\varvec{\theta }_{n}}(\varvec{x}))}\right) d\varvec{x}\end{aligned}$$

Let $||\eta _0-\eta _{\varvec{\theta }^*_n}||_1\le \varepsilon \epsilon _n^2/4$ where $n\epsilon _n^2 \rightarrow \infty $. If $K_n\log n=o(n\epsilon _n^2)$, $||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)$,

$$\begin{aligned} \int h(\varvec{\theta }_{n})q(\varvec{\theta }_{n})d\varvec{\theta }_{n} \le \varepsilon \epsilon _n^2. \end{aligned}$$

Proof

Since $h(\varvec{\theta }_{n})$ is a KL-distance, $h(\varvec{\theta }_{n})>0$. We establish an upper bound:

$$\begin{aligned} \int h(\varvec{\theta }_{n})q(\varvec{\theta }_{n})d\varvec{\theta }_{n}&\le \int _{\varvec{x}\in [0,1]^{p_n}} |\eta _{\varvec{\theta }_{n}}(\varvec{x})-\eta _0(\varvec{x})|d\varvec{x}\nonumber \\&\le \int \int _{\varvec{x}\in [0,1]^{p_n}} |\eta _{\varvec{\theta }_{n}}(\varvec{x})-\eta _{\theta _n^*}(\varvec{x})|d\varvec{x}q(\varvec{\theta }_{n})d\varvec{\theta }_{n}\nonumber \\&\qquad +||\eta _{\varvec{\theta }_{n}^*}-\eta _0||_1\nonumber \\&\le \int ||\eta _{\varvec{\theta }_{n}}-\eta _{\varvec{\theta }_{n}^*}||_1 q(\varvec{\theta }_{n})d\varvec{\theta }_{n}+\varepsilon \epsilon _n^2 \end{aligned}$$

(B17)

where the first inequality is a consequence of Lemma 10 and the last inequality follows since $||\eta _{\varvec{\theta }_{n}^*}-\eta _0||_1=o(\epsilon _n^2)$.

Let $S=\{\varvec{\theta }_{n}:\cap _{j=1}^{K_n}|\theta _{jn}-\theta _{jn}^*|\le \varepsilon \epsilon _n^2/(\sum _{v=0}^L \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n}) \}$, then

$$\begin{aligned}&\int ||\eta _{\varvec{\theta }_{n}}-\eta _{\varvec{\theta }_{n}^*}||_1 q(\varvec{\theta }_{n})d\varvec{\theta }_{n} \nonumber \\&\quad =\int _S ||\eta _{\varvec{\theta }_{n}}-\eta _{\varvec{\theta }_{n}^*}||_1 q(\varvec{\theta }_{n})d\varvec{\theta }_{n}+\int _{S^c} ||\eta _{\varvec{\theta }_{n}}-\eta _{\varvec{\theta }_{n}^*}||_1 q(\varvec{\theta }_{n})d\varvec{\theta }_{n}\nonumber \\&\quad \le \varepsilon \epsilon _n^2+\int _{S^c} |\varvec{b}_L[s]-\varvec{b}_L^*[s]|q(\varvec{\theta }_{n})d\varvec{\theta }_{n}\nonumber \\&\qquad +\int _{S^c}\sum _{s'=1}^{k_{L_n}}|\varvec{A}_L[s][s']-\varvec{A}_L^*[s][s']| q(\varvec{\theta }_{n})d\varvec{\theta }_{n}\nonumber \\&\qquad + \sum _{s=1}^{k_{L_n}}|\varvec{A}_L^*[1][s]| \int _{S^c} q(\varvec{\theta }_{n})d\varvec{\theta }_{n} \end{aligned}$$

(B18)

Let $S^c=\cup _{j=1}^{K_n}S_j^c$, $S_j=\{|\theta _{jn}-\theta _{jn}^*|\le u_n\}$, $u_n=\varepsilon \epsilon _n^2/(\sum _{v=0}^L \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})$. We first compute $Q(S^c)$ as follows:

$$\begin{aligned} Q(S^c)&=Q(\cup _{j=1}^{K_n}S_j^c)\le \sum _{j=1}^{K_n}Q(S_j^c)\nonumber \\&=\sum _{j=1}^{K_n}\int _{|\theta _{jn}-\theta _{jn}^*|>u_n}q(\theta _{jn})d\theta _{jn}\nonumber \\&=2K_n\left( 1-\Phi \left( n^{1+d}u_n\right) \right) \end{aligned}$$

(B19)

Using (B19) in the last term of (B18), we get

$$\begin{aligned}&\sum _{s=1}^{k_{L_n}}|\varvec{A}_L^*[1][s]| \int _{S^c} q(\varvec{\theta }_{n})d\varvec{\theta }_{n} \nonumber \\&\quad = Q(S^c)\sum _{s=1}^{k_{L_n}}|\varvec{A}_L^*[1][s]| = a^*_{L_n n}K_n(1-\Phi (n^{1+d}u_n))\nonumber \\&\quad =o(n\epsilon _n^2)O\left( n^d\frac{1}{n^{1+d}u_n}e^{-n^{2(1+d)}u_n^2}\right) =o(\epsilon _n^2) \end{aligned}$$

(B20)

where second step follows by Mill’s ratio, $K_n=o(n\epsilon _n^2)$, $\sum _{v=0}^{L_n} k_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n}=O(n^d)$ which implies $n^{1+d}u_n \rightarrow \infty $. The third step holds because

$$\begin{aligned}{} & {} \frac{n^{1+d}}{n^{1+d}u_n}e^{-n^{2(1+d)}u_n^2}\le e^{-n^{2(1+d)}u_n^2}\nonumber \\{} & {} \quad =e^{-\left( \frac{n^{2(1+d)}\varepsilon ^2 \epsilon _n^4}{\log n(\sum _{v=0}^L \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})^2}-(d+1)\right) }=o(1) \end{aligned}$$

(B21)

since $(\sum _{v=0}^L \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n} )^2 \log n=O(n^{2d^*} \log n)=o(n^{2d})$.

For the second term in (B18), let ${S'}=\{|\varvec{b}_L[s]-\varvec{b}_L^*[s]|>u_n\}$

$$\begin{aligned}&\int _{S^c}\left( |\varvec{b}_L[s]-\varvec{b}_L^*[s]|\right) q(\varvec{\theta }_{n})d\varvec{\theta }_{n}\nonumber \\&\quad =\int _{S^c\cap {S'}} |\varvec{b}_L[s]-\varvec{b}_L^*[s]|q(\varvec{\theta }_{n})d\varvec{\theta }_{n}\nonumber \\&\qquad +\int _{S^c \cap {S'}^c}|\varvec{b}_L[s]-\varvec{b}_L^*[s]|q(\varvec{\theta }_{n})d\varvec{\theta }_{n}\nonumber \\&\quad \le \int _{{S'}} |\varvec{b}_L[s]-\varvec{b}_L^*[s]|q(\varvec{b}_L[s])d\varvec{b}_L[s]\nonumber \\&\qquad +E_{q(\varvec{b}_L[s])}|\varvec{b}_L[s]-\varvec{b}_L^*[s]|Q(\tilde{S}^c), \end{aligned}$$

(B22)

$\tilde{S}^c$ is the union of all $S_j^c$, $j=1, \ldots , K_n$ except the one corresponding to $\varvec{b}_{L}[s]$.

$$\begin{aligned}&\int _{{S'}} |\varvec{b}_L[s]-\varvec{b}_L^*[s]|q(\varvec{b}_L[s])d\varvec{b}_L[s]\nonumber \\&\quad =\int _{|\varvec{b}_L[s]-\varvec{b}_L^*[s]|>n^{1+d}u_n}\sqrt{\frac{n^{2+2d}}{2\pi }}(\varvec{b}_L[s]-\varvec{b}_L^*[s])\nonumber \\&\quad e^{-\frac{n^{2+2d}}{2}(\varvec{b}_L[s]-\varvec{b}_L^*[s])^2}d\varvec{b}_L[s]\nonumber \\&\quad =\frac{2}{\sqrt{n^{2+2d}}}\int _{n^{1+d}u_n}^{\infty } \frac{u}{\sqrt{2\pi }}e^{-\frac{1}{2}u^2}du\le e^{-n^{1+d}u_n} \end{aligned}$$

(B23)

Also, $E_{q(\varvec{b}_L[s])}|\varvec{b}_L[s]-\varvec{b}_L^*[s]|=\sqrt{2/\pi }(1/n^{1+d})$. Thus

$$\begin{aligned}&E_{q(\varvec{b}_L[s])}|\varvec{b}_L[s]-\varvec{b}_L^*[s]|Q(\tilde{S}^c)\nonumber \\&\quad = O\left( \frac{K_n}{n^{1+d}}\left( 1-\Phi \left( n^{1+d}u_n\right) \right) \right) \sim \frac{K_n}{n^{2(1+d)}u_n}e^{- n^{2(1+d)}u_n}\nonumber \\&\quad \le e^{-n^{2(1+v)}u_n^2 } \end{aligned}$$

(B24)

where the first equality in the above step follows by observing that $Q(\tilde{S}^c)$ behaves analogous to $Q(S^c)$ which was computed in (B19) and the second equality in the above step follows due to Mill’s ratio and $\sum _{v=0}^L \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n}=O(n^v)$ which implies $n^{1+d} u_n \rightarrow \infty $. The third inequality in the above step is a consequence of the fact that $K_n\le n^{1+d}$.

Combining (B20), (B23) and (B24), we get

$$\begin{aligned} \int _{S^c}\left( |\varvec{b}_L[s]-\varvec{b}_L^*[s]|\right) q(\varvec{\theta }_{n})d\varvec{\theta }_{n} \le e^{-n^{1+d}u_n} \end{aligned}$$

(B25)

Note the third term in (B18) can be handled similar to third term and it can be shown

$$\begin{aligned}&\int _{S^c} |\varvec{b}_L[s]-\varvec{b}_L^*[s]|q(\varvec{\theta }_{n})d\varvec{\theta }_{n}+\nonumber \\&\quad \int _{S^c}\sum _{s'=1}^{k_{L_n}}|\varvec{A}_L[s][s']-\varvec{A}_L^*[s][s']| q(\varvec{\theta }_{n})d\varvec{\theta }_{n}\nonumber \\&\quad \le k_{L_n+1}K_ne^{-n^{1+d}u_n}\le =o((n\epsilon _n^2)^2)e^{-n^{1+d}u_n}\nonumber \\&\quad \le o(\epsilon _n^2)e^{-(n^{1+d}u_n-2\log n)}=o(\epsilon _n^2) \end{aligned}$$

(B26)

where the last equality in the second step follows by $K_n=o(n \epsilon _n^2)$ and the argument in (B21) by which $e^{-(n^{1+d}u_n-2\log n)}=o(1)$.

Combining (B20) and (B26) with (B18) the proof follows.

Using Propositions 15, 16 and 17, the following Proposition 18 establishes that the KL-distance between the true posterior and the variational posterior is suitably bounded. $\square $

Proposition 18

Let $p(\varvec{\theta }_{n})=MVN(\varvec{\mu }_n,\text {diag}(\varvec{\zeta }_n^2))$, $||\varvec{\zeta }_n||_\infty =O(n)$ and $||\varvec{\zeta }^*_n||_\infty =O(1)$.

1.
Let $L_n=L$, $p_n=p$ independent of n. If $K_n\log n=o(n)$ and $||\varvec{\mu }_n||_2^2=o(n)$, then
$$\begin{aligned} d_{\textrm{KL}}(\pi ^*,\pi (.|\varvec{y}_{n},\varvec{X}_{n}))=o_{P_0^n}(n) \end{aligned}$$
(B27)
2.
Let $K_n\log n=o(n\epsilon _n^2)$, $L_n \sim \log n$ and $||\varvec{\mu }_n||_2^2=o(n\epsilon _n^2)$. If there exists a neural network such that $||\eta _0-\eta _{\varvec{\theta }^*_n}||_1=o(n\epsilon _n^2)$, $||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)$ and $\log (\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})=O(\log n)$, then
$$\begin{aligned} d_{\textrm{KL}}(\pi ^*,\pi (.|\varvec{y}_{n},\varvec{X}_{n}))=o_{P_0^n}(n\epsilon _n^2) \end{aligned}$$
(B28)

Proof

For any $q \in \mathcal {Q}_n$.

$$\begin{aligned}&d_{\textrm{KL}}(q,\pi (.|\varvec{y}_{n},\varvec{X}_{n}))\nonumber \\&\quad =\int q(\varvec{\theta }_{n})\log q(\varvec{\theta }_{n})d\varvec{\theta }_{n}\nonumber \\&\qquad -\int q(\varvec{\theta }_{n}) \log \pi (\varvec{\theta }_{n}|\varvec{y}_{n},\varvec{X}_{n})d\varvec{\theta }_{n}\nonumber \\&\quad =\int q(\varvec{\theta }_{n})\log q(\varvec{\theta }_{n})d\varvec{\theta }_{n}\nonumber \\&\qquad - \int q(\varvec{\theta }_{n}) \log \frac{L(\varvec{\theta }_{n})p(\varvec{\theta }_{n})}{\int L(\varvec{\theta }_{n})p(\varvec{\theta }_{n})d\varvec{\theta }_{n}} d\varvec{\theta }_{n}\nonumber \\&\quad =d_{\textrm{KL}}(q,p)-\int \log \frac{L(\varvec{\theta }_{n})}{L_0}q(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\nonumber \\&\qquad +\log \int \frac{L(\varvec{\theta }_{n})}{L_0} p(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\nonumber \\&\quad \le d_{\textrm{KL}}(q,p)+\left| \int \log \frac{L(\varvec{\theta }_{n})}{L_0}q(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right| \nonumber \\&\qquad +\left| \log \int \frac{L(\varvec{\theta }_{n})}{L_0} p(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right| \end{aligned}$$

(B29)

Since $\pi ^*$ satisfies minimizes the KL-distance to $\pi (.|\varvec{y}_{n},\varvec{X}_{n})$ in the family $\mathcal {Q}_n$, therefore for any $\kappa >0$

$$\begin{aligned}&P_0^n\left( d_{\textrm{KL}}(\pi ^*,\pi (.|\varvec{y}_{n},\varvec{X}_{n}))>\kappa \right) \nonumber \\&\quad \le P_0^n\left( d_{\textrm{KL}}(q,\pi (.|\varvec{y}_{n},\varvec{X}_{n}))>\kappa \right) \end{aligned}$$

(B30)

$\square $

Proof of part 1

Note, $K_n\log n=o(n)$, $||\mu _n||_2^2 =o(n)$, $||\varvec{\zeta }_n||_\infty =O(n)$ and $||\varvec{\zeta }^*_n||_\infty =O(1)$. Let $q(\varvec{\theta }_{n})=MVN(\varvec{\theta }^*_n, \varvec{I}_{K_n}/\sqrt{n})$ where $\varvec{\theta }^*_n$ is defined next. For $N\ge 1$, let $\eta _{\varvec{\theta }^*_N}$ be a finite neural network approximation satisfying $||\eta _{\varvec{\theta }^*_n}-\eta _0||_1 \le \varepsilon /4$. Since $\eta _0$ is a continuous function defined on the compact set $[0,1]^p$, thus the existence of such a neural network is guaranteed by Theorem 2.1 in Hornik et al. (1989). Let $\varvec{\theta }_{n}^*$ be $\varvec{\theta }_N^*$ for all non zero coefficients and zeros for all non existent coefficients.

Step 1 (a): Using proposition 16, with $\epsilon _n=1$, we get for any $\nu >0$,

$$\begin{aligned} P_0^n(d_{\textrm{KL}}(q,p)>n\nu )=0 \end{aligned}$$

(B31)

where the above step follows $||\varvec{\theta }^*_n||_2^2=||\varvec{\theta }^*_N||_2^2=||\varvec{\theta }^*_n||_2^2=o(n)$.

Step 1 (b): Next, note that

$$\begin{aligned}&d_{\textrm{KL}}(\ell _0,\ell _{\varvec{\theta }_{n}})\nonumber \\&\quad = \int _{\varvec{x}\in [0,1]^{p_n}} \left( \sigma (\eta _0(\varvec{x}))\log \frac{\sigma (\eta _0(\varvec{x}))}{\sigma (\eta _{\varvec{\theta }_{n}}(\varvec{x}))}\right. \nonumber \\&\quad \left. +(1-\sigma (\eta _0(\varvec{x})))\log \frac{1-\sigma (\eta _0(\varvec{x}))}{1-\sigma (\eta _{\varvec{\theta }_{n}}(\varvec{x}))}\right) d\varvec{x}\nonumber \\&\quad =\int _{\varvec{x}\in [0,1]^{p_n}} \left( \sigma (\eta _0(\varvec{x}))(\eta _{\varvec{\theta }_{n}}(\varvec{x})-\eta _0(\varvec{x}))\right. \nonumber \\&\qquad \left. +\log \frac{1-\sigma (\eta _0(\varvec{x}))}{1-\sigma (\eta _{\varvec{\theta }_{n}}(\varvec{x}))}\right) d\varvec{x}\end{aligned}$$

(B32)

Since $||\eta _0-\eta _{\varvec{\theta }^*_n}||_1 \le \varepsilon /4$, using proposition 17 with $\epsilon _n=1$ and $\varepsilon =\varepsilon $, $\int d_{\textrm{KL}}(\ell _0,\ell _{\varvec{\theta }_{n}})q(\varvec{\theta }_{n})d\varvec{\theta }_{n} \le \varepsilon $ which follows by noting that $||\varvec{\theta }^*_n||_2^2=||\varvec{\theta }^*_N||_2^2=o(n)$ and $\log (\sum _{v=0}^{L} \tilde{k}_{vN}\prod _{v'=v+1}^{L} a^*_{v'N})=O(\log n)$.

Therefore, by Lemma 9,

$$\begin{aligned} P_0^n\left( \left| \int \log \frac{L(\varvec{\theta }_{n})}{L_0}q(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right| > n\nu \right) \le \frac{\varepsilon }{\nu }. \end{aligned}$$

(B33)

Step 1 (c): Since $||\eta _0-\eta _{\varvec{\theta }^*_n}||_1\le \varepsilon /4$, using proposition 15 with $\epsilon _n=1$ and $\nu =\varepsilon $,

$$\begin{aligned} \int _{\varvec{\theta }_{n}\in \mathcal {N}_\varepsilon } p(\varvec{\theta }_{n})d\varvec{\theta }_{n}&\ge \exp (-n\varepsilon ) \end{aligned}$$

which follows by $||\varvec{\theta }^*_n||_2^2=||\varvec{\theta }^*_N||_2^2=o(n)$ and $\log (\sum _{v=0}^{L} \tilde{k}_{vn}\prod _{v'=v+1}^{L} a^*_{v'n})=O(\log n)$. Therefore, using Lemma 7, we get

$$\begin{aligned} P_0^n\left( \left| \log \int \frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right| > n\nu \right) \le \frac{2\varepsilon }{\nu } \end{aligned}$$

(B34)

Step 1 (d): From (B30) and (B29) we get

$$\begin{aligned}&P_0^n(d_{\textrm{KL}}(\pi ^*,\pi (.|\varvec{y}_{n},\varvec{X}_{n}))>3n\nu )\le P_0^n \left( d_{\textrm{KL}}(q,p)>n\nu \right) \nonumber \\&\qquad +P_0^n\left( \left| \int \log \frac{L(\varvec{\theta }_{n})}{L_0}q(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right|> n\nu \right) \nonumber \\&\qquad +P_0^n\left( \left| \log \int \frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right| > n\nu \right) \nonumber \\&\quad \le \frac{3\varepsilon }{\nu } \end{aligned}$$

(B35)

where the last inequality is a consequence of (B31), (B33) and (B34).

Since $\varepsilon $ is arbitrary, taking $\varepsilon \rightarrow 0$ completes the proof. $\square $

Proof of part 2

Note, $K_n\log n=o(n\epsilon _n^2)$, $||\mu _n||_2^2 =o(n\epsilon _n^2)$, $\log ||\varvec{\zeta }_n||_\infty =O(\log n)$ and $||\varvec{\zeta }^*_n||_\infty =O(1)$. Let $q(\varvec{\theta }_{n})=MVN(\varvec{\theta }^*_n, \varvec{I}_{K_n}/n^{2+2d}), d>d^*$ where $\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n}=O(n^{d^*})$, $d^*>0$. We next define $\varvec{\theta }_{n}^*$ as follows:

Let $\eta _{\varvec{\theta }^*_n}$ be the neural satisfying

$$\begin{aligned} ||\eta _{\varvec{\theta }^*_n}-\eta _0||_1\le \varepsilon \epsilon _n^2/4 \hspace{5mm} ||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2) \end{aligned}$$

The existence of such a neural network is guaranteed since $||\eta _{\varvec{\theta }^*_n}-\eta _0||_1=o(\epsilon _n^2)$.

Step 2 (a): Since $||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)$, by proposition 16,

$$\begin{aligned} P_0^n(d_{\textrm{KL}}(q,p)>\nu n\epsilon _n^2)=0 \end{aligned}$$

(B36)

Step 2 (b): By proposition 17, since $||\eta _{\varvec{\theta }^*_n}-\eta _0||_1\le \varepsilon \epsilon _n^2/4$, $||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)$ and $(\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})\log n=o(n\epsilon _n^2)$, we have

$$\begin{aligned}\int d_{\textrm{KL}}(\ell _0,\ell _{\varvec{\theta }_{n}})q(\varvec{\theta }_{n})d\varvec{\theta }_{n} \le \varepsilon \epsilon _n^2 \end{aligned}$$

Therefore, by Lemma 9,

$$\begin{aligned} P_0^n\left( \left| \int \log \frac{L(\varvec{\theta }_{n})}{L_0}q(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right| > \nu n \epsilon _n^2\right) \le \frac{\varepsilon }{\nu }. \end{aligned}$$

(B37)

Step 2 (c): By proposition 15, since $||\eta _{\varvec{\theta }^*_n}-\eta _0||_1\le \varepsilon \epsilon _n^2/4$, $||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2)$and $\log (\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})=O(\log n)$, we have

$$\begin{aligned} \int _{\varvec{\theta }_{n}\in \mathcal {N}{\varepsilon \epsilon _n^2}} p(\varvec{\theta }_{n})d\varvec{\theta }_{n}&\ge \exp (-\varepsilon n\epsilon _n^2) \end{aligned}$$

Therefore, using Lemma 7, we get

$$\begin{aligned} P_0^n\left( \left| \log \int \frac{L(\varvec{\theta }_{n})}{L_0}q(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right| > \nu n \epsilon _n^2\right) \le \frac{2\varepsilon }{\nu } \end{aligned}$$

(B38)

Step 2 (d): From (B30) and (B29) we get

$$\begin{aligned}&P_0^n(d_{\textrm{KL}}(\pi ^*,\pi (.|\varvec{y}_{n},\varvec{X}_{n}))>3\nu n\epsilon _n^2)\le P_0^n \left( d_{\textrm{KL}}(q,p)>\nu n\epsilon _n^2\right) \nonumber \\&\quad +P_0^n\left( \left| \int \log \frac{L(\varvec{\theta }_{n})}{L_0}q(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right|> \nu n\epsilon _n^2\right) \nonumber \\&\quad +P_0^n\left( \left| \log \int \frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right| > \nu n\epsilon _n^2\right) \le \frac{3\varepsilon }{\nu } \end{aligned}$$

(B39)

where the last inequality is a consequence of (B36), (B37) and (B38).

Since $\varepsilon $ is arbitrary, taking $\varepsilon \rightarrow 0$ completes the proof. $\square $

Appendix C Consistency of the variational posterior

Proof of Theorem 1

We assume Relation (32) holds with $A_n$ and $B_n$ are same as in (31).

By assumptions (A1) and (A2), the prior parameters satisfy

$$\begin{aligned}{} & {} ||\varvec{\mu }_n||_2^2=o(n), \,\,\log ||\varvec{\zeta }_n||_\infty =O(\log n),\\{} & {} ||\varvec{\zeta }_n^*||_\infty =O(1), \,\, \varvec{\zeta }^*_n=1/\varvec{\zeta }_n. \end{aligned}$$

Note $K_n \sim n^a$, $0<a<1$ which implies $K_n\log n=o(n)$. By proposition 18 part 1.,

$$\begin{aligned} d_{\textrm{KL}} (\pi ^*, \pi (.|\varvec{y}_{n},\varvec{X}_{n}))= o_{P_0^n}(n). \end{aligned}$$

(C40)

By step 1 (c) in the proof of proposition 18

$$\begin{aligned} B_n= o_{P_0^n}(n) \end{aligned}$$

(C41)

Since, $K_n \sim n^a$, $K_n\log n=o(n^b)$, $a<b<1$. Using proposition 13 with $\epsilon _n=1$,

$$\begin{aligned}{} & {} -\pi ^*(\mathcal {U}_\varepsilon ^c)A_n \ge n\varepsilon ^2\pi ^*(\mathcal {U}_\varepsilon ^c)-\log 2+o_{P_0^n}(1)\nonumber \\{} & {} \quad =n\varepsilon ^2\pi ^*(\mathcal {U}_\varepsilon ^c)+O_{P_0^n}(1) \end{aligned}$$

(C42)

Thus, using (C40), (C41) and (C42) in (32), we get

$$\begin{aligned}{} & {} n\varepsilon ^2\pi ^*(\mathcal {U}_\varepsilon ^c)+O_{P_0^n}(1)\le o_{P_0^n}(n)+o_{P_0^n}(n) \\{} & {} \quad \implies \pi ^*(\mathcal {U}_\varepsilon ^c)=o_{P_0^n}(1) \end{aligned}$$

$\square $

Proof of Theorem 2

We assume Relation (32) holds with $A_n$ and $B_n$ are same as in (31).

Let $K_n\sim n^a$ and $\epsilon _n^2\sim n^{-\delta }$, $0<\delta <1-a$. This implies $K_n\log n=o(n\epsilon _n^2)$.

By assumptions (A1) and (A4), the prior parameters satisfy

$$\begin{aligned}{} & {} ||\varvec{\mu }_n||_2^2=o(n \epsilon _n^2), \,\,\log ||\varvec{\zeta }_n||_\infty =O(\log n),\\{} & {} ||\varvec{\zeta }_n^*||_\infty =O(1), \,\, \varvec{\zeta }^*_n=1/\varvec{\zeta }_n. \end{aligned}$$

Also by assumption (A3),

$$\begin{aligned}{} & {} ||\eta _0-\eta _{\varvec{\theta }^*_n}||_1=o(\epsilon _n^2), \,\, ||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2),\\{} & {} \log (\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})=O(\log n) \end{aligned}$$

By proposition 18 part 2.,

$$\begin{aligned} d_{\textrm{KL}} (\pi ^*, \pi (.|\varvec{y}_{n},\varvec{X}_{n}))= o_{P_0^n}(n\epsilon _n^2). \end{aligned}$$

(C43)

By step 2 (c) in the proof of proposition 18

$$\begin{aligned} B_n= o_{P_0^n}(n \epsilon _n^2 ) \end{aligned}$$

(C44)

Since $K_n \sim n^a$, $K_n\log n=o(n^b \epsilon _n^2)$, $a+\delta<b<1$. Using proposition 13, it follows that

$$\begin{aligned}{} & {} -\pi ^*(\mathcal {U}_{\varepsilon \epsilon _n}^c )A_n \ge \varepsilon ^2 n \epsilon _n^2 \pi ^*(\mathcal {U}_{\varepsilon \epsilon _n}^c)-\log 2+o_{P_0^n}(1)\nonumber \\{} & {} \quad =\varepsilon ^2 n \epsilon _n^2 \pi ^*(\mathcal {U}_{\varepsilon \epsilon _n}^c)+O_{P_0^n}(1) \end{aligned}$$

(C45)

Thus, using (C43), (C44) and (C45) in (32), we get

$$\begin{aligned}{} & {} n\varepsilon ^2 \epsilon _n^2\pi ^*(\mathcal {U}_{\varepsilon \epsilon _n}^c)+O_{P_0^n}(1)\le o_{P_0^n}(n\epsilon _n^2)+o_{P_0^n}(n\epsilon _n^2) \\{} & {} \quad \implies \pi ^*(\mathcal {U}_{\varepsilon \epsilon _n}^c)=o_{P_0^n}(1) \end{aligned}$$

$\square $

Proof of Corollary 1

Let $\hat{\ell }_n(y,\varvec{x})=\int \ell _{\varvec{\theta }_{n}}(y,\varvec{x}) \pi ^*(\varvec{\theta }_{n})d\varvec{\theta }_{n}$.

$$\begin{aligned} d_{\textrm{H}}(\hat{\ell }_n,\ell _0)&=d_{\textrm{H}}\left( \int \ell _{\varvec{\theta }_{n}} \pi ^*(\varvec{\theta }_{n})d\varvec{\theta }_{n},\ell _0\right) \\&\le \int d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi ^*(\varvec{\theta }_{n})d\varvec{\theta }_{n} \hspace{5mm} \text {Jensen's inequality}\\&=\int _{\mathcal {U}_\varepsilon } d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi ^*(\varvec{\theta }_{n})d\varvec{\theta }_{n}\\&\quad +\int _{\mathcal {U}_\varepsilon ^c} d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi ^*(\varvec{\theta }_{n})d\varvec{\theta }_{n}\\&\le \varepsilon +o_{P_0^n}(1) \end{aligned}$$

Taking $\varepsilon \rightarrow 0$, we get $d_{\text {H}}(\hat{\ell }_n,\ell _0)=o_{P_0^n}(1)$. Let

$$\begin{aligned} \hat{\eta }(\varvec{x})=\sigma ^{-1}\left( \int \sigma (\eta _{\varvec{\theta }_{n}}(\varvec{x})) \pi ^*(\varvec{\theta }_{n})d\varvec{\theta }_{n} \right) \end{aligned}$$

(C46)

then, note that $\hat{\eta }(\varvec{x})=\log ( \hat{\ell }_n(1,\varvec{x})/\hat{\ell }_n(0,\varvec{x}))$. Further,

$$\begin{aligned} 2d^2_{\textrm{H}}(\hat{\ell }_n,\ell _0) =2-2 \int _{\varvec{x}\in [0,1]^p} \sum _{y \in \{0,1\}} \sqrt{\hat{\ell }_n(y,\varvec{x})\ell _0(y,\varvec{x})} d\varvec{x}\end{aligned}$$

This implies

$$\begin{aligned}&2d^2_{\textrm{H}}(\hat{\ell }_n,\ell _0)\nonumber \\&\quad =2-2\int _{\varvec{x}\in [0,1]^p} \nonumber \\&\quad \sum _{y \in \{0,1\}} e^{\left\{ \frac{1}{2}\left( y\hat{\eta }(\varvec{x})-\log (1+e^{\hat{\eta }(\varvec{x})})+y\eta _0(\varvec{x})-\log (1+e^{\eta _0(\varvec{x})}\right) \right\} } d\varvec{x}\nonumber \\&\quad =2-2\int _{\varvec{x}\in [0,1]^p} \left( \sqrt{\sigma (\eta _0(\varvec{x}))\sigma (\hat{\eta }(\varvec{x}))}\right. \nonumber \\&\quad \left. +\sqrt{(1-\sigma (\eta _0(\varvec{x})))(1-\sigma (\hat{\eta }(\varvec{x})))}\right) d\varvec{x}\nonumber \\&\quad \ge 2-2 \int _{\varvec{x}\in [0,1]^p} \sqrt{1- (\sqrt{\sigma (\eta _0(\varvec{x}))}-\sqrt{\sigma (\hat{\eta }(\varvec{x}))})^2}d\varvec{x}\nonumber \\&\quad \ge \int _{\varvec{x}\in [0,1]^p} (\sqrt{\sigma (\eta _0(\varvec{x}))}-\sqrt{\sigma (\hat{\eta }(\varvec{x}))})^2d\varvec{x}\nonumber \\&\quad \ge \frac{1}{4}\int _{\varvec{x}\in [0,1]^p} (\sigma (\eta _0(\varvec{x}))-\sigma (\hat{\eta }(\varvec{x})))^2d\varvec{x}\end{aligned}$$

(C47)

In the above equation, the sixth and the seventh step hold because $\sqrt{1-x}\le 1-x/2$ and $|p_1-p_2|\le |\sqrt{p_1}+\sqrt{p_2}||\sqrt{p_1}-\sqrt{p_2}|\le 2|\sqrt{p_1}-\sqrt{p_2}|$ respectively. The fifth step holds because

$$\begin{aligned}&\left( \sqrt{p_1p_2}+\sqrt{(1-p_1)(1-p_2)}\right) ^2\\&\quad =p_1p_2+1-p_1-p_2+\sqrt{p_1p_2(1-p_1)(1-p_2)}\\&\quad \le \sqrt{p_1p_2}+1-p_1-p_2+\sqrt{p_1p_2}=1-(\sqrt{p1}-\sqrt{p_2})^2 \end{aligned}$$

By (C47) and Cauchy Schwartz inequality,

$$\begin{aligned}&\int _{\varvec{x}\in U[0,1]^p} |\sigma (\eta _0(\varvec{x}))-\sigma (\hat{\eta }(\varvec{x}))| d\varvec{x}\nonumber \\&\quad \le \left( \int _{\varvec{x}\in [0,1]^p} (\sigma (\eta _0(\varvec{x}))-\sigma (\hat{\eta }(\varvec{x})))^2d\varvec{x}\right) ^{1/2}\nonumber \\&\quad \le 2\sqrt{2} d_{\text {H}}(\hat{\ell }_n, \ell _0)=o_{P_0^n}(1) \end{aligned}$$

(C48)

The proof follows in lieu (35). $\square $

Proof of Corollary 2

We assume Relation (32) holds with $A_n$ and $B_n$ are same as in (31).

Let $K_n\sim n^a$ and $\epsilon _n^2\sim n^{-\delta }$, $0<\delta <1-a$. This implies $K_n\log n=o(n\epsilon _n^2)$.

Also, $K_n\log n=o(n^b \epsilon _n^2)$, $a+\delta<b<1$. This implies $K_n\log n =o(n^b (\epsilon _n^2)^\kappa )$, $0\le \kappa \le 1$. Thus, using proposition 13 with $\epsilon _n=\epsilon _n^{k}$, we get

$$\begin{aligned}{} & {} -\pi ^*(\mathcal {U}_{\varepsilon \epsilon _n^{\kappa }}^c )A_n \ge \varepsilon ^2 n \epsilon _n^{2\kappa } \pi ^*(\mathcal {U}_{\varepsilon \epsilon _n^{\kappa }}^c)-\log 2+o_{P_0^n}(1)\nonumber \\{} & {} \quad =\varepsilon ^2 n \epsilon _n^{2\kappa } \pi ^*(\mathcal {U}_{\varepsilon \epsilon _n^k}^c)+O_{P_0^n}(1) \end{aligned}$$

(C49)

This together with (C43), (C44) and (32) implies $\pi ^*(\mathcal {U}_{\varepsilon \epsilon _n^\kappa }^c)=o_{P_0^n}(\epsilon _n^{2-2\kappa })$.

Let $\hat{\ell }_n(y,\varvec{x})=\int \ell _{\varvec{\theta }_{n}}(y,\varvec{x}) \pi ^*(\varvec{\theta }_{n})d\varvec{\theta }_{n}$, then

$$\begin{aligned} d_{\textrm{H}}(\hat{\ell }_n,\ell _0)&\le \int _{\mathcal {U}_{\varepsilon \epsilon _n^\kappa }} d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi ^*(\varvec{\theta }_{n})d\varvec{\theta }_{n}\\&\quad +\int _{\mathcal {U}_{\varepsilon \epsilon _n^\kappa }^c} d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi ^*(\varvec{\theta }_{n})d\varvec{\theta }_{n}\\&\le \varepsilon \epsilon _n^\kappa +o_{P_0^n}(\epsilon _n^{2-2\kappa }) \end{aligned}$$

Dividing by $\epsilon _n^\kappa $ on both sides we get

$$\begin{aligned} \frac{1}{\epsilon _n^\kappa }d_{\textrm{H}}(\hat{\ell }_n,\ell _0)&=o_{P_0^n}(\epsilon _n^{2-3\kappa })+o_{P_0^n}(1)=o_{P_0^n}(1),\\&\quad 0\le \kappa \le 2/3. \end{aligned}$$

By (C48), for every $0\le \kappa \le 2/3$,

$$\begin{aligned}{} & {} \frac{1}{\epsilon _n^\kappa }\int _{\varvec{x}\in [0,1]^{p_n}} |\sigma (\eta _0(\varvec{x}))-\sigma (\hat{\eta }(\varvec{x}))| d\varvec{x}\\{} & {} \quad \le \frac{1}{\epsilon _n^\kappa }2\sqrt{2}d_{\text {H}}(\hat{\ell }_n,\ell _0)=o_{P_0^n}(1). \end{aligned}$$

The proof follows in lieu of (35). $\square $

Appendix D Consistency of the true posterior

From (11), note that

$$\begin{aligned} \pi (\mathcal {U}_{\varepsilon }^c|\varvec{y}_{n},\varvec{X}_{n})&=\frac{\int _{\mathcal {U}_\varepsilon ^c}L(\varvec{\theta }_{n})p(\varvec{\theta }_{n})d\varvec{\theta }_{n}}{\int L(\varvec{\theta }_{n})p(\varvec{\theta }_{n})d\varvec{\theta }_{n}}\nonumber \\&=\frac{\int _{\mathcal {U}_\varepsilon ^c}(L(\varvec{\theta }_{n})/L_0)p(\varvec{\theta }_{n})d\varvec{\theta }_{n}}{\int (L(\varvec{\theta }_{n})/L_0)p(\varvec{\theta }_{n})d\varvec{\theta }_{n}} \end{aligned}$$

(D50)

Theorem 19

Suppose conditions of Theorem 1 hold. Then,

1.
$$\begin{aligned} P_0^n\left( \pi (\mathcal {U}_{\varepsilon }^c|\varvec{y}_{n},\varvec{X}_{n})\le 2e^{-n\varepsilon ^2/2}\right) \rightarrow 1, n \rightarrow \infty \end{aligned}$$
2.
$$\begin{aligned} P_0^n(|R(\hat{C})-R(C^\textrm{Bayes})|\le 8\sqrt{2}\varepsilon )\rightarrow 1,n \rightarrow \infty \end{aligned}$$

Proof

By assumptions (A1) and (A2), the prior parameters satisfy

$$\begin{aligned}{} & {} ||\varvec{\mu }_n||_2^2=o(n), \,\,\log ||\varvec{\zeta }_n||_\infty =O(\log n),\\{} & {} ||\varvec{\zeta }_n^*||_\infty =O(1), \,\, \varvec{\zeta }^*_n=1/\varvec{\zeta }_n. \end{aligned}$$

Note $K_n \sim n^a$, $0<a<1$ which implies $K_n\log n=o(n)$. Thus, the conditions of proposition 15 hold with $\epsilon _n=1$.

$$\begin{aligned}&P_0^n\left( \int \frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n}) d\varvec{\theta }_{n} \le e^{-n\nu }\right) \nonumber \\&\quad \le P_0^n\left( \left| \log \int \frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right| > n\nu \right) \rightarrow 0, \end{aligned}$$

(D51)

$n \rightarrow \infty $ which follows from (B34) (see step 1 (c) in proof of proposition 18). Since $K_n\log n=o(n^b)$, $a<b<1$, the proposition 13 holds with $\epsilon _n=1$.

$$\begin{aligned} P_0^n\left( \int _{\mathcal {U}_\varepsilon ^c} \frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n}) d\varvec{\theta }_{n} \ge 2 e^{-n\varepsilon ^2 }\right) \rightarrow 0, n \rightarrow \infty \end{aligned}$$

(D52)

where the last equality follows from (B7) with $\epsilon _n=1$ in the proof of proposition 13. Using (D51) and (D52) with (D50), we get

$$\begin{aligned} P_0^n\left( \pi (\mathcal {U}_{\varepsilon }^c|\varvec{y}_{n},\varvec{X}_{n})\ge 2e^{-n(\varepsilon ^2-\nu )}\right) \rightarrow 0, n \rightarrow \infty \end{aligned}$$

Take $\nu =\varepsilon ^2/2$ to complete the proof. Mimicking the steps in the proof of corollary 1,

$$\begin{aligned} d_{\textrm{H}}(\hat{\ell }_n,\ell _0)&\le \int d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi (\varvec{\theta }_{n}|\varvec{y}_{n},\varvec{X}_{n})d\varvec{\theta }_{n} \\&\quad \text {Jensen's inequality}\\&=\int _{\mathcal {U}_\varepsilon } d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi (\varvec{\theta }_{n}|\varvec{y}_{n},\varvec{X}_{n})d\varvec{\theta }_{n}\\&\quad +\int _{\mathcal {U}_\varepsilon ^c} d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi (\varvec{\theta }_{n}|\varvec{y}_{n},\varvec{X}_{n})d\varvec{\theta }_{n}\\&\le \varepsilon +2e^{-n\varepsilon ^2/2}\le 2\varepsilon ,\\&\quad \text {with probability tending to 1 as }n\rightarrow \infty \end{aligned}$$

where the second last inequality is a consequence of part 1. in Theorem 19. The remaining part of the proof follows by (C48) and (35). $\square $

Table 6 Clinical Features and Cognitive Assessment Score. Values are shown as mean ± standard deviation or percentage. Test statistics and P-values for differences between MCI-S and MCI-C are based on (a) t-test or (b) chi- square test. MCI-S = non-progressive MCI; MCI-P = progressive MCI; APOE = apolipoprotein E; MMSE = Mini-Mental State Examination. RAVLT = The Rey Auditory Verbal Learning Test (immediate: sum of 5 trails; learning: trial 5-trial 1; Forgetting: trial 5-delayed; perc.forgetting: Precent forgetting); DIGT = The Digit- Symbol Coding test; TRAB = Trail Making tests; CDRSB = Clinical Dementia Rating Scaled Response; FAQ = Activities of Daily living Score; ADAS = Alzheimer’s Disease Assessment Scale Cognitive sub-scale; mPACCdigit = the Digit Symbol Substitution Test from the Preclinical Alzheimer Cognitive Composite

Full size table

Table 7 Significant MRI Features. Values are shown as mean ± standard deviation or percentage. Test statistics and P-values for differences between MCI-C and MCI-S are based on t-test. MCI-S = non-progressive MCI; MCI-C = progressive MCI. HippoR = Right Hippocampus; HippoL = Left Hippocampus; flWMR = frontal lobe WM right; flWML = frontal lobe WM left; plWMR = parietal lobe WM right; plWML = parietal lobe WM left; tlWMR = temporal lobe WM right; tlWML = temporal lobe WM left; ACgCR=Right ACgG anterior cingulate gyrus; ACgCL=Left ACgG anterior cingulate gyrus; EntR = Right Ent entorhinal area; EntL = Left Ent entorhinal area; MCgCR = Right MCgG middle cingulate gyrus;MCgCL = Left MCgG middle cingulate gyrus; MFCR = Right MFC medial frontal cortex; MFCL = Left MFC medial frontal cortex; OpIFGR = Right OpIFG opercular part of the inferior frontal gyrus; OpIFGL = Left OpIFG opercular part of the inferior frontal gyrus; OrIFGR = Right OrIFG orbital part of the inferior frontal gyrus; OrIFGL = Left OrIFG orbital part of the inferior frontal gyrus; PCgCR = Right PCgG posterior cingulate gyrus; PCgCL = Left PCgG posterior cingulate gyrus; PCuR = Right PCu precuneus; PCuL = Left PCu precuneus; SPLR = Right SPL superior parietal lobule; SPLL = Left SPL superior parietal lobule

Full size table

Theorem 20

Suppose conditions of Theorem 2 hold. Then,

1.
$$\begin{aligned} P_0^n\left( \pi (\mathcal {U}_{\varepsilon \epsilon _n}^c|\varvec{y}_{n},\varvec{X}_{n})\le 2e^{-n\epsilon _n^2\varepsilon ^2/2}\right) \rightarrow 1, n \rightarrow \infty \end{aligned}$$
2.
$$\begin{aligned} P_0^n(|R(\hat{C})-R(C^\textrm{Bayes})|\le 8\sqrt{2}\varepsilon \epsilon _n)\rightarrow 1,n \rightarrow \infty \end{aligned}$$

Proof

By assumptions (A1) and (A4), the prior parameters satisfy

$$\begin{aligned}{} & {} ||\varvec{\mu }_n||_2^2=o(n \epsilon _n^2), \,\,\log ||\varvec{\zeta }_n||_\infty =O(\log n),\\{} & {} ||\varvec{\zeta }_n^*||_\infty =O(1), \,\, \varvec{\zeta }^*_n=1/\varvec{\zeta }_n. \end{aligned}$$

Also by assumption (A3),

$$\begin{aligned}{} & {} ||\eta _0-\eta _{\varvec{\theta }^*_n}||_1=o(\epsilon _n^2), \,\, ||\varvec{\theta }^*_n||_2^2=o(n\epsilon _n^2),\\{} & {} \log (\sum _{v=0}^{L_n} \tilde{k}_{vn}\prod _{v'=v+1}^{L_n} a^*_{v'n})=O(\log n) \end{aligned}$$

Note $K_n \sim n^a$, $0<a<1$ and $\epsilon _n\sim n^{-\delta }$, $0<\delta <1-a$, thus $K_n\log n=o(n\epsilon _n^2)$. Thus, the conditions of proposition 15 hold.

$$\begin{aligned}&P_0^n\left( \int \frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n}) d\varvec{\theta }_{n} \le e^{-n\epsilon _n^2 \nu }\right) \nonumber \\&\quad \le P_0^n\left( \left| \log \int \frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n}) d\varvec{\theta }_{n}\right| > n\epsilon _n^2\nu \right) \nonumber \\&\quad \rightarrow 0 \end{aligned}$$

(D53)

for $ n \rightarrow \infty $ where the above convergence follows from (B38) in step 2 (c) in the proof of proposition 18. Also, since $K_n\log n=o(n^b \epsilon _n^2)$, $a+\delta<b<1$. Thus conditions of proposition 13 hold.

$$\begin{aligned} P_0^n\left( \int _{\mathcal {U}_{\varepsilon \epsilon _n}^c} \frac{L(\varvec{\theta }_{n})}{L_0}p(\varvec{\theta }_{n}) d\varvec{\theta }_{n} \ge 2 e^{-n \epsilon _n^2\varepsilon ^2 }\right) \rightarrow 0, n \rightarrow \infty \end{aligned}$$

(D54)

where the last equality follows from (B7) in the proof of proposition 13.

Using (D53) and (D54) with (D50), we get $P_0^n\left( \pi (\mathcal {U}_{\varepsilon \epsilon _n}^c|\varvec{y}_{n},\varvec{X}_{n})\ge 2e^{-n\epsilon _n^2(\varepsilon ^2-\nu )}\right) \rightarrow 0, n \rightarrow \infty $. Take $\nu =\varepsilon ^2/2$ to complete the proof. Mimicking the steps in the proof of corollary 2,

$$\begin{aligned} d_{\textrm{H}}(\hat{\ell }_n,\ell _0)&\le \int d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi (\varvec{\theta }_{n}|\varvec{y}_{n},\varvec{X}_{n})d\varvec{\theta }_{n} \\&\quad \text {Jensen's inequality}\\&=\int _{\mathcal {U}_{\varepsilon \epsilon _n}} d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi (\varvec{\theta }_{n}|\varvec{y}_{n},\varvec{X}_{n})d\varvec{\theta }_{n}\\&\quad +\int _{\mathcal {U}_{\varepsilon \epsilon _n}^c} d_{\text {H}}(\ell _{\varvec{\theta }_{n}},\ell _0) \pi (\varvec{\theta }_{n}|\varvec{y}_{n},\varvec{X}_{n})d\varvec{\theta }_{n}\\&\le \varepsilon \epsilon _n+2e^{-2n\epsilon _n^2\varepsilon ^2}\le 2\varepsilon \epsilon _n, \\&\quad \text {with probability tending to 1 as }n\rightarrow \infty \end{aligned}$$

where the second last inequality is a consequence of part 1. in Theorem 20 and the last inequality last equality follows since $\epsilon _n \sim n^{-\delta }$. Dividing by $\epsilon _n$ on both sides we get

$$\begin{aligned} \epsilon _n^{-1}d_{\textrm{H}}(\hat{\ell }_n,\ell _0) \le 2\varepsilon ,\,\,\text {with probability tending to 1 as }n\rightarrow \infty \end{aligned}$$

The remaining part of the proof follows by (C48) and (35). $\square $

Appendix E Tables for real data

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bhattacharya, S., Liu, Z. & Maiti, T. Comprehensive study of variational Bayes classification for dense deep neural networks. Stat Comput 34, 17 (2024). https://doi.org/10.1007/s11222-023-10338-9

Download citation

Received: 14 May 2023
Accepted: 11 October 2023
Published: 30 October 2023
DOI: https://doi.org/10.1007/s11222-023-10338-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comprehensive study of variational Bayes classification for dense deep neural networks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Comprehensive overview of Alzheimer's disease utilizing Machine Learning approaches

Computational Approaches Applied in the Field of Neuroscience

Early prediction of Alzheimer's disease using convolutional neural network: a review

Explore related subjects

Availability of data and materials

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (zip 147 KB)

Appendices

Appendix A Algorithms of variational implementation.

Appendix B Preliminaries

Definition 1

Definition 2

Definition 3

Definition 4

Lemma 5

Proof

Lemma 6

Proof

Lemma 7

Proof

Lemma 8

Proof

Lemma 9

Proof

Lemma 10

Proof

Lemma 11

Proof

Lemma 12

Proof

Proposition 13

Proof

Proposition 14

Proof

Proposition 15

Proof

Proposition 16

Proof

Proposition 17

Proof

Proposition 18

Proof

Proof of part 1

Proof of part 2

Appendix C Consistency of the variational posterior

Proof of Theorem 1

Proof of Theorem 2

Proof of Corollary 1

Proof of Corollary 2

Appendix D Consistency of the true posterior

Theorem 19

Proof

Theorem 20

Proof

Appendix E Tables for real data

Rights and permissions

About this article

Cite this article

Share this article

Keywords