Abstract
In this work we investigate the reasons why Batch Normalization (BN) improves the generalization performance of deep networks. We argue that one major reason, distinguishing it from data-independent normalization methods, is randomness of batch statistics. This randomness appears in the parameters rather than in activations and admits an interpretation as a practical Bayesian learning. We apply this idea to other (deterministic) normalization techniques that are oblivious to the batch size. We show that their generalization performance can be improved significantly by Bayesian learning of the same form. We obtain test performance comparable to BN and, at the same time, better validation losses suitable for subsequent output uncertainty estimation through approximate Bayesian posterior.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Using well known results for the distribution of the sample mean and variance of normally distributed variables. The inverse chi distribution is the distribution of 1 / S when \(S^2\) has a chi squared distribution [13].
References
Arpit, D., Zhou, Y., Kota, B.U., Govindaraju, V.: Normalization propagation: a parametric technique for removing internal covariate shift in deep networks. In: ICML, pp. 1168–1176 (2016)
Atanov, A., Ashukha, A., Molchanov, D., Neklyudov, K., Vetrov, D.: Uncertainty estimation via stochastic batch normalization. In: ICLR Workshop Track (2018)
Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D.: Weight uncertainty in neural networks. In: ICML, pp. 1613–1622 (2015)
Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUs). In: ICLR (2016)
Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: ICML, pp. 1050–1059 (2016)
Gal, Y., Ghahramani, Z.: A theoretically grounded application of dropout in recurrent neural networks. In: NIPS, pp. 1027–1035 (2016)
Gast, J., Roth, S.: Lightweight probabilistic deep networks. In: CVPR, June 2018
Gitman, I., Ginsburg, B.: Comparison of batch normalization and weight normalization algorithms for the large-scale image classification. CoRR abs/1709.08145 (2017)
Graves, A.: Practical variational inference for neural networks. In: NIPS, pp. 2348–2356 (2011)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, vol. 37, pp. 448–456 (2015)
Kingma, D.P., Salimans, T., Welling, M.: Variational dropout and the local reparameterization trick. In: NIPS, pp. 2575–2583 (2015)
Lee, P.: Bayesian Statistics: An Introduction. Wiley, Hoboken (2012)
Lei Ba, J., Kiros, J.R., Hinton, G.E.: Layer normalization. ArXiv e-prints, July 2016
Li, X., Chen, S., Hu, X., Yang, J.: Understanding the disharmony between dropout and batch normalization by variance shift. CoRR abs/1801.05134 (2018)
Luenberger, D.G., Ye, Y.: Linear and Nonlinear Programming. Springer, New York (2015)
Maška, M., et al.: A benchmark for comparison of cell tracking algorithms. Bioinformatics 30(11), 1609–1617 (2014)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Salimans, T., Kingma, D.P.: Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In: NIPS (2016)
Santurkar, S., Tsipras, D., Ilyas, A., Madry, A.: How does batch normalization help optimization? (no, it is not about internal covariate shift). CoRR 1805.11604 (2018)
Schulman, J., Heess, N., Weber, T., Abbeel, P.: Gradient estimation using stochastic computation graphs. In: NIPS, pp. 3528–3536 (2015)
Shekhovtsov, A., Flach, B.: Normalization of neural networks using analytic variance propagation. In: Computer Vision Winter Workshop, pp. 45–53 (2018)
Springenberg, J., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net. In: ICLR (Workshop Track) (2015)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. JMLR 15, 1929–1958 (2014)
Teye, M., Azizpour, H., Smith, K.: Bayesian uncertainty estimation for batch normalized deep networks. In: ICML (2018)
Zagoruyko, S., Komodakis, N.: Wide residual networks. In: BMVC, pp. 87.1–87.12, September 2016
Acknowledgments
A.S. has been supported by Czech Science Foundation grant 18-25383S and Toyota Motor Europe. B.F. gratefully acknowledges support by the Czech OP VVV project “Research Center for Informatics” (CZ.02.1.01/0.0/0.0/16_019/0000765).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Shekhovtsov, A., Flach, B. (2019). Stochastic Normalizations as Bayesian Learning. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11362. Springer, Cham. https://doi.org/10.1007/978-3-030-20890-5_30
Download citation
DOI: https://doi.org/10.1007/978-3-030-20890-5_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20889-9
Online ISBN: 978-3-030-20890-5
eBook Packages: Computer ScienceComputer Science (R0)