Abstract
Batch normalization was introduced as a novel solution to help with training fully-connected feed-forward deep neural networks. It proposes to normalize each training-batch in order to alleviate the problem caused by internal covariate shift. The original method claimed that Batch Normalization must be performed before the ReLu activation in the training process for optimal results. However, a second method has since gained ground which stresses the importance of performing BN after the ReLu activation in order to maximize performance. In fact, in the source code of PyTorch, common architectures such as VGG16, ResNet and DenseNet have Batch Normalization layer after the ReLU activation layer. Our work is the first to demystify the aforementioned debate and offer a comprehensive answer as to the proper order for Batch Normalization in the neural network training process. We demonstrate that for convolutional neural networks (CNNs) without skip connections, it is optimal to do ReLu activation before Batch Normalization as a result of higher gradient flow. In Residual Networks with skip connections, the order does not affect the performance or the gradient flow between the layers.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Chollet, F.: Keras (2015)
Goodfellow, I.: Chapter 8: Optimization for Training Deep Models [Deep Learning Book]. Retrieved from Deep Learning Book (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Ioffe, S., Szegedy, C.: arxiv preprint arxiv:1502.03167 (2015)
LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient BackProp. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 9–48. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_3
Paszke, A., Gros, S., Chintala, S., Chanan, G.: Pytorch. Comput. Softw. Vers. 0.3, 1 (2017)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Franceschi, D.D., Jang, J.H. (2020). Demystifying Batch Normalization: Analysis of Normalizing Layer Inputs in Neural Networks. In: Dorronsoro, B., Ruiz, P., de la Torre, J., Urda, D., Talbi, EG. (eds) Optimization and Learning. OLA 2020. Communications in Computer and Information Science, vol 1173. Springer, Cham. https://doi.org/10.1007/978-3-030-41913-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-41913-4_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-41912-7
Online ISBN: 978-3-030-41913-4
eBook Packages: Computer ScienceComputer Science (R0)