Elsevier

Expert Systems with Applications

Volume 124, 15 June 2019, Pages 271-281
Expert Systems with Applications

Enhancing batch normalized convolutional networks using displaced rectifier linear units: A systematic comparative study

https://doi.org/10.1016/j.eswa.2019.01.066Get rights and content

Highlights

  • Enhanced nonlinearities may improve expert systems performance.

  • Proposal of the activation function DReLU.

  • DReLU presents the best training speed in all cases.

  • DReLU enhances the ReLU performance in all scenarios.

  • DReLU provides the best test accuracy in almost all experiments.

Abstract

A substantial number of expert and intelligent systems rely on deep learning methods to solve problems in areas such as economics, physics, and medicine. Improving the accuracy of the activation functions used by such methods can directly and positively impact the overall performance and quality of the mentioned systems at no cost whatsoever. In this sense, enhancing the design of such theoretical fundamental blocks is of great significance as it immediately impacts a broad range of current and future real-world deep learning based applications. Therefore, in this paper, we turn our attention to the interworking between the activation functions and the batch normalization, which is practically a mandatory technique to train deep networks currently. We propose the activation function Displaced Rectifier Linear Unit (DReLU) by conjecturing that extending the identity function of ReLU to the third quadrant enhances compatibility with batch normalization. Moreover, we used statistical tests to compare the impact of using distinct activation functions (ReLU, LReLU, PReLU, ELU, and DReLU) on the learning speed and test accuracy performance of standardized VGG and Residual Networks state-of-the-art models. These Convolutional Neural Networks were trained on CIFAR-100 and CIFAR-10, the most commonly used deep learning computer vision datasets. The results showed DReLU speeded up learning in all models and datasets. Besides, statistical significant performance assessments (p < 0.05) showed DReLU enhanced the test accuracy presented by ReLU in all scenarios. Furthermore, DReLU showed better test accuracy than any other tested activation function in all experiments with one exception, in which case it presented the second best performance. Therefore, this work demonstrates that it is possible to increase performance replacing ReLU by an enhanced activation function.

Introduction

The recent advances in deep learning research have produced more accurate image, speech, and language recognition systems and generated new state-of-the-art machine learning applications in a broad range of areas such as mathematics, physics, healthcare, genomics, financing, business, agriculture, etc.

For example, Nezhad, Sadati, Yang, and Zhu (2019) proposed an expert system for prostate cancer treatment recommendations based on deep learning. In another case, Ullah, Hussain, ul Haq Qazi, and Aboalsamh (2018) used deep learning to design an automated intelligent system for epilepsy detection. Moreover, Adem (2018) built an expert intelligent system for diabetic retinopathy detection also using deep learning models.

Deep learning grounded expert systems have also been proposed in areas such as finance to forecast stock market crisis (Chatzis, Siakoulis, Petropoulos, Stavroulakis, & Vlachogiannakis, 2018) or make analysis and prediction (Chong, Han, & Park, 2017). Naturally, deep learning intelligent systems have been deployed in computer science as well (Do, Prasad, Maag, Alsadoon, 2019, Park, Oh, Kim, 2017, Yousefi-Azar, Hamey, 2017).

Hence, considering the large and fast increasing number of expert and intelligent systems which are or will be deep learning based, proposing novel fundamental techniques to horizontally and simultaneously increase the performance of such systems is of vital importance since advances in such areas have the potential to positively impact the performance of all such systems at once.

Although advances have been made, accuracy performance enhancements have usually demanded considerably deeper or more complex models, which tend to increase the required computational resources (processing time and memory usage).

Instead of increasing deep models depth or complexity, a less computational expensive alternative approach to enhance deep learning performance across-the-board is to design more efficient activation functions. Even if computational resources are no issue, to employ enhanced activation functions nevertheless contributes to speeding up learning and achieving higher accuracy.

Indeed, by allowing the training of deep neural networks, the discovery of Rectified Linear Units (ReLU) (Glorot, Bordes, Bengio, 2011, Krizhevsky, Sutskever, Hinton, 2012, Nair, Hinton, 2010) was one of the main factors that contributed to deep learning advent. ReLU allowed achieving higher accuracy in less time by avoiding the vanishing gradient problem (Hochreiter, 1991). Before ReLU, activation functions such as Sigmoid and Hyperbolic Tangent were unable to train deep neural networks because of the absence of the identity function for positive input.

However, ReLU presents drawbacks. For example, some researchers argue that zero slope avoids learning for negative values (He, Zhang, Ren, Sun, 2016b, Maas, Hannun, Ng, 2013). Therefore, other activation functions like Leaky Rectifier Linear Unit (LReLU) (Maas et al., 2013), Parametric Rectifier Linear Unit (PReLU) (He et al., 2016b) and Exponential Linear Unit (ELU) (Clevert, Unterthiner, & Hochreiter, 2015) were proposed. Unfortunately, there is no consensus about how these proposed nonlinearities compare to ReLU, which therefore remains the most used activation function in deep learning.

Similar to activation functions, batch normalization (Ioffe & Szegedy, 2015) currently plays a fundamental role in training deep architectures. This technique normalizes the inputs of each layer, which is equivalent to normalizing the outputs of the deep model previous layer. However, before being used as inputs for the subsequent layer, the normalized data are typically fed into activation functions (nonlinearities), which necessarily skew the otherwise normalized distributions. In fact, ReLU only produces non-negative activations, which is harmful to the previously normalized data. The outputs mean values after ReLU are no longer zero, but rather necessarily positives. Therefore, the ReLU skews the normalized distribution.

Aiming to mitigate the mentioned problem, we concentrate our attention on the interaction between activation functions and batch normalization. We conjecture that nonlinearities that are more compatible with batch normalization present higher performance. After that, considering that an identity transformation preserves any statistical distribution, we assume that to extend the identity function from the first quadrant to the third implies less damage to the normalization procedure.

Hence, we investigate and propose the activation function Displaced Rectifier Linear Unit (DReLU), which partially prolongs the identity function beyond origin. Thus, DReLU is essentially a ReLU diagonally displaced into the third quadrant. Different from all other previously mentioned activation functions, the inflection of DReLU does not happen at the origin, but in the third quadrant.

Considering the widespread adoption and practical importance, we used Convolutional Neural Networks (CNN) (Krizhevsky, Sutskever, Hinton, 2012, LeCun, Bottou, Bengio, Haffner, 1998) in our experiments. Moreover, as particular examples of CNN architectures, we used the previous ImageNet Large Scale Visual Recognition Competition (ILSVRC) winners Visual Geometry Group (VGG) (Simonyan & Zisserman, 2014) and Residual Networks (ResNets) (He, Zhang, Ren, Sun, 2016a, He, Zhang, Ren, Sun, 2016c).

These previously mentioned architectures have distinctive designs and depth to promote generality to the conclusions of this work. In this regard, we evaluated how replacing the activation function impacts the performance of well established and widely used standard state-of-the-art models. Finally, we decided to employ the two most broadly used computer vision datasets by deep learning research community: CIFAR-100 (Krizhevsky, 2009) and CIFAR-10 (Krizhevsky, 2009).

Performance assessments were carried out using statistical tests with a 5% significance level. At least ten executions of each experiment were executed. However, when the mentioned significance was not achieved, ten additional runs were performed.

Section snippets

Background

Currently, all major activation functions adopt the identity transformation to positive inputs, some particular function for negative inputs, and an inflection on the origin. In the following subsections, we describe the compared activation functions.

Displaced rectifier linear units

In machine learning, normalizing the distribution of the input data decreases the training time and improves test accuracy (Tax & Duin, 2002). Consequently, normalization also improves neural networks performance (LeCun, Bottou, Orr, & Müller, 2012). A standard approach to normalizing input data distributions is the mean-standard technique. The input data is transformed to present zero mean and standard deviation of one.

However, if instead of working with shallow machine learning models, we are

Experiments

The experiments were executed on a machine configured with an Intel(R) Core(TM) i7-4790K CPU, 16 GB RAM, 2 TB HD, and a GeForce GTX 980Ti card. The operational system was Ubuntu 14.04 LTS with CUDA 7.5, cuDNN 5.0, and Torch 7 deep learning library.

Results and discussion

In the following subsections, we analyze the tested scenarios. In each case, we first discuss the activation functions learning speed based on test accuracy obtained for the partially trained models. Subsequently, we comment about the test accuracy performances of the activation functions, which corresponds to the respective model test accuracy evaluated after 100 epochs. Naturally, we consider that an activation function presents better test accuracy if it showed the higher test accuracy for

Conclusion

In this paper, we have proposed a novel activation function for deep learning architectures, referred to as DReLU. The results showed that DReLU presented better learning speed than the all alternative activation functions, including ReLU, in all models and datasets. Moreover, the experiments showed DReLU was more accurate than ReLU in all situations. Besides, DReLU also outperformed test accuracy results of all others investigated activation functions (LReLU, PReLU, and ELU) in all scenarios

References (42)

  • K. Simonyan et al.

    Very deep convolutional networks for large-scale image recognition

    CoRR

    (2014)
  • N. Srivastava et al.

    Dropout: A simple way to prevent neural networks from overfitting

    Journal of Machine Learning Research

    (2014)
  • Zagoruyko, S., & Komodakis, N. (2016). Wide residual networks....
  • Zagoruyko, S., & Komodakis, N. (2017). Diracnets: Training very deep neural networks without...
  • S.-i. Amari

    Natural gradient works efficiently in learning

    Neural Computation

    (1998)
  • D.-A. Clevert et al.

    Fast and accurate deep network learning by exponential linear units (elus)

    CoRR

    (2015)
  • W.J. Conover et al.

    On multiple-comparisons procedures

    Proceedings of the joint statistical meetings, Houston Texas, August

    (1979)
  • Cunningham, R. J., Harding, P. J., & Loram, I. D. (2017). The application of deep convolutional neural networks to...
  • X. Glorot et al.

    Deep sparse rectifier neural networks

    AISTATS ’11: Proceedings of the 14th international conference on artificial intelligence and statistics

    (2011)
  • K. He et al.

    Deep residual learning for image recognition

    2016 IEEE conference on computer vision and pattern recognition (CVPR)

    (2016)
  • K. He et al.

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

    Proceedings of the IEEE international conference on computer vision

    (2016)
  • Cited by (0)

    View full text