Enhancing batch normalized convolutional networks using displaced rectifier linear units: A systematic comparative study
Introduction
The recent advances in deep learning research have produced more accurate image, speech, and language recognition systems and generated new state-of-the-art machine learning applications in a broad range of areas such as mathematics, physics, healthcare, genomics, financing, business, agriculture, etc.
For example, Nezhad, Sadati, Yang, and Zhu (2019) proposed an expert system for prostate cancer treatment recommendations based on deep learning. In another case, Ullah, Hussain, ul Haq Qazi, and Aboalsamh (2018) used deep learning to design an automated intelligent system for epilepsy detection. Moreover, Adem (2018) built an expert intelligent system for diabetic retinopathy detection also using deep learning models.
Deep learning grounded expert systems have also been proposed in areas such as finance to forecast stock market crisis (Chatzis, Siakoulis, Petropoulos, Stavroulakis, & Vlachogiannakis, 2018) or make analysis and prediction (Chong, Han, & Park, 2017). Naturally, deep learning intelligent systems have been deployed in computer science as well (Do, Prasad, Maag, Alsadoon, 2019, Park, Oh, Kim, 2017, Yousefi-Azar, Hamey, 2017).
Hence, considering the large and fast increasing number of expert and intelligent systems which are or will be deep learning based, proposing novel fundamental techniques to horizontally and simultaneously increase the performance of such systems is of vital importance since advances in such areas have the potential to positively impact the performance of all such systems at once.
Although advances have been made, accuracy performance enhancements have usually demanded considerably deeper or more complex models, which tend to increase the required computational resources (processing time and memory usage).
Instead of increasing deep models depth or complexity, a less computational expensive alternative approach to enhance deep learning performance across-the-board is to design more efficient activation functions. Even if computational resources are no issue, to employ enhanced activation functions nevertheless contributes to speeding up learning and achieving higher accuracy.
Indeed, by allowing the training of deep neural networks, the discovery of Rectified Linear Units (ReLU) (Glorot, Bordes, Bengio, 2011, Krizhevsky, Sutskever, Hinton, 2012, Nair, Hinton, 2010) was one of the main factors that contributed to deep learning advent. ReLU allowed achieving higher accuracy in less time by avoiding the vanishing gradient problem (Hochreiter, 1991). Before ReLU, activation functions such as Sigmoid and Hyperbolic Tangent were unable to train deep neural networks because of the absence of the identity function for positive input.
However, ReLU presents drawbacks. For example, some researchers argue that zero slope avoids learning for negative values (He, Zhang, Ren, Sun, 2016b, Maas, Hannun, Ng, 2013). Therefore, other activation functions like Leaky Rectifier Linear Unit (LReLU) (Maas et al., 2013), Parametric Rectifier Linear Unit (PReLU) (He et al., 2016b) and Exponential Linear Unit (ELU) (Clevert, Unterthiner, & Hochreiter, 2015) were proposed. Unfortunately, there is no consensus about how these proposed nonlinearities compare to ReLU, which therefore remains the most used activation function in deep learning.
Similar to activation functions, batch normalization (Ioffe & Szegedy, 2015) currently plays a fundamental role in training deep architectures. This technique normalizes the inputs of each layer, which is equivalent to normalizing the outputs of the deep model previous layer. However, before being used as inputs for the subsequent layer, the normalized data are typically fed into activation functions (nonlinearities), which necessarily skew the otherwise normalized distributions. In fact, ReLU only produces non-negative activations, which is harmful to the previously normalized data. The outputs mean values after ReLU are no longer zero, but rather necessarily positives. Therefore, the ReLU skews the normalized distribution.
Aiming to mitigate the mentioned problem, we concentrate our attention on the interaction between activation functions and batch normalization. We conjecture that nonlinearities that are more compatible with batch normalization present higher performance. After that, considering that an identity transformation preserves any statistical distribution, we assume that to extend the identity function from the first quadrant to the third implies less damage to the normalization procedure.
Hence, we investigate and propose the activation function Displaced Rectifier Linear Unit (DReLU), which partially prolongs the identity function beyond origin. Thus, DReLU is essentially a ReLU diagonally displaced into the third quadrant. Different from all other previously mentioned activation functions, the inflection of DReLU does not happen at the origin, but in the third quadrant.
Considering the widespread adoption and practical importance, we used Convolutional Neural Networks (CNN) (Krizhevsky, Sutskever, Hinton, 2012, LeCun, Bottou, Bengio, Haffner, 1998) in our experiments. Moreover, as particular examples of CNN architectures, we used the previous ImageNet Large Scale Visual Recognition Competition (ILSVRC) winners Visual Geometry Group (VGG) (Simonyan & Zisserman, 2014) and Residual Networks (ResNets) (He, Zhang, Ren, Sun, 2016a, He, Zhang, Ren, Sun, 2016c).
These previously mentioned architectures have distinctive designs and depth to promote generality to the conclusions of this work. In this regard, we evaluated how replacing the activation function impacts the performance of well established and widely used standard state-of-the-art models. Finally, we decided to employ the two most broadly used computer vision datasets by deep learning research community: CIFAR-100 (Krizhevsky, 2009) and CIFAR-10 (Krizhevsky, 2009).
Performance assessments were carried out using statistical tests with a 5% significance level. At least ten executions of each experiment were executed. However, when the mentioned significance was not achieved, ten additional runs were performed.
Section snippets
Background
Currently, all major activation functions adopt the identity transformation to positive inputs, some particular function for negative inputs, and an inflection on the origin. In the following subsections, we describe the compared activation functions.
Displaced rectifier linear units
In machine learning, normalizing the distribution of the input data decreases the training time and improves test accuracy (Tax & Duin, 2002). Consequently, normalization also improves neural networks performance (LeCun, Bottou, Orr, & Müller, 2012). A standard approach to normalizing input data distributions is the mean-standard technique. The input data is transformed to present zero mean and standard deviation of one.
However, if instead of working with shallow machine learning models, we are
Experiments
The experiments were executed on a machine configured with an Intel(R) Core(TM) i7-4790K CPU, 16 GB RAM, 2 TB HD, and a GeForce GTX 980Ti card. The operational system was Ubuntu 14.04 LTS with CUDA 7.5, cuDNN 5.0, and Torch 7 deep learning library.
Results and discussion
In the following subsections, we analyze the tested scenarios. In each case, we first discuss the activation functions learning speed based on test accuracy obtained for the partially trained models. Subsequently, we comment about the test accuracy performances of the activation functions, which corresponds to the respective model test accuracy evaluated after 100 epochs. Naturally, we consider that an activation function presents better test accuracy if it showed the higher test accuracy for
Conclusion
In this paper, we have proposed a novel activation function for deep learning architectures, referred to as DReLU. The results showed that DReLU presented better learning speed than the all alternative activation functions, including ReLU, in all models and datasets. Moreover, the experiments showed DReLU was more accurate than ReLU in all situations. Besides, DReLU also outperformed test accuracy results of all others investigated activation functions (LReLU, PReLU, and ELU) in all scenarios
References (42)
Exudate detection for diabetic retinopathy with circular hough transformation and convolutional neural networks
Expert Systems with Applications
(2018)- et al.
Enhancing deep learning sentiment analysis with ensemble techniques in social applications
Expert Systems with Applications
(2017) - et al.
Deep learning with adaptive learning rate using laplacian score
Expert Systems with Applications
(2016) - et al.
Forecasting stock market crisis events using deep and statistical machine learning techniques
Expert Systems with Applications
(2018) - et al.
Deep learning networks for stock market analysis and prediction: Methodology, data representations, and case studies
Expert Systems with Applications
(2017) - et al.
Deep learning for aspect-based sentiment analysis: A comparative review
Expert Systems with Applications
(2019) - et al.
Designing architectures of convolutional neural networks to solve practical problems
Expert Systems with Applications
(2018) - Karpathy, A. (2017). Convolutional neural networks for visual recognition. URL:...
- et al.
A structure-enriched neural network for network embedding
Expert Systems with Applications
(2019) - Shah, A., Kadam, E., Shah, H., & Shinde, S. (2016). Deep residual networks with exponential linear unit....