A fusing framework of shortcut convolutional neural networks
Introduction
With the advancements in ImageNet large-scale visual recognition challenge (ILSVRC), the topic of CNNs has quickly attracted more and more attention in the field of image recognition [2]. In fact, CNNs have been widely used to extract excellent features with great improvement over hand-crafted features for many problems, including image classification [1], [3], [4], object detection [5], [6], semantic segmentation [8], [7], face recognition and identification [9], [13], [14], and large-scale image classification task [15], [16], [17].
In general, a standard CNN alternates several convolutional layers (CLs) with pooling layers (PLs), followed by one or more fully-connected layers (FCLs). These layers are also called hidden layers. To use an image’s 2-dimensional structure, CNN is designed to have local connections and learnable weights for robust invariance in translation. Meanwhile, it is easier to train with much fewer parameters than a standard multilayer perceptron with the same number of hidden units [18], [11], [10]. Traditionally, each layer in a CNN is only allowed to connect to its next forward layer. Given the features extracted from the topmost convolutional or pooling layer, the final output of the FCLs is independent of the lower layers. Such CNN can work well in many situations. However, when the CNN goes deeper and deeper, it becomes more challenging to train because of gradient disappearing and gradient explosion [12].
Many attempts have been made to overcome this kind of difficulty, one of which is to add shortcut connections to CNNs. Shortcut connections are the connections crossing one or more layers [19]. For example, He et al. [19] proposed deep residual networks (ResNets), which can be successfully trained even with more than 1,000 layers, using identity shortcut connections to bypass signals over two or three layers. By ResNets, they won first place on image classification, object detection, and object location in ILSVRC, as well as on Common Objects in Context (COCO) detection and segmentation [19]. Huang et al. [20] presented densely-connected CNNs (DenseNets), which allow connections from each layer to every other layer in a feed-forward fashion. For each layer in a DenseNet, the feature maps of preceding layers are used as inputs. In addition, the feature maps are used as inputs into all subsequent layers. Thus, DenseNets can alleviate the vanishing gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the parameters.
In ResNets and DenseNets, it has been shown that shortcut connections are essential for deep CNNs to obtain excellent performance. However, these shortcut connections are equally concatenated with fixed weights to address the difficulty of training deep networks, ignoring the difference between different scale features [21]. Here, we would like to present another shortcut connection with learnable weights to integrate features from different layers for robust classification. Accordingly, we get a fusing framework of shortcut convolutional neural networks (S-CNNs), which can combine low-level fine features with high-level invariant features to produce better representations in leverage of concrete and abstraction [22], [23]. This fusing framework is a multi-scale process, similar to human vision [24].
There have been some related works on weighted shortcut connections. For example, Sermanet and LeCun applied a multi-scale CNN to the task of traffic sign classification [25], and they got first place in the German traffic sign recognition benchmark (GTSRB) competition. In this multi-scale CNN, both the first pooling and the second pooling layer are directly fed to the fully-connected layer through trainable shortcut connections. Moreover, Sun et al. proposed a DeepID network for face verification [26], which only allows the last pooling layer but one to have trainable shortcut connections. Additionally, Szegedy et al. introduced GoogLeNet with Inception modules in a multi-scale processing framework [16], where an Inception module concatenates several parallel branches into a single output. Besides, He et al. [27] designed a highway network for digit classification, allowing earlier representations to flow unimpededly to later layers through parameterized shortcut connections known as “information highways”. The parameters of shortcut connections are learned to control the amount of information allowed on these “highways”.
Section snippets
Key contributions
The main difference between our fusing framework and the multi-scale CNN and DeepID lies in that ours is general for fusing multi-scale features, whereas the latter two fuse features of only two scales. Although GoogLeNet concatenates features in the Inception module, our fusing framework fuses multi-scale features from many layers. Unlike ResNets with fixed-weight shortcut connections over 2–3 layers, our fusing framework allows available weights for shortcut connections to bypass arbitrary
Framework description
The motivation behind our proposed designed framework are two aspects. First, we systematically compared and analyzed the results of different shortcut connections. Second, we provideed useful information for choosing the appropriate shortcut styles for four popular tasks. The four tasks include gender classification, texture classification, digit recognition and object recognition. The proposed framework is shown in Fig. 1. Overall, it is an alternating structure of r CLs and r PLs, followed
Learning algorithm
For the l-th sample in a training set , a S-CNN in the fusing framework computes the activations of all CLs and PLs, the FCL and the actual output as follows:
Let be the required output with as well as . is the real output. The
Experimental results
Considering CNNs perform very well and get higher results than other models on four tasks of gender classification, texture classification, digit recognition, and object recognition. Moreover, some are fashionable datasets in each task for experiments. Therefore, to compare S-CNNs with standard CNNs, we conducted experiments on four tasks. We have implemented a stochastic variant of the GPU-accelerated CNN library Algorithm 1 based onCaffe [29], initializing the weights by the “Xavier” scheme
Conclusions
We have presented a fusing framework of shortcut convolutional neural networks (S-CNN). S-CNN fuses multi-scale features through learnable shortcut connections in a standard CNN. By shortcut indicators, we can conveniently choose a fusing structure for S-CNNs compared with standard CNNs on four different tasks, including gender classification, texture classification, digit recognition, and object recognition. Based on extensive experiments, we describe that the S-CNNs produce high accuracy than
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work was supported by the National Natural Science Foundation of China (61806013, 61876010, 61906005), Open project of key laboratory of oil and gas resources research, Chinese academy of sciences (KLOR2018-9), Project of Interdisciplinary Research Institute of Beijing University of Technology (2021020101) and International Research Cooperation Seed Fund of Beijing University of Technology (2021A01).
References (42)
- et al.
Feature-level fusion approaches based on multimodal EEG data for depression recognition
Information Fusion
(2020) - et al.
IFCNN: A general image framework based on convolutional neural network
Information Fusion
(2020) - et al.
Multi-focus image fusion with a deep convolutional neural network
Information Fusion
(2017) - et al.
Infrared salient object detection based on global guided lightweight non-local deep features
Infrared Physics & Technology
(2021) - et al.
The FERET database and evaluation procedure for face-recognition algorithms
Image and Vision Computing
(1998) - et al.
ImageNet classification with deep convolutional neural networks
- et al.
Visualizing and understanding convolutional networks
- et al.
A deep convolutional activation feature for generic visual recognition
- et al.
Rich feature hierarchies for accurate object detection and semantic segmentation
- et al.
R-CNN: towards real-time object detection with region proposal networks