Elsevier

Information Sciences

Volume 579, November 2021, Pages 685-699
Information Sciences

A fusing framework of shortcut convolutional neural networks

https://doi.org/10.1016/j.ins.2021.08.030Get rights and content

Highlights

  • We integrate the features of difference layers to present a fusing framework.

  • To classify genders, texture and digit recognition as well as object recognition.

  • These classification and recognition can be helpful in real life.

  • Introduce a real time data that can be used by researchers for fused framework.

  • Our fused framework is successful in learning task-specific in computer vision.

Abstract

Convolutional neural networks (CNNs) have proven to be very successful in learning task-specific computer vision features. To integrate features from different layers in standard CNNs, we present a fusing framework of shortcut convolutional neural networks (S-CNNs). This framework can fuse arbitrary scale features by adding weighted shortcut connections to the standard CNNs. Besides the framework, we propose a shortcut indicator (SI) of binary string to stand for a specific S-CNN shortcut style. Additionally, we design a learning algorithm for the proposed S-CNNs. Comprehensive experiments are conducted to compare its performances with standard CNNs on multiple benchmark datasets for different visual tasks. Empirical results show that if we choose an appropriate fusing style of shortcut connections with learnable weights, S-CNNs can perform better than standard CNNs regarding accuracy and stability in different activation functions and pooling schemes initializations, and occlusions. Moreover, S-CNNs are competitive with ResNets and can outperform GoogLeNet, DenseNets, Multi-scale CNN, and DeepID.

Introduction

With the advancements in ImageNet large-scale visual recognition challenge (ILSVRC), the topic of CNNs has quickly attracted more and more attention in the field of image recognition [2]. In fact, CNNs have been widely used to extract excellent features with great improvement over hand-crafted features for many problems, including image classification [1], [3], [4], object detection [5], [6], semantic segmentation [8], [7], face recognition and identification [9], [13], [14], and large-scale image classification task [15], [16], [17].

In general, a standard CNN alternates several convolutional layers (CLs) with pooling layers (PLs), followed by one or more fully-connected layers (FCLs). These layers are also called hidden layers. To use an image’s 2-dimensional structure, CNN is designed to have local connections and learnable weights for robust invariance in translation. Meanwhile, it is easier to train with much fewer parameters than a standard multilayer perceptron with the same number of hidden units [18], [11], [10]. Traditionally, each layer in a CNN is only allowed to connect to its next forward layer. Given the features extracted from the topmost convolutional or pooling layer, the final output of the FCLs is independent of the lower layers. Such CNN can work well in many situations. However, when the CNN goes deeper and deeper, it becomes more challenging to train because of gradient disappearing and gradient explosion [12].

Many attempts have been made to overcome this kind of difficulty, one of which is to add shortcut connections to CNNs. Shortcut connections are the connections crossing one or more layers [19]. For example, He et al. [19] proposed deep residual networks (ResNets), which can be successfully trained even with more than 1,000 layers, using identity shortcut connections to bypass signals over two or three layers. By ResNets, they won first place on image classification, object detection, and object location in ILSVRC, as well as on Common Objects in Context (COCO) detection and segmentation [19]. Huang et al. [20] presented densely-connected CNNs (DenseNets), which allow connections from each layer to every other layer in a feed-forward fashion. For each layer in a DenseNet, the feature maps of preceding layers are used as inputs. In addition, the feature maps are used as inputs into all subsequent layers. Thus, DenseNets can alleviate the vanishing gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the parameters.

In ResNets and DenseNets, it has been shown that shortcut connections are essential for deep CNNs to obtain excellent performance. However, these shortcut connections are equally concatenated with fixed weights to address the difficulty of training deep networks, ignoring the difference between different scale features [21]. Here, we would like to present another shortcut connection with learnable weights to integrate features from different layers for robust classification. Accordingly, we get a fusing framework of shortcut convolutional neural networks (S-CNNs), which can combine low-level fine features with high-level invariant features to produce better representations in leverage of concrete and abstraction [22], [23]. This fusing framework is a multi-scale process, similar to human vision [24].

There have been some related works on weighted shortcut connections. For example, Sermanet and LeCun applied a multi-scale CNN to the task of traffic sign classification [25], and they got first place in the German traffic sign recognition benchmark (GTSRB) competition. In this multi-scale CNN, both the first pooling and the second pooling layer are directly fed to the fully-connected layer through trainable shortcut connections. Moreover, Sun et al. proposed a DeepID network for face verification [26], which only allows the last pooling layer but one to have trainable shortcut connections. Additionally, Szegedy et al. introduced GoogLeNet with Inception modules in a multi-scale processing framework [16], where an Inception module concatenates several parallel branches into a single output. Besides, He et al. [27] designed a highway network for digit classification, allowing earlier representations to flow unimpededly to later layers through parameterized shortcut connections known as “information highways”. The parameters of shortcut connections are learned to control the amount of information allowed on these “highways”.

Section snippets

Key contributions

The main difference between our fusing framework and the multi-scale CNN and DeepID lies in that ours is general for fusing multi-scale features, whereas the latter two fuse features of only two scales. Although GoogLeNet concatenates features in the Inception module, our fusing framework fuses multi-scale features from many layers. Unlike ResNets with fixed-weight shortcut connections over 2–3 layers, our fusing framework allows available weights for shortcut connections to bypass arbitrary

Framework description

The motivation behind our proposed designed framework are two aspects. First, we systematically compared and analyzed the results of different shortcut connections. Second, we provideed useful information for choosing the appropriate shortcut styles for four popular tasks. The four tasks include gender classification, texture classification, digit recognition and object recognition. The proposed framework is shown in Fig. 1. Overall, it is an alternating structure of r CLs and r PLs, followed

Learning algorithm

For the l-th sample in a training set S=xl,yl,1lN, a S-CNN in the fusing framework computes the activations of all CLs and PLs, the FCL and the actual output as follows:h2k-1,jl=fu2k-1,jl=fih2k-2,il*Wij2k-1+bj2k-1,1k1,h2k,jl=poolingh2k-1,jl,1kr,hconl=a1h1l,a2h2l,,a2k-1h2k-1l,a2kh2kl,,h2rl,h2r+1l=fu2r+1l=fhconlw2r+1+b2r+1,ol=softmaxu2r+2l=softmaxW2r+2h2r+1l+b2r+2,

Let yl=y1l,y2l,,yClT be the required output with yjll=1 as well as yjkl=0(kl). ol=o1l,o2l,,oClT is the real output. The

Experimental results

Considering CNNs perform very well and get higher results than other models on four tasks of gender classification, texture classification, digit recognition, and object recognition. Moreover, some are fashionable datasets in each task for experiments. Therefore, to compare S-CNNs with standard CNNs, we conducted experiments on four tasks. We have implemented a stochastic variant of the GPU-accelerated CNN library Algorithm 1 based onCaffe [29], initializing the weights by the “Xavier” scheme

Conclusions

We have presented a fusing framework of shortcut convolutional neural networks (S-CNN). S-CNN fuses multi-scale features through learnable shortcut connections in a standard CNN. By shortcut indicators, we can conveniently choose a fusing structure for S-CNNs compared with standard CNNs on four different tasks, including gender classification, texture classification, digit recognition, and object recognition. Based on extensive experiments, we describe that the S-CNNs produce high accuracy than

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by the National Natural Science Foundation of China (61806013, 61876010, 61906005), Open project of key laboratory of oil and gas resources research, Chinese academy of sciences (KLOR2018-9), Project of Interdisciplinary Research Institute of Beijing University of Technology (2021020101) and International Research Cooperation Seed Fund of Beijing University of Technology (2021A01).

References (42)

  • J. Long et al.

    Fully convolutional networks for semantic segmentation

  • C. Ding et al.

    Robust face recognition via multimodal deep face representation

    IEEE Transactions on Multimedia

    (2015)
  • S. Tu et al.

    ModPSO-CNN: an evolutionary convolution neural network with application to visual recognition

    Soft Computing

    (2020)
  • M. Simijanska et al.

    Multi-level information fusion for learning a blood pressure predictive model using sensor data

    Information Fusion

    (2020)
  • Y. Qin et al.

    Exploring of alternative representations of facial images for face recognition

    International Journal of Machine Learning and Cybernetics

    (2020)
  • X. Wei et al.

    Selective multi-descriptor fusion for face identification

    International Journal of Machine Learning and Cybernetics

    (2019)
  • Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition, in: Proceedings of the...
  • C. Szegedy et al.

    Going deeper with convolutions

  • Y. Chen et al.

    Dual path networks

  • Y. LeCun et al.

    Gradient-based learning applied to document recognition

    Proceedings of the IEEE

    (1998)
  • He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition, in: Proceedings of the 29th International...
  • Cited by (0)

    View full text