Elsevier

Pattern Recognition

Volume 91, July 2019, Pages 391-403
Pattern Recognition

Filter-in-Filter: Low Cost CNN Improvement by Sub-filter Parameter Sharing

https://doi.org/10.1016/j.patcog.2019.01.044Get rights and content

Highlights

  • We defined the sub-filters of a filter and visualized them to verify these sub-filters can recognize multiple meaningful patterns.

  • Filter-in-Filter was proposed to make full use of the sub-filters to enhance the expressibility of the filters in CNNs.

  • Filter-in-Filter does not increase the number of parameters and increases the computational cost only slightly as compared to the standard convolution. We verified that FIF is effective to improve the performance of CNNs by conducting extensive experiments.

Abstract

Increasing the number of parameters seems to have improved convolutional neural networks, e.g. increasing the depth or width of the networks. In this paper, we propose a scheme to improve CNNs by deriving the six sub-filters from a filter, which share parameters among them and enhance the expressibility of the filter. We first defined the sub-filters of a filter, and by visualizing a well-trained CNN, we verified that these sub-filters could recognize multiple meaningful patterns with different visual characteristics, even when the filter containing them was not activated. These findings revealed that the filter has the potential to recognize multiple patterns. Inspired by these findings, we proposed the filter-in-filter (FIF) scheme to enhance the expressibility of a filter, by making full use of its sub-filters to recognize multiple meaningful sub-patterns. We verified the effectiveness of FIF on three image classification benchmark datasets, namely Tiny ImageNet, CIFAR-100 and ImageNet. Our experimental results showed that our models achieved consistent improvement over the base CNNs on the benchmark datasets, e.g. AlexNet and VGG16 using FIF achieved approximately 1% improvement on ImageNet. The sub-filters share the parameters and most of the computational cost with the filter containing them; therefore, FIF does not increase the number of parameters and increases the computational cost only slightly.

Introduction

Convolutional neural networks (CNNs) were proposed in 1989 [1] for handwritten digit recognition and achieved considerable success on ImageNet in 2012 [2]. Inspired by this success, scholars have since used CNNs for most of the object recognition tasks [3], [4], [5] and the object detection tasks [6], [7], [8], [9]. Continuous effort has been made to improve the performance of CNNs, e.g. from the aspects of increasing their depth (the number of layers)  [10], [11], [12], [13], increasing their width (the number of filters per layer) [14], [15], increasing the complexity of the building blocks [16], [17] and other aspects [18], [19], [20]. In this study, our proposed scheme belongs to the aspect that increases the complexity of the building blocks. However, our proposed scheme is low-cost by using the concept of parameters sharing when designed.

The standard convolution layer uses regular filters with fixed size, e.g. 3 × 3, 5 × 5 or 7 × 7. However, the regular filters with a fixed size are not flexible to adapt to the input data. For example, if a filter of 5 × 5 could recognize a frontal face, it may be not able to recognize a face with occlusion (e.g. wear a face mask), because the occlusion is a noise disturbing the recognition. Another disadvantage of filters with a fixed size is that it may waste the parameters. It is well known that the filters in standard convolution networks are sparse. For example, if a 5 × 5 filter recognizes a horizontal stick, this filter may only have non-zero values in one row and the parameters of other rows will be close to zeros and be waste. If we could use the waste parameters to represent a vertical stick, we may improve the CNNs. With this motivation, we derive the concept of sub-filters from the normal filters to extend the flexibility of the filters.

The reason that we could derive the sub-filters is that, several patterns sharing something in common could be grouped together and represented in one filter. we denote them as sub-patterns, which could be represented by filters sharing parameters. First, we explain the concepts of pattern and sub-pattern as follows,

  • 1.

    Pattern: A pattern is a describable regularity in the world or in a man-made design in the image recognition task, which contains several discriminative components.

  • 2.

    Sub-pattern: A sub-pattern is a describable regularity which shares the discriminative components with a specific pattern.

Specifically, Let’s give an intuitive example of sub-patterns as illustrated in Fig. 1. In Fig. 1, we assume the pattern of the dog face contains a set of components {left ear, right ear, left eye, right eye, nose}. Representing a pattern as a set of components have been proposed in the conventional computer vision algorithms, e.g. the bag-of-words model [21], [22]. Then, the subsets of these components can represent other patterns of dog faces with characteristics different from the original pattern. For example, the subset {left eye, right eye, nose} may represent the pattern of the frontal dog faces (sub-pattern 1 in Fig. 1), the subset {left eye, nose} or {right eye, nose} may represent the pattern of the dog faces with different poses (sub-pattern 2 and 3 in Fig. 1), and the subset {nose} may recognize the patterns of the dog faces with eyes that are occluded by the long face hair (sub-pattern 4 in Fig. 1). In conclusion, we derive the concept of sub-pattern as follows: a sub-pattern is represented by the subset of components of a pattern.

Because the sub-patterns sharing something in common with the original pattern, we derive the filters recognizing those sub-patterns from the original filter. Similarly, the filters recognizing those sub-patterns are denoted as sub-filters in this paper. Corresponding to the description of the sub-patterns, those sub-filters share parameters with the origin filters as well. Specifically, we define the sub-filters of a filter as follows: (1) First, spatially decomposing the filter into a set of components, as Fig. 3 illustrated. (2) Then, the sub-filters of the filter are defined as the subsets of these components, e.g. in Fig. 3, sub-filter 1 and sub-filter 2 are 1 × 3 filters in the different orientations, which are composed by the subset of parameters of the original filter.

Because the sub-filters share parameters with a filter, one will have the question that whether these sub-filters recognize the same pattern as the original filter. Therefore we have to verify that the sub-filter recognize a sub-pattern different from the pattern of the original filter. Actually, each sub-filter can be visualized using the standard filter visualization techniques to show the pattern recognized by this sub-filter. To verify these sub-filters recognize different patterns, we adopted visualization techniques using top-9 images patches [23] and gradient images [24] to visualize the sub-filter of a well-trained CNN, which will be covered in Section 4. From the visualization results, we found that the sub-filters of a filter could help to recognize meaningful sub-patterns with different characteristics, which may be useful to improve the performance of CNNs. Another interesting finding is that these sub-filters can be activated even when the filters containing them are not activated. We show one of such examples in Fig. 2, in which we visualized the sub-filters of one neuron of VGG16. We do the experiments as follows: (1) First we judge whether the filter is activated or not (when the output of this filter is large than 0, it is activated and vice versa.); (2) Then we visualize the sub-filters in the condition that the original filter is not activated. In Fig. 2, the filter is not activated and marked with the blue blank patch, while two of the sub-filters are activated and help to recognize (1) the tree trunk pattern and (2) the tree pattern (Visualized using top-9 images patches and gradient images of guided-backpropagation, refer to Section 4).

On the basis of the findings that the sub-filters of a filter can detect meaningful sub-patterns, we proposed a scheme to improve CNNs, called filter-in-filter (FIF), which explicitly leverages these sub-filters to enhance the expressibility of each filter. A CNN with FIF has the same number of parameters as that of the base CNN, while increasing the potential number of filters per layer. The FIF scheme is inherently different from the architectures that involve an addition of filters into a layer and increase the memory cost and computational cost [14], [25]. Finally, we conducted extensive image classification experiments on three benchmark datasets, namely Tiny ImageNet, CIFAR-100 and ImageNet ILSVRC12, which verified that CNNs with FIF consistently outperform the base CNNs. On the basis of the detailed analysis, we showed that the filter of FIF can recognize significantly multiple patterns. As the sub-filters share the parameters and most of the computational cost with the filter containing them, FIF does not increase the number of parameters and increases the computational cost only slightly, while effectively improving the performance of the CNNs.

In summary, our contributions are as follows:

  • 1.

    We defined the sub-filters of a filter and proposed the visualization of these sub-filters. From these visualizations, we inferred that the sub-filters of a filter could help to recognize various sub-patterns, including patterns with different scales/orientations and patterns from different objects categories, which could be useful to improve CNNs.

  • 2.

    Inspired by the visualization results of various sub-patterns recognized by the sub-filters, we proposed a new convolution scheme called filter-in-filter (FIF) to explicitly utilize the responses of these sub-filters and enhance the expressibility of filters, which improves CNNs in a low-cost way by sharing parameters, without increasing the number of parameters and by increasing the computational cost only slightly.

  • 3.

    We verified the proposed FIF extensively on three image classification benchmark datasets, namely Tiny ImageNet, CIFAR-100 and ImageNet ILSVRC12, and our models achieved consistent improved results as compared to CNNs with the standard convolution.

The rest of this paper is organized as follows: In Section 2, we introduce some related works. Section 3 defines the sub-filters of a filter. In Section 4, we visualize the sub-filters and demonstrate that various meaningful patterns are recognized by these sub-filters. Inspired by the visualization results, in Section 5, we further propose a new convolutional scheme called FIF to use sub-filters to enhance the expressibility of each filter. In Section 6, we verify the proposed FIF through extensive experiments, and conclude the paper with discussions in Section 7.

Section snippets

Related works

Network architecture design. The network architecture design is a popular research topic currently. There are many empirical aspects of the network architecture design to improve CNNs. One aspect is increasing the depth (the number of layers), e.g. 19-layer VGGNet [10], 21-layer GoogLeNet [11] and 152-layer ResNet [12]. Increasing the depth will cause the problem of gradient vanishing when training, which could be relieved by introducing the skip connection structure [12], [13], [26].

Definition of sub-filter

Filters are the basic units of CNNs, and recognize discriminative patterns by aggregating responses from their previous layer in a convolutional manner using the learned weights. Formally, a filter is parameterized by a three-dimensional tensor KRU×V×C, where U × V is the spatial size of the filter and C is the number of channels. Similarly, the responses of the previous layer are organized as a three-dimensional tensor denoted by XRW×H×C, where W × H is the spatial size of the feature maps.

Visualization of sub-filter

After defining the sub-filters from a filter, in this section, we visualize the patterns recognized by these sub-filters in a well-trained CNN model. The CNN model follows the architecture of VGG-16 [10], which contains 13 convolutional layers organized in 5 groups separated by 4 max-pooling layers; a layer is indexed by the group ID and the layer ID in the group, e.g. Conv32 denotes the second convolutional layer in the third group. In addition to multi-faceted patterns recognized by a filter

Filter-in-filter

In this section, we propose a new convolutional scheme called filter-in-filter (FIF) to explicitly use the sub-filters in a filter.

Experiments

In this section, we evaluate FIF by discussing the image classification experiments conducted on three widely used benchmark datasets, namely Tiny ImageNet,1 CIFAR-100 [38] and ImageNet ILSVRC2012 [39]. Then, we analyze the activations of the sub-filters in the FIF scheme. Our implementation is derived from the publicly available C++ Caffe toolbox [40].

Conclusion

In this study, we first defined the sub-filters of a filter and then visualized these sub-filters. The visualization results showed that these sub-filters of a filter could recognize various patterns of different scales and orientations and even patterns from completely different categories. Note that these sub-filters could be activated even when the full filter was not activated. Inspired by the fact that the sub-filters could recognize more patterns, we proposed a new convolutional scheme

Acknowledgment

This work was supported by the NSFC (61573387, 61876104), and Guangzhou science and technology project (No. 201604046018). This project was partially supported by the Key Areas Research and Development Program of Guangdong Grant 2018B010109007.

Guotian Xie received the BS degree in Automation Engineering in 2013 from Sun Yat-sen University, China. He received his PhD degree in Computer Science and Technology from Sun Yat-sen University in 2018. His research interest is deep learning, pattern recognition, especially focusing on deep learning applying on computer vison, deep network visualization and structure optimization of deep network.

References (43)

  • S. Ren et al.

    Faster r-cnn: towards real-time object detection with region proposal networks

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • K. Simonyan et al.

    Very deep convolutional networks for large-scale image recognition

    (2014)
  • C. Szegedy et al.

    Going deeper with convolutions

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • K. He et al.

    Deep residual learning for image recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • G. Huang et al.

    Densely connected convolutional networks

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • S. Zagoruyko et al.

    Wide residual networks

    British Machine Vision Conference (BMVC)

    (2016)
  • T. Zhang et al.

    Interleaved group convolutions

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • M. Lin et al.

    Network in network

    International Conference on Learning Representations (ICLR)

    (2014)
  • C. Szegedy et al.

    Rethinking the inception architecture for computer vision

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • J. Gu et al.

    Recent advances in convolutional neural networks

    Pattern Recognit.

    (2017)
  • X. Zhang et al.

    Accelerating very deep convolutional networks for classification and detection

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • Cited by (9)

    • Efficient densely connected convolutional neural networks

      2021, Pattern Recognition
      Citation Excerpt :

      ResNet-100 [14] and DenseNet-121 [16] have 25 million and 8.1 million parameters, respectively. Xie et al proposed filter-in-filter scheme to enhance the expressibility of a filter [17]. A CNN that promotes competition of multiple-size filters was proposed to improve the accuracy [18].

    • A model-based gait recognition method with body pose and human prior knowledge

      2020, Pattern Recognition
      Citation Excerpt :

      The method can recover 3D human poses from silhouettes by adopting multiview locality-sensitive sparse coding in the retrieving process. These last years, most recent work are deep learning based methods [15,16]. In [17] Hong et al. learnt a non-linear mapping from 2D images to 3D poses using a deep autoencoder network.

    View all citing articles on Scopus

    Guotian Xie received the BS degree in Automation Engineering in 2013 from Sun Yat-sen University, China. He received his PhD degree in Computer Science and Technology from Sun Yat-sen University in 2018. His research interest is deep learning, pattern recognition, especially focusing on deep learning applying on computer vison, deep network visualization and structure optimization of deep network.

    Kuiyuan Yang received the B.E. and Ph.D. degrees in automation from the University of Science and Technology of China, Hefei, China, in 2007 and 2012, respectively. He is currently with DeepMotion, Beijing, China. His current research interests include computer vision, deep learning and autonomous driving. He was the recipient of the Best Paper Award at the International Multimedia Modelling Conference 2010.

    Jianhuang Lai received the Ph.D. degree in mathematics in 1999 from Sun Yat-Sen University, China. He joined Sun Yat-Sen University in 1989 as an assistant professor, where he is currently a Professor of the School of Data and Computer Science. His current research interests are in the areas of digital image processing, pattern recognition, multimedia communication, multiple target tracking, and wavelet and its applications. He has published over 100 scientific papers in the international journals and conferences on image processing and pattern recognition, e.g., IEEE TPAMI, IEEE TNN, IEEE TKDE, IEEE TIP, IEEE TSMC (Part B), IEEE TCSVT, Pattern Recognition, ICCV, CVPR, ECCV, and ICDM. He serves as a Vice Director of the Image and Graphics Association of China. He is a senior member of the IEEE.

    View full text