Filter-in-Filter: Low Cost CNN Improvement by Sub-filter Parameter Sharing
Introduction
Convolutional neural networks (CNNs) were proposed in 1989 [1] for handwritten digit recognition and achieved considerable success on ImageNet in 2012 [2]. Inspired by this success, scholars have since used CNNs for most of the object recognition tasks [3], [4], [5] and the object detection tasks [6], [7], [8], [9]. Continuous effort has been made to improve the performance of CNNs, e.g. from the aspects of increasing their depth (the number of layers) [10], [11], [12], [13], increasing their width (the number of filters per layer) [14], [15], increasing the complexity of the building blocks [16], [17] and other aspects [18], [19], [20]. In this study, our proposed scheme belongs to the aspect that increases the complexity of the building blocks. However, our proposed scheme is low-cost by using the concept of parameters sharing when designed.
The standard convolution layer uses regular filters with fixed size, e.g. 3 × 3, 5 × 5 or 7 × 7. However, the regular filters with a fixed size are not flexible to adapt to the input data. For example, if a filter of 5 × 5 could recognize a frontal face, it may be not able to recognize a face with occlusion (e.g. wear a face mask), because the occlusion is a noise disturbing the recognition. Another disadvantage of filters with a fixed size is that it may waste the parameters. It is well known that the filters in standard convolution networks are sparse. For example, if a 5 × 5 filter recognizes a horizontal stick, this filter may only have non-zero values in one row and the parameters of other rows will be close to zeros and be waste. If we could use the waste parameters to represent a vertical stick, we may improve the CNNs. With this motivation, we derive the concept of sub-filters from the normal filters to extend the flexibility of the filters.
The reason that we could derive the sub-filters is that, several patterns sharing something in common could be grouped together and represented in one filter. we denote them as sub-patterns, which could be represented by filters sharing parameters. First, we explain the concepts of pattern and sub-pattern as follows,
- 1.
Pattern: A pattern is a describable regularity in the world or in a man-made design in the image recognition task, which contains several discriminative components.
- 2.
Sub-pattern: A sub-pattern is a describable regularity which shares the discriminative components with a specific pattern.
Specifically, Let’s give an intuitive example of sub-patterns as illustrated in Fig. 1. In Fig. 1, we assume the pattern of the dog face contains a set of components {left ear, right ear, left eye, right eye, nose}. Representing a pattern as a set of components have been proposed in the conventional computer vision algorithms, e.g. the bag-of-words model [21], [22]. Then, the subsets of these components can represent other patterns of dog faces with characteristics different from the original pattern. For example, the subset {left eye, right eye, nose} may represent the pattern of the frontal dog faces (sub-pattern 1 in Fig. 1), the subset {left eye, nose} or {right eye, nose} may represent the pattern of the dog faces with different poses (sub-pattern 2 and 3 in Fig. 1), and the subset {nose} may recognize the patterns of the dog faces with eyes that are occluded by the long face hair (sub-pattern 4 in Fig. 1). In conclusion, we derive the concept of sub-pattern as follows: a sub-pattern is represented by the subset of components of a pattern.
Because the sub-patterns sharing something in common with the original pattern, we derive the filters recognizing those sub-patterns from the original filter. Similarly, the filters recognizing those sub-patterns are denoted as sub-filters in this paper. Corresponding to the description of the sub-patterns, those sub-filters share parameters with the origin filters as well. Specifically, we define the sub-filters of a filter as follows: (1) First, spatially decomposing the filter into a set of components, as Fig. 3 illustrated. (2) Then, the sub-filters of the filter are defined as the subsets of these components, e.g. in Fig. 3, sub-filter 1 and sub-filter 2 are 1 × 3 filters in the different orientations, which are composed by the subset of parameters of the original filter.
Because the sub-filters share parameters with a filter, one will have the question that whether these sub-filters recognize the same pattern as the original filter. Therefore we have to verify that the sub-filter recognize a sub-pattern different from the pattern of the original filter. Actually, each sub-filter can be visualized using the standard filter visualization techniques to show the pattern recognized by this sub-filter. To verify these sub-filters recognize different patterns, we adopted visualization techniques using top-9 images patches [23] and gradient images [24] to visualize the sub-filter of a well-trained CNN, which will be covered in Section 4. From the visualization results, we found that the sub-filters of a filter could help to recognize meaningful sub-patterns with different characteristics, which may be useful to improve the performance of CNNs. Another interesting finding is that these sub-filters can be activated even when the filters containing them are not activated. We show one of such examples in Fig. 2, in which we visualized the sub-filters of one neuron of VGG16. We do the experiments as follows: (1) First we judge whether the filter is activated or not (when the output of this filter is large than 0, it is activated and vice versa.); (2) Then we visualize the sub-filters in the condition that the original filter is not activated. In Fig. 2, the filter is not activated and marked with the blue blank patch, while two of the sub-filters are activated and help to recognize (1) the tree trunk pattern and (2) the tree pattern (Visualized using top-9 images patches and gradient images of guided-backpropagation, refer to Section 4).
On the basis of the findings that the sub-filters of a filter can detect meaningful sub-patterns, we proposed a scheme to improve CNNs, called filter-in-filter (FIF), which explicitly leverages these sub-filters to enhance the expressibility of each filter. A CNN with FIF has the same number of parameters as that of the base CNN, while increasing the potential number of filters per layer. The FIF scheme is inherently different from the architectures that involve an addition of filters into a layer and increase the memory cost and computational cost [14], [25]. Finally, we conducted extensive image classification experiments on three benchmark datasets, namely Tiny ImageNet, CIFAR-100 and ImageNet ILSVRC12, which verified that CNNs with FIF consistently outperform the base CNNs. On the basis of the detailed analysis, we showed that the filter of FIF can recognize significantly multiple patterns. As the sub-filters share the parameters and most of the computational cost with the filter containing them, FIF does not increase the number of parameters and increases the computational cost only slightly, while effectively improving the performance of the CNNs.
In summary, our contributions are as follows:
- 1.
We defined the sub-filters of a filter and proposed the visualization of these sub-filters. From these visualizations, we inferred that the sub-filters of a filter could help to recognize various sub-patterns, including patterns with different scales/orientations and patterns from different objects categories, which could be useful to improve CNNs.
- 2.
Inspired by the visualization results of various sub-patterns recognized by the sub-filters, we proposed a new convolution scheme called filter-in-filter (FIF) to explicitly utilize the responses of these sub-filters and enhance the expressibility of filters, which improves CNNs in a low-cost way by sharing parameters, without increasing the number of parameters and by increasing the computational cost only slightly.
- 3.
We verified the proposed FIF extensively on three image classification benchmark datasets, namely Tiny ImageNet, CIFAR-100 and ImageNet ILSVRC12, and our models achieved consistent improved results as compared to CNNs with the standard convolution.
The rest of this paper is organized as follows: In Section 2, we introduce some related works. Section 3 defines the sub-filters of a filter. In Section 4, we visualize the sub-filters and demonstrate that various meaningful patterns are recognized by these sub-filters. Inspired by the visualization results, in Section 5, we further propose a new convolutional scheme called FIF to use sub-filters to enhance the expressibility of each filter. In Section 6, we verify the proposed FIF through extensive experiments, and conclude the paper with discussions in Section 7.
Section snippets
Related works
Network architecture design. The network architecture design is a popular research topic currently. There are many empirical aspects of the network architecture design to improve CNNs. One aspect is increasing the depth (the number of layers), e.g. 19-layer VGGNet [10], 21-layer GoogLeNet [11] and 152-layer ResNet [12]. Increasing the depth will cause the problem of gradient vanishing when training, which could be relieved by introducing the skip connection structure [12], [13], [26].
Definition of sub-filter
Filters are the basic units of CNNs, and recognize discriminative patterns by aggregating responses from their previous layer in a convolutional manner using the learned weights. Formally, a filter is parameterized by a three-dimensional tensor where U × V is the spatial size of the filter and C is the number of channels. Similarly, the responses of the previous layer are organized as a three-dimensional tensor denoted by where W × H is the spatial size of the feature maps.
Visualization of sub-filter
After defining the sub-filters from a filter, in this section, we visualize the patterns recognized by these sub-filters in a well-trained CNN model. The CNN model follows the architecture of VGG-16 [10], which contains 13 convolutional layers organized in 5 groups separated by 4 max-pooling layers; a layer is indexed by the group ID and the layer ID in the group, e.g. Conv denotes the second convolutional layer in the third group. In addition to multi-faceted patterns recognized by a filter
Filter-in-filter
In this section, we propose a new convolutional scheme called filter-in-filter (FIF) to explicitly use the sub-filters in a filter.
Experiments
In this section, we evaluate FIF by discussing the image classification experiments conducted on three widely used benchmark datasets, namely Tiny ImageNet,1 CIFAR-100 [38] and ImageNet ILSVRC2012 [39]. Then, we analyze the activations of the sub-filters in the FIF scheme. Our implementation is derived from the publicly available C++ Caffe toolbox [40].
Conclusion
In this study, we first defined the sub-filters of a filter and then visualized these sub-filters. The visualization results showed that these sub-filters of a filter could recognize various patterns of different scales and orientations and even patterns from completely different categories. Note that these sub-filters could be activated even when the full filter was not activated. Inspired by the fact that the sub-filters could recognize more patterns, we proposed a new convolutional scheme
Acknowledgment
This work was supported by the NSFC (61573387, 61876104), and Guangzhou science and technology project (No. 201604046018). This project was partially supported by the Key Areas Research and Development Program of Guangdong Grant 2018B010109007.
Guotian Xie received the BS degree in Automation Engineering in 2013 from Sun Yat-sen University, China. He received his PhD degree in Computer Science and Technology from Sun Yat-sen University in 2018. His research interest is deep learning, pattern recognition, especially focusing on deep learning applying on computer vison, deep network visualization and structure optimization of deep network.
References (43)
- et al.
Learning scale-variant and scale-invariant features for deep image classification
Pattern Recognit.
(2017) - et al.
How deep learning extracts and learns leaf features for plant classification
Pattern Recognit.
(2017) - et al.
Facial expression recognition with convolutional neural networks: coping with few data and the training sample order
Pattern Recognit.
(2017) - et al.
Multi-scale volumes for deep object detection and localization
Pattern Recognit.
(2017) - et al.
Learning with rethinking: recurrently improving convolutional neural networks through feedback
Pattern Recognit.
(2018) - et al.
Image classification by visual bag-of-words refinement and reduction
Neurocomputing
(2016) - et al.
Backpropagation applied to handwritten zip code recognition
Neural Comput.
(1989) - et al.
ImageNet classification with deep convolutional neural networks
Advances in Neural Information Processing Systems
(2012) - et al.
Object detection networks on convolutional feature maps
IEEE Trans. Pattern Anal. Mach. Intell.
(2017) - et al.
S-cnn: subcategory-aware convolutional networks for object detection
IEEE Trans. Pattern Anal. Mach. Intell.
(2017)
Faster r-cnn: towards real-time object detection with region proposal networks
IEEE Trans. Pattern Anal. Mach. Intell.
Very deep convolutional networks for large-scale image recognition
Going deeper with convolutions
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Deep residual learning for image recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Densely connected convolutional networks
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Wide residual networks
British Machine Vision Conference (BMVC)
Interleaved group convolutions
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Network in network
International Conference on Learning Representations (ICLR)
Rethinking the inception architecture for computer vision
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Recent advances in convolutional neural networks
Pattern Recognit.
Accelerating very deep convolutional networks for classification and detection
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (9)
Improving efficiency in convolutional neural networks with 3D image filters
2022, Biomedical Signal Processing and ControlContext extraction module for deep convolutional neural networks
2022, Pattern RecognitionEfficient densely connected convolutional neural networks
2021, Pattern RecognitionCitation Excerpt :ResNet-100 [14] and DenseNet-121 [16] have 25 million and 8.1 million parameters, respectively. Xie et al proposed filter-in-filter scheme to enhance the expressibility of a filter [17]. A CNN that promotes competition of multiple-size filters was proposed to improve the accuracy [18].
A model-based gait recognition method with body pose and human prior knowledge
2020, Pattern RecognitionCitation Excerpt :The method can recover 3D human poses from silhouettes by adopting multiview locality-sensitive sparse coding in the retrieving process. These last years, most recent work are deep learning based methods [15,16]. In [17] Hong et al. learnt a non-linear mapping from 2D images to 3D poses using a deep autoencoder network.
DeepBlue: Advanced convolutional neural network applications for ocean remote sensing
2024, IEEE Geoscience and Remote Sensing MagazineRecognition of Persian/Arabic Handwritten Words Using a Combination of Convolutional Neural Networks and Autoencoder (AECNN)
2022, Mathematical Problems in Engineering
Guotian Xie received the BS degree in Automation Engineering in 2013 from Sun Yat-sen University, China. He received his PhD degree in Computer Science and Technology from Sun Yat-sen University in 2018. His research interest is deep learning, pattern recognition, especially focusing on deep learning applying on computer vison, deep network visualization and structure optimization of deep network.
Kuiyuan Yang received the B.E. and Ph.D. degrees in automation from the University of Science and Technology of China, Hefei, China, in 2007 and 2012, respectively. He is currently with DeepMotion, Beijing, China. His current research interests include computer vision, deep learning and autonomous driving. He was the recipient of the Best Paper Award at the International Multimedia Modelling Conference 2010.
Jianhuang Lai received the Ph.D. degree in mathematics in 1999 from Sun Yat-Sen University, China. He joined Sun Yat-Sen University in 1989 as an assistant professor, where he is currently a Professor of the School of Data and Computer Science. His current research interests are in the areas of digital image processing, pattern recognition, multimedia communication, multiple target tracking, and wavelet and its applications. He has published over 100 scientific papers in the international journals and conferences on image processing and pattern recognition, e.g., IEEE TPAMI, IEEE TNN, IEEE TKDE, IEEE TIP, IEEE TSMC (Part B), IEEE TCSVT, Pattern Recognition, ICCV, CVPR, ECCV, and ICDM. He serves as a Vice Director of the Image and Graphics Association of China. He is a senior member of the IEEE.