Combating spatial redundancy with spectral norm attention in convolutional learners
Introduction
Inter-pixel spatial redundancy is an intrinsic characteristic of digital images [1]. It refers to that neighboring pixels are not statistically independent and the value of any given pixel can be predicated from the value of its neighbors that is they are highly correlated. Fig. 1 can demonstrate that inter-pixel spatial redundancy is inherent and varies in degree for different imaging ways. By performing singular value decomposition (SVD) on two images of column (a), we can calculate explained variance ratios (EVR) of their singular values. EVR measures the percentage occupying information in a matrix for each singular value. The information carried by individual pixels is relatively small. We can group pixels into two categories: informative pixels and redundant pixels. Those pixels in the directions of singular values with non-zero EVR are informative while others are redundant. The qualitative schematics of Columns (b) and (c) can disclose that the low-rank approximation only using top-30 singular values can almost recover original images, and the top-1 singular value (i.e., the spectral norm) contributes the most information. By observing column (d), we can find that the number of singular values with non-zero EVR is less than 30 (red dot in the first row) for the natural image while the medical image is less than 10 (red dot in the second row). By combining the dimensions of the two images, it can clearly be seen that the medical image is more redundant. The statistical result in Column (e) can also prove such a claim, where the averages of maximal explained variances of medical images on varying numbers are larger than natural images.
With the above investigation, it is inevitable to consider inter-pixel spatial redundancy during image processing. Inter-pixel spatial redundancy leads to low efficiency and repetitive features occurring when inferring vision learners [2], [3]. The redundant features have a limited contribution to vision learners and may even bring over-fitting, thus diminishing the generalization quality [4]. Hence, when the highly informative features are drowned in the considerable redundant features, it is challenging to extract the discriminative features to increase learning accuracy [5]. The intuitive strategy of combating spatial redundancy is to apply redundancy removal approaches to pre-process images before being input into the networks, including PCA [6], DCT [7], etc. Recently, to encourage learning useful features from images with spatial redundancy, MAE [8] presents a simple strategy to reduce redundancy by masking a very high portion of random patches. Although pre-processing image methods can effectively address data redundancy, they require task-specific customization and may rise to unanticipated inferior performance. Hence, encouraged by a pioneer work [3] reducing spatial redundancy by reparameterizing convolutional layers, we consider improving vision learners to combat spatial redundancy, a task-agnostic and robust solution.
Among vision learners, vision transformers (ViTs) [9] enjoy the ability to capture intra-image long-range dependencies and exhibit impressive performances, as MAE [8] uses ViTs as encoders. The self-attention mechanism in ViTs favors learning global discriminative features [10], thus addressing spatial redundancy to some extent. In contrast to ViTs, convolutional neural networks (CNNs) provide locality via a limited receptive field [11], establishing prior for image processing. CNNs naturally capture local patterns, not global context, so it is impractical to screen out redundant features. With the above comparison between ViTs and CNNs, we intuitively consider introducing a global attention mechanism to improve CNN networks, like self-attention for ViTs. Attention-based models offer an adaptive aggregation mechanism, where the aggregation scheme itself is input-dependent or spatially dynamic [10]. Through the proposed global attention mechanism, we aim to identify informative and redundant features and then generate differentiated attention weights to play their respective contributions to the final discriminative decision.
With the above intention of our proposed attention mechanism, we first make a brief retrospective discussion on attention mechanisms. Both global and local attention mechanisms[12], [13], [14], [15], [16] can help advance the state-of-the-art performance of CNNs by selectively emphasizing salient features and suppressing less useful ones. A representative method of local attention [17] is squeeze-and-excitation (SE), which learns channel-wise attention for each convolution block, showing encouraging performance for various CNNs. On the other hand, global attention [18], [19], [20], [21] has emerged as a recent advance due to its advantage of capturing long-range interdependencies. Due to quadratic memory and computational requirements involving high-resolution images, global attention is intractable. Instead, local attention is the current de facto choice in CNNs for better vision backbones. However, local attention can not identify informative and redundant features. To this end, we propose a novel attention mechanism for CNNs, termed spectral norm attention (SNA), to leverage its globally expressive power while retaining the locality and shift-invariance of convolutional learners. The global SNA aims to help convolutional learners effectively learn informative features from images with large redundant pixels without paying dear computation.
Specifically, given a matrix obtained from feature aggregation on an image, SNA performs SVD on this matrix to obtain the spectral norm. The highly informative features lie in the direction of the spectral norm, and the features in the other directions of singular values are redundant. We intend to enhance informative features of the direction of the spectral norm and penalize redundant features of other directions. After screening out redundant features of the matrix, we employ the rank-one approximation of the spectral norm direction as attention weights to boost the discriminative contributions of informative features. In addition, to efficiently solve the spectral norm, we apply the power iteration algorithm to approximate the spectral norm. Compared to vanilla CNNs, the computation overhead slightly increase after integrating SNA, almost negligible. Concretely, we make the following contributions:
- •
A longstanding challenge for CNNs is learning discriminative features from digital images with spatial redundancy. We propose a novel and efficient attention block to address the challenge, named spectral norm attention (SNA), which can elaborately designate attentive scores for features in terms of their global informativity.
- •
SNA performs SVD on an aggregated feature matrix for each image within a mini-batch to obtain its spectral norm. We further generate distinguishing attention weights for informative and redundant features by utilizing the rank-one approximation of spectral norm direction. By laying emphasis on the highly informative features in the direction of the spectral norm, SNA can help convolutional learners improve generalization.
- •
SNA can be seamlessly integrated into the off-the-shelf CNNs to help improve performance without paying dearly to model efficiency. We demonstrate the effectiveness of our proposed SNA by conducting experiments on five image datasets. Experimental results also show that our SNA can gain more when encountering heavily spatial redundancy, such as medical images.
The remainder of this paper includes related works reviewing in Section 2, our method describing in detail in Section 3, experiment analysis in Section 4, discussion in Section 5, and the conclusion stating in Section 6.
Section snippets
Vision learners
The inherent inductive biases make CNNs achieve currently state-of-the-art in computer vision and therefore widely used in different image recognition tasks [11]. Recently, ViTs have emerged as a competitive vision learner to CNNs by using multi-head self-attention without requiring the image-specific biases [9], [22]. The self-attention mechanism favors ViTs for capturing global dependencies but also arises intractable with higher-resolution inputs due to a quadratic complexity with respect to
Methodology
This section will elaborate on our proposed SNA method, including SNA block, spectral norm solution, and vision learners.
Experimental settings
.
- •
Datasets. We conduct experiments on five image datasets, including CIFAR-100 [49], ImageNet [50], [51], COCO2017 [52], VOC2012 [53] using the validation set for evaluation, and VIN-CXR [54] which localizes and classifies 14 types of thoracic abnormalities. We convert the data format of VOC2012 to the format of COCO2017, including generating mask labels, so we can perform experiments on VOC2012 using Mask R-CNN [55]. The input images of the CIFAR-100 dataset are randomly cropped to and
Discussions
Our SNA is a dedicated global attention block for convolutional learners with locality and shift-invariance to combat spatial redundancy. SNA globally identifies and weights informative and redundant features for each intermediate feature map by SVD. Based on the above experimental reports, we would like to discuss the pros and cons of our SNA.
From the viewpoint of eliminating inter-pixel redundancy, SNA is a unique exploration of combating spatial redundancy during vision learning for
Conclusions
Attention mechanisms have demonstrated their impressive progress in improving the performance of convolutional learners. In this work, we study how to exploit informative features by introducing a novel attention mechanism using the method of redundancy removal. We utilize the spectral norm of a feature matrix to calculate the attentive scores to give different weights for informative features and redundant features. We extensively evaluate our SNA on the tasks of image classification and
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Jiansheng Fang is a Ph.D. candidate in the School of Computer Science and Technology at Harbin Institute of Technology and a researcher at CVTE. His main research interests include computer vision, medical image processing, and image retrieval.
References (59)
- et al.
Pca and lda in dct domain
Pattern Recogn. Lett.
(2005) - et al.
Svd-based redundancy removal in 1-d cnns for acoustic scene classification
Pattern Recogn. Lett.
(2020) - et al.
A robust meaningful image encryption scheme based on block compressive sensing and svd embedding
Signal Processing
(2020) Image compression technique
IEEE Potentials
(2001)- et al.
Glance and focus: a dynamic approach to reducing spatial redundancy in image classification
Adv. Neural Inform. Process. Syst.
(2020) - et al.
Spatially adaptive inference with stochastic feature sampling and interpolation
- et al.
Feature selection for unsupervised and supervised inference: The emergence of sparsity in a weight-based approach
J. Mach. Learn. Res.
(2005) - et al.
Unsupervised feature selection via nonnegative spectral analysis and redundancy control
IEEE Trans. Image Process.
(2015) - Z. Pan, A.G. Rust, H. Bolouri, Image redundancy reduction for neural network classification using discrete cosine...
- K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked autoencoders are scalable vision learners, arXiv preprint...
Learned queries for efficient local attention
Attention augmented convolutional networks
Cbam: Convolutional block attention module
A 2-nets: Double attention networks
Adv. Neural Inform. Process. Syst.
Dual attention network for scene segmentation
Exploring self-attention for image recognition
Squeeze-and-excitation networks
Non-local neural networks
Stand-alone self-attention in vision models
Adv. Neural Inform. Process. Syst.
Bottleneck transformers for visual recognition
Crossvit: Cross-attention multi-scale vision transformer for image classification
Tesa: Tensor element self-attention via matricization
Swin transformer: Hierarchical vision transformer using shifted windows
Cited by (2)
Jiansheng Fang is a Ph.D. candidate in the School of Computer Science and Technology at Harbin Institute of Technology and a researcher at CVTE. His main research interests include computer vision, medical image processing, and image retrieval.
Dan Zeng received B.E. and Ph.D. in computer science and technology from Sichuan University in 2013 and 2018. From 2018 to 2020, she worked as a post-doc research fellow in the Data Management and Biometrics Group at the University of Twente, the Netherlands. She is currently a research assistant professor at the Southern University of Science and Technology. Her main research interests include image processing, biometrics, and deep learning.
Xiao Yan obtained his Ph.D. in 2020 from the Chinese University of Hong Kong and is currently a research assistant professor in the Department of Computer Science and Engineering at the Southern University of Science and Technology. His research interests include large-scale machine learning, algorithms and systems for database, and especially large-scale vector search.
Yubing Zhang is a researcher manager and technical expert at the Machine Vision Institute of CVTE Research. He won the 20th China Invention Patent Excellence Award and the 2011 U.S. College Student Mathematical Modeling Outstanding Award (top 1%). Besides, he also got the Second Prize for the CVTE Invention Patent Quality Gold Award and the CVTE Founder’s Innovation Award. His research interests include face recognition, digital human, and metaverse.
Hongbo Liu is the Director of Machine Vision Institute at CVTE Research. Before joining CVTE, he worked as the PM of the System Architecture Department at Hisilicon. His research interests include machine learning-based ISP image processing, algorithms of video codec, and hardware-software architecture for computer vision systems.
Bo Tang received his Ph.D. in computer science from The Hong Kong Polytechnic University in 2017. He is currently an assistant professor at the Southern University of Science and Technology. He won ACM SIGMOD China Rising star 2021. His research interests include query optimization and data-intensive system.
Ming Yang is the CTO with CVTE (002841.SZ) and serves as the Director of CVTE Research. Before joining CVTE, he received a B.E. (2009) and a Ph.D. (2014) in Computer Science from Sun Yat-sen University. His research interests include machine learning and computer vision.
Jiang Liu obtained his Ph.D. in 2004 from the Department of Computer Science of the National University of Singapore and is currently a full professor in the Department of Computer Science and Engineering at the Southern University of Science and Technology. His main research interests include medical image processing and artificial intelligence.