Combating spatial redundancy with spectral norm attention in convolutional learners

doi:10.1016/j.neucom.2022.09.075

Neurocomputing

Volume 511, 28 October 2022, Pages 105-116

https://doi.org/10.1016/j.neucom.2022.09.075 Get rights and content

Abstract

There is an inherent and longstanding challenge for vision learners to exploit informative features from digital images with spatial redundancy. Given pre-processing image methods require task-specific customization and may rise unanticipated poor performance due to redundancy removal, we explore improving vision learners to combat spatial redundancy during vision learning, a task-agnostic and robust solution. Among popular vision learners, vision transformers with self-attention can mitigate pixel redundancy by capturing global dependencies, while convolutional learners fall into locality via a limited receptive field. To this end, based on investigating inter-pixel spatial redundancy of images, in this work, we propose spectral norm attention (SNA), a novel yet efficient attention block to help convolutional neural networks (CNNs) highlight informative features. We can seamlessly plug SNA into off-the-shelf CNNs to suppress the contributions of redundant features by globally differentiating and weighting. In particular, SNA performs singular value decomposition (SVD) on intermediate features of each image within a mini-batch to obtain its spectral norm. The features in the direction of the spectral norm are most informative, while the discriminative features in other directions leave less. Hence, we apply the rank-one approximation of the spectral norm direction as attention weights to enhance informative features. Besides, we adopt the power iteration algorithm to approximate the spectral norm to significantly reduce the matrix computation overhead during training, thus keeping inference speed on par with vanilla CNNs. We extensively evaluate our SNA on four mainstream natural datasets to demonstrate the effectiveness and favourability of our SNA against its counterparts. In addition, the experimental results of image classification and object detection show our SNA can bring more gains to medical images with heavy redundancy than other state-of-the-art attention modules.

Introduction

Inter-pixel spatial redundancy is an intrinsic characteristic of digital images [1]. It refers to that neighboring pixels are not statistically independent and the value of any given pixel can be predicated from the value of its neighbors that is they are highly correlated. Fig. 1 can demonstrate that inter-pixel spatial redundancy is inherent and varies in degree for different imaging ways. By performing singular value decomposition (SVD) on two images of column (a), we can calculate explained variance ratios (EVR) of their singular values. EVR measures the percentage occupying information in a matrix for each singular value. The information carried by individual pixels is relatively small. We can group pixels into two categories: informative pixels and redundant pixels. Those pixels in the directions of singular values with non-zero EVR are informative while others are redundant. The qualitative schematics of Columns (b) and (c) can disclose that the low-rank approximation only using top-30 singular values can almost recover original images, and the top-1 singular value (i.e., the spectral norm) contributes the most information. By observing column (d), we can find that the number of singular values with non-zero EVR is less than 30 (red dot in the first row) for the natural image while the medical image is less than 10 (red dot in the second row). By combining the dimensions of the two images, it can clearly be seen that the medical image is more redundant. The statistical result in Column (e) can also prove such a claim, where the averages of maximal explained variances of medical images on varying numbers are larger than natural images.

With the above investigation, it is inevitable to consider inter-pixel spatial redundancy during image processing. Inter-pixel spatial redundancy leads to low efficiency and repetitive features occurring when inferring vision learners [2], [3]. The redundant features have a limited contribution to vision learners and may even bring over-fitting, thus diminishing the generalization quality [4]. Hence, when the highly informative features are drowned in the considerable redundant features, it is challenging to extract the discriminative features to increase learning accuracy [5]. The intuitive strategy of combating spatial redundancy is to apply redundancy removal approaches to pre-process images before being input into the networks, including PCA [6], DCT [7], etc. Recently, to encourage learning useful features from images with spatial redundancy, MAE [8] presents a simple strategy to reduce redundancy by masking a very high portion of random patches. Although pre-processing image methods can effectively address data redundancy, they require task-specific customization and may rise to unanticipated inferior performance. Hence, encouraged by a pioneer work [3] reducing spatial redundancy by reparameterizing convolutional layers, we consider improving vision learners to combat spatial redundancy, a task-agnostic and robust solution.

Among vision learners, vision transformers (ViTs) [9] enjoy the ability to capture intra-image long-range dependencies and exhibit impressive performances, as MAE [8] uses ViTs as encoders. The self-attention mechanism in ViTs favors learning global discriminative features [10], thus addressing spatial redundancy to some extent. In contrast to ViTs, convolutional neural networks (CNNs) provide locality via a limited receptive field [11], establishing prior for image processing. CNNs naturally capture local patterns, not global context, so it is impractical to screen out redundant features. With the above comparison between ViTs and CNNs, we intuitively consider introducing a global attention mechanism to improve CNN networks, like self-attention for ViTs. Attention-based models offer an adaptive aggregation mechanism, where the aggregation scheme itself is input-dependent or spatially dynamic [10]. Through the proposed global attention mechanism, we aim to identify informative and redundant features and then generate differentiated attention weights to play their respective contributions to the final discriminative decision.

With the above intention of our proposed attention mechanism, we first make a brief retrospective discussion on attention mechanisms. Both global and local attention mechanisms[12], [13], [14], [15], [16] can help advance the state-of-the-art performance of CNNs by selectively emphasizing salient features and suppressing less useful ones. A representative method of local attention [17] is squeeze-and-excitation (SE), which learns channel-wise attention for each convolution block, showing encouraging performance for various CNNs. On the other hand, global attention [18], [19], [20], [21] has emerged as a recent advance due to its advantage of capturing long-range interdependencies. Due to quadratic memory and computational requirements involving high-resolution images, global attention is intractable. Instead, local attention is the current de facto choice in CNNs for better vision backbones. However, local attention can not identify informative and redundant features. To this end, we propose a novel attention mechanism for CNNs, termed spectral norm attention (SNA), to leverage its globally expressive power while retaining the locality and shift-invariance of convolutional learners. The global SNA aims to help convolutional learners effectively learn informative features from images with large redundant pixels without paying dear computation.

Specifically, given a matrix obtained from feature aggregation on an image, SNA performs SVD on this matrix to obtain the spectral norm. The highly informative features lie in the direction of the spectral norm, and the features in the other directions of singular values are redundant. We intend to enhance informative features of the direction of the spectral norm and penalize redundant features of other directions. After screening out redundant features of the matrix, we employ the rank-one approximation of the spectral norm direction as attention weights to boost the discriminative contributions of informative features. In addition, to efficiently solve the spectral norm, we apply the power iteration algorithm to approximate the spectral norm. Compared to vanilla CNNs, the computation overhead slightly increase after integrating SNA, almost negligible. Concretely, we make the following contributions:

•
A longstanding challenge for CNNs is learning discriminative features from digital images with spatial redundancy. We propose a novel and efficient attention block to address the challenge, named spectral norm attention (SNA), which can elaborately designate attentive scores for features in terms of their global informativity.
•
SNA performs SVD on an aggregated feature matrix for each image within a mini-batch to obtain its spectral norm. We further generate distinguishing attention weights for informative and redundant features by utilizing the rank-one approximation of spectral norm direction. By laying emphasis on the highly informative features in the direction of the spectral norm, SNA can help convolutional learners improve generalization.
•
SNA can be seamlessly integrated into the off-the-shelf CNNs to help improve performance without paying dearly to model efficiency. We demonstrate the effectiveness of our proposed SNA by conducting experiments on five image datasets. Experimental results also show that our SNA can gain more when encountering heavily spatial redundancy, such as medical images.

The remainder of this paper includes related works reviewing in Section 2, our method describing in detail in Section 3, experiment analysis in Section 4, discussion in Section 5, and the conclusion stating in Section 6.

Section snippets

Vision learners

The inherent inductive biases make CNNs achieve currently state-of-the-art in computer vision and therefore widely used in different image recognition tasks [11]. Recently, ViTs have emerged as a competitive vision learner to CNNs by using multi-head self-attention without requiring the image-specific biases [9], [22]. The self-attention mechanism favors ViTs for capturing global dependencies but also arises intractable with higher-resolution inputs due to a quadratic complexity with respect to

Methodology

This section will elaborate on our proposed SNA method, including SNA block, spectral norm solution, and vision learners.

Experimental settings

•
Datasets. We conduct experiments on five image datasets, including CIFAR-100 [49], ImageNet [50], [51], COCO2017 [52], VOC2012 [53] using the validation set for evaluation, and VIN-CXR [54] which localizes and classifies 14 types of thoracic abnormalities. We convert the data format of VOC2012 to the format of COCO2017, including generating mask labels, so we can perform experiments on VOC2012 using Mask R-CNN [55]. The input images of the CIFAR-100 dataset are randomly cropped to $32 \times 32$ and

Discussions

Our SNA is a dedicated global attention block for convolutional learners with locality and shift-invariance to combat spatial redundancy. SNA globally identifies and weights informative and redundant features for each intermediate feature map by SVD. Based on the above experimental reports, we would like to discuss the pros and cons of our SNA.

From the viewpoint of eliminating inter-pixel redundancy, SNA is a unique exploration of combating spatial redundancy during vision learning for

Conclusions

Attention mechanisms have demonstrated their impressive progress in improving the performance of convolutional learners. In this work, we study how to exploit informative features by introducing a novel attention mechanism using the method of redundancy removal. We utilize the spectral norm of a feature matrix to calculate the attentive scores to give different weights for informative features and redundant features. We extensively evaluate our SNA on the tasks of image classification and

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Jiansheng Fang is a Ph.D. candidate in the School of Computer Science and Technology at Harbin Institute of Technology and a researcher at CVTE. His main research interests include computer vision, medical image processing, and image retrieval.

References (59)

W. Chen et al.
Pca and lda in dct domain
Pattern Recogn. Lett.
(2005)
A. Singh et al.
Svd-based redundancy removal in 1-d cnns for acoustic scene classification
Pattern Recogn. Lett.
(2020)
L. Zhu et al.
A robust meaningful image encryption scheme based on block compressive sensing and svd embedding
Signal Processing
(2020)
A. Subramanya
Image compression technique
IEEE Potentials
(2001)
Y. Wang et al.
Glance and focus: a dynamic approach to reducing spatial redundancy in image classification
Adv. Neural Inform. Process. Syst.
(2020)
Z. Xie et al.
Spatially adaptive inference with stochastic feature sampling and interpolation
L. Wolf et al.
Feature selection for unsupervised and supervised inference: The emergence of sparsity in a weight-based approach
J. Mach. Learn. Res.
(2005)
Z. Li et al.
Unsupervised feature selection via nonnegative spectral analysis and redundancy control
IEEE Trans. Image Process.
(2015)
Z. Pan, A.G. Rust, H. Bolouri, Image redundancy reduction for neural network classification using discrete cosine...
K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked autoencoders are scalable vision learners, arXiv preprint...

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold,...

M. Arar et al.

Learned queries for efficient local attention

(2021)

I. Bello et al.

Attention augmented convolutional networks

S. Woo et al.

Cbam: Convolutional block attention module

Y. Chen et al.

A 2-nets: Double attention networks

Adv. Neural Inform. Process. Syst.

(2018)

J. Fu et al.

Dual attention network for scene segmentation

H. Zhao et al.

Exploring self-attention for image recognition

H. Zhang, C. Wu, Z. Zhang, Y. Zhu, H. Lin, Z. Zhang, Y. Sun, T. He, J. Mueller, R. Manmatha, et al., Resnest:...

J. Hu et al.

Squeeze-and-excitation networks

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you...

X. Wang et al.

Non-local neural networks

P. Ramachandran et al.

Stand-alone self-attention in vision models

Adv. Neural Inform. Process. Syst.

(2019)

A. Srinivas et al.

Bottleneck transformers for visual recognition

C.-F.R. Chen et al.

Crossvit: Cross-attention multi-scale vision transformer for image classification

F. Babiloni et al.

Tesa: Tensor element self-attention via matricization

Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A convnet for the 2020s, arXiv preprint...

Z. Liu et al.

Swin transformer: Hierarchical vision transformer using shifted windows

V. Mnih, N. Heess, A. Graves, et al., Recurrent models of visual attention, in: Advances in neural information...

G. Cheng, p. lai, D. Gao, J. Han, Class attention network for image recognition, Sci. China Inform. Sci. (2022)....

Cited by (2)

Decoding emotion with phase–amplitude fusion features of EEG functional connectivity network
2024, Neural Networks
Decoding emotional neural representations from the electroencephalographic (EEG)-based functional connectivity network (FCN) is of great scientific importance for uncovering emotional cognition mechanisms and developing harmonious human–computer interactions. However, existing methods mainly rely on phase-based FCN measures (e.g., phase locking value [PLV]) to capture dynamic interactions between brain oscillations in emotional states, which fail to reflect the energy fluctuation of cortical oscillations over time. In this study, we initially examined the efficacy of amplitude-based functional networks (e.g., amplitude envelope correlation [AEC]) in representing emotional states. Subsequently, we proposed an efficient phase–amplitude fusion framework (PAF) to fuse PLV and AEC and used common spatial pattern (CSP) to extract fused spatial topological features from PAF for multi-class emotion recognition. We conducted extensive experiments on the DEAP and MAHNOB-HCI datasets. The results showed that: (1) AEC-derived discriminative spatial network topological features possess the ability to characterize emotional states, and the differential network patterns of AEC reflect dynamic interactions in brain regions associated with emotional cognition. (2) The proposed fusion features outperformed other state-of-the-art methods in terms of classification accuracy for both datasets. Moreover, the spatial filter learned from PAF is separable and interpretable, enabling a description of affective activation patterns from both phase and amplitude perspectives.
Decoding Emotion with Phase-Amplitude Fusion Features of Eeg Functional Connectivity Network
2023, SSRN

Dan Zeng received B.E. and Ph.D. in computer science and technology from Sichuan University in 2013 and 2018. From 2018 to 2020, she worked as a post-doc research fellow in the Data Management and Biometrics Group at the University of Twente, the Netherlands. She is currently a research assistant professor at the Southern University of Science and Technology. Her main research interests include image processing, biometrics, and deep learning.

Xiao Yan obtained his Ph.D. in 2020 from the Chinese University of Hong Kong and is currently a research assistant professor in the Department of Computer Science and Engineering at the Southern University of Science and Technology. His research interests include large-scale machine learning, algorithms and systems for database, and especially large-scale vector search.

Yubing Zhang is a researcher manager and technical expert at the Machine Vision Institute of CVTE Research. He won the 20th China Invention Patent Excellence Award and the 2011 U.S. College Student Mathematical Modeling Outstanding Award (top 1%). Besides, he also got the Second Prize for the CVTE Invention Patent Quality Gold Award and the CVTE Founder’s Innovation Award. His research interests include face recognition, digital human, and metaverse.

Hongbo Liu is the Director of Machine Vision Institute at CVTE Research. Before joining CVTE, he worked as the PM of the System Architecture Department at Hisilicon. His research interests include machine learning-based ISP image processing, algorithms of video codec, and hardware-software architecture for computer vision systems.

Bo Tang received his Ph.D. in computer science from The Hong Kong Polytechnic University in 2017. He is currently an assistant professor at the Southern University of Science and Technology. He won ACM SIGMOD China Rising star 2021. His research interests include query optimization and data-intensive system.

Ming Yang is the CTO with CVTE (002841.SZ) and serves as the Director of CVTE Research. Before joining CVTE, he received a B.E. (2009) and a Ph.D. (2014) in Computer Science from Sun Yat-sen University. His research interests include machine learning and computer vision.

Jiang Liu obtained his Ph.D. in 2004 from the Department of Computer Science of the National University of Singapore and is currently a full professor in the Department of Computer Science and Engineering at the Southern University of Science and Technology. His main research interests include medical image processing and artificial intelligence.

View full text

Combating spatial redundancy with spectral norm attention in convolutional learners

Abstract

Introduction

Section snippets

Vision learners

Methodology

Experimental settings

Discussions

Conclusions

Declaration of Competing Interest

Pattern Recogn. Lett.

Pattern Recogn. Lett.

Signal Processing

Image compression technique

IEEE Potentials

Glance and focus: a dynamic approach to reducing spatial redundancy in image classification

Adv. Neural Inform. Process. Syst.

Spatially adaptive inference with stochastic feature sampling and interpolation

Feature selection for unsupervised and supervised inference: The emergence of sparsity in a weight-based approach

J. Mach. Learn. Res.

Unsupervised feature selection via nonnegative spectral analysis and redundancy control

IEEE Trans. Image Process.

Learned queries for efficient local attention

Attention augmented convolutional networks

Cbam: Convolutional block attention module

A 2-nets: Double attention networks

Adv. Neural Inform. Process. Syst.

Dual attention network for scene segmentation

Exploring self-attention for image recognition

Squeeze-and-excitation networks

Non-local neural networks

Stand-alone self-attention in vision models

Adv. Neural Inform. Process. Syst.

Bottleneck transformers for visual recognition

Crossvit: Cross-attention multi-scale vision transformer for image classification

Tesa: Tensor element self-attention via matricization

Swin transformer: Hierarchical vision transformer using shifted windows