High-Order-Interaction for weakly supervised Fine-Grained Visual Categorization

doi:10.1016/j.neucom.2021.08.108

Neurocomputing

Volume 464, 13 November 2021, Pages 27-36

https://doi.org/10.1016/j.neucom.2021.08.108 Get rights and content

Abstract

Fine-Grained Visual Categorization (FGVC) is a challenging task due to the large intra-subcategory and small inter-subcategory variances. Recent studies tackle this task through a weakly supervised manner without using the part annotation from the experts. Of those, methods based on bilinear pooling are one of the main categories for computing the interaction between deep features and have shown high effectiveness. However, these methods mainly focus on the correlation within one specific layer but largely ignore the high interactions between multiple layers. In this study, we argue that considering the high interaction between the features from multiple layers can help to learn more distinguishing fine-grained features. To this end, we propose a High-Order-Interaction (HOI) method for FGVC. In our HOI, an efficient cross-layer trilinear pooling is introduced to calculate the third-order interaction between three different layers. Third-order interactions of different combinations are then fused to form the final representation. HOI can produce more discriminative representations and be readily integrated with the two popular techniques, attention mechanism and triplet loss, to obtain superposed improvement. Extensive experiments conducted on four FGVC datasets show the great superiority of our method over bilinear-based methods and demonstrate that the proposed method achieves the state of the art.

Introduction

In the past few years, the performances of general image classification on large-scale datasets (e.g., ImageNet [1] and Places [2]) have achieved significant improvements, along with the development of deep neural networks (DNNs) [3], [4], [5]. Compared with the general image classification task, the Fine-Grained Visual Categorization (FGVC) is more challenging [6], [7], [8]. The FGVC needs to classify images into different subcategories that have small inter-class variations, such as, bird species [9], vehicle models [10] and aircraft types [11]. Fig. 1(a) shows the challenge of a large intra-subcategory variance and small variance among different similar subcategories through examples of bird images. As can be seen, different types of Gull look very similar, while the same sub-type of Gull can have very large variations. Human experts usually need to find the representative regions to classify an image into the true subcategory. To mimic this characteristic, a straightforward way is to train the CNN network with representative part annotations. However, annotating such regions not only requires expertise but also is time-consuming. To deal with these issues, recent studies focus on exploring the weakly-supervised FGVC [12], [13], [14], [15], in which only subcategory-level annotations are provided for training CNN models.

For constructing an accurate weakly supervised FGVC recognition method, two main concerns need to be addressed: 1) how to locate the informative regions, and 2) how to learn more discriminative features. Regarding the first concern, many methods based on region localization [16], [17], [18], [19] have been developed. These methods mainly consist of two steps, which usually first localize the object parts and then extract corresponding features from those parts for final classification. Despite their success, due to the lack of expert labeling, these methods have difficulties in defining the informative parts and deciding the optimal number of parts. Furthermore, the two-stage scheme may increase the training complexity and computation cost. On the other aspect, several attention based method [20], [21], [22] have been proposed to find the informative regions. In these methods, a specially designed attention module is used to compute the attention weights to highlight important regions. Regarding the second concern, bilinear-based methods [23], [24], [25] have been widely used. These methods concentrate on increasing the feature discrimination by modeling the underlying relationship within the feature regions through bilinear pooling. However, most of them only use the feature from the last convolution layer for computing the final feature representation. In fact, after several max pooling operations, the last convolution layer will discard a lot of local details, leading the feature insufficient to overcome the small inter-subcategory variations and significant intra-subcategory variations. To address this problem, there are some attempts [26], [22] to consider using more layers for further improving the discrimination. The most representative method is the Hierarchical Bilinear Pooling (HBP) proposed in [26]. HBP considers the three-layer feature interaction by stacking several two-layer bilinear pooling features, and their experimental results demonstrated that using more layers can significantly improve the accuracy. However, it still only considers the second-order interaction and fails to exploit the direct interaction between more layers.

To deal with the shortcomings of these methods, we argue that considering the direct correlations between multiple layers can help distinguish the small variations among different subcategories. To this end, we propose a simple but effective method, called High-Order-Interaction (HOI), for the Fine-Grained Visual Categorization task. First, we utilize an attention mechanism to construct a multi-scale feature pyramid that contains semantic information from different layers with various receptive fields. Then, we introduce a cross-layer trilinear pooling operation to compute the high-order interaction among three different layers. The proposed cross-layer trilinear pooling only involves several dot product operations, making it efficient and easy to implement. Finally, we concatenate the high-order interactions of different combinations to obtain the final feature representation. HOI enables the model to discover and focus on discriminative regions in the feature map (as shown in Fig. 1(b)) so that the final representation can be more robust to inter- and intra-subcategory variance. Finally, we leverage the cross-entropy loss and triplet loss to train the whole model.

To summarize, the main contributions of this study are as following:

•
We propose a easy and efficient High-Order-Interaction (HOI) method for computing the long-range cross-layer correlations between multiple layers. The visualization of the learned feature maps shows that our proposed HOI can help to learn more discriminative representation for Fine-Grained Visual Categorization.
•
We develop a framework that mainly consists of two levels by integrating the attention mechanism, High-Order-Interaction, and multi-task loss functions, which can boost the final performance.
•
Extensive experiments on four fine-grained categorization benchmark datasets (i.e., CUB-200–2011, Stanford-Car, FGVC-Aircraft and Oxford-Flower102) demonstrate the effectiveness of our method, which can find discriminative regions for reaching the state-of-the-art performance.

Section snippets

Related works

In this section, we give a brief review of bilinear-based and attention-based methods of Fine-Grained Visual Categorization, which are most relevant to our method.

Method

The overview structure of the proposed model is illustrated in Fig. 2, which is built upon an ImageNet pre-trained ResNet50 model [4] and mainly consists of two main levels, i.e., (1) Feature Attention Pyramid, and (2) High-Order-Interaction. In the following sections, we will describe our approach in detail from four aspects: a preliminary introduction of squeeze-and-exaction attention [30], the feature attention pyramids, the proposed High-Order-Interaction, and the training loss function.

Datasets

To evaluate the performance of our proposed method, we conduct experiments on four widely used Fine-Grained Visual Categorization datasets, i.e., Caltech-UCSD Birds-200-2011 (CUB-200-2011) [9], Stanford Cars [10], FGVC Aircraft [11], and Oxford Flower 102 [38]. The number of categories and data split for training and testing of each dataset are listed in Table 1.

Caltech-UCSD Birds-200-2011. Caltech-UCSD Birds-200-2011 (CUB-200-2011) is considered the most competitive datasets in the

Conclusion

In this paper, we propose a new weakly supervised fine-grained categorization method, which does not require any bounding box or part annotations. In our method, a high order interaction is utilized to compute the correlation between multiple different layers. The High-Order-Interaction is implemented through a simple dot production between the features from three layers. Besides, we develop a two-level framework that can benefit from the feature attention pyramid and the proposed

CRediT authorship contribution statement

Junzheng Wang: Conceptualization, Methodology, Writing - original draft. Nanyu Li: Conceptualization, Methodology, Writing - original draft. Zhiming Luo: Writing - review & editing, Funding acquisition. Zhun Zhong: Writing - review & editing. Shaozi Li: Supervision, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China under Grants 61876159, 61806172 and U1705286; the China Postdoctoral Science Foundation under Grant 2019M652257; and the Open Fund Project of Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang University) (Grant No. MJUKF-IPIC202006).

Junzheng Wang received the B.S. degree in Network Engineering from Shandong Technology and Business University, China, in 2018. He is currently working toward the M.S. degree in computer science from Xiamen University. His research interests include Fine-Grained Visual Categorization, computer vision.

References (62)

H. Huang et al.
Self-adaptive manifold discriminant analysis for feature extraction from hyperspectral imagery
Pattern Recogn.
(2020)
Q. Sun et al.
Hyperlayer bilinear pooling with application to fine-grained categorization and image retrieval
Neurocomputing
(2018)
J. Zhao et al.
Attribute hierarchy based multi-task learning for fine-grained image classification
Neurocomputing
(2020)
M. Luo et al.
Stochastic region pooling: Make attention more expressive
Neurocomputing
(2020)
Y. Zhu et al.
Ta-cnn: Two-way attention models in deep convolutional neural network for plant recognition
Neurocomputing
(2019)
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in:...
B. Zhou et al.
Places: A 10 million image database for scene recognition
IEEE Trans. Pattern Anal. Mach. Intell.
(2017)
K. Simonyan et al.
Very deep convolutional networks for large-scale image recognition
Proc. ICLR
(2015)
K. He et al.
Deep residual learning for image recognition
Proc. CVPR
(2016)
T. Chen, W. Wu, Y. Gao, L. Dong, X. Luo, L. Lin, Fine-grained representation learning and recognition by exploiting...

X. He et al.

Only learn one sample: Fine-grained visual categorization with one sample training

Proc. ACM MM

(2018)

Z. Wang, S. Wang, P. Zhang, H. Li, W. Zhong, J. Li, Weakly supervised fine-grained image classification via...

X. He et al.

A new benchmark and approach for fine-grained cross-media retrieval

Proc. ACM MM

(2019)

C. Wah, S. Branson, P. Welinder, P. Perona, S. Belongie, The Caltech-UCSD Birds-200-2011 Dataset, Tech. Rep....

J. Krause et al.

3d object representations for fine-grained categorization

Proc. ICCVW

(2013)

S. Maji, E. Rahtu, J. Kannala, M. Blaschko, A. Vedaldi, Fine-grained visual classification of aircraft, arXiv preprint...

J. Fu et al.

Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition

Proc. CVPR

(2017)

Z. Yang et al.

Learning to navigate for fine-grained classification

Proc. ECCV

(2018)

H. Zheng et al.

Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition

Proc. CVPR

(2019)

Y. Zhang et al.

Part-aware fine-grained object categorization using weakly supervised part detection network

IEEE Trans. Multimedia

(2020)

X. He et al.

Fast fine-grained image classification via weakly supervised discriminative localization

IEEE Trans. Circuits Syst. Video Technol.

(2018)

M. Simon, E. Rodner, Neural activation constellations: Unsupervised part model discovery with convolutional networks,...

Y. Peng et al.

Object-part attention model for fine-grained image classification

IEEE Trans. Image Process.

(2018)

H. Zheng et al.

Learning rich part hierarchies with progressive attention networks for fine-grained image recognition

IEEE Trans. Image Process.

(2019)

J. Fu et al.

Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition

Proc. CVPR

(2017)

M. Sun et al.

Multi-attention multi-class constraint for fine-grained image recognition

Proc. ECCV

(2018)

W. Luo, X. Yang, X. Mo, Y. Lu, L.S. Davis, J. Li, J. Yang, S.-N. Lim, Cross-x learning for fine-grained visual...

T.-Y. Lin et al.

Bilinear cnn models for fine-grained visual recognition

Proc. CVPR

(2015)

K. Yue et al.

Compact generalized non-local network, in

Proc. NeurIPS

(2018)

T.-Y. Lin et al.

Improved bilinear pooling with cnns

Proc. BMVC

(2017)

C. Yu et al.

Hierarchical bilinear pooling for fine-grained visual recognition

Proc. ECCV

(2018)

Cited by (11)

DM-CNN: Dynamic Multi-scale Convolutional Neural Network with uncertainty quantification for medical image classification
2024, Computers in Biology and Medicine
Convolutional neural network (CNN) has promoted the development of diagnosis technology of medical images. However, the performance of CNN is limited by insufficient feature information and inaccurate attention weight. Previous works have improved the accuracy and speed of CNN but ignored the uncertainty of the prediction, that is to say, uncertainty of CNN has not received enough attention. Therefore, it is still a great challenge for extracting effective features and uncertainty quantification of medical deep learning models In order to solve the above problems, this paper proposes a novel convolutional neural network model named DM-CNN, which mainly contains the four proposed sub-modules : dynamic multi-scale feature fusion module (DMFF), hierarchical dynamic uncertainty quantifies attention (HDUQ-Attention) and multi-scale fusion pooling method (MF Pooling) and multi-objective loss (MO loss). DMFF select different convolution kernels according to the feature maps at different levels, extract different-scale feature information, and make the feature information of each layer have stronger representation ability for information fusion HDUQ-Attention includes a tuning block that adjust the attention weight according to the different information of each layer, and a Monte-Carlo (MC) dropout structure for quantifying uncertainty MF Pooling is a pooling method designed for multi-scale models, which can speed up the calculation and prevent overfitting while retaining the main important information Because the number of parameters in the backbone part of DM-CNN is different from other modules, MO loss is proposed, which has a fast optimization speed and good classification effect DM-CNN conducts experiments on publicly available datasets in four areas of medicine (Dermatology, Histopathology, Respirology, Ophthalmology), achieving state-of-the-art classification performance on all datasets. DM-CNN can not only maintain excellent performance, but also solve the problem of quantification of uncertainty, which is a very important task for the medical field. The code is available: https://github.com/QIANXIN22/DM-CNN.
Transformer with peak suppression and knowledge guidance for fine-grained image recognition
2022, Neurocomputing
Citation Excerpt :
However, the part annotation is time-consuming. Recent researches [16–19] focused on weakly supervised recognition methods with only image-level labels to obtain accurate part localization to solve this problem. Some patch-based methods [20–22] first initialize abundant region proposals and select the discriminative parts based on a specific strategy.
Fine-grained image recognition is challenging because discriminative clues are usually fragmented, whether from a single image or multiple images. Despite their significant improvements, the majority of existing methods still focus on the most discriminative parts from a single image, ignoring informative details in other regions and lacking consideration of clues from other associated images. In this paper, we analyze the difficulties of fine-grained image recognition from a new perspective and propose a transformer architecture with the peak suppression module and knowledge guidance module, which respects the diversification of discriminative features in a single image and the aggregation of discriminative clues among multiple images. Specifically, the peak suppression module first utilizes a linear projection to convert the input image into sequential tokens. It then blocks the token based on the attention response generated by the transformer encoder. This module penalizes the attention to the most discriminative parts in the feature learning process, therefore, enhancing the information exploitation of the neglected regions. The knowledge guidance module compares the image-based representation generated from the peak suppression module with the learnable knowledge embedding set to obtain the knowledge response coefficients. Afterwards, it formalizes the knowledge learning as a classification problem using response coefficients as the classification scores. Knowledge embeddings and image-based representations are updated during training simultaneously so that the knowledge embedding includes a large number of discriminative clues for different images of the same category. Finally, we incorporate the acquired knowledge embeddings into the image-based representations as comprehensive representations, leading to significantly higher recognition performance. Extensive evaluations on the six popular datasets demonstrate the advantage of the proposed method in performance. The source code and models will be available online after the acceptance of the paper.
Multi-Region Attention Network for Fine-Grained Image Classification
2024, Jisuanji Gongcheng/Computer Engineering
SPCB-Net: A Multi-Scale Skin Cancer Image Identification Network Using Self-Interactive Attention Pyramid and Cross-Layer Bilinear-Trilinear Pooling
2024, IEEE Access
GET: group equivariant transformer for person detection of overhead fisheye images
2023, Applied Intelligence
SPCB-Net: A multi-scale skin cancer image identification network using self-interactive attention pyramid and cross-layer bilinear-trilinear pooling
2023, Research Square

View all citing articles on Scopus

Nanyu Li received the B.S. degree in communication engineering from Jilin University Zhuhai College, China, in 2017. He is currently working toward the M.S. degree in computer science from Kunming University of Science and Technology and he is also a visiting master student in Wangxuan Institute of Computer Technology at Peking university. His research interests include Fine-Grained Visual Categorization, optimization and computer vision.

Zhiming Luo received the B.S. degree in cognitive science from Xiamen University, Xiamen, China, in 2011 and the Ph.D. degree in computer science with Xiamen University and University of Sherbrooke, Sherbrooke, QC, Canada. He is currently working as a post-doctoral at the Artificial Intelligence Department in Xiamen University. His research interests include traffic surveillance video analytics, computer vision, and machine learning.

Zhun Zhong received the Ph.D. Degree from Xiamen University, China, in 2019 and the M.S. Degree from China University of Petroleum, China, in 2015. He is currently a postdoc at the University of Trento, Italy. He was also a joint Ph.D. student at the University of Technology Sydney. His research interests include person re-identification, data augmentation, and domain adaptation.

Shaozi Li received the B.S. degree from Hunan University, Changsha, China; the M.S. degree from Xi’an Jiaotong University, Xi’an, China; and the Ph.D. degree from the National University of Defense Technology, Changsha. He is the Chair and Professor of Department of Artificial Intelligence, Xiamen University, Xiamen, China, the Vice Director of the Technical Committee on Collaborative Computing of CCF, and the Vice Director of the Fujian Association of Artificial Intelligence. He has directed and completed more than 20 research projects, including several national 863 Programs, National Nature Science Foundation of China, and the Ph.D. Programs Foundation of the Ministry of Education of China. Furthermore, he has authored nearly 300 papers in journals and international conferences. His research interests include artificial intelligence and its applications, moving objects detection and recognition, machine learning, computer vision, and multimedia information retrieval. Dr. Li is a Senior Member of ACM and the China Computer Federation.

¹: Joint first authors

View full text

High-Order-Interaction for weakly supervised Fine-Grained Visual Categorization

Abstract

Introduction

Section snippets

Related works

Method

Datasets

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgement

Pattern Recogn.

Neurocomputing

Neurocomputing

Neurocomputing

Neurocomputing

Places: A 10 million image database for scene recognition

IEEE Trans. Pattern Anal. Mach. Intell.

Very deep convolutional networks for large-scale image recognition

Proc. ICLR

Deep residual learning for image recognition

Proc. CVPR

Only learn one sample: Fine-grained visual categorization with one sample training

Proc. ACM MM

A new benchmark and approach for fine-grained cross-media retrieval

Proc. ACM MM

3d object representations for fine-grained categorization

Proc. ICCVW

Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition

Proc. CVPR

Learning to navigate for fine-grained classification

Proc. ECCV

Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition

Proc. CVPR

Part-aware fine-grained object categorization using weakly supervised part detection network

IEEE Trans. Multimedia

Fast fine-grained image classification via weakly supervised discriminative localization

IEEE Trans. Circuits Syst. Video Technol.

Object-part attention model for fine-grained image classification

IEEE Trans. Image Process.

Learning rich part hierarchies with progressive attention networks for fine-grained image recognition

IEEE Trans. Image Process.

Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition

Proc. CVPR

Multi-attention multi-class constraint for fine-grained image recognition

Proc. ECCV

Bilinear cnn models for fine-grained visual recognition

Proc. CVPR

Compact generalized non-local network, in

Proc. NeurIPS

Improved bilinear pooling with cnns

Proc. BMVC

Hierarchical bilinear pooling for fine-grained visual recognition

Proc. ECCV