Elsevier

Neurocomputing

Volume 464, 13 November 2021, Pages 27-36
Neurocomputing

High-Order-Interaction for weakly supervised Fine-Grained Visual Categorization

https://doi.org/10.1016/j.neucom.2021.08.108Get rights and content

Abstract

Fine-Grained Visual Categorization (FGVC) is a challenging task due to the large intra-subcategory and small inter-subcategory variances. Recent studies tackle this task through a weakly supervised manner without using the part annotation from the experts. Of those, methods based on bilinear pooling are one of the main categories for computing the interaction between deep features and have shown high effectiveness. However, these methods mainly focus on the correlation within one specific layer but largely ignore the high interactions between multiple layers. In this study, we argue that considering the high interaction between the features from multiple layers can help to learn more distinguishing fine-grained features. To this end, we propose a High-Order-Interaction (HOI) method for FGVC. In our HOI, an efficient cross-layer trilinear pooling is introduced to calculate the third-order interaction between three different layers. Third-order interactions of different combinations are then fused to form the final representation. HOI can produce more discriminative representations and be readily integrated with the two popular techniques, attention mechanism and triplet loss, to obtain superposed improvement. Extensive experiments conducted on four FGVC datasets show the great superiority of our method over bilinear-based methods and demonstrate that the proposed method achieves the state of the art.

Introduction

In the past few years, the performances of general image classification on large-scale datasets (e.g., ImageNet [1] and Places [2]) have achieved significant improvements, along with the development of deep neural networks (DNNs) [3], [4], [5]. Compared with the general image classification task, the Fine-Grained Visual Categorization (FGVC) is more challenging [6], [7], [8]. The FGVC needs to classify images into different subcategories that have small inter-class variations, such as, bird species [9], vehicle models [10] and aircraft types [11]. Fig. 1(a) shows the challenge of a large intra-subcategory variance and small variance among different similar subcategories through examples of bird images. As can be seen, different types of Gull look very similar, while the same sub-type of Gull can have very large variations. Human experts usually need to find the representative regions to classify an image into the true subcategory. To mimic this characteristic, a straightforward way is to train the CNN network with representative part annotations. However, annotating such regions not only requires expertise but also is time-consuming. To deal with these issues, recent studies focus on exploring the weakly-supervised FGVC [12], [13], [14], [15], in which only subcategory-level annotations are provided for training CNN models.

For constructing an accurate weakly supervised FGVC recognition method, two main concerns need to be addressed: 1) how to locate the informative regions, and 2) how to learn more discriminative features. Regarding the first concern, many methods based on region localization [16], [17], [18], [19] have been developed. These methods mainly consist of two steps, which usually first localize the object parts and then extract corresponding features from those parts for final classification. Despite their success, due to the lack of expert labeling, these methods have difficulties in defining the informative parts and deciding the optimal number of parts. Furthermore, the two-stage scheme may increase the training complexity and computation cost. On the other aspect, several attention based method [20], [21], [22] have been proposed to find the informative regions. In these methods, a specially designed attention module is used to compute the attention weights to highlight important regions. Regarding the second concern, bilinear-based methods [23], [24], [25] have been widely used. These methods concentrate on increasing the feature discrimination by modeling the underlying relationship within the feature regions through bilinear pooling. However, most of them only use the feature from the last convolution layer for computing the final feature representation. In fact, after several max pooling operations, the last convolution layer will discard a lot of local details, leading the feature insufficient to overcome the small inter-subcategory variations and significant intra-subcategory variations. To address this problem, there are some attempts [26], [22] to consider using more layers for further improving the discrimination. The most representative method is the Hierarchical Bilinear Pooling (HBP) proposed in [26]. HBP considers the three-layer feature interaction by stacking several two-layer bilinear pooling features, and their experimental results demonstrated that using more layers can significantly improve the accuracy. However, it still only considers the second-order interaction and fails to exploit the direct interaction between more layers.

To deal with the shortcomings of these methods, we argue that considering the direct correlations between multiple layers can help distinguish the small variations among different subcategories. To this end, we propose a simple but effective method, called High-Order-Interaction (HOI), for the Fine-Grained Visual Categorization task. First, we utilize an attention mechanism to construct a multi-scale feature pyramid that contains semantic information from different layers with various receptive fields. Then, we introduce a cross-layer trilinear pooling operation to compute the high-order interaction among three different layers. The proposed cross-layer trilinear pooling only involves several dot product operations, making it efficient and easy to implement. Finally, we concatenate the high-order interactions of different combinations to obtain the final feature representation. HOI enables the model to discover and focus on discriminative regions in the feature map (as shown in Fig. 1(b)) so that the final representation can be more robust to inter- and intra-subcategory variance. Finally, we leverage the cross-entropy loss and triplet loss to train the whole model.

To summarize, the main contributions of this study are as following:

  • We propose a easy and efficient High-Order-Interaction (HOI) method for computing the long-range cross-layer correlations between multiple layers. The visualization of the learned feature maps shows that our proposed HOI can help to learn more discriminative representation for Fine-Grained Visual Categorization.

  • We develop a framework that mainly consists of two levels by integrating the attention mechanism, High-Order-Interaction, and multi-task loss functions, which can boost the final performance.

  • Extensive experiments on four fine-grained categorization benchmark datasets (i.e., CUB-200–2011, Stanford-Car, FGVC-Aircraft and Oxford-Flower102) demonstrate the effectiveness of our method, which can find discriminative regions for reaching the state-of-the-art performance.

Section snippets

Related works

In this section, we give a brief review of bilinear-based and attention-based methods of Fine-Grained Visual Categorization, which are most relevant to our method.

Method

The overview structure of the proposed model is illustrated in Fig. 2, which is built upon an ImageNet pre-trained ResNet50 model [4] and mainly consists of two main levels, i.e., (1) Feature Attention Pyramid, and (2) High-Order-Interaction. In the following sections, we will describe our approach in detail from four aspects: a preliminary introduction of squeeze-and-exaction attention [30], the feature attention pyramids, the proposed High-Order-Interaction, and the training loss function.

Datasets

To evaluate the performance of our proposed method, we conduct experiments on four widely used Fine-Grained Visual Categorization datasets, i.e., Caltech-UCSD Birds-200-2011 (CUB-200-2011) [9], Stanford Cars [10], FGVC Aircraft [11], and Oxford Flower 102 [38]. The number of categories and data split for training and testing of each dataset are listed in Table 1.

Caltech-UCSD Birds-200-2011. Caltech-UCSD Birds-200-2011 (CUB-200-2011) is considered the most competitive datasets in the

Conclusion

In this paper, we propose a new weakly supervised fine-grained categorization method, which does not require any bounding box or part annotations. In our method, a high order interaction is utilized to compute the correlation between multiple different layers. The High-Order-Interaction is implemented through a simple dot production between the features from three layers. Besides, we develop a two-level framework that can benefit from the feature attention pyramid and the proposed

CRediT authorship contribution statement

Junzheng Wang: Conceptualization, Methodology, Writing - original draft. Nanyu Li: Conceptualization, Methodology, Writing - original draft. Zhiming Luo: Writing - review & editing, Funding acquisition. Zhun Zhong: Writing - review & editing. Shaozi Li: Supervision, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China under Grants 61876159, 61806172 and U1705286; the China Postdoctoral Science Foundation under Grant 2019M652257; and the Open Fund Project of Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang University) (Grant No. MJUKF-IPIC202006).

Junzheng Wang received the B.S. degree in Network Engineering from Shandong Technology and Business University, China, in 2018. He is currently working toward the M.S. degree in computer science from Xiamen University. His research interests include Fine-Grained Visual Categorization, computer vision.

References (62)

  • X. He et al.

    Only learn one sample: Fine-grained visual categorization with one sample training

    Proc. ACM MM

    (2018)
  • Z. Wang, S. Wang, P. Zhang, H. Li, W. Zhong, J. Li, Weakly supervised fine-grained image classification via...
  • X. He et al.

    A new benchmark and approach for fine-grained cross-media retrieval

    Proc. ACM MM

    (2019)
  • C. Wah, S. Branson, P. Welinder, P. Perona, S. Belongie, The Caltech-UCSD Birds-200-2011 Dataset, Tech. Rep....
  • J. Krause et al.

    3d object representations for fine-grained categorization

    Proc. ICCVW

    (2013)
  • S. Maji, E. Rahtu, J. Kannala, M. Blaschko, A. Vedaldi, Fine-grained visual classification of aircraft, arXiv preprint...
  • J. Fu et al.

    Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition

    Proc. CVPR

    (2017)
  • Z. Yang et al.

    Learning to navigate for fine-grained classification

    Proc. ECCV

    (2018)
  • H. Zheng et al.

    Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition

    Proc. CVPR

    (2019)
  • Y. Zhang et al.

    Part-aware fine-grained object categorization using weakly supervised part detection network

    IEEE Trans. Multimedia

    (2020)
  • X. He et al.

    Fast fine-grained image classification via weakly supervised discriminative localization

    IEEE Trans. Circuits Syst. Video Technol.

    (2018)
  • M. Simon, E. Rodner, Neural activation constellations: Unsupervised part model discovery with convolutional networks,...
  • Y. Peng et al.

    Object-part attention model for fine-grained image classification

    IEEE Trans. Image Process.

    (2018)
  • H. Zheng et al.

    Learning rich part hierarchies with progressive attention networks for fine-grained image recognition

    IEEE Trans. Image Process.

    (2019)
  • J. Fu et al.

    Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition

    Proc. CVPR

    (2017)
  • M. Sun et al.

    Multi-attention multi-class constraint for fine-grained image recognition

    Proc. ECCV

    (2018)
  • W. Luo, X. Yang, X. Mo, Y. Lu, L.S. Davis, J. Li, J. Yang, S.-N. Lim, Cross-x learning for fine-grained visual...
  • T.-Y. Lin et al.

    Bilinear cnn models for fine-grained visual recognition

    Proc. CVPR

    (2015)
  • K. Yue et al.

    Compact generalized non-local network, in

    Proc. NeurIPS

    (2018)
  • T.-Y. Lin et al.

    Improved bilinear pooling with cnns

    Proc. BMVC

    (2017)
  • C. Yu et al.

    Hierarchical bilinear pooling for fine-grained visual recognition

    Proc. ECCV

    (2018)
  • Cited by (11)

    • Transformer with peak suppression and knowledge guidance for fine-grained image recognition

      2022, Neurocomputing
      Citation Excerpt :

      However, the part annotation is time-consuming. Recent researches [16–19] focused on weakly supervised recognition methods with only image-level labels to obtain accurate part localization to solve this problem. Some patch-based methods [20–22] first initialize abundant region proposals and select the discriminative parts based on a specific strategy.

    View all citing articles on Scopus

    Junzheng Wang received the B.S. degree in Network Engineering from Shandong Technology and Business University, China, in 2018. He is currently working toward the M.S. degree in computer science from Xiamen University. His research interests include Fine-Grained Visual Categorization, computer vision.

    Nanyu Li received the B.S. degree in communication engineering from Jilin University Zhuhai College, China, in 2017. He is currently working toward the M.S. degree in computer science from Kunming University of Science and Technology and he is also a visiting master student in Wangxuan Institute of Computer Technology at Peking university. His research interests include Fine-Grained Visual Categorization, optimization and computer vision.

    Zhiming Luo received the B.S. degree in cognitive science from Xiamen University, Xiamen, China, in 2011 and the Ph.D. degree in computer science with Xiamen University and University of Sherbrooke, Sherbrooke, QC, Canada. He is currently working as a post-doctoral at the Artificial Intelligence Department in Xiamen University. His research interests include traffic surveillance video analytics, computer vision, and machine learning.

    Zhun Zhong received the Ph.D. Degree from Xiamen University, China, in 2019 and the M.S. Degree from China University of Petroleum, China, in 2015. He is currently a postdoc at the University of Trento, Italy. He was also a joint Ph.D. student at the University of Technology Sydney. His research interests include person re-identification, data augmentation, and domain adaptation.

    Shaozi Li received the B.S. degree from Hunan University, Changsha, China; the M.S. degree from Xi’an Jiaotong University, Xi’an, China; and the Ph.D. degree from the National University of Defense Technology, Changsha. He is the Chair and Professor of Department of Artificial Intelligence, Xiamen University, Xiamen, China, the Vice Director of the Technical Committee on Collaborative Computing of CCF, and the Vice Director of the Fujian Association of Artificial Intelligence. He has directed and completed more than 20 research projects, including several national 863 Programs, National Nature Science Foundation of China, and the Ph.D. Programs Foundation of the Ministry of Education of China. Furthermore, he has authored nearly 300 papers in journals and international conferences. His research interests include artificial intelligence and its applications, moving objects detection and recognition, machine learning, computer vision, and multimedia information retrieval. Dr. Li is a Senior Member of ACM and the China Computer Federation.

    1

    Joint first authors

    View full text