Full length article
Dual attention interactive fine-grained classification network based on data augmentation

https://doi.org/10.1016/j.jvcir.2022.103632Get rights and content

Highlights

  • Our method is convenient to obtain rich global image overall attributes.

  • We use the global attention of dual attention mechanism to lock the target area.

  • We employ interactive fusion of channels to distinguish similar images quickly.

Abstract

The key to fine-grained image classification is to find discriminative regions. Most existing methods only use simple baseline networks or low-recognition attention modules to discover object differences, which will limit the model to finding discriminative regions hidden in images. This article proposes an effective method to solve this problem. The first is a novel layered training method, which uses a new training method to enhance the feature extraction ability of the baseline model. The second step focuses on key regions of the image based on improved long short-term memory (LSTM) and multi-head attention. In the third step, based on the feature map obtained by the dual attention network, spatial mapping is performed by a multi-layer perceptron (MLP). Then the element-by-element mutual multiplication calculation of the channel is performed to obtain a feature map with finer granularity. Finally, the CUB-200-2011, FGVC Aircraft, Stanford Cars, and MedMNIST v2 datasets achieved good performance.

Introduction

Fine-grained visual classification aims to identify very similar images, often subclasses under a large class. This is a more challenging problem than traditional classification due to the inherent little intra-object variation between sub-categories.

Many early classification works StdDev [1] and PS-CNN [2] usually rely on additional annotations to accurately find discriminative regions. However, it is too cumbersome and often prone to errors, resulting in poor results. Therefore, in recent years, most fine-grained research directions have turned to weakly supervise training models, which only rely on the given classification labels for end-to-end learning [3], [4], [5], [6], [7], [8]. In reality, it is difficult to distinguish between the common kingfisher and the great spotted kingfisher; they are very similar in size. However, the ear feathers of the common kingfisher are orange–yellow, and the ear feathers of the great spotted kingfisher are blue. At the same time, the difference between male and female birds of the same species is that the beak is black, while the lower beak of the female is red. It can be seen that subtle differences tend to be concentrated in local areas of objects. So far, most feasible solutions, such as MACNN [9], Cross-X [10], are to rely on neural networks to extract image feature maps, locate local feature regions, and then perform finer learning based on local regions. In this way, subtle differences between two similar objects can be better detected. It is also essential to find the most discriminative local regions for the many local feature regions obtained by localization. For example, in bird classification, the most critical local regions are usually the head, claws, and beaks. However, for some very similar birds, the only area that differs may be the head area. This requires us to learn multiple local regions more deeply and focus on the correct discriminative regions.

Many recent fine-grained classifications tend to amplify localized feature regions WS-DAN [11],MMAL-Net [12]. While this allows better learning of discriminative features of images, it does not consider how features from different enlarged parts are fused together in a synergistic manner. To sum up, fine-grained classification should not only learn the discriminative regions of the image but also increase the feature extraction ability of the baseline network or the good localization of the target region and strengthen the connection of various local features. In this way, more hidden discriminative features can be discovered. Therefore, it is as described in Fig. 1, we propose a dual attention interaction network based on data augmentation and evaluate the performance of the model on three commonly used fine-grained data sets: CUB-200-2011 [13], FGVC Aircraft [14] and Stanford Cars [15]. In addition, experiments were performed on a biomedical dataset MedMNIST v2 [16]. A large number of experimental results show that our model has achieved high performance compared with the state-of-the-art methods. Three innovation points of this article are shown, as follows:

  • We propose a cross-layer channel data augmentation framework. This is different from the recent paper SnapMix [17], which directly performs feature fusion processing at the attribute level of the original data. Our data enhancement is based on the training of the residual network layer. The features of each layer are fused with the features before that layer. Hierarchical training can make each neural network layer play the best role. The high-level convolution contains more semantic features, and the bottom-level convolution contains more field of view information, which will avoid feature loss.

  • We propose a local attention region framework that uses dual attention networks to localize more local feature regions. Through the LSTM [18] module, the corresponding global features are cyclically focused in turn, and then the attention feature maps are weighted and fused to aggregate the details into representative complementary vectors. Similarly, the multi-head attention [19] module is used to aggregate the extracted attention feature maps. It uses linear conversion processing between Q, K, and V channels, which is equivalent to multiple independent attention calculations, and then performs feature maps. Weighted feature processing. Effectively increase the diversity of image key regions.

  • We propose a channel interaction architecture. First, the attention features extracted by dual attention can filter some noise through MLP, perform spatial feature mapping and dimensionality reduction, and generate two more fine-grained features. Then, based on the features extracted by dual attention and the features generated by MLP, the channel multiplication is performed element by element, step by step. Finally, this is equivalent to amplifying the learned features, which is convenient to find rich discriminative features.

The rest of this article is organized as follows. Section 2 briefly introduces related work, while Section 3 provides a detailed overview of the proposed method, in Section 4, performance analysis, ablation experiment, and attention visualization are performed on three data. The fifth chapter discusses the content of the complete text, and the sixth section summarizes the research conclusions.

Section snippets

Related work

Fine-grained image classification is also called sub-category image classification, and its purpose is to perform a more detailed sub-category division of coarse-grained categories. Since two objects of the same type often differ only in subtle aspects such as ear shape and coat color, for strong supervision, the task of fine-grained images classification is undoubtedly more complex and challenging. Therefore, in recent years, weak supervision has been increasingly used for fine-grained

Model overview

As shown in Fig. 2, we use data-enhanced dual attention channel interaction to perform fine-grained classification of images and their corresponding labels. Based on the ResNet101 network, four residual blocks are trained hierarchically to enhance the feature extraction ability of the baseline network and then enrich the global features. The dual attention mechanism finds more local attention points and can filter out some less important regional features. The processing of interactive channels

Experimental results and analysis

The experimental results are reported here, and the method of our model is compared with the latest method. Subsequently, experimental analyses are performed on three popular fine-grained data, and a biomedical dataset is attached. The analysis highlights the innovation and performance advantages of our model.

Conclusions

This paper proposes a dual-attention interaction network based on data augmentation. Our network first considers acquiring rich global features, then encourages the multi-attention model to locate feature regions, and finally find more features by amplifying the feature regions. The whole process is to classify the fine-grained images step-by-step and orderly. In addition, it can also be used as a lightweight localization module, which only relies on class label annotations during training, and

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by National Natural Science Foundation of China (Nos. 62276073, 61966004, 61866004), Guangxi Natural Science Foundation (No. 2019GXNSFDA245018), the Innovation Project of Guangxi Graduate Education (No. YCBZ2022060), Guangxi “Bagui Scholar” Teams for Innovation and Research Project, Guangxi Talent Highland Project of Big Data Intelligence and Application, and Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing .

References (58)

  • WeiH. et al.

    The synergy of double attention: Combine sentence-level and word-level attention for image captioning

    Comput. Vis. Image Underst.

    (2020)
  • ZhangJ. et al.

    Stable self-attention adversarial learning for semi-supervised semantic image segmentation

    J. Vis. Commun. Image Represent.

    (2021)
  • YangZ. et al.

    Sws-dan: Subtler ws-dan for fine-grained image classification

    J. Vis. Commun. Image Represent.

    (2021)
  • L. Xie, Q. Tian, R. Hong, S. Yan, B. Zhang, Hierarchical part matching for fine-grained visual categorization. in:...
  • S. Huang, Z. Xu, D. Tao, Y. Zhang, Part-stacked cnn for fine-grained visual categorization. in: Proceedings of the IEEE...
  • Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, L. Wang, Learning to navigate for fine-grained classification. in: Proceedings...
  • T.-Y. Lin, A. RoyChowdhury, S. Maji, Bilinear cnn models for fine-grained visual recognition. in: Proceedings of the...
  • LiZ. et al.

    A semi-supervised learning approach based on adaptive weighted fusion for automatic image annotation

    ACM Trans. Multimed. Comput. Commun. Appl.

    (2021)
  • ZhouT. et al.

    Classify multi-label images via improved cnn model with adversarial network

    Multimedia Tools Appl.

    (2020)
  • H. Zheng, J. Fu, T. Mei, J. Luo, Learning multi-attention convolutional neural network for fine-grained image...
  • W. Luo, X. Yang, X. Mo, Y. Lu, L.S. Davis, J. Li, J. Yang, S.-N. Lim, Cross-x learning for fine-grained visual...
  • HuT. et al.

    See better before looking closer: Weakly supervised data augmentation network for fine-grained visual classification

    (2019)
  • ZhangF. et al.

    Multi-branch and multi-scale attention learning for fine-grained visual categorization

    (2020)
  • WahC. et al.

    The Caltech-Ucsd Birds-200-2011 DatasetTechnical Report 2010-001

    (2011)
  • MajiS. et al.

    Fine-grained visual classification of aircraft

    (2013)
  • M. Liu, C. Yu, H. Ling, J. Lei, Hierarchical joint cnn-based models for fine-grained cars recognition. in: Proceedings...
  • YangJ. et al.

    Medmnist v2: A large-scale lightweight benchmark for 2d and 3d biomedical image classification

    (2021)
  • HuangS. et al.

    Snapmix: Semantically proportional mixing for augmenting fine-grained data

    (2020)
  • XingjianS. et al.

    Convolutional lstm network: A machine learning approach for precipitation nowcasting

  • VaswaniA. et al.

    Attention is all you need

  • SimonyanK. et al.

    Very deep convolutional networks for large-scale image recognition

    (2014)
  • K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition. in: Proceedings of the IEEE Conference...
  • N. Zhang, J. Donahue, R. Girshick, T. Darrell, Part-based r-cnns for fine-grained category detection. in: Proceedings...
  • BransonS. et al.

    Bird species categorization using pose normalized deep convolutional nets

    (2014)
  • J. Fu, H. Zheng, T. Mei, Look closer to see better: Recurrent attention convolutional neural network for fine-grained...
  • LeCunY. et al.

    Gradient-based learning applied to document recognition

    Proc. IEEE

    (1998)
  • F. Zhang, M. Li, G. Zhai, Y. Liu, Multi-branch and multi-scale attention learning for fine-grained visual...
  • P. Shroff, T. Chen, Y. Wei, Z. Wang, Focus longer to see better: Recursively refined attention for fine-grained image...
  • Y. Wang, V.I. Morariu, L.S. Davis, Learning a discriminative filter bank within a cnn for fine-grained recognition. in:...
  • Cited by (7)

    View all citing articles on Scopus

    This paper has been recommended for acceptance by Dr Zicheng Liu.

    View full text