Abstract:
In recent years, vision transformer (ViT) has achieved remarkable breakthroughs in fine-grained visual classification (FGVC) because of its self-attention mechanism that ...Show MoreMetadata
Abstract:
In recent years, vision transformer (ViT) has achieved remarkable breakthroughs in fine-grained visual classification (FGVC) because of its self-attention mechanism that excels in extracting distinctive features from different pixels. However, pure ViT falls short in capturing the crucial multi-scale, local, and low-layer features that hold significance for FGVC. To compensate for these shortcomings, a new hybrid network called HVCNet is designed, which fuses the advantages of ViT and convolutional neural networks (CNN). The three modifications in the original ViT are: 1) using a multi-scale image-to-tokens (MIT) module instead of directly tokenizing the raw input image, thus enabling the network to capture the features at different scales; 2) substituting feed-forward network in ViT's encoder with mixed convolution feed-forward (MCF) module, which enhances the capability of the network in capturing the local and multi-scale features; 3) designing multi-layer feature selection (MFS) module to address the issue of deep-layer tokens in ViT to avoid ignoring the local and low-layer features. The experiment results indicate that the proposed method surpasses state-of-the-art methods on publicly datasets.
Published in: IEEE Signal Processing Letters ( Volume: 31)