Convolution-Embedded Vision Transformer With Elastic Positional Encoding for Pansharpening | IEEE Journals & Magazine | IEEE Xplore

Convolution-Embedded Vision Transformer With Elastic Positional Encoding for Pansharpening


Abstract:

Transformer, especially vision transformer (ViT), is attracting increasing attention in various computer vision (CV) tasks. However, two urgent problems exist for the ViT...Show More

Abstract:

Transformer, especially vision transformer (ViT), is attracting increasing attention in various computer vision (CV) tasks. However, two urgent problems exist for the ViT: 1) owing to its attending to an image in the patch level, the ViT seems to have a better performance in fetching global representations but is limited in extracting local features, which is an inherent advantage for the convolutional neural network (CNN) and 2) the learnable positional encoding plays a positive role, but limits the cross-resolution ability of the network. Specifically, the pre-trained model could only generate images of the same size during training. To conquer the two problems, we propose a novel convolution-embedded ViT with elastic positional encoding in this article. On one hand, we propose a joint CNN and self-attention (CSA) network to collaboratively extract local and global features. On the other hand, we propose to integrate the elastic CNN-based positional encoder into the framework to solve the rigid limitation of the ViT in cross resolution issues and improve the performance. Extensive experiments were conducted on IKONOS and WorldView-2 with 4- and 8-band multispectral (MS) images, respectively. The visual and numerical results show the competitive performance of the proposed method.
Article Sequence Number: 5413809
Date of Publication: 07 December 2022

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.