Abstract:
Vision transformers have shown that the self-attention mechanism performs well in the computer vision field. However, since such transformers are based on data sampled fr...Show MoreMetadata
Abstract:
Vision transformers have shown that the self-attention mechanism performs well in the computer vision field. However, since such transformers are based on data sampled from fixed areas, there is a limit to efficiently learning the important features in images. To compensate, we propose two modules based on the deformable operation: deformable patch embedding and deformable pooling. Deformable patch embedding consists of a hybrid structure of standard and deformable convolutions, and adaptively samples features from an image. The deformable pooling module also has a similar structure to the embedding module, but it not only samples data flexibly after self-attention but also allows the transformer to learn spatial information of various scales. The experimental results show that the transformer with the proposed modules converges faster and outperforms various vision transformers on image classification (ImageNet-1K) and object detection (MS-COCO).
Published in: 2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)
Date of Conference: 29 November 2022 - 02 December 2022
Date Added to IEEE Xplore: 24 November 2022
ISBN Information: