Impact Statement:The proposed GhostViT has achieved excellent performance in terms of accuracy and efficiency. It can be used in various perception applications, such as autonomous drivin...Show More
Abstract:
Vision Transformers (ViTs) have recently achieved promising results in various computer vision tasks. However, ViTs have high computation costs and a large number of para...Show MoreMetadata
Impact Statement:
The proposed GhostViT has achieved excellent performance in terms of accuracy and efficiency. It can be used in various perception applications, such as autonomous driving and robot navigation. GhostViT can reduce the carbon footprint because it offers outstanding performance at a low computational cost. The proposed approach itself does not lead to potential negative social impacts; however, potential applications may have negative social impacts. For example, the development of autonomous driving could lead to driver unemployment. Moreover, related robotic applications could be used for military purposes, thus harming humans.
Abstract:
Vision Transformers (ViTs) have recently achieved promising results in various computer vision tasks. However, ViTs have high computation costs and a large number of parameters due to the stacked multihead self-attention (MHSA) and expanded feed-forward network (FFN) modules. Since the complexity of Transformer-based models is quadratic with the length of the input tokens, most current efforts focus on reducing the number of tokens in ViTs to improve the model efficiency. Unlike previous studies, we argue that diverse redundant features help ViTs understand the data comprehensively. In this article, we propose ghost vision Transformer (GhostViT), which achieves both computation and storage efficiency. The key concept of GhostViT is to generate more diverse features using cheap operations in the MHSA and FFN modules. We experimentally demonstrate that our GhostViT can significantly reduce both the parameters and floating point operations (FLOPs) of ViTs while achieving similar or better accuracy. For example, about 14% of parameters and 17% of FLOPs of the DeiT-tiny model are reduced without any accuracy loss on the ImageNet-1 K dataset.
Published in: IEEE Transactions on Artificial Intelligence ( Volume: 5, Issue: 6, June 2024)