ABSTRACT
The Transformer architecture with self-attention mechanism at its core has played an important role in the field of computer vision. For example, the Vision Transformer (VIT) pre-trained on large-scale datasets can converge to the optimal point quickly in image classification and segmentation tasks. Even in tasks with limited training data, many models based on the Transformer architecture can achieve good results after importing pre-trained weights. However, the work about exploring the relationship between pre-training and self-attention mechanism is still very limited. In order to better understand the relationship between the two, this paper takes the VIT as the research object, studies the basic properties of the self-attention mechanism, and the theoretical essence of the good properties obtained after pre-training on large-scale datasets. The contributions of this paper are twofold: (1) The paper theoretically proves that the attention module of a well-pretrained VIT model can filter out some information when computing the correlation between different parts of an image, and compute based on effective features, thus avoiding interference from some noise. The softmax function maintains the high-rank status of the attention matrix, which can extract image features well. (2) The paper analyzes the properties of the eigenvalues of the self-attention module matrix and their impact on the learning process. We found that the self-attention mechanism model can actively adjust the distribution range of the eigenvalues of the attention matrix according to the correlation between different blocks, and then achieve the full utilization of the overall network information and ensure the convergence of the learning process by assigning different weights when extracting information features.
- Vaswani A, Shazeer N, Parmar N, Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.Google Scholar
- Devlin J, Chang M W, Lee K, Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.Google Scholar
- Dosovitskiy A, Beyer L, Kolesnikov A, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[C]//International Conference on Learning Representations,2021.Google Scholar
- Carion N, Massa F, Synnaeve G, End-to-end object detection with transformers[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer International Publishing, 2020: 213-229.Google Scholar
- Bao H, Dong L, Piao S, BEiT: BERT Pre-Training of Image Transformers[C]//International Conference on Learning Representations.Google Scholar
- Chen H, Wang Y, Guo T, Pre-trained image processing transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 12299-12310.Google Scholar
- Cheng J, Dong L, Lapata M. Long short-term memory-networks for machine reading[J]. arXiv preprint arXiv:1601.06733, 2016.Google Scholar
- Yun C, Bhojanapalli S, Rawat A S, Are Transformers universal approximators of sequence-to-sequence functions?[C]//International Conference on Learning Representations.Google Scholar
- Cordonnier J B, Loukas A, Jaggi M. On the Relationship between Self-Attention and Convolutional Layers[C]//Eighth International Conference on Learning Representations-ICLR 2020. 2020 (CONF).Google Scholar
- Xiong R, Yang Y, He D, On layer normalization in the transformer architecture[C]//International Conference on Machine Learning. PMLR, 2020: 10524-10533.Google Scholar
- Kim H, Papamakarios G, Mnih A. The lipschitz constant of self-attention[C]//International Conference on Machine Learning. PMLR, 2021: 5562-5571.Google Scholar
- Chefer H, Gur S, Wolf L. Transformer interpretability beyond attention visualization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 782-791.Google Scholar
- Zhou D, Yu Z, Xie E, Understanding the robustness in vision transformers[C]//International Conference on Machine Learning. PMLR, 2022: 27378-27394.Google Scholar
- Park N, Kim S. How Do Vision Transformers Work?[C]//International Conference on Learning Representations.Google Scholar
- Graham A. Kronecker products and matrix calculus with applications[M]. Courier Dover Publications, 2018.Google Scholar
- Burden R L, Faires J D, Burden A M. Numerical analysis[M]. Cengage learning, 2015.Google Scholar
- Ouyang L, Wu J, Jiang X, Training language models to follow instructions with human feedback[J]. Advances in Neural Information Processing Systems, 2022, 35: 27730-27744.Google Scholar
- Liu Z, Lin Y, Cao Y, Swin transformer: Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 10012-10022.Google Scholar
Index Terms
- What Does Pre-Train Bring to Vision Transformer
Recommendations
ScalableViT: Rethinking the Context-Oriented Generalization of Vision Transformer
Computer Vision – ECCV 2022AbstractThe vanilla self-attention mechanism inherently relies on pre-defined and steadfast computational dimensions. Such inflexibility restricts it from possessing context-oriented generalization that can bring more contextual cues and global ...
TT-ViT: Vision Transformer Compression Using Tensor-Train Decomposition
Computational Collective IntelligenceAbstractInspired by Transformer, one of the most successful deep learning models in natural language processing, machine translation, etc. Vision Transformer (ViT) has recently demonstrated its effectiveness in computer vision tasks such as image ...
Towards Efficient Adversarial Training on Vision Transformers
Computer Vision – ECCV 2022AbstractVision Transformer (ViT), as a powerful alternative to Convolutional Neural Network (CNN), has received much attention. Recent work showed that ViTs are also vulnerable to adversarial examples like CNNs. To build robust ViTs, an intuitive way is ...
Comments