research-article

What Does Pre-Train Bring to Vision Transformer

Authors:
Yajie Wu

North China University of Technology, China

North China University of Technology, China

0009-0007-0773-8998
View Profile

,
Weihan Ren

North China University of Technology, China

North China University of Technology, China

0009-0000-0868-5305
View Profile

,
Zhihui Yang

North China University of Technology, China

North China University of Technology, China

0009-0007-8105-1334
View Profile

BDIOT '23: Proceedings of the 2023 7th International Conference on Big Data and Internet of ThingsAugust 2023Pages 40–46https://doi.org/10.1145/3617695.3617708

Published:02 November 2023Publication History

BDIOT '23: Proceedings of the 2023 7th International Conference on Big Data and Internet of Things

Pages 40–46

ABSTRACT

The Transformer architecture with self-attention mechanism at its core has played an important role in the field of computer vision. For example, the Vision Transformer (VIT) pre-trained on large-scale datasets can converge to the optimal point quickly in image classification and segmentation tasks. Even in tasks with limited training data, many models based on the Transformer architecture can achieve good results after importing pre-trained weights. However, the work about exploring the relationship between pre-training and self-attention mechanism is still very limited. In order to better understand the relationship between the two, this paper takes the VIT as the research object, studies the basic properties of the self-attention mechanism, and the theoretical essence of the good properties obtained after pre-training on large-scale datasets. The contributions of this paper are twofold: (1) The paper theoretically proves that the attention module of a well-pretrained VIT model can filter out some information when computing the correlation between different parts of an image, and compute based on effective features, thus avoiding interference from some noise. The softmax function maintains the high-rank status of the attention matrix, which can extract image features well. (2) The paper analyzes the properties of the eigenvalues of the self-attention module matrix and their impact on the learning process. We found that the self-attention mechanism model can actively adjust the distribution range of the eigenvalues of the attention matrix according to the correlation between different blocks, and then achieve the full utilization of the overall network information and ensure the convergence of the learning process by assigning different weights when extracting information features.

References

Vaswani A, Shazeer N, Parmar N, Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.Google Scholar
Devlin J, Chang M W, Lee K, Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[C]//International Conference on Learning Representations,2021.Google Scholar
Carion N, Massa F, Synnaeve G, End-to-end object detection with transformers[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer International Publishing, 2020: 213-229.Google Scholar
Bao H, Dong L, Piao S, BEiT: BERT Pre-Training of Image Transformers[C]//International Conference on Learning Representations.Google Scholar
Chen H, Wang Y, Guo T, Pre-trained image processing transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 12299-12310.Google Scholar
Cheng J, Dong L, Lapata M. Long short-term memory-networks for machine reading[J]. arXiv preprint arXiv:1601.06733, 2016.Google Scholar
Yun C, Bhojanapalli S, Rawat A S, Are Transformers universal approximators of sequence-to-sequence functions?[C]//International Conference on Learning Representations.Google Scholar
Cordonnier J B, Loukas A, Jaggi M. On the Relationship between Self-Attention and Convolutional Layers[C]//Eighth International Conference on Learning Representations-ICLR 2020. 2020 (CONF).Google Scholar
Xiong R, Yang Y, He D, On layer normalization in the transformer architecture[C]//International Conference on Machine Learning. PMLR, 2020: 10524-10533.Google Scholar
Kim H, Papamakarios G, Mnih A. The lipschitz constant of self-attention[C]//International Conference on Machine Learning. PMLR, 2021: 5562-5571.Google Scholar
Chefer H, Gur S, Wolf L. Transformer interpretability beyond attention visualization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 782-791.Google Scholar
Zhou D, Yu Z, Xie E, Understanding the robustness in vision transformers[C]//International Conference on Machine Learning. PMLR, 2022: 27378-27394.Google Scholar
Park N, Kim S. How Do Vision Transformers Work?[C]//International Conference on Learning Representations.Google Scholar
Graham A. Kronecker products and matrix calculus with applications[M]. Courier Dover Publications, 2018.Google Scholar
Burden R L, Faires J D, Burden A M. Numerical analysis[M]. Cengage learning, 2015.Google Scholar
Ouyang L, Wu J, Jiang X, Training language models to follow instructions with human feedback[J]. Advances in Neural Information Processing Systems, 2022, 35: 27730-27744.Google Scholar
Liu Z, Lin Y, Cao Y, Swin transformer: Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 10012-10022.Google Scholar

Index Terms

What Does Pre-Train Bring to Vision Transformer
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Machine learning

Index terms have been assigned to the content through auto-classification.

Recommendations

ScalableViT: Rethinking the Context-Oriented Generalization of Vision Transformer
Computer Vision – ECCV 2022
Abstract
The vanilla self-attention mechanism inherently relies on pre-defined and steadfast computational dimensions. Such inflexibility restricts it from possessing context-oriented generalization that can bring more contextual cues and global ...
Read More
TT-ViT: Vision Transformer Compression Using Tensor-Train Decomposition
Computational Collective Intelligence
Abstract
Inspired by Transformer, one of the most successful deep learning models in natural language processing, machine translation, etc. Vision Transformer (ViT) has recently demonstrated its effectiveness in computer vision tasks such as image ...
Read More
Towards Efficient Adversarial Training on Vision Transformers
Computer Vision – ECCV 2022
Abstract
Vision Transformer (ViT), as a powerful alternative to Convolutional Neural Network (CNN), has received much attention. Recent work showed that ViTs are also vulnerable to adversarial examples like CNNs. To build robust ViTs, an intuitive way is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

BDIOT '23: Proceedings of the 2023 7th International Conference on Big Data and Internet of Things
August 2023
232 pages
ISBN:9798400708015
DOI:10.1145/3617695

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 November 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Interpretability
Pre-training
Self-attention mechanism
Vision Transformer
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate75of136submissions,55%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 12
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

What Does Pre-Train Bring to Vision Transformer

BDIOT '23: Proceedings of the 2023 7th International Conference on Big Data and Internet of Things

ABSTRACT

References

Cited By

Index Terms

Recommendations

ScalableViT: Rethinking the Context-Oriented Generalization of Vision Transformer

TT-ViT: Vision Transformer Compression Using Tensor-Train Decomposition

Towards Efficient Adversarial Training on Vision Transformers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

What Does Pre-Train Bring to Vision Transformer

BDIOT '23: Proceedings of the 2023 7th International Conference on Big Data and Internet of Things

ABSTRACT

References

Cited By

Index Terms

Recommendations

ScalableViT: Rethinking the Context-Oriented Generalization of Vision Transformer

TT-ViT: Vision Transformer Compression Using Tensor-Train Decomposition

Towards Efficient Adversarial Training on Vision Transformers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media