skip to main content
10.1145/3617695.3617708acmotherconferencesArticle/Chapter ViewAbstractPublication PagesbdiotConference Proceedingsconference-collections
research-article

What Does Pre-Train Bring to Vision Transformer

Published:02 November 2023Publication History

ABSTRACT

The Transformer architecture with self-attention mechanism at its core has played an important role in the field of computer vision. For example, the Vision Transformer (VIT) pre-trained on large-scale datasets can converge to the optimal point quickly in image classification and segmentation tasks. Even in tasks with limited training data, many models based on the Transformer architecture can achieve good results after importing pre-trained weights. However, the work about exploring the relationship between pre-training and self-attention mechanism is still very limited. In order to better understand the relationship between the two, this paper takes the VIT as the research object, studies the basic properties of the self-attention mechanism, and the theoretical essence of the good properties obtained after pre-training on large-scale datasets. The contributions of this paper are twofold: (1) The paper theoretically proves that the attention module of a well-pretrained VIT model can filter out some information when computing the correlation between different parts of an image, and compute based on effective features, thus avoiding interference from some noise. The softmax function maintains the high-rank status of the attention matrix, which can extract image features well. (2) The paper analyzes the properties of the eigenvalues of the self-attention module matrix and their impact on the learning process. We found that the self-attention mechanism model can actively adjust the distribution range of the eigenvalues of the attention matrix according to the correlation between different blocks, and then achieve the full utilization of the overall network information and ensure the convergence of the learning process by assigning different weights when extracting information features.

References

  1. Vaswani A, Shazeer N, Parmar N, Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.Google ScholarGoogle Scholar
  2. Devlin J, Chang M W, Lee K, Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.Google ScholarGoogle Scholar
  3. Dosovitskiy A, Beyer L, Kolesnikov A, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[C]//International Conference on Learning Representations,2021.Google ScholarGoogle Scholar
  4. Carion N, Massa F, Synnaeve G, End-to-end object detection with transformers[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer International Publishing, 2020: 213-229.Google ScholarGoogle Scholar
  5. Bao H, Dong L, Piao S, BEiT: BERT Pre-Training of Image Transformers[C]//International Conference on Learning Representations.Google ScholarGoogle Scholar
  6. Chen H, Wang Y, Guo T, Pre-trained image processing transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 12299-12310.Google ScholarGoogle Scholar
  7. Cheng J, Dong L, Lapata M. Long short-term memory-networks for machine reading[J]. arXiv preprint arXiv:1601.06733, 2016.Google ScholarGoogle Scholar
  8. Yun C, Bhojanapalli S, Rawat A S, Are Transformers universal approximators of sequence-to-sequence functions?[C]//International Conference on Learning Representations.Google ScholarGoogle Scholar
  9. Cordonnier J B, Loukas A, Jaggi M. On the Relationship between Self-Attention and Convolutional Layers[C]//Eighth International Conference on Learning Representations-ICLR 2020. 2020 (CONF).Google ScholarGoogle Scholar
  10. Xiong R, Yang Y, He D, On layer normalization in the transformer architecture[C]//International Conference on Machine Learning. PMLR, 2020: 10524-10533.Google ScholarGoogle Scholar
  11. Kim H, Papamakarios G, Mnih A. The lipschitz constant of self-attention[C]//International Conference on Machine Learning. PMLR, 2021: 5562-5571.Google ScholarGoogle Scholar
  12. Chefer H, Gur S, Wolf L. Transformer interpretability beyond attention visualization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 782-791.Google ScholarGoogle Scholar
  13. Zhou D, Yu Z, Xie E, Understanding the robustness in vision transformers[C]//International Conference on Machine Learning. PMLR, 2022: 27378-27394.Google ScholarGoogle Scholar
  14. Park N, Kim S. How Do Vision Transformers Work?[C]//International Conference on Learning Representations.Google ScholarGoogle Scholar
  15. Graham A. Kronecker products and matrix calculus with applications[M]. Courier Dover Publications, 2018.Google ScholarGoogle Scholar
  16. Burden R L, Faires J D, Burden A M. Numerical analysis[M]. Cengage learning, 2015.Google ScholarGoogle Scholar
  17. Ouyang L, Wu J, Jiang X, Training language models to follow instructions with human feedback[J]. Advances in Neural Information Processing Systems, 2022, 35: 27730-27744.Google ScholarGoogle Scholar
  18. Liu Z, Lin Y, Cao Y, Swin transformer: Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 10012-10022.Google ScholarGoogle Scholar

Index Terms

  1. What Does Pre-Train Bring to Vision Transformer
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        BDIOT '23: Proceedings of the 2023 7th International Conference on Big Data and Internet of Things
        August 2023
        232 pages
        ISBN:9798400708015
        DOI:10.1145/3617695

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 2 November 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        Overall Acceptance Rate75of136submissions,55%
      • Article Metrics

        • Downloads (Last 12 months)12
        • Downloads (Last 6 weeks)0

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format