KVT: k-NN Attention for Boosting Vision Transformers

Wang, Pichao; Wang, Xue; Wang, Fan; Lin, Ming; Chang, Shuning; Li, Hao; Jin, Rong

doi:10.1007/978-3-031-20053-3_17

Pichao Wang¹²,
Xue Wang¹²,
Fan Wang¹²,
Ming Lin¹²,
Shuning Chang¹²,
Hao Li¹² &
…
Rong Jin¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13684))

Included in the following conference series:

European Conference on Computer Vision

4090 Accesses

Abstract

Convolutional Neural Networks (CNNs) have dominated computer vision for years, due to its ability in capturing locality and translation invariance. Recently, many vision transformer architectures have been proposed and they show promising performance. A key component in vision transformers is the fully-connected self-attention which is more powerful than CNNs in modelling long range dependencies. However, since the current dense self-attention uses all image patches (tokens) to compute attention matrix, it may neglect locality of images patches and involve noisy tokens (e.g., clutter background and occlusion), leading to a slow training process and potential degradation of performance. To address these problems, we propose the k-NN attention for boosting vision transformers. Specifically, instead of involving all the tokens for attention matrix calculation, we only select the top-k similar tokens from the keys for each query to compute the attention map. The proposed k-NN attention naturally inherits the local bias of CNNs without introducing convolutional operations, as nearby tokens tend to be more similar than others. In addition, the k-NN attention allows for the exploration of long range correlation and at the same time filters out irrelevant tokens by choosing the most similar tokens from the entire image. Despite its simplicity, we verify, both theoretically and empirically, that k-NN attention is powerful in speeding up training and distilling noise from input tokens. Extensive experiments are conducted by using 11 different vision transformer architectures to verify that the proposed k-NN attention can work with any existing transformer architectures to improve its prediction performance. The codes are available at https://github.com/damo-cv/KVT.

P. Wang and X. Wang—The first two authors contribute equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Regularizing self-attention on vision transformers with 2D spatial distance loss

Article 18 July 2022

An Attention-Based Token Pruning Method for Vision Transformers

Efficient Vision Transformers with Partial Attention

Notes

1.
Theorem 4.1 in [33] describes the upper bound for regrets (the gap on loss function value between the current step parameters and optimal parameters). One can telescope it to the average regrets to consider the Adam’s convergence.

References

Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294 (2021)
Chang, S., Wang, P., Wang, F., Li, H., Feng, J.: Augmented transformer with adaptive graph for temporal action proposal generation. arXiv preprint arXiv:2103.16024 (2021)
Chefer, H., Gur, S., Wolf, L.: Transformer interpretability beyond attention visualization. In: CVPR (2021)
Google Scholar
Chen, C.F., Fan, Q., Panda, R.: Crossvit: cross-attention multi-scale vision transformer for image classification. arXiv preprint arXiv:2103.14899 (2021)
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: CVPR (2021)
Google Scholar
Chen, Z., Xie, L., Niu, J., Liu, X., Wei, L., Tian, Q.: Visformer: the vision-friendly transformer. arXiv preprint arXiv:2104.12533 (2021)
Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019)
Chu, X., et al.: Twins: revisiting spatial attention design in vision transformers. arXiv preprint arXiv:2104.13840 (2021)
Chu, X., Zhang, B., Tian, Z., Wei, X., Xia, H.: Do we really need explicit position encodings for vision transformers? arXiv preprint arXiv:2102.10882 (2021)
Cordonnier, J.B., Loukas, A., Jaggi, M.: On the relationship between self-attention and convolutional layers. In: ICLR (2019)
Google Scholar
Correia, G.M., Niculae, V., Martins, A.F.: Adaptively sparse transformers. In: EMNLP, pp. 2174–2184 (2019)
Google Scholar
d’Ascoli, S., Touvron, H., Leavitt, M., Morcos, A., Biroli, G., Sagun, L.: Convit: improving vision transformers with soft convolutional inductive biases. arXiv preprint arXiv:2103.10697 (2021)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
El-Nouby, A., et al.: Xcit: cross-covariance image transformers. arXiv preprint arXiv:2106.09681 (2021)
Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C.: Multiscale vision transformers. arXiv preprint arXiv:2104.11227 (2021)
Fang, J., Xie, L., Wang, X., Zhang, X., Liu, W., Tian, Q.: Msg-transformer: exchanging local spatial information by manipulating messenger tokens. arXiv preprint arXiv:2105.15168 (2021)
Gao, P., Lu, J., Li, H., Mottaghi, R., Kembhavi, A.: Container: context aggregation network. arXiv preprint arXiv:2106.01401 (2021)
Gong, C., Wang, D., Li, M., Chandra, V., Liu, Q.: Improve vision transformers training by suppressing over-smoothing. arXiv preprint arXiv:2104.12753 (2021)
Graham, B., et al.: Levit: a vision transformer in convnet’s clothing for faster inference. arXiv preprint arXiv:2104.01136 (2021)
Guo, J., Han, K., Wu, H., Xu, C., Tang, Y., Xu, C., Wang, Y.: Cmt: convolutional neural networks meet vision transformers. arXiv preprint arXiv:2107.06263 (2021)
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. arXiv preprint arXiv:2103.00112 (2021)
Han, L., Wang, P., Yin, Z., Wang, F., Li, H.: Exploiting better feature aggregation for video object detection. In: ACM MM (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
He, S., Luo, H., Wang, P., Wang, F., Li, H., Jiang, W.: Transreid: transformer-based object re-identification. arXiv preprint arXiv:2102.04378 (2021)
Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial dimensions of vision transformers. arXiv preprint arXiv:2103.16302 (2021)
Ho, J., Kalchbrenner, N., Weissenborn, D., Salimans, T.: Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180 (2019)
Huang, Z., Ben, Y., Luo, G., Cheng, P., Yu, G., Fu, B.: Shuffle transformer: rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650 (2021)
Jiang, Z., et al.: Token labeling: training a 85.5% top-1 accuracy vision transformer with 56m parameters on imagenet. arXiv preprint arXiv:2104.10858 (2021)
Jonnalagadda, A., Wang, W., Eckstein, M.P.: Foveater: foveated transformer for image classification. arXiv preprint arXiv:2105.14173 (2021)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kitaev, N., Kaiser, Ł., Levskaya, A.: Reformer: the efficient transformer. arXiv preprint arXiv:2001.04451 (2020)
Li, W., Liu, H., Ding, R., Liu, M., Wang, P.: Lifting transformer for 3D human pose estimation in video. arXiv preprint arXiv:2103.14304 (2021)
Li, X., Hou, Y., Wang, P., Gao, Z., Xu, M., Li, W.: Transformer guided geometry model for flow-based unsupervised visual odometry. Neural Comput. Appl. 33(13), 8031–8042 (2021). https://doi.org/10.1007/s00521-020-05545-8
Article Google Scholar
Li, X., Hou, Y., Wang, P., Gao, Z., Xu, M., Li, W.: Trear: transformer-based RGB-D egocentric action recognition. arXiv preprint arXiv:2101.03904 (2021)
Li, Y., Zhang, K., Cao, J., Timofte, R., Van Gool, L.: Localvit: bringing locality to vision transformers. arXiv preprint arXiv:2104.05707 (2021)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, P.J., et al.: Generating Wikipedia by summarizing long sequences. In: ICLR (2018)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)
Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: EMNLP (2015)
Google Scholar
Pai, S.: Convolutional neural networks arise from Ising models and restricted Boltzmann machines
Google Scholar
Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: efficient vision transformers with dynamic token sparsification. arXiv preprint arXiv:2106.02034 (2021)
Roy, A., Saffar, M., Vaswani, A., Grangier, D.: Efficient content-based sparse attention with routing transformers. Trans. Assoc. Comput. Linguist. 9, 53–68 (2021)
Article Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: ICML (2019)
Google Scholar
Tay, Y., Bahri, D., Yang, L., Metzler, D., Juan, D.C.: Sparse sinkhorn attention. In: ICML (2020)
Google Scholar
Tay, Y., et al.: Omninet: omnidirectional representations from transformers. arXiv preprint arXiv:2103.01075 (2021)
Tay, Y., Dehghani, M., Bahri, D., Metzler, D.: Efficient transformers: a survey. arXiv preprint arXiv:2009.06732 (2020)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877 (2020)
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. arXiv preprint arXiv:2103.17239 (2021)
Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
Google Scholar
Wang, P., et al.: Scaled relu matters for training vision transformers. arXiv preprint arXiv:2109.03810 (2021)
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122 (2021)
Wang, W., Yao, L., Chen, L., Cai, D., He, X., Liu, W.: Crossformer: a versatile vision transformer based on cross-scale attention. arXiv preprint arXiv:2108.00154 (2021)
Wang, Y., Huang, R., Song, S., Huang, Z., Huang, G.: Not all images are worth 16x16 words: dynamic vision transformers with adaptive sequence length. arXiv preprint arXiv:2105.15075 (2021)
Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: CVPR (2021)
Google Scholar
Wu, H., et al.: Cvt: introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808 (2021)
Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., Girshick, R.: Early convolutions help transformers see better. arXiv preprint arXiv:2106.14881 (2021)
Xie, J., Zeng, R., Wang, Q., Zhou, Z., Li, P.: So-vit: mind visual tokens for vision transformer. arXiv preprint arXiv:2104.10935 (2021)
Xu, T., Chen, W., Wang, P., Wang, F., Li, H., Jin, R.: Cdtrans: cross-domain transformer for unsupervised domain adaptation. arXiv preprint arXiv:2109.06165 (2021)
Xu, W., Xu, Y., Chang, T., Tu, Z.: Co-scale conv-attentional image transformers. arXiv preprint arXiv:2104.06399 (2021)
Xu, Y., et al.: Evo-vit: sow-fast token evolution for dynamic vision transformer. arXiv preprint arXiv:2108.01390 (2021)
Yang, J., et al.: Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641 (2021)
Yin, Z., et al.: Transfgu: a top-down approach to fine-grained unsupervised semantic segmentation. In: European Conference on Computer Vision (2022)
Google Scholar
Yu, Q., Xia, Y., Bai, Y., Lu, Y., Yuille, A., Shen, W.: Glance-and-gaze vision transformer. arXiv preprint arXiv:2106.02277 (2021)
Yu, Z., Li, X., Wang, P., Zhao, G.: Transrppg: remote photoplethysmography transformer for 3D mask face presentation attack detection. arXiv preprint arXiv:2104.07419 (2021)
Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F., Wu, W.: Incorporating convolution designs into visual transformers. arXiv preprint arXiv:2103.11816 (2021)
Yuan, L., et al.: Tokens-to-token vit: training vision transformers from scratch on imagenet. arXiv preprint arXiv:2101.11986 (2021)
Yuan, L., Hou, Q., Jiang, Z., Feng, J., Yan, S.: Volo: vision outlooker for visual recognition. arXiv preprint arXiv:2106.13112 (2021)
Zaheer, M., et al.: Big bird: transformers for longer sequences. arXiv preprint arXiv:2007.14062 (2020)
Zhang, P., et al.: Multi-scale vision longformer: a new vision transformer for high-resolution image encoding. arXiv preprint arXiv:2103.15358 (2021)
Zhang, Z., Zhang, H., Zhao, L., Chen, T., Pfister, T.: Aggregating nested transformers. arXiv preprint arXiv:2105.12723 (2021)
Zhao, G., Lin, J., Zhang, Z., Ren, X., Su, Q., Sun, X.: Explicit sparse transformer: concentrated attention through explicit selection. arXiv preprint arXiv:1912.11637 (2019)
Zhou, B., et al.: Semantic understanding of scenes through the ADE20K dataset. Int. J. Comput. Vis. 127(3), 302–321 (2018). https://doi.org/10.1007/s11263-018-1140-0
Article Google Scholar
Zhou, D., et al.: Deepvit: towards deeper vision transformer. arXiv preprint arXiv:2103.11886 (2021)
Zhou, D., et al.: Refiner: refining self-attention for vision transformers. arXiv preprint arXiv:2106.03714 (2021)
Zhou, J., Wang, P., Wang, F., Liu, Q., Li, H., Jin, R.: Elsa: enhanced local self-attention for vision transformer. arXiv preprint arXiv:2112.12786 (2021)
Zou, C., et al.: End-to-end human object interaction detection with hoi transformer. In: CVPR (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

Alibaba Group, Hangzhou, China
Pichao Wang, Xue Wang, Fan Wang, Ming Lin, Shuning Chang, Hao Li & Rong Jin

Authors

Pichao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Fan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ming Lin
View author publications
You can also search for this author in PubMed Google Scholar
Shuning Chang
View author publications
You can also search for this author in PubMed Google Scholar
Hao Li
View author publications
You can also search for this author in PubMed Google Scholar
Rong Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pichao Wang .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 463 KB)

Supplementary material 2 (zip 429 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, P. et al. (2022). KVT: k-NN Attention for Boosting Vision Transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13684. Springer, Cham. https://doi.org/10.1007/978-3-031-20053-3_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-20053-3_17
Published: 06 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20052-6
Online ISBN: 978-3-031-20053-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

KVT: k-NN Attention for Boosting Vision Transformers