Abstract
Convolutional Neural Networks (CNNs) have dominated computer vision for years, due to its ability in capturing locality and translation invariance. Recently, many vision transformer architectures have been proposed and they show promising performance. A key component in vision transformers is the fully-connected self-attention which is more powerful than CNNs in modelling long range dependencies. However, since the current dense self-attention uses all image patches (tokens) to compute attention matrix, it may neglect locality of images patches and involve noisy tokens (e.g., clutter background and occlusion), leading to a slow training process and potential degradation of performance. To address these problems, we propose the k-NN attention for boosting vision transformers. Specifically, instead of involving all the tokens for attention matrix calculation, we only select the top-k similar tokens from the keys for each query to compute the attention map. The proposed k-NN attention naturally inherits the local bias of CNNs without introducing convolutional operations, as nearby tokens tend to be more similar than others. In addition, the k-NN attention allows for the exploration of long range correlation and at the same time filters out irrelevant tokens by choosing the most similar tokens from the entire image. Despite its simplicity, we verify, both theoretically and empirically, that k-NN attention is powerful in speeding up training and distilling noise from input tokens. Extensive experiments are conducted by using 11 different vision transformer architectures to verify that the proposed k-NN attention can work with any existing transformer architectures to improve its prediction performance. The codes are available at https://github.com/damo-cv/KVT.
P. Wang and X. Wang—The first two authors contribute equally.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Theorem 4.1 in [33] describes the upper bound for regrets (the gap on loss function value between the current step parameters and optimal parameters). One can telescope it to the average regrets to consider the Adam’s convergence.
References
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Caron, M., et al.: Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294 (2021)
Chang, S., Wang, P., Wang, F., Li, H., Feng, J.: Augmented transformer with adaptive graph for temporal action proposal generation. arXiv preprint arXiv:2103.16024 (2021)
Chefer, H., Gur, S., Wolf, L.: Transformer interpretability beyond attention visualization. In: CVPR (2021)
Chen, C.F., Fan, Q., Panda, R.: Crossvit: cross-attention multi-scale vision transformer for image classification. arXiv preprint arXiv:2103.14899 (2021)
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: CVPR (2021)
Chen, Z., Xie, L., Niu, J., Liu, X., Wei, L., Tian, Q.: Visformer: the vision-friendly transformer. arXiv preprint arXiv:2104.12533 (2021)
Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019)
Chu, X., et al.: Twins: revisiting spatial attention design in vision transformers. arXiv preprint arXiv:2104.13840 (2021)
Chu, X., Zhang, B., Tian, Z., Wei, X., Xia, H.: Do we really need explicit position encodings for vision transformers? arXiv preprint arXiv:2102.10882 (2021)
Cordonnier, J.B., Loukas, A., Jaggi, M.: On the relationship between self-attention and convolutional layers. In: ICLR (2019)
Correia, G.M., Niculae, V., Martins, A.F.: Adaptively sparse transformers. In: EMNLP, pp. 2174–2184 (2019)
d’Ascoli, S., Touvron, H., Leavitt, M., Morcos, A., Biroli, G., Sagun, L.: Convit: improving vision transformers with soft convolutional inductive biases. arXiv preprint arXiv:2103.10697 (2021)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
El-Nouby, A., et al.: Xcit: cross-covariance image transformers. arXiv preprint arXiv:2106.09681 (2021)
Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C.: Multiscale vision transformers. arXiv preprint arXiv:2104.11227 (2021)
Fang, J., Xie, L., Wang, X., Zhang, X., Liu, W., Tian, Q.: Msg-transformer: exchanging local spatial information by manipulating messenger tokens. arXiv preprint arXiv:2105.15168 (2021)
Gao, P., Lu, J., Li, H., Mottaghi, R., Kembhavi, A.: Container: context aggregation network. arXiv preprint arXiv:2106.01401 (2021)
Gong, C., Wang, D., Li, M., Chandra, V., Liu, Q.: Improve vision transformers training by suppressing over-smoothing. arXiv preprint arXiv:2104.12753 (2021)
Graham, B., et al.: Levit: a vision transformer in convnet’s clothing for faster inference. arXiv preprint arXiv:2104.01136 (2021)
Guo, J., Han, K., Wu, H., Xu, C., Tang, Y., Xu, C., Wang, Y.: Cmt: convolutional neural networks meet vision transformers. arXiv preprint arXiv:2107.06263 (2021)
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. arXiv preprint arXiv:2103.00112 (2021)
Han, L., Wang, P., Yin, Z., Wang, F., Li, H.: Exploiting better feature aggregation for video object detection. In: ACM MM (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
He, S., Luo, H., Wang, P., Wang, F., Li, H., Jiang, W.: Transreid: transformer-based object re-identification. arXiv preprint arXiv:2102.04378 (2021)
Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial dimensions of vision transformers. arXiv preprint arXiv:2103.16302 (2021)
Ho, J., Kalchbrenner, N., Weissenborn, D., Salimans, T.: Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180 (2019)
Huang, Z., Ben, Y., Luo, G., Cheng, P., Yu, G., Fu, B.: Shuffle transformer: rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650 (2021)
Jiang, Z., et al.: Token labeling: training a 85.5% top-1 accuracy vision transformer with 56m parameters on imagenet. arXiv preprint arXiv:2104.10858 (2021)
Jonnalagadda, A., Wang, W., Eckstein, M.P.: Foveater: foveated transformer for image classification. arXiv preprint arXiv:2105.14173 (2021)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kitaev, N., Kaiser, Ł., Levskaya, A.: Reformer: the efficient transformer. arXiv preprint arXiv:2001.04451 (2020)
Li, W., Liu, H., Ding, R., Liu, M., Wang, P.: Lifting transformer for 3D human pose estimation in video. arXiv preprint arXiv:2103.14304 (2021)
Li, X., Hou, Y., Wang, P., Gao, Z., Xu, M., Li, W.: Transformer guided geometry model for flow-based unsupervised visual odometry. Neural Comput. Appl. 33(13), 8031–8042 (2021). https://doi.org/10.1007/s00521-020-05545-8
Li, X., Hou, Y., Wang, P., Gao, Z., Xu, M., Li, W.: Trear: transformer-based RGB-D egocentric action recognition. arXiv preprint arXiv:2101.03904 (2021)
Li, Y., Zhang, K., Cao, J., Timofte, R., Van Gool, L.: Localvit: bringing locality to vision transformers. arXiv preprint arXiv:2104.05707 (2021)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, P.J., et al.: Generating Wikipedia by summarizing long sequences. In: ICLR (2018)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)
Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: EMNLP (2015)
Pai, S.: Convolutional neural networks arise from Ising models and restricted Boltzmann machines
Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: efficient vision transformers with dynamic token sparsification. arXiv preprint arXiv:2106.02034 (2021)
Roy, A., Saffar, M., Vaswani, A., Grangier, D.: Efficient content-based sparse attention with routing transformers. Trans. Assoc. Comput. Linguist. 9, 53–68 (2021)
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: ICML (2019)
Tay, Y., Bahri, D., Yang, L., Metzler, D., Juan, D.C.: Sparse sinkhorn attention. In: ICML (2020)
Tay, Y., et al.: Omninet: omnidirectional representations from transformers. arXiv preprint arXiv:2103.01075 (2021)
Tay, Y., Dehghani, M., Bahri, D., Metzler, D.: Efficient transformers: a survey. arXiv preprint arXiv:2009.06732 (2020)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877 (2020)
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. arXiv preprint arXiv:2103.17239 (2021)
Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
Wang, P., et al.: Scaled relu matters for training vision transformers. arXiv preprint arXiv:2109.03810 (2021)
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122 (2021)
Wang, W., Yao, L., Chen, L., Cai, D., He, X., Liu, W.: Crossformer: a versatile vision transformer based on cross-scale attention. arXiv preprint arXiv:2108.00154 (2021)
Wang, Y., Huang, R., Song, S., Huang, Z., Huang, G.: Not all images are worth 16x16 words: dynamic vision transformers with adaptive sequence length. arXiv preprint arXiv:2105.15075 (2021)
Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: CVPR (2021)
Wu, H., et al.: Cvt: introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808 (2021)
Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., Girshick, R.: Early convolutions help transformers see better. arXiv preprint arXiv:2106.14881 (2021)
Xie, J., Zeng, R., Wang, Q., Zhou, Z., Li, P.: So-vit: mind visual tokens for vision transformer. arXiv preprint arXiv:2104.10935 (2021)
Xu, T., Chen, W., Wang, P., Wang, F., Li, H., Jin, R.: Cdtrans: cross-domain transformer for unsupervised domain adaptation. arXiv preprint arXiv:2109.06165 (2021)
Xu, W., Xu, Y., Chang, T., Tu, Z.: Co-scale conv-attentional image transformers. arXiv preprint arXiv:2104.06399 (2021)
Xu, Y., et al.: Evo-vit: sow-fast token evolution for dynamic vision transformer. arXiv preprint arXiv:2108.01390 (2021)
Yang, J., et al.: Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641 (2021)
Yin, Z., et al.: Transfgu: a top-down approach to fine-grained unsupervised semantic segmentation. In: European Conference on Computer Vision (2022)
Yu, Q., Xia, Y., Bai, Y., Lu, Y., Yuille, A., Shen, W.: Glance-and-gaze vision transformer. arXiv preprint arXiv:2106.02277 (2021)
Yu, Z., Li, X., Wang, P., Zhao, G.: Transrppg: remote photoplethysmography transformer for 3D mask face presentation attack detection. arXiv preprint arXiv:2104.07419 (2021)
Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F., Wu, W.: Incorporating convolution designs into visual transformers. arXiv preprint arXiv:2103.11816 (2021)
Yuan, L., et al.: Tokens-to-token vit: training vision transformers from scratch on imagenet. arXiv preprint arXiv:2101.11986 (2021)
Yuan, L., Hou, Q., Jiang, Z., Feng, J., Yan, S.: Volo: vision outlooker for visual recognition. arXiv preprint arXiv:2106.13112 (2021)
Zaheer, M., et al.: Big bird: transformers for longer sequences. arXiv preprint arXiv:2007.14062 (2020)
Zhang, P., et al.: Multi-scale vision longformer: a new vision transformer for high-resolution image encoding. arXiv preprint arXiv:2103.15358 (2021)
Zhang, Z., Zhang, H., Zhao, L., Chen, T., Pfister, T.: Aggregating nested transformers. arXiv preprint arXiv:2105.12723 (2021)
Zhao, G., Lin, J., Zhang, Z., Ren, X., Su, Q., Sun, X.: Explicit sparse transformer: concentrated attention through explicit selection. arXiv preprint arXiv:1912.11637 (2019)
Zhou, B., et al.: Semantic understanding of scenes through the ADE20K dataset. Int. J. Comput. Vis. 127(3), 302–321 (2018). https://doi.org/10.1007/s11263-018-1140-0
Zhou, D., et al.: Deepvit: towards deeper vision transformer. arXiv preprint arXiv:2103.11886 (2021)
Zhou, D., et al.: Refiner: refining self-attention for vision transformers. arXiv preprint arXiv:2106.03714 (2021)
Zhou, J., Wang, P., Wang, F., Liu, Q., Li, H., Jin, R.: Elsa: enhanced local self-attention for vision transformer. arXiv preprint arXiv:2112.12786 (2021)
Zou, C., et al.: End-to-end human object interaction detection with hoi transformer. In: CVPR (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, P. et al. (2022). KVT: k-NN Attention for Boosting Vision Transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13684. Springer, Cham. https://doi.org/10.1007/978-3-031-20053-3_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-20053-3_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20052-6
Online ISBN: 978-3-031-20053-3
eBook Packages: Computer ScienceComputer Science (R0)