AttnZero: Efficient Attention Discovery for Vision Transformers

Li, Lujun; Wei, Zimian; Dong, Peijie; Luo, Wenhan; Xue, Wei; Liu, Qifeng; Guo, Yike

doi:10.1007/978-3-031-72652-1_2

Lujun Li¹³,
Zimian Wei¹⁴,
Peijie Dong¹⁵,
Wenhan Luo¹³,
Wei Xue¹³,
Qifeng Liu¹³ &
…
Yike Guo¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15063))

Included in the following conference series:

European Conference on Computer Vision

518 Accesses

Abstract

In this paper, we present AttnZero, the first framework for automatically discovering efficient attention modules tailored for Vision Transformers (ViTs). While traditional self-attention in ViTs suffers from quadratic computation complexity, linear attention offers a more efficient alternative with linear complexity approximation. However, existing hand-crafted linear attention suffers from performance degradation. To address these issues, our AttnZero constructs search spaces and employs evolutionary algorithms to discover potential linear attention formulations. Specifically, our search space consists of six kinds of computation graphs and advanced activation, normalize, and binary operators. To enhance generality, we derive results of candidate attention applied to multiple advanced ViTs as the multi-objective for the evolutionary search. To expedite the search process, we utilize program checking and rejection protocols to filter out unpromising candidates swiftly. Additionally, we develop Attn-Bench-101, which provides precomputed performance of 2,000 attentions in the search spaces, enabling us to summarize attention design insights. Experimental results demonstrate that the discovered AttnZero module generalizes well to different tasks and consistently achieves improved performance across various ViTs. For instance, the tiny model of DeiT|PVT|Swin|CSwin trained with AttnZero on ImageNet reaches 74.9%|78.1%|82.1%|82.9% top-1 accuracy. Codes at: https://github.com/lliai/AttnZero.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An Attention-Based Token Pruning Method for Vision Transformers

NASformer: Neural Architecture Search for Vision Transformer

FeaTrim-ViT: Vision Transformer Trimming with One Shot Neural Architecture Search in Continuous Optimisation Space and Efficient Feature Selection

References

http://tiny-imagenet.herokuapp.com/
Ali, A., et al.: Xcit: cross-covariance image transformers. In: NIPS (2021)
Google Scholar
Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Hoffman, J.: Hydra attention: efficient attention with many heads. In: ECCVW (2022)
Google Scholar
Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)
Chen, M., Peng, H., Fu, J., Ling, H.: Autoformer: searching transformers for visual recognition. In: ICCV, pp. 12270–12280 (2021)
Google Scholar
Choromanski, K., et al.: Rethinking attention with performers. In: ICLR (2021)
Google Scholar
Chowdhery, A., et al.: Palm: scaling language modeling with pathways. JMLR (2023)
Google Scholar
Chu, X., et al.: Conditional positional encodings for vision transformers. arxiv preprint 2102.10882 (2021). https://arxiv.org/pdf/2102.10882.pdf
Chu, X., Zhang, B., Xu, R.: Fairnas: rethinking evaluation fairness of weight sharing neural architecture search. In: ICCV (2021)
Google Scholar
Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. (2002)
Google Scholar
Dong, P., et al.: Pruner-zero: Evolving symbolic pruning metric from scratch for large language models. In: ICML (2024)
Google Scholar
Dong, P., Li, L., Wei, Z.: Diswot: student architecture search for distillation without training. In: CVPR (2023)
Google Scholar
Dong, P., Li, L., Wei, Z., Niu, X., Tian, Z., Pan, H.: EMQ: evolving training-free proxies for automated mixed precision quantization. In: ICCV, pp. 17076–17086 (2023)
Google Scholar
Dong, P., et al.: RD-NAS: enhancing one-shot supernet ranking ability via ranking distillation from zero-cost proxies. In: ICASSP (2023)
Google Scholar
Dong, P., et al.: Prior-guided one-shot neural architecture search. arXiv preprint arXiv:2206.13329 (2022)
Dong, P., et al.: Progressive meta-pooling learning for lightweight image classification model. In: ICASSP (2023)
Google Scholar
Dong, X., et al.: Cswin transformer: a general vision transformer backbone with cross-shaped windows. arXiv abs/2107.00652 (2021)
Google Scholar
Dong, X., Yang, Y.: NAS-bench-201: extending the scope of reproducible neural architecture search (2020)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Gao, J., et al.: AutoBERT-zero: evolving BERT backbone from scratch. In: AAAI (2022)
Google Scholar
Guan, C., Wang, X., Zhu, W.: Autoattend: automated attention representation search. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 3864–3874. PMLR (2021). http://proceedings.mlr.press/v139/guan21a.html
Guo, Z., et al.: Single path one-shot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420 (2019)
Guo, Z., et al.: Single path one-shot neural architecture search with uniform sampling. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 544–560. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_32
Chapter Google Scholar
Han, D., Pan, X., Han, Y., Song, S., Huang, G.: Flatten transformer: vision transformer using focused linear attention. In: ICCV (2023)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
Google Scholar
Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial dimensions of vision transformers. In: ICCV (2021)
Google Scholar
Hu, Y., Wang, X., Li, L., Gu, Q.: Improving one-shot NAS with shrinking-and-expanding supernet. Pattern Recogn. (2021)
Google Scholar
Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are RNNs: fast autoregressive transformers with linear attention. In: ICML (2020)
Google Scholar
Kirillov, A., Girshick, R., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR (2019)
Google Scholar
Kitaev, N., Kaiser, Ł., Levskaya, A.: Reformer: the efficient transformer. In: ICLR (2020)
Google Scholar
Li, K., Yu, R., Wang, Z., Yuan, L., Song, G., Chen, J.: Locality guidance for improving vision transformers on tiny datasets. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13684, pp. 110–127. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20053-3_7
Chapter Google Scholar
Li, L.: Self-regulated feature learning via teacher-free feature distillation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13686, pp. 347–363. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19809-0_20
Chapter Google Scholar
Li, L., Dong, P., Li, A., Wei, Z., Yang, Y.: KD-zero: evolving knowledge distiller for any teacher-student pairs. In: NeuIPS (2024)
Google Scholar
Li, L., Dong, P., Wei, Z., Yang, Y.: Automated knowledge distillation via Monte Carlo tree search. In: ICCV (2023)
Google Scholar
Li, L., Jin, Z.: Shadow knowledge distillation: bridging offline and online knowledge transfer. In: NeuIPS (2022)
Google Scholar
Li, L., Sun, H., Dong, P., Wei, Z., Shao, S.: Auto-das: Automated proxy discovery for training-free distillation-aware architecture search. In: ECCV (2024)
Google Scholar
Li, L., et al.: Auto-GAS: automated proxy discovery for training-free generative architecture search. In: ECCV (2024)
Google Scholar
Li, L., Wang, Y., Yao, A., Qian, Y., Zhou, X., He, K.: Explicit connection distillation (2020)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, H., Simonyan, K., Yang, Y.: DARTS: differentiable architecture search. In: 7th ICLR, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019, abs/1806.09055 (2019)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Google Scholar
Lu, J., et al.: Soft: softmax-free transformer with linear complexity. In: NeurIPS (2021)
Google Scholar
Nakai, K., Matsubara, T., Uehara, K.: Att-darts: differentiable neural architecture search for attention. In: IJCNN. IEEE (2020)
Google Scholar
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics and Image Processing, pp. 722–729. IEEE (2008)
Google Scholar
Peng, H., Du, H., Yu, H., Li, Q., Liao, J., Fu, J.: Cream of the crop: distilling prioritized paths for one-shot neural architecture search. In: NIPS (2020)
Google Scholar
Pham, H., Guan, M.Y., Zoph, B., Le, Q.V., Dean, J.: Efficient neural architecture search via parameter sharing. In: ICML (2018)
Google Scholar
Qin, J., Wu, J., Xiao, X., Li, L., Wang, X.: Activation modulation and recalibration scheme for weakly supervised semantic segmentation. In: AAAI (2022)
Google Scholar
Raquel, C.R., Naval Jr., P.C.: An effective use of crowding distance in multiobjective particle swarm optimization. In: CGEC (2005)
Google Scholar
Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image classifier architecture search. In: AAAI (2019)
Google Scholar
Real, E., Liang, C., So, D., Le, Q.: Automl-zero: evolving machine learning algorithms from scratch. In: ICML (2020)
Google Scholar
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision (2015)
Google Scholar
Shao, S., Dai, X., Yin, S., Li, L., Chen, H., Hu, Y.: Catch-up distillation: you only need to train once for accelerating sampling. arXiv preprint arXiv:2305.10769 (2023)
Shen, Z., Zhang, M., Zhao, H., Yi, S., Li, H.: Efficient attention: attention with linear complexities. In: WACV (2021)
Google Scholar
So, D.R., Mańke, W., Liu, H., Dai, Z., Shazeer, N., Le, Q.V.: Primer: searching for efficient transformers for language modeling. arXiv preprint arXiv:2109.08668 (2021)
Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: ICML (2019)
Google Scholar
Tay, Y., Dehghani, M., Bahri, D., Metzler, D.: Efficient transformers: a survey. arXiv preprint arXiv:2009.06732 (2020)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML (2021)
Google Scholar
Touvron, H., et al.: Llama: open and efficient foundation language models. corr, abs/2302.13971 (2023). https://doi.org/10.48550/arXiv.2302.13971. arXiv preprint arXiv:2302.13971
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: ICCV (2021)
Google Scholar
Wang, X., et al.: AttentionNAS: spatiotemporal attention cell search for video classification. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 449–465. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_27
Chapter Google Scholar
Wei, Z., et al.: Convformer: closing the gap between CNN and vision transformers. arXiv preprint arXiv:2209.07738 (2022)
Xiaolong, L., Lujun, L., Chao, L., Yao, A.: Norm: knowledge distillation via n-to-one representation matching (2022)
Google Scholar
Xiong, Y., et al.: Nyströmformer: a nyström-based algorithm for approximating self-attention. In: AAAI (2021)
Google Scholar
Yang, A., Esperança, P.M., Carlucci, F.M.: NAS evaluation is frustratingly hard. arXiv preprint arXiv:1912.12522 (2019)
You, H., et al.: Castling-ViT: compressing self-attention via switching towards linear-angular attention at vision transformer inference. In: CVPR (2023)
Google Scholar
You, S., Huang, T., Yang, M., Wang, F., Qian, C., Zhang, C.: GreedyNAS: towards fast one-shot NAS with greedy supernet. In: CVPR (2020)
Google Scholar
Yu, W., et al.: Metaformer is actually what you need for vision. In: CVPR (2022)
Google Scholar
Yuan, L., et al.: Tokens-to-token ViT: training vision transformers from scratch on imagenet. In: ICCV (2021)
Google Scholar
Zheng, Z., et al.: Autoattention: automatic field pair selection for attention in user behavior modeling. In: 2022 IEEE International Conference on Data Mining (ICDM), pp. 803–812. IEEE (2022)
Google Scholar
Zhou, B., et al.: Semantic understanding of scenes through the ade20k dataset. In: IJCV (2019)
Google Scholar
Zhou, Q., et al.: Training-free transformer architecture search. In: CVPR, pp. 10894–10903 (2022)
Google Scholar
Zhu, C., Li, L., Wu, Y., Sun, Z.: Saswot: real-time semantic segmentation architecture search without training. In: AAAI (2024)
Google Scholar
Zhu, C., Chen, W., Peng, T., Wang, Y., Jin, M.: Hard sample aware noise robust learning for histopathology image classification. TMI (2021)
Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)
Zimian Wei, Z., et al.: Auto-prox: training-free vision transformer architecture search via automatic proxy discovery. In: AAAI (2024)
Google Scholar
Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. In: ICLR (2017)
Google Scholar
Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: CVPR (2018)
Google Scholar

Download references

Acknowledgements

The research was supported by Theme-based Research Scheme (T45-205/21-N) from Hong Kong RGC, and Generative AI Research and Development Centre from InnoHK.

Author information

Authors and Affiliations

The Hong Kong University of Science and Technology, Hong Kong, China
Lujun Li, Wenhan Luo, Wei Xue, Qifeng Liu & Yike Guo
National University of Defense Technology, Changsha, China
Zimian Wei
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
Peijie Dong

Authors

Lujun Li
View author publications
You can also search for this author in PubMed Google Scholar
Zimian Wei
View author publications
You can also search for this author in PubMed Google Scholar
Peijie Dong
View author publications
You can also search for this author in PubMed Google Scholar
Wenhan Luo
View author publications
You can also search for this author in PubMed Google Scholar
Wei Xue
View author publications
You can also search for this author in PubMed Google Scholar
Qifeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yike Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Zimian Wei , Qifeng Liu or Yike Guo .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2622 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, L. et al. (2025). AttnZero: Efficient Attention Discovery for Vision Transformers. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15063. Springer, Cham. https://doi.org/10.1007/978-3-031-72652-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-72652-1_2
Published: 30 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72651-4
Online ISBN: 978-3-031-72652-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

AttnZero: Efficient Attention Discovery for Vision Transformers