research-article

Sufficient Vision Transformer

Authors:
Zhi Cheng

The University of Sydney, Sydney, NSW, Australia

The University of Sydney, Sydney, NSW, Australia
View Profile

,
Xiu Su

The University of Sydney, Sydney, NSW, Australia

The University of Sydney, Sydney, NSW, Australia
View Profile

,
Xueyu Wang

The University of Sydney, Sydney, NSW, Australia

The University of Sydney, Sydney, NSW, Australia
View Profile

,
Shan You

SenseTime Research, Beijing, China

SenseTime Research, Beijing, China
View Profile

,
Chang Xu

The University of Sydney, Sydney, NSW, Australia

The University of Sydney, Sydney, NSW, Australia
View Profile

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data MiningAugust 2022Pages 190–200https://doi.org/10.1145/3534678.3539322

Published:14 August 2022Publication History

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 190–200

ABSTRACT

Currently, Vision Transformer (ViT) and its variants have demonstrated promising performance on various computer vision tasks. Nevertheless, task-irrelevant information such as background nuisance and noise in patch tokens would damage the performance of ViT-based models. In this paper, we develop Sufficient Vision Transformer (Suf-ViT) as a new solution to address this issue. In our research, we propose the Sufficiency-Blocks (S-Blocks) to be applied across the depth of Suf-ViT to disentangle and discard task-irrelevant information accurately. Besides, to boost the training of Suf-ViT, we formulate a Sufficient-Reduction Loss (SRLoss) leveraging the concept of Mutual Information (MI) that enables Suf-ViT to extract more reliable sufficient representations by removing task-irrelevant information. Extensive experiments on benchmark datasets such as ImageNet, ImageNet-C, and CIFAR-10 indicate that our method can achieve state-of-the-art or competing performance over other baseline methods. Codes are available at: https://github.com/zhicheng2T0/Sufficient-Vision-Transformer.git

Supplemental Material

kdd22-rtfp1001.mp4

mp4

78.5 MB

Download

References

Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. 2018. Mutual Information Neural Estimation. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 531--540. http://proceedings.mlr.press/v80/belghazi18a.htmlGoogle Scholar
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with Transformers. arXiv:2005.12872 (cs.CV)Google Scholar
Chun-Fu Chen, Quanfu Fan, and Rameswar Panda. 2021. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. ArXiv abs/2103.14899 (2021).Google Scholar
Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. 2021. Pre-Trained Image Processing Transformer. arXiv:2012.00364 (cs.CV)Google Scholar
Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. 2021. Twins: Revisiting the Design of Spatial Attention in Vision Transformers. arXiv:2104.13840 (cs.CV)Google Scholar
Stéphane d'Ascoli, Hugo Touvron, Matthew Leavitt, Ari Morcos, Giulio Biroli, and Levent Sagun. 2021. ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases. arXiv:2103.10697 (cs.CV)Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL]Google Scholar
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations. https://openreview.net/forum?id=YicbFdNTTyGoogle Scholar
Yuxuan Du, Tao Huang, Shan You, Min-Hsiu Hsieh, and Dacheng Tao. 2020. Quantum circuit architecture search: error mitigation and trainability enhancement for variational quantum solvers. arXiv preprint arXiv:2010.10217 (2020).Google Scholar
Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6572Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778. https://doi.org/10.1109/CVPR.2016.90Google Scholar
Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. In International Conference on Learning Representations. https://openreview.net/forum?id=HJz6tiCqYmGoogle Scholar
Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. 2019. AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty. https://doi.org/10.48550/ARXIV.1912.02781Google Scholar
R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2019. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations. https://openreview.net/forum?id=Bklr3j0cKXGoogle Scholar
Changlin Li, GuangrunWang, BingWang, Xiaodan Liang, Zhihui Li, and Xiaojun Chang. 2021. Dynamic Slimmable Network. arXiv:2103.13258 (cs.CV)Google Scholar
Qiufu Li, Linlin Shen, Sheng Guo, and Zhihui Lai. 2020. Wavelet Integrated CNNs for Noise-Robust Image Classification. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Qiufu Li, Linlin Shen, Sheng Guo, and Zhihui Lai. 2021. WaveCNet: Wavelet Integrated CNNs to Suppress Aliasing Effect for Noise-Robust Image Classification. IEEE Transactions on Image Processing 30 (2021), 7074--7089. https://doi.org/10.1109/tip.2021.3101395Google ScholarDigital Library
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv:2103.14030 (cs.CV)Google Scholar
Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. 2017. Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision. 5058--5066.Google ScholarCross Ref
Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. 2016. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5--10, 2016, Barcelona, Spain, Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (Eds.). 271--279. https://proceedings.neurips.cc/paper/2016/hash/cedebb6e872f539bef8c3f919874e9d7-Abstract.htmlGoogle Scholar
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211--252. https://doi.org/10.1007/s11263-015-0816-yGoogle ScholarDigital Library
Xiu Su, Tao Huang, Yanxi Li, Shan You, Fei Wang, Chen Qian, Changshui Zhang, and Chang Xu. 2021. Prioritized architecture sampling with monto-carlo tree search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10968--10977.Google ScholarCross Ref
Xiu Su, Shan You, Tao Huang, Fei Wang, Chen Qian, Changshui Zhang, and Chang Xu. 2021. Locally free weight sharing for network width search. arXiv preprint arXiv:2102.05258 (2021).Google Scholar
Xiu Su, Shan You, Fei Wang, Chen Qian, Changshui Zhang, and Chang Xu. 2021. Bcnet: Searching for network width with bilaterally coupled network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2175--2184.Google ScholarCross Ref
Xiu Su, Shan You, Jiyang Xie, Mingkai Zheng, Fei Wang, Chen Qian, Changshui Zhang, Xiaogang Wang, and Chang Xu. 2021. Vision transformer architecture search. arXiv e-prints (2021), arXiv--2106.Google Scholar
Xiu Su, Shan You, Mingkai Zheng, Fei Wang, Chen Qian, Changshui Zhang, and Chang Xu. 2021. K-shot NAS: Learnable Weight-Sharing for NAS with K-shot Supernets. In International Conference on Machine Learning. PMLR, 9880--9890.Google Scholar
Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy. 2021. MLP-Mixer: An all-MLP Architecture for Vision. arXiv:2105.01601 (cs.CV)Google Scholar
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. 2021. Training data-efficient image transformers& distillation through attention. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 10347--10357. http://proceedings.mlr.press/v139/touvron21a.htmlGoogle Scholar
Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. 2021. Going deeper with Image Transformers. arXiv:2103.17239 (cs.CV)Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.Google Scholar
Pichao Wang, Xue Wang, Fan Wang, Ming Lin, Shuning Chang, Wen Xie, Hao Li, and Rong Jin. 2021. KVT: k-NN Attention for Boosting Vision Transformers. arXiv:2106.00515 (cs.CV)Google Scholar
WenhaiWang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 568--578.Google Scholar
XiaolongWang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-Local Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Yulin Wang, Zanlin Ni, Shiji Song, Le Yang, and Gao Huang. 2021. Revisiting Locally Supervised Learning: an Alternative to End-to-end Training. In International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=fAbkE6ant2Google Scholar
Shan You, Tao Huang, Mingmin Yang, Fei Wang, Chen Qian, and Changshui Zhang. 2020. Greedynas: Towards fast one-shot nas with greedy supernet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1999--2008.Google ScholarCross Ref
Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. 2021. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. arXiv:2101.11986 (cs.CV)Google Scholar
Yanhong Zeng, Jianlong Fu, and Hongyang Chao. 2020. Learning Joint Spatial-Temporal Transformations for Video Inpainting. arXiv:2007.10247 (cs.CV)Google Scholar
Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip Torr, and Vladlen Koltun. 2020. Point Transformer. arXiv:2012.09164 (cs.CV)Google Scholar
Mingkai Zheng, Fei Wang, Shan You, Chen Qian, Changshui Zhang, Xiaogang Wang, and Chang Xu. 2021. Weakly supervised contrastive learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10042--10051.Google ScholarCross Ref
Mingkai Zheng, Shan You, Fei Wang, Chen Qian, Changshui Zhang, Xiaogang Wang, and Chang Xu. 2021. ReSSL: Relational Self-Supervised Learning with Weak Augmentation. Advances in Neural Information Processing Systems 34 (2021).Google Scholar

Index Terms

Sufficient Vision Transformer
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Towards Efficient Adversarial Training on Vision Transformers
Computer Vision – ECCV 2022
Abstract
Vision Transformer (ViT), as a powerful alternative to Convolutional Neural Network (CNN), has received much attention. Recent work showed that ViTs are also vulnerable to adversarial examples like CNNs. To build robust ViTs, an intuitive way is ...
Read More
Delineating Necessary and Sufficient Neural Systems with Functional Imaging Studies of Neuropsychological Patients

This paper demonstrates how functional imaging studies of neuropsychological patients can provide a way of determining which areas in a cognitive network are jointly necessary and sufficient. The approach is illustrated with an investigation of the ...
Read More
Convolutional Embedding Makes Hierarchical Vision Transformer Stronger
Computer Vision – ECCV 2022
Abstract
Vision Transformers (ViTs) have recently dominated a range of computer vision tasks, yet it suffers from low training data efficiency and inferior local semantic representation capability without appropriate inductive bias. Convolutional neural ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2022
5033 pages
ISBN:9781450393850
DOI:10.1145/3534678
General Chairs:
Aidong Zhang
University of Virginia
,
Huzefa Rangwala
Amazon/George Mason University
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 August 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
deep learning
robustness
sufficient-reduction loss
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 483
  Total Downloads
- Downloads (Last 12 months)156
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Sufficient Vision Transformer

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Towards Efficient Adversarial Training on Vision Transformers

Delineating Necessary and Sufficient Neural Systems with Functional Imaging Studies of Neuropsychological Patients

Convolutional Embedding Makes Hierarchical Vision Transformer Stronger

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Sufficient Vision Transformer

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Towards Efficient Adversarial Training on Vision Transformers

Delineating Necessary and Sufficient Neural Systems with Functional Imaging Studies of Neuropsychological Patients

Convolutional Embedding Makes Hierarchical Vision Transformer Stronger

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media