skip to main content
10.1145/3581783.3612060acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Scale-space Tokenization for Improving the Robustness of Vision Transformers

Published: 27 October 2023 Publication History

Abstract

The performance of the Vision Transformer (ViT) model and its variants in most vision tasks has surpassed traditional Convolutional Neural Networks (CNNs) in terms of in-distribution accuracy. However, ViTs still have significant room for improvement in their robustness to input perturbations. Furthermore, robustness is a critical aspect to consider when deploying ViTs in real-world scenarios. Despite this, some variants of ViT improve the in-distribution accuracy and computation performance at the cost of sacrificing the model's robustness and generalization. In this study, inspired by the prior findings on the potential effectiveness of shape bias to robustness improvement and the importance of multi-scale analysis, we propose a simple yet effective method, scale-space tokenization, to improve the robustness of ViT while maintaining in-distribution accuracy. Based on this method, we build Scale-space-based Robust Vision Transformer (SRVT) model. Our method consists of scale-space patch embedding and scale-space positional encoding. The scale-space patch embedding makes a sequence of variable-scale images and increases the model's shape bias to enhance its robustness. The scale-space positional encoding implicitly boosts the model's invariance to input perturbations by incorporating scale-aware position information into 3D sinusoidal positional encoding. We conduct experiments on image recognition benchmarks (CIFAR10/100 and ImageNet-1k) from the perspectives of in-distribution accuracy, adversarial and out-of-distribution robustness. The experimental results demonstrate our method's effectiveness in improving robustness without compromising in-distribution accuracy. Especially, our approach achieves advanced adversarial robustness on ImageNet-1k benchmark compared with state-of-the-art robust ViT.

References

[1]
Saad Albawi, Tareq Abed Mohammed, and Saad Al-Zawi. 2017. Understanding of a convolutional neural network. In Proc. International Conference on Engineering and Technology (ICET).
[2]
Aharon Azulay and Yair Weiss. 2019. Why do deep convolutional networks generalize so poorly to small image transformations? J. Mach. Learn. Res. 20 (2019), no. 184, pp. 1--25.
[3]
Yutong Bai, Jieru Mei, Alan L Yuille, and Cihang Xie. 2021. Are Transformers More Robust than CNNs?. In Proc. Advances in Neural Information Processing Systems (NeurIPS).
[4]
Srinadh Bhojanapalli, Ayan Chakrabarti, Daniel Glasner, Daliang Li, Thomas Unterthiner, and Andreas Veit. 2021. Understanding Robustness of Transformers for Image Classification. In Proc. International Conference on Computer Vision (ICCV).
[5]
Anna Bosch, Andrew Zisserman, and Xavier Munoz. 2007. Representing Shape with a Spatial Pyramid Kernel. In Proc. ACM International Conference on Image and Video Retrieval (CIVR).
[6]
Peter J. Burt and Edward H. Adelson. 1983. The Laplacian Pyramid as a Compact Image Code. IEEE Journal of Transactions on Communications 31 (1983), pp. 532--540.
[7]
Stéphane d'Ascoli, Hugo Touvron, Matthew Leavitt, Ari Morcos, Giulio Biroli, and Levent Sagun. 2021. Convit: Improving vision transformers with soft convolutional inductive biases. In Proc. International Conference on Machine Learning (ICML).
[8]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proc. Computer Vision and Pattern Recognition (CVPR).
[9]
Yifu Ding, Haotong Qin, Qinghua Yan, Zhenhua Chai, Junjie Liu, Xiaolin Wei, and Xianglong Liu. 2022. Towards Accurate Post-Training Quantization for Vision Transformer. In Proc. ACM International Conference on Multimedia (ACM MM).
[10]
Piotr Dollár, Ron Appel, Serge Belongie, and Pietro Perona. 2014. Fast Feature Pyramids for Object Detection. IEEE Trans. PAMI 36 (2014), no.8, pp. 1532--1545.
[11]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. International Conference on Learning Representations (ICLR).
[12]
Xia Du and Chi-Man Pun. 2020. Adversarial Image Attacks Using Multi-Sample and Most-Likely Ensemble Methods. In Proc. ACM International Conference on Multimedia (ACM MM).
[13]
Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale Vision Transformers. In Proc. International Conference on Computer Vision (ICCV).
[14]
Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard S. Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. 2020. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2 (2020), pp. 665--673.
[15]
Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. 2019. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In Proc. International Conference on Learning Representations (ICLR).
[16]
Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. In Proc. International Conference on Learning Representations (ICLR).
[17]
Jindong Gu, Volker Tresp, and Yao Qin. 2022. Are Vision Transformers Robust to Patch Perturbations?. In Proc. European Conference on Computer Vision (ECCV).
[18]
Kai Han, An Xiao, EnhuaWu, Jianyuan Guo, Chunjing Xu, and YunheWang. 2021. Transformer in transformer. In Proc. Advances in Neural Information Processing Systems (NeurIPS).
[19]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proc. Computer Vision and Pattern Recognition (CVPR).
[20]
Rundong He, Zhongyi Han, Xiankai Lu, and Yilong Yin. 2022. RONF: Reliable Outlier Synthesis under Noisy Feature Space for Out-of-Distribution Detection. In Proc. ACM International Conference on Multimedia (ACM MM).
[21]
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, FrankWang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. 2020. The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization. In Proc. International Conference on Computer Vision (ICCV).
[22]
Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. In Proc. International Conference on Learning Representations (ICLR).
[23]
Dan Hendrycks, Norman Mu, Ekin D Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. 2020. AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty. In Proc. International Conference on Learning Representations (ICLR).
[24]
Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. 2021. Natural Adversarial Examples. In CVPR.
[25]
Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, and Seong Joon Oh. 2021. Rethinking spatial dimensions of vision transformers. In Proc. International Conference on Computer Vision (ICCV).
[26]
Tomu Hirata, Yusuke Mukuta, and Tatsuya Harada. 2021. Making Video Recognition Models Robust to Common Corruptions With Supervised Contrastive Learning. In Proc. ACM International Conference on Multimedia in Asia (MMAsia).
[27]
Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proc. International Conference on Machine Learning (ICML).
[28]
Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2022. Transformers in Vision: A Survey. ACM Comput. Surv. 54 (2022), no. 200, pp. 1--41.
[29]
Sohail Ahmed Khan and Hang Dai. 2021. Video Transformer for Deepfake Detection with Incremental Learning. In Proc. ACM International Conference on Multimedia (ACM MM).
[30]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proc. International Conference on Learning Representations (ICLR).
[31]
Jan J. Koenderink. 1984. The Structure of Images. Springer Journal of Biological Cybernetics 31 (1984), pp. 363--370.
[32]
Alex Krizhevsky and Geoffrey Hinton. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report.
[33]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proc. Conference on Neural Information Processing Systems (NIPS).
[34]
S. Lazebnik, C. Schmid, and J. Ponce. 2006. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In Proc. Computer Vision and Pattern Recognition (CVPR).
[35]
He Li, Mang Ye, Cong Wang, and Bo Du. 2022. Pyramidal Transformer with Conv-Patchify for Person Re-identification. In Proc. ACM International Conference on Multimedia (ACM MM).
[36]
Yingwei Li, Qihang Yu, Mingxing Tan, Jieru Mei, Peng Tang, Wei Shen, Alan Yuille, and Cihang Xie. 2021. Shape-Texture Debiased Neural Network Training. In Proc. International Conference on Learning Representations (ICLR).
[37]
Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, and Luc Van Gool. 2021. LocalViT: Bringing Locality to Vision Transformers. CoRR (2021). arXiv:2104.05707
[38]
Tony Lindeberg. 1994. Scale-Space Theory: A Basic Tool for Analysing Structures at Different Scales. Journal of Applied Statistics 21 (1994), pp. 225--270.
[39]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proc. International Conference on Computer Vision (ICCV).
[40]
David G. Lowe. 2004. Distinctive Image Features from Scale-Invariant Keypoints. Springer International Journal of Computer Vision 78 (2004), pp. 23415--23442.
[41]
Gabriel Resende Machado, Eugênio Silva, and Ronaldo Ribeiro Goldschmidt. 2023. Adversarial Machine Learning in Image Classification: A Survey Toward the Defender's Perspective. ACM Comput. Surv. 55 (2023), no. 8, pp. 1--38.
[42]
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards Deep Learning Models Resistant to Adversarial Attacks. In Proc. International Conference on Learning Representations (ICLR).
[43]
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks. In Proc. International Conference on Learning Representations (ICLR).
[44]
Kaleel Mahmood, Rigel Mahmood, and Marten Van Dijk. 2021. On the Robustness of Vision Transformers to Adversarial Examples. In Proc. International Conference on Computer Vision (ICCV).
[45]
Xiaofeng Mao, Gege Qi, Yuefeng Chen1 Xiaodan Li, Ranjie Duan, Shaokai Ye, Yuan He, and Hui Xue. 2022. Towards Robust Vision Transformer. In Proc. Computer Vision and Pattern Recognition (CVPR).
[46]
D. Marr and E. Hildreth. 1980. Theory of Edge Detection. In In Proc. Royal Society of London. Series B, Biological Sciences.
[47]
Sayak Paul and Pin-Yu Chen. 2022. Vision Transformers are Robust Learners. Proc. Association for the Advancement of Artificial Intelligence (AAAI).
[48]
Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. 2020. Designing Network Design Spaces. In Proc. Computer Vision and Pattern Recognition (CVPR).
[49]
Jiahuan Ren, Zhao Zhang, Richang Hong, Mingliang Xu, Haijun Zhang, Mingbo Zhao, and Meng Wang. 2022. Robust Low-Rank Convolution Network for Image Denoising. In Proc. ACM International Conference on Multimedia (ACM MM).
[50]
Evgenia Rusak, Lukas Schott, Roland S Zimmermann, Julian Bitterwolf, Oliver Bringmann, Matthias Bethge, and Wieland Brendel. 2020. A SimpleWay to Make Neural Networks Robust Against Diverse Image Corruptions. In Proc. European Conference on Computer Vision (ECCV).
[51]
Gal-Lev Shalev, Gabi Shalev, and Joseph Keshet. 2022. A Baseline for Detecting Out-of-Distribution Examples in Image Captioning. In Proc. ACM International Conference on Multimedia (ACM MM).
[52]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In Proc. Computer Vision and Pattern Recognition (CVPR).
[53]
Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proc. International Conference on Machine Learning (ICML).
[54]
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2020. Training data-efficient image transformers & distillation through attention. Proc. International Conference on Machine Learning (ICML).
[55]
Lin Tsung-Yi, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature Pyramid Networks for Object Detection. In Proc. Computer Vision and Pattern Recognition (CVPR).
[56]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In Proc. Advances in Neural Information Processing Systems (NeurIPS).
[57]
Haohan Wang, Songwei Ge, Eric P Xing, and Zachary C Lipton. 2019. Learning Robust Global Representations by Penalizing Local Predictive Power. In Proc. Advances in Neural Information Processing Systems (NeurIPS).
[58]
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proc. International Conference on Computer Vision (ICCV).
[59]
Yuxuan Wang, Jiakai Wang, Zixin Yin, Ruihao Gong, Jingyi Wang, Aishan Liu, and Xianglong Liu. 2022. Generating Transferable Adversarial Examples against Vision Transformers. In Proc. ACM International Conference on Multimedia (ACM.
[60]
Zelun Wang and Jyh-Charn Liu. 2021. Translating math formula images to LaTeX sequences using deep neural networks with sequence-level training. International Journal on Document Analysis and Recognition 24 (2021), pp. 63--75.
[61]
Boxi Wu, Jindong Gu, Zhifeng Li, Deng Cai, Xiaofei He, and Wei Liu. 2022. Towards Efficient Adversarial Training on Vision Transformers. In Proc. European Conference on Computer Vision (ECCV).
[62]
Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. 2021. Cvt: Introducing Convolutions to Vision Transformers. In Proc. International Conference on Computer Vision (ICCV).
[63]
Bo Zhang Xinlong Wang Chunhua Shen Xiangxiang Chu, Zhi Tian. 2023. Conditional Positional Encodings for Vision Transformers. In Proc. International Conference on Learning Representations (ICLR).
[64]
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated Residual Transformations for Deep Neural Networks. In Proc. Computer Vision and Pattern Recognition (CVPR).
[65]
Xing Xu, Jiefu Chen, Jinhui Xiao, Zheng Wang, Yang Yang, and Heng Tao Shen. 2020. Learning Optimization-based Adversarial Perturbations for Attacking Sequential Recognition Models. In Proc. ACM International Conference on Multimedia (ACM MM).
[66]
Dong Yin, Raphael Gontijo Lopes, Jonathon Shlens, Ekin Dogus Cubuk, and Justin Gilmer. 2019. A Fourier Perspective on Model Robustness in Computer Vision. In Proc. Advances in Neural Information Processing Systems (NeurIPS).
[67]
Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, and Wei Wu. 2021. Incorporating Convolution Designs into Visual Transformers. In Proc. International Conference on Computer Vision (ICCV).
[68]
Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. 2021. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proc. International Conference on Computer Vision (ICCV).
[69]
Daichi Zhang, Fanzhao Lin, Yingying Hua, PengjuWang, Dan Zeng, and Shiming Ge. 2022. Deepfake Video Detection with Spatiotemporal Dropout Transformer. In Proc. ACM International Conference on Multimedia (ACM MM).
[70]
Hanwei Zhang, Yannis Avrithis, Teddy Furon, and Laurent Amsaleg. 2021. Patch Replacement: A Transformation-based Method to Improve Robustness against Adversarial Attacks. In In Porc. ACM Multimedia Workshop on Trustworthy AI for Multimedia Computing (TrustworthyAI).
[71]
Jiaming Zhang, Qi Yi, and Jitao Sang. 2022. Towards Adversarial Attack on Vision-Language Pre-training Models. In Proc. ACM International Conference on Multimedia (ACM MM).
[72]
Richard Zhang. 2019. Making Convolutional Networks Shift-Invariant Again. In Proc. International Conference on Machine Learning (ICML).
[73]
Lei Zhao and Lei Huang. 2021. An Investigation on Sparsity of CapsNets for Adversarial Robustness. In Proc. ACM Multimedia Workshop on Adversarial Learning for Multimedia (ADVM).
[74]
Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Animashree Anandkumar, Jiashi Feng, and Jose M. Alvarez. 2022. Understanding The Robustness in Vision Transformers. In Proc. International Conference on Machine Learning (ICML).

Index Terms

  1. Scale-space Tokenization for Improving the Robustness of Vision Transformers

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. image classification
    2. positional encoding
    3. robustness
    4. scale-space theory
    5. vision transformer

    Qualifiers

    • Research-article

    Funding Sources

    • JSPS KAKENHI

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 179
      Total Downloads
    • Downloads (Last 12 months)61
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media