research-article

Scale-space Tokenization for Improving the Robustness of Vision Transformers

Authors:

Nakamasa InoueAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 2684 - 2693

https://doi.org/10.1145/3581783.3612060

Published: 27 October 2023 Publication History

Abstract

The performance of the Vision Transformer (ViT) model and its variants in most vision tasks has surpassed traditional Convolutional Neural Networks (CNNs) in terms of in-distribution accuracy. However, ViTs still have significant room for improvement in their robustness to input perturbations. Furthermore, robustness is a critical aspect to consider when deploying ViTs in real-world scenarios. Despite this, some variants of ViT improve the in-distribution accuracy and computation performance at the cost of sacrificing the model's robustness and generalization. In this study, inspired by the prior findings on the potential effectiveness of shape bias to robustness improvement and the importance of multi-scale analysis, we propose a simple yet effective method, scale-space tokenization, to improve the robustness of ViT while maintaining in-distribution accuracy. Based on this method, we build Scale-space-based Robust Vision Transformer (SRVT) model. Our method consists of scale-space patch embedding and scale-space positional encoding. The scale-space patch embedding makes a sequence of variable-scale images and increases the model's shape bias to enhance its robustness. The scale-space positional encoding implicitly boosts the model's invariance to input perturbations by incorporating scale-aware position information into 3D sinusoidal positional encoding. We conduct experiments on image recognition benchmarks (CIFAR10/100 and ImageNet-1k) from the perspectives of in-distribution accuracy, adversarial and out-of-distribution robustness. The experimental results demonstrate our method's effectiveness in improving robustness without compromising in-distribution accuracy. Especially, our approach achieves advanced adversarial robustness on ImageNet-1k benchmark compared with state-of-the-art robust ViT.

References

[1]

Saad Albawi, Tareq Abed Mohammed, and Saad Al-Zawi. 2017. Understanding of a convolutional neural network. In Proc. International Conference on Engineering and Technology (ICET).

[2]

Aharon Azulay and Yair Weiss. 2019. Why do deep convolutional networks generalize so poorly to small image transformations? J. Mach. Learn. Res. 20 (2019), no. 184, pp. 1--25.

[3]

Yutong Bai, Jieru Mei, Alan L Yuille, and Cihang Xie. 2021. Are Transformers More Robust than CNNs?. In Proc. Advances in Neural Information Processing Systems (NeurIPS).

[4]

Srinadh Bhojanapalli, Ayan Chakrabarti, Daniel Glasner, Daliang Li, Thomas Unterthiner, and Andreas Veit. 2021. Understanding Robustness of Transformers for Image Classification. In Proc. International Conference on Computer Vision (ICCV).

[5]

Anna Bosch, Andrew Zisserman, and Xavier Munoz. 2007. Representing Shape with a Spatial Pyramid Kernel. In Proc. ACM International Conference on Image and Video Retrieval (CIVR).

Digital Library

[6]

Peter J. Burt and Edward H. Adelson. 1983. The Laplacian Pyramid as a Compact Image Code. IEEE Journal of Transactions on Communications 31 (1983), pp. 532--540.

[7]

Stéphane d'Ascoli, Hugo Touvron, Matthew Leavitt, Ari Morcos, Giulio Biroli, and Levent Sagun. 2021. Convit: Improving vision transformers with soft convolutional inductive biases. In Proc. International Conference on Machine Learning (ICML).

[8]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proc. Computer Vision and Pattern Recognition (CVPR).

[9]

Yifu Ding, Haotong Qin, Qinghua Yan, Zhenhua Chai, Junjie Liu, Xiaolin Wei, and Xianglong Liu. 2022. Towards Accurate Post-Training Quantization for Vision Transformer. In Proc. ACM International Conference on Multimedia (ACM MM).

Digital Library

[10]

Piotr Dollár, Ron Appel, Serge Belongie, and Pietro Perona. 2014. Fast Feature Pyramids for Object Detection. IEEE Trans. PAMI 36 (2014), no.8, pp. 1532--1545.

Digital Library

[11]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. International Conference on Learning Representations (ICLR).

[12]

Xia Du and Chi-Man Pun. 2020. Adversarial Image Attacks Using Multi-Sample and Most-Likely Ensemble Methods. In Proc. ACM International Conference on Multimedia (ACM MM).

Digital Library

[13]

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale Vision Transformers. In Proc. International Conference on Computer Vision (ICCV).

[14]

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard S. Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. 2020. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2 (2020), pp. 665--673.

[15]

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. 2019. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In Proc. International Conference on Learning Representations (ICLR).

[16]

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. In Proc. International Conference on Learning Representations (ICLR).

[17]

Jindong Gu, Volker Tresp, and Yao Qin. 2022. Are Vision Transformers Robust to Patch Perturbations?. In Proc. European Conference on Computer Vision (ECCV).

Digital Library

[18]

Kai Han, An Xiao, EnhuaWu, Jianyuan Guo, Chunjing Xu, and YunheWang. 2021. Transformer in transformer. In Proc. Advances in Neural Information Processing Systems (NeurIPS).

[19]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proc. Computer Vision and Pattern Recognition (CVPR).

[20]

Rundong He, Zhongyi Han, Xiankai Lu, and Yilong Yin. 2022. RONF: Reliable Outlier Synthesis under Noisy Feature Space for Out-of-Distribution Detection. In Proc. ACM International Conference on Multimedia (ACM MM).

Digital Library

[21]

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, FrankWang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. 2020. The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization. In Proc. International Conference on Computer Vision (ICCV).

[22]

Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. In Proc. International Conference on Learning Representations (ICLR).

[23]

Dan Hendrycks, Norman Mu, Ekin D Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. 2020. AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty. In Proc. International Conference on Learning Representations (ICLR).

[24]

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. 2021. Natural Adversarial Examples. In CVPR.

[25]

Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, and Seong Joon Oh. 2021. Rethinking spatial dimensions of vision transformers. In Proc. International Conference on Computer Vision (ICCV).

[26]

Tomu Hirata, Yusuke Mukuta, and Tatsuya Harada. 2021. Making Video Recognition Models Robust to Common Corruptions With Supervised Contrastive Learning. In Proc. ACM International Conference on Multimedia in Asia (MMAsia).

Digital Library

[27]

Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proc. International Conference on Machine Learning (ICML).

[28]

Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2022. Transformers in Vision: A Survey. ACM Comput. Surv. 54 (2022), no. 200, pp. 1--41.

Digital Library

[29]

Sohail Ahmed Khan and Hang Dai. 2021. Video Transformer for Deepfake Detection with Incremental Learning. In Proc. ACM International Conference on Multimedia (ACM MM).

Digital Library

[30]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proc. International Conference on Learning Representations (ICLR).

[31]

Jan J. Koenderink. 1984. The Structure of Images. Springer Journal of Biological Cybernetics 31 (1984), pp. 363--370.

[32]

Alex Krizhevsky and Geoffrey Hinton. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report.

[33]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proc. Conference on Neural Information Processing Systems (NIPS).

[34]

S. Lazebnik, C. Schmid, and J. Ponce. 2006. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In Proc. Computer Vision and Pattern Recognition (CVPR).

[35]

He Li, Mang Ye, Cong Wang, and Bo Du. 2022. Pyramidal Transformer with Conv-Patchify for Person Re-identification. In Proc. ACM International Conference on Multimedia (ACM MM).

Digital Library

[36]

Yingwei Li, Qihang Yu, Mingxing Tan, Jieru Mei, Peng Tang, Wei Shen, Alan Yuille, and Cihang Xie. 2021. Shape-Texture Debiased Neural Network Training. In Proc. International Conference on Learning Representations (ICLR).

[37]

Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, and Luc Van Gool. 2021. LocalViT: Bringing Locality to Vision Transformers. CoRR (2021). arXiv:2104.05707

[38]

Tony Lindeberg. 1994. Scale-Space Theory: A Basic Tool for Analysing Structures at Different Scales. Journal of Applied Statistics 21 (1994), pp. 225--270.

[39]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proc. International Conference on Computer Vision (ICCV).

[40]

David G. Lowe. 2004. Distinctive Image Features from Scale-Invariant Keypoints. Springer International Journal of Computer Vision 78 (2004), pp. 23415--23442.

[41]

Gabriel Resende Machado, Eugênio Silva, and Ronaldo Ribeiro Goldschmidt. 2023. Adversarial Machine Learning in Image Classification: A Survey Toward the Defender's Perspective. ACM Comput. Surv. 55 (2023), no. 8, pp. 1--38.

Digital Library

[42]

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards Deep Learning Models Resistant to Adversarial Attacks. In Proc. International Conference on Learning Representations (ICLR).

[43]

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks. In Proc. International Conference on Learning Representations (ICLR).

[44]

Kaleel Mahmood, Rigel Mahmood, and Marten Van Dijk. 2021. On the Robustness of Vision Transformers to Adversarial Examples. In Proc. International Conference on Computer Vision (ICCV).

[45]

Xiaofeng Mao, Gege Qi, Yuefeng Chen1 Xiaodan Li, Ranjie Duan, Shaokai Ye, Yuan He, and Hui Xue. 2022. Towards Robust Vision Transformer. In Proc. Computer Vision and Pattern Recognition (CVPR).

[46]

D. Marr and E. Hildreth. 1980. Theory of Edge Detection. In In Proc. Royal Society of London. Series B, Biological Sciences.

[47]

Sayak Paul and Pin-Yu Chen. 2022. Vision Transformers are Robust Learners. Proc. Association for the Advancement of Artificial Intelligence (AAAI).

[48]

Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. 2020. Designing Network Design Spaces. In Proc. Computer Vision and Pattern Recognition (CVPR).

[49]

Jiahuan Ren, Zhao Zhang, Richang Hong, Mingliang Xu, Haijun Zhang, Mingbo Zhao, and Meng Wang. 2022. Robust Low-Rank Convolution Network for Image Denoising. In Proc. ACM International Conference on Multimedia (ACM MM).

Digital Library

[50]

Evgenia Rusak, Lukas Schott, Roland S Zimmermann, Julian Bitterwolf, Oliver Bringmann, Matthias Bethge, and Wieland Brendel. 2020. A SimpleWay to Make Neural Networks Robust Against Diverse Image Corruptions. In Proc. European Conference on Computer Vision (ECCV).

[51]

Gal-Lev Shalev, Gabi Shalev, and Joseph Keshet. 2022. A Baseline for Detecting Out-of-Distribution Examples in Image Captioning. In Proc. ACM International Conference on Multimedia (ACM MM).

Digital Library

[52]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In Proc. Computer Vision and Pattern Recognition (CVPR).

[53]

Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proc. International Conference on Machine Learning (ICML).

[54]

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2020. Training data-efficient image transformers & distillation through attention. Proc. International Conference on Machine Learning (ICML).

[55]

Lin Tsung-Yi, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature Pyramid Networks for Object Detection. In Proc. Computer Vision and Pattern Recognition (CVPR).

[56]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In Proc. Advances in Neural Information Processing Systems (NeurIPS).

[57]

Haohan Wang, Songwei Ge, Eric P Xing, and Zachary C Lipton. 2019. Learning Robust Global Representations by Penalizing Local Predictive Power. In Proc. Advances in Neural Information Processing Systems (NeurIPS).

[58]

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proc. International Conference on Computer Vision (ICCV).

[59]

Yuxuan Wang, Jiakai Wang, Zixin Yin, Ruihao Gong, Jingyi Wang, Aishan Liu, and Xianglong Liu. 2022. Generating Transferable Adversarial Examples against Vision Transformers. In Proc. ACM International Conference on Multimedia (ACM.

Digital Library

[60]

Zelun Wang and Jyh-Charn Liu. 2021. Translating math formula images to LaTeX sequences using deep neural networks with sequence-level training. International Journal on Document Analysis and Recognition 24 (2021), pp. 63--75.

Digital Library

[61]

Boxi Wu, Jindong Gu, Zhifeng Li, Deng Cai, Xiaofei He, and Wei Liu. 2022. Towards Efficient Adversarial Training on Vision Transformers. In Proc. European Conference on Computer Vision (ECCV).

Digital Library

[62]

Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. 2021. Cvt: Introducing Convolutions to Vision Transformers. In Proc. International Conference on Computer Vision (ICCV).

[63]

Bo Zhang Xinlong Wang Chunhua Shen Xiangxiang Chu, Zhi Tian. 2023. Conditional Positional Encodings for Vision Transformers. In Proc. International Conference on Learning Representations (ICLR).

[64]

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated Residual Transformations for Deep Neural Networks. In Proc. Computer Vision and Pattern Recognition (CVPR).

[65]

Xing Xu, Jiefu Chen, Jinhui Xiao, Zheng Wang, Yang Yang, and Heng Tao Shen. 2020. Learning Optimization-based Adversarial Perturbations for Attacking Sequential Recognition Models. In Proc. ACM International Conference on Multimedia (ACM MM).

Digital Library

[66]

Dong Yin, Raphael Gontijo Lopes, Jonathon Shlens, Ekin Dogus Cubuk, and Justin Gilmer. 2019. A Fourier Perspective on Model Robustness in Computer Vision. In Proc. Advances in Neural Information Processing Systems (NeurIPS).

[67]

Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, and Wei Wu. 2021. Incorporating Convolution Designs into Visual Transformers. In Proc. International Conference on Computer Vision (ICCV).

[68]

Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. 2021. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proc. International Conference on Computer Vision (ICCV).

[69]

Daichi Zhang, Fanzhao Lin, Yingying Hua, PengjuWang, Dan Zeng, and Shiming Ge. 2022. Deepfake Video Detection with Spatiotemporal Dropout Transformer. In Proc. ACM International Conference on Multimedia (ACM MM).

Digital Library

[70]

Hanwei Zhang, Yannis Avrithis, Teddy Furon, and Laurent Amsaleg. 2021. Patch Replacement: A Transformation-based Method to Improve Robustness against Adversarial Attacks. In In Porc. ACM Multimedia Workshop on Trustworthy AI for Multimedia Computing (TrustworthyAI).

Digital Library

[71]

Jiaming Zhang, Qi Yi, and Jitao Sang. 2022. Towards Adversarial Attack on Vision-Language Pre-training Models. In Proc. ACM International Conference on Multimedia (ACM MM).

Digital Library

[72]

Richard Zhang. 2019. Making Convolutional Networks Shift-Invariant Again. In Proc. International Conference on Machine Learning (ICML).

[73]

Lei Zhao and Lei Huang. 2021. An Investigation on Sparsity of CapsNets for Adversarial Robustness. In Proc. ACM Multimedia Workshop on Adversarial Learning for Multimedia (ADVM).

Digital Library

[74]

Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Animashree Anandkumar, Jiashi Feng, and Jose M. Alvarez. 2022. Understanding The Robustness in Vision Transformers. In Proc. International Conference on Machine Learning (ICML).

Index Terms

Scale-space Tokenization for Improving the Robustness of Vision Transformers
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object recognition

Recommendations

Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem
Machine Learning and Knowledge Discovery in Databases
Abstract
Recent research on the robustness of deep learning has shown that Vision Transformers (ViTs) surpass the Convolutional Neural Networks (CNNs) under some perturbations, e.g., natural corruption, adversarial attacks, etc. Some papers argue that the ...
Improved robustness of vision transformers via prelayernorm in patch embedding
Highlights
- We provide empirical tests on various image corruption using vision transformers.
- Vision transformers showed performance degradation on contrast-enhanced images.
- We proposed PreLayerNorm for the consistent behavior of positional ...
Graphical abstract

Display Omitted

Abstract
Vision Transformers (ViTs) have recently demonstrated state-of-the-art performance in various vision tasks, replacing convolutional neural networks (CNNs). However, because ViT has a different architectural design than CNN, it may behave ...
Towards Efficient Adversarial Training on Vision Transformers
Computer Vision – ECCV 2022
Abstract
Vision Transformer (ViT), as a powerful alternative to Convolutional Neural Network (CNN), has received much attention. Recent work showed that ViTs are also vulnerable to adversarial examples like CNNs. To build robust ViTs, an intuitive way is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

JSPS KAKENHI

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
179
Total Downloads

Downloads (Last 12 months)61
Downloads (Last 6 weeks)6

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten