Abstract
Vision transformers (ViTs) have witnessed significant progress in the past few years. Recently, the latest research revealed that ViTs are vulnerable to transfer-based attacks, in which attackers can use a local surrogate model to generate adversarial examples, then transfer these malicious examples to attack the target black-box ViT directly. Suffering from the threat of transfer-based attacks, it is challenging to deploy ViTs to security-critical tasks. Therefore, it becomes an exact need to explore the robustness of ViTs against transfer-based attacks. However, existing transfer-based attack methods do not fully consider the unique structure of ViT, and they indiscriminately attack the intermediate outputs token of ViTs, leading to the perturbations being focused on specific model information within the tokens, and further resulting in a limited transferability of the generated adversarial examples. To address the current limitations, we propose Token Importance Attack (TIA), a novel ViTs-oriented transfer-based attack method. Specifically, we introduce Randomly Shuffle Patches (RSP) strategy to expand the diversity of the input space. By applying RSP, we can generate multiple shuffled images from a single image, allowing us to obtain multiple token gradients. Then TIA ensembles these token gradients of shuffled images as a guide map to focus the perturbation on the model-independent information in the token rather than model-specific information. Benefiting from these two components, TIA can avoid overfitting to the surrogate model, thus enhancing the transferability of the crafted adversarial examples. Extensive experiments conducted on common datasets with different ViTs and CNNs have demonstrated the effectiveness of TIA.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
Chen, P.Y., Zhang, H., Sharma, Y., Yi, J., Hsieh, C.J.: ZOO: zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In: Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 15–26 (2017)
Chen, Z., Xie, L., Niu, J., Liu, X., Wei, L., Tian, Q.: Visformer: the vision-friendly transformer. In: ICCV (2021)
d’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L.: ConViT: improving vision transformers with soft convolutional inductive biases. In: ICML (2021)
Dong, Y., et al.: Boosting adversarial attacks with momentum. In: CVPR (2018)
Dong, Y., Pang, T., Su, H., Zhu, J.: Evading defenses to transferable adversarial examples by translation-invariant attacks. In: CVPR (2019)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Gao, L., Zhang, Q., Song, J., Liu, X., Shen, H.T.: Patch-wise attack for fooling deep neural network. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 307–322. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_19
Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: Bengio, Y., LeCun, Y. (eds.) ICLR (2015)
Graham, B., et al.: LeViT: a vision transformer in convnet’s clothing for faster inference. In: ICCV (2021)
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. In: NIPS (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial dimensions of vision transformers. In: ICCV (2021)
Kononenko, I.: Machine learning for medical diagnosis: history, state of the art and perspective. Artif. Intell. Med. 23(1), 89–109 (2001)
Kurakin, A., Goodfellow, I.J., Bengio, S.: Adversarial examples in the physical world. In: ICLR (2017)
Li, Y., Li, Y., Xu, K., Yan, Q., Deng, R.H.: Empirical study of face authentication systems under OSNFD attacks. IEEE Trans. Dependable Secure Comput. 15(2), 231–245 (2016)
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
Liu, R., et al.: DualFlow: generating imperceptible adversarial examples by flow field and normalize flow-based model. Front. Neurorobot. 17, 1129720 (2023)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Naseer, M., Khan, S.H., Rahman, S., Porikli, F.: Task-generalizable adversarial attack based on perceptual metric. arXiv preprint arXiv:1811.09020 (2018)
Naseer, M., Ranasinghe, K., Khan, S., Khan, F.S., Porikli, F.: On improving adversarial transferability of vision transformers. In: ICLR (2022)
Papadakis, M.A., McPhee, S.J., Rabow, M.C.: Medical Diagnosis & Treatment. Mc Graw Hill, San Francisco, CA, USA (2019)
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)
Shao, R., Shi, Z., Yi, J., Chen, P.Y., Hsieh, C.J.: On the adversarial robustness of vision transformers. arXiv preprint arXiv:2103.15670 (2021)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Sun, Y., Wang, X., Tang, X.: Deeply learned face representations are sparse, selective, and robust. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2892–2900 (2015)
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.: Inception-v4, inception-ResNet and the impact of residual connections on learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: DeepFace: closing the gap to human-level performance in face verification. In: CVPR (2014)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML (2021)
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: ICCV (2021)
Wang, Z., Guo, H., Zhang, Z., Liu, W., Qin, Z., Ren, K.: Feature importance-aware transferable adversarial attacks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7639–7648 (2021)
Wei, Z., Chen, J., Goldblum, M., Wu, Z., Goldstein, T., Jiang, Y.: Towards transferable adversarial attacks on vision transformers. In: AAAI (2022)
Xie, C., et al.: Improving transferability of adversarial examples with input diversity. In: CVPR, pp. 2730–2739 (2019)
Yurtsever, E., Lambert, J., Carballo, A., Takeda, K.: A survey of autonomous driving: common practices and emerging technologies. IEEE Access 8, 58443–58469 (2020)
Zhou, W., et al.: Transferable adversarial perturbations. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 471–486. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_28
Acknowledgment
This work is supported in part by the National Natural Science Foundation of China under Grant 62162067 and 62101480, Research and Application of Object detection based on Artificial Intelligence, in part by the Yunnan Province expert workstations under Grant 202205AF150145.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Fu, T., Li, F., Zhang, J., Zhu, L., Wang, Y., Zhou, W. (2024). TIA: Token Importance Transferable Attack on Vision Transformers. In: Ge, C., Yung, M. (eds) Information Security and Cryptology. Inscrypt 2023. Lecture Notes in Computer Science, vol 14527. Springer, Singapore. https://doi.org/10.1007/978-981-97-0945-8_6
Download citation
DOI: https://doi.org/10.1007/978-981-97-0945-8_6
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0944-1
Online ISBN: 978-981-97-0945-8
eBook Packages: Computer ScienceComputer Science (R0)