Skip to main content

Robustness Tokens: Towards Adversarial Robustness of Transformers

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Recently, large pre-trained foundation models have become widely adopted by machine learning practitioners for a multitude of tasks. Given that such models are publicly available, relying on their use as backbone models for downstream tasks might result in high vulnerability to adversarial attacks crafted with the same public model. In this work, we propose Robustness Tokens, a novel approach specific to the transformer architecture that fine-tunes a few additional private tokens with low computational requirements instead of tuning model parameters as done in traditional adversarial training. We show that Robustness Tokens make Vision Transformer models significantly more robust to white-box adversarial attacks while also retaining the original downstream performances.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: a video vision transformer (2021)

    Google Scholar 

  2. Assran, M., et al.: Masked Siamese networks for label-efficient learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13691, pp. 456–473 Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19821-2_26

  3. Assran, M., et al.: Self-supervised learning from images with a joint-embedding predictive architecture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15619–15629 (June 2023)

    Google Scholar 

  4. Awais, M., et al.: Foundational models defining a new era in vision: a survey and outlook. arXiv preprint arXiv:2307.13721 (2023)

  5. Bahri, Y., Dyer, E., Kaplan, J., Lee, J., Sharma, U.: Explaining neural scaling laws. arXiv preprint arXiv:2102.06701 (2021)

  6. Bai, T., Luo, J., Zhao, J., Wen, B., Wang, Q.: Recent advances in adversarial training for adversarial robustness. arXiv preprint arXiv:2102.01356 (2021)

  7. Balestriero, R., et al.: A cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210 (2023)

  8. Ban, Y., Dong, Y.: Pre-trained adversarial perturbations. Adv. Neural Inf. Process. Syst. 35, 1196–1209 (2022)

    Google Scholar 

  9. Bommasani, R., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

  10. Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. IEEE (2017)

    Google Scholar 

  11. Chen, X., et al.: Context autoencoder for self-supervised representation learning (2022)

    Google Scholar 

  12. Croce, F., et al.: Robustbench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670 (2020)

  13. Croce, F., Hein, M.: Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: International Conference on Machine Learning, pp. 2206–2216. PMLR (2020)

    Google Scholar 

  14. Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need registers (2023)

    Google Scholar 

  15. Demontis, A., et al.: Why do adversarial attacks transfer? Explaining transferability of evasion and poisoning attacks. In: 28th USENIX Security Symposium (USENIX Security 19), pp. 321–338 (2019)

    Google Scholar 

  16. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  17. Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: QLORA: efficient finetuning of quantized llms. Adv. Neural Inf. Process. Syst. 36 (2024)

    Google Scholar 

  18. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  19. Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale (2021)

    Google Scholar 

  20. El-Nouby, A., et al.: Scalable pre-training of large autoregressive image models (2024)

    Google Scholar 

  21. Fort, S.: Adversarial examples for the openai clip in its zero-shot classification regime and their semantic generalization, January 2021. https://stanislavfort.github.io/2021/01/12/OpenAI_CLIP_adversarial_examples.html

  22. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014)

  23. Gu, A., Dao, T.: Mamba: linear-time sequence modeling with selective state spaces (2023)

    Google Scholar 

  24. Hatamizadeh, A., Ranzinger, M., Lan, S., Alvarez, J.M., Fidler, S., Kautz, J.: Vir: towards efficient vision retention backbones (2024)

    Google Scholar 

  25. He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., Neubig, G.: Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366 (2021)

  26. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16000–16009 (June 2022)

    Google Scholar 

  27. Hendrycks, D., Mazeika, M., Kadavath, S., Song, D.: Using self-supervised learning can improve model robustness and uncertainty. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper_files/paper/2019/file/a2b15837edac15df90721968986f7f8e-Paper.pdf

  28. Hestness, J., et al.: Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409 (2017)

  29. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  30. Hu, E.J., et al.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

  31. Ilharco, G., et al.: Openclip (2021). https://doi.org/10.5281/zenodo.5143773

  32. Inkawhich, N., McDonald, G., Luley, R.: Adversarial attacks on foundational vision models (2023)

    Google Scholar 

  33. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)

    Google Scholar 

  34. Jiang, X., Ge, Y., Ge, Y., Yuan, C., Shan, Y.: Supervised fine-tuning in turn improves visual foundation models. arXiv preprint arXiv:2401.10222 (2024)

  35. Kim, H.: Torchattacks: a pytorch repository for adversarial attacks. arXiv preprint arXiv:2010.01950 (2020)

  36. Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)

    Google Scholar 

  37. Li, L.H., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965–10975 (2022)

    Google Scholar 

  38. Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021)

  39. Li, Y., Fan, H., Hu, R., Feichtenhofer, C., He, K.: Scaling language-image pre-training via masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23390–23400 (2023)

    Google Scholar 

  40. Lian, C., Zhou, H.Y., Yu, Y., Wang, L.: Less could be better: parameter-efficient fine-tuning advances medical vision foundation models. arXiv preprint arXiv:2401.12215 (2024)

  41. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

    Google Scholar 

  42. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks (2019)

    Google Scholar 

  43. Mann, B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)

  44. Miller, E.: Attention is off by one (2023). https://www.evanmiller.org/attention-is-off-by-one.html

  45. Moosavi-Dezfooli, S.M., Fawzi, A., Frossard, P.: Deepfool: a simple and accurate method to fool deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2574–2582 (2016)

    Google Scholar 

  46. Nguyen, A., Yosinski, J., Clune, J.: Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 427–436 (2015)

    Google Scholar 

  47. Oquab, M., et al.: Dinov2: learning robust visual features without supervision (2023)

    Google Scholar 

  48. Peng, B., et al.: Rwkv: reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048 (2023)

  49. Peng, Z., Dong, L., Bao, H., Ye, Q., Wei, F.: Beit v2: masked image modeling with vector-quantized visual tokenizers (2022)

    Google Scholar 

  50. Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)

    Google Scholar 

  51. Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp. 28492–28518. PMLR (2023)

    Google Scholar 

  52. Rando, J., Naimi, N., Baumann, T., Mathys, M.: Exploring adversarial attacks and defenses in vision transformers trained with dino (2022). https://arxiv.org/abs/2206.06761

  53. Schlarmann, C., Hein, M.: On the adversarial robustness of multi-modal foundation models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pp. 3677–3685, October 2023

    Google Scholar 

  54. Sitawarin, C., Chang, J., Huang, D., Altoyan, W., Wagner, D.: Defending against transfer attacks from public models (2023)

    Google Scholar 

  55. Sun, M., Chen, X., Kolter, J.Z., Liu, Z.: Massive activations in large language models (2024)

    Google Scholar 

  56. Sun, Y., et al.: Retentive network: a successor to transformer for large language models. arXiv preprint arXiv:2307.08621 (2023)

  57. Touvron, H., Cord, M., Jégou, H.: DeiT III: revenge of the ViT. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13684, pp. 516–533. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20053-3_30

  58. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  59. Xiao, G., Tian, Y., Chen, B., Han, S., Lewis, M.: Efficient streaming language models with attention sinks (2023)

    Google Scholar 

  60. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 34, 12077–12090 (2021)

    Google Scholar 

  61. Yuan, B., et al.: Decentralized training of foundation models in heterogeneous environments. Adv. Neural Inf. Process. Syst. 35, 25464–25477 (2022)

    Google Scholar 

  62. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

    Google Scholar 

  63. Zhou, J., et al.: Training and serving system of foundation models: a comprehensive survey. arXiv preprint arXiv:2401.02643 (2024)

  64. Zhou, J., et al.: iBOT: image bert pre-training with online tokenizer (2022)

    Google Scholar 

  65. Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision mamba: efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417 (2024)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Brian Pulfer .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3536 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pulfer, B., Belousov, Y., Voloshynovskiy, S. (2025). Robustness Tokens: Towards Adversarial Robustness of Transformers. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15117. Springer, Cham. https://doi.org/10.1007/978-3-031-73202-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73202-7_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73201-0

  • Online ISBN: 978-3-031-73202-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics