Skip to main content

Lost in Translation: Modern Neural Networks Still Struggle with Small Realistic Image Transformations

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15127))

Included in the following conference series:

  • 159 Accesses

Abstract

Deep neural networks that achieve remarkable performance in image classification have previously been shown to be easily fooled by tiny transformations such as a one pixel translation of the input image. In order to address this problem, two approaches have been proposed in recent years. The first approach suggests using huge datasets together with data augmentation in the hope that a highly varied training set will teach the network to learn to be invariant. The second approach suggests using architectural modifications based on sampling theory to deal explicitly with image translations. In this paper, we show that these approaches still fall short in robustly handling ‘natural’ image translations that simulate a subtle change in camera orientation. Our findings reveal that a mere one-pixel translation can result in a significant change in the predicted image representation for approximately 40% of the test images in state-of-the-art models (e.g. open-CLIP trained on LAION-2B or DINO-v2) , while models that are explicitly constructed to be robust to cyclic translations can still be fooled with 1 pixel realistic (non-cyclic) translations 11% of the time. We present Robust Inference by Crop Selection: a simple method that can be proven to achieve any desired level of consistency, although with a modest tradeoff with the model’s accuracy. Importantly, we demonstrate how employing this method reduces the ability to fool state-of-the-art models with a 1 pixel translation to less than 5% while suffering from only a 1% drop in classification accuracy. Additionally, we show that our method can be easily adjusted to deal with circular shifts as well. In such a case we achieve 100% robustness to integer shifts with state-of-the-art accuracy, and with no need for any further training. Code is available at: https://github.com/ofirshifman/RICS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Typically, ImageNet images are resized to \(256 \times 256\) and then a \(224 \times 224\) crop is selected. Without the need for the consistency evaluation process, we could use our method to select \(224 \times 224\) crops directly from \(256 \times 256\) images, which would likely result in no accuracy loss compared to standard classifiers.

  2. 2.

    The main additional computation required by our method during inference is convolving the image with a 140\(\,\times \,\)140 Mexican-hat kernel. Using separable convolutions can make this computation negligible compared to the forward pass of DINOv2.

References

  1. Awais, M., et al.: Foundational models defining a new era in vision: a survey and outlook. arXiv preprint arXiv:2307.13721 (2023)

  2. Azulay, A., Weiss, Y.: Why do deep convolutional networks generalize so poorly to small image transformations? arXiv preprint arXiv:1805.12177 (2018)

  3. Bommasani, R., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

  4. Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)

    Google Scholar 

  5. Chaman, A., Dokmanic, I.: Truly shift-invariant convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3773–3783 (2021)

    Google Scholar 

  6. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  7. Engstrom, L., Tsipras, D., Schmidt, L., Madry, A.: A rotation and a translation suffice: fooling CNNs with simple transformations (2017)

    Google Scholar 

  8. Fang, Y., et al.: Eva: exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19358–19369 (2023)

    Google Scholar 

  9. Fonaryov, M., Lindenbaum, M.: On the minimal recognizable image patch. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 6734–6741 (2021). https://doi.org/10.1109/ICPR48806.2021.9412064

  10. Fukushima, K., Miyake, S., Ito, T.: Neocognitron: a neural network model for a mechanism of visual pattern recognition. IEEE Trans. Syst. Man Cybern. 5, 826–834 (1983)

    Article  Google Scholar 

  11. Geirhos, R., et al.: Shortcut learning in deep neural networks. Nat. Mach. Intell. 2(11), 665–673 (2020)

    Article  Google Scholar 

  12. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation (2014)

    Google Scholar 

  13. Gunasekar, S.: Generalization to translation shifts: a study in architectures and augmentations. arXiv preprint arXiv:2207.02349 (2022)

  14. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)

    Google Scholar 

  15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  16. Hendrycks, D., et al.: The many faces of robustness: a critical analysis of out-of-distribution generalization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340–8349 (2021)

    Google Scholar 

  17. Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261 (2019)

  18. Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136 (2016)

  19. Ilharco, G., et al.: Openclip (2021). https://doi.org/10.5281/zenodo.5143773. if you use this software, please cite it as below

  20. LeCun, Y., et al.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)

    Article  Google Scholar 

  21. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022)

    Google Scholar 

  22. Madan, S., Sasaki, T., Pfister, H., Li, T.M., Boix, X.: Adversarial examples within the training distribution: a widespread challenge. arXiv preprint arXiv:2106.16198 (2021)

  23. Michaeli, H., Michaeli, T., Soudry, D.: Alias-free convnets: fractional shift invariance via polynomial activations (2023)

    Google Scholar 

  24. Naseer, M.M., Ranasinghe, K., Khan, S.H., Hayat, M., Shahbaz Khan, F., Yang, M.H.: Intriguing properties of vision transformers. Adv. Neural. Inf. Process. Syst. 34, 23296–23308 (2021)

    Google Scholar 

  25. Oquab, M., et al.: Dinov2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  26. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  27. Rahman, M.A., Yeh, R.A.: Truly scale-equivariant deep nets with Fourier layers. arXiv preprint arXiv:2311.02922 (2023)

  28. Rebuffi, S.A., Gowal, S., Calian, D.A., Stimberg, F., Wiles, O., Mann, T.A.: Data augmentation can improve robustness. Adv. Neural. Inf. Process. Syst. 34, 29935–29948 (2021)

    Google Scholar 

  29. Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imagenet classifiers generalize to imagenet? In: International Conference on Machine Learning, pp. 5389–5400. PMLR (2019)

    Google Scholar 

  30. Reddy, V.K., Siramoju, K.K., Sircar, P.: Object detection by 2-d continuous wavelet transform. In: 2014 International Conference on Computational Science and Computational Intelligence, vol. 1, pp. 162–167. IEEE (2014)

    Google Scholar 

  31. Rojas-Gomez, R.A., Lim, T.Y., Do, M.N., Yeh, R.A.: Making vision transformers truly shift-equivariant. arXiv preprint arXiv:2305.16316 (2023)

  32. Rojas-Gomez, R.A., Lim, T.Y., Schwing, A., Do, M., Yeh, R.A.: Learnable polyphase sampling for shift invariant and equivariant convolutional networks. Adv. Neural. Inf. Process. Syst. 35, 35755–35768 (2022)

    Google Scholar 

  33. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  34. Sarvaiya, J.N., Patnaik, S.: Automatic image registration using Mexican hat wavelet, invariant moment, and radon transform. IJACSA Int. J. Adv. Comput. Sci. Appl. (2011). Special Issue on Image Processing and Analysis

    Google Scholar 

  35. Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. Adv. Neural. Inf. Process. Syst. 35, 25278–25294 (2022)

    Google Scholar 

  36. Srivastava, S., Ben-Yosef, G., Boix, X.: Minimal images in deep neural networks: Fragile object recognition in natural images. arXiv preprint arXiv:1902.03227 (2019)

  37. Wortsman, M., et al.: Robust fine-tuning of zero-shot models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7959–7971 (2022)

    Google Scholar 

  38. Wu, B., et al.: Visual transformers: token-based image representation and processing for computer vision (2020)

    Google Scholar 

  39. Wu, H., et al.: CvT: introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22–31 (2021)

    Google Scholar 

  40. Yuan, L., et al.: Florence: a new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021)

  41. Zhang, R.: Making convolutional networks shift-invariant again (2019)

    Google Scholar 

  42. Zou, X., Xiao, F., Yu, Z., Li, Y., Lee, Y.J.: Delving deeper into anti-aliasing in convnets. Int. J. Comput. Vis. 131(1), 67–81 (2023)

    Article  Google Scholar 

Download references

Acknowledgments

We thank the Gatsby Foundation for their support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ofir Shifman .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 12938 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shifman, O., Weiss, Y. (2025). Lost in Translation: Modern Neural Networks Still Struggle with Small Realistic Image Transformations. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15127. Springer, Cham. https://doi.org/10.1007/978-3-031-72890-7_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72890-7_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72889-1

  • Online ISBN: 978-3-031-72890-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics