Lost in Translation: Modern Neural Networks Still Struggle with Small Realistic Image Transformations

Shifman, Ofir; Weiss, Yair

doi:10.1007/978-3-031-72890-7_14

Ofir Shifman¹³ &
Yair Weiss¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15127))

Included in the following conference series:

European Conference on Computer Vision

159 Accesses

Abstract

Deep neural networks that achieve remarkable performance in image classification have previously been shown to be easily fooled by tiny transformations such as a one pixel translation of the input image. In order to address this problem, two approaches have been proposed in recent years. The first approach suggests using huge datasets together with data augmentation in the hope that a highly varied training set will teach the network to learn to be invariant. The second approach suggests using architectural modifications based on sampling theory to deal explicitly with image translations. In this paper, we show that these approaches still fall short in robustly handling ‘natural’ image translations that simulate a subtle change in camera orientation. Our findings reveal that a mere one-pixel translation can result in a significant change in the predicted image representation for approximately 40% of the test images in state-of-the-art models (e.g. open-CLIP trained on LAION-2B or DINO-v2) , while models that are explicitly constructed to be robust to cyclic translations can still be fooled with 1 pixel realistic (non-cyclic) translations 11% of the time. We present Robust Inference by Crop Selection: a simple method that can be proven to achieve any desired level of consistency, although with a modest tradeoff with the model’s accuracy. Importantly, we demonstrate how employing this method reduces the ability to fool state-of-the-art models with a 1 pixel translation to less than 5% while suffering from only a 1% drop in classification accuracy. Additionally, we show that our method can be easily adjusted to deal with circular shifts as well. In such a case we achieve 100% robustness to integer shifts with state-of-the-art accuracy, and with no need for any further training. Code is available at: https://github.com/ofirshifman/RICS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

What Does CNN Shift Invariance Look Like? A Visualization Study

LocalNorm: Robust Image Classification Through Dynamically Regularized Normalization

Scaling Backwards: Minimal Synthetic Pre-Training?

Notes

1.
Typically, ImageNet images are resized to $256 \times 256$ and then a $224 \times 224$ crop is selected. Without the need for the consistency evaluation process, we could use our method to select $224 \times 224$ crops directly from $256 \times 256$ images, which would likely result in no accuracy loss compared to standard classifiers.
2.
The main additional computation required by our method during inference is convolving the image with a 140$\,\times \,$140 Mexican-hat kernel. Using separable convolutions can make this computation negligible compared to the forward pass of DINOv2.

References

Awais, M., et al.: Foundational models defining a new era in vision: a survey and outlook. arXiv preprint arXiv:2307.13721 (2023)
Azulay, A., Weiss, Y.: Why do deep convolutional networks generalize so poorly to small image transformations? arXiv preprint arXiv:1805.12177 (2018)
Bommasani, R., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)
Google Scholar
Chaman, A., Dokmanic, I.: Truly shift-invariant convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3773–3783 (2021)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Engstrom, L., Tsipras, D., Schmidt, L., Madry, A.: A rotation and a translation suffice: fooling CNNs with simple transformations (2017)
Google Scholar
Fang, Y., et al.: Eva: exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19358–19369 (2023)
Google Scholar
Fonaryov, M., Lindenbaum, M.: On the minimal recognizable image patch. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 6734–6741 (2021). https://doi.org/10.1109/ICPR48806.2021.9412064
Fukushima, K., Miyake, S., Ito, T.: Neocognitron: a neural network model for a mechanism of visual pattern recognition. IEEE Trans. Syst. Man Cybern. 5, 826–834 (1983)
Article Google Scholar
Geirhos, R., et al.: Shortcut learning in deep neural networks. Nat. Mach. Intell. 2(11), 665–673 (2020)
Article Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation (2014)
Google Scholar
Gunasekar, S.: Generalization to translation shifts: a study in architectures and augmentations. arXiv preprint arXiv:2207.02349 (2022)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hendrycks, D., et al.: The many faces of robustness: a critical analysis of out-of-distribution generalization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340–8349 (2021)
Google Scholar
Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261 (2019)
Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136 (2016)
Ilharco, G., et al.: Openclip (2021). https://doi.org/10.5281/zenodo.5143773. if you use this software, please cite it as below
LeCun, Y., et al.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)
Article Google Scholar
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022)
Google Scholar
Madan, S., Sasaki, T., Pfister, H., Li, T.M., Boix, X.: Adversarial examples within the training distribution: a widespread challenge. arXiv preprint arXiv:2106.16198 (2021)
Michaeli, H., Michaeli, T., Soudry, D.: Alias-free convnets: fractional shift invariance via polynomial activations (2023)
Google Scholar
Naseer, M.M., Ranasinghe, K., Khan, S.H., Hayat, M., Shahbaz Khan, F., Yang, M.H.: Intriguing properties of vision transformers. Adv. Neural. Inf. Process. Syst. 34, 23296–23308 (2021)
Google Scholar
Oquab, M., et al.: Dinov2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Rahman, M.A., Yeh, R.A.: Truly scale-equivariant deep nets with Fourier layers. arXiv preprint arXiv:2311.02922 (2023)
Rebuffi, S.A., Gowal, S., Calian, D.A., Stimberg, F., Wiles, O., Mann, T.A.: Data augmentation can improve robustness. Adv. Neural. Inf. Process. Syst. 34, 29935–29948 (2021)
Google Scholar
Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imagenet classifiers generalize to imagenet? In: International Conference on Machine Learning, pp. 5389–5400. PMLR (2019)
Google Scholar
Reddy, V.K., Siramoju, K.K., Sircar, P.: Object detection by 2-d continuous wavelet transform. In: 2014 International Conference on Computational Science and Computational Intelligence, vol. 1, pp. 162–167. IEEE (2014)
Google Scholar
Rojas-Gomez, R.A., Lim, T.Y., Do, M.N., Yeh, R.A.: Making vision transformers truly shift-equivariant. arXiv preprint arXiv:2305.16316 (2023)
Rojas-Gomez, R.A., Lim, T.Y., Schwing, A., Do, M., Yeh, R.A.: Learnable polyphase sampling for shift invariant and equivariant convolutional networks. Adv. Neural. Inf. Process. Syst. 35, 35755–35768 (2022)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Sarvaiya, J.N., Patnaik, S.: Automatic image registration using Mexican hat wavelet, invariant moment, and radon transform. IJACSA Int. J. Adv. Comput. Sci. Appl. (2011). Special Issue on Image Processing and Analysis
Google Scholar
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. Adv. Neural. Inf. Process. Syst. 35, 25278–25294 (2022)
Google Scholar
Srivastava, S., Ben-Yosef, G., Boix, X.: Minimal images in deep neural networks: Fragile object recognition in natural images. arXiv preprint arXiv:1902.03227 (2019)
Wortsman, M., et al.: Robust fine-tuning of zero-shot models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7959–7971 (2022)
Google Scholar
Wu, B., et al.: Visual transformers: token-based image representation and processing for computer vision (2020)
Google Scholar
Wu, H., et al.: CvT: introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22–31 (2021)
Google Scholar
Yuan, L., et al.: Florence: a new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021)
Zhang, R.: Making convolutional networks shift-invariant again (2019)
Google Scholar
Zou, X., Xiao, F., Yu, Z., Li, Y., Lee, Y.J.: Delving deeper into anti-aliasing in convnets. Int. J. Comput. Vis. 131(1), 67–81 (2023)
Article Google Scholar

Download references

Acknowledgments

We thank the Gatsby Foundation for their support.

Author information

Authors and Affiliations

The Hebrew University of Jerusalem, Jerusalem, Israel
Ofir Shifman & Yair Weiss

Authors

Ofir Shifman
View author publications
You can also search for this author in PubMed Google Scholar
Yair Weiss
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ofir Shifman .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 12938 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shifman, O., Weiss, Y. (2025). Lost in Translation: Modern Neural Networks Still Struggle with Small Realistic Image Transformations. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15127. Springer, Cham. https://doi.org/10.1007/978-3-031-72890-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-72890-7_14
Published: 07 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72889-1
Online ISBN: 978-3-031-72890-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics