Abstract
Deep neural networks that achieve remarkable performance in image classification have previously been shown to be easily fooled by tiny transformations such as a one pixel translation of the input image. In order to address this problem, two approaches have been proposed in recent years. The first approach suggests using huge datasets together with data augmentation in the hope that a highly varied training set will teach the network to learn to be invariant. The second approach suggests using architectural modifications based on sampling theory to deal explicitly with image translations. In this paper, we show that these approaches still fall short in robustly handling ‘natural’ image translations that simulate a subtle change in camera orientation. Our findings reveal that a mere one-pixel translation can result in a significant change in the predicted image representation for approximately 40% of the test images in state-of-the-art models (e.g. open-CLIP trained on LAION-2B or DINO-v2) , while models that are explicitly constructed to be robust to cyclic translations can still be fooled with 1 pixel realistic (non-cyclic) translations 11% of the time. We present Robust Inference by Crop Selection: a simple method that can be proven to achieve any desired level of consistency, although with a modest tradeoff with the model’s accuracy. Importantly, we demonstrate how employing this method reduces the ability to fool state-of-the-art models with a 1 pixel translation to less than 5% while suffering from only a 1% drop in classification accuracy. Additionally, we show that our method can be easily adjusted to deal with circular shifts as well. In such a case we achieve 100% robustness to integer shifts with state-of-the-art accuracy, and with no need for any further training. Code is available at: https://github.com/ofirshifman/RICS.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Typically, ImageNet images are resized to \(256 \times 256\) and then a \(224 \times 224\) crop is selected. Without the need for the consistency evaluation process, we could use our method to select \(224 \times 224\) crops directly from \(256 \times 256\) images, which would likely result in no accuracy loss compared to standard classifiers.
- 2.
The main additional computation required by our method during inference is convolving the image with a 140\(\,\times \,\)140 Mexican-hat kernel. Using separable convolutions can make this computation negligible compared to the forward pass of DINOv2.
References
Awais, M., et al.: Foundational models defining a new era in vision: a survey and outlook. arXiv preprint arXiv:2307.13721 (2023)
Azulay, A., Weiss, Y.: Why do deep convolutional networks generalize so poorly to small image transformations? arXiv preprint arXiv:1805.12177 (2018)
Bommasani, R., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)
Chaman, A., Dokmanic, I.: Truly shift-invariant convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3773–3783 (2021)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Engstrom, L., Tsipras, D., Schmidt, L., Madry, A.: A rotation and a translation suffice: fooling CNNs with simple transformations (2017)
Fang, Y., et al.: Eva: exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19358–19369 (2023)
Fonaryov, M., Lindenbaum, M.: On the minimal recognizable image patch. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 6734–6741 (2021). https://doi.org/10.1109/ICPR48806.2021.9412064
Fukushima, K., Miyake, S., Ito, T.: Neocognitron: a neural network model for a mechanism of visual pattern recognition. IEEE Trans. Syst. Man Cybern. 5, 826–834 (1983)
Geirhos, R., et al.: Shortcut learning in deep neural networks. Nat. Mach. Intell. 2(11), 665–673 (2020)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation (2014)
Gunasekar, S.: Generalization to translation shifts: a study in architectures and augmentations. arXiv preprint arXiv:2207.02349 (2022)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hendrycks, D., et al.: The many faces of robustness: a critical analysis of out-of-distribution generalization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340–8349 (2021)
Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261 (2019)
Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136 (2016)
Ilharco, G., et al.: Openclip (2021). https://doi.org/10.5281/zenodo.5143773. if you use this software, please cite it as below
LeCun, Y., et al.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022)
Madan, S., Sasaki, T., Pfister, H., Li, T.M., Boix, X.: Adversarial examples within the training distribution: a widespread challenge. arXiv preprint arXiv:2106.16198 (2021)
Michaeli, H., Michaeli, T., Soudry, D.: Alias-free convnets: fractional shift invariance via polynomial activations (2023)
Naseer, M.M., Ranasinghe, K., Khan, S.H., Hayat, M., Shahbaz Khan, F., Yang, M.H.: Intriguing properties of vision transformers. Adv. Neural. Inf. Process. Syst. 34, 23296–23308 (2021)
Oquab, M., et al.: Dinov2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Rahman, M.A., Yeh, R.A.: Truly scale-equivariant deep nets with Fourier layers. arXiv preprint arXiv:2311.02922 (2023)
Rebuffi, S.A., Gowal, S., Calian, D.A., Stimberg, F., Wiles, O., Mann, T.A.: Data augmentation can improve robustness. Adv. Neural. Inf. Process. Syst. 34, 29935–29948 (2021)
Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imagenet classifiers generalize to imagenet? In: International Conference on Machine Learning, pp. 5389–5400. PMLR (2019)
Reddy, V.K., Siramoju, K.K., Sircar, P.: Object detection by 2-d continuous wavelet transform. In: 2014 International Conference on Computational Science and Computational Intelligence, vol. 1, pp. 162–167. IEEE (2014)
Rojas-Gomez, R.A., Lim, T.Y., Do, M.N., Yeh, R.A.: Making vision transformers truly shift-equivariant. arXiv preprint arXiv:2305.16316 (2023)
Rojas-Gomez, R.A., Lim, T.Y., Schwing, A., Do, M., Yeh, R.A.: Learnable polyphase sampling for shift invariant and equivariant convolutional networks. Adv. Neural. Inf. Process. Syst. 35, 35755–35768 (2022)
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Sarvaiya, J.N., Patnaik, S.: Automatic image registration using Mexican hat wavelet, invariant moment, and radon transform. IJACSA Int. J. Adv. Comput. Sci. Appl. (2011). Special Issue on Image Processing and Analysis
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. Adv. Neural. Inf. Process. Syst. 35, 25278–25294 (2022)
Srivastava, S., Ben-Yosef, G., Boix, X.: Minimal images in deep neural networks: Fragile object recognition in natural images. arXiv preprint arXiv:1902.03227 (2019)
Wortsman, M., et al.: Robust fine-tuning of zero-shot models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7959–7971 (2022)
Wu, B., et al.: Visual transformers: token-based image representation and processing for computer vision (2020)
Wu, H., et al.: CvT: introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22–31 (2021)
Yuan, L., et al.: Florence: a new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021)
Zhang, R.: Making convolutional networks shift-invariant again (2019)
Zou, X., Xiao, F., Yu, Z., Li, Y., Lee, Y.J.: Delving deeper into anti-aliasing in convnets. Int. J. Comput. Vis. 131(1), 67–81 (2023)
Acknowledgments
We thank the Gatsby Foundation for their support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Shifman, O., Weiss, Y. (2025). Lost in Translation: Modern Neural Networks Still Struggle with Small Realistic Image Transformations. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15127. Springer, Cham. https://doi.org/10.1007/978-3-031-72890-7_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-72890-7_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72889-1
Online ISBN: 978-3-031-72890-7
eBook Packages: Computer ScienceComputer Science (R0)