Skip to main content
Log in

SSIM: self-supervised learning method based on spatially selected shifts and irregular image masking

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Self-supervised learning has gained popularity for reducing the cost of large-scale dataset labeling while improving model generalization and representation. Among self-supervised learning techniques, masked learning is a prominent approach. However, current masking methods typically use regular small block masking after data augmentation, which is not truly random and can degrade the local correlation between image chunks. This paper proposes a novel self-supervised learning method based on spatially selected shifts and irregular image masks (SSIM). The method generates irregular images by threshold binarization, randomly masks the input image, and then performs spatially selective shifting and aggregated input position information operations. This approach not only avoids fixed mask shapes but also preserves and enhances the local correlation between image chunks. We benchmark our method using the DINO model, applying irregular random masking and spatial selective shifting. Experiments on the Imagenet10 dataset show improvements in linear and k-NN accuracy by 7.6% and 5.7%, respectively. The results demonstrate that SSIM outperforms existing self-supervised learning methods using masks. The code of this paper has been open source. https://github.com/wangzy2024/SSIM

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Algorithm 1
Fig. 7
Fig. 8
Algorithm 2

Similar content being viewed by others

Data availability

The underlying data on which this paper relies can be obtained from the following: ImageNet100: https://image-net.org/ CIFAR-10: https://www.cs.toronto.edu/~kriz/cifar.html Fashion-Mnist: https://www.kaggle.com

References

  1. Gui Jie, Chen Tuo, Zhang Jing, Cao Qiong, Sun Zhenan, Luo Hao, Tao Dacheng (2024) A survey on self-supervised learning: Algorithms, applications, and future trends. IEEE Transactions on Pattern Analysis and Machine Intelligence

  2. Zhai Xiaohua, Oliver Avital, Kolesnikov Alexander, Beyer Lucas (2019) S4l: Self-supervised semi-supervised learning. In Proceedings of the IEEE/CVF international Conference on computer vision, pages 1476–1485

  3. Gidaris Spyros, Singh Praveer, Komodakis Nikos (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728

  4. Wu Zhirong, Xiong Yuanjun, Yu Stella X, Lin Dahua (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on computer vision and pattern recognition, pages 3733–3742

  5. Feichtenhofer Christoph, Fan Haoqi, Xiong Bo, Girshick Ross, He Kaiming (2021) A large-scale study on unsupervised spatiotemporal representation learning. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 3299–3309

  6. He Kaiming, Fan Haoqi, Wu Yuxin, Xie Saining, Girshick Ross (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 9729–9738

  7. Caron Mathilde, Bojanowski Piotr, Joulin Armand, Douze Matthijs (2018) Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on computer vision (ECCV), pages 132–149

  8. Cord Matthieu, Cunningham Pádraig (2008) Machine learning techniques for multimedia: case studies on organization and retrieval. Springer Science & Business Media

  9. Loussaief Sehla, Abdelkrim Afef (2016) Machine learning framework for image classification. In 2016 7th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT), pages 58–61. IEEE

  10. Collobert Ronan, Weston Jason, Bottou Léon, Karlen Michael, Kavukcuoglu Koray, Kuksa Pavel (2011) Natural language processing (almost) from scratch. Journal of machine learning research 12:2493–2537

    MATH  Google Scholar 

  11. Collobert Ronan, Weston Jason (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international Conference on Machine learning, pages 160–167

  12. Wolf T (2019) Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771

  13. He Kaiming, Gkioxari Georgia, Dollár Piotr, Girshick Ross (2017) Mask r-cnn. In Proceedings of the IEEE international Conference on computer vision, pages 2961–2969

  14. Xie Zhenda, Zhang Zheng, Cao Yue, Lin Yutong, Bao Jianmin, Yao Zhuliang, Dai Qi, Hu Han (2022) Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 9653–9663

  15. Seo Younggyo, Hafner Danijar, Liu Hao, Liu Fangchen, James Stephen, Lee Kimin, Abbeel Pieter (2023) Masked world models for visual control. In Conference on Robot Learning, pages 1332–1344. PMLR

  16. Li Siyuan, Zhang Luyuan, Wang Zedong, Wu Di, Wu Lirong, Liu Zicheng, Xia Jun, Tan Cheng, Liu Yang, Sun Baigui, et al (2023) Masked modeling for self-supervised representation learning on vision and beyond. arXiv preprint arXiv:2401.00897

  17. Wei Chen, Fan Haoqi, Xie Saining, Wu Chao-Yuan, Yuille Alan, Feichtenhofer Christoph (2022) Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14668–14678

  18. Jing Longlong, Tian Yingli (2020) Self-supervised visual feature learning with deep neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence 43(11):4037–4058

    Article  MATH  Google Scholar 

  19. Liu Xiao, Zhang Fanjin, Hou Zhenyu, Mian Li, Wang Zhaoyu, Zhang Jing, Tang Jie (2021) Self-supervised learning: Generative or contrastive. IEEE transactions on knowledge and data engineering 35(1):857–876

    MATH  Google Scholar 

  20. Chen Xinlei, Xie Saining, He Kaiming (2021) An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9640–9649

  21. Khan Salman, Naseer Muzammal, Hayat Munawar, Zamir Syed Waqas, Khan Fahad Shahbaz, Shah Mubarak (2022) Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41

  22. Zhai Xiaohua, Kolesnikov Alexander, Houlsby Neil, Beyer Lucas (2022) Scaling vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113

  23. Caron Mathilde, Touvron Hugo, Misra Ishan, Jégou Hervé, Mairal Julien, Bojanowski Piotr, Joulin Armand (2021) Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660

  24. Fan Haoqi, Xiong Bo, Mangalam Karttikeya, Li Yanghao, Yan Zhicheng, Malik Jitendra, Feichtenhofer Christoph (2021) Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835

  25. Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, Fei-Fei Li (2009) Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on computer vision and pattern recognition, pages 248–255. Ieee

  26. Beyer Lucas, Hénaff Olivier J, Kolesnikov Alexander, Zhai Xiaohua, van den Oord Aäron (2020) Are we done with imagenet? arXiv preprint arXiv:2006.07159

  27. Forster Kenneth I (1998) The pros and cons of masked priming. Journal of psycholinguistic research, 27:203–233

  28. Germain Mathieu, Gregor Karol, Murray Iain, Larochelle Hugo (2015) Made: Masked autoencoder for distribution estimation. In International conference on machine learning, pages 881–889. PMLR

  29. Chang Huiwen, Zhang Han, Jiang Lu, Liu Ce, Freeman William T (2022) Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325

  30. He Kaiming, Chen Xinlei, Xie Saining, Li Yanghao, Dollár Piotr, Girshick Ross (2022) Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009

  31. Li Zhaowen, Chen Zhiyang, Yang Fan, Li Wei, Zhu Yousong, Zhao Chaoyang, Deng Rui, Liwei Wu, Zhao Rui, Tang Ming et al (2021) Mst: Masked self-supervised transformer for visual representation. Advances in Neural Information Processing Systems 34:13165–13176

    MATH  Google Scholar 

  32. Oquab Maxime, Darcet Timothée, Moutakanni Théo, Vo Huy, Szafraniec Marc, Khalidov Vasil, Fernandez Pierre, Haziza Daniel, Massa Francisco, El-Nouby Alaaeldin et al (2023) Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193

  33. Zhengyang Lu, Chen Ying (2023) Joint self-supervised depth and optical flow estimation towards dynamic objects. Neural Processing Letters 55(8):10235–10249

    Article  MATH  Google Scholar 

  34. Zhengyang Lu, Chen Ying (2024) Self-supervised monocular depth estimation on water scenes via specular reflection prior. Digital Signal Processing 149:104496

    Article  Google Scholar 

  35. Cao Shuhao, Xu Peng, Clifton David A (2022) How to understand masked autoencoders. arXiv preprint arXiv:2202.03670

  36. Bao Hangbo, Dong Li, Piao Songhao, Wei Furu (2021) Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254

  37. Xie Zhenda, Zhang Zheng, Cao Yue, Lin Yutong, Wei Yixuan, Dai Qi, Hu Han (2023) On data scaling in masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10365–10374

  38. Mehrara Hamed, Zahedinejad Mohammad, Parsayan Ali (2009) Novel edge detection using bp neural network based on threshold binarization. In 2009 Second International Conference on Computer and Electrical Engineering, volume 2, pages 408–412. IEEE

  39. Liu Ze, Lin Yutong, Cao Yue, Hu Han, Wei Yixuan, Zhang Zheng, Lin Stephen, Guo Baining (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022

  40. Li Yuxuan, Hou Qibin, Zheng Zhaohui, Cheng Ming-Ming, Yang Jian, Li Xiang (2023) Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16794–16805

Download references

Acknowledgements

This research was supported by the National Natural Science Foundation of China under Grants 62376017, and Fundamental Research Funds for the Central Universities buctrc202221.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lingfeng Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shao, Y., Wang, Z. & Wang, L. SSIM: self-supervised learning method based on spatially selected shifts and irregular image masking. J Supercomput 81, 490 (2025). https://doi.org/10.1007/s11227-025-07013-3

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11227-025-07013-3

Keywords