Abstract
Self-supervised learning has gained popularity for reducing the cost of large-scale dataset labeling while improving model generalization and representation. Among self-supervised learning techniques, masked learning is a prominent approach. However, current masking methods typically use regular small block masking after data augmentation, which is not truly random and can degrade the local correlation between image chunks. This paper proposes a novel self-supervised learning method based on spatially selected shifts and irregular image masks (SSIM). The method generates irregular images by threshold binarization, randomly masks the input image, and then performs spatially selective shifting and aggregated input position information operations. This approach not only avoids fixed mask shapes but also preserves and enhances the local correlation between image chunks. We benchmark our method using the DINO model, applying irregular random masking and spatial selective shifting. Experiments on the Imagenet10 dataset show improvements in linear and k-NN accuracy by 7.6% and 5.7%, respectively. The results demonstrate that SSIM outperforms existing self-supervised learning methods using masks. The code of this paper has been open source. https://github.com/wangzy2024/SSIM










Similar content being viewed by others
Data availability
The underlying data on which this paper relies can be obtained from the following: ImageNet100: https://image-net.org/ CIFAR-10: https://www.cs.toronto.edu/~kriz/cifar.html Fashion-Mnist: https://www.kaggle.com
References
Gui Jie, Chen Tuo, Zhang Jing, Cao Qiong, Sun Zhenan, Luo Hao, Tao Dacheng (2024) A survey on self-supervised learning: Algorithms, applications, and future trends. IEEE Transactions on Pattern Analysis and Machine Intelligence
Zhai Xiaohua, Oliver Avital, Kolesnikov Alexander, Beyer Lucas (2019) S4l: Self-supervised semi-supervised learning. In Proceedings of the IEEE/CVF international Conference on computer vision, pages 1476–1485
Gidaris Spyros, Singh Praveer, Komodakis Nikos (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728
Wu Zhirong, Xiong Yuanjun, Yu Stella X, Lin Dahua (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on computer vision and pattern recognition, pages 3733–3742
Feichtenhofer Christoph, Fan Haoqi, Xiong Bo, Girshick Ross, He Kaiming (2021) A large-scale study on unsupervised spatiotemporal representation learning. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 3299–3309
He Kaiming, Fan Haoqi, Wu Yuxin, Xie Saining, Girshick Ross (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 9729–9738
Caron Mathilde, Bojanowski Piotr, Joulin Armand, Douze Matthijs (2018) Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on computer vision (ECCV), pages 132–149
Cord Matthieu, Cunningham Pádraig (2008) Machine learning techniques for multimedia: case studies on organization and retrieval. Springer Science & Business Media
Loussaief Sehla, Abdelkrim Afef (2016) Machine learning framework for image classification. In 2016 7th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT), pages 58–61. IEEE
Collobert Ronan, Weston Jason, Bottou Léon, Karlen Michael, Kavukcuoglu Koray, Kuksa Pavel (2011) Natural language processing (almost) from scratch. Journal of machine learning research 12:2493–2537
Collobert Ronan, Weston Jason (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international Conference on Machine learning, pages 160–167
Wolf T (2019) Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771
He Kaiming, Gkioxari Georgia, Dollár Piotr, Girshick Ross (2017) Mask r-cnn. In Proceedings of the IEEE international Conference on computer vision, pages 2961–2969
Xie Zhenda, Zhang Zheng, Cao Yue, Lin Yutong, Bao Jianmin, Yao Zhuliang, Dai Qi, Hu Han (2022) Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 9653–9663
Seo Younggyo, Hafner Danijar, Liu Hao, Liu Fangchen, James Stephen, Lee Kimin, Abbeel Pieter (2023) Masked world models for visual control. In Conference on Robot Learning, pages 1332–1344. PMLR
Li Siyuan, Zhang Luyuan, Wang Zedong, Wu Di, Wu Lirong, Liu Zicheng, Xia Jun, Tan Cheng, Liu Yang, Sun Baigui, et al (2023) Masked modeling for self-supervised representation learning on vision and beyond. arXiv preprint arXiv:2401.00897
Wei Chen, Fan Haoqi, Xie Saining, Wu Chao-Yuan, Yuille Alan, Feichtenhofer Christoph (2022) Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14668–14678
Jing Longlong, Tian Yingli (2020) Self-supervised visual feature learning with deep neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence 43(11):4037–4058
Liu Xiao, Zhang Fanjin, Hou Zhenyu, Mian Li, Wang Zhaoyu, Zhang Jing, Tang Jie (2021) Self-supervised learning: Generative or contrastive. IEEE transactions on knowledge and data engineering 35(1):857–876
Chen Xinlei, Xie Saining, He Kaiming (2021) An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9640–9649
Khan Salman, Naseer Muzammal, Hayat Munawar, Zamir Syed Waqas, Khan Fahad Shahbaz, Shah Mubarak (2022) Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41
Zhai Xiaohua, Kolesnikov Alexander, Houlsby Neil, Beyer Lucas (2022) Scaling vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113
Caron Mathilde, Touvron Hugo, Misra Ishan, Jégou Hervé, Mairal Julien, Bojanowski Piotr, Joulin Armand (2021) Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660
Fan Haoqi, Xiong Bo, Mangalam Karttikeya, Li Yanghao, Yan Zhicheng, Malik Jitendra, Feichtenhofer Christoph (2021) Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835
Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, Fei-Fei Li (2009) Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on computer vision and pattern recognition, pages 248–255. Ieee
Beyer Lucas, Hénaff Olivier J, Kolesnikov Alexander, Zhai Xiaohua, van den Oord Aäron (2020) Are we done with imagenet? arXiv preprint arXiv:2006.07159
Forster Kenneth I (1998) The pros and cons of masked priming. Journal of psycholinguistic research, 27:203–233
Germain Mathieu, Gregor Karol, Murray Iain, Larochelle Hugo (2015) Made: Masked autoencoder for distribution estimation. In International conference on machine learning, pages 881–889. PMLR
Chang Huiwen, Zhang Han, Jiang Lu, Liu Ce, Freeman William T (2022) Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325
He Kaiming, Chen Xinlei, Xie Saining, Li Yanghao, Dollár Piotr, Girshick Ross (2022) Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009
Li Zhaowen, Chen Zhiyang, Yang Fan, Li Wei, Zhu Yousong, Zhao Chaoyang, Deng Rui, Liwei Wu, Zhao Rui, Tang Ming et al (2021) Mst: Masked self-supervised transformer for visual representation. Advances in Neural Information Processing Systems 34:13165–13176
Oquab Maxime, Darcet Timothée, Moutakanni Théo, Vo Huy, Szafraniec Marc, Khalidov Vasil, Fernandez Pierre, Haziza Daniel, Massa Francisco, El-Nouby Alaaeldin et al (2023) Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193
Zhengyang Lu, Chen Ying (2023) Joint self-supervised depth and optical flow estimation towards dynamic objects. Neural Processing Letters 55(8):10235–10249
Zhengyang Lu, Chen Ying (2024) Self-supervised monocular depth estimation on water scenes via specular reflection prior. Digital Signal Processing 149:104496
Cao Shuhao, Xu Peng, Clifton David A (2022) How to understand masked autoencoders. arXiv preprint arXiv:2202.03670
Bao Hangbo, Dong Li, Piao Songhao, Wei Furu (2021) Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254
Xie Zhenda, Zhang Zheng, Cao Yue, Lin Yutong, Wei Yixuan, Dai Qi, Hu Han (2023) On data scaling in masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10365–10374
Mehrara Hamed, Zahedinejad Mohammad, Parsayan Ali (2009) Novel edge detection using bp neural network based on threshold binarization. In 2009 Second International Conference on Computer and Electrical Engineering, volume 2, pages 408–412. IEEE
Liu Ze, Lin Yutong, Cao Yue, Hu Han, Wei Yixuan, Zhang Zheng, Lin Stephen, Guo Baining (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022
Li Yuxuan, Hou Qibin, Zheng Zhaohui, Cheng Ming-Ming, Yang Jian, Li Xiang (2023) Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16794–16805
Acknowledgements
This research was supported by the National Natural Science Foundation of China under Grants 62376017, and Fundamental Research Funds for the Central Universities buctrc202221.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shao, Y., Wang, Z. & Wang, L. SSIM: self-supervised learning method based on spatially selected shifts and irregular image masking. J Supercomput 81, 490 (2025). https://doi.org/10.1007/s11227-025-07013-3
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-025-07013-3