SSIM: self-supervised learning method based on spatially selected shifts and irregular image masking

Shao, Yunxue; Wang, Zhiyang; Wang, Lingfeng

doi:10.1007/s11227-025-07013-3

SSIM: self-supervised learning method based on spatially selected shifts and irregular image masking

Published: 13 February 2025

Volume 81, article number 490, (2025)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Yunxue Shao¹,
Zhiyang Wang¹^na1 &
Lingfeng Wang^2,3,4^na1

94 Accesses
Explore all metrics

Abstract

Self-supervised learning has gained popularity for reducing the cost of large-scale dataset labeling while improving model generalization and representation. Among self-supervised learning techniques, masked learning is a prominent approach. However, current masking methods typically use regular small block masking after data augmentation, which is not truly random and can degrade the local correlation between image chunks. This paper proposes a novel self-supervised learning method based on spatially selected shifts and irregular image masks (SSIM). The method generates irregular images by threshold binarization, randomly masks the input image, and then performs spatially selective shifting and aggregated input position information operations. This approach not only avoids fixed mask shapes but also preserves and enhances the local correlation between image chunks. We benchmark our method using the DINO model, applying irregular random masking and spatial selective shifting. Experiments on the Imagenet10 dataset show improvements in linear and k-NN accuracy by 7.6% and 5.7%, respectively. The results demonstrate that SSIM outperforms existing self-supervised learning methods using masks. The code of this paper has been open source. https://github.com/wangzy2024/SSIM

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Masked Siamese Networks for Label-Efficient Learning

Augmentation Learning for Semi-Supervised Classification

RegionCL: Exploring Contrastive Region Pairs for Self-supervised Representation Learning

Data availability

The underlying data on which this paper relies can be obtained from the following: ImageNet100: https://image-net.org/ CIFAR-10: https://www.cs.toronto.edu/~kriz/cifar.html Fashion-Mnist: https://www.kaggle.com

References

Gui Jie, Chen Tuo, Zhang Jing, Cao Qiong, Sun Zhenan, Luo Hao, Tao Dacheng (2024) A survey on self-supervised learning: Algorithms, applications, and future trends. IEEE Transactions on Pattern Analysis and Machine Intelligence
Zhai Xiaohua, Oliver Avital, Kolesnikov Alexander, Beyer Lucas (2019) S4l: Self-supervised semi-supervised learning. In Proceedings of the IEEE/CVF international Conference on computer vision, pages 1476–1485
Gidaris Spyros, Singh Praveer, Komodakis Nikos (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728
Wu Zhirong, Xiong Yuanjun, Yu Stella X, Lin Dahua (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on computer vision and pattern recognition, pages 3733–3742
Feichtenhofer Christoph, Fan Haoqi, Xiong Bo, Girshick Ross, He Kaiming (2021) A large-scale study on unsupervised spatiotemporal representation learning. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 3299–3309
He Kaiming, Fan Haoqi, Wu Yuxin, Xie Saining, Girshick Ross (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 9729–9738
Caron Mathilde, Bojanowski Piotr, Joulin Armand, Douze Matthijs (2018) Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on computer vision (ECCV), pages 132–149
Cord Matthieu, Cunningham Pádraig (2008) Machine learning techniques for multimedia: case studies on organization and retrieval. Springer Science & Business Media
Loussaief Sehla, Abdelkrim Afef (2016) Machine learning framework for image classification. In 2016 7th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT), pages 58–61. IEEE
Collobert Ronan, Weston Jason, Bottou Léon, Karlen Michael, Kavukcuoglu Koray, Kuksa Pavel (2011) Natural language processing (almost) from scratch. Journal of machine learning research 12:2493–2537
MATH Google Scholar
Collobert Ronan, Weston Jason (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international Conference on Machine learning, pages 160–167
Wolf T (2019) Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771
He Kaiming, Gkioxari Georgia, Dollár Piotr, Girshick Ross (2017) Mask r-cnn. In Proceedings of the IEEE international Conference on computer vision, pages 2961–2969
Xie Zhenda, Zhang Zheng, Cao Yue, Lin Yutong, Bao Jianmin, Yao Zhuliang, Dai Qi, Hu Han (2022) Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 9653–9663
Seo Younggyo, Hafner Danijar, Liu Hao, Liu Fangchen, James Stephen, Lee Kimin, Abbeel Pieter (2023) Masked world models for visual control. In Conference on Robot Learning, pages 1332–1344. PMLR
Li Siyuan, Zhang Luyuan, Wang Zedong, Wu Di, Wu Lirong, Liu Zicheng, Xia Jun, Tan Cheng, Liu Yang, Sun Baigui, et al (2023) Masked modeling for self-supervised representation learning on vision and beyond. arXiv preprint arXiv:2401.00897
Wei Chen, Fan Haoqi, Xie Saining, Wu Chao-Yuan, Yuille Alan, Feichtenhofer Christoph (2022) Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14668–14678
Jing Longlong, Tian Yingli (2020) Self-supervised visual feature learning with deep neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence 43(11):4037–4058
Article MATH Google Scholar
Liu Xiao, Zhang Fanjin, Hou Zhenyu, Mian Li, Wang Zhaoyu, Zhang Jing, Tang Jie (2021) Self-supervised learning: Generative or contrastive. IEEE transactions on knowledge and data engineering 35(1):857–876
MATH Google Scholar
Chen Xinlei, Xie Saining, He Kaiming (2021) An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9640–9649
Khan Salman, Naseer Muzammal, Hayat Munawar, Zamir Syed Waqas, Khan Fahad Shahbaz, Shah Mubarak (2022) Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41
Zhai Xiaohua, Kolesnikov Alexander, Houlsby Neil, Beyer Lucas (2022) Scaling vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113
Caron Mathilde, Touvron Hugo, Misra Ishan, Jégou Hervé, Mairal Julien, Bojanowski Piotr, Joulin Armand (2021) Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660
Fan Haoqi, Xiong Bo, Mangalam Karttikeya, Li Yanghao, Yan Zhicheng, Malik Jitendra, Feichtenhofer Christoph (2021) Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835
Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, Fei-Fei Li (2009) Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on computer vision and pattern recognition, pages 248–255. Ieee
Beyer Lucas, Hénaff Olivier J, Kolesnikov Alexander, Zhai Xiaohua, van den Oord Aäron (2020) Are we done with imagenet? arXiv preprint arXiv:2006.07159
Forster Kenneth I (1998) The pros and cons of masked priming. Journal of psycholinguistic research, 27:203–233
Germain Mathieu, Gregor Karol, Murray Iain, Larochelle Hugo (2015) Made: Masked autoencoder for distribution estimation. In International conference on machine learning, pages 881–889. PMLR
Chang Huiwen, Zhang Han, Jiang Lu, Liu Ce, Freeman William T (2022) Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325
He Kaiming, Chen Xinlei, Xie Saining, Li Yanghao, Dollár Piotr, Girshick Ross (2022) Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009
Li Zhaowen, Chen Zhiyang, Yang Fan, Li Wei, Zhu Yousong, Zhao Chaoyang, Deng Rui, Liwei Wu, Zhao Rui, Tang Ming et al (2021) Mst: Masked self-supervised transformer for visual representation. Advances in Neural Information Processing Systems 34:13165–13176
MATH Google Scholar
Oquab Maxime, Darcet Timothée, Moutakanni Théo, Vo Huy, Szafraniec Marc, Khalidov Vasil, Fernandez Pierre, Haziza Daniel, Massa Francisco, El-Nouby Alaaeldin et al (2023) Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193
Zhengyang Lu, Chen Ying (2023) Joint self-supervised depth and optical flow estimation towards dynamic objects. Neural Processing Letters 55(8):10235–10249
Article MATH Google Scholar
Zhengyang Lu, Chen Ying (2024) Self-supervised monocular depth estimation on water scenes via specular reflection prior. Digital Signal Processing 149:104496
Article Google Scholar
Cao Shuhao, Xu Peng, Clifton David A (2022) How to understand masked autoencoders. arXiv preprint arXiv:2202.03670
Bao Hangbo, Dong Li, Piao Songhao, Wei Furu (2021) Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254
Xie Zhenda, Zhang Zheng, Cao Yue, Lin Yutong, Wei Yixuan, Dai Qi, Hu Han (2023) On data scaling in masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10365–10374
Mehrara Hamed, Zahedinejad Mohammad, Parsayan Ali (2009) Novel edge detection using bp neural network based on threshold binarization. In 2009 Second International Conference on Computer and Electrical Engineering, volume 2, pages 408–412. IEEE
Liu Ze, Lin Yutong, Cao Yue, Hu Han, Wei Yixuan, Zhang Zheng, Lin Stephen, Guo Baining (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022
Li Yuxuan, Hou Qibin, Zheng Zhaohui, Cheng Ming-Ming, Yang Jian, Li Xiang (2023) Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16794–16805

Download references

Acknowledgements

This research was supported by the National Natural Science Foundation of China under Grants 62376017, and Fundamental Research Funds for the Central Universities buctrc202221.

Author information

Zhiyang Wang and Lingfeng Wang have contributed equally to this work.

Authors and Affiliations

College of Computer and Information Engineering (College of Artificial Intelligence), Nanjing Tech University, Nanjing, 211816, Jiangsu Province, People’s Republic of China
Yunxue Shao & Zhiyang Wang
Binzhou Institute of Technology, Weiqiao-UCAS Science and Technology Park, Binzhou, Shandong Province, People’s Republic of China
Lingfeng Wang
Shandong Hongqiao New Material Co., Ltd., No.1 Hui Xian Road, Economic Development Zone, Zouping County, Binzhou City, Shandong Province, People’s Republic of China
Lingfeng Wang
College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, 100029, People’s Republic of China
Lingfeng Wang

Authors

Yunxue Shao
View author publications
You can also search for this author inPubMed Google Scholar
Zhiyang Wang
View author publications
You can also search for this author inPubMed Google Scholar
Lingfeng Wang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Lingfeng Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Shao, Y., Wang, Z. & Wang, L. SSIM: self-supervised learning method based on spatially selected shifts and irregular image masking. J Supercomput 81, 490 (2025). https://doi.org/10.1007/s11227-025-07013-3

Download citation

Accepted: 30 January 2025
Published: 13 February 2025
DOI: https://doi.org/10.1007/s11227-025-07013-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SSIM: self-supervised learning method based on spatially selected shifts and irregular image masking

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Masked Siamese Networks for Label-Efficient Learning

Augmentation Learning for Semi-Supervised Classification

RegionCL: Exploring Contrastive Region Pairs for Self-supervised Representation Learning

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now