Abstract
The subdomain of computer vision applications is Image Classification which helps in categorizing the images. The advent of handheld devices and image sensors leads to the availability of a huge amount of data without labels. Hence, to categorize these images, a supervised learning algorithm won’t be suitable as it requires labels. On the other hand, unsupervised learning uses clustering that also not useful as its accuracy is not reliable as the data are not labeled in advance. Self-Supervised Learning techniques can be used to overcome this problem. In this work, we present a novel Swin Transformer based Contrastive Self-Supervised Learning (Swin-TCSSL), where the paired sample is formed using the transformation of the given input image and this paired sample is passed to the Swin-T transformer which produces a feature vector. The maximum Mutual Information of these feature vectors is used to form robust clusters and these cluster labels get propagates to the Swin Transformer block until the appropriate clusters are obtained. It is then followed by contrastive learning and finally produces the classified output. The experimental results prove that the proposed system is invariant to occlusion, viewpoint variation, and illumination effects. The proposed Swin-TSSCL achieves state-of-the-art results in 5 benchmark datasets namely CIFAR-10, Snapshot Serengeti, Stanford dogs, Animals with attributes, and ImageNet dataset. As evident from the rigorous experiments, the proposed Swin-TCSSL has set a new global state-of-the-art with an average accuracy of 97.63%, which is comparatively higher than the state-of-the-art systems.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-13629-x/MediaObjects/11042_2022_13629_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-13629-x/MediaObjects/11042_2022_13629_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-13629-x/MediaObjects/11042_2022_13629_Figa_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-13629-x/MediaObjects/11042_2022_13629_Figb_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-13629-x/MediaObjects/11042_2022_13629_Figc_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-13629-x/MediaObjects/11042_2022_13629_Figd_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-13629-x/MediaObjects/11042_2022_13629_Fig3_HTML.jpg)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-13629-x/MediaObjects/11042_2022_13629_Fig4a_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-13629-x/MediaObjects/11042_2022_13629_Fig4b_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-13629-x/MediaObjects/11042_2022_13629_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-13629-x/MediaObjects/11042_2022_13629_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-13629-x/MediaObjects/11042_2022_13629_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-13629-x/MediaObjects/11042_2022_13629_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-13629-x/MediaObjects/11042_2022_13629_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-022-13629-x/MediaObjects/11042_2022_13629_Fig10_HTML.png)
Similar content being viewed by others
Data availability
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
References
Al-Halah Z, Stiefelhagen R (2015, January) How to transfer? Zero-shot object recognition via hierarchical transfer of semantic attributes. In: 2015 IEEE winter conference on applications of computer vision. IEEE. pp. 837-843
Bau D, Zhu JY, Strobelt H, Zhou B, Tenenbaum JB, Freeman WT, Torralba A (2019) Visualizing and understanding generative adversarial networks. arXiv preprint arXiv:1901.09887
Becker S, Hinton GE (1992) Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature 355(6356):161–163
Caron M, Touvron H, Misra I, Jégou H, Mairal J, Bojanowski P, Joulin A (2021) Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294
Chapelle O, Scholkopf B, Zien A (2009) Semi-supervised learning (chapelle, o. et al., eds.; 2006) [book reviews]. IEEE Trans Neural Netw 20(3):542–542
Chen T, Kornblith S, Norouzi M, Hinton G (2020, November) A simple framework for contrastive learning of visual representations. In: International conference on machine learning. PMLR. pp. 1597-1607
Chen X, Xie S, He K (2021) An empirical study of training self-supervised visual transformers. arXiv preprint arXiv:2104.02057
Dhillon IS, Mallela S, Modha DS (2003, August) Information-theoretic co-clustering. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. pp. 89-98
Dosovitskiy A, Springenberg JT, Riedmiller M, Brox T (2014) Discriminative unsupervised feature learning with convolutional neural networks. Advances Neural Inf Process Syst 27:766–774
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Friedman N, Mosenzon O, Slonim N, Tishby N (2013) Multivariate information bottleneck. arXiv preprint arXiv:1301.2270
Goyal P, Mahajan D, Gupta A, Misra I (2019) Scaling and benchmarking self-supervised visual representation learning. In: Proceedings of the IEEE international conference on computer vision. pp. 6391-6400
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pages 770–778
He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9729-9738
Hjelm RD, Fedorov A, Lavoie-Marchildon S, Grewal K, Bachman P, Trischler A, Bengio Y (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670
Hu W, Miyato T, Tokui S, Matsumoto E, Sugiyama M (2017) Learning discrete representations via information maximizing self-augmented training. arXiv preprint arXiv:1702.08720
Huang G, Liu Z, Maaten Lvd, Weinberger KQ (2017) Densely connected convolutional networks, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2261–2269. https://doi.org/10.1109/CVPR.2017.243
Ji X, Henriques JF, Vedaldi A (2019) Invariant information clustering for unsupervised image classification and segmentation. In: Proceedings of the IEEE international conference on computer vision. pp. 9865-9874
Jing L, Tian Y (2020) Self-supervised visual feature learning with deep neural networks: a survey. IEEE Trans Pattern Anal Mach Intell 43(11):4037–4058
Khosla A, Jayadevaprakash N, Yao B, Li FF (2011, June) Novel dataset for fine-grained image categorization: Stanford dogs. In: Proc. CVPR workshop on fine-grained visual categorization (FGVC). Vol. 2, no. 1
Li J, Zhou P, Xiong C, Socher R, Hoi SC (2020) Prototypical contrastive learning of unsupervised representations. arXiv preprint arXiv:2005.04966
Li C, Yang J, Zhang P, Gao M, Xiao B, Dai X, Yuan L, Gao J (n.d.) Efficient Self-supervised Vision Transformers for Representation Learning. https://doi.org/10.48550/arXiv.2106.09785
Liao X, Li K, Yin J (2017) Separable data hiding in encrypted image based on compressive sensing and discrete fourier transform. Multimed Tools Appl 76:20739–20753. https://doi.org/10.1007/s11042-016-3971-4
Lin T-Y, Goyal P, Girshick R, He K, Dollar P (2017) Focal loss for dense object detection. In: The IEEE International Conference on Computer Vision, ICCV, pp. 2999–3007
Liu X, Zhang F, Hou Z, Mian L, Wang Z, Zhang J, Tang J (2021) Self-supervised learning: generative or contrastive. IEEE Trans Knowl Data Eng
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, ..., Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022
Meena SD, Agilandeeswari L (2019) An efficient framework for animal breeds classification using semi-supervised learning and multi-part convolutional neural network (MP-CNN). IEEE Access 7:151783–151802
Meena SD, Agilandeeswari L (2020) Stacked convolutional autoencoder for detecting animal images in cluttered scenes with a novel feature extraction framework. In: Soft computing for problem solving. Springer, Singapore. pp. 513–522
Meena D, Agilandeeswari L (2020) Invariant features-based fuzzy inference system for animal detection and recognition using thermal images. Int J Fuzzy Syst 22:1868–1879
Meena SD, Agilandeeswari L (n.d.) Adaboost Cascade Classifier for Classification and Identification of Wild Animals using Movidius Neural Compute Stick
Meena SD, Loganathan A (2020) Intelligent animal detection system using sparse multi discriminative-neural network (SMD-NN) to mitigate animal-vehicle collision. Environ Sci Pollut Res 27:1–16
Misra I, Maaten LVD (2020) Self-supervised learning of pretext-invariant representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6707-6717
MohanRajan SN, Loganathan A (2021) Modelling spatial drivers for LU/LC change prediction using hybrid machine learning methods in Javadi Hills, Tamil Nadu, India. J Indian Soc Remote Sens 49:913–934
Mohanrajan SN, Loganathan A (2022) Novel vision transformer–based bi-LSTM model for LU/LC prediction—Javadi Hills, India. Appl Sci 12(13):6387
Navin MS, Agilandeeswari L (2020) Multispectral and hyperspectral images based land use/land cover change prediction analysis: an extensive review Multimed Tools Appl Scopus Indexed with Impact factor 2.313
Prabukumar M, Agilandeeswari L, Ganesan K (2018) An optimized lung Cancer diagnosis system using cuckoo search optimization and support vector machine classifier. J Ambient Intell Humanized Comput Springer
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y
Selvaraju RR, Das A, Vedantam R, Cogswell M, Parikh D, Batra D (2016) Grad-CAM: why did you say that?. arXiv preprint arXiv:1611.07450
Sohn K, Berthelot D, Li C L, Zhang Z, Carlini N, Cubuk ED, ..., Raffel C (2020) Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685
Sundaram DM, Loganathan A (2020) FSSCaps-DetCountNet: fuzzy soft sets and CapsNet-based detection and counting network for monitoring animals from aerial images. J Appl Remote Sens 14(2):026521
Sundaram DM, Loganathan A (2020) A new supervised clustering framework using multi discriminative parts and expectation–maximization approach for a fine-grained animal breed classification (SC-MPEM). Neural Process Lett 52(1):727–766
Swanson A, Kosmala M, Lintott C, Simpson R, Smith A, Packer C (2015) Snapshot Serengeti, high-frequency annotated camera trap images of 40 mammalian species in an African savanna. Sci Data 2(1):1–14
Tian Y, Krishnan D, Isola P (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2020) Training data-efficient image transformers and distillation through attention. arXiv preprint arXiv:2012.12877
Van Gansbeke W, Vandenhende S, Georgoulis S, Proesmans M, Van Gool L (2020) Learning to classify images without labels. arXiv preprint arXiv:2005.12320
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems. pages 5998–6008
Wang J, Wang J, Song J, Xu XS, Shen HT, Li S (2014) Optimized cartesian k-means. IEEE Trans Knowl Data Eng 27(1):180–192
Wu Z, Xiong Y, Yu SX, Lin D (2018) Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3733-3742
Xie J, Girshick R, Farhadi A (2016, June) Unsupervised deep embedding for clustering analysis. In: International conference on machine learning. pp. 478-487
Zelnik-Manor L, Perona P (2004) Self-tuning spectral clustering. Adv Neural Inf Proces Syst 17:1601–1608
Zhong H, Chen C, Jin Z, Hua XS (2020) Deep robust clustering by contrastive learning. arXiv preprint arXiv:2008.03030
Zou W, Zhu S, Yu K, Ng A (2012) Deep learning of invariant features via simulated fixations in video. Adv Neural Inf Proces Syst 25:3203–3211
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
There is no Conflict of Interest or competing interests by the authors.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Agilandeeswari, L., Meena, S.D. SWIN transformer based contrastive self-supervised learning for animal detection and classification. Multimed Tools Appl 82, 10445–10470 (2023). https://doi.org/10.1007/s11042-022-13629-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13629-x