Skip to main content
Log in

SWIN transformer based contrastive self-supervised learning for animal detection and classification

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The subdomain of computer vision applications is Image Classification which helps in categorizing the images. The advent of handheld devices and image sensors leads to the availability of a huge amount of data without labels. Hence, to categorize these images, a supervised learning algorithm won’t be suitable as it requires labels. On the other hand, unsupervised learning uses clustering that also not useful as its accuracy is not reliable as the data are not labeled in advance. Self-Supervised Learning techniques can be used to overcome this problem. In this work, we present a novel Swin Transformer based Contrastive Self-Supervised Learning (Swin-TCSSL), where the paired sample is formed using the transformation of the given input image and this paired sample is passed to the Swin-T transformer which produces a feature vector. The maximum Mutual Information of these feature vectors is used to form robust clusters and these cluster labels get propagates to the Swin Transformer block until the appropriate clusters are obtained. It is then followed by contrastive learning and finally produces the classified output. The experimental results prove that the proposed system is invariant to occlusion, viewpoint variation, and illumination effects. The proposed Swin-TSSCL achieves state-of-the-art results in 5 benchmark datasets namely CIFAR-10, Snapshot Serengeti, Stanford dogs, Animals with attributes, and ImageNet dataset. As evident from the rigorous experiments, the proposed Swin-TCSSL has set a new global state-of-the-art with an average accuracy of 97.63%, which is comparatively higher than the state-of-the-art systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1:
Algorithm 2:
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

  1. Al-Halah Z, Stiefelhagen R (2015, January) How to transfer? Zero-shot object recognition via hierarchical transfer of semantic attributes. In: 2015 IEEE winter conference on applications of computer vision. IEEE. pp. 837-843

  2. Bau D, Zhu JY, Strobelt H, Zhou B, Tenenbaum JB, Freeman WT, Torralba A (2019) Visualizing and understanding generative adversarial networks. arXiv preprint arXiv:1901.09887

  3. Becker S, Hinton GE (1992) Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature 355(6356):161–163

    Article  Google Scholar 

  4. Caron M, Touvron H, Misra I, Jégou H, Mairal J, Bojanowski P, Joulin A (2021) Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294

  5. Chapelle O, Scholkopf B, Zien A (2009) Semi-supervised learning (chapelle, o. et al., eds.; 2006) [book reviews]. IEEE Trans Neural Netw 20(3):542–542

  6. Chen T, Kornblith S, Norouzi M, Hinton G (2020, November) A simple framework for contrastive learning of visual representations. In: International conference on machine learning. PMLR. pp. 1597-1607

  7. Chen X, Xie S, He K (2021) An empirical study of training self-supervised visual transformers. arXiv preprint arXiv:2104.02057

  8. Dhillon IS, Mallela S, Modha DS (2003, August) Information-theoretic co-clustering. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. pp. 89-98

  9. Dosovitskiy A, Springenberg JT, Riedmiller M, Brox T (2014) Discriminative unsupervised feature learning with convolutional neural networks.  Advances Neural Inf Process Syst 27:766–774

  10. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  11. Friedman N, Mosenzon O, Slonim N, Tishby N (2013) Multivariate information bottleneck. arXiv preprint arXiv:1301.2270

  12. Goyal P, Mahajan D, Gupta A, Misra I (2019) Scaling and benchmarking self-supervised visual representation learning. In: Proceedings of the IEEE international conference on computer vision. pp. 6391-6400

  13. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pages 770–778

  14. He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9729-9738

  15. Hjelm RD, Fedorov A, Lavoie-Marchildon S, Grewal K, Bachman P, Trischler A, Bengio Y (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670

  16. Hu W, Miyato T, Tokui S, Matsumoto E, Sugiyama M (2017) Learning discrete representations via information maximizing self-augmented training. arXiv preprint arXiv:1702.08720

  17. Huang G, Liu Z, Maaten Lvd, Weinberger KQ (2017) Densely connected convolutional networks, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2261–2269. https://doi.org/10.1109/CVPR.2017.243

  18. Ji X, Henriques JF, Vedaldi A (2019) Invariant information clustering for unsupervised image classification and segmentation. In: Proceedings of the IEEE international conference on computer vision. pp. 9865-9874

  19. Jing L, Tian Y (2020) Self-supervised visual feature learning with deep neural networks: a survey. IEEE Trans Pattern Anal Mach Intell 43(11):4037–4058

    Article  Google Scholar 

  20. Khosla A, Jayadevaprakash N, Yao B, Li FF (2011, June) Novel dataset for fine-grained image categorization: Stanford dogs. In: Proc. CVPR workshop on fine-grained visual categorization (FGVC). Vol. 2, no. 1

  21. Li J, Zhou P, Xiong C, Socher R, Hoi SC (2020) Prototypical contrastive learning of unsupervised representations. arXiv preprint arXiv:2005.04966

  22. Li C, Yang J, Zhang P, Gao M, Xiao B, Dai X, Yuan L, Gao J (n.d.) Efficient Self-supervised Vision Transformers for Representation Learning. https://doi.org/10.48550/arXiv.2106.09785

  23. Liao X, Li K, Yin J (2017) Separable data hiding in encrypted image based on compressive sensing and discrete fourier transform. Multimed Tools Appl 76:20739–20753. https://doi.org/10.1007/s11042-016-3971-4

    Article  Google Scholar 

  24. Lin T-Y, Goyal P, Girshick R, He K, Dollar P (2017) Focal loss for dense object detection. In: The IEEE International Conference on Computer Vision, ICCV, pp. 2999–3007

  25. Liu X, Zhang F, Hou Z, Mian L, Wang Z, Zhang J, Tang J (2021) Self-supervised learning: generative or contrastive. IEEE Trans Knowl Data Eng

  26. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, ..., Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022

  27. Meena SD, Agilandeeswari L (2019) An efficient framework for animal breeds classification using semi-supervised learning and multi-part convolutional neural network (MP-CNN). IEEE Access 7:151783–151802

    Article  Google Scholar 

  28. Meena SD, Agilandeeswari L (2020) Stacked convolutional autoencoder for detecting animal images in cluttered scenes with a novel feature extraction framework. In: Soft computing for problem solving. Springer, Singapore. pp. 513–522

  29. Meena D, Agilandeeswari L (2020) Invariant features-based fuzzy inference system for animal detection and recognition using thermal images. Int J Fuzzy Syst 22:1868–1879

    Article  Google Scholar 

  30. Meena SD, Agilandeeswari L (n.d.) Adaboost Cascade Classifier for Classification and Identification of Wild Animals using Movidius Neural Compute Stick

  31. Meena SD, Loganathan A (2020) Intelligent animal detection system using sparse multi discriminative-neural network (SMD-NN) to mitigate animal-vehicle collision. Environ Sci Pollut Res 27:1–16

    Article  Google Scholar 

  32. Misra I, Maaten LVD (2020) Self-supervised learning of pretext-invariant representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6707-6717

  33. MohanRajan SN, Loganathan A (2021) Modelling spatial drivers for LU/LC change prediction using hybrid machine learning methods in Javadi Hills, Tamil Nadu, India. J Indian Soc Remote Sens 49:913–934

    Article  Google Scholar 

  34. Mohanrajan SN, Loganathan A (2022) Novel vision transformer–based bi-LSTM model for LU/LC prediction—Javadi Hills, India. Appl Sci 12(13):6387

    Article  Google Scholar 

  35. Navin MS, Agilandeeswari L (2020) Multispectral and hyperspectral images based land use/land cover change prediction analysis: an extensive review Multimed Tools Appl Scopus Indexed with Impact factor 2.313

  36. Prabukumar M, Agilandeeswari L, Ganesan K (2018) An optimized lung Cancer diagnosis system using cuckoo search optimization and support vector machine classifier. J Ambient Intell Humanized Comput Springer

  37. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  38. Selvaraju RR, Das A, Vedantam R, Cogswell M, Parikh D, Batra D (2016) Grad-CAM: why did you say that?. arXiv preprint arXiv:1611.07450

  39. Sohn K, Berthelot D, Li C L, Zhang Z, Carlini N, Cubuk ED, ..., Raffel C (2020) Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685

  40. Sundaram DM, Loganathan A (2020) FSSCaps-DetCountNet: fuzzy soft sets and CapsNet-based detection and counting network for monitoring animals from aerial images. J Appl Remote Sens 14(2):026521

    Article  Google Scholar 

  41. Sundaram DM, Loganathan A (2020) A new supervised clustering framework using multi discriminative parts and expectation–maximization approach for a fine-grained animal breed classification (SC-MPEM). Neural Process Lett 52(1):727–766

  42. Swanson A, Kosmala M, Lintott C, Simpson R, Smith A, Packer C (2015) Snapshot Serengeti, high-frequency annotated camera trap images of 40 mammalian species in an African savanna. Sci Data 2(1):1–14

    Article  Google Scholar 

  43. Tian Y, Krishnan D, Isola P (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849

  44. Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2020) Training data-efficient image transformers and distillation through attention. arXiv preprint arXiv:2012.12877

  45. Van Gansbeke W, Vandenhende S, Georgoulis S, Proesmans M, Van Gool L (2020) Learning to classify images without labels. arXiv preprint arXiv:2005.12320

  46. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems. pages 5998–6008

  47. Wang J, Wang J, Song J, Xu XS, Shen HT, Li S (2014) Optimized cartesian k-means. IEEE Trans Knowl Data Eng 27(1):180–192

    Article  Google Scholar 

  48. Wu Z, Xiong Y, Yu SX, Lin D (2018) Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3733-3742

  49. Xie J, Girshick R, Farhadi A (2016, June) Unsupervised deep embedding for clustering analysis. In: International conference on machine learning. pp. 478-487

  50. Zelnik-Manor L, Perona P (2004) Self-tuning spectral clustering. Adv Neural Inf Proces Syst 17:1601–1608

    Google Scholar 

  51. Zhong H, Chen C, Jin Z, Hua XS (2020) Deep robust clustering by contrastive learning. arXiv preprint arXiv:2008.03030

  52. Zou W, Zhu S, Yu K, Ng A (2012) Deep learning of invariant features via simulated fixations in video. Adv Neural Inf Proces Syst 25:3203–3211

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to L. Agilandeeswari.

Ethics declarations

Conflict of interest

There is no Conflict of Interest or competing interests by the authors.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Agilandeeswari, L., Meena, S.D. SWIN transformer based contrastive self-supervised learning for animal detection and classification. Multimed Tools Appl 82, 10445–10470 (2023). https://doi.org/10.1007/s11042-022-13629-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13629-x

Keywords

Navigation