Skip to main content

Advertisement

Log in

Improved transferability of self-supervised learning models through batch normalization finetuning

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Abundance of unlabelled data and advances in Self-Supervised Learning (SSL) have made it the preferred choice in many transfer learning scenarios. Due to the rapid and ongoing development of SSL approaches, practitioners are now faced with an overwhelming amount of models trained for a specific task/domain, calling for a method to estimate transfer performance on novel tasks/domains. Typically, the role of such estimator is played by linear probing which trains a linear classifier on top of the frozen feature extractor. In this work we address a shortcoming of linear probing — it is not very strongly correlated with the performance of the models finetuned end-to-end— the latter often being the final objective in transfer learning— and, in some cases, catastrophically misestimates a model’s potential. We propose a way to obtain a significantly better proxy task by unfreezing and jointly finetuning batch normalization layers together with the classification head. At a cost of extra training of only 0.16% model parameters, in case of ResNet-50, we acquire a proxy task that (i) has a stronger correlation with end-to-end finetuned performance, (ii) improves the linear probing performance in the many- and few-shot learning regimes and (iii) in some cases, outperforms both linear probing and end-to-end finetuning, reaching the state-of-the-art performance on a pathology dataset. Finally, we analyze and discuss the changes batch normalization training introduces in the feature distributions that may be the reason for the improved performance. The code is available at https://github.com/vpulab/bn_finetuning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability and access

The data used in this study are publicly available on the Internet.

Notes

  1. https://github.com/facebookresearch/vissl/

  2. Empirically we have found this setup to perform slightly better

References

  1. Pan H, Guo Y, Deng Q, Yang H, Chen J, Chen Y (2023) Improving fine-tuning of self-supervised models with contrastive initialization. Neural Netw 159:198–207

    Article  Google Scholar 

  2. Zhan X, Xie J, Liu Z, Ong Y-S, Loy CC (2020) Online deep clustering for unsupervised representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6688–6697

  3. Grill J-B, Strub F, Altché F, Tallec C, Richemond P, Buchatskaya E, Doersch C, Avila Pires B, Guo Z, Gheshlaghi M, Piot B, Kavukcuoglu K, Munos R, Valko M (2020) Bootstrap your own latent - a new approach to self-supervised learning. In: Advances in Neural Information Processing Systems, vol. 33. pp. 21271–21284

  4. Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607

  5. Zhang R, Isola P, Efros AA (2016) Colorful image colorization. In: European conference on computer vision. pp. 649–666

  6. Kolesnikov A, Zhai X, Beyer L (2019) Revisiting self-supervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1920–1929

  7. Ericsson L, Gouk H, Hospedales TM (2021) How well do self-supervised models transfer? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5414–5423

  8. Vasconcelos C, Birodkar V, Dumoulin V (2022) Proper reuse of image classification features improves object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13628–13637

  9. Deng J, Berg A, Satheesh S, Su H, Khosla A, Li F-F (2012) Large scale visual recognition challenge. https://www.image-net.org/challenges/LSVRC/2012

  10. Li Y, Wang N, Shi J, Hou X, Liu J (2018) Adaptive batch normalization for practical domain adaptation. Pattern Recogn 80:109–117

    Article  Google Scholar 

  11. Kanavati F, Tsuneki M (2021) Partial transfusion: on the expressive influence of trainable batch norm parameters for transfer learning. In: Medical Imaging with Deep Learning, pp. 338–353

  12. Yazdanpanah M, Rahman AA, Chaudhary M, Desrosiers C, Havaei M, Belilovsky E, Kahou SE (2022) Revisiting learnable affines for batch norm in few-shot transfer learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9109–9118

  13. Li S, Mao Y, Li J, Xu Y, Li J, Chen X, Liu S, Zhao X (2023) Fedutn: federated self-supervised learning with updating target network. Appl Intell 53(9):10879–10892

    Article  Google Scholar 

  14. Wu Z, Xiong Y, Yu SX, Lin D (2018) Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3733–3742

  15. Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. In: Advances in Neural Information Processing Systems, vol. 33. pp 9912–9924

  16. Caron M, Bojanowski P, Joulin A, Douze M (2018) Deep clustering for unsupervised learning of visual features. In: Proceedings of the European conference on computer vision. pp 132–149

  17. Asano YM, Rupprecht C, Vedaldi A (2020) Self-labelling via simultaneous clustering and representation learning. In: International conference on learning representations

  18. Lim JY, Lim KM, Lee CP, Tan YX (2023) Scl: Self-supervised contrastive learning for few-shot image classification. Neural Netw 165:19–30

    Article  Google Scholar 

  19. Li J, Zhou P, Xiong C, Hoi SC (2021) Prototypical contrastive learning of unsupervised representations. In: International conference on learning representations

  20. Maji S, Rahtu E, Kannala J, Blaschko M, Vedaldi A (2013) Fine-grained visual classification of aircraft. arXiv:1306.5151

  21. Frankle J, Schwab DJ, Morcos AS (2021) Training batchnorm and only batchnorm: On the expressive power of random features in cnns. In: International conference on learning representations

  22. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. pp 448–456

  23. Tian Y, Sun C, Poole B, Krishnan D, Schmid C, Isola P (2020) What makes for good views for contrastive learning? Adv Neural Inf Process Syst 33:6827–6839

    Google Scholar 

  24. He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 9729–9738

  25. Chen X, Fan H, Girshick R, He K (2020) Improved baselines with momentum contrastive learning. arXiv:2003.04297

  26. Misra I, Maaten Lvd (2020) Self-supervised learning of pretext-invariant representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 6707–6717

  27. Krause J, Stark M, Deng J, Fei-Fei L (2013) 3d object representations for fine-grained categorization. In: 4th International IEEE workshop on 3d representation and recognition

  28. Kornblith S, Shlens J, Le QV (2019) Do better imagenet models transfer better? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 2661–2671

  29. Li F-F, Fergus R, Perona P (2004) Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In: IEEE/CVF conference on computer vision and pattern recognition workshop. pp 178–178

  30. Krizhevsky A, Hinton G, et al (2009) Learning multiple layers of features from tiny images. Technical report, University of Toronto

  31. Nilsback M-E, Zisserman A (2008) Automated flower classification over a large number of classes. In: Sixth Indian conference on computer vision, graphics & image processing. pp 722–729

  32. Parkhi OM, Vedaldi A, Zisserman A, Jawahar C (2012) Cats and dogs. In: IEEE/CVF conference on computer vision and pattern recognition. pp 3498–3505

  33. Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A (2014) Describing textures in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 3606–3613

  34. Guo Y, Codella NC, Karlinsky L, Codella JV, Smith JR, Saenko K, Rosing T, Feris R (2020) A broader study of cross-domain few-shot learning. In: European conference on computer vision. pp 124–141

  35. Tschandl P, Rosendahl C, Kittler H (2018) The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions

  36. Codella N, Rotemberg V, Tschandl P, Celebi ME, Dusza S, Gutman D, Helba B, Kalloo A, Liopyris K, Marchetti M, et al (2019) Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration. arXiv:1902.03368

  37. Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers R (2017) Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, vol. 7. pp 46

  38. Helber P, Bischke B, Dengel A, Borth D (2019) Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J Sel Top Appl Earth Obs Remote Sens 12(7):2217–2226

    Article  Google Scholar 

  39. Wei J, Suriawinata A, Ren B, Liu X, Lisovsky M, Vaickus L, Brown C, Baker M, Tomita N, Torresani L, et al (2021) A petri dish for histopathology image analysis. In: Artificial Intelligence in Medicine: 19th International Conference on Artificial Intelligence in Medicine, AIME 2021, Virtual Event, June 15–18, 2021, Proceedings. Springer, pp 11–24

  40. Sirotkin K, Carballeira P, Escudero-Viñolo M (2022) A study on the distribution of social biases in self-supervised learning visual models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 10442–10451

  41. Kang M, Song H, Park S, Yoo D, Pereira S (2023) Benchmarking self-supervised learning on diverse pathology datasets. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 3344–3354

  42. Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, Ellrott K, Shmulevich I, Sander C, Stuart JM (2013) The cancer genome atlas pan-cancer analysis project. Nat Genet 45(10):1113–1120

    Article  Google Scholar 

  43. Zbontar J, Jing L, Misra I, LeCun Y, Deny S (2021) Barlow twins: Self-supervised learning via redundancy reduction. In: International conference on machine learning. PMLR, pp 12310–12320

  44. Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? Adv Neural Inf Process Syst 27

  45. Bengio Y, Mesnil G, Dauphin Y, Rifai S (2013) Better mixing via deep representations. In: International conference on machine learning. PMLR, pp 552–560

Download references

Acknowledgements

This work was supported by the Ministerio de Ciencia e Innovación de España under projects TED2021-131643A-I00 (SEGA-CV) and PID2021-125051OB-I00 (HVD).

Author information

Authors and Affiliations

Authors

Contributions

Kirill Sirotkin: conceptualization, coding, writing, original draft preparation, investigation. Marcos Escudero-Viñolo, Pablo Carballeira, Álvaro García-Martín: methodology, investigation, validation, review, editing.

Corresponding author

Correspondence to Kirill Sirotkin.

Ethics declarations

Competing interests

The authors have no conflict of interest.

Ethical and informed consent for data used

This article does not contain any studies with human participants or animals. All datasets used in this article are publicly available on the Internet.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A  Many-shot learning

Tables 4, 5, 6, 7, 8, 9, 10 and 11 provide the complete experimental results for the experiments in the many-shot regime, described in Section 4.1 of the main paper. As can be seen from the tables, BN finetuning consistently outperforms standard linear probing across studied datasets, with the exception of DTD (see Table 9) where linear evaluation reaches a slightly higher accuracy for DeepCluster-v2, BYOL, SeLa-v2, MoCo-v2, SwAV and Supervised models. Moreover, Tables 4 and 6 further illustrate the advantages of BN-finetuning BN finetuning on the datasets, for which the improvement with respect to linear probing is most significant (see Fig. 3 and Section 4.1).

Table 4 Mean classification accuracy after transferring ImageNet-pretrained SSL models to Aircraft dataset using (i) linear probing, (ii) linear probing with finetuning BN layers and (iii) finetuning the whole model end-to-end
Table 5 Mean classification accuracy after transferring ImageNet-pretrained SSL models to Caltech-101 dataset using (i) linear probing, (ii) linear probing with finetuning BN layers and (iii) finetuning the whole model end-to-end
Table 6 Top-1 classification accuracy after transferring ImageNet-pretrained SSL models to Stanford Cars dataset using (i) linear probing, (ii) linear probing with finetuning BN layers and (iii) finetuning the whole model end-to-end
Table 7 Top-1 classification accuracy after transferring ImageNet-pretrained SSL models to CIFAR-10 dataset using (i) linear probing, (ii) linear probing with finetuning BN layers and (iii) finetuning the whole model end-to-end
Table 8 Top-1 classification accuracy after transferring ImageNet-pretrained SSL models to CIFAR-100 dataset using (i) linear probing, (ii) linear probing with finetuning BN layers and (iii) finetuning the whole model end-to-end
Table 9 Top-1 classification accuracy after transferring ImageNet-pretrained SSL models to DTD dataset using (i) linear probing, (ii) linear probing with finetuning BN layers and (iii) finetuning the whole model end-to-end
Table 10 Mean classification accuracy after transferring ImageNet-pretrained SSL models to Flowers dataset using (i) linear probing, (ii) linear probing with finetuning BN layers and (iii) finetuning the whole model end-to-end
Table 11 Mean classification accuracy after transferring ImageNet-pretrained SSL models to Pets dataset using (i) linear probing, (ii) linear probing with finetuning BN layers and (iii) finetuning the whole model end-to-end

Appendix B  BatchNorm parameters

Additionally, we provide the complete results on the distributions of relative changes of BatchNorm affine and normalization parameters before and after BN finetuning for every studied model (see Section 4.3). To improve readability, the results are arranged as an HTML page available at http://www-vpu.eps.uam.es/publications/BN-FT_SSL_transferability/. The results obtained for the models, other than the ones mentioned in the main paper, fortify the conclusions and demonstrate that the gain in performance and the magnitude of parameter change seem to be correlated.

Appendix C  Distributions of feature embeddings

Finally, we provide the complete results on the difference of distributions of feature embeddings before and after BN finetuning for Aircraft and Stanford cars datasets (see Section 4.3). To improve readability, the results are arranged as an HTML page available at http://www-vpu.eps.uam.es/publications/BN-FT_SSL_transferability/. Similarly to results for BYOL and PCL-v1 depicted on Fig. 5, results for other SSL models show the same trend— learned BN parameters lead to a better alignment of source (upstream) and target (downstream) features.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sirotkin, K., Escudero-Viñolo, M., Carballeira, P. et al. Improved transferability of self-supervised learning models through batch normalization finetuning. Appl Intell 54, 11281–11294 (2024). https://doi.org/10.1007/s10489-024-05758-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05758-7

Keywords