Improved transferability of self-supervised learning models through batch normalization finetuning

Sirotkin, Kirill; Escudero-Viñolo, Marcos; Carballeira, Pablo; García-Martín, Álvaro

doi:10.1007/s10489-024-05758-7

Improved transferability of self-supervised learning models through batch normalization finetuning

Published: 15 August 2024

Volume 54, pages 11281–11294, (2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

313 Accesses
Explore all metrics

Abstract

Abundance of unlabelled data and advances in Self-Supervised Learning (SSL) have made it the preferred choice in many transfer learning scenarios. Due to the rapid and ongoing development of SSL approaches, practitioners are now faced with an overwhelming amount of models trained for a specific task/domain, calling for a method to estimate transfer performance on novel tasks/domains. Typically, the role of such estimator is played by linear probing which trains a linear classifier on top of the frozen feature extractor. In this work we address a shortcoming of linear probing — it is not very strongly correlated with the performance of the models finetuned end-to-end— the latter often being the final objective in transfer learning— and, in some cases, catastrophically misestimates a model’s potential. We propose a way to obtain a significantly better proxy task by unfreezing and jointly finetuning batch normalization layers together with the classification head. At a cost of extra training of only 0.16% model parameters, in case of ResNet-50, we acquire a proxy task that (i) has a stronger correlation with end-to-end finetuned performance, (ii) improves the linear probing performance in the many- and few-shot learning regimes and (iii) in some cases, outperforms both linear probing and end-to-end finetuning, reaching the state-of-the-art performance on a pathology dataset. Finally, we analyze and discuss the changes batch normalization training introduces in the feature distributions that may be the reason for the improved performance. The code is available at https://github.com/vpulab/bn_finetuning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FissionFusion: Fast Geometric Generation and Hierarchical Souping for Medical Image Analysis

Semi-supervised Medical Image Classification with Global Latent Mixing

Have You Forgotten? A Method to Assess if Machine Learning Models Have Forgotten Data

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability and access

The data used in this study are publicly available on the Internet.

Notes

https://github.com/facebookresearch/vissl/
Empirically we have found this setup to perform slightly better

References

Pan H, Guo Y, Deng Q, Yang H, Chen J, Chen Y (2023) Improving fine-tuning of self-supervised models with contrastive initialization. Neural Netw 159:198–207
Article Google Scholar
Zhan X, Xie J, Liu Z, Ong Y-S, Loy CC (2020) Online deep clustering for unsupervised representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6688–6697
Grill J-B, Strub F, Altché F, Tallec C, Richemond P, Buchatskaya E, Doersch C, Avila Pires B, Guo Z, Gheshlaghi M, Piot B, Kavukcuoglu K, Munos R, Valko M (2020) Bootstrap your own latent - a new approach to self-supervised learning. In: Advances in Neural Information Processing Systems, vol. 33. pp. 21271–21284
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607
Zhang R, Isola P, Efros AA (2016) Colorful image colorization. In: European conference on computer vision. pp. 649–666
Kolesnikov A, Zhai X, Beyer L (2019) Revisiting self-supervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1920–1929
Ericsson L, Gouk H, Hospedales TM (2021) How well do self-supervised models transfer? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5414–5423
Vasconcelos C, Birodkar V, Dumoulin V (2022) Proper reuse of image classification features improves object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13628–13637
Deng J, Berg A, Satheesh S, Su H, Khosla A, Li F-F (2012) Large scale visual recognition challenge. https://www.image-net.org/challenges/LSVRC/2012
Li Y, Wang N, Shi J, Hou X, Liu J (2018) Adaptive batch normalization for practical domain adaptation. Pattern Recogn 80:109–117
Article Google Scholar
Kanavati F, Tsuneki M (2021) Partial transfusion: on the expressive influence of trainable batch norm parameters for transfer learning. In: Medical Imaging with Deep Learning, pp. 338–353
Yazdanpanah M, Rahman AA, Chaudhary M, Desrosiers C, Havaei M, Belilovsky E, Kahou SE (2022) Revisiting learnable affines for batch norm in few-shot transfer learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9109–9118
Li S, Mao Y, Li J, Xu Y, Li J, Chen X, Liu S, Zhao X (2023) Fedutn: federated self-supervised learning with updating target network. Appl Intell 53(9):10879–10892
Article Google Scholar
Wu Z, Xiong Y, Yu SX, Lin D (2018) Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3733–3742
Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. In: Advances in Neural Information Processing Systems, vol. 33. pp 9912–9924
Caron M, Bojanowski P, Joulin A, Douze M (2018) Deep clustering for unsupervised learning of visual features. In: Proceedings of the European conference on computer vision. pp 132–149
Asano YM, Rupprecht C, Vedaldi A (2020) Self-labelling via simultaneous clustering and representation learning. In: International conference on learning representations
Lim JY, Lim KM, Lee CP, Tan YX (2023) Scl: Self-supervised contrastive learning for few-shot image classification. Neural Netw 165:19–30
Article Google Scholar
Li J, Zhou P, Xiong C, Hoi SC (2021) Prototypical contrastive learning of unsupervised representations. In: International conference on learning representations
Maji S, Rahtu E, Kannala J, Blaschko M, Vedaldi A (2013) Fine-grained visual classification of aircraft. arXiv:1306.5151
Frankle J, Schwab DJ, Morcos AS (2021) Training batchnorm and only batchnorm: On the expressive power of random features in cnns. In: International conference on learning representations
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. pp 448–456
Tian Y, Sun C, Poole B, Krishnan D, Schmid C, Isola P (2020) What makes for good views for contrastive learning? Adv Neural Inf Process Syst 33:6827–6839
Google Scholar
He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 9729–9738
Chen X, Fan H, Girshick R, He K (2020) Improved baselines with momentum contrastive learning. arXiv:2003.04297
Misra I, Maaten Lvd (2020) Self-supervised learning of pretext-invariant representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 6707–6717
Krause J, Stark M, Deng J, Fei-Fei L (2013) 3d object representations for fine-grained categorization. In: 4th International IEEE workshop on 3d representation and recognition
Kornblith S, Shlens J, Le QV (2019) Do better imagenet models transfer better? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 2661–2671
Li F-F, Fergus R, Perona P (2004) Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In: IEEE/CVF conference on computer vision and pattern recognition workshop. pp 178–178
Krizhevsky A, Hinton G, et al (2009) Learning multiple layers of features from tiny images. Technical report, University of Toronto
Nilsback M-E, Zisserman A (2008) Automated flower classification over a large number of classes. In: Sixth Indian conference on computer vision, graphics & image processing. pp 722–729
Parkhi OM, Vedaldi A, Zisserman A, Jawahar C (2012) Cats and dogs. In: IEEE/CVF conference on computer vision and pattern recognition. pp 3498–3505
Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A (2014) Describing textures in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 3606–3613
Guo Y, Codella NC, Karlinsky L, Codella JV, Smith JR, Saenko K, Rosing T, Feris R (2020) A broader study of cross-domain few-shot learning. In: European conference on computer vision. pp 124–141
Tschandl P, Rosendahl C, Kittler H (2018) The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions
Codella N, Rotemberg V, Tschandl P, Celebi ME, Dusza S, Gutman D, Helba B, Kalloo A, Liopyris K, Marchetti M, et al (2019) Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration. arXiv:1902.03368
Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers R (2017) Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, vol. 7. pp 46
Helber P, Bischke B, Dengel A, Borth D (2019) Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J Sel Top Appl Earth Obs Remote Sens 12(7):2217–2226
Article Google Scholar
Wei J, Suriawinata A, Ren B, Liu X, Lisovsky M, Vaickus L, Brown C, Baker M, Tomita N, Torresani L, et al (2021) A petri dish for histopathology image analysis. In: Artificial Intelligence in Medicine: 19th International Conference on Artificial Intelligence in Medicine, AIME 2021, Virtual Event, June 15–18, 2021, Proceedings. Springer, pp 11–24
Sirotkin K, Carballeira P, Escudero-Viñolo M (2022) A study on the distribution of social biases in self-supervised learning visual models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 10442–10451
Kang M, Song H, Park S, Yoo D, Pereira S (2023) Benchmarking self-supervised learning on diverse pathology datasets. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 3344–3354
Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, Ellrott K, Shmulevich I, Sander C, Stuart JM (2013) The cancer genome atlas pan-cancer analysis project. Nat Genet 45(10):1113–1120
Article Google Scholar
Zbontar J, Jing L, Misra I, LeCun Y, Deny S (2021) Barlow twins: Self-supervised learning via redundancy reduction. In: International conference on machine learning. PMLR, pp 12310–12320
Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? Adv Neural Inf Process Syst 27
Bengio Y, Mesnil G, Dauphin Y, Rifai S (2013) Better mixing via deep representations. In: International conference on machine learning. PMLR, pp 552–560

Download references

Acknowledgements

This work was supported by the Ministerio de Ciencia e Innovación de España under projects TED2021-131643A-I00 (SEGA-CV) and PID2021-125051OB-I00 (HVD).

Author information

Authors and Affiliations

Video Processing and Understanding Lab, Escuela Politécnica Superior, Universidad Autónoma de Madrid, 28049, Madrid, Spain
Kirill Sirotkin, Marcos Escudero-Viñolo, Pablo Carballeira & Álvaro García-Martín

Authors

Kirill Sirotkin
View author publications
You can also search for this author in PubMed Google Scholar
Marcos Escudero-Viñolo
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Carballeira
View author publications
You can also search for this author in PubMed Google Scholar
Álvaro García-Martín
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Kirill Sirotkin: conceptualization, coding, writing, original draft preparation, investigation. Marcos Escudero-Viñolo, Pablo Carballeira, Álvaro García-Martín: methodology, investigation, validation, review, editing.

Corresponding author

Correspondence to Kirill Sirotkin.

Ethics declarations

Competing interests

The authors have no conflict of interest.

Ethical and informed consent for data used

This article does not contain any studies with human participants or animals. All datasets used in this article are publicly available on the Internet.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Many-shot learning

Tables 4, 5, 6, 7, 8, 9, 10 and 11 provide the complete experimental results for the experiments in the many-shot regime, described in Section 4.1 of the main paper. As can be seen from the tables, BN finetuning consistently outperforms standard linear probing across studied datasets, with the exception of DTD (see Table 9) where linear evaluation reaches a slightly higher accuracy for DeepCluster-v2, BYOL, SeLa-v2, MoCo-v2, SwAV and Supervised models. Moreover, Tables 4 and 6 further illustrate the advantages of BN-finetuning BN finetuning on the datasets, for which the improvement with respect to linear probing is most significant (see Fig. 3 and Section 4.1).

Table 4 Mean classification accuracy after transferring ImageNet-pretrained SSL models to Aircraft dataset using (i) linear probing, (ii) linear probing with finetuning BN layers and (iii) finetuning the whole model end-to-end

Full size table

Table 5 Mean classification accuracy after transferring ImageNet-pretrained SSL models to Caltech-101 dataset using (i) linear probing, (ii) linear probing with finetuning BN layers and (iii) finetuning the whole model end-to-end

Full size table

Table 6 Top-1 classification accuracy after transferring ImageNet-pretrained SSL models to Stanford Cars dataset using (i) linear probing, (ii) linear probing with finetuning BN layers and (iii) finetuning the whole model end-to-end

Full size table

Table 7 Top-1 classification accuracy after transferring ImageNet-pretrained SSL models to CIFAR-10 dataset using (i) linear probing, (ii) linear probing with finetuning BN layers and (iii) finetuning the whole model end-to-end

Full size table

Table 8 Top-1 classification accuracy after transferring ImageNet-pretrained SSL models to CIFAR-100 dataset using (i) linear probing, (ii) linear probing with finetuning BN layers and (iii) finetuning the whole model end-to-end

Full size table

Table 9 Top-1 classification accuracy after transferring ImageNet-pretrained SSL models to DTD dataset using (i) linear probing, (ii) linear probing with finetuning BN layers and (iii) finetuning the whole model end-to-end

Full size table

Table 10 Mean classification accuracy after transferring ImageNet-pretrained SSL models to Flowers dataset using (i) linear probing, (ii) linear probing with finetuning BN layers and (iii) finetuning the whole model end-to-end

Full size table

Table 11 Mean classification accuracy after transferring ImageNet-pretrained SSL models to Pets dataset using (i) linear probing, (ii) linear probing with finetuning BN layers and (iii) finetuning the whole model end-to-end

Full size table

Appendix B BatchNorm parameters

Additionally, we provide the complete results on the distributions of relative changes of BatchNorm affine and normalization parameters before and after BN finetuning for every studied model (see Section 4.3). To improve readability, the results are arranged as an HTML page available at http://www-vpu.eps.uam.es/publications/BN-FT_SSL_transferability/. The results obtained for the models, other than the ones mentioned in the main paper, fortify the conclusions and demonstrate that the gain in performance and the magnitude of parameter change seem to be correlated.

Appendix C Distributions of feature embeddings

Finally, we provide the complete results on the difference of distributions of feature embeddings before and after BN finetuning for Aircraft and Stanford cars datasets (see Section 4.3). To improve readability, the results are arranged as an HTML page available at http://www-vpu.eps.uam.es/publications/BN-FT_SSL_transferability/. Similarly to results for BYOL and PCL-v1 depicted on Fig. 5, results for other SSL models show the same trend— learned BN parameters lead to a better alignment of source (upstream) and target (downstream) features.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sirotkin, K., Escudero-Viñolo, M., Carballeira, P. et al. Improved transferability of self-supervised learning models through batch normalization finetuning. Appl Intell 54, 11281–11294 (2024). https://doi.org/10.1007/s10489-024-05758-7

Download citation

Accepted: 07 August 2024
Published: 15 August 2024
Issue Date: November 2024
DOI: https://doi.org/10.1007/s10489-024-05758-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improved transferability of self-supervised learning models through batch normalization finetuning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

FissionFusion: Fast Geometric Generation and Hierarchical Souping for Medical Image Analysis

Semi-supervised Medical Image Classification with Global Latent Mixing

Have You Forgotten? A Method to Assess if Machine Learning Models Have Forgotten Data

Explore related subjects

Data availability and access

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethical and informed consent for data used

Additional information

Publisher's Note

Appendices

Appendix A Many-shot learning

Appendix B BatchNorm parameters

Appendix C Distributions of feature embeddings

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now