Abstract
Abundance of unlabelled data and advances in Self-Supervised Learning (SSL) have made it the preferred choice in many transfer learning scenarios. Due to the rapid and ongoing development of SSL approaches, practitioners are now faced with an overwhelming amount of models trained for a specific task/domain, calling for a method to estimate transfer performance on novel tasks/domains. Typically, the role of such estimator is played by linear probing which trains a linear classifier on top of the frozen feature extractor. In this work we address a shortcoming of linear probing — it is not very strongly correlated with the performance of the models finetuned end-to-end— the latter often being the final objective in transfer learning— and, in some cases, catastrophically misestimates a model’s potential. We propose a way to obtain a significantly better proxy task by unfreezing and jointly finetuning batch normalization layers together with the classification head. At a cost of extra training of only 0.16% model parameters, in case of ResNet-50, we acquire a proxy task that (i) has a stronger correlation with end-to-end finetuned performance, (ii) improves the linear probing performance in the many- and few-shot learning regimes and (iii) in some cases, outperforms both linear probing and end-to-end finetuning, reaching the state-of-the-art performance on a pathology dataset. Finally, we analyze and discuss the changes batch normalization training introduces in the feature distributions that may be the reason for the improved performance. The code is available at https://github.com/vpulab/bn_finetuning.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability and access
The data used in this study are publicly available on the Internet.
Notes
Empirically we have found this setup to perform slightly better
References
Pan H, Guo Y, Deng Q, Yang H, Chen J, Chen Y (2023) Improving fine-tuning of self-supervised models with contrastive initialization. Neural Netw 159:198–207
Zhan X, Xie J, Liu Z, Ong Y-S, Loy CC (2020) Online deep clustering for unsupervised representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6688–6697
Grill J-B, Strub F, Altché F, Tallec C, Richemond P, Buchatskaya E, Doersch C, Avila Pires B, Guo Z, Gheshlaghi M, Piot B, Kavukcuoglu K, Munos R, Valko M (2020) Bootstrap your own latent - a new approach to self-supervised learning. In: Advances in Neural Information Processing Systems, vol. 33. pp. 21271–21284
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607
Zhang R, Isola P, Efros AA (2016) Colorful image colorization. In: European conference on computer vision. pp. 649–666
Kolesnikov A, Zhai X, Beyer L (2019) Revisiting self-supervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1920–1929
Ericsson L, Gouk H, Hospedales TM (2021) How well do self-supervised models transfer? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5414–5423
Vasconcelos C, Birodkar V, Dumoulin V (2022) Proper reuse of image classification features improves object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13628–13637
Deng J, Berg A, Satheesh S, Su H, Khosla A, Li F-F (2012) Large scale visual recognition challenge. https://www.image-net.org/challenges/LSVRC/2012
Li Y, Wang N, Shi J, Hou X, Liu J (2018) Adaptive batch normalization for practical domain adaptation. Pattern Recogn 80:109–117
Kanavati F, Tsuneki M (2021) Partial transfusion: on the expressive influence of trainable batch norm parameters for transfer learning. In: Medical Imaging with Deep Learning, pp. 338–353
Yazdanpanah M, Rahman AA, Chaudhary M, Desrosiers C, Havaei M, Belilovsky E, Kahou SE (2022) Revisiting learnable affines for batch norm in few-shot transfer learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9109–9118
Li S, Mao Y, Li J, Xu Y, Li J, Chen X, Liu S, Zhao X (2023) Fedutn: federated self-supervised learning with updating target network. Appl Intell 53(9):10879–10892
Wu Z, Xiong Y, Yu SX, Lin D (2018) Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3733–3742
Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. In: Advances in Neural Information Processing Systems, vol. 33. pp 9912–9924
Caron M, Bojanowski P, Joulin A, Douze M (2018) Deep clustering for unsupervised learning of visual features. In: Proceedings of the European conference on computer vision. pp 132–149
Asano YM, Rupprecht C, Vedaldi A (2020) Self-labelling via simultaneous clustering and representation learning. In: International conference on learning representations
Lim JY, Lim KM, Lee CP, Tan YX (2023) Scl: Self-supervised contrastive learning for few-shot image classification. Neural Netw 165:19–30
Li J, Zhou P, Xiong C, Hoi SC (2021) Prototypical contrastive learning of unsupervised representations. In: International conference on learning representations
Maji S, Rahtu E, Kannala J, Blaschko M, Vedaldi A (2013) Fine-grained visual classification of aircraft. arXiv:1306.5151
Frankle J, Schwab DJ, Morcos AS (2021) Training batchnorm and only batchnorm: On the expressive power of random features in cnns. In: International conference on learning representations
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. pp 448–456
Tian Y, Sun C, Poole B, Krishnan D, Schmid C, Isola P (2020) What makes for good views for contrastive learning? Adv Neural Inf Process Syst 33:6827–6839
He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 9729–9738
Chen X, Fan H, Girshick R, He K (2020) Improved baselines with momentum contrastive learning. arXiv:2003.04297
Misra I, Maaten Lvd (2020) Self-supervised learning of pretext-invariant representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 6707–6717
Krause J, Stark M, Deng J, Fei-Fei L (2013) 3d object representations for fine-grained categorization. In: 4th International IEEE workshop on 3d representation and recognition
Kornblith S, Shlens J, Le QV (2019) Do better imagenet models transfer better? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 2661–2671
Li F-F, Fergus R, Perona P (2004) Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In: IEEE/CVF conference on computer vision and pattern recognition workshop. pp 178–178
Krizhevsky A, Hinton G, et al (2009) Learning multiple layers of features from tiny images. Technical report, University of Toronto
Nilsback M-E, Zisserman A (2008) Automated flower classification over a large number of classes. In: Sixth Indian conference on computer vision, graphics & image processing. pp 722–729
Parkhi OM, Vedaldi A, Zisserman A, Jawahar C (2012) Cats and dogs. In: IEEE/CVF conference on computer vision and pattern recognition. pp 3498–3505
Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A (2014) Describing textures in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 3606–3613
Guo Y, Codella NC, Karlinsky L, Codella JV, Smith JR, Saenko K, Rosing T, Feris R (2020) A broader study of cross-domain few-shot learning. In: European conference on computer vision. pp 124–141
Tschandl P, Rosendahl C, Kittler H (2018) The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions
Codella N, Rotemberg V, Tschandl P, Celebi ME, Dusza S, Gutman D, Helba B, Kalloo A, Liopyris K, Marchetti M, et al (2019) Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration. arXiv:1902.03368
Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers R (2017) Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, vol. 7. pp 46
Helber P, Bischke B, Dengel A, Borth D (2019) Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J Sel Top Appl Earth Obs Remote Sens 12(7):2217–2226
Wei J, Suriawinata A, Ren B, Liu X, Lisovsky M, Vaickus L, Brown C, Baker M, Tomita N, Torresani L, et al (2021) A petri dish for histopathology image analysis. In: Artificial Intelligence in Medicine: 19th International Conference on Artificial Intelligence in Medicine, AIME 2021, Virtual Event, June 15–18, 2021, Proceedings. Springer, pp 11–24
Sirotkin K, Carballeira P, Escudero-Viñolo M (2022) A study on the distribution of social biases in self-supervised learning visual models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 10442–10451
Kang M, Song H, Park S, Yoo D, Pereira S (2023) Benchmarking self-supervised learning on diverse pathology datasets. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 3344–3354
Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, Ellrott K, Shmulevich I, Sander C, Stuart JM (2013) The cancer genome atlas pan-cancer analysis project. Nat Genet 45(10):1113–1120
Zbontar J, Jing L, Misra I, LeCun Y, Deny S (2021) Barlow twins: Self-supervised learning via redundancy reduction. In: International conference on machine learning. PMLR, pp 12310–12320
Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? Adv Neural Inf Process Syst 27
Bengio Y, Mesnil G, Dauphin Y, Rifai S (2013) Better mixing via deep representations. In: International conference on machine learning. PMLR, pp 552–560
Acknowledgements
This work was supported by the Ministerio de Ciencia e Innovación de España under projects TED2021-131643A-I00 (SEGA-CV) and PID2021-125051OB-I00 (HVD).
Author information
Authors and Affiliations
Contributions
Kirill Sirotkin: conceptualization, coding, writing, original draft preparation, investigation. Marcos Escudero-Viñolo, Pablo Carballeira, Álvaro García-Martín: methodology, investigation, validation, review, editing.
Corresponding author
Ethics declarations
Competing interests
The authors have no conflict of interest.
Ethical and informed consent for data used
This article does not contain any studies with human participants or animals. All datasets used in this article are publicly available on the Internet.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Many-shot learning
Tables 4, 5, 6, 7, 8, 9, 10 and 11 provide the complete experimental results for the experiments in the many-shot regime, described in Section 4.1 of the main paper. As can be seen from the tables, BN finetuning consistently outperforms standard linear probing across studied datasets, with the exception of DTD (see Table 9) where linear evaluation reaches a slightly higher accuracy for DeepCluster-v2, BYOL, SeLa-v2, MoCo-v2, SwAV and Supervised models. Moreover, Tables 4 and 6 further illustrate the advantages of BN-finetuning BN finetuning on the datasets, for which the improvement with respect to linear probing is most significant (see Fig. 3 and Section 4.1).
Appendix B BatchNorm parameters
Additionally, we provide the complete results on the distributions of relative changes of BatchNorm affine and normalization parameters before and after BN finetuning for every studied model (see Section 4.3). To improve readability, the results are arranged as an HTML page available at http://www-vpu.eps.uam.es/publications/BN-FT_SSL_transferability/. The results obtained for the models, other than the ones mentioned in the main paper, fortify the conclusions and demonstrate that the gain in performance and the magnitude of parameter change seem to be correlated.
Appendix C Distributions of feature embeddings
Finally, we provide the complete results on the difference of distributions of feature embeddings before and after BN finetuning for Aircraft and Stanford cars datasets (see Section 4.3). To improve readability, the results are arranged as an HTML page available at http://www-vpu.eps.uam.es/publications/BN-FT_SSL_transferability/. Similarly to results for BYOL and PCL-v1 depicted on Fig. 5, results for other SSL models show the same trend— learned BN parameters lead to a better alignment of source (upstream) and target (downstream) features.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sirotkin, K., Escudero-Viñolo, M., Carballeira, P. et al. Improved transferability of self-supervised learning models through batch normalization finetuning. Appl Intell 54, 11281–11294 (2024). https://doi.org/10.1007/s10489-024-05758-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05758-7