Mitigating background bias in self-supervised video representation learning

Akar, Arif; Senturk, Ufuk Umut; Ikizler-Cinbis, Nazli

doi:10.1007/s11760-024-03644-w

Mitigating background bias in self-supervised video representation learning

Original Paper
Published: 04 December 2024

Volume 19, article number 55, (2025)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Arif Akar^1,2,
Ufuk Umut Senturk² &
Nazli Ikizler-Cinbis¹

63 Accesses
Explore all metrics

Abstract

This paper addresses the problem of self-supervised video representation learning focused on motion features, aiming to capture features from foreground motion with reduced reliance on background bias. Recent successful methods often employ instance discrimination approaches, which entail heavy computation and may lead to inefficient and exhaustive pretraining. To this end, we utilize the augmentation technique MAC: Mask-Augmentation teChnique. MAC blends foreground motion using frame-difference-based masks and sets up a pretext task to recognize the applied transformation. By incorporating a game of predicting the correct blending multiplier at the pretraining stage, our model is compelled to encode motion-based features, which are then successfully transferred to downstream tasks such as action recognition. Moreover, we expand our approach within a joint contrastive framework and integrate additional tasks in the spatial and temporal domains to further enhance representation capabilities. Experimental results demonstrate that our method achieves superior performance on the UCF-101, HMDB51 and Diving-48 datasets under low-resource settings and competitive results with instance discrimination methods under costly computation settings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MaCLR: Motion-Aware Contrastive Learning of Representations for Videos

TCVM: Temporal Contrasting Video Montage Framework for Self-supervised Video Representation Learning

Controllable augmentations for video representation learning

Article Open access 10 January 2024

Data availability

The data used to support the findings of this study are available from the corresponding author upon request.

References

Ahsan, U., Madhok, R., Essa, I.: Video jigsaw: unsupervised learning of spatiotemporal context for video action recognition. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 179–189. IEEE (2019)
Akar, A., Senturk, U.U., Ikizler-Cinbis, N.: Mac: mask-augmentation for motion-aware video representation learning. In: BMVC, p. 5. (2022)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. (2017)
Choi, J., Gao, C., Messou, C.E.J., et al.: Why can’t I dance in the mall? Learning to mitigate scene bias in action recognition. In: NeurIPS (2019)
Dave, I.R., Jenni, S., Shah, M.: No more shortcuts: realizing the potential of temporal self-supervision. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1481–1491. (2024)
Ding, S., Li, M., Yang, T., et al.: Motion-aware contrastive video representation learning via foreground-background merging. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9716–9726. (2022)
Ding, S., Qian, R., Xiong, H.: Dual contrastive learning for spatio-temporal representation. In: Proceedings of the 30th ACM international conference on multimedia, pp. 5649–5658. (2022)
Duan, H., Zhao, N., Chen, K., et al.: Transrank: self-supervised video representation learning via ranking-based transformation recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3000–3010. (2022)
Feichtenhofer, C., Fan, H., Xiong, B., et al.: A large-scale study on unsupervised spatiotemporal representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3299–3309. (2021)
Feichtenhofer, C., Fan, H., Li, Y., et al.: Masked autoencoders as spatiotemporal learners. ArXiv arXiV:2205.09113 (2022)
Gavrilyuk, K., Jain, M., Karmanov, I., et al.: Motion-augmented self-training for video recognition at smaller scale. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10429–10438. (2021)
Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, pp. 297–304. (2010)
He, K., Fan, H., Wu, Y., et al.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. (2020)
Hu, K., Shao, J., Liu, Y., et al.: Contrast and order representations for video self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7939–7949. (2021)
Huang, L., Liu, Y., Wang, B., et al.: Self-supervised video representation learning by context and motion decoupling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13886–13895. (2021)
Jenni, S., Meishvili, G., Favaro, P.: Video representation learning by recognizing temporal transformations. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII 16, Springer, pp. 425–442. (2020)
Kuehne, H., Jhuang, H., Garrote, E., et al.: Hmdb: a large video database for human motion recognition. In: 2011 International conference on computer vision, IEEE, pp. 2556–2563. (2011)
Lee, H.Y., Huang, J.B., Singh, M., et al.: Unsupervised representation learning by sorting sequences. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 667–676. (2017)
Li, W., Luo, D., Fang, B., et al.: Video 3d sampling for self-supervised representation learning. arXiv preprint arXiv:2107.03578 (2021)
Luo, D., Liu, C., Zhou, Y., et al.: Video cloze procedure for self-supervised spatio-temporal learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 11701–11708 (2020)
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Su, Y., Xing, M., An, S., et al.: Vdarn: video disentangling attentive relation network for few-shot and zero-shot action recognition. Ad Hoc Netw. 113, 102380 (2021)
Article Google Scholar
Thoker, F.M., Doughty, H., Snoek, C.G.: Tubelet-contrastive self-supervision for video-efficient generalization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13812–13823. (2023)
Tran, D., Wang, H., Torresani, L., et al.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459. (2018)
Wang, G., Zhou, Y., Luo, C., et al.: Unsupervised visual representation learning by tracking patches in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2563–2572. (2021)
Wang, J., Jiao, J., Liu, Y.H.: Self-supervised video representation learning by pace prediction. In: European Conference on Computer Vision, pp. 504–521. Springer (2020)
Wang, J., Gao, Y., Li, K., et al.: Enhancing unsupervised video representation learning by decoupling the scene and the motion. In: AAAI (2021)
Wang, J., Gao, Y., Li, K., et al.: Removing the background by adding the background: Towards background robust self-supervised video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11804–11813. (2021)
Xu, D., Xiao, J., Zhao, Z., et al.: Self-supervised spatiotemporal learning via video clip order prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10334–10343. (2019)
Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime tv-l 1 optical flow. In: Pattern Recognition: 29th DAGM Symposium, Heidelberg, Germany, September 12-14, 2007. Proceedings 29, pp. 214–223. Springer (2007)

Download references

Funding

This declaration is not applicable.

Author information

Authors and Affiliations

Department of Computer Engineering, Hacettepe University, 06800, Ankara, Turkey
Arif Akar & Nazli Ikizler-Cinbis
Aselsan Inc., 06800, Ankara, Turkey
Arif Akar & Ufuk Umut Senturk

Authors

Arif Akar
View author publications
You can also search for this author in PubMed Google Scholar
Ufuk Umut Senturk
View author publications
You can also search for this author in PubMed Google Scholar
Nazli Ikizler-Cinbis
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

This work was performed as a part of the Ph.D. studies of the first author at Hacettepe University.

Corresponding author

Correspondence to Arif Akar.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Ethical approval

This declaration is not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was performed as a part of the Ph.D. studies of the first author at Hacettepe University.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Akar, A., Senturk, U.U. & Ikizler-Cinbis, N. Mitigating background bias in self-supervised video representation learning. SIViP 19, 55 (2025). https://doi.org/10.1007/s11760-024-03644-w

Download citation

Received: 12 August 2024
Revised: 14 September 2024
Accepted: 07 October 2024
Published: 04 December 2024
DOI: https://doi.org/10.1007/s11760-024-03644-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mitigating background bias in self-supervised video representation learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MaCLR: Motion-Aware Contrastive Learning of Representations for Videos

TCVM: Temporal Contrasting Video Montage Framework for Self-supervised Video Representation Learning

Controllable augmentations for video representation learning

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Mitigating background bias in self-supervised video representation learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MaCLR: Motion-Aware Contrastive Learning of Representations for Videos

TCVM: Temporal Contrasting Video Montage Framework for Self-supervised Video Representation Learning

Controllable augmentations for video representation learning

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation