Abstract
This paper addresses the problem of self-supervised video representation learning focused on motion features, aiming to capture features from foreground motion with reduced reliance on background bias. Recent successful methods often employ instance discrimination approaches, which entail heavy computation and may lead to inefficient and exhaustive pretraining. To this end, we utilize the augmentation technique MAC: Mask-Augmentation teChnique. MAC blends foreground motion using frame-difference-based masks and sets up a pretext task to recognize the applied transformation. By incorporating a game of predicting the correct blending multiplier at the pretraining stage, our model is compelled to encode motion-based features, which are then successfully transferred to downstream tasks such as action recognition. Moreover, we expand our approach within a joint contrastive framework and integrate additional tasks in the spatial and temporal domains to further enhance representation capabilities. Experimental results demonstrate that our method achieves superior performance on the UCF-101, HMDB51 and Diving-48 datasets under low-resource settings and competitive results with instance discrimination methods under costly computation settings.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03644-w/MediaObjects/11760_2024_3644_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03644-w/MediaObjects/11760_2024_3644_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03644-w/MediaObjects/11760_2024_3644_Fig3_HTML.png)
Similar content being viewed by others
Data availability
The data used to support the findings of this study are available from the corresponding author upon request.
References
Ahsan, U., Madhok, R., Essa, I.: Video jigsaw: unsupervised learning of spatiotemporal context for video action recognition. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 179–189. IEEE (2019)
Akar, A., Senturk, U.U., Ikizler-Cinbis, N.: Mac: mask-augmentation for motion-aware video representation learning. In: BMVC, p. 5. (2022)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. (2017)
Choi, J., Gao, C., Messou, C.E.J., et al.: Why can’t I dance in the mall? Learning to mitigate scene bias in action recognition. In: NeurIPS (2019)
Dave, I.R., Jenni, S., Shah, M.: No more shortcuts: realizing the potential of temporal self-supervision. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1481–1491. (2024)
Ding, S., Li, M., Yang, T., et al.: Motion-aware contrastive video representation learning via foreground-background merging. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9716–9726. (2022)
Ding, S., Qian, R., Xiong, H.: Dual contrastive learning for spatio-temporal representation. In: Proceedings of the 30th ACM international conference on multimedia, pp. 5649–5658. (2022)
Duan, H., Zhao, N., Chen, K., et al.: Transrank: self-supervised video representation learning via ranking-based transformation recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3000–3010. (2022)
Feichtenhofer, C., Fan, H., Xiong, B., et al.: A large-scale study on unsupervised spatiotemporal representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3299–3309. (2021)
Feichtenhofer, C., Fan, H., Li, Y., et al.: Masked autoencoders as spatiotemporal learners. ArXiv arXiV:2205.09113 (2022)
Gavrilyuk, K., Jain, M., Karmanov, I., et al.: Motion-augmented self-training for video recognition at smaller scale. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10429–10438. (2021)
Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, pp. 297–304. (2010)
He, K., Fan, H., Wu, Y., et al.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. (2020)
Hu, K., Shao, J., Liu, Y., et al.: Contrast and order representations for video self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7939–7949. (2021)
Huang, L., Liu, Y., Wang, B., et al.: Self-supervised video representation learning by context and motion decoupling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13886–13895. (2021)
Jenni, S., Meishvili, G., Favaro, P.: Video representation learning by recognizing temporal transformations. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII 16, Springer, pp. 425–442. (2020)
Kuehne, H., Jhuang, H., Garrote, E., et al.: Hmdb: a large video database for human motion recognition. In: 2011 International conference on computer vision, IEEE, pp. 2556–2563. (2011)
Lee, H.Y., Huang, J.B., Singh, M., et al.: Unsupervised representation learning by sorting sequences. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 667–676. (2017)
Li, W., Luo, D., Fang, B., et al.: Video 3d sampling for self-supervised representation learning. arXiv preprint arXiv:2107.03578 (2021)
Luo, D., Liu, C., Zhou, Y., et al.: Video cloze procedure for self-supervised spatio-temporal learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 11701–11708 (2020)
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Su, Y., Xing, M., An, S., et al.: Vdarn: video disentangling attentive relation network for few-shot and zero-shot action recognition. Ad Hoc Netw. 113, 102380 (2021)
Thoker, F.M., Doughty, H., Snoek, C.G.: Tubelet-contrastive self-supervision for video-efficient generalization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13812–13823. (2023)
Tran, D., Wang, H., Torresani, L., et al.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459. (2018)
Wang, G., Zhou, Y., Luo, C., et al.: Unsupervised visual representation learning by tracking patches in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2563–2572. (2021)
Wang, J., Jiao, J., Liu, Y.H.: Self-supervised video representation learning by pace prediction. In: European Conference on Computer Vision, pp. 504–521. Springer (2020)
Wang, J., Gao, Y., Li, K., et al.: Enhancing unsupervised video representation learning by decoupling the scene and the motion. In: AAAI (2021)
Wang, J., Gao, Y., Li, K., et al.: Removing the background by adding the background: Towards background robust self-supervised video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11804–11813. (2021)
Xu, D., Xiao, J., Zhao, Z., et al.: Self-supervised spatiotemporal learning via video clip order prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10334–10343. (2019)
Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime tv-l 1 optical flow. In: Pattern Recognition: 29th DAGM Symposium, Heidelberg, Germany, September 12-14, 2007. Proceedings 29, pp. 214–223. Springer (2007)
Funding
This declaration is not applicable.
Author information
Authors and Affiliations
Contributions
This work was performed as a part of the Ph.D. studies of the first author at Hacettepe University.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Ethical approval
This declaration is not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was performed as a part of the Ph.D. studies of the first author at Hacettepe University.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Akar, A., Senturk, U.U. & Ikizler-Cinbis, N. Mitigating background bias in self-supervised video representation learning. SIViP 19, 55 (2025). https://doi.org/10.1007/s11760-024-03644-w
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11760-024-03644-w