Spatiotemporal consistency enhancement self-supervised representation learning for action recognition

Bi, Shuai; Hu, Zhengping; Zhao, Mengyao; Li, Shufang; Sun, Zhe

doi:10.1007/s11760-022-02357-2

Spatiotemporal consistency enhancement self-supervised representation learning for action recognition

Original Paper
Published: 19 September 2022

Volume 17, pages 1485–1492, (2023)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Shuai Bi¹,
Zhengping Hu¹,
Mengyao Zhao¹,
Shufang Li^1,2 &
…
Zhe Sun¹

447 Accesses
Explore all metrics

Abstract

Self-supervised learning has shown enormous potential in extracting valuable features from abundant unlabeled image data. However, for video, it requires models with powerful representation capabilities to exploit the rich spatiotemporal information to fully explore the internal relationships between different instances. This paper describes a novel spatiotemporal consistency enhancement self-supervised representation learning for action recognition. In contrast to typical contrastive learning methods, which merely use positive–negative pairs to learn invariant features, in this work, we design data augmentation of spatiotemporal information for feature similarity comparison. Specifically, we first extract the motion information from the video frames to keep the same action as those belonging to the original video. Further, we add static frames to these motion features to construct distracting video positive samples to mitigate the effect of irrelevant variables on model discrimination. In addition, we corrupt the sequence of video frames to generate extra categories of negative samples and distinguish them from the original frames by temporal differences. Ultimately, the learned helpful features are used for the downstream action recognition task, and the experimental results show that the method improves the recognition accuracy of the UCF101 and HMDB51 video action datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video representation learning by identifying spatio-temporal transformations

Article 14 September 2021

MaCLR: Motion-Aware Contrastive Learning of Representations for Videos

Self-supervised pretext task collaborative multi-view contrastive learning for video action recognition

Article 05 May 2023

Data availability

The datasets generated during and/or analyzed during the current study are available in the network.

References

Wang, X., Girshick, R., Gupta, A., He. K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Deng, J., Dong, W., Socher. R.: Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Zhu, C., He, Y., Savvides, M.: Feature selective anchor-free module for single-shot object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 840–849 (2019)
Li, X., Lin, T., Liu. X.: Deep concept-wise temporal convolutional networks for action localization. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4004–4012 (2020)
Feichtenhofer, C., Fan, H., Malik. J.: SlowFast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference, pp. 6202–6211 (2019)
Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Proceedings of the European Conference on Computer Vision, pp. 69–84 (2016)
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018)
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544 (2016)
Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Proceedings of the European Conference on Computer Vision, pp. 649–666 (2016)
Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y.: Self-supervised spatiotemporal learning via video clip order prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10334–10343 (2019)
Yao, Y., Liu, C., Luo, D., Zhou, Y., Ye, Q.: Video playback rate perception for self-supervised spatio-temporal representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6548–6557 (2020)
Jenni, S., Meishvili, G., Favaro, P.: Video representation learning by recognizing temporal transformations. In: Proceedings of the European Conference on Computer Vision, pp. 425–442 (2020)
Misra, I., Maaten, L.: Self-supervised learning of pretext-invariant representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707–6717 (2020)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning, pp. 1597–1607 (2020)
Li, Y., Hu, P., Liu, Z., Peng, D., Zhou, J., Peng, X.: Contrastive clustering. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8547–8555 (2021)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of the Advanced in Neural Information Processing System, vol. 27 (2014)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: Proceedings of the European Conference on Computer Vision, pp. 20–36 (2016)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3636–3645 (2017)
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1735–1742 (2006)
Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Hjelm, R., Fedorov, A., Samuel, L.: Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018)
Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. In: Proceedings of the European Conference on Computer Vision, pp. 776–794 (2020)
Tao, L., Wang, X., Yamasaki, T.: Self-supervised video representation learning using inter–intra contrastive framework. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2193–2201 (2020)
Lee, H., Huang, J., Singh, M., Yang, M.: Unsupervised representation learning by sorting sequences. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 667–676 (2017)
Buchler, U., Brattoli, B., Ommer, B.: Improving spatiotemporal self-supervision by deep reinforcement learning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 770–786 (2018)
Kim, D., Cho, D., Kweon, I.: Self-supervised video representation learning with space-time cubic puzzles. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8545–8552 (2019)
Luo, D., Liu, Y., Yang, D., Ma, C., Ye, Q.: Video cloze procedure for self-supervised spatio-temporal learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11701–11708 (2020)

Download references

Funding

This work was supported by the National Natural Science Foundation of China under Grants 61771420 and 62001413, and the National Natural Science Foundation of Hebei Province under Grants F2020203064.

Author information

Authors and Affiliations

School of Information Science and Engineering, Yanshan University, West of Hebei Street No. 438, Qinhuangdao, 066004, China
Shuai Bi, Zhengping Hu, Mengyao Zhao, Shufang Li & Zhe Sun
Department of Information Engineering, Hebei University of Environmental Engineering, Jingang Road No. 8, Qinhuangdao, 066102, China
Shufang Li

Authors

Shuai Bi
View author publications
You can also search for this author in PubMed Google Scholar
Zhengping Hu
View author publications
You can also search for this author in PubMed Google Scholar
Mengyao Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Shufang Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhe Sun
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SB and ZH prepared the first draft of the manuscript, MZ contributed significantly to analysis and manuscript preparation, SL helped perform the analysis with constructive discussions, and ZS reviewed the manuscript and checked some grammar. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Zhengping Hu.

Ethics declarations

Conflict of interest

The authors declared that they have no conflict of interest to this work.

Ethics approval and consent to participate

All the experimental subjects in this study do not include any animals or people and do not violate ethics.

Consent for publication

All the authors agreed to publish the article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bi, S., Hu, Z., Zhao, M. et al. Spatiotemporal consistency enhancement self-supervised representation learning for action recognition. SIViP 17, 1485–1492 (2023). https://doi.org/10.1007/s11760-022-02357-2

Download citation

Received: 13 May 2022
Revised: 01 August 2022
Accepted: 05 September 2022
Published: 19 September 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s11760-022-02357-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatiotemporal consistency enhancement self-supervised representation learning for action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Video representation learning by identifying spatio-temporal transformations

MaCLR: Motion-Aware Contrastive Learning of Representations for Videos

Self-supervised pretext task collaborative multi-view contrastive learning for video action recognition

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval and consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Spatiotemporal consistency enhancement self-supervised representation learning for action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Video representation learning by identifying spatio-temporal transformations

MaCLR: Motion-Aware Contrastive Learning of Representations for Videos

Self-supervised pretext task collaborative multi-view contrastive learning for video action recognition

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval and consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation