Abstract
Self-supervised learning has shown enormous potential in extracting valuable features from abundant unlabeled image data. However, for video, it requires models with powerful representation capabilities to exploit the rich spatiotemporal information to fully explore the internal relationships between different instances. This paper describes a novel spatiotemporal consistency enhancement self-supervised representation learning for action recognition. In contrast to typical contrastive learning methods, which merely use positive–negative pairs to learn invariant features, in this work, we design data augmentation of spatiotemporal information for feature similarity comparison. Specifically, we first extract the motion information from the video frames to keep the same action as those belonging to the original video. Further, we add static frames to these motion features to construct distracting video positive samples to mitigate the effect of irrelevant variables on model discrimination. In addition, we corrupt the sequence of video frames to generate extra categories of negative samples and distinguish them from the original frames by temporal differences. Ultimately, the learned helpful features are used for the downstream action recognition task, and the experimental results show that the method improves the recognition accuracy of the UCF101 and HMDB51 video action datasets.
Similar content being viewed by others
Data availability
The datasets generated during and/or analyzed during the current study are available in the network.
References
Wang, X., Girshick, R., Gupta, A., He. K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Deng, J., Dong, W., Socher. R.: Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Zhu, C., He, Y., Savvides, M.: Feature selective anchor-free module for single-shot object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 840–849 (2019)
Li, X., Lin, T., Liu. X.: Deep concept-wise temporal convolutional networks for action localization. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4004–4012 (2020)
Feichtenhofer, C., Fan, H., Malik. J.: SlowFast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference, pp. 6202–6211 (2019)
Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Proceedings of the European Conference on Computer Vision, pp. 69–84 (2016)
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018)
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544 (2016)
Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Proceedings of the European Conference on Computer Vision, pp. 649–666 (2016)
Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y.: Self-supervised spatiotemporal learning via video clip order prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10334–10343 (2019)
Yao, Y., Liu, C., Luo, D., Zhou, Y., Ye, Q.: Video playback rate perception for self-supervised spatio-temporal representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6548–6557 (2020)
Jenni, S., Meishvili, G., Favaro, P.: Video representation learning by recognizing temporal transformations. In: Proceedings of the European Conference on Computer Vision, pp. 425–442 (2020)
Misra, I., Maaten, L.: Self-supervised learning of pretext-invariant representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707–6717 (2020)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning, pp. 1597–1607 (2020)
Li, Y., Hu, P., Liu, Z., Peng, D., Zhou, J., Peng, X.: Contrastive clustering. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8547–8555 (2021)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of the Advanced in Neural Information Processing System, vol. 27 (2014)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: Proceedings of the European Conference on Computer Vision, pp. 20–36 (2016)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3636–3645 (2017)
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1735–1742 (2006)
Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Hjelm, R., Fedorov, A., Samuel, L.: Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018)
Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. In: Proceedings of the European Conference on Computer Vision, pp. 776–794 (2020)
Tao, L., Wang, X., Yamasaki, T.: Self-supervised video representation learning using inter–intra contrastive framework. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2193–2201 (2020)
Lee, H., Huang, J., Singh, M., Yang, M.: Unsupervised representation learning by sorting sequences. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 667–676 (2017)
Buchler, U., Brattoli, B., Ommer, B.: Improving spatiotemporal self-supervision by deep reinforcement learning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 770–786 (2018)
Kim, D., Cho, D., Kweon, I.: Self-supervised video representation learning with space-time cubic puzzles. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8545–8552 (2019)
Luo, D., Liu, Y., Yang, D., Ma, C., Ye, Q.: Video cloze procedure for self-supervised spatio-temporal learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11701–11708 (2020)
Funding
This work was supported by the National Natural Science Foundation of China under Grants 61771420 and 62001413, and the National Natural Science Foundation of Hebei Province under Grants F2020203064.
Author information
Authors and Affiliations
Contributions
SB and ZH prepared the first draft of the manuscript, MZ contributed significantly to analysis and manuscript preparation, SL helped perform the analysis with constructive discussions, and ZS reviewed the manuscript and checked some grammar. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declared that they have no conflict of interest to this work.
Ethics approval and consent to participate
All the experimental subjects in this study do not include any animals or people and do not violate ethics.
Consent for publication
All the authors agreed to publish the article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bi, S., Hu, Z., Zhao, M. et al. Spatiotemporal consistency enhancement self-supervised representation learning for action recognition. SIViP 17, 1485–1492 (2023). https://doi.org/10.1007/s11760-022-02357-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-022-02357-2