Video representation learning by identifying spatio-temporal transformations

Geng, Sheng; Zhao, Shimin; Liu, Hu

doi:10.1007/s10489-021-02790-9

Video representation learning by identifying spatio-temporal transformations

Published: 14 September 2021

Volume 52, pages 6613–6622, (2022)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Sheng Geng¹,
Shimin Zhao¹ &
Hu Liu¹

342 Accesses
Explore all metrics

Abstract

Self-supervised learning becomes a prevalent paradigm in both image and video domains due to the difficulty in obtaining a large amount of annotated data. In this paper, we adopt the self-supervised learning paradigm and propose to learn 3D video representations by identifying spatio-temporal transformations. Specifically, we choose a set of transformations and apply them to unlabelled videos to change the spatio-temporal structure of these videos. By identifying these spatio-temporal transformations, the network learns knowledge about both spatial appearance and temporal relation of video frames. In this paper, we choose the spatio-temporal rotations as the transformations. We conduct extensive experiments to validate the effectiveness of the proposed method. After fine-tuning on action recognition benchmarks, our model yields a remarkable gain of 29.6% on UCF101 and 25.1% on HMDB51 compared with models trained from scratch, which belongs to the current advanced method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

MaCLR: Motion-Aware Contrastive Learning of Representations for Videos

Video Representation Learning by Recognizing Temporal Transformations

Spatiotemporal consistency enhancement self-supervised representation learning for action recognition

Article 19 September 2022

Shuai Bi, Zhengping Hu, … Zhe Sun

References

Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on computer vision and pattern recognition. Ieee, pp 248–255
Doersch C, Gupta A, Efros AA (2015) Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE international conference on computer vision, pp 1422–1430
Fernando B, Bilen H, Gavves E, Gould S (2017) Self-supervised video representation learning with odd-one-out networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3636–3645
Gidaris S, Singh P, Komodakis N (2018) Unsupervised representation learning by predicting image rotations. arXiv:1803.07728
Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6546–6555
Hinton GE, Krizhevsky A, Sutskever I (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inform Process Syst 25:1106–1114
Google Scholar
Jing L, Tian Y (2018) Self-supervised spatiotemporal feature learning by video geometric transformations. 2(7):8. arXiv:1811.11387
Jing L, Tian Y (2020) Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence
Kim D, Cho D, Kweon IS (2019) Self-supervised video representation learning with space-time cubic puzzles. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 8545–8552
Kuehne H, Jhuang H, Stiefelhagen R, Serre T (2013) Hmdb51: a large video database for human motion recognition. High Perform Comput SciEng 12:571–582
Google Scholar
Larsson G, Maire M, Shakhnarovich G (2016) Learning representations for automatic colorization. In: European conference on computer vision. Springer, pp 577–593
Lee HY, Huang JB, Singh M, Yang MH (2017) Unsupervised representation learning by sorting sequences. In: Proceedings of the IEEE international conference on computer vision, pp 667–676
Lorre G, Rabarisoa J, Orcesi A, Ainouz S, Canu S (2020) Temporal contrastive pretraining for video action recognition. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 662–670
Luo D, Liu C, Zhou Y, Yang D, Ma C, Ye Q, Wang W (2020) Video cloze procedure for self-supervised spatio-temporal learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 11701–11708
Misra I, Zitnick CL, Hebert M (2016) Shuffle and learn: unsupervised learning using temporal order verification. In: European conference on computer vision. Springer, pp 527–544
Noroozi M, Favaro P (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In: European conference on computer vision. Springer, pp 69–84
Noroozi M, Pirsiavash H, Favaro P (2017) Representation learning by learning to count. In: Proceedings of the IEEE international conference on computer vision, pp 5898–5906
Pathak D, Krahenbuhl P, Donahue J, Darrell T, Efros AA (2016) Context encoders: Feature learning by inpainting. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2536–2544
Pérez-Hernández F, Tabik S, Lamas A, Olmos R, Fujita H, Herrera F (2020) Object detection binary classifiers methodology based on deep learning to identify small objects handled similarly: Application in video surveillance. Knowl-Based Syst 194:105590
Article Google Scholar
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp 618–626
Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Tian F, Gao Y, Fang Z, Fang Y, Gu J, Fujita H, Hwang JN (2021) Depth estimation using a self-supervised network based on cross-layer feature fusion and the quadtree constraint. IEEE Transactions on Circuits and Systems for Video Technology
Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. In: Advances in neural information processing systems, pp 613–621
Wang F, Liu H (2021) Understanding the behaviour of contrastive loss. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2495–2504
Wang F, Liu H, Guo D, Sun F (2020) Unsupervised representation learning by invariancepropagation. arXiv:2010.11694
Wang J, Gao Y, Li K, Jiang X, Guo X, Ji R, Sun X (2021) Enhancing unsupervised video representation learning by decoupling the scene and the motion. In: AAAI
Wang J, Jiao J, Bao L, He S, Liu W, Liu YH (2021) Self-supervised video representation learning by uncovering spatio-temporal statistics. IEEE Transactions on Pattern Analysis and Machine Intelligence
Wang T, Zhang X, Jiang R, Zhao L, Chen H, Luo W (2021) Video deblurring via spatiotemporal pyramid network and adversarial gradient prior. Comput Vis Image Underst 203:103135
Article Google Scholar
Wang X, Gupta A (2015) Unsupervised learning of visual representations using videos. In: Proceedings of the IEEE international conference on computer vision, pp 2794–2802
Wei D, Lim JJ, Zisserman A, Freeman WT (2018) Learning and using the arrow of time. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8052– 8060
Wu Y, Jiang X, Fang Z, Gao Y, Fujita H (2021) Multi-modal 3d object detection by 2d-guided precision anchor proposal and multi-layer fusion. Appl Soft Comput 108:107405
Article Google Scholar
Yao Y, Liu C, Luo D, Zhou Y, Ye Q (2020) Video playback rate perception for self-supervised spatio-temporal representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6548–6557
Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks?. In: Advances in neural information processing systems, pp 3320–3328
Zhang R, Isola P, Efros AA (2017) Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1058–1067
Zhang X, Wang T, Wang J, Tang G, Zhao L (2020) Pyramid channel-based feature attention network for image dehazing. Comput Vis Image Underst 197:103003
Article Google Scholar
Zhao Y, Deng B, Shen C, Liu Y, Lu H, Hua XS (2017) Spatio-temporal autoencoder for video anomaly detection. In: Proceedings of the 25th ACM international conference on multimedia, pp 1933–1941

Download references

Author information

Authors and Affiliations

Shanghai Institute of Technology, Shanghai, 201418, China
Sheng Geng, Shimin Zhao & Hu Liu

Authors

Sheng Geng
View author publications
You can also search for this author in PubMed Google Scholar
Shimin Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Hu Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hu Liu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Geng, S., Zhao, S. & Liu, H. Video representation learning by identifying spatio-temporal transformations. Appl Intell 52, 6613–6622 (2022). https://doi.org/10.1007/s10489-021-02790-9

Download citation

Accepted: 23 August 2021
Published: 14 September 2021
Issue Date: April 2022
DOI: https://doi.org/10.1007/s10489-021-02790-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Video representation learning by identifying spatio-temporal transformations

Abstract

Access this article

Similar content being viewed by others

MaCLR: Motion-Aware Contrastive Learning of Representations for Videos

Video Representation Learning by Recognizing Temporal Transformations

Spatiotemporal consistency enhancement self-supervised representation learning for action recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Video representation learning by identifying spatio-temporal transformations

Abstract

Access this article

Similar content being viewed by others

MaCLR: Motion-Aware Contrastive Learning of Representations for Videos

Video Representation Learning by Recognizing Temporal Transformations

Spatiotemporal consistency enhancement self-supervised representation learning for action recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation