Skip to main content
Log in

Video representation learning by identifying spatio-temporal transformations

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Self-supervised learning becomes a prevalent paradigm in both image and video domains due to the difficulty in obtaining a large amount of annotated data. In this paper, we adopt the self-supervised learning paradigm and propose to learn 3D video representations by identifying spatio-temporal transformations. Specifically, we choose a set of transformations and apply them to unlabelled videos to change the spatio-temporal structure of these videos. By identifying these spatio-temporal transformations, the network learns knowledge about both spatial appearance and temporal relation of video frames. In this paper, we choose the spatio-temporal rotations as the transformations. We conduct extensive experiments to validate the effectiveness of the proposed method. After fine-tuning on action recognition benchmarks, our model yields a remarkable gain of 29.6% on UCF101 and 25.1% on HMDB51 compared with models trained from scratch, which belongs to the current advanced method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308

  2. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on computer vision and pattern recognition. Ieee, pp 248–255

  3. Doersch C, Gupta A, Efros AA (2015) Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE international conference on computer vision, pp 1422–1430

  4. Fernando B, Bilen H, Gavves E, Gould S (2017) Self-supervised video representation learning with odd-one-out networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3636–3645

  5. Gidaris S, Singh P, Komodakis N (2018) Unsupervised representation learning by predicting image rotations. arXiv:1803.07728

  6. Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6546–6555

  7. Hinton GE, Krizhevsky A, Sutskever I (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inform Process Syst 25:1106–1114

    Google Scholar 

  8. Jing L, Tian Y (2018) Self-supervised spatiotemporal feature learning by video geometric transformations. 2(7):8. arXiv:1811.11387

  9. Jing L, Tian Y (2020) Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence

  10. Kim D, Cho D, Kweon IS (2019) Self-supervised video representation learning with space-time cubic puzzles. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 8545–8552

  11. Kuehne H, Jhuang H, Stiefelhagen R, Serre T (2013) Hmdb51: a large video database for human motion recognition. High Perform Comput SciEng 12:571–582

    Google Scholar 

  12. Larsson G, Maire M, Shakhnarovich G (2016) Learning representations for automatic colorization. In: European conference on computer vision. Springer, pp 577–593

  13. Lee HY, Huang JB, Singh M, Yang MH (2017) Unsupervised representation learning by sorting sequences. In: Proceedings of the IEEE international conference on computer vision, pp 667–676

  14. Lorre G, Rabarisoa J, Orcesi A, Ainouz S, Canu S (2020) Temporal contrastive pretraining for video action recognition. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 662–670

  15. Luo D, Liu C, Zhou Y, Yang D, Ma C, Ye Q, Wang W (2020) Video cloze procedure for self-supervised spatio-temporal learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 11701–11708

  16. Misra I, Zitnick CL, Hebert M (2016) Shuffle and learn: unsupervised learning using temporal order verification. In: European conference on computer vision. Springer, pp 527–544

  17. Noroozi M, Favaro P (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In: European conference on computer vision. Springer, pp 69–84

  18. Noroozi M, Pirsiavash H, Favaro P (2017) Representation learning by learning to count. In: Proceedings of the IEEE international conference on computer vision, pp 5898–5906

  19. Pathak D, Krahenbuhl P, Donahue J, Darrell T, Efros AA (2016) Context encoders: Feature learning by inpainting. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2536–2544

  20. Pérez-Hernández F, Tabik S, Lamas A, Olmos R, Fujita H, Herrera F (2020) Object detection binary classifiers methodology based on deep learning to identify small objects handled similarly: Application in video surveillance. Knowl-Based Syst 194:105590

    Article  Google Scholar 

  21. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp 618–626

  22. Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  23. Tian F, Gao Y, Fang Z, Fang Y, Gu J, Fujita H, Hwang JN (2021) Depth estimation using a self-supervised network based on cross-layer feature fusion and the quadtree constraint. IEEE Transactions on Circuits and Systems for Video Technology

  24. Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. In: Advances in neural information processing systems, pp 613–621

  25. Wang F, Liu H (2021) Understanding the behaviour of contrastive loss. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2495–2504

  26. Wang F, Liu H, Guo D, Sun F (2020) Unsupervised representation learning by invariancepropagation. arXiv:2010.11694

  27. Wang J, Gao Y, Li K, Jiang X, Guo X, Ji R, Sun X (2021) Enhancing unsupervised video representation learning by decoupling the scene and the motion. In: AAAI

  28. Wang J, Jiao J, Bao L, He S, Liu W, Liu YH (2021) Self-supervised video representation learning by uncovering spatio-temporal statistics. IEEE Transactions on Pattern Analysis and Machine Intelligence

  29. Wang T, Zhang X, Jiang R, Zhao L, Chen H, Luo W (2021) Video deblurring via spatiotemporal pyramid network and adversarial gradient prior. Comput Vis Image Underst 203:103135

    Article  Google Scholar 

  30. Wang X, Gupta A (2015) Unsupervised learning of visual representations using videos. In: Proceedings of the IEEE international conference on computer vision, pp 2794–2802

  31. Wei D, Lim JJ, Zisserman A, Freeman WT (2018) Learning and using the arrow of time. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8052– 8060

  32. Wu Y, Jiang X, Fang Z, Gao Y, Fujita H (2021) Multi-modal 3d object detection by 2d-guided precision anchor proposal and multi-layer fusion. Appl Soft Comput 108:107405

    Article  Google Scholar 

  33. Yao Y, Liu C, Luo D, Zhou Y, Ye Q (2020) Video playback rate perception for self-supervised spatio-temporal representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6548–6557

  34. Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks?. In: Advances in neural information processing systems, pp 3320–3328

  35. Zhang R, Isola P, Efros AA (2017) Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1058–1067

  36. Zhang X, Wang T, Wang J, Tang G, Zhao L (2020) Pyramid channel-based feature attention network for image dehazing. Comput Vis Image Underst 197:103003

    Article  Google Scholar 

  37. Zhao Y, Deng B, Shen C, Liu Y, Lu H, Hua XS (2017) Spatio-temporal autoencoder for video anomaly detection. In: Proceedings of the 25th ACM international conference on multimedia, pp 1933–1941

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hu Liu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Geng, S., Zhao, S. & Liu, H. Video representation learning by identifying spatio-temporal transformations. Appl Intell 52, 6613–6622 (2022). https://doi.org/10.1007/s10489-021-02790-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02790-9

Keywords

Navigation