Skip to main content
Log in

Unsupervised video-based action recognition using two-stream generative adversarial network

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Video-based action recognition faces many challenges, such as complex and varied dynamic motion, spatio-temporal similar action factors, and manual labeling of archived videos over large datasets. How to extract discriminative spatio-temporal action features in videos with resisting the effect of similar factors in an unsupervised manner is pivotal. For that, this paper proposes an unsupervised video-based action recognition method, called two-stream generative adversarial network (TS-GAN), which comprehensively learns the static texture and dynamic motion information inherited in videos with taking the detail information and global information into account. Specifically, the extraction of the spatio-temporal information in videos is achieved by a two-stream GAN. Considering that proper attention to detail is capable of alleviating the influence of spatio-temporal similar factors to the network, a global-detailed layer is proposed to resist similar factors via fusing intermediate features (i.e., detailed action information) and high-level semantic features (i.e., global action information). It is worthwhile of mentioning that the proposed TS-GAN does not require complex pretext tasks or the construction of positive and negative sample pairs, compared with recent unsupervised video-based action recognition methods. Extensive experiments conducted on the UCF101 and HMDB51 datasets have demonstrated that the proposed TS-GAN is superior to multiple classical and state-of-the-art unsupervised action recognition methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Algorithm 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

No new data were created during the study.

Notes

  1. UCF101 data sources: https://www.crcv.ucf.edu/research/data-sets/ucf101/.

  2. HMDB51 data sources: http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/.

References

  1. Ciaparrone G, Chiariglione L, Tagliaferri R (2022) A comparison of deep learning models for end-to-end face-based video retrieval in unconstrained videos. Neural Comput Appl 34(10):7489–7506

    Article  Google Scholar 

  2. Kompella A, Kulkarni R (2021) A semi-supervised recurrent neural network for video salient object detection. Neural Comput Appl 33(6):2065–2083

    Article  Google Scholar 

  3. Hou Y, Yu H, Zhou D, Wang P, Ge H, Zhang J, Zhang Q (2021) Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition. Neural Comput Appl 33(23):16439–16450

    Article  Google Scholar 

  4. Tong M, Yan K, Jin L, Yue X, Li M (2021) Dm-ctsa: a discriminative multi-focused and complementary temporal/spatial attention framework for action recognition. Neural Comput Appl 33(15):9375–9389

    Article  Google Scholar 

  5. Lin W, Liu X, Zhuang Y, Ding X, Tu X, Huang Y, Zeng H (2023) Unsupervised video-based action recognition with imagining motion and perceiving appearance. IEEE Trans Circuits Syst Video Technol 33(5):2245–2258

    Article  Google Scholar 

  6. Sun C, Nagrani A, Tian Y, Schmid C (2021) Composable augmentation encoding for video representation learning. arXiv preprint arXiv:2104.00616

  7. Qian R, Meng T, Gong B, Yang MH, Wang H, Belongie S, Cui Y (2021) Spatiotemporal contrastive video representation learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6964–6974

  8. Tao L, Wang X, Yamasaki T (2022) An improved inter-intra contrastive learning framework on self-supervised video representation. IEEE Trans Circuits Syst Video Technol 32(8):5266–5280

    Article  Google Scholar 

  9. Dorkenwald M, Xiao F, Brattoli B, Tighe J, Modolo D (2022) Scvrl: shuffled contrastive video representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4132–4141

  10. Ding S, Li M, Yang T, Qian R, Xu H, Chen Q, Wang J, Xiong H (2022) Motion-aware contrastive video representation learning via foreground-background merging. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9716–9726

  11. Park J, Lee J, Kim I, Sohn K (2022) Probabilistic representations for video contrastive learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14711–14721

  12. Ni J, Zhou N, Qin J, Wu Q, Liu J, Li B, Huang D (2022) Motion sensitive contrastive learning for self-supervised video representation. In: European conference on computer vision. Springer, pp 457–474

  13. Ahsan U, Madhok R, Essa I (2019) Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition. In: IEEE winter conference on applications of computer vision. IEEE, pp 179–189

  14. Knights J, Harwood B, Ward D, Vanderkop A, Mackenzie-Ross O, Moghadam P (2021) Temporally coherent embeddings for self-supervised video representation learning. In: International conference on pattern recognition. IEEE, pp 8914–8921

  15. Huo Y, Ding M, Lu H, Huang Z, Tang M, Lu Z, Xiang T (2021) Self-supervised video representation learning with constrained spatiotemporal jigsaw. In: International joint conference on artificial intelligence. IEEE

  16. Zhang Y, Zhang H, Wu G, Li J (2022) Spatio-temporal self-supervision enhanced transformer networks for action recognition. In: IEEE international conference on multimedia and Expo. IEEE, pp 1–6

  17. Duan H, Zhao N, Chen K, Lin D (2022) Transrank: Self-supervised video representation learning via ranking-based transformation recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3000–3010

  18. Chen Z, Wang H, Chen C (2023) Self-supervised video representation learning by serial restoration with elastic complexity. IEEE Trans Multimed. https://doi.org/10.1109/TMM.2023.3293727

    Article  Google Scholar 

  19. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680

  20. Ahsan U, Sun C, Essa I (2018) Discrimnet: Semi-supervised action recognition from videos using generative adversarial networks. In: IEEE conference on computer vision and pattern recognition

  21. Varol G, Laptev I, Schmid C (2017) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517

    Article  PubMed  Google Scholar 

  22. Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: IEEE conference on computer vision and pattern recognition, pp 4768–4777

  23. Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and imagenet? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6546–6555

  24. Lin J, Gan C, Wang K, Han S (2020) TSM: temporal shift module for efficient and scalable video understanding on edge devices. IEEE Trans Pattern Anal Mach Intell 44(5):2760–2774

    Google Scholar 

  25. Zhu L, Fan H, Luo Y, Xu M, Yang Y (2022) Temporal cross-layer correlation mining for action recognition. IEEE Trans Multimedia 24:668–676

    Article  Google Scholar 

  26. Radford A, Metz L, Chintala S (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. Proc Int Confer Learn Represent, pp 1-16

  27. Gibson JJ (1950) The perception of the visual world. Houghton Mifflin

    Google Scholar 

  28. Gibson JJ, Carmichael L (1966) The senses considered as perceptual systems 2(1). Houghton Mifflin

  29. Horn BK, Schunck BG (1981) Determining optical flow. Artif Intell 17(1–3):185–203

    Article  Google Scholar 

  30. Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402

  31. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: Proceedings of the IEEE international conference on computer vision, pp 2556–2563

  32. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467

  33. Kim D, Cho D, Kweon IS (2019) Self-supervised video representation learning with space-time cubic puzzles. In: Proceedings of the aaai conference on artificial intelligence, vol 33, pp 8545–8552

  34. Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. In: Advances in neural information processing systems, pp 613–621

  35. Behrmann N, Gall J, Noroozi M (2021) Unsupervised video representation learning by bidirectional feature prediction. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1670–1679

  36. Misra I, Zitnick CL, Hebert M (2016) Shuffle and learn: unsupervised learning using temporal order verification. In: Proceedings of European conference on computer vision, pp 527–544

  37. Wang X, Gupta A (2015) Unsupervised learning of visual representations using videos. In: Proceedings of the IEEE international conference on computer vision, pp 2794–2802

  38. He KM, Fan HQ, Wu YX, Xie SN, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9729–9738

  39. Gan C, Gong B, Liu K, Su H, Guibas LJ (2018) Geometry guided convolutional neural networks for self-supervised video representation learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5589–5597

  40. Wei D, Lim J, Zisserman A, Freeman W (2018) Learning and using the arrow of time. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8052–8060

  41. Buchler U, Brattoli B, Ommer B (2018) Improving spatiotemporal self-supervision by deep reinforcement learning. In: Proceedings of the European conference on computer vision, pp 770–786

  42. Wang J, Jiao J, Bao L, He S, Liu Y, Liu W (2019) Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4006–4015

Download references

Acknowledgements

This work was supported in part by the National Key R &D Program of China under the Grant 2021YFE0205400, in part by the National Natural Science Foundation of China under the Grants 61871434 and 61976098, in part by the Natural Science Foundation for Outstanding Young Scholars of Fujian Province under the Grant 2022J06023, in part by the Natural Science Foundation of Fujian Province under the Grant 2022J01294, and in part by the Collaborative Innovation Platform Project of Fuzhou-Xiamen-Quanzhou National Independent Innovation Demonstration Zone under the Grant 2021FX03.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huanqiang Zeng.

Ethics declarations

Conflict of interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, and there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled ‘Unsupervised Video-Based Action Recognition Using Two-Stream Generative Adversarial Network.’

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lin, W., Zeng, H., Zhu, J. et al. Unsupervised video-based action recognition using two-stream generative adversarial network. Neural Comput & Applic 36, 5077–5091 (2024). https://doi.org/10.1007/s00521-023-09333-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-09333-y

Keywords

Navigation