Unsupervised video-based action recognition using two-stream generative adversarial network

Lin, Wei; Zeng, Huanqiang; Zhu, Jianqing; Hsia, Chih-Hsien; Hou, Junhui; Ma, Kai-Kuang

doi:10.1007/s00521-023-09333-y

Unsupervised video-based action recognition using two-stream generative adversarial network

Original Article
Published: 26 December 2023

Volume 36, pages 5077–5091, (2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Wei Lin¹,
Huanqiang Zeng ORCID: orcid.org/0000-0002-2802-7745²,
Jianqing Zhu²,
Chih-Hsien Hsia³,
Junhui Hou⁴ &
…
Kai-Kuang Ma⁵

219 Accesses
Explore all metrics

Abstract

Video-based action recognition faces many challenges, such as complex and varied dynamic motion, spatio-temporal similar action factors, and manual labeling of archived videos over large datasets. How to extract discriminative spatio-temporal action features in videos with resisting the effect of similar factors in an unsupervised manner is pivotal. For that, this paper proposes an unsupervised video-based action recognition method, called two-stream generative adversarial network (TS-GAN), which comprehensively learns the static texture and dynamic motion information inherited in videos with taking the detail information and global information into account. Specifically, the extraction of the spatio-temporal information in videos is achieved by a two-stream GAN. Considering that proper attention to detail is capable of alleviating the influence of spatio-temporal similar factors to the network, a global-detailed layer is proposed to resist similar factors via fusing intermediate features (i.e., detailed action information) and high-level semantic features (i.e., global action information). It is worthwhile of mentioning that the proposed TS-GAN does not require complex pretext tasks or the construction of positive and negative sample pairs, compared with recent unsupervised video-based action recognition methods. Extensive experiments conducted on the UCF101 and HMDB51 datasets have demonstrated that the proposed TS-GAN is superior to multiple classical and state-of-the-art unsupervised action recognition methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward

Article 04 June 2022

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

Human Action Recognition and Prediction: A Survey

Article 28 March 2022

Data availability

No new data were created during the study.

Notes

UCF101 data sources: https://www.crcv.ucf.edu/research/data-sets/ucf101/.
HMDB51 data sources: http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/.

References

Ciaparrone G, Chiariglione L, Tagliaferri R (2022) A comparison of deep learning models for end-to-end face-based video retrieval in unconstrained videos. Neural Comput Appl 34(10):7489–7506
Article Google Scholar
Kompella A, Kulkarni R (2021) A semi-supervised recurrent neural network for video salient object detection. Neural Comput Appl 33(6):2065–2083
Article Google Scholar
Hou Y, Yu H, Zhou D, Wang P, Ge H, Zhang J, Zhang Q (2021) Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition. Neural Comput Appl 33(23):16439–16450
Article Google Scholar
Tong M, Yan K, Jin L, Yue X, Li M (2021) Dm-ctsa: a discriminative multi-focused and complementary temporal/spatial attention framework for action recognition. Neural Comput Appl 33(15):9375–9389
Article Google Scholar
Lin W, Liu X, Zhuang Y, Ding X, Tu X, Huang Y, Zeng H (2023) Unsupervised video-based action recognition with imagining motion and perceiving appearance. IEEE Trans Circuits Syst Video Technol 33(5):2245–2258
Article Google Scholar
Sun C, Nagrani A, Tian Y, Schmid C (2021) Composable augmentation encoding for video representation learning. arXiv preprint arXiv:2104.00616
Qian R, Meng T, Gong B, Yang MH, Wang H, Belongie S, Cui Y (2021) Spatiotemporal contrastive video representation learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6964–6974
Tao L, Wang X, Yamasaki T (2022) An improved inter-intra contrastive learning framework on self-supervised video representation. IEEE Trans Circuits Syst Video Technol 32(8):5266–5280
Article Google Scholar
Dorkenwald M, Xiao F, Brattoli B, Tighe J, Modolo D (2022) Scvrl: shuffled contrastive video representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4132–4141
Ding S, Li M, Yang T, Qian R, Xu H, Chen Q, Wang J, Xiong H (2022) Motion-aware contrastive video representation learning via foreground-background merging. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9716–9726
Park J, Lee J, Kim I, Sohn K (2022) Probabilistic representations for video contrastive learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14711–14721
Ni J, Zhou N, Qin J, Wu Q, Liu J, Li B, Huang D (2022) Motion sensitive contrastive learning for self-supervised video representation. In: European conference on computer vision. Springer, pp 457–474
Ahsan U, Madhok R, Essa I (2019) Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition. In: IEEE winter conference on applications of computer vision. IEEE, pp 179–189
Knights J, Harwood B, Ward D, Vanderkop A, Mackenzie-Ross O, Moghadam P (2021) Temporally coherent embeddings for self-supervised video representation learning. In: International conference on pattern recognition. IEEE, pp 8914–8921
Huo Y, Ding M, Lu H, Huang Z, Tang M, Lu Z, Xiang T (2021) Self-supervised video representation learning with constrained spatiotemporal jigsaw. In: International joint conference on artificial intelligence. IEEE
Zhang Y, Zhang H, Wu G, Li J (2022) Spatio-temporal self-supervision enhanced transformer networks for action recognition. In: IEEE international conference on multimedia and Expo. IEEE, pp 1–6
Duan H, Zhao N, Chen K, Lin D (2022) Transrank: Self-supervised video representation learning via ranking-based transformation recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3000–3010
Chen Z, Wang H, Chen C (2023) Self-supervised video representation learning by serial restoration with elastic complexity. IEEE Trans Multimed. https://doi.org/10.1109/TMM.2023.3293727
Article Google Scholar
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680
Ahsan U, Sun C, Essa I (2018) Discrimnet: Semi-supervised action recognition from videos using generative adversarial networks. In: IEEE conference on computer vision and pattern recognition
Varol G, Laptev I, Schmid C (2017) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517
Article PubMed Google Scholar
Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: IEEE conference on computer vision and pattern recognition, pp 4768–4777
Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and imagenet? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6546–6555
Lin J, Gan C, Wang K, Han S (2020) TSM: temporal shift module for efficient and scalable video understanding on edge devices. IEEE Trans Pattern Anal Mach Intell 44(5):2760–2774
Google Scholar
Zhu L, Fan H, Luo Y, Xu M, Yang Y (2022) Temporal cross-layer correlation mining for action recognition. IEEE Trans Multimedia 24:668–676
Article Google Scholar
Radford A, Metz L, Chintala S (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. Proc Int Confer Learn Represent, pp 1-16
Gibson JJ (1950) The perception of the visual world. Houghton Mifflin
Google Scholar
Gibson JJ, Carmichael L (1966) The senses considered as perceptual systems 2(1). Houghton Mifflin
Horn BK, Schunck BG (1981) Determining optical flow. Artif Intell 17(1–3):185–203
Article Google Scholar
Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: Proceedings of the IEEE international conference on computer vision, pp 2556–2563
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467
Kim D, Cho D, Kweon IS (2019) Self-supervised video representation learning with space-time cubic puzzles. In: Proceedings of the aaai conference on artificial intelligence, vol 33, pp 8545–8552
Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. In: Advances in neural information processing systems, pp 613–621
Behrmann N, Gall J, Noroozi M (2021) Unsupervised video representation learning by bidirectional feature prediction. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1670–1679
Misra I, Zitnick CL, Hebert M (2016) Shuffle and learn: unsupervised learning using temporal order verification. In: Proceedings of European conference on computer vision, pp 527–544
Wang X, Gupta A (2015) Unsupervised learning of visual representations using videos. In: Proceedings of the IEEE international conference on computer vision, pp 2794–2802
He KM, Fan HQ, Wu YX, Xie SN, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9729–9738
Gan C, Gong B, Liu K, Su H, Guibas LJ (2018) Geometry guided convolutional neural networks for self-supervised video representation learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5589–5597
Wei D, Lim J, Zisserman A, Freeman W (2018) Learning and using the arrow of time. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8052–8060
Buchler U, Brattoli B, Ommer B (2018) Improving spatiotemporal self-supervision by deep reinforcement learning. In: Proceedings of the European conference on computer vision, pp 770–786
Wang J, Jiao J, Bao L, He S, Liu Y, Liu W (2019) Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4006–4015

Download references

Acknowledgements

This work was supported in part by the National Key R &D Program of China under the Grant 2021YFE0205400, in part by the National Natural Science Foundation of China under the Grants 61871434 and 61976098, in part by the Natural Science Foundation for Outstanding Young Scholars of Fujian Province under the Grant 2022J06023, in part by the Natural Science Foundation of Fujian Province under the Grant 2022J01294, and in part by the Collaborative Innovation Platform Project of Fuzhou-Xiamen-Quanzhou National Independent Innovation Demonstration Zone under the Grant 2021FX03.

Author information

Authors and Affiliations

School of Engineering and School of Information Science and Engineering, Huaqiao University, Xiamen, China
Wei Lin
School of Engineering, Huaqiao University, Quanzhou, China
Huanqiang Zeng & Jianqing Zhu
Department of Computer Science and Information Engineering, Ilan University, Taiwan, China
Chih-Hsien Hsia
Department of Computer Science, The City University of Hong Kong, Hong Kong, China
Junhui Hou
School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Singapore
Kai-Kuang Ma

Authors

Wei Lin
View author publications
You can also search for this author in PubMed Google Scholar
Huanqiang Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Jianqing Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Chih-Hsien Hsia
View author publications
You can also search for this author in PubMed Google Scholar
Junhui Hou
View author publications
You can also search for this author in PubMed Google Scholar
Kai-Kuang Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huanqiang Zeng.

Ethics declarations

Conflict of interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, and there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled ‘Unsupervised Video-Based Action Recognition Using Two-Stream Generative Adversarial Network.’

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lin, W., Zeng, H., Zhu, J. et al. Unsupervised video-based action recognition using two-stream generative adversarial network. Neural Comput & Applic 36, 5077–5091 (2024). https://doi.org/10.1007/s00521-023-09333-y

Download citation

Received: 31 August 2022
Accepted: 26 November 2023
Published: 26 December 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s00521-023-09333-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised video-based action recognition using two-stream generative adversarial network

Abstract

Access this article

Similar content being viewed by others

Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward

Video summarization using deep learning techniques: a detailed analysis and investigation

Human Action Recognition and Prediction: A Survey

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unsupervised video-based action recognition using two-stream generative adversarial network

Abstract

Access this article

Similar content being viewed by others

Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward

Video summarization using deep learning techniques: a detailed analysis and investigation

Human Action Recognition and Prediction: A Survey

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation