Self-supervised Compressed Video Action Recognition via Temporal-Consistent Sampling

Chen, Pan; Lin, Shaohui; Zhang, Yongxiang; Xu, Jiachen; Tan, Xin; Ma, Lizhuang

doi:10.1007/978-3-030-92273-3_20

Self-supervised Compressed Video Action Recognition via Temporal-Consistent Sampling

Pan Chen¹³,
Shaohui Lin¹⁴,
Yongxiang Zhang¹⁴,
Jiachen Xu¹³,
Xin Tan¹³ &
…
Lizhuang Ma^13,14

Conference paper
First Online: 05 December 2021

1986 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 13111))

Abstract

Compressed video action recognition targets at classifying action class in compressed video, instead of decoded/standard video. It benefits from fast training and inference by reducing the utilization of redundant information. However, off-the-shelf methods still rely on heavy-cost labels for training. In this paper, we propose self-supervised compressed video action recognition method via Momentum contrast (MoCo) and temporal-consistent sampling. We leverage temporal-consistent sampling into MoCo to improve the ability of feature presentation on each input modality of compressed video. Modality-oriented fine-tuning is introduced to applying into the downstream compressed video action recognition.

Extensive experiments demonstrate the effectiveness of our method on different datasets with different backbones. Compared to SOTA self-supervised learning methods for decoded videos on HMDB51 dataset, our method achieves the highest accuracy of 57.8%.

Supported by NSFC (No. 61972157), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), Shanghai Science and Technology Commission (21511101200), Art major project of National Social Science Fund (I8ZD22).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Alwassel, H., Mahajan, D., Korbar, B., Torresani, L., Ghanem, B., Tran, D.: Self-supervised learning by cross-modal audio-video clustering. In: NeurlIPS (2020)
Google Scholar
Benaim, S., et al..: SpeedNet: learning the speediness in videos. In: CVPR (2020)
Google Scholar
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)
Google Scholar
Chen, P., et al.: RSPNet: relative speed perception for unsupervised video representation learning. In: AAAI (2021)
Google Scholar
Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv:2003.04297 (2020)
Dave, I., Gupta, R., Rizve, M.N., Shah, M.: TCLR: temporal contrastive learning for video representation. arXiv:2101.07974 (2021)
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. TPAMI 39(4), 677–691 (2017)
Article Google Scholar
Han, T., Xie, W., Zisserman, A.: Video representation learning by dense predictive coding. In: ICCV Workshop (2019)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Huo, Y., et al.: Lightweight action recognition in compressed videos. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12536, pp. 337–352. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-66096-3_24
Chapter Google Scholar
Jing, L., Yang, X., Liu, J., Tian, Y.: Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv:1811.1138 (2019)
Kalfaoglu, M.E., Kalkan, S., Alatan, A.A.: Late temporal modeling in 3D CNN architectures with BERT for action recognition. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12539, pp. 731–747. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-68238-5_48
Chapter Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv:1705.06950 (2017)
Kim, D., Cho, D., Kweon, I.S.: Self-supervised video representation learning with space-time cubic puzzles. In: AAAI (2019)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Komkov, S., Dzabraev, M., Petiushko, A.: Mutual modality learning for video action classification. arXiv:2011.02543 (2020)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV (2011)
Google Scholar
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: ICCV (2019)
Google Scholar
Ma, C.Y., Chen, M.H., Kira, Z., AlRegib, G.: TS-LSTM and temporal-inception: exploiting spatiotemporal dynamics for activity recognition. Sig. Process.: Image Commun. 71, 76–87 (2019)
Google Scholar
Morgado, P., Vasconcelos, N., Misra, I.: Audio-visual instance discrimination with cross-modal agreement. arXiv:2004.12943 (2020)
Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv:1807.03748 (2019)
Pan, T., Song, Y., Yang, T., Jiang, W., Liu, W.: VideoMoCo: contrastive video representation learning with temporally adversarial examples. arXiv:2103.05905 (2021)
Qian, R., et al.: Spatiotemporal contrastive video representation learning. arXiv:2008.03800 (2021)
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. IJCV 128(2), 336–359 (2019)
Article Google Scholar
Shou, Z., et al.: DMC-Net: generating discriminative motion cues for fast compressed video action recognition. In: CVPR (2019)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NeurlIPS (2014)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (2012)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)
Google Scholar
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: CVPR (2018)
Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Wu, C.Y., Zaheer, M., Hu, H., Manmatha, R., Smola, A.J., Krähenbühl, P.: Compressed video action recognition. In: CVPR (2018)
Google Scholar
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: CVPR (2018)
Google Scholar
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Pan Chen, Jiachen Xu, Xin Tan & Lizhuang Ma
East China Normal University, Shanghai, China
Shaohui Lin, Yongxiang Zhang & Lizhuang Ma

Authors

Pan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Shaohui Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yongxiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiachen Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xin Tan
View author publications
You can also search for this author in PubMed Google Scholar
Lizhuang Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Shaohui Lin or Lizhuang Ma .

Editor information

Editors and Affiliations

Sampoerna University, Jakarta, Indonesia
Teddy Mantoro
Kyungpook National University, Daegu, Korea (Republic of)
Minho Lee
Sampoerna University, Jakarta, Indonesia
Media Anugerah Ayu
Murdoch University, Murdoch, WA, Australia
Kok Wai Wong
Universitas Indonesia, Depok, Indonesia
Achmad Nizar Hidayanto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, P., Lin, S., Zhang, Y., Xu, J., Tan, X., Ma, L. (2021). Self-supervised Compressed Video Action Recognition via Temporal-Consistent Sampling. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Lecture Notes in Computer Science(), vol 13111. Springer, Cham. https://doi.org/10.1007/978-3-030-92273-3_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-92273-3_20
Published: 05 December 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92272-6
Online ISBN: 978-3-030-92273-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics