Multi-view Action Recognition Using Cross-View Video Prediction

Vyas, Shruti; Rawat, Yogesh S.; Shah, Mubarak

doi:10.1007/978-3-030-58583-9_26

Shruti Vyas¹²,
Yogesh S. Rawat¹² &
Mubarak Shah¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12372))

Included in the following conference series:

European Conference on Computer Vision

3813 Accesses
19 Citations

Abstract

In this work, we address the problem of action recognition in a multi-view environment. Most of the existing approaches utilize pose information for multi-view action recognition. We focus on RGB modality instead and propose an unsupervised representation learning framework, which encodes the scene dynamics in videos captured from multiple viewpoints via predicting actions from unseen views. The framework takes multiple short video clips from different viewpoints and time as input and learns an holistic internal representation which is used to predict a video clip from an unseen viewpoint and time. The ability of the proposed network to render unseen video frames enables it to learn a meaningful and robust representation of the scene dynamics. We evaluate the effectiveness of the learned representation for multi-view video action recognition in a supervised approach. We observe a significant improvement in the performance with RGB modality on NTU-RGB+D dataset, which is the largest dataset for multi-view action recognition. The proposed framework also achieves state-of-the-art results with depth modality, which validates the generalization capability of the approach to other data modalities. The code is publicly available at https://github.com/svyas23/cross-view-action.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Baradel, F., Wolf, C., Mille, J.: Human action recognition: pose-based attention draws focus to hands. In: The IEEE ICCV Workshops, October 2017
Google Scholar
Ben Tanfous, A., Drira, H., Ben Amor, B.: Coding Kendall’s shape trajectories for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2840–2849 (2018)
Google Scholar
Byeon, W., et al.: ContextVP: fully context-aware video prediction. In: Proceedings of the IEEE CVPR Workshops (2018)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR (2017)
Google Scholar
Clark, A., Donahue, J., Simonyan, K.: Efficient video generation on complex datasets. arXiv preprint arXiv:1907.06571 (2019)
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on CVPR (2015)
Google Scholar
Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1110–1118 (2015)
Google Scholar
Eslami, S.A., et al.: Neural scene representation and rendering. Science (2018)
Google Scholar
Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: IEEE Conference on CVPR (2017)
Google Scholar
Goyal, P., Hu, Z., Liang, X., Wang, C., Xing, E.P., Mellon, C.: Nonparametric variational auto-encoders for hierarchical representation learning. In: ICCV, pp. 5104–5112 (2017)
Google Scholar
Gupta, A., Martinez, J., Little, J.J., Woodham, R.J.: 3D pose from motion for cross-view action recognition via non-linear circulant temporal encoding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2601–2608 (2014)
Google Scholar
Hochreiter, S., Schmidhuber, J.: LSTM can solve hard long time lag problems. In: NeurIPS (1997)
Google Scholar
Isik, L., Tacchetti, A., Poggio, T.A.: A fast, invariant representation for human action in the visual system. J. Neurophysiol. 119, 631–640 (2017)
Google Scholar
Jayaraman, D., Gao, R., Grauman, K.: ShapeCodes: self-supervised feature learning by lifting views to viewgrids. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 126–144. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_8
Chapter Google Scholar
Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: A new representation of skeleton sequences for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3288–3297 (2017)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Lakhal, M.I., Lanz, O., Cavallaro, A.: View-LSTM: novel-view video synthesis through view decomposition. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Ledig, C., Theis, L., Huszár, F., Caballero, J., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: IEEE Conference on CVPR (2017)
Google Scholar
Lee, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Unsupervised representation learning by sorting sequences. In: International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Li, B., Camps, O.I., Sznaier, M.: Cross-view activity recognition using hankelets. In: IEEE CVPR (2012)
Google Scholar
Li, C., Cui, Z., Zheng, W., Xu, C., Yang, J.: Spatio-temporal graph convolution for skeleton based action recognition. In: AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Li, J., Wong, Y., Zhao, Q., Kankanhalli, M.: Unsupervised learning of view-invariant action representations. In: Advances in Neural Information Processing Systems (2018)
Google Scholar
Li, R., Zickler, T.: Discriminative virtual views for cross-view action recognition. In: IEEE CVPR (2012)
Google Scholar
Liu, J., Shahroudy, A., Xu, D., Kot, A.C., Wang, G.: Skeleton-based action recognition using spatio-temporal LSTM network with trust gates. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 3007–3021 (2017)
Article Google Scholar
Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 816–833. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_50
Chapter Google Scholar
Liu, M., Liu, H., Chen, C.: Enhanced skeleton visualization for view invariant human action recognition. Pattern Recogn. 68, 346–362 (2017)
Article Google Scholar
Luo, Z., Peng, B., Huang, D.A., Alahi, A., Fei-Fei, L.: Unsupervised learning of long-term motion dynamics for videos. In: IEEE Conference on CVPR (2017)
Google Scholar
Luvizon, D.C., Picard, D., Tabia, H.: 2D/3D pose estimation and action recognition using multitask deep learning. In: IEEE Conference on CVPR (2018)
Google Scholar
Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Google Scholar
Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: ICLR (2016)
Google Scholar
Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32
Chapter Google Scholar
Ohn-Bar, E., Trivedi, M.: Joint angles similarities and HOG2 for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 465–470 (2013)
Google Scholar
Oreifej, O., Liu, Z.: HON4D: histogram of oriented 4d normals for activity recognition from depth sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 716–723 (2013)
Google Scholar
Rahmani, H., Mahmood, A., Huynh, D., Mian, A.: Histogram of oriented principal components for cross-view action recognition. IEEE Trans. PAMI (2016)
Google Scholar
Rahmani, H., Mian, A.: Learning a non-linear knowledge transfer model for cross-view action recognition. In: Proceedings of the IEEE Conference on CVPR (2015)
Google Scholar
Regmi, K., Borji, A.: Cross-view image synthesis using conditional GANs. In: IEEE Conference on CVPR (2018)
Google Scholar
Saito, M., Matsumoto, E., Saito, S.: Temporal generative adversarial nets with singular value clipping. In: IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on CVPR (2016)
Google Scholar
Shahroudy, A., Ng, T.T., Gong, Y., Wang, G.: Deep multimodal feature analysis for action recognition in RGB+D videos. IEEE Trans. PAMI (2018)
Google Scholar
Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning, pp. 843–852 (2015)
Google Scholar
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation. arXiv preprint arXiv:1707.04993 (2017)
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: NeurIPS (2016)
Google Scholar
Wang, D., Ouyang, W., Li, W., Xu, D.: Dividing and aggregating network for multi-view action recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 457–473. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_28
Chapter Google Scholar
Wang, J., Nie, X., Xia, Y., Wu, Y., Zhu, S.C.: Cross-view action modeling, learning and recognition. In: IEEE Conference on CVPR (2014)
Google Scholar
Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: IEEE ICCV (2015)
Google Scholar
Wen, Y.H., Gao, L., Fu, H., et al.: Graph CNNs with motif and variable temporal block for skeleton-based action recognition. In: AAAI Conference on Artificial Intelligence (2019)
Google Scholar
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500 (2017)
Google Scholar
Xu, X., Chen, Y.C., Jia, J.: View independent generative adversarial network for novel view synthesis. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Yang, X., Tian, Y.: Super normal vector for activity recognition using depth sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 804–811 (2014)
Google Scholar
Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive neural networks for high performance skeleton-based human action recognition. IEEE PAMI (2019)
Google Scholar
Zhang, P., Xue, J., Lan, C., Zeng, W., Gao, Z., Zheng, N.: Adding Attentiveness to the Neurons in Recurrent Neural Networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 136–152. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_9
Chapter Google Scholar

Download references

Acknowledgement

This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. D17PC00345. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.

Author information

Authors and Affiliations

Center for Research in Computer Vision, University of Central Florida, Orlando, USA
Shruti Vyas, Yogesh S. Rawat & Mubarak Shah

Authors

Shruti Vyas
View author publications
You can also search for this author in PubMed Google Scholar
Yogesh S. Rawat
View author publications
You can also search for this author in PubMed Google Scholar
Mubarak Shah
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yogesh S. Rawat .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1835 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vyas, S., Rawat, Y.S., Shah, M. (2020). Multi-view Action Recognition Using Cross-View Video Prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12372. Springer, Cham. https://doi.org/10.1007/978-3-030-58583-9_26

Download citation

DOI: https://doi.org/10.1007/978-3-030-58583-9_26
Published: 19 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58582-2
Online ISBN: 978-3-030-58583-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics