Abstract
In this work, we address the problem of action recognition in a multi-view environment. Most of the existing approaches utilize pose information for multi-view action recognition. We focus on RGB modality instead and propose an unsupervised representation learning framework, which encodes the scene dynamics in videos captured from multiple viewpoints via predicting actions from unseen views. The framework takes multiple short video clips from different viewpoints and time as input and learns an holistic internal representation which is used to predict a video clip from an unseen viewpoint and time. The ability of the proposed network to render unseen video frames enables it to learn a meaningful and robust representation of the scene dynamics. We evaluate the effectiveness of the learned representation for multi-view video action recognition in a supervised approach. We observe a significant improvement in the performance with RGB modality on NTU-RGB+D dataset, which is the largest dataset for multi-view action recognition. The proposed framework also achieves state-of-the-art results with depth modality, which validates the generalization capability of the approach to other data modalities. The code is publicly available at https://github.com/svyas23/cross-view-action.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Baradel, F., Wolf, C., Mille, J.: Human action recognition: pose-based attention draws focus to hands. In: The IEEE ICCV Workshops, October 2017
Ben Tanfous, A., Drira, H., Ben Amor, B.: Coding Kendall’s shape trajectories for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2840–2849 (2018)
Byeon, W., et al.: ContextVP: fully context-aware video prediction. In: Proceedings of the IEEE CVPR Workshops (2018)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR (2017)
Clark, A., Donahue, J., Simonyan, K.: Efficient video generation on complex datasets. arXiv preprint arXiv:1907.06571 (2019)
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on CVPR (2015)
Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1110–1118 (2015)
Eslami, S.A., et al.: Neural scene representation and rendering. Science (2018)
Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: IEEE Conference on CVPR (2017)
Goyal, P., Hu, Z., Liang, X., Wang, C., Xing, E.P., Mellon, C.: Nonparametric variational auto-encoders for hierarchical representation learning. In: ICCV, pp. 5104–5112 (2017)
Gupta, A., Martinez, J., Little, J.J., Woodham, R.J.: 3D pose from motion for cross-view action recognition via non-linear circulant temporal encoding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2601–2608 (2014)
Hochreiter, S., Schmidhuber, J.: LSTM can solve hard long time lag problems. In: NeurIPS (1997)
Isik, L., Tacchetti, A., Poggio, T.A.: A fast, invariant representation for human action in the visual system. J. Neurophysiol. 119, 631–640 (2017)
Jayaraman, D., Gao, R., Grauman, K.: ShapeCodes: self-supervised feature learning by lifting views to viewgrids. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 126–144. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_8
Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: A new representation of skeleton sequences for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3288–3297 (2017)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Lakhal, M.I., Lanz, O., Cavallaro, A.: View-LSTM: novel-view video synthesis through view decomposition. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
Ledig, C., Theis, L., Huszár, F., Caballero, J., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: IEEE Conference on CVPR (2017)
Lee, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Unsupervised representation learning by sorting sequences. In: International Conference on Computer Vision (ICCV) (2017)
Li, B., Camps, O.I., Sznaier, M.: Cross-view activity recognition using hankelets. In: IEEE CVPR (2012)
Li, C., Cui, Z., Zheng, W., Xu, C., Yang, J.: Spatio-temporal graph convolution for skeleton based action recognition. In: AAAI Conference on Artificial Intelligence (2018)
Li, J., Wong, Y., Zhao, Q., Kankanhalli, M.: Unsupervised learning of view-invariant action representations. In: Advances in Neural Information Processing Systems (2018)
Li, R., Zickler, T.: Discriminative virtual views for cross-view action recognition. In: IEEE CVPR (2012)
Liu, J., Shahroudy, A., Xu, D., Kot, A.C., Wang, G.: Skeleton-based action recognition using spatio-temporal LSTM network with trust gates. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 3007–3021 (2017)
Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 816–833. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_50
Liu, M., Liu, H., Chen, C.: Enhanced skeleton visualization for view invariant human action recognition. Pattern Recogn. 68, 346–362 (2017)
Luo, Z., Peng, B., Huang, D.A., Alahi, A., Fei-Fei, L.: Unsupervised learning of long-term motion dynamics for videos. In: IEEE Conference on CVPR (2017)
Luvizon, D.C., Picard, D., Tabia, H.: 2D/3D pose estimation and action recognition using multitask deep learning. In: IEEE Conference on CVPR (2018)
Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: ICLR (2016)
Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32
Ohn-Bar, E., Trivedi, M.: Joint angles similarities and HOG2 for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 465–470 (2013)
Oreifej, O., Liu, Z.: HON4D: histogram of oriented 4d normals for activity recognition from depth sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 716–723 (2013)
Rahmani, H., Mahmood, A., Huynh, D., Mian, A.: Histogram of oriented principal components for cross-view action recognition. IEEE Trans. PAMI (2016)
Rahmani, H., Mian, A.: Learning a non-linear knowledge transfer model for cross-view action recognition. In: Proceedings of the IEEE Conference on CVPR (2015)
Regmi, K., Borji, A.: Cross-view image synthesis using conditional GANs. In: IEEE Conference on CVPR (2018)
Saito, M., Matsumoto, E., Saito, S.: Temporal generative adversarial nets with singular value clipping. In: IEEE International Conference on Computer Vision (ICCV) (2017)
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on CVPR (2016)
Shahroudy, A., Ng, T.T., Gong, Y., Wang, G.: Deep multimodal feature analysis for action recognition in RGB+D videos. IEEE Trans. PAMI (2018)
Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning, pp. 843–852 (2015)
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation. arXiv preprint arXiv:1707.04993 (2017)
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: NeurIPS (2016)
Wang, D., Ouyang, W., Li, W., Xu, D.: Dividing and aggregating network for multi-view action recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 457–473. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_28
Wang, J., Nie, X., Xia, Y., Wu, Y., Zhu, S.C.: Cross-view action modeling, learning and recognition. In: IEEE Conference on CVPR (2014)
Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: IEEE ICCV (2015)
Wen, Y.H., Gao, L., Fu, H., et al.: Graph CNNs with motif and variable temporal block for skeleton-based action recognition. In: AAAI Conference on Artificial Intelligence (2019)
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500 (2017)
Xu, X., Chen, Y.C., Jia, J.: View independent generative adversarial network for novel view synthesis. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI Conference on Artificial Intelligence (2018)
Yang, X., Tian, Y.: Super normal vector for activity recognition using depth sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 804–811 (2014)
Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive neural networks for high performance skeleton-based human action recognition. IEEE PAMI (2019)
Zhang, P., Xue, J., Lan, C., Zeng, W., Gao, Z., Zheng, N.: Adding Attentiveness to the Neurons in Recurrent Neural Networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 136–152. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_9
Acknowledgement
This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. D17PC00345. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Vyas, S., Rawat, Y.S., Shah, M. (2020). Multi-view Action Recognition Using Cross-View Video Prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12372. Springer, Cham. https://doi.org/10.1007/978-3-030-58583-9_26
Download citation
DOI: https://doi.org/10.1007/978-3-030-58583-9_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58582-2
Online ISBN: 978-3-030-58583-9
eBook Packages: Computer ScienceComputer Science (R0)