Spatial–Temporal Relation Reasoning for Action Prediction in Videos

Wu, Xinxiao; Wang, Ruiqi; Hou, Jingyi; Lin, Hanxi; Luo, Jiebo

doi:10.1007/s11263-020-01409-9

Spatial–Temporal Relation Reasoning for Action Prediction in Videos

Published: 12 February 2021

Volume 129, pages 1484–1505, (2021)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Xinxiao Wu ORCID: orcid.org/0000-0002-2056-6947¹,
Ruiqi Wang¹,
Jingyi Hou¹,
Hanxi Lin¹ &
…
Jiebo Luo²

1508 Accesses
20 Citations
3 Altmetric
Explore all metrics

Abstract

Action prediction in videos refers to inferring the action category label by an early observation of a video. Existing studies mainly focus on exploiting multiple visual cues to enhance the discriminative power of feature representation while neglecting important structure information in videos including interactions and correlations between different object entities. In this paper, we focus on reasoning about the spatial–temporal relations between persons and contextual objects to interpret the observed video part for predicting action categories. With this in mind, we propose a novel spatial–temporal relation reasoning approach that extracts the spatial relations between persons and objects in still frames and explores how these spatial relations change over time. Specifically, for spatial relation reasoning, we propose an improved gated graph neural network to perform spatial relation reasoning between the visual objects in video frames. For temporal relation reasoning, we propose a long short-term graph network to model both the short-term and long-term varying dynamics of the spatial relations with multi-scale receptive fields. By this means, our approach can accurately recognize the video content in terms of fine-grained object relations in both spatial and temporal domains to make prediction decisions. Moreover, in order to learn the latent correlations between spatial–temporal object relations and action categories in videos, a visual semantic relation loss is proposed to model the triple constraints between objects in semantic domain via VTransE. Extensive experiments on five public video datasets (i.e., 20BN-something-something, CAD120, UCF101, BIT-Interaction and HMDB51) demonstrate the effectiveness of the proposed spatial–temporal relation reasoning on action prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semantic2Graph: graph-based multi-modal feature fusion for action segmentation in videos

Article 29 January 2024

Representation Learning on Visual-Symbolic Graphs for Video Understanding

Video-based spatio-temporal scene graph generation with efficient self-supervision tasks

Article 27 March 2023

References

Aditya, S., Yang, Y., & Baral, C. (2018). Explicit reasoning over end-to-end neural architectures for visual question answering. In Thirty-second AAAI conference on artificial intelligence.
Aliakbarian, M. S., Saleh, F. S., Salzmann, M., Fernando, B., Petersson, L., & Andersson, L. (2017). Encouraging lstms to anticipate actions very early. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., et al. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Arthur, D., & Vassilvitskii, S. (2007). K-means++: The advantages of careful seeding. In Eighteenth ACM-SIAM symposium on discrete algorithms.
Bhoi, A. (2019). Spatio-temporal action recognition: A survey. arXiv preprint arXiv:1901.09403.
Cai, Y., Li, H., Hu, J. F., & Zheng, W. S. (2019). Action knowledge transfer for action prediction with partial videos. In Proceedings of the AAAI conference on artificial intelligence.
Cao, Y., Barrett, D., Barbu, A., Narayanaswamy, S., Yu, H., Michaux, A., et al. (2013). Recognize human activities from partially observed videos. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Chen, L., Lu, J., Song, Z., & Zhou, J. (2018a). Part-activated deep reinforcement learning for action prediction. In European conference on computer vision.
Chen, X., Li, L. J., Fei-Fei, L., & Gupta, A. (2018b). Iterative visual reasoning beyond convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., et al. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 conference on empirical methods in natural language processing.
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Evans, J. S. B., Over, D. E., & Manktelow, K. I. (1993). Reasoning, decision making and rationality. Cognition, 49(1–2), 165–187.
Article Google Scholar
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., & Dahl, G. E. (2017). Neural message passing for quantum chemistry. In Proceedings of the 34th international conference on machine learning.
Girshick, R. (2015). Fast r-CNN. In Proceedings of the IEEE international conference on computer vision.
Girshick, R., Radosavovic, I., Gkioxari, G., Dollár, P., & He, K. (2018). Detectron. https://github.com/facebookresearch/detectron.
Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., et al. (2017). The “something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision.
Hanwang, Z., Kyaw, Z., Chang, S., & Chua, T. (2017). Visual translation embedding network for visual relation detection. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and imagenet? In Proceedings of the IEEE conference on computer vision and pattern recognition.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Herzig, R., Levi, E., Xu, H., Gao, H., Brosh, E., Wang, X., et al. (2019). Spatio-temporal action graph networks. In Proceedings of the IEEE international conference on computer vision workshops.
Hu, J. F., Zheng, W. S., Ma, L., Wang, G., Lai, J., & Zhang, J. (2018). Early action prediction by soft regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(11), 2568–2583.
Article Google Scholar
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning.
Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221–231.
Article Google Scholar
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., et al. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In 3rd International conference on learning representations.
Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In 5th International conference on learning representations.
Kong, Y., & Fu, Y. (2016). Max-margin action prediction machine. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9, 1844–1858.
Article Google Scholar
Kong, Y., Gao, S., Sun, B., & Fu, Y. (2018). Action prediction from videos via memorizing hard-to-predict samples. In AAAI conference on artificial intelligence.
Kong, Y., Jia, Y., & Fu, Y. (2014a). Interactive phrases: Semantic descriptionsfor human interaction recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9, 1775–1788.
Article Google Scholar
Kong, Y., Kit, D., & Fu, Y. (2014b). A discriminative model with multiple temporal scales for action prediction. In European conference on computer vision.
Kong, Y., Tao, Z., & Fu, Y. (2017). Deep sequential context networks for action prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Kong, Y., Tao, Z., & Fu, Y. (2020). Adversarial action prediction networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(3), 539–553.
Article Google Scholar
Koppula, H. S., Gupta, R., & Saxena, A. (2013). Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research, 32(8), 951–970.
Article Google Scholar
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. In Proceedings of the international conference on computer vision.
Lai, S., Zheng, W. S., Hu, J. F., & Zhang, J. (2018). Global-local temporal saliency action prediction. IEEE Transactions on Image Process, 27(5), 2272–2285.
Article MathSciNet MATH Google Scholar
Lan, T., Chen, T. C., & Savarese, S. (2014). A hierarchical representation for future action prediction. In European conference on computer vision (pp. 689–704).
Li, K., & Fu, Y. (2014). Prediction of human activity by discovering temporal sequence patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8), 1644–1657.
Article Google Scholar
Li, Y., Tarlow, D., Brockschmidt, M., & Zemel, R. S. (2016). Gated graph sequence neural networks. In 4th International conference on learning representations.
Liang, K., Guo, Y., Chang, H., & Chen, X. (2018). Visual relationship detection with deep structural ranking. In AAAI conference on artificial intelligence.
Liao, W., Rosenhahn, B., Shuai, L., & Ying Yang, M. (2019). Natural language guided visual relationship detection. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops.
Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Lu, C., Krishna, R., Bernstein, M., & Fei-Fei, L. (2016). Visual relationship detection with language priors. In European conference on computer vision.
Newell, A., & Deng, J. (2017). Pixels to graphs by associative embedding. In Advances in neural information processing systems.
Nicolicioiu, A., Duta, I., & Leordeanu, M. (2019). Recurrent space-time graph neural networks. In Advances in neural information processing systems.
Pang, G., Wang, X., Hu, J. F., Zhang, Q., & Zheng, W. S. (2019). Dbdnet: Learning bi-directional dynamics for early action prediction. In Proceedings of the 28th international joint conference on artificial intelligence.
Qi, M., Li, W., Yang, Z., Wang, Y., & Luo, J. (2019). Attentive relational networks for mapping images to scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Ryoo, M. S. (2011). Human activity prediction: Early recognition of ongoing activities from streaming videos. In IEEE international conference on computer vision.
Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., & Monfardini, G. (2008). The graph neural network model. IEEE Transactions on Neural Networks, 20(1), 61–80.
Article Google Scholar
Shang, X., Ren, T., Guo, J., Zhang, H., & Tat-Seng, C. (2017). Video visual relation detection. In Proceedings of the 25th ACM international conference on multimedia.
Si, C., Jing, Y., Wang, W., Wang, L., & Tan, T. (2018). Skeleton-based action recognition with spatial reasoning and temporal stack learning. In Proceedings of the European conference on computer vision.
Soomro, K., Zamir, A. R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
Sun, C., Shrivastava, A., Vondrick, C., Sukthankar, R., Murphy, K., & Schmid, C. (2019). Relational action forecasting. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Tsai, Y. H. H., Divvala, S., Morency, L. P., Salakhutdinov, R., & Farhadi, A. (2019). Video relationship reasoning using gated spatio-temporal energy graph. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2018). Graph attention networks. In ICLR.
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., et al. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision.
Wang, X., & Gupta, A. (2018). Videos as space-time region graphs. In Proceedings of the European conference on computer vision.
Wang, X., Hu, J. F., Lai, J. H., Zhang, J., & Zheng, W. S. (2019). Progressive teacher-student learning for early action prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Woo, S., Kim, D., Cho, D., & Kweon, I. S. (2018). Linknet: Relational embedding for scene graph. In Advances in neural information processing systems.
Xu, H., Jiang, C., Liang, X., & Li, Z. (2019). Spatial-aware graph relation network for large-scale object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Zhang, J., Kalantidis, Y., Rohrbach, M., Paluri, M., Elgammal, A., & Elhoseiny, M. (2019). Large-scale visual relationship understanding. In Proceedings of the AAAI conference on artificial intelligence.
Zhao, H., & Wildes, R. P. (2019). Spatiotemporal feature residual propagation for action prediction. In Proceedings of the IEEE international conference on computer vision.
Zhou, B., Andonian, A., Oliva, A., & Torralba, A. (2018). Temporal relational reasoning in videos. In Proceedings of the European conference on computer vision.

Download references

Acknowledgements

This work was supported in part by the Natural Science Foundation of China (NSFC) under Grant Nos. 61673062 and 62072041.

Author information

Authors and Affiliations

Beijing Laboratory of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology, Beijing, 10081, People’s Republic of China
Xinxiao Wu, Ruiqi Wang, Jingyi Hou & Hanxi Lin
Department of Computer Science, University of Rochester, Rochester, NY, 14627, USA
Jiebo Luo

Authors

Xinxiao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Ruiqi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jingyi Hou
View author publications
You can also search for this author in PubMed Google Scholar
Hanxi Lin
View author publications
You can also search for this author in PubMed Google Scholar
Jiebo Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xinxiao Wu.

Additional information

Communicated by Sven J. Dickinson.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, X., Wang, R., Hou, J. et al. Spatial–Temporal Relation Reasoning for Action Prediction in Videos. Int J Comput Vis 129, 1484–1505 (2021). https://doi.org/10.1007/s11263-020-01409-9

Download citation

Received: 11 December 2019
Accepted: 26 November 2020
Published: 12 February 2021
Issue Date: May 2021
DOI: https://doi.org/10.1007/s11263-020-01409-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatial–Temporal Relation Reasoning for Action Prediction in Videos

Abstract

Access this article

Similar content being viewed by others

Semantic2Graph: graph-based multi-modal feature fusion for action segmentation in videos

Representation Learning on Visual-Symbolic Graphs for Video Understanding

Video-based spatio-temporal scene graph generation with efficient self-supervision tasks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Spatial–Temporal Relation Reasoning for Action Prediction in Videos

Abstract

Access this article

Similar content being viewed by others

Semantic2Graph: graph-based multi-modal feature fusion for action segmentation in videos

Representation Learning on Visual-Symbolic Graphs for Video Understanding

Video-based spatio-temporal scene graph generation with efficient self-supervision tasks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation