Skip to main content
Log in

Spatial–Temporal Relation Reasoning for Action Prediction in Videos

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Action prediction in videos refers to inferring the action category label by an early observation of a video. Existing studies mainly focus on exploiting multiple visual cues to enhance the discriminative power of feature representation while neglecting important structure information in videos including interactions and correlations between different object entities. In this paper, we focus on reasoning about the spatial–temporal relations between persons and contextual objects to interpret the observed video part for predicting action categories. With this in mind, we propose a novel spatial–temporal relation reasoning approach that extracts the spatial relations between persons and objects in still frames and explores how these spatial relations change over time. Specifically, for spatial relation reasoning, we propose an improved gated graph neural network to perform spatial relation reasoning between the visual objects in video frames. For temporal relation reasoning, we propose a long short-term graph network to model both the short-term and long-term varying dynamics of the spatial relations with multi-scale receptive fields. By this means, our approach can accurately recognize the video content in terms of fine-grained object relations in both spatial and temporal domains to make prediction decisions. Moreover, in order to learn the latent correlations between spatial–temporal object relations and action categories in videos, a visual semantic relation loss is proposed to model the triple constraints between objects in semantic domain via VTransE. Extensive experiments on five public video datasets (i.e., 20BN-something-something, CAD120, UCF101, BIT-Interaction and HMDB51) demonstrate the effectiveness of the proposed spatial–temporal relation reasoning on action prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Aditya, S., Yang, Y., & Baral, C. (2018). Explicit reasoning over end-to-end neural architectures for visual question answering. In Thirty-second AAAI conference on artificial intelligence.

  • Aliakbarian, M. S., Saleh, F. S., Salzmann, M., Fernando, B., Petersson, L., & Andersson, L. (2017). Encouraging lstms to anticipate actions very early. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., et al. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Arthur, D., & Vassilvitskii, S. (2007). K-means++: The advantages of careful seeding. In Eighteenth ACM-SIAM symposium on discrete algorithms.

  • Bhoi, A. (2019). Spatio-temporal action recognition: A survey. arXiv preprint arXiv:1901.09403.

  • Cai, Y., Li, H., Hu, J. F., & Zheng, W. S. (2019). Action knowledge transfer for action prediction with partial videos. In Proceedings of the AAAI conference on artificial intelligence.

  • Cao, Y., Barrett, D., Barbu, A., Narayanaswamy, S., Yu, H., Michaux, A., et al. (2013). Recognize human activities from partially observed videos. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Chen, L., Lu, J., Song, Z., & Zhou, J. (2018a). Part-activated deep reinforcement learning for action prediction. In European conference on computer vision.

  • Chen, X., Li, L. J., Fei-Fei, L., & Gupta, A. (2018b). Iterative visual reasoning beyond convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., et al. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 conference on empirical methods in natural language processing.

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Evans, J. S. B., Over, D. E., & Manktelow, K. I. (1993). Reasoning, decision making and rationality. Cognition, 49(1–2), 165–187.

    Article  Google Scholar 

  • Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., & Dahl, G. E. (2017). Neural message passing for quantum chemistry. In Proceedings of the 34th international conference on machine learning.

  • Girshick, R. (2015). Fast r-CNN. In Proceedings of the IEEE international conference on computer vision.

  • Girshick, R., Radosavovic, I., Gkioxari, G., Dollár, P., & He, K. (2018). Detectron. https://github.com/facebookresearch/detectron.

  • Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., et al. (2017). The “something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision.

  • Hanwang, Z., Kyaw, Z., Chang, S., & Chua, T. (2017). Visual translation embedding network for visual relation detection. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and imagenet? In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Herzig, R., Levi, E., Xu, H., Gao, H., Brosh, E., Wang, X., et al. (2019). Spatio-temporal action graph networks. In Proceedings of the IEEE international conference on computer vision workshops.

  • Hu, J. F., Zheng, W. S., Ma, L., Wang, G., Lai, J., & Zhang, J. (2018). Early action prediction by soft regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(11), 2568–2583.

    Article  Google Scholar 

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning.

  • Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221–231.

    Article  Google Scholar 

  • Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., et al. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.

  • Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In 3rd International conference on learning representations.

  • Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In 5th International conference on learning representations.

  • Kong, Y., & Fu, Y. (2016). Max-margin action prediction machine. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9, 1844–1858.

    Article  Google Scholar 

  • Kong, Y., Gao, S., Sun, B., & Fu, Y. (2018). Action prediction from videos via memorizing hard-to-predict samples. In AAAI conference on artificial intelligence.

  • Kong, Y., Jia, Y., & Fu, Y. (2014a). Interactive phrases: Semantic descriptionsfor human interaction recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9, 1775–1788.

    Article  Google Scholar 

  • Kong, Y., Kit, D., & Fu, Y. (2014b). A discriminative model with multiple temporal scales for action prediction. In European conference on computer vision.

  • Kong, Y., Tao, Z., & Fu, Y. (2017). Deep sequential context networks for action prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Kong, Y., Tao, Z., & Fu, Y. (2020). Adversarial action prediction networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(3), 539–553.

    Article  Google Scholar 

  • Koppula, H. S., Gupta, R., & Saxena, A. (2013). Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research, 32(8), 951–970.

    Article  Google Scholar 

  • Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. In Proceedings of the international conference on computer vision.

  • Lai, S., Zheng, W. S., Hu, J. F., & Zhang, J. (2018). Global-local temporal saliency action prediction. IEEE Transactions on Image Process, 27(5), 2272–2285.

    Article  MathSciNet  MATH  Google Scholar 

  • Lan, T., Chen, T. C., & Savarese, S. (2014). A hierarchical representation for future action prediction. In European conference on computer vision (pp. 689–704).

  • Li, K., & Fu, Y. (2014). Prediction of human activity by discovering temporal sequence patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8), 1644–1657.

    Article  Google Scholar 

  • Li, Y., Tarlow, D., Brockschmidt, M., & Zemel, R. S. (2016). Gated graph sequence neural networks. In 4th International conference on learning representations.

  • Liang, K., Guo, Y., Chang, H., & Chen, X. (2018). Visual relationship detection with deep structural ranking. In AAAI conference on artificial intelligence.

  • Liao, W., Rosenhahn, B., Shuai, L., & Ying Yang, M. (2019). Natural language guided visual relationship detection. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops.

  • Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Lu, C., Krishna, R., Bernstein, M., & Fei-Fei, L. (2016). Visual relationship detection with language priors. In European conference on computer vision.

  • Newell, A., & Deng, J. (2017). Pixels to graphs by associative embedding. In Advances in neural information processing systems.

  • Nicolicioiu, A., Duta, I., & Leordeanu, M. (2019). Recurrent space-time graph neural networks. In Advances in neural information processing systems.

  • Pang, G., Wang, X., Hu, J. F., Zhang, Q., & Zheng, W. S. (2019). Dbdnet: Learning bi-directional dynamics for early action prediction. In Proceedings of the 28th international joint conference on artificial intelligence.

  • Qi, M., Li, W., Yang, Z., Wang, Y., & Luo, J. (2019). Attentive relational networks for mapping images to scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Ryoo, M. S. (2011). Human activity prediction: Early recognition of ongoing activities from streaming videos. In IEEE international conference on computer vision.

  • Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., & Monfardini, G. (2008). The graph neural network model. IEEE Transactions on Neural Networks, 20(1), 61–80.

    Article  Google Scholar 

  • Shang, X., Ren, T., Guo, J., Zhang, H., & Tat-Seng, C. (2017). Video visual relation detection. In Proceedings of the 25th ACM international conference on multimedia.

  • Si, C., Jing, Y., Wang, W., Wang, L., & Tan, T. (2018). Skeleton-based action recognition with spatial reasoning and temporal stack learning. In Proceedings of the European conference on computer vision.

  • Soomro, K., Zamir, A. R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.

  • Sun, C., Shrivastava, A., Vondrick, C., Sukthankar, R., Murphy, K., & Schmid, C. (2019). Relational action forecasting. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Tsai, Y. H. H., Divvala, S., Morency, L. P., Salakhutdinov, R., & Farhadi, A. (2019). Video relationship reasoning using gated spatio-temporal energy graph. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2018). Graph attention networks. In ICLR.

  • Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., et al. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision.

  • Wang, X., & Gupta, A. (2018). Videos as space-time region graphs. In Proceedings of the European conference on computer vision.

  • Wang, X., Hu, J. F., Lai, J. H., Zhang, J., & Zheng, W. S. (2019). Progressive teacher-student learning for early action prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Woo, S., Kim, D., Cho, D., & Kweon, I. S. (2018). Linknet: Relational embedding for scene graph. In Advances in neural information processing systems.

  • Xu, H., Jiang, C., Liang, X., & Li, Z. (2019). Spatial-aware graph relation network for large-scale object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Zhang, J., Kalantidis, Y., Rohrbach, M., Paluri, M., Elgammal, A., & Elhoseiny, M. (2019). Large-scale visual relationship understanding. In Proceedings of the AAAI conference on artificial intelligence.

  • Zhao, H., & Wildes, R. P. (2019). Spatiotemporal feature residual propagation for action prediction. In Proceedings of the IEEE international conference on computer vision.

  • Zhou, B., Andonian, A., Oliva, A., & Torralba, A. (2018). Temporal relational reasoning in videos. In Proceedings of the European conference on computer vision.

Download references

Acknowledgements

This work was supported in part by the Natural Science Foundation of China (NSFC) under Grant Nos. 61673062 and 62072041.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xinxiao Wu.

Additional information

Communicated by Sven J. Dickinson.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, X., Wang, R., Hou, J. et al. Spatial–Temporal Relation Reasoning for Action Prediction in Videos. Int J Comput Vis 129, 1484–1505 (2021). https://doi.org/10.1007/s11263-020-01409-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01409-9

Keywords

Navigation