Abstract
Action prediction is defined as the inference of an action label while the action is still ongoing. Such a capability is extremely useful for early response and further action planning. In this paper, we consider the problem of action prediction in scenarios involving humans interacting with objects. We formulate an approach that builds time series representations of the performance of the humans and the objects. Such a representation of an ongoing action is then compared to prototype actions. This is achieved by a Dynamic Time Warping (DTW)-based time series alignment framework which identifies the best match between the ongoing action and the prototype ones. Our approach is evaluated quantitatively on three standard benchmark datasets. Our experimental results reveal the importance of the fusion of human- and object-centered action representations in the accuracy of action prediction. Moreover, we demonstrate that the proposed approach achieves significantly higher action prediction accuracy compared to competitive methods.
The research work was supported by the Hellenic Foundation for Research and Innovation (HFRI) under the HFRI PhD Fellowship grant (Fellowship Number: 1592).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In our work the terms “video recordings” and “skeletal data” are used interchangeably.
- 2.
MHAD is not included in this investigation as the vast majority of its actions do not involve human-object interactions.
References
Afrasiabi, M., Mansoorizadeh, M., et al.: DTW-CNN: time series-based human interaction prediction in videos using CNN-extracted features. Vis. Comput. 36, 1127–1139 (2019)
Alfaifi, R., Artoli, A.: Human action prediction with 3D-CNN. SN Comput. Sci. 1, 1–15 (2020)
Arzani, M.M., Fathy, M., Azirani, A.A., Adeli, E.: Skeleton-based structured early activity prediction. Multimedia Tools Appl. 80(15), 23023–23049 (2020). https://doi.org/10.1007/s11042-020-08875-w
Bao, W., Yu, Q., Kong, Y.: Uncertainty-based traffic accident anticipation with spatio-temporal relational learning. In: ACM International Conference on Multimedia (2020)
Bochkovskiy, A., Wang, C., Liao, H.: YOLOv4: optimal speed and accuracy of object detection. arXiv:2004.10934 (2020)
Cuturi, M.: Fast global alignment kernels. In: ICML 2011 (2011)
Cuturi, M., Blondel, M.: Soft-DTW: a differentiable loss function for time-series. arXiv:1703.01541 (2017)
Dutta, V., Zielinska, T.: Predicting human actions taking into account object affordances. J. Intell. Robot. Syst. 93, 745–761 (2019)
Dutta, V., Zielińska, T.: An adversarial explainable artificial intelligence (XAI) based approach for action forecasting. J. Autom. Mob. Robot. Intell. Syst. (2021)
Farha, A., Richard, A., Gall, J.: When will you do what?-anticipating temporal occurrences of activities. In: IEEE CVPR (2018)
Gammulle, H., Denman, S., Sridharan, S., Fookes, C.: Predicting the future: a jointly learnt model for action anticipation. In: IEEE ICCV (2019)
Ghoddoosian, R., Sayed, S., Athitsos, V.: Action duration prediction for segment-level alignment of weakly-labeled videos. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2053–2062 (2021)
Hadji, I., Derpanis, K.G., Jepson, A.D.: Representation learning via global temporal alignment and cycle-consistency. arXiv preprint arXiv:2105.05217 (2021)
Haresh, S., et al.: Learning by aligning videos in time. arXiv preprint arXiv:2103.17260 (2021)
Ke, Q., Bennamoun, M., Rahmani, H., An, S., Sohel, F., Boussaid, F.: Learning latent global network for skeleton-based action prediction. IEEE Trans. Image Process. (2019)
Ke, Q., Fritz, M., Schiele, B.: Time-conditioned action anticipation in one shot. In: IEEE CVPR (2019)
Koppula, H., Gupta, R., Saxena, A.: Learning human activities and object affordances from RGB-D videos. Int. J. Robot. Res. 32, 951–970 (2013)
Li, T., Liu, J., Zhang, W., Duan, L.: HARD-Net: hardness-AwaRe discrimination network for 3D early activity prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 420–436. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_25
Liu, J., Shahroudy, A., Wang, G., Duan, L., Kot, A.: Skeleton-based online action prediction using scale selection network. IEEE PAMI 42, 1453–1467 (2019)
Manousaki, V., Papoutsakis, K., Argyros, A.: Evaluating method design options for action classification based on bags of visual words. In: VISAPP (2018)
Mao, W., Liu, M., Salzmann, M.: History repeats itself: human motion prediction via motion attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 474–489. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_28
Mavrogiannis, A., Chandra, R., Manocha, D.: B-GAP: behavior-guided action prediction for autonomous navigation. arXiv:2011.03748 (2020)
Mavroudi, E., Haro, B.B., Vidal, R.: Representation learning on visual-symbolic graphs for video understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12374, pp. 71–90. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58526-6_5
Miech, A., Laptev, I., Sivic, J., Wang, H., Torresani, L., Tran, D.: Leveraging the present to anticipate the future in videos. In: IEEE CVPR Workshops (2019)
Ng, Y., Basura, F.: Forecasting future action sequences with attention: a new approach to weakly supervised action forecasting. IEEE Trans. Image Process. 29, 8880–8891 (2020)
Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Berkeley MHAD: a comprehensive multimodal human action database. In: IEEE Workshop on Applications of Computer Vision (WACV) (2013)
Oprea, S., et al.: A review on deep learning techniques for video prediction. IEEE PAMI (2020)
Papoutsakis, K., Panagiotakis, C., Argyros, A.: Temporal action co-segmentation in 3D motion capture data and videos. In: CVPR (2017)
Qammaz, A., Argyros, A.: Occlusion-tolerant and personalized 3D human pose estimation in RGB images. In: 2020 ICPR. IEEE (2021)
Qin, Y., Mo, L., Li, C., Luo, J.: Skeleton-based action recognition by part-aware graph convolutional networks. Vis. Comput. 36, 621–631 (2020)
Rasouli, A.: Deep learning for vision-based prediction: a survey. arXiv:2007.00095 (2020)
Rasouli, A., Yau, T., Rohani, M., Luo, J.: Multi-modal hybrid architecture for pedestrian action prediction. arXiv:2012.00514 (2020)
Reily, B., Han, F., Parker, L., Zhang, H.: Skeleton-based bio-inspired human activity prediction for real-time human-robot interaction. Auton. Robots 42, 1281–1298 (2018)
Rius, I., Gonzàlez, J., Varona, J., Roca, F.: Action-specific motion prior for efficient Bayesian 3D human body tracking. Pattern Recogn. 42, 2907–2921 (2009)
Tavenard, R., et al.: Tslearn, a machine learning toolkit for time series data. J. Mach. Learn. Res. 21, 1–6 (2020)
Ryoo, M.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: IEEE ICCV (2011)
Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 26, 43–49 (1978)
Tormene, P., Giorgino, T., Quaglini, S., Stefanelli, M.: Matching incomplete time series with dynamic time warping: an algorithm and an application to post-stroke rehabilitation. Artif. Intell. Med. 45, 11–34 (2009)
Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: IEEE CVPR (2012)
Wang, X., Hu, J., Lai, J., Zhang, J., Zheng, W.: Progressive teacher-student learning for early action prediction. In: IEEE CVPR (2019)
Wu, M., et al.: Gaze-based intention anticipation over driving manoeuvres in semi-autonomous vehicles (2020)
Xia, L., Aggarwal, J.: Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: IEEE CVPR (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Manousaki, V., Papoutsakis, K., Argyros, A. (2021). Action Prediction During Human-Object Interaction Based on DTW and Early Fusion of Human and Object Representations. In: Vincze, M., Patten, T., Christensen, H.I., Nalpantidis, L., Liu, M. (eds) Computer Vision Systems. ICVS 2021. Lecture Notes in Computer Science(), vol 12899. Springer, Cham. https://doi.org/10.1007/978-3-030-87156-7_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-87156-7_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87155-0
Online ISBN: 978-3-030-87156-7
eBook Packages: Computer ScienceComputer Science (R0)