Action Prediction During Human-Object Interaction Based on DTW and Early Fusion of Human and Object Representations

Manousaki, Victoria; Papoutsakis, Konstantinos; Argyros, Antonis

doi:10.1007/978-3-030-87156-7_14

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12899))

Included in the following conference series:

International Conference on Computer Vision Systems

959 Accesses
2 Citations

Abstract

Action prediction is defined as the inference of an action label while the action is still ongoing. Such a capability is extremely useful for early response and further action planning. In this paper, we consider the problem of action prediction in scenarios involving humans interacting with objects. We formulate an approach that builds time series representations of the performance of the humans and the objects. Such a representation of an ongoing action is then compared to prototype actions. This is achieved by a Dynamic Time Warping (DTW)-based time series alignment framework which identifies the best match between the ongoing action and the prototype ones. Our approach is evaluated quantitatively on three standard benchmark datasets. Our experimental results reveal the importance of the fusion of human- and object-centered action representations in the accuracy of action prediction. Moreover, we demonstrate that the proposed approach achieves significantly higher action prediction accuracy compared to competitive methods.

The research work was supported by the Hellenic Foundation for Research and Innovation (HFRI) under the HFRI PhD Fellowship grant (Fellowship Number: 1592).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In our work the terms “video recordings” and “skeletal data” are used interchangeably.
2.
MHAD is not included in this investigation as the vast majority of its actions do not involve human-object interactions.

References

https://github.com/statefb/dtwalign
Afrasiabi, M., Mansoorizadeh, M., et al.: DTW-CNN: time series-based human interaction prediction in videos using CNN-extracted features. Vis. Comput. 36, 1127–1139 (2019)
Article Google Scholar
Alfaifi, R., Artoli, A.: Human action prediction with 3D-CNN. SN Comput. Sci. 1, 1–15 (2020)
Article Google Scholar
Arzani, M.M., Fathy, M., Azirani, A.A., Adeli, E.: Skeleton-based structured early activity prediction. Multimedia Tools Appl. 80(15), 23023–23049 (2020). https://doi.org/10.1007/s11042-020-08875-w
Article Google Scholar
Bao, W., Yu, Q., Kong, Y.: Uncertainty-based traffic accident anticipation with spatio-temporal relational learning. In: ACM International Conference on Multimedia (2020)
Google Scholar
Bochkovskiy, A., Wang, C., Liao, H.: YOLOv4: optimal speed and accuracy of object detection. arXiv:2004.10934 (2020)
Cuturi, M.: Fast global alignment kernels. In: ICML 2011 (2011)
Google Scholar
Cuturi, M., Blondel, M.: Soft-DTW: a differentiable loss function for time-series. arXiv:1703.01541 (2017)
Dutta, V., Zielinska, T.: Predicting human actions taking into account object affordances. J. Intell. Robot. Syst. 93, 745–761 (2019)
Article Google Scholar
Dutta, V., Zielińska, T.: An adversarial explainable artificial intelligence (XAI) based approach for action forecasting. J. Autom. Mob. Robot. Intell. Syst. (2021)
Google Scholar
Farha, A., Richard, A., Gall, J.: When will you do what?-anticipating temporal occurrences of activities. In: IEEE CVPR (2018)
Google Scholar
Gammulle, H., Denman, S., Sridharan, S., Fookes, C.: Predicting the future: a jointly learnt model for action anticipation. In: IEEE ICCV (2019)
Google Scholar
Ghoddoosian, R., Sayed, S., Athitsos, V.: Action duration prediction for segment-level alignment of weakly-labeled videos. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2053–2062 (2021)
Google Scholar
Hadji, I., Derpanis, K.G., Jepson, A.D.: Representation learning via global temporal alignment and cycle-consistency. arXiv preprint arXiv:2105.05217 (2021)
Haresh, S., et al.: Learning by aligning videos in time. arXiv preprint arXiv:2103.17260 (2021)
Ke, Q., Bennamoun, M., Rahmani, H., An, S., Sohel, F., Boussaid, F.: Learning latent global network for skeleton-based action prediction. IEEE Trans. Image Process. (2019)
Google Scholar
Ke, Q., Fritz, M., Schiele, B.: Time-conditioned action anticipation in one shot. In: IEEE CVPR (2019)
Google Scholar
Koppula, H., Gupta, R., Saxena, A.: Learning human activities and object affordances from RGB-D videos. Int. J. Robot. Res. 32, 951–970 (2013)
Article Google Scholar
Li, T., Liu, J., Zhang, W., Duan, L.: HARD-Net: hardness-AwaRe discrimination network for 3D early activity prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 420–436. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_25
Chapter Google Scholar
Liu, J., Shahroudy, A., Wang, G., Duan, L., Kot, A.: Skeleton-based online action prediction using scale selection network. IEEE PAMI 42, 1453–1467 (2019)
Article Google Scholar
Manousaki, V., Papoutsakis, K., Argyros, A.: Evaluating method design options for action classification based on bags of visual words. In: VISAPP (2018)
Google Scholar
Mao, W., Liu, M., Salzmann, M.: History repeats itself: human motion prediction via motion attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 474–489. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_28
Chapter Google Scholar
Mavrogiannis, A., Chandra, R., Manocha, D.: B-GAP: behavior-guided action prediction for autonomous navigation. arXiv:2011.03748 (2020)
Mavroudi, E., Haro, B.B., Vidal, R.: Representation learning on visual-symbolic graphs for video understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12374, pp. 71–90. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58526-6_5
Chapter Google Scholar
Miech, A., Laptev, I., Sivic, J., Wang, H., Torresani, L., Tran, D.: Leveraging the present to anticipate the future in videos. In: IEEE CVPR Workshops (2019)
Google Scholar
Ng, Y., Basura, F.: Forecasting future action sequences with attention: a new approach to weakly supervised action forecasting. IEEE Trans. Image Process. 29, 8880–8891 (2020)
Article Google Scholar
Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Berkeley MHAD: a comprehensive multimodal human action database. In: IEEE Workshop on Applications of Computer Vision (WACV) (2013)
Google Scholar
Oprea, S., et al.: A review on deep learning techniques for video prediction. IEEE PAMI (2020)
Google Scholar
Papoutsakis, K., Panagiotakis, C., Argyros, A.: Temporal action co-segmentation in 3D motion capture data and videos. In: CVPR (2017)
Google Scholar
Qammaz, A., Argyros, A.: Occlusion-tolerant and personalized 3D human pose estimation in RGB images. In: 2020 ICPR. IEEE (2021)
Google Scholar
Qin, Y., Mo, L., Li, C., Luo, J.: Skeleton-based action recognition by part-aware graph convolutional networks. Vis. Comput. 36, 621–631 (2020)
Article Google Scholar
Rasouli, A.: Deep learning for vision-based prediction: a survey. arXiv:2007.00095 (2020)
Rasouli, A., Yau, T., Rohani, M., Luo, J.: Multi-modal hybrid architecture for pedestrian action prediction. arXiv:2012.00514 (2020)
Reily, B., Han, F., Parker, L., Zhang, H.: Skeleton-based bio-inspired human activity prediction for real-time human-robot interaction. Auton. Robots 42, 1281–1298 (2018)
Article Google Scholar
Rius, I., Gonzàlez, J., Varona, J., Roca, F.: Action-specific motion prior for efficient Bayesian 3D human body tracking. Pattern Recogn. 42, 2907–2921 (2009)
Article Google Scholar
Tavenard, R., et al.: Tslearn, a machine learning toolkit for time series data. J. Mach. Learn. Res. 21, 1–6 (2020)
MATH Google Scholar
Ryoo, M.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: IEEE ICCV (2011)
Google Scholar
Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 26, 43–49 (1978)
Article Google Scholar
Tormene, P., Giorgino, T., Quaglini, S., Stefanelli, M.: Matching incomplete time series with dynamic time warping: an algorithm and an application to post-stroke rehabilitation. Artif. Intell. Med. 45, 11–34 (2009)
Article Google Scholar
Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: IEEE CVPR (2012)
Google Scholar
Wang, X., Hu, J., Lai, J., Zhang, J., Zheng, W.: Progressive teacher-student learning for early action prediction. In: IEEE CVPR (2019)
Google Scholar
Wu, M., et al.: Gaze-based intention anticipation over driving manoeuvres in semi-autonomous vehicles (2020)
Google Scholar
Xia, L., Aggarwal, J.: Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: IEEE CVPR (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, University of Crete, Heraklion, Greece
Victoria Manousaki & Antonis Argyros
Institute of Computer Science, Foundation for Research and Technology - Hellas (FORTH), Heraklion, Greece
Victoria Manousaki, Konstantinos Papoutsakis & Antonis Argyros

Authors

Victoria Manousaki
View author publications
You can also search for this author in PubMed Google Scholar
Konstantinos Papoutsakis
View author publications
You can also search for this author in PubMed Google Scholar
Antonis Argyros
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Victoria Manousaki .

Editor information

Editors and Affiliations

TU Wien, Vienna, Austria
Markus Vincze
University of Technology Sydney, Sydney, Australia
Timothy Patten
University of California San Diego, La Jolla, CA, USA
Henrik I Christensen
Technical University of Denmark, Kongens Lyngby, Denmark
Lazaros Nalpantidis
Hong Kong University of Science and Technology, Hong Kong, China
Ming Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Manousaki, V., Papoutsakis, K., Argyros, A. (2021). Action Prediction During Human-Object Interaction Based on DTW and Early Fusion of Human and Object Representations. In: Vincze, M., Patten, T., Christensen, H.I., Nalpantidis, L., Liu, M. (eds) Computer Vision Systems. ICVS 2021. Lecture Notes in Computer Science(), vol 12899. Springer, Cham. https://doi.org/10.1007/978-3-030-87156-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-87156-7_14
Published: 19 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87155-0
Online ISBN: 978-3-030-87156-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics